Re: Pushing a whole set of pdf-files to solr
Your errors may simply have been improperly encoded documents. Or some encoding that is not supported. Hard to say. Start with a simple case, then build on success. I think you're just trying to do too much all at once. Do one PDF file first, then work up to a directory, and only when you've mastered all that successfully, then you can try large numbers of unknown documents from somewhere else. Try a simple PDF, like one you create yourself by outputting from an Office app. And try an Office (MS or Open) file as well. -- Jack Krupansky -Original Message- From: sdspieg Sent: Wednesday, April 24, 2013 7:57 PM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr I am still struggling with this. I have solr 4.2.1.2013.03.26.08.26.55 installed. So are you telling me that I should somehow install the older version of that tool that comes with Solr 3.x? Because with the newer version I get the errors I already mentioned. Now I suppose I may be an untypical user, as I am running all of this under windows and really just want to find an easy way to get a whole bunch of files from a local folder (on my harddrive) into my local version of solr. But so is there really no easier way of doing this? -Stephan -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058776.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
(Just documenting my experiences). I stopped and restarted solr in the tomcat web application manager. Everything seems fine <http://lucene.472066.n3.nabble.com/file/n4058786/4-25-2013_2-38-43_AM.png> And yet I still get that same error message. -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058786.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
I am still struggling with this. I have solr 4.2.1.2013.03.26.08.26.55 installed. So are you telling me that I should somehow install the older version of that tool that comes with Solr 3.x? Because with the newer version I get the errors I already mentioned. Now I suppose I may be an untypical user, as I am running all of this under windows and really just want to find an easy way to get a whole bunch of files from a local folder (on my harddrive) into my local version of solr. But so is there really no easier way of doing this? -Stephan -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058776.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
Yes, there is the version that comes with Solr 3.x. I'm not aware of an encoding issue. -- Jack Krupansky -Original Message- From: sdspieg Sent: Wednesday, April 10, 2013 8:11 AM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr Jack - I apologize for my ignorance here, but when you keep emphasizing 'new' - does that mean that there is ANOTHER version of this tool than the one that is built into solr-4.2.1? And on the encoding issue - I thought pdf was platform-agnostic? Or is the problem on my windows system - i.e. that it extracts the (correctly encoded) text into Win-1251, which solr then has a problem with? But can't I change that somewhere then? -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4055010.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
Jack - I apologize for my ignorance here, but when you keep emphasizing 'new' - does that mean that there is ANOTHER version of this tool than the one that is built into solr-4.2.1? And on the encoding issue - I thought pdf was platform-agnostic? Or is the problem on my windows system - i.e. that it extracts the (correctly encoded) text into Win-1251, which solr then has a problem with? But can't I change that somewhere then? -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4055010.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
The newer SimplePostTool can in fact recurse a directory of PDFs. Just get the usage for the tool. I'm sure it lists the command options. -- Jack Krupansky -Original Message- From: sdspieg Sent: Tuesday, April 09, 2013 9:48 PM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr Thanks for those replies. I will look into them. But if anyone knows of a site that describes step by step how a windows user who has already installed solr (and tomcat) can easily feed a folder (and subfolders) with 100s of pdfs into solr, or would be willing to write down down those steps, I would really appreciate the reference. And I bet you there are lots of people like me... -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
On 10 April 2013 08:11, sdspieg wrote: > Another progress report. I 'flattened' all the folders which contained the > pdf files with Fileboss and then moved the pdf files to the directory where > I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I > then ran "java -Ddata=files -jar post.jar *.pdf" and in the command window > it seemed to be working fine (these are just academic articles in pdf-format > that I downloaded with ZOtyero from EBSCO): [...] If it works, great, but it is not generally advisable to have a large number of files under one directory. However, that is not the source of your error here. > But then when I looked in solr, I saw the following: > 04:34:41 > SEVERE > SolrCore > org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at > char #10, byte #-1) [...] Your files seem to have some encoding other than UTF-8: My random guess would be Windows-1252. You need to convert the files to UTF-8. Regards, Gora
Re: Pushing a whole set of pdf-files to solr
Another progress report. I 'flattened' all the folders which contained the pdf files with Fileboss and then moved the pdf files to the directory where I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I then ran "java -Ddata=files -jar post.jar *.pdf" and in the command window it seemed to be working fine (these are just academic articles in pdf-format that I downloaded with ZOtyero from EBSCO): 04/10/2013 12:20 AM 159,224 Vorontsov - 2012 - The Korea- Russia Gas Pipeline Project Past, Pres.pdf 04/10/2013 12:12 AM 3,885,056 Walker - 2012 - Asia competes for energy security.pdf 04/10/2013 12:45 AM66,195 Whitmill - 2012 - Is UK Energy Policy Dri ving Energy Innovation - or.pdf 04/10/2013 12:29 AM 2,208,367 Wietfeld - 2011 - Understanding Middle Ea st Gas Exporting Behavior.pdf 04/10/2013 12:59 AM 3,011,185 Wiseman - 2011 - Expanding Regional Renew able Governance.pdf 04/10/2013 12:38 AM 180,692 Woudhuysen - 2012 - Innovation in Energy Expressions of a Crisis, and.pdf 04/10/2013 12:49 AM 229,991 Yergin - 2012 - How Is Energy Remaking th e World.pdf 04/10/2013 12:40 AM 3,397,328 Young - 2012 - Industrial Gases. (cover s tory).pdf 04/10/2013 01:36 AM73,125 Zimmerer - 2011 - New Geographies of Ener gy Introduction to the Spe.pdf ... and so on, all together some 300 articles. But then when I looked in solr, I saw the following: 04:34:41 SEVERE SolrCore org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) 04:34:41 SEVERE SolrCore org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) ... and a lot more of those. I'd like to think I made SOME progress, but it also seems like I'm still not close to being there. Any suggestions from the experts here on what I am doing wrong? Thanks! -Stephan -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054920.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
On 10 April 2013 07:28, sdspieg wrote: > I am able to run the "java -jar post.jar -help" command which I found here: > http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell > post to post all pdf files in a certain folder (preferably recursively) to a > collection? Could anybody please post the exact command for that? [...] There are two options: * I am not familiar with Microsoft Windows, but writing some kind of a batch script that recurses down a directory, and posts files to Solr should be easy. * One could use the Solr DataImportHandler with FileDataSource to handle the filesystem traversal, and TikaEntityProcessor to handle the indexing of rich content. Please see: http://wiki.apache.org/solr/DataImportHandler http://wiki.apache.org/solr/TikaEntityProcessor Regards, Gora
Re: Pushing a whole set of pdf-files to solr
I am able to run the "java -jar post.jar -help" command which I found here: http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell post to post all pdf files in a certain folder (preferably recursively) to a collection? Could anybody please post the exact command for that? -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054916.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
Thanks for those replies. I will look into them. But if anyone knows of a site that describes step by step how a windows user who has already installed solr (and tomcat) can easily feed a folder (and subfolders) with 100s of pdfs into solr, or would be willing to write down down those steps, I would really appreciate the reference. And I bet you there are lots of people like me... -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
The newer release of SimplePostTool with Solr 4.x makes it easy to post PDF files from a directory, including automatically adding the file name to a field. But SolrCell is the direct API that it uses as well. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, April 09, 2013 6:58 PM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr Apache Solr 4 Cookbok says that: curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F "myfile=@cookbook.pdf" is that what you want? 2013/4/10 sdspieg If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
Apache Solr 4 Cookbok says that: curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F "myfile=@cookbook.pdf" is that what you want? 2013/4/10 sdspieg > If anybody could still help me out with this, I'd really appreciate it. > Thanks! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Pushing a whole set of pdf-files to solr
If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.