Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi, all, I tried to upgrade tika0.8 to tika0.10 on solr3.3.0, following the similar steps, but failed. 1. Replace the following jars in /contrib/extraction/ fontbox-1.6.0, jempbox-1.6.0, pdfbox-1.6.0, tika-core-0.10, tika-parsers-0.10; 2. Copy all the jars in /contrib/langid/* from solr3.5.0 3. Copy /dist/apache-solr-langid-3.5.0 from solr3.5.0 4. Configure solrconfig.xml in solr3.3.0, adding the following lib and definition of updateRequestProcessorChain. lib dir=../../contrib/langid/lib / lib dir=../../dist/ regex=apache-solr-langid-\d.*\.jar / updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory str name=langid.fltext,title,author/str str name=langid.langFieldlanguage_s/str str name=langid.fallbacken/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Errors: (typical errors when factory is not found) org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389) at Anyone tried similar things before. Pls advice. Thank you. Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p3772177.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: upgrading to Tika 0.9 on Solr 1.4.1
I have upgraded my Solr Distribution to 3.2 and also the referring jars of my application (especially the solr jar was 1.4.1 in my application which calls solr...hence causing javabin exception...) . Also updated the pdfbox/jempbox/fontbox to latest versions and Tika to 0.9 version...which made things up for me! -- Surendranadh
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Chris ,Andreas I have upgraded to solr 3.2 ... everything seems fine now. I will have to integrate this to my application and observe if any further issues...again thanks for your patience and time... --Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Glad it worked out! Cheers, Chris On Jun 22, 2011, at 5:14 AM, Surendra wrote: Hi Chris ,Andreas I have upgraded to solr 3.2 ... everything seems fine now. I will have to integrate this to my application and observe if any further issues...again thanks for your patience and time... --Surendra ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Chris I did a proper checkout of TIKA 0.9 and built the jars as specified in the http://tika.apache.org/0.9/gettingstarted.html; and replaced the existing tika0.4 jars with 0.9 jars. I don't see any difference. The documents are getting indexed but the fmap.content(attr_content) is still not available for me. Am I missing something? Between I'm digging further in this isse... if I can get any further help it would be great! Thanks for your time... -- Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Andreas I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with the newer versions too. For me, I need the attr_content:* should return me results (with 1.4.1 this is successful) which is not happening . It indexes well in 3.1 but in 3.2 i have the following issue. Invalid version or the data in not in 'javabin' format --Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Surendra, Thanks. Besides replacing the tika-*-0.9.jar files, you also need to replace the dependency jar files for the other libs as well since they have been upgraded. It's also possible that b/c of API changes, Solr 1.4.1 won't work with Tika 0.9 without modifying the ExtractingRequestHandler code... Cheers, Chris On Jun 21, 2011, at 12:28 AM, Surendra wrote: Hi Chris I did a proper checkout of TIKA 0.9 and built the jars as specified in the http://tika.apache.org/0.9/gettingstarted.html; and replaced the existing tika0.4 jars with 0.9 jars. I don't see any difference. The documents are getting indexed but the fmap.content(attr_content) is still not available for me. Am I missing something? Between I'm digging further in this isse... if I can get any further help it would be great! Thanks for your time... -- Surendra ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9. Replace fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar tika-parsers-0.8.jar with fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar tika-parsers-0.9.jar I'm not entirely certain, if a recompile of Solr was necessary or not. Andreas From: Surendra csnsha...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, June 21, 2011 5:18:31 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 Hi Andreas I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with the newer versions too. For me, I need the attr_content:* should return me results (with 1.4.1 this is successful) which is not happening . It indexes well in 3.1 but in 3.2 i have the following issue. Invalid version or the data in not in 'javabin' format --Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Mattmann, Chris A (388J chris.a.mattmann at jpl.nasa.gov writes: Hi Jo, You may consider checking out Tika trunk, where we recently have a Tika JAX-RS web service [1] committed as part of the tika-server module. You could probably wire DIH into it and accomplish the same thing. Cheers, Chris [1] https://issues.apache.org/jira/browse/TIKA-593 On Feb 24, 2011, at 12:42 PM, jo wrote: I have tried the steps indicated here: http://wiki.apache.org/solr/ExtractingRequestHandler http://wiki.apache.org/solr/ExtractingRequestHandler and when I try to parse a document nothing would happen, no error.. I have copied the jar files everywhere, and nothing.. can anyone give me the steps on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4 thank you -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2570526.html Sent from the Solr - User mailing list archive at Nabble.com. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattmann at nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ Hey Chris I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib) after building them using the source provided by TIKA. Now I have an issue with this. I am working with extracting PDF content using Solr. I have added fmap.content to the configurable params as attr_content where I can see the entire extracted document. After the TIKA update i am not able to see attr_content appearing in the search results. When I restore it with old 0.4 TIKA jars again the attr_content appears. I didn't find any exceptions shown up there in the console. Is this a known behavior that someone have faced already? Can you guide me to resolve this? -- Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Surendra, On Jun 20, 2011, at 4:59 AM, Surendra wrote: Hey Chris I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib) after building them using the source provided by TIKA. Now I have an issue with this. I am working with extracting PDF content using Solr. I have added fmap.content to the configurable params as attr_content where I can see the entire extracted document. After the TIKA update i am not able to see attr_content appearing in the search results. When I restore it with old 0.4 TIKA jars again the attr_content appears. I didn't find any exceptions shown up there in the console. Is this a known behavior that someone have faced already? Can you guide me to resolve this? I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to extraction/lib -- I think you'll need to replace the set of prior Tika jars in there. Have a look here to see what jars you would need to replace, HTH: http://tika.apache.org/0.9/gettingstarted.html Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
I've unsuccessfully attempted to go down this road - there are API changes, some of which I was able to solve by taking code snippets from Solr 3.1. Some extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 - some tests not passing' in the archive). Ultimately, I decided that the then newly released Solr 3.1 was the less rocky route. Not sure if that is an option for you. Andreas From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Mon, June 20, 2011 7:18:34 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 Hi Surendra, On Jun 20, 2011, at 4:59 AM, Surendra wrote: Hey Chris I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib) after building them using the source provided by TIKA. Now I have an issue with this. I am working with extracting PDF content using Solr. I have added fmap.content to the configurable params as attr_content where I can see the entire extracted document. After the TIKA update i am not able to see attr_content appearing in the search results. When I restore it with old 0.4 TIKA jars again the attr_content appears. I didn't find any exceptions shown up there in the console. Is this a known behavior that someone have faced already? Can you guide me to resolve this? I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to extraction/lib -- I think you'll need to replace the set of prior Tika jars in there. Have a look here to see what jars you would need to replace, HTH: http://tika.apache.org/0.9/gettingstarted.html Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
Your best bet is perhaps upgrading to latest 1.4 branch, i.e. 1.4.2-dev (http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/) It includes Tika 0.8-SNAPSHOT and is a compatible drop-in (war/jar) replacement with lots of other bug fixes you'd also like (check changes.txt). svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4 cd branch-1.4 ant dist -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 24. feb. 2011, at 21.42, jo wrote: I have tried the steps indicated here: http://wiki.apache.org/solr/ExtractingRequestHandler http://wiki.apache.org/solr/ExtractingRequestHandler and when I try to parse a document nothing would happen, no error.. I have copied the jar files everywhere, and nothing.. can anyone give me the steps on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4 thank you -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2570526.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: upgrading to Tika 0.9 on Solr 1.4.1
You don't want to use 0.8 if you're parsing PDF. Your best bet is perhaps upgrading to latest 1.4 branch, i.e. 1.4.2-dev (http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/) It includes Tika 0.8-SNAPSHOT and is a compatible drop-in (war/jar) replacement with lots of other bug fixes you'd also like (check changes.txt). svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4 cd branch-1.4 ant dist -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 24. feb. 2011, at 21.42, jo wrote: I have tried the steps indicated here: http://wiki.apache.org/solr/ExtractingRequestHandler http://wiki.apache.org/solr/ExtractingRequestHandler and when I try to parse a document nothing would happen, no error.. I have copied the jar files everywhere, and nothing.. can anyone give me the steps on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4 thank you
Re: upgrading to Tika 0.9 on Solr 1.4.1
Hi Jo, You may consider checking out Tika trunk, where we recently have a Tika JAX-RS web service [1] committed as part of the tika-server module. You could probably wire DIH into it and accomplish the same thing. Cheers, Chris [1] https://issues.apache.org/jira/browse/TIKA-593 On Feb 24, 2011, at 12:42 PM, jo wrote: I have tried the steps indicated here: http://wiki.apache.org/solr/ExtractingRequestHandler http://wiki.apache.org/solr/ExtractingRequestHandler and when I try to parse a document nothing would happen, no error.. I have copied the jar files everywhere, and nothing.. can anyone give me the steps on how to upgrade just tika, btw, currently on 1.4.1 has tika 0.4 thank you -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2570526.html Sent from the Solr - User mailing list archive at Nabble.com. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
You guys are great.. I will stick for now to the release version and if I have problem parsing I will give the branch jars a try the reason I am looking for upgrading tika is because tika keeps improving on things like languages, mime type support, and so on thanks again JO -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p2576658.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: upgrading to Tika 0.9 on Solr 1.4.1
hi if you want to index pdf files then use tika 0.6 because 0.7 and 0.8 does not detect the correctly the pdfParse
Re: upgrading to Tika 0.9 on Solr 1.4.1
According to the Tika release notes, it's fixed in 0.9. Haven't tried it myself. A critical backwards incompatible bug in PDF parsing that was introduced in Tika 0.8 has been fixed. (TIKA-548) Andreas From: Darx Oman darxo...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, February 25, 2011 10:33:39 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 hi if you want to index pdf files then use tika 0.6 because 0.7 and 0.8 does not detect the correctly the pdfParse
Re: upgrading to Tika 0.9 on Solr 1.4.1
Yep it's fixed in 0.9. Cheers, Chris On Feb 25, 2011, at 2:37 PM, Andreas Kemkes wrote: According to the Tika release notes, it's fixed in 0.9. Haven't tried it myself. A critical backwards incompatible bug in PDF parsing that was introduced in Tika 0.8 has been fixed. (TIKA-548) Andreas From: Darx Oman darxo...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, February 25, 2011 10:33:39 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 hi if you want to index pdf files then use tika 0.6 because 0.7 and 0.8 does not detect the correctly the pdfParse ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++