Re: Solr throws TikaException while parsing sample PDF
Can somebody please guide me here? On Tue, Apr 20, 2010 at 10:53 AM, Praveen Agrawal wrote: > I'm using Solr 1.4 distribution, with Solr cell. Can i update only new > version of Tika in Solr 1.4 distn? If yes, any guide etc? > Thanks. > > > > On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi wrote: > >> Praveen Agrawal wrote: >> >>> Hi Grant, >>> I tried command line of Tika v-0.7(newest), and it parsed the file.. I >>> believe Solr1.4 contains 0.4 version of Tika. >>> Do you suggest to upgrade to new Tika? Can i upgrade only tika in >>> Solr-1.4? >>> or i need to wait till Solr ships with new Tika? >>> Thanks. >>> >>> >> Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI. >> >> Koji >> >> -- >> http://www.rondhuit.com/en/ >> >> >
Re: Solr throws TikaException while parsing sample PDF
I'm using Solr 1.4 distribution, with Solr cell. Can i update only new version of Tika in Solr 1.4 distn? If yes, any guide etc? Thanks. On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi wrote: > Praveen Agrawal wrote: > >> Hi Grant, >> I tried command line of Tika v-0.7(newest), and it parsed the file.. I >> believe Solr1.4 contains 0.4 version of Tika. >> Do you suggest to upgrade to new Tika? Can i upgrade only tika in >> Solr-1.4? >> or i need to wait till Solr ships with new Tika? >> Thanks. >> >> > Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI. > > Koji > > -- > http://www.rondhuit.com/en/ > >
Re: Solr throws TikaException while parsing sample PDF
Praveen Agrawal wrote: Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks. Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI. Koji -- http://www.rondhuit.com/en/
Re: Solr throws TikaException while parsing sample PDF
Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks. On Sun, Apr 18, 2010 at 11:24 PM, Grant Ingersoll wrote: > Can you extract content from this using Tika's standalone command line > tool? PDF's are notorious for problems in extracting. To me, it looks like > a bug in PDFBox. I would try to isolate it down to there and then send, if > possible, the sample document to PDFBox and see if they can come up w/ a > fix. > > -Grant > > On Apr 18, 2010, at 1:12 PM, pk wrote: > > > > > Hi, > > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm > > getting a TikaException. > > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to > solr. > > Other sample pdfs can be parsed and indexed successfully.. I;m getting > same > > error with some other pdfs also (but adobe reader can open them fine, so > i > > dont think they have an issue in formatting or are corrupt etc)... Here > is > > the trace... > > > > > > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: > > size=286242 > > Apr 18, 2010 10:31:34 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 640 > > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: > > org.apache.tika.exception.TikaException: Una > > ble to extract PDF content > >at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > > mentLoader.java:211) > >at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea > > mHandlerBase.java:54) > >at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav > > a:131) > >at > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re > > questHandlers.java:233) > >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > >at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > > > >at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 > > ) > >at > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > > terChain.java:215) > >at > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > > .java:188) > >at > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > > 213) > >at > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > > 172) > >at > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > >at > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > >at > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > > 8) > >at > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > >at > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) > >at > > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > > ection(Http11BaseProtocol.java:665) > >at > > > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > > 28) > >at > > > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > > rThread.java:81) > >at > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > > 89) > >at java.lang.Thread.run(Thread.java:595) > > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF > > content > >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) > >at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) > >at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > >at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > >at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > > mentLoader.java:190) > >... 20 more > > Caused by: java.util.zip.ZipException: incorrect header check > >at > > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) > >at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) > >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) > >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) > >at > org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) > >at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) > >at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) > >a
Re: Solr throws TikaException while parsing sample PDF
Can you extract content from this using Tika's standalone command line tool? PDF's are notorious for problems in extracting. To me, it looks like a bug in PDFBox. I would try to isolate it down to there and then send, if possible, the sample document to PDFBox and see if they can come up w/ a fix. -Grant On Apr 18, 2010, at 1:12 PM, pk wrote: > > Hi, > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm > getting a TikaException. > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr. > Other sample pdfs can be parsed and indexed successfully.. I;m getting same > error with some other pdfs also (but adobe reader can open them fine, so i > dont think they have an issue in formatting or are corrupt etc)... Here is > the trace... > > > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: > size=286242 > Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {} 0 640 > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Una > ble to extract PDF content >at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:211) >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea > mHandlerBase.java:54) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav > a:131) >at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re > questHandlers.java:233) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 > ) >at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > terChain.java:215) >at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > .java:188) >at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > 213) >at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > 172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) >at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) >at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > 8) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) >at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) >at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > ection(Http11BaseProtocol.java:665) >at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > 28) >at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > rThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > 89) >at java.lang.Thread.run(Thread.java:595) > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF > content >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) >at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) >at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) >at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) >at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:190) >... 20 more > Caused by: java.util.zip.ZipException: incorrect header check >at > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) >at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) >at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) >at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) >at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) >at > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) >at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) >at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) >at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) >at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) >at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) >