Can you extract content from this using Tika's standalone command line tool?  
PDF's are notorious for problems in extracting.  To me, it looks like a bug in 
PDFBox.  I would try to isolate it down to there and then send, if possible, 
the sample document to PDFBox and see if they can come up w/ a fix.

-Grant

On Apr 18, 2010, at 1:12 PM, pk wrote:

> 
> Hi,
> while posting a sample pdf (that comes with Solr dist'n) to solr, i'm
> getting a TikaException. 
> Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr.
> Other sample pdfs can be parsed and indexed successfully.. I;m getting same
> error with some other pdfs also (but adobe reader can open them fine, so i
> dont think they have an issue in formatting or are corrupt etc)... Here is
> the trace...
> 
> ........
> found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf ::
> size=286242
> Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 640
> Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Una
> ble to extract PDF content
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> mentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea
> mHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav
> a:131)
>        at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re
> questHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> 
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241
> )
>        at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
> terChain.java:215)
>        at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
> .java:188)
>        at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
> 213)
>        at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
> 172)
>        at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>        at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
>        at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
> 8)
>        at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
>        at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
>        at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
> ection(Http11BaseProtocol.java:665)
>        at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
> 28)
>        at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
> rThread.java:81)
>        at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
> 89)
>        at java.lang.Thread.run(Thread.java:595)
> Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF
> content
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> mentLoader.java:190)
>        ... 20 more
> Caused by: java.util.zip.ZipException: incorrect header check
>        at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
>        at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
>        at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
>        at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
>        at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
>        at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101)
>        at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
>        at
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
>        at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>        at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>        at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>        at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>        ... 24 more
> 
> Apr 18, 2010 10:31:34 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={wt=javabin&waitFlush=true&literal.index
> Date=2010-04-18+&commit=true&waitSearcher=true&version=1&literal.id=C%253A%255Csolr_1.4.0%
> 255Cdocs%255CInstalling%2BSolr%2Bin%2BTomcat.pdf} status=500 QTime=640
> Exception in handling an uplaoded file:C:\solr_1.4.0\docs\Installing Solr in
> Tomcat.pdf :
> Internal Server Error
> 
> Internal Server Error
> 
> request:
> http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI
> nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18
> &commit=true&waitFlush=true&wait
> Searcher=true&wt=javabin&version=1
> org.apache.solr.common.SolrException: Internal Server Error
> 
> Internal Server Error
> 
> request:
> http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI
> nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18
> &commit=true&waitFlush=true&wait
> Searcher=true&wt=javabin&version=1
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr
> Server.java:424)
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr
> Server.java:243)
>        at
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUp
> dateSolrServer.java:170)
>        at com.solr.common.Util.submitToSolr(Util.java:100)
>        at
> com.solr.search.admin.UploadedFilePostHandler.handleUpload(UploadedFilePostHand
> ler.java:62)
>        at org.apache.jsp.admin.post_jsp._jspService(post_jsp.java:71)
>        at
> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
>        at
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331)
> 
>        at
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
>        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
>        at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
> terChain.java:269)
>        at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
> .java:188)
>        at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
> 213)
>        at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
> 172)
>        at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>        at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
>        at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
> 8)
>        at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
>        at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
>        at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
> ection(Http11BaseProtocol.java:665)
>        at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
> 28)
>        at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
> rThread.java:81)
>        at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
> 89)
>        at java.lang.Thread.run(Thread.java:595)
> 
> 
> Any help/hint could be useful..
> thanks..
> -- 
> View this message in context: 
> http://n3.nabble.com/Solr-throws-TikaException-while-parsing-sample-PDF-tp728016p728016.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to