Can you extract content from this using Tika's standalone command line tool? PDF's are notorious for problems in extracting. To me, it looks like a bug in PDFBox. I would try to isolate it down to there and then send, if possible, the sample document to PDFBox and see if they can come up w/ a fix.
-Grant On Apr 18, 2010, at 1:12 PM, pk wrote: > > Hi, > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm > getting a TikaException. > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr. > Other sample pdfs can be parsed and indexed successfully.. I;m getting same > error with some other pdfs also (but adobe reader can open them fine, so i > dont think they have an issue in formatting or are corrupt etc)... Here is > the trace... > > ........ > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: > size=286242 > Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {} 0 640 > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Una > ble to extract PDF content > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:211) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea > mHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav > a:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re > questHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 > ) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > terChain.java:215) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > .java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > 213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > 172) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > 8) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > ection(Http11BaseProtocol.java:665) > at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > 28) > at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > rThread.java:81) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > 89) > at java.lang.Thread.run(Thread.java:595) > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF > content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:190) > ... 20 more > Caused by: java.util.zip.ZipException: incorrect header check > at > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) > at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) > at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) > at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) > at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) > at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) > at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) > at > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) > at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) > at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) > at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) > at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) > at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) > ... 24 more > > Apr 18, 2010 10:31:34 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update/extract > params={wt=javabin&waitFlush=true&literal.index > Date=2010-04-18+&commit=true&waitSearcher=true&version=1&literal.id=C%253A%255Csolr_1.4.0% > 255Cdocs%255CInstalling%2BSolr%2Bin%2BTomcat.pdf} status=500 QTime=640 > Exception in handling an uplaoded file:C:\solr_1.4.0\docs\Installing Solr in > Tomcat.pdf : > Internal Server Error > > Internal Server Error > > request: > http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI > nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18 > &commit=true&waitFlush=true&wait > Searcher=true&wt=javabin&version=1 > org.apache.solr.common.SolrException: Internal Server Error > > Internal Server Error > > request: > http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI > nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18 > &commit=true&waitFlush=true&wait > Searcher=true&wt=javabin&version=1 > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr > Server.java:424) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr > Server.java:243) > at > org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUp > dateSolrServer.java:170) > at com.solr.common.Util.submitToSolr(Util.java:100) > at > com.solr.search.admin.UploadedFilePostHandler.handleUpload(UploadedFilePostHand > ler.java:62) > at org.apache.jsp.admin.post_jsp._jspService(post_jsp.java:71) > at > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) > at > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331) > > at > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) > at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > terChain.java:269) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > .java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > 213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > 172) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > 8) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > ection(Http11BaseProtocol.java:665) > at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > 28) > at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > rThread.java:81) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > 89) > at java.lang.Thread.run(Thread.java:595) > > > Any help/hint could be useful.. > thanks.. > -- > View this message in context: > http://n3.nabble.com/Solr-throws-TikaException-while-parsing-sample-PDF-tp728016p728016.html > Sent from the Solr - User mailing list archive at Nabble.com. -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search