Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks.
On Sun, Apr 18, 2010 at 11:24 PM, Grant Ingersoll <gsing...@apache.org>wrote: > Can you extract content from this using Tika's standalone command line > tool? PDF's are notorious for problems in extracting. To me, it looks like > a bug in PDFBox. I would try to isolate it down to there and then send, if > possible, the sample document to PDFBox and see if they can come up w/ a > fix. > > -Grant > > On Apr 18, 2010, at 1:12 PM, pk wrote: > > > > > Hi, > > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm > > getting a TikaException. > > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to > solr. > > Other sample pdfs can be parsed and indexed successfully.. I;m getting > same > > error with some other pdfs also (but adobe reader can open them fine, so > i > > dont think they have an issue in formatting or are corrupt etc)... Here > is > > the trace... > > > > ........ > > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: > > size=286242 > > Apr 18, 2010 10:31:34 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 640 > > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: > > org.apache.tika.exception.TikaException: Una > > ble to extract PDF content > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > > mentLoader.java:211) > > at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea > > mHandlerBase.java:54) > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav > > a:131) > > at > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re > > questHandlers.java:233) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > > > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 > > ) > > at > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > > terChain.java:215) > > at > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > > .java:188) > > at > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > > 213) > > at > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > > 172) > > at > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > > at > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > > at > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > > 8) > > at > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > > at > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) > > at > > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > > ection(Http11BaseProtocol.java:665) > > at > > > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > > 28) > > at > > > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > > rThread.java:81) > > at > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > > 89) > > at java.lang.Thread.run(Thread.java:595) > > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF > > content > > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > > mentLoader.java:190) > > ... 20 more > > Caused by: java.util.zip.ZipException: incorrect header check > > at > > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) > > at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) > > at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) > > at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) > > at > org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) > > at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) > > at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) > > at > > > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) > > at > > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) > > at > > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) > > at > > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) > > at > > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) > > at > org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) > > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) > > ... 24 more > > > > Apr 18, 2010 10:31:34 PM org.apache.solr.core.SolrCore execute > > INFO: [] webapp=/solr path=/update/extract > > params={wt=javabin&waitFlush=true&literal.index > > Date=2010-04-18+&commit=true&waitSearcher=true&version=1&literal.id > =C%253A%255Csolr_1.4.0% > > 255Cdocs%255CInstalling%2BSolr%2Bin%2BTomcat.pdf} status=500 QTime=640 > > Exception in handling an uplaoded file:C:\solr_1.4.0\docs\Installing Solr > in > > Tomcat.pdf : > > Internal Server Error > > > > Internal Server Error > > > > request: > > > http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI > > nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18 > > &commit=true&waitFlush=true&wait > > Searcher=true&wt=javabin&version=1 > > org.apache.solr.common.SolrException: Internal Server Error > > > > Internal Server Error > > > > request: > > > http://localhost:8080/solr/update/extract?literal.id=C%3A%5Csolr_1.4.0%5Cdocs%5CI > > nstalling+Solr+in+Tomcat.pdf&literal.indexDate=2010-04-18 > > &commit=true&waitFlush=true&wait > > Searcher=true&wt=javabin&version=1 > > at > > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr > > Server.java:424) > > at > > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolr > > Server.java:243) > > at > > > org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUp > > dateSolrServer.java:170) > > at com.solr.common.Util.submitToSolr(Util.java:100) > > at > > > com.solr.search.admin.UploadedFilePostHandler.handleUpload(UploadedFilePostHand > > ler.java:62) > > at org.apache.jsp.admin.post_jsp._jspService(post_jsp.java:71) > > at > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) > > at > > > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331) > > > > at > > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) > > at > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) > > at > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > > terChain.java:269) > > at > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > > .java:188) > > at > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > > 213) > > at > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > > 172) > > at > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > > at > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > > at > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > > 8) > > at > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > > at > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) > > at > > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > > ection(Http11BaseProtocol.java:665) > > at > > > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > > 28) > > at > > > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > > rThread.java:81) > > at > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > > 89) > > at java.lang.Thread.run(Thread.java:595) > > > > > > Any help/hint could be useful.. > > thanks.. > > -- > > View this message in context: > http://n3.nabble.com/Solr-throws-TikaException-while-parsing-sample-PDF-tp728016p728016.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >