Re: Solr throws TikaException while parsing sample PDF

2010-04-21 Thread Praveen Agrawal
Can somebody please guide me here?


On Tue, Apr 20, 2010 at 10:53 AM, Praveen Agrawal  wrote:

> I'm using Solr 1.4 distribution, with Solr cell. Can i update only new
> version of Tika in Solr 1.4 distn? If yes, any guide etc?
> Thanks.
>
>
>
> On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi wrote:
>
>> Praveen Agrawal wrote:
>>
>>> Hi Grant,
>>> I tried command line of Tika v-0.7(newest), and it parsed the file.. I
>>> believe Solr1.4 contains 0.4 version of Tika.
>>> Do you suggest to upgrade to new Tika? Can i upgrade only tika in
>>> Solr-1.4?
>>> or i need to wait till Solr ships with new Tika?
>>> Thanks.
>>>
>>>
>> Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI.
>>
>> Koji
>>
>> --
>> http://www.rondhuit.com/en/
>>
>>
>


Re: Solr throws TikaException while parsing sample PDF

2010-04-19 Thread Praveen Agrawal
I'm using Solr 1.4 distribution, with Solr cell. Can i update only new
version of Tika in Solr 1.4 distn? If yes, any guide etc?
Thanks.


On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi  wrote:

> Praveen Agrawal wrote:
>
>> Hi Grant,
>> I tried command line of Tika v-0.7(newest), and it parsed the file.. I
>> believe Solr1.4 contains 0.4 version of Tika.
>> Do you suggest to upgrade to new Tika? Can i upgrade only tika in
>> Solr-1.4?
>> or i need to wait till Solr ships with new Tika?
>> Thanks.
>>
>>
> Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI.
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>


Re: Solr throws TikaException while parsing sample PDF

2010-04-19 Thread Koji Sekiguchi

Praveen Agrawal wrote:

Hi Grant,
I tried command line of Tika v-0.7(newest), and it parsed the file.. I
believe Solr1.4 contains 0.4 version of Tika.
Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4?
or i need to wait till Solr ships with new Tika?
Thanks.
  

Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI.

Koji

--
http://www.rondhuit.com/en/



Re: Solr throws TikaException while parsing sample PDF

2010-04-19 Thread Praveen Agrawal
Hi Grant,
I tried command line of Tika v-0.7(newest), and it parsed the file.. I
believe Solr1.4 contains 0.4 version of Tika.
Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4?
or i need to wait till Solr ships with new Tika?
Thanks.


On Sun, Apr 18, 2010 at 11:24 PM, Grant Ingersoll wrote:

> Can you extract content from this using Tika's standalone command line
> tool?  PDF's are notorious for problems in extracting.  To me, it looks like
> a bug in PDFBox.  I would try to isolate it down to there and then send, if
> possible, the sample document to PDFBox and see if they can come up w/ a
> fix.
>
> -Grant
>
> On Apr 18, 2010, at 1:12 PM, pk wrote:
>
> >
> > Hi,
> > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm
> > getting a TikaException.
> > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to
> solr.
> > Other sample pdfs can be parsed and indexed successfully.. I;m getting
> same
> > error with some other pdfs also (but adobe reader can open them fine, so
> i
> > dont think they have an issue in formatting or are corrupt etc)... Here
> is
> > the trace...
> >
> > 
> > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf ::
> > size=286242
> > Apr 18, 2010 10:31:34 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 640
> > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: Una
> > ble to extract PDF content
> >at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> > mentLoader.java:211)
> >at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea
> > mHandlerBase.java:54)
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav
> > a:131)
> >at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re
> > questHandlers.java:233)
> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241
> > )
> >at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
> > terChain.java:215)
> >at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
> > .java:188)
> >at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
> > 213)
> >at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
> > 172)
> >at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> >at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
> >at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
> > 8)
> >at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
> >at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
> >at
> >
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
> > ection(Http11BaseProtocol.java:665)
> >at
> >
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
> > 28)
> >at
> >
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
> > rThread.java:81)
> >at
> >
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
> > 89)
> >at java.lang.Thread.run(Thread.java:595)
> > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF
> > content
> >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
> >at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
> >at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> > mentLoader.java:190)
> >... 20 more
> > Caused by: java.util.zip.ZipException: incorrect header check
> >at
> > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
> >at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
> >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
> >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
> >at
> org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
> >at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101)
> >at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
> >a

Re: Solr throws TikaException while parsing sample PDF

2010-04-18 Thread Grant Ingersoll
Can you extract content from this using Tika's standalone command line tool?  
PDF's are notorious for problems in extracting.  To me, it looks like a bug in 
PDFBox.  I would try to isolate it down to there and then send, if possible, 
the sample document to PDFBox and see if they can come up w/ a fix.

-Grant

On Apr 18, 2010, at 1:12 PM, pk wrote:

> 
> Hi,
> while posting a sample pdf (that comes with Solr dist'n) to solr, i'm
> getting a TikaException. 
> Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr.
> Other sample pdfs can be parsed and indexed successfully.. I;m getting same
> error with some other pdfs also (but adobe reader can open them fine, so i
> dont think they have an issue in formatting or are corrupt etc)... Here is
> the trace...
> 
> 
> found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf ::
> size=286242
> Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 640
> Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Una
> ble to extract PDF content
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> mentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea
> mHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav
> a:131)
>at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re
> questHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> 
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241
> )
>at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
> terChain.java:215)
>at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
> .java:188)
>at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
> 213)
>at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
> 172)
>at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
>at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
> 8)
>at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
>at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
>at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
> ection(Http11BaseProtocol.java:665)
>at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
> 28)
>at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
> rThread.java:81)
>at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
> 89)
>at java.lang.Thread.run(Thread.java:595)
> Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF
> content
>at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
>at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
> mentLoader.java:190)
>... 20 more
> Caused by: java.util.zip.ZipException: incorrect header check
>at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
>at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
>at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
>at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
>at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
>at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101)
>at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
>at
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
>at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>