Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

Markus Jelsma Wed, 08 Sep 2010 03:08:40 -0700

This description fooled me too once but it hasn't been patched yet? Now it is 
[1], please commit.


[1]: https://issues.apache.org/jira/browse/NUTCH-900

On Wednesday 14 July 2010 07:10:47 Mattmann, Chris A (388J) wrote:
> No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA
>  (http://issues.apache.org/jira/browse/NUTCH), and provide a documentation
>  patch spelling out the difference below. That would really help out!
> 
> To create a patch:
> 
> 
>  1.  svn co http://svn.apache.org/repos/asf/nutch/trunk ./nutch
>  2.  cd nutch
>  3.  edit conf/nutch-default.xml with your documentation update
>  4.  svn status - make sure that you see 1 file changed
>  conf/nutch-default.xml 5.  svn diff > NUTCH-xxx.<your last
>  name>.<yyMMdd>.patch.txt where xxx is the issue # that JIRA creates
> 
> Then just attach the patch from #5 in JIRA and I'll get it committed to the
>  sources...
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> 
> 
> On 7/13/10 9:49 PM, "brad" <[email protected]> wrote:
> 
> Chris,
> Thank you for the help.  Based on what you said, I decided to install
> Tika 0.7 and try to parse the file using tika app to see what happen.
> 
> java -jar tika-app/target/tika-app-*.jar -t spring2010_tcm24-5512.pdf
> 
> It parsed the file completely without issue.
> 
> So I figured the issue must be something with the configuration. I did
> some hunting around for more information.  It turns out I misunderstood
> the setting file.content.limit It appears that is only for files
> retrieved via file:// not for files retrieved via http:// which
> Is probably how most of the fetch content is being downloaded.  I need
> to use http.content.limit instead of file.content.limit.
> 
> When I changed it, everything appears to be running correctly.
> 
> Ironically the description for the both properties nutch-default.xml is
> the identical and file.content.limt is the first property set in the file
> where as http.content limit comes about 120 lines later.
> 
> <property>
>   <name>file.content.limit</name>
>   <value>65536</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> ...
> <property>
>   <name>http.content.limit</name>
>   <value>65536</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> 
> Thanks again for the help!
> Brad
> 
> ---------------------------------------------------------------------------
> -
>  --------------------------------------------------------------------------
> -- ---------
> Hi Brad,
> 
> This might be a POI issue, which is the underlying java library that Tika
> wraps
> for PDF, and in turn Nutch wraps through parse-tika.
> 
> You may want to download Apache POI and try parsing the PDF file with it
> outside of Nutch and Tika. If it works with the latest version (I think
> 1.2?)
> then you can manually replace the version of POI that parse-tika uses in
> Nutch
> (potentially though they might not be fwds/backwds compatible). But it is
>  at
> 
> least worth a shot until we upgrade the Tika deps and in turn the Nutch
> deps...
> 
> Cheers,
> Chris
> 
> 
> 
> On 7/13/10 11:15 AM, "brad" <[email protected]> wrote:
> 
> I'm getting the following error on a regular basis with PDFs on Nutch 1.1
> 
> 2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing
> http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf
> java.io.IOException: expected='endstream' actual=''
> org.apache.pdfbox.io.pushbackinputstr...@721ba923
>         at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380)
>         at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
>         at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
> 2010-07-13 10:57:32,788 WARN  fetcher.Fetcher - Error parsing:
> http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.pushbackinputstr...@721ba923
> 
> I manually downloaded the file and it opens just fine.  The document
> properties show that it is 438,870 bytes long.  PDF version 1.4 (Acrobat
> 5.x).
> 
> I have file.content.limit set to 1310720 (1,310,720) byte so file size
> should not be the issue.  I check several of the files that I got the error
> on and every file was smaller than the file.content.limit
> 
> Any ideas on what the problem may be and how to resolve it?
> 
> Thanks
> Brad
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

Reply via email to