Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

Mattmann, Chris A (388J) Tue, 13 Jul 2010 22:11:26 -0700

No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA 
(http://issues.apache.org/jira/browse/NUTCH), and provide a documentation patch 
spelling out the difference below. That would really help out!


To create a patch:


 1.  svn co http://svn.apache.org/repos/asf/nutch/trunk ./nutch
 2.  cd nutch
 3.  edit conf/nutch-default.xml with your documentation update
 4.  svn status - make sure that you see 1 file changed conf/nutch-default.xml
 5.  svn diff > NUTCH-xxx.<your last name>.<yyMMdd>.patch.txt where xxx is the 
issue # that JIRA creates

Then just attach the patch from #5 in JIRA and I'll get it committed to the 
sources...

Thanks!

Cheers,
Chris



On 7/13/10 9:49 PM, "brad" <[email protected]> wrote:

Chris,
Thank you for the help.  Based on what you said, I decided to install
Tika 0.7 and try to parse the file using tika app to see what happen.

java -jar tika-app/target/tika-app-*.jar -t spring2010_tcm24-5512.pdf

It parsed the file completely without issue.

So I figured the issue must be something with the configuration. I did
some hunting around for more information.  It turns out I misunderstood
the setting file.content.limit It appears that is only for files
retrieved via file:// not for files retrieved via http:// which
Is probably how most of the fetch content is being downloaded.  I need
to use http.content.limit instead of file.content.limit.

When I changed it, everything appears to be running correctly.

Ironically the description for the both properties nutch-default.xml is
the identical and file.content.limt is the first property set in the file
where as http.content limit comes about 120 lines later.

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>
...
<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

Thanks again for the help!
Brad

----------------------------------------------------------------------------
----------------------------------------------------------------------------
---------
Hi Brad,

This might be a POI issue, which is the underlying java library that Tika
wraps
for PDF, and in turn Nutch wraps through parse-tika.

You may want to download Apache POI and try parsing the PDF file with it
outside of Nutch and Tika. If it works with the latest version (I think
1.2?)
then you can manually replace the version of POI that parse-tika uses in
Nutch
(potentially though they might not be fwds/backwds compatible). But it is at

least worth a shot until we upgrade the Tika deps and in turn the Nutch
deps...

Cheers,
Chris



On 7/13/10 11:15 AM, "brad" <[email protected]> wrote:

I'm getting the following error on a regular basis with PDFs on Nutch 1.1

2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing
http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf
java.io.IOException: expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@721ba923
        at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380)
        at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
        at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
2010-07-13 10:57:32,788 WARN  fetcher.Fetcher - Error parsing:
http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@721ba923

I manually downloaded the file and it opens just fine.  The document
properties show that it is 438,870 bytes long.  PDF version 1.4 (Acrobat
5.x).

I have file.content.limit set to 1310720 (1,310,720) byte so file size
should not be the issue.  I check several of the files that I got the error
on and every file was smaller than the file.content.limit

Any ideas on what the problem may be and how to resolve it?

Thanks
Brad




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

Reply via email to