This description fooled me too once but it hasn't been patched yet? Now it is [1], please commit.
[1]: https://issues.apache.org/jira/browse/NUTCH-900 On Wednesday 14 July 2010 07:10:47 Mattmann, Chris A (388J) wrote: > No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA > (http://issues.apache.org/jira/browse/NUTCH), and provide a documentation > patch spelling out the difference below. That would really help out! > > To create a patch: > > > 1. svn co http://svn.apache.org/repos/asf/nutch/trunk ./nutch > 2. cd nutch > 3. edit conf/nutch-default.xml with your documentation update > 4. svn status - make sure that you see 1 file changed > conf/nutch-default.xml 5. svn diff > NUTCH-xxx.<your last > name>.<yyMMdd>.patch.txt where xxx is the issue # that JIRA creates > > Then just attach the patch from #5 in JIRA and I'll get it committed to the > sources... > > Thanks! > > Cheers, > Chris > > > > On 7/13/10 9:49 PM, "brad" <[email protected]> wrote: > > Chris, > Thank you for the help. Based on what you said, I decided to install > Tika 0.7 and try to parse the file using tika app to see what happen. > > java -jar tika-app/target/tika-app-*.jar -t spring2010_tcm24-5512.pdf > > It parsed the file completely without issue. > > So I figured the issue must be something with the configuration. I did > some hunting around for more information. It turns out I misunderstood > the setting file.content.limit It appears that is only for files > retrieved via file:// not for files retrieved via http:// which > Is probably how most of the fetch content is being downloaded. I need > to use http.content.limit instead of file.content.limit. > > When I changed it, everything appears to be running correctly. > > Ironically the description for the both properties nutch-default.xml is > the identical and file.content.limt is the first property set in the file > where as http.content limit comes about 120 lines later. > > <property> > <name>file.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > ... > <property> > <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > Thanks again for the help! > Brad > > --------------------------------------------------------------------------- > - > -------------------------------------------------------------------------- > -- --------- > Hi Brad, > > This might be a POI issue, which is the underlying java library that Tika > wraps > for PDF, and in turn Nutch wraps through parse-tika. > > You may want to download Apache POI and try parsing the PDF file with it > outside of Nutch and Tika. If it works with the latest version (I think > 1.2?) > then you can manually replace the version of POI that parse-tika uses in > Nutch > (potentially though they might not be fwds/backwds compatible). But it is > at > > least worth a shot until we upgrade the Tika deps and in turn the Nutch > deps... > > Cheers, > Chris > > > > On 7/13/10 11:15 AM, "brad" <[email protected]> wrote: > > I'm getting the following error on a regular basis with PDFs on Nutch 1.1 > > 2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing > http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf > java.io.IOException: expected='endstream' actual='' > org.apache.pdfbox.io.pushbackinputstr...@721ba923 > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380) > at > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > 2010-07-13 10:57:32,788 WARN fetcher.Fetcher - Error parsing: > http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.pushbackinputstr...@721ba923 > > I manually downloaded the file and it opens just fine. The document > properties show that it is 438,870 bytes long. PDF version 1.4 (Acrobat > 5.x). > > I have file.content.limit set to 1310720 (1,310,720) byte so file size > should not be the issue. I check several of the files that I got the error > on and every file was smaller than the file.content.limit > > Any ideas on what the problem may be and how to resolve it? > > Thanks > Brad > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

