Hi Brad, This might be a POI issue, which is the underlying java library that Tika wraps for PDF, and in turn Nutch wraps through parse-tika.
You may want to download Apache POI and try parsing the PDF file with it outside of Nutch and Tika. If it works with the latest version (I think 1.2?) then you can manually replace the version of POI that parse-tika uses in Nutch (potentially though they might not be fwds/backwds compatible). But it is at least worth a shot until we upgrade the Tika deps and in turn the Nutch deps... Cheers, Chris On 7/13/10 11:15 AM, "brad" <[email protected]> wrote: I'm getting the following error on a regular basis with PDFs on Nutch 1.1 2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@721ba923 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) 2010-07-13 10:57:32,788 WARN fetcher.Fetcher - Error parsing: http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@721ba923 I manually downloaded the file and it opens just fine. The document properties show that it is 438,870 bytes long. PDF version 1.4 (Acrobat 5.x). I have file.content.limit set to 1310720 (1,310,720) byte so file size should not be the issue. I check several of the files that I got the error on and every file was smaller than the file.content.limit Any ideas on what the problem may be and how to resolve it? Thanks Brad ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

