Hi, I'm having some inconsistent issues with parsing pdf/word/ppt files. For some files the parsing & indexing works fine, except for a few. Here is a blurb of the logs:
-- crawler log Error parsing: http://mysite/applications.pdf: failed(2,202): Content truncated at 11974 bytes. Parser can't handle incomplete pdf file. -- hadoop log 2006-10-02 17:49:30,187 WARN fetcher.Fetcher - Error parsing: http://mysite/test.doc: failed(2,202): Content truncated at 11981 bytes. Parser can't handle incomplete file. Now, I do have in the nutch-site.xml file the content limit set to "-1" so it doesn't truncate. It doesn't seem to work. Has anybody seen something similar? Do I have to delete the property from the nutch-default.xml just in case? <property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. </description> </property> Finally I do have a separate engine indexing the same documents so I don't think it is an issue with the webserver. Thanks for any help. Omar -- View this message in context: http://www.nabble.com/Inconsistent-behaviour-while-parsing-pdf-word-ppt-files-tf2384012.html#a6644990 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
