Hi,

I'm having some inconsistent issues with parsing pdf/word/ppt files. For
some files the parsing & indexing works fine, except for a few. Here is a
blurb of the logs:

-- crawler log

Error parsing: http://mysite/applications.pdf: failed(2,202): Content
truncated at 11974 bytes. Parser can't handle incomplete pdf file.

-- hadoop log

2006-10-02 17:49:30,187 WARN  fetcher.Fetcher - Error parsing:
http://mysite/test.doc: failed(2,202): Content truncated at 11981 bytes.
Parser can't handle incomplete file.

Now, I do have in the nutch-site.xml file the content limit set to "-1" so
it doesn't truncate. It doesn't seem to work. Has anybody seen something
similar? Do I have to delete the property from the nutch-default.xml just in
case? 

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  </description>
</property>

Finally I do have a separate engine indexing the same documents so I don't
think it is an issue with the webserver.

Thanks for any help.

Omar
-- 
View this message in context: 
http://www.nabble.com/Inconsistent-behaviour-while-parsing-pdf-word-ppt-files-tf2384012.html#a6644990
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to