I have been doing some testing on different nutch configurations to see
what slows down the fetching process on my servers(nutch 0.7.1).
My general experience is that the PDF parse process is nutchs Achilles heel.
Nutch works fine on older computers, but with the combination of
|parse-(text|html|pdf)
and http.content.limit = -1(needed to get PDF parsing to work) nutch
sometimes freezes completely.
Is there planned any improvement to the parsing of PDF files in the next
version of nutch (0.8)?
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general