tsmori wrote:
I'm having an interesting problem that I think revolves around the interplay
of a few settings that I'm not really clear on how they affect the crawl.
Currently I have:
content.limit = -1
fetcher.threads = 1000
fetcher.threads.per host = 100
indexer.max.tokens = 750000
I also increased the JAVA_HEAP space to account for the additional tokens.
I'm not getting any out of memory errors, so that part should be okay.
The problem is that with the content limit set high or not at all (I have
tried other values), I get Fetch errors with NullPointerExceptions on one
set of files (html files), these are fairly large html files, but not over
1MB. If I set the content limit to a reasonable amount, say 5MB, the
nullpointerexceptions go away, but I get a lot of truncation errors on a
different group of files (pdf files, all over 5MB).
Could you please copy the full stack trace, including line numbers?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com