i had some problems fetching gziped contents when setting content-limit=-1
in the 0.8 version...Maybe it's part of your problem?.Hope this will help
you

On Fri, May 1, 2009 at 3:43 PM, tsmori <[email protected]> wrote:

>
> I'm having an interesting problem that I think revolves around the
> interplay
> of a few settings that I'm not really clear on how they affect the crawl.
>
> Currently I have:
>
> content.limit = -1
> fetcher.threads = 1000
> fetcher.threads.per host = 100
> indexer.max.tokens = 750000
>
> I also increased the JAVA_HEAP space to account for the additional tokens.
> I'm not getting any out of memory errors, so that part should be okay.
>
> The problem is that with the content limit set high or not at all (I have
> tried other values), I get Fetch errors with NullPointerExceptions on one
> set of files (html files), these are fairly large html files, but not over
> 1MB. If I set the content limit to a reasonable amount, say 5MB, the
> nullpointerexceptions go away, but I get a lot of truncation errors on a
> different group of files (pdf files, all over 5MB).
>
> I'm trying to find a sweet spot where I can fetch/index all of my pdf
> files,
> while not having the crawl bomb out, which it does if it gets too many
> errors.
>
> I'm not sure if the threads and threads per host play any role. I feel like
> I got a better crawl when I have them set a little more modestly, but I
> read
> in another thread somewhere that a good server should handle those settings
> and I'm running this on a quad-core Opteron server.
>
> I'm also not sure if maybe some of the parse setting are affecting
> anything.
> I got rid of index-more, but ultimately I think I'd like to put that back
> if
> I can.
>
>
> --
> View this message in context:
> http://www.nabble.com/NullPointerExceptions-in-Fetch-tp23333304p23333304.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Reply via email to