i had some problems fetching gziped contents when setting content-limit=-1 in the 0.8 version...Maybe it's part of your problem?.Hope this will help you
On Fri, May 1, 2009 at 3:43 PM, tsmori <[email protected]> wrote: > > I'm having an interesting problem that I think revolves around the > interplay > of a few settings that I'm not really clear on how they affect the crawl. > > Currently I have: > > content.limit = -1 > fetcher.threads = 1000 > fetcher.threads.per host = 100 > indexer.max.tokens = 750000 > > I also increased the JAVA_HEAP space to account for the additional tokens. > I'm not getting any out of memory errors, so that part should be okay. > > The problem is that with the content limit set high or not at all (I have > tried other values), I get Fetch errors with NullPointerExceptions on one > set of files (html files), these are fairly large html files, but not over > 1MB. If I set the content limit to a reasonable amount, say 5MB, the > nullpointerexceptions go away, but I get a lot of truncation errors on a > different group of files (pdf files, all over 5MB). > > I'm trying to find a sweet spot where I can fetch/index all of my pdf > files, > while not having the crawl bomb out, which it does if it gets too many > errors. > > I'm not sure if the threads and threads per host play any role. I feel like > I got a better crawl when I have them set a little more modestly, but I > read > in another thread somewhere that a good server should handle those settings > and I'm running this on a quad-core Opteron server. > > I'm also not sure if maybe some of the parse setting are affecting > anything. > I got rid of index-more, but ultimately I think I'd like to put that back > if > I can. > > > -- > View this message in context: > http://www.nabble.com/NullPointerExceptions-in-Fetch-tp23333304p23333304.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
