I'm using Nutch 0.9.

I appears that Nutch is ignoring the http.content.limit number in the config
file.  I have left this setting at the default (64K), and the httpclient
plugin logs that value (...httpclient.Http - http.content.limit = 65536),
yet Nutch is attempting to fetch a 115MB file.

I have verified that the server is indeed sending the content-length header.

As you might imagine, this takes a very long time, and the Fetcher gives up
on that fetcher thread:
2007-05-10 08:29:31,963 WARN  fetcher.Fetcher - Aborting with 1 hung
threads.

Then, the next time around the generate/fetch/update loop, Nutch again tries
to fetch that same document.  I'm doing a very deep crawl, and this
combination of behaviors wound up giving me 10 threads all hung fetching the
same file.  Bandwidth issues followed...

I've groveled the code, and while I see HttpBase reading the
http.content.limit value from the config, I can't find anywhere that the
value is actually used.

Now this is a flv file, and sure, I could easily filter the URL out based in
its extension, but I'd like to find a more generic solution.

So is it true that http.content.limit is not implemented yet or is there
something I don't understand about the way it is meant to work?

Regards
Charlie
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to