The answer is that http.content.limit is indeed broken in the
protocol-httpclient plugin, though it doesn't really look like it's
entirely Nutch's fault.

The org.apache.nutch.protocol.httpclient.HttpResponse class is doing the
right thing in trying to abort the GET at the content limit, but when it
calls close on the input stream of the request at HttpResponse.java:120,
the org.apache.commons.httpclient.AutoCloseInputStream class goes off and
tries to read the entire response anyway.

I found that if get.abort() is called when the content goes over limit, the
request is terminated and Nutch is able to do the right thing.

The default protocol-http plugin does not use the apache commons httpclient
stuff, and works correctly.


On 5/10/07, charlie w <[EMAIL PROTECTED]> wrote:


I'm using Nutch 0.9.

I appears that Nutch is ignoring the http.content.limit number in the
config file.  I have left this setting at the default (64K), and the
httpclient plugin logs that value (...httpclient.Http - http.content.limit= 
65536), yet Nutch is attempting to fetch a 115MB file.

I have verified that the server is indeed sending the content-length
header.

As you might imagine, this takes a very long time, and the Fetcher gives
up on that fetcher thread:
2007-05-10 08:29:31,963 WARN  fetcher.Fetcher - Aborting with 1 hung
threads.

Then, the next time around the generate/fetch/update loop, Nutch again
tries to fetch that same document.  I'm doing a very deep crawl, and this
combination of behaviors wound up giving me 10 threads all hung fetching the
same file.  Bandwidth issues followed...

I've groveled the code, and while I see HttpBase reading the
http.content.limit value from the config, I can't find anywhere that the
value is actually used.

Now this is a flv file, and sure, I could easily filter the URL out based
in its extension, but I'd like to find a more generic solution.

So is it true that http.content.limit is not implemented yet or is there
something I don't understand about the way it is meant to work?

Regards
Charlie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to