I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is difficult to maintain.

I think we should thus merge these, with a configuration parameter determining which http backend is used, much like parse-html, which can switch between neko and tagsoup.

What do others think?

Merging would be great - at least then there's only one plug-in to focus debugging energies on.

BTW, we've been tweaking code in this area to fix some issues we've run into. Some of the changes are minor, others are more significant. Some questions:

1. We needed to modify the commons-httpclient code to fix one hang that sometimes occurs in ChunkedInputStream.exhaustInputStream(). We found sites that were trickling lots of data back to us (e.g. 60Kbits/sec), so we'd wind up waiting a really long time (up to two hours) for a fetcher thread to terminate.

What we did was have this routine throw an HttpException (cause is InterruptedException) whenever it notices that its thread has been interrupted. Then we monitor performance in fetcher.Fetcher.run() and interrupt any thread that has been working on a URL past a configurable time limit.

We also modified HttpMethodDirector.HttpMethodDirector(). It now sets the connection manager time-out (http.connection-manager.timeout HTTP parameter) to 10 minutes, rather than letting this default to 0. This prevents the connection manager from looping forever when it doesn't have a free connection to satisfy the client. It's not obvious how we get into this state of no free connections, but it has happened, and at least we don't hang now.

Plus some minor changes to tone down the level of logging for some messages, so our logs (when running at INFO) show only important status and real warnings/errors.

So the question here is what to do with these changes. I will try to get them integrated into the commons-httpclient code, but that might take a while before they circle back into Nutch. Suggestions for what to do in the short term?

2. Our other changes are a mixture of dealing more effectively with bad hosts so fetcher threads don't get hung up, and changes to do a better job of crawling a limited domain space (vertical crawl).

The first set of changes seem like something that could get merged in (if deemed useful) without too much effort. The second set are more architectural in nature - and I'm a bit worried about what happens when we try to integrate these into 0.8. Plus we're still in the middle of getting the wrinkles ironed out, so it would be premature to submit any patches.

But are we going to be running into trouble by waiting? Would it make sense to send out patches of what we've done to date, even if the code isn't ready for prime time?

Thanks for any advice,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to