Re: protocol-http versus protocol-httpclient

Ken Krugler Wed, 09 Nov 2005 20:29:52 -0800

I was recently benchmarking fetching at a site with lots ofbandwidth, and it seemed to me that protocol-http is capable offaster crawling than protocol-httpclient. So I don't think we shoulddiscard protocol-http just yet. But there's a lot of duplicate codebetween these, which is difficult to maintain.
I think we should thus merge these, with a configuration parameterdetermining which http backend is used, much like parse-html, whichcan switch between neko and tagsoup.
What do others think?

Merging would be great - at least then there's only one plug-in tofocus debugging energies on.

BTW, we've been tweaking code in this area to fix some issues we'verun into. Some of the changes are minor, others are more significant.Some questions:

1. We needed to modify the commons-httpclient code to fix one hangthat sometimes occurs in ChunkedInputStream.exhaustInputStream(). Wefound sites that were trickling lots of data back to us (e.g.60Kbits/sec), so we'd wind up waiting a really long time (up to twohours) for a fetcher thread to terminate.

What we did was have this routine throw an HttpException (cause isInterruptedException) whenever it notices that its thread has beeninterrupted. Then we monitor performance in fetcher.Fetcher.run() andinterrupt any thread that has been working on a URL past aconfigurable time limit.

We also modified HttpMethodDirector.HttpMethodDirector(). It now setsthe connection manager time-out (http.connection-manager.timeout HTTPparameter) to 10 minutes, rather than letting this default to 0. Thisprevents the connection manager from looping forever when it doesn'thave a free connection to satisfy the client. It's not obvious how weget into this state of no free connections, but it has happened, andat least we don't hang now.

Plus some minor changes to tone down the level of logging for somemessages, so our logs (when running at INFO) show only importantstatus and real warnings/errors.

So the question here is what to do with these changes. I will try toget them integrated into the commons-httpclient code, but that mighttake a while before they circle back into Nutch. Suggestions for whatto do in the short term?

2. Our other changes are a mixture of dealing more effectively withbad hosts so fetcher threads don't get hung up, and changes to do abetter job of crawling a limited domain space (vertical crawl).

The first set of changes seem like something that could get merged in(if deemed useful) without too much effort. The second set are morearchitectural in nature - and I'm a bit worried about what happenswhen we try to integrate these into 0.8. Plus we're still in themiddle of getting the wrinkles ironed out, so it would be prematureto submit any patches.

But are we going to be running into trouble by waiting? Would it makesense to send out patches of what we've done to date, even if thecode isn't ready for prime time?


Thanks for any advice,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: protocol-http versus protocol-httpclient

Reply via email to