I was recently benchmarking fetching at a site with lots of
bandwidth, and it seemed to me that protocol-http is capable of
faster crawling than protocol-httpclient. So I don't think we should
discard protocol-http just yet. But there's a lot of duplicate code
between these, which is difficult to maintain.
I think we should thus merge these, with a configuration parameter
determining which http backend is used, much like parse-html, which
can switch between neko and tagsoup.
What do others think?
Merging would be great - at least then there's only one plug-in to
focus debugging energies on.
BTW, we've been tweaking code in this area to fix some issues we've
run into. Some of the changes are minor, others are more significant.
Some questions:
1. We needed to modify the commons-httpclient code to fix one hang
that sometimes occurs in ChunkedInputStream.exhaustInputStream(). We
found sites that were trickling lots of data back to us (e.g.
60Kbits/sec), so we'd wind up waiting a really long time (up to two
hours) for a fetcher thread to terminate.
What we did was have this routine throw an HttpException (cause is
InterruptedException) whenever it notices that its thread has been
interrupted. Then we monitor performance in fetcher.Fetcher.run() and
interrupt any thread that has been working on a URL past a
configurable time limit.
We also modified HttpMethodDirector.HttpMethodDirector(). It now sets
the connection manager time-out (http.connection-manager.timeout HTTP
parameter) to 10 minutes, rather than letting this default to 0. This
prevents the connection manager from looping forever when it doesn't
have a free connection to satisfy the client. It's not obvious how we
get into this state of no free connections, but it has happened, and
at least we don't hang now.
Plus some minor changes to tone down the level of logging for some
messages, so our logs (when running at INFO) show only important
status and real warnings/errors.
So the question here is what to do with these changes. I will try to
get them integrated into the commons-httpclient code, but that might
take a while before they circle back into Nutch. Suggestions for what
to do in the short term?
2. Our other changes are a mixture of dealing more effectively with
bad hosts so fetcher threads don't get hung up, and changes to do a
better job of crawling a limited domain space (vertical crawl).
The first set of changes seem like something that could get merged in
(if deemed useful) without too much effort. The second set are more
architectural in nature - and I'm a bit worried about what happens
when we try to integrate these into 0.8. Plus we're still in the
middle of getting the wrinkles ironed out, so it would be premature
to submit any patches.
But are we going to be running into trouble by waiting? Would it make
sense to send out patches of what we've done to date, even if the
code isn't ready for prime time?
Thanks for any advice,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers