Todd Lipcon wrote:

The issue I see with decreasing max crawl delay is that it essentially
blacklists those hosts. Even if I can only crawl these hosts 1/10th as fast,
I'd still like to have them in my index. I guess this is where the hostdb
will help once that jira is implemented, so this kind of thing can be taken
into account at generate time.

Ah, ok, now I get your point. I agree - it's better to at least fetch some of them as time permits, rather than none.


Current implementation (in Fetcher2) tries to do just that, see QueueFeeder
thread. Or did you have something else in mind?


Sorry - I just had in mind making that factor of 50 a tunable parameter for
those who have more RAM available.

Sure, let's make this configurable.

Crawl delay is not the only factor that slows down the fetching. One common
issue is hosts that relatively quickly accept connections, but then very
slowly trickle the data byte by byte. Since we have per-host queues it would
make sense to monitor not only the req/sec rate but also bytes/sec rate, and
if it falls too low then terminate the queue.


Yep - though, similar to the issue with max crawl delay, I don't want to
terminate the queue. Rather I just want the queue to be marked as
"terminable" such that it won't hold up map task completion when it's the
only queue left. If there are other queues still making reasonable progress
we should keep trickling those bytes from the slow host.

I agree, this makes sense. The "terminable" status nicely conveys the idea, and it's easy to implement and check.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to