On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote:
 Actually there is a property in conf: generate.max.per.host

That has proven to be problematic.

foo.domain.com
bar.domain.com
baz.domain.com
*** Repeat up to 4 Million times for some content generator sites ***

Each of these gets a different slot which effectively stalls everything
else.

Are there any objections to changing this to be one bucket per domain
instead of one per hostname?

That sounds like a good idea.

From what I remember when we did this, generating the base domain for a URL is a bit of a fuzzy problem. Things like language code suffixes, shortened versions of .com with some country codes (.co.jp), etc.

Eventually we shifted to resolving domains to IP addresses. I think there's been discussion of that on this list previously, to help ensure threads on different TaskTracker nodes don't hit the same server at the same time.

For the cases you've run into, do they resolve down to a limited number of unique IP addresses?

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to