On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote:
Actually there is a property in conf: generate.max.per.host
That has proven to be problematic.
foo.domain.com
bar.domain.com
baz.domain.com
*** Repeat up to 4 Million times for some content generator sites ***
Each of these gets a different slot which effectively stalls everything
else.
Are there any objections to changing this to be one bucket per domain
instead of one per hostname?
That sounds like a good idea.
From what I remember when we did this, generating the base domain for
a URL is a bit of a fuzzy problem. Things like language code
suffixes, shortened versions of .com with some country codes
(.co.jp), etc.
Eventually we shifted to resolving domains to IP addresses. I think
there's been discussion of that on this list previously, to help
ensure threads on different TaskTracker nodes don't hit the same
server at the same time.
For the cases you've run into, do they resolve down to a limited
number of unique IP addresses?
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers