On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote:
> Actually there is a property in conf: generate.max.per.host

That has proven to be problematic.

foo.domain.com
bar.domain.com
baz.domain.com
*** Repeat up to 4 Million times for some content generator sites ***

Each of these gets a different slot which effectively stalls everything
else.

Are there any objections to changing this to be one bucket per domain
instead of one per hostname?

> So if you add a message in Generator.java at the appropriate place... you
> have what you wish...


> -----Original Message-----
> From: Rod Taylor [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 08, 2006 7:28 PM
> To: Nutch Developer List
> Subject: Proposal for Avoiding Content Generation Sites
> 
> We've indexed several content generation sites that we want to
> eliminate. One had hundreds of thousands of sub-domains spread across
> several domains (up to 50M pages in total). Quite annoying.
> 
> First is to allow for cleaning up.  This consists of a new option to
> "updatedb" which can scrub the database of all URLs which no longer
> match URLFilter settings (regex-urlfilter.txt). This allows a change in
> the urlfilter to be reflected against Nutches current dataset, something
> I think others have asked for in the past.
> 
> Second is to treat a subdomain as being in the same bucket as the domain
> for the generator.  This means that *.domain.com or *.domain.co.uk would
> create 2 buckets instead of one per hostname. There is a high likely
> hood that sub-domains will be on the same servers as the primary domain
> and should be rate-limited as such.  generate.max.per.host would work
> more as generate.max.per.domain instead.
> 
> 
> Third is ongoing detection. I would like to add a feature to Nutch which
> could report anomalies during updatedb or generate. For example, if any
> given domain.com bucket during generate is found to have more than 5000
> URLs to be downloaded, it should be flagged for a manual review. Write a
> record to a text file which can be read and picked up by a person to
> confirm that we haven't gotten into a garbage content generation site.
> If we are in a content generation site, the person would add this domain
> to the urlfilter and the next updatedb would clean out all URLs from
> that location.
> 
> 
> Are there any thoughts or objections to this? One and 2 are pretty
> straight forward. Detection is not so easy.
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to