On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote: > Actually there is a property in conf: generate.max.per.host
That has proven to be problematic. foo.domain.com bar.domain.com baz.domain.com *** Repeat up to 4 Million times for some content generator sites *** Each of these gets a different slot which effectively stalls everything else. Are there any objections to changing this to be one bucket per domain instead of one per hostname? > So if you add a message in Generator.java at the appropriate place... you > have what you wish... > -----Original Message----- > From: Rod Taylor [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 08, 2006 7:28 PM > To: Nutch Developer List > Subject: Proposal for Avoiding Content Generation Sites > > We've indexed several content generation sites that we want to > eliminate. One had hundreds of thousands of sub-domains spread across > several domains (up to 50M pages in total). Quite annoying. > > First is to allow for cleaning up. This consists of a new option to > "updatedb" which can scrub the database of all URLs which no longer > match URLFilter settings (regex-urlfilter.txt). This allows a change in > the urlfilter to be reflected against Nutches current dataset, something > I think others have asked for in the past. > > Second is to treat a subdomain as being in the same bucket as the domain > for the generator. This means that *.domain.com or *.domain.co.uk would > create 2 buckets instead of one per hostname. There is a high likely > hood that sub-domains will be on the same servers as the primary domain > and should be rate-limited as such. generate.max.per.host would work > more as generate.max.per.domain instead. > > > Third is ongoing detection. I would like to add a feature to Nutch which > could report anomalies during updatedb or generate. For example, if any > given domain.com bucket during generate is found to have more than 5000 > URLs to be downloaded, it should be flagged for a manual review. Write a > record to a text file which can be read and picked up by a person to > confirm that we haven't gotten into a garbage content generation site. > If we are in a content generation site, the person would add this domain > to the urlfilter and the next updatedb would clean out all URLs from > that location. > > > Are there any thoughts or objections to this? One and 2 are pretty > straight forward. Detection is not so easy. > -- Rod Taylor <[EMAIL PROTECTED]> ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
