On Thu, 2006-03-09 at 12:09 -0800, Doug Cutting wrote: > Rod Taylor wrote: > > First is to allow for cleaning up. This consists of a new option to > > "updatedb" which can scrub the database of all URLs which no longer > > match URLFilter settings (regex-urlfilter.txt). This allows a change in > > the urlfilter to be reflected against Nutches current dataset, something > > I think others have asked for in the past. > > Yes, this would be a welcome addition. Note that Andrzej recently > committed a change that causes Generate to filter urls, which achieves > the same effect, but w/o removing them from the database, so they're > still consuming space & time.
Excellent. I'll put someone on this. > > Second is to treat a subdomain as being in the same bucket as the domain > > for the generator. This means that *.domain.com or *.domain.co.uk would > > create 2 buckets instead of one per hostname. There is a high likely > > hood that sub-domains will be on the same servers as the primary domain > > and should be rate-limited as such. generate.max.per.host would work > > more as generate.max.per.domain instead. > > This could be implemented by adding a new plugin extension point for > hostname normalization. The default implementation would be a no-op. Reasonable enough. > > Third is ongoing detection. I would like to add a feature to Nutch which > > could report anomalies during updatedb or generate. For example, if any > > given domain.com bucket during generate is found to have more than 5000 > > URLs to be downloaded, it should be flagged for a manual review. Write a > > record to a text file which can be read and picked up by a person to > > confirm that we haven't gotten into a garbage content generation site. > > A simple way to implement this would be to have the generator log each > host that exceeds the limit. Then you can simply grep the logs for > these messages. Good enough? Good enough. Thanks for the hints at direction. -- Rod Taylor <[EMAIL PROTECTED]> ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
