On Thu, 2006-03-09 at 12:09 -0800, Doug Cutting wrote:
> Rod Taylor wrote:
> > First is to allow for cleaning up.  This consists of a new option to
> > "updatedb" which can scrub the database of all URLs which no longer
> > match URLFilter settings (regex-urlfilter.txt). This allows a change in
> > the urlfilter to be reflected against Nutches current dataset, something
> > I think others have asked for in the past.
> 
> Yes, this would be a welcome addition.  Note that Andrzej recently 
> committed a change that causes Generate to filter urls, which achieves 
> the same effect, but w/o removing them from the database, so they're 
> still consuming space & time.

Excellent. I'll put someone on this.

> > Second is to treat a subdomain as being in the same bucket as the domain
> > for the generator.  This means that *.domain.com or *.domain.co.uk would
> > create 2 buckets instead of one per hostname. There is a high likely
> > hood that sub-domains will be on the same servers as the primary domain
> > and should be rate-limited as such.  generate.max.per.host would work
> > more as generate.max.per.domain instead.
> 
> This could be implemented by adding a new plugin extension point for 
> hostname normalization.  The default implementation would be a no-op.

Reasonable enough.

> > Third is ongoing detection. I would like to add a feature to Nutch which
> > could report anomalies during updatedb or generate. For example, if any
> > given domain.com bucket during generate is found to have more than 5000
> > URLs to be downloaded, it should be flagged for a manual review. Write a
> > record to a text file which can be read and picked up by a person to
> > confirm that we haven't gotten into a garbage content generation site.
> 
> A simple way to implement this would be to have the generator log each 
> host that exceeds the limit.  Then you can simply grep the logs for 
> these messages.  Good enough?

Good enough.

Thanks for the hints at direction.

-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to