Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226
Give it a try and let me know if that works for you, in any case
backup your crawlDb first!!!
I tested it only with a small crawlDb, so it is your own risk. :)
Stefan
Am 08.03.2006 um 19:47 schrieb Rod Taylor:
On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
Don't generate URLs that don't pass URLFilters.
Just to be clear, this is to support folks changing their filters
while they're crawling, right? We already filter before we
Yes, and this seems to be the most common case. This is especially
important since there are no tools yet to clean up the DB.
I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.
Adding a --refilter option to updatedb seemed like the most obvious
course of action.
A completely separate command so it could be initiated by hand would
also work for me.
--
Rod Taylor <[EMAIL PROTECTED]>
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com