CrawlDb Filter tool, was Re: svn commit: r384219 -

Stefan Groschupf Wed, 08 Mar 2006 11:14:24 -0800

Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226

Give it a try and let me know if that works for you, in any casebackup your crawlDb first!!!

I tested it only with a small crawlDb, so it is your own risk. :)


Stefan

Am 08.03.2006 um 19:47 schrieb Rod Taylor:

On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:

Doug Cutting wrote:

[EMAIL PROTECTED] wrote:

Don't generate URLs that don't pass URLFilters.


Just to be clear, this is to support folks changing their filters
while they're crawling, right?  We already filter before we


Yes, and this seems to be the most common case. This is especially
important since there are no tools yet to clean up the DB.


I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.

Adding a --refilter option to updatedb seemed like the most obvious
course of action.

A completely separate command so it could be initiated by hand would
also work for me.

--
Rod Taylor <[EMAIL PROTECTED]>


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

CrawlDb Filter tool, was Re: svn commit: r384219 -

Reply via email to