Rod, I just posted my PruneDB.java file to: http://
blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html
(104 lines, nutch 0.7 only.)
License granted anyone to hack/copy this as they wish. Should be easy
to adapt to 0.8.
Usage: PruneDB <db> -s
Where: db is the path of the nutch db to prune
Usage: -s simulate: parses the db, but doesn't delete any pages
--Matt
On Mar 8, 2006, at 1:47 PM, Rod Taylor wrote:
On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
Don't generate URLs that don't pass URLFilters.
Just to be clear, this is to support folks changing their filters
while they're crawling, right? We already filter before we
Yes, and this seems to be the most common case. This is especially
important since there are no tools yet to clean up the DB.
I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.
Adding a --refilter option to updatedb seemed like the most obvious
course of action.
A completely separate command so it could be initiated by hand would
also work for me.
--
Rod Taylor <[EMAIL PROTECTED]>
--
Matt Kangas / [EMAIL PROTECTED]