Re: Removing urls from crawl db

Ferdy Galema Tue, 01 Nov 2011 02:57:15 -0700

As for reading the crawldb, you can useorg.apache.nutch.crawl.CrawlDbReader. This allows for dumping thecrawldb into a readable textfile as well as querying individual urls.Run without args to see its usage.


On 10/31/2011 08:47 PM, Markus Jelsma wrote:

Hi


Write an regex URL filter and use it the next time you update the db; it will
disappear. Be sure to backup the db first in case your regex catches valid
URL's. Nutch 1.5 will have an option to keep the previous version of the DB
after update.

cheers

We accidentally injected some urls into the crawl database and I need to go
remove them.  From what I understand, in 1.4 I can view and modify the urls
and indexes.  But I can't seem to find any information on how to do this.

Is there anything regarding this available?

Re: Removing urls from crawl db

Reply via email to