As for reading the crawldb, you can use
org.apache.nutch.crawl.CrawlDbReader. This allows for dumping the
crawldb into a readable textfile as well as querying individual urls.
Run without args to see its usage.
On 10/31/2011 08:47 PM, Markus Jelsma wrote:
Hi
Write an regex URL filter and use it the next time you update the db; it will
disappear. Be sure to backup the db first in case your regex catches valid
URL's. Nutch 1.5 will have an option to keep the previous version of the DB
after update.
cheers
We accidentally injected some urls into the crawl database and I need to go
remove them. From what I understand, in 1.4 I can view and modify the urls
and indexes. But I can't seem to find any information on how to do this.
Is there anything regarding this available?