Re: Removing urls from crawl db

Bai Shen Tue, 01 Nov 2011 10:19:15 -0700

Already did that.  But it doesn't allow me to delete urls from the list to
be crawled.


On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema <[email protected]>wrote:

> As for reading the crawldb, you can use 
> org.apache.nutch.crawl.**CrawlDbReader.
> This allows for dumping the crawldb into a readable textfile as well as
> querying individual urls. Run without args to see its usage.
>
>
> On 10/31/2011 08:47 PM, Markus Jelsma wrote:
>
>> Hi
>>
>> Write an regex URL filter and use it the next time you update the db; it
>> will
>> disappear. Be sure to backup the db first in case your regex catches valid
>> URL's. Nutch 1.5 will have an option to keep the previous version of the
>> DB
>> after update.
>>
>> cheers
>>
>>  We accidentally injected some urls into the crawl database and I need to
>>> go
>>> remove them.  From what I understand, in 1.4 I can view and modify the
>>> urls
>>> and indexes.  But I can't seem to find any information on how to do this.
>>>
>>> Is there anything regarding this available?
>>>
>>

Re: Removing urls from crawl db

Reply via email to