Re: Removing urls from crawl db

Markus Jelsma Tue, 01 Nov 2011 13:46:18 -0700

> I think you must add a regex to regex-urlfilter.txt . In that case those
> urls will not be fetched by fetcher.


Yes but if you use it when doing updatedb it will disappear from the crawldb 
entirely.

> 
> 
> -----Original Message-----
> From: Bai Shen <[email protected]>
> To: user <[email protected]>
> Sent: Tue, Nov 1, 2011 10:35 am
> Subject: Re: Removing urls from crawl db
> 
> 
> Already did that.  But it doesn't allow me to delete urls from the list to
> be crawled.
> 
> On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema 
<[email protected]>wrote:
> > As for reading the crawldb, you can use
> > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > crawldb into a readable textfile as well as querying individual urls.
> > Run without args to see its usage.
> > 
> > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> >> Hi
> >> 
> >> Write an regex URL filter and use it the next time you update the db; it
> >> will
> >> disappear. Be sure to backup the db first in case your regex catches
> >> valid URL's. Nutch 1.5 will have an option to keep the previous version
> >> of the DB
> >> after update.
> >> 
> >> cheers
> >> 
> >>  We accidentally injected some urls into the crawl database and I need
> >>  to
> >>  
> >>> go
> >>> remove them.  From what I understand, in 1.4 I can view and modify the
> >>> urls
> >>> and indexes.  But I can't seem to find any information on how to do
> >>> this.
> >>> 
> >>> Is there anything regarding this available?

Re: Removing urls from crawl db

Reply via email to