Re: Removing urls from crawl db

Markus Jelsma Thu, 10 Nov 2011 07:42:55 -0800

if i want to remove example.org from my CrawlDB using regex filters i'll add:


-^http://example\.org/

and run updatedb with filtering enabled. The URL's will then be deleted.

On Thursday 10 November 2011 16:36:24 Bai Shen wrote:
> Can you give me an example of how would I set my URL filter to do this?
> Right now I'm just using the default.
> 
> On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > Hi
> > 
> > Write an regex URL filter and use it the next time you update the db; it
> > will
> > disappear. Be sure to backup the db first in case your regex catches
> > valid URL's. Nutch 1.5 will have an option to keep the previous version
> > of the DB after update.
> > 
> > cheers
> > 
> > > We accidentally injected some urls into the crawl database and I need
> > > to
> > 
> > go
> > 
> > > remove them.  From what I understand, in 1.4 I can view and modify the
> > 
> > urls
> > 
> > > and indexes.  But I can't seem to find any information on how to do
> > > this.
> > > 
> > > Is there anything regarding this available?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Removing urls from crawl db

Reply via email to