Re: Removing urls from crawl db

Bai Shen Wed, 02 Nov 2011 06:16:14 -0700

No idea where I got that impression from.  I just thought it was one of the
reasons to move to 1.4 even though it's still in dev.


On Tue, Nov 1, 2011 at 4:54 PM, Markus Jelsma <[email protected]>wrote:

>
> > It seems like there would be a better way to do that.
>
> The problem is that there are many files storing URL's, CrawlDB, LinkDB,
> WebGraph DB's, segment data. There is in Nutch 1.x no single place where
> you
> can find an URL.
>
> For example, if we find URL patterns we don't want we write additional
> filters
> for it and have to update all DB's again, which can take minutes, hours or
> days depending on size and cluster capacity.
>
> >
> > I thought 1.4 was going to have a Luke style capability in regards to
> it's
> > data?
>
> Where did you read that? That is, unfortunately, not the case :)
>
> >
> > On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma
> <[email protected]>wrote:
> > > > I think you must add a regex to regex-urlfilter.txt . In that case
> > > > those urls will not be fetched by fetcher.
> > >
> > > Yes but if you use it when doing updatedb it will disappear from the
> > > crawldb
> > > entirely.
> > >
> > > > -----Original Message-----
> > > > From: Bai Shen <[email protected]>
> > > > To: user <[email protected]>
> > > > Sent: Tue, Nov 1, 2011 10:35 am
> > > > Subject: Re: Removing urls from crawl db
> > > >
> > > >
> > > > Already did that.  But it doesn't allow me to delete urls from the
> list
> > >
> > > to
> > >
> > > > be crawled.
> > > >
> > > > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> > >
> > > <[email protected]>wrote:
> > > > > As for reading the crawldb, you can use
> > > > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > > > crawldb into a readable textfile as well as querying individual
> urls.
> > > > > Run without args to see its usage.
> > > > >
> > > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > > > >> Hi
> > > > >>
> > > > >> Write an regex URL filter and use it the next time you update the
> > > > >> db;
> > >
> > > it
> > >
> > > > >> will
> > > > >> disappear. Be sure to backup the db first in case your regex
> catches
> > > > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> > >
> > > version
> > >
> > > > >> of the DB
> > > > >> after update.
> > > > >>
> > > > >> cheers
> > > > >>
> > > > >>  We accidentally injected some urls into the crawl database and I
> > > > >>  need to
> > > > >>
> > > > >>> go
> > > > >>> remove them.  From what I understand, in 1.4 I can view and
> modify
> > >
> > > the
> > >
> > > > >>> urls
> > > > >>> and indexes.  But I can't seem to find any information on how to
> do
> > > > >>> this.
> > > > >>>
> > > > >>> Is there anything regarding this available?
>

Re: Removing urls from crawl db

Reply via email to