No idea where I got that impression from. I just thought it was one of the reasons to move to 1.4 even though it's still in dev.
On Tue, Nov 1, 2011 at 4:54 PM, Markus Jelsma <[email protected]>wrote: > > > It seems like there would be a better way to do that. > > The problem is that there are many files storing URL's, CrawlDB, LinkDB, > WebGraph DB's, segment data. There is in Nutch 1.x no single place where > you > can find an URL. > > For example, if we find URL patterns we don't want we write additional > filters > for it and have to update all DB's again, which can take minutes, hours or > days depending on size and cluster capacity. > > > > > I thought 1.4 was going to have a Luke style capability in regards to > it's > > data? > > Where did you read that? That is, unfortunately, not the case :) > > > > > On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma > <[email protected]>wrote: > > > > I think you must add a regex to regex-urlfilter.txt . In that case > > > > those urls will not be fetched by fetcher. > > > > > > Yes but if you use it when doing updatedb it will disappear from the > > > crawldb > > > entirely. > > > > > > > -----Original Message----- > > > > From: Bai Shen <[email protected]> > > > > To: user <[email protected]> > > > > Sent: Tue, Nov 1, 2011 10:35 am > > > > Subject: Re: Removing urls from crawl db > > > > > > > > > > > > Already did that. But it doesn't allow me to delete urls from the > list > > > > > > to > > > > > > > be crawled. > > > > > > > > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema > > > > > > <[email protected]>wrote: > > > > > As for reading the crawldb, you can use > > > > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the > > > > > crawldb into a readable textfile as well as querying individual > urls. > > > > > Run without args to see its usage. > > > > > > > > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote: > > > > >> Hi > > > > >> > > > > >> Write an regex URL filter and use it the next time you update the > > > > >> db; > > > > > > it > > > > > > > >> will > > > > >> disappear. Be sure to backup the db first in case your regex > catches > > > > >> valid URL's. Nutch 1.5 will have an option to keep the previous > > > > > > version > > > > > > > >> of the DB > > > > >> after update. > > > > >> > > > > >> cheers > > > > >> > > > > >> We accidentally injected some urls into the crawl database and I > > > > >> need to > > > > >> > > > > >>> go > > > > >>> remove them. From what I understand, in 1.4 I can view and > modify > > > > > > the > > > > > > > >>> urls > > > > >>> and indexes. But I can't seem to find any information on how to > do > > > > >>> this. > > > > >>> > > > > >>> Is there anything regarding this available? >

