This technique looks fine if filtering is to be done once (or avoiding crawl of certain urls which have been erronously injected in crawldb).
But is it the right way to go in larger crawls? In a scenario where certain domains are periodically injected to be crawled and/or removed, for some domains are no longer required to be crawled (though these could be reintroduced again at a later stage), won't managing a regex-urlfilter.txt become tedious as - filters increase? Isn't there a way, where urls (from certain domains) are completely removed from crawldb (no longer required to be crawled) and injected at a later stage if need be? Thanks, --Sudip. On Fri, Nov 11, 2011 at 2:00 AM, Markus Jelsma <[email protected]> wrote: > Uh, the filter checker immediately produces output. > >> Interesting. What kind of output should I expect to see? So far it's been >> running for a while with no output. >> >> On Thu, Nov 10, 2011 at 1:51 PM, Markus Jelsma >> >> <[email protected]>wrote: >> > You can use bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined >> > to test. >> > >> > > Okay. So I would just put that above the +. line, right? >> > > >> > > Thanks. >> > > >> > > On Thu, Nov 10, 2011 at 10:42 AM, Markus Jelsma >> > > >> > > <[email protected]>wrote: >> > > > if i want to remove example.org from my CrawlDB using regex filters >> > >> > i'll >> > >> > > > add: >> > > > >> > > > -^http://example\.org/ >> > > > >> > > > and run updatedb with filtering enabled. The URL's will then be >> > >> > deleted. >> > >> > > > On Thursday 10 November 2011 16:36:24 Bai Shen wrote: >> > > > > Can you give me an example of how would I set my URL filter to do >> > >> > this? >> > >> > > > > Right now I'm just using the default. >> > > > > >> > > > > On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma >> > > > > >> > > > > <[email protected]>wrote: >> > > > > > Hi >> > > > > > >> > > > > > Write an regex URL filter and use it the next time you update the >> > >> > db; >> > >> > > > it >> > > > >> > > > > > will >> > > > > > disappear. Be sure to backup the db first in case your regex >> > >> > catches >> > >> > > > > > valid URL's. Nutch 1.5 will have an option to keep the previous >> > > > > > version of the DB after update. >> > > > > > >> > > > > > cheers >> > > > > > >> > > > > > > We accidentally injected some urls into the crawl database and >> > > > > > > I need to >> > > > > > >> > > > > > go >> > > > > > >> > > > > > > remove them. From what I understand, in 1.4 I can view and >> > >> > modify >> > >> > > > the >> > > > >> > > > > > urls >> > > > > > >> > > > > > > and indexes. But I can't seem to find any information on how >> > > > > > > to >> > >> > do >> > >> > > > > > > this. >> > > > > > > >> > > > > > > Is there anything regarding this available? >> > > > >> > > > -- >> > > > Markus Jelsma - CTO - Openindex >> > > > http://www.linkedin.com/in/markus17 >> > > > 050-8536620 / 06-50258350 >

