Maybe. I'll give it a try once I manage to remove the bad urls from my crawldb.
On Mon, Jun 11, 2012 at 2:55 AM, Matthias Paul <magethle.nu...@gmail.com>wrote: > Wouldn't it be enough to filter and normalize urls once during the parsing? > Then in generate, update and invert links it shouldn't be necessary any > more. > > > On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen <baishen.li...@gmail.com> wrote: > > I'm attempting to filter during the generating. I removed the noFilter > and > > noNorm flags from my generate job. I have around 10M records in my > crawldb. > > > > The generate job has been running for several days now. Is there a way > to > > check it's progress and/or make sure it's not hung? > > > > Also, is there a faster way to do this? It seems like I shouldn't need > to > > filter the entire crawldb every time I generate a segment. Just the new > > urls that were found in the latest fetch. > > > > On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma > > <markus.jel...@openindex.io>wrote: > > > >> > >> -----Original message----- > >> > From:Bai Shen <baishen.li...@gmail.com> > >> > Sent: Tue 22-May-2012 19:40 > >> > To: user@nutch.apache.org > >> > Subject: URL filtering and normalization > >> > > >> > Somehow my crawler started fetching youtube. I'm not really sure why > as > >> I > >> > have db.ignore.external.links set to true. > >> > >> Weird! > >> > >> > > >> > I've since added the following line to my regex-urlfilter.txt file. > >> > > >> > -^http://www\.youtube\.com/ > >> > >> For domain filtering you should use the domain-urlfilter or > >> domain-blacklistfilter. It's faster and easier to maintain. > >> > >> > > >> > However, I'm still seeing youtube urls in the fetch logs. I'm using > the > >> > -noFilter and -noNorm options with generate. I'm also not using the > >> > -filter and -normalize options for updatedb. > >> > >> You must either filter out all YT records from the CrawlDB or filter > >> during generating. > >> > >> > > >> > According to Markus in this thread, the normalization and filtering > >> should > >> > still occur even when using the above options and using 1.4 > >> > > >> > > >> > http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html > >> > > >> > > >> > Is there a setting I'm missing? I'm not seeing anything in the logs > >> > regarding this. > >> > > >> > Thanks. > >> > > >> >