"bad" URLs are already and still in. You'll need to update your db with the 'updatedb' command
On Monday, June 11, 2012, Bai Shen wrote: > > > > > > However, I'm still seeing youtube urls in the fetch logs. I'm using > the > > > -noFilter and -noNorm options with generate. I'm also not using the > > > -filter and -normalize options for updatedb. > > > > You must either filter out all YT records from the CrawlDB or filter > > during generating. > > > > > I just tried this and it didn't work. > > In my nutch-site.xml I have urlfilter-regex in the plugin.includes. > In my regex-urlfilter.txt I have -^http://www\.youtube\.com/ right above > the +. at the bottom. > > Yet when I run a crawldb dump, the youtube urls still show up. What am I > missing? > > Thanks. >