Re: URL filtering and normalization

2012-06-11 Thread remi tassing
"bad" URLs are already and still in. You'll need to update your db with the 'updatedb' command On Monday, June 11, 2012, Bai Shen wrote: > > > > > > However, I'm still seeing youtube urls in the fetch logs. I'm using > the > > > -noFilter and -noNorm options with generate. I'm also not using th

Re: URL filtering and normalization

2012-06-11 Thread Bai Shen
> > > > However, I'm still seeing youtube urls in the fetch logs. I'm using the > > -noFilter and -noNorm options with generate. I'm also not using the > > -filter and -normalize options for updatedb. > > You must either filter out all YT records from the CrawlDB or filter > during generating. >

Re: URL filtering and normalization

2012-06-11 Thread Bai Shen
I generate a segment. Just the new > > urls that were found in the latest fetch. > > > > On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma > > wrote: > > > >> > >> -----Original message- > >> > From:Bai Shen > >> > Sent: T

Re: URL filtering and normalization

2012-06-10 Thread Matthias Paul
ai Shen >> > Sent: Tue 22-May-2012 19:40 >> > To: user@nutch.apache.org >> > Subject: URL filtering and normalization >> > >> > Somehow my crawler started fetching youtube.  I'm not really sure why as >> I >> > have db.ignore.external.links

Re: URL filtering and normalization

2012-06-08 Thread Bai Shen
m:Bai Shen > > Sent: Tue 22-May-2012 19:40 > > To: user@nutch.apache.org > > Subject: URL filtering and normalization > > > > Somehow my crawler started fetching youtube. I'm not really sure why as > I > > have db.ignore.external.links set t

Re: URL filtering and normalization

2012-05-22 Thread Bai Shen
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma wrote: > > -Original message- > > From:Bai Shen > > Sent: Tue 22-May-2012 19:40 > > To: user@nutch.apache.org > > Subject: URL filtering and normalization > > > > Somehow my crawler started fetchin

Re: URL filtering and normalization

2012-05-22 Thread Bai Shen
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma wrote: > > -Original message- > > From:Bai Shen > > Sent: Tue 22-May-2012 19:40 > > To: user@nutch.apache.org > > Subject: URL filtering and normalization > > > > Somehow my crawler started fetchin

RE: URL filtering and normalization

2012-05-22 Thread Markus Jelsma
-Original message- > From:Bai Shen > Sent: Tue 22-May-2012 19:40 > To: user@nutch.apache.org > Subject: URL filtering and normalization > > Somehow my crawler started fetching youtube. I'm not really sure why as I > have db.ignore.external.links set to true

URL filtering and normalization

2012-05-22 Thread Bai Shen
Somehow my crawler started fetching youtube. I'm not really sure why as I have db.ignore.external.links set to true. I've since added the following line to my regex-urlfilter.txt file. -^http://www\.youtube\.com/ However, I'm still seeing youtube urls in the fetch logs. I'm using the -noFilter