"bad" URLs are already and still in. You'll need to update your db with the
'updatedb' command
On Monday, June 11, 2012, Bai Shen wrote:
> > >
> > > However, I'm still seeing youtube urls in the fetch logs. I'm using
> the
> > > -noFilter and -noNorm options with generate. I'm also not using th
> >
> > However, I'm still seeing youtube urls in the fetch logs. I'm using the
> > -noFilter and -noNorm options with generate. I'm also not using the
> > -filter and -normalize options for updatedb.
>
> You must either filter out all YT records from the CrawlDB or filter
> during generating.
>
I generate a segment. Just the new
> > urls that were found in the latest fetch.
> >
> > On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
> > wrote:
> >
> >>
> >> -----Original message-
> >> > From:Bai Shen
> >> > Sent: T
ai Shen
>> > Sent: Tue 22-May-2012 19:40
>> > To: user@nutch.apache.org
>> > Subject: URL filtering and normalization
>> >
>> > Somehow my crawler started fetching youtube. I'm not really sure why as
>> I
>> > have db.ignore.external.links
m:Bai Shen
> > Sent: Tue 22-May-2012 19:40
> > To: user@nutch.apache.org
> > Subject: URL filtering and normalization
> >
> > Somehow my crawler started fetching youtube. I'm not really sure why as
> I
> > have db.ignore.external.links set t
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
wrote:
>
> -Original message-
> > From:Bai Shen
> > Sent: Tue 22-May-2012 19:40
> > To: user@nutch.apache.org
> > Subject: URL filtering and normalization
> >
> > Somehow my crawler started fetchin
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
wrote:
>
> -Original message-
> > From:Bai Shen
> > Sent: Tue 22-May-2012 19:40
> > To: user@nutch.apache.org
> > Subject: URL filtering and normalization
> >
> > Somehow my crawler started fetchin
-Original message-
> From:Bai Shen
> Sent: Tue 22-May-2012 19:40
> To: user@nutch.apache.org
> Subject: URL filtering and normalization
>
> Somehow my crawler started fetching youtube. I'm not really sure why as I
> have db.ignore.external.links set to true
Somehow my crawler started fetching youtube. I'm not really sure why as I
have db.ignore.external.links set to true.
I've since added the following line to my regex-urlfilter.txt file.
-^http://www\.youtube\.com/
However, I'm still seeing youtube urls in the fetch logs. I'm using the
-noFilter
9 matches
Mail list logo