Re: URL filtering and normalization

Bai Shen Mon, 11 Jun 2012 07:16:22 -0700

Maybe.  I'll give it a try once I manage to remove the bad urls from my
crawldb.


On Mon, Jun 11, 2012 at 2:55 AM, Matthias Paul <magethle.nu...@gmail.com>wrote:

> Wouldn't it be enough to filter and normalize urls once during the parsing?
> Then in generate, update and invert links it shouldn't be necessary any
> more.
>
>
> On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen <baishen.li...@gmail.com> wrote:
> > I'm attempting to filter during the generating.  I removed the noFilter
> and
> > noNorm flags from my generate job.  I have around 10M records in my
> crawldb.
> >
> > The generate job has been running for several days now.  Is there a way
> to
> > check it's progress and/or make sure it's not hung?
> >
> > Also, is there a faster way to do this?  It seems like I shouldn't need
> to
> > filter the entire crawldb every time I generate a segment.  Just the new
> > urls that were found in the latest fetch.
> >
> > On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
> > <markus.jel...@openindex.io>wrote:
> >
> >>
> >> -----Original message-----
> >> > From:Bai Shen <baishen.li...@gmail.com>
> >> > Sent: Tue 22-May-2012 19:40
> >> > To: user@nutch.apache.org
> >> > Subject: URL filtering and normalization
> >> >
> >> > Somehow my crawler started fetching youtube.  I'm not really sure why
> as
> >> I
> >> > have db.ignore.external.links set to true.
> >>
> >> Weird!
> >>
> >> >
> >> > I've since added the following line to my regex-urlfilter.txt file.
> >> >
> >> > -^http://www\.youtube\.com/
> >>
> >> For domain filtering you should use the domain-urlfilter or
> >> domain-blacklistfilter. It's faster and easier to maintain.
> >>
> >> >
> >> > However, I'm still seeing youtube urls in the fetch logs.  I'm using
> the
> >> > -noFilter and -noNorm options with generate.  I'm also not using the
> >> > -filter and -normalize options for updatedb.
> >>
> >> You must either filter out all YT records from the CrawlDB or filter
> >> during generating.
> >>
> >> >
> >> > According to Markus in this thread, the normalization and filtering
> >> should
> >> > still occur even when using the above options and using 1.4
> >> >
> >> >
> >>
> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
> >> >
> >> >
> >> > Is there a setting I'm missing?  I'm not seeing anything in the logs
> >> > regarding this.
> >> >
> >> > Thanks.
> >> >
> >>
>

Re: URL filtering and normalization

Reply via email to