Re: Crawldb

BlackIce Wed, 15 Jun 2016 11:10:00 -0700

I was reffering to adding some "maintenance" functionality asides removing
404's, I want to remove some entries that have certain values in order to
keep database small - or since you have mentioned it to have them flagged
like the 404's so they don't get visited nor passed onto solr


On Wed, Jun 15, 2016 at 7:40 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi,
>
> it's possible to remove the 404s along with an ordinary updatedb:
>
> bin/nutch updatedb -Ddb.update.purge.404=true ...
>
> But that's better to be done only from time to time for maintenance,
> otherwise 404s found by dead links are fetched again and again.
>
> Sebastian
>
> On 06/14/2016 10:23 PM, Lewis John Mcgibbney wrote:
> > Hi BlackIce,
> >
> > On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org>
> wrote:
> >
> >> From: BlackIce <blackice...@gmail.com>
> >> To: user@nutch.apache.org
> >> Cc:
> >> Date: Mon, 13 Jun 2016 14:19:42 +0200
> >> Subject: Crawldb
> >> I would like to "groom" the crawldb.... My guess is that it should be an
> >> easy thing just to built upon the function that removes the 404 status
> and
> >> duplicates. But where do I find these?
> >>
> >>
> > A good example of this can be seen in the DeduplicationJob.java [0] in
> > particular the public static class DBFilter which implements Mapper<Text,
> > CrawlDatum, BytesWritable, CrawlDatum>
> > From here you can access every CrawlDatum record and edit values as you
> > wish.
> > Hopefully this will get you started.
> > Thanks
> >
> > [0]
> >
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130
> >
>
>

Re: Crawldb

Reply via email to