Re: Crawldb

Lewis John Mcgibbney Tue, 14 Jun 2016 13:23:17 -0700

Hi BlackIce,

On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote:


> From: BlackIce <blackice...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Date: Mon, 13 Jun 2016 14:19:42 +0200
> Subject: Crawldb
> I would like to "groom" the crawldb.... My guess is that it should be an
> easy thing just to built upon the function that removes the 404 status and
> duplicates. But where do I find these?
>
>
A good example of this can be seen in the DeduplicationJob.java [0] in
particular the public static class DBFilter which implements Mapper<Text,
CrawlDatum, BytesWritable, CrawlDatum>
>From here you can access every CrawlDatum record and edit values as you
wish.
Hopefully this will get you started.
Thanks

[0]
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130

Re: Crawldb

Reply via email to