Hi BlackIce, On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote:
> From: BlackIce <blackice...@gmail.com> > To: user@nutch.apache.org > Cc: > Date: Mon, 13 Jun 2016 14:19:42 +0200 > Subject: Crawldb > I would like to "groom" the crawldb.... My guess is that it should be an > easy thing just to built upon the function that removes the 404 status and > duplicates. But where do I find these? > > A good example of this can be seen in the DeduplicationJob.java [0] in particular the public static class DBFilter which implements Mapper<Text, CrawlDatum, BytesWritable, CrawlDatum> >From here you can access every CrawlDatum record and edit values as you wish. Hopefully this will get you started. Thanks [0] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130