Hi, it's possible to remove the 404s along with an ordinary updatedb:
bin/nutch updatedb -Ddb.update.purge.404=true ... But that's better to be done only from time to time for maintenance, otherwise 404s found by dead links are fetched again and again. Sebastian On 06/14/2016 10:23 PM, Lewis John Mcgibbney wrote: > Hi BlackIce, > > On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote: > >> From: BlackIce <blackice...@gmail.com> >> To: user@nutch.apache.org >> Cc: >> Date: Mon, 13 Jun 2016 14:19:42 +0200 >> Subject: Crawldb >> I would like to "groom" the crawldb.... My guess is that it should be an >> easy thing just to built upon the function that removes the 404 status and >> duplicates. But where do I find these? >> >> > A good example of this can be seen in the DeduplicationJob.java [0] in > particular the public static class DBFilter which implements Mapper<Text, > CrawlDatum, BytesWritable, CrawlDatum> > From here you can access every CrawlDatum record and edit values as you > wish. > Hopefully this will get you started. > Thanks > > [0] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130 >