I was reffering to adding some "maintenance" functionality asides removing 404's, I want to remove some entries that have certain values in order to keep database small - or since you have mentioned it to have them flagged like the 404's so they don't get visited nor passed onto solr
On Wed, Jun 15, 2016 at 7:40 PM, Sebastian Nagel <wastl.na...@googlemail.com > wrote: > Hi, > > it's possible to remove the 404s along with an ordinary updatedb: > > bin/nutch updatedb -Ddb.update.purge.404=true ... > > But that's better to be done only from time to time for maintenance, > otherwise 404s found by dead links are fetched again and again. > > Sebastian > > On 06/14/2016 10:23 PM, Lewis John Mcgibbney wrote: > > Hi BlackIce, > > > > On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> > wrote: > > > >> From: BlackIce <blackice...@gmail.com> > >> To: user@nutch.apache.org > >> Cc: > >> Date: Mon, 13 Jun 2016 14:19:42 +0200 > >> Subject: Crawldb > >> I would like to "groom" the crawldb.... My guess is that it should be an > >> easy thing just to built upon the function that removes the 404 status > and > >> duplicates. But where do I find these? > >> > >> > > A good example of this can be seen in the DeduplicationJob.java [0] in > > particular the public static class DBFilter which implements Mapper<Text, > > CrawlDatum, BytesWritable, CrawlDatum> > > From here you can access every CrawlDatum record and edit values as you > > wish. > > Hopefully this will get you started. > > Thanks > > > > [0] > > > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130 > > > >