Re: Crawldb

Sebastian Nagel Wed, 15 Jun 2016 10:47:48 -0700

Hi,

it's possible to remove the 404s along with an ordinary updatedb:


bin/nutch updatedb -Ddb.update.purge.404=true ...

But that's better to be done only from time to time for maintenance,
otherwise 404s found by dead links are fetched again and again.

Sebastian

On 06/14/2016 10:23 PM, Lewis John Mcgibbney wrote:
> Hi BlackIce,
> 
> On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote:
> 
>> From: BlackIce <blackice...@gmail.com>
>> To: user@nutch.apache.org
>> Cc:
>> Date: Mon, 13 Jun 2016 14:19:42 +0200
>> Subject: Crawldb
>> I would like to "groom" the crawldb.... My guess is that it should be an
>> easy thing just to built upon the function that removes the 404 status and
>> duplicates. But where do I find these?
>>
>>
> A good example of this can be seen in the DeduplicationJob.java [0] in
> particular the public static class DBFilter which implements Mapper<Text,
> CrawlDatum, BytesWritable, CrawlDatum>
> From here you can access every CrawlDatum record and edit values as you
> wish.
> Hopefully this will get you started.
> Thanks
> 
> [0]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/DeduplicationJob.java#L77-L130
>

Re: Crawldb

Reply via email to