Hi Harry,

The generator has Jexl support, check [1] for fields. Metadata is as-is.

It's very simple:
# bin/nutch generate -expr "status == db_unfetched"

Cheers

[1] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524

 
 
-----Original message-----
> From:Harry Waye <ha...@arachnys.com>
> Sent: Wednesday 20th July 2016 15:40
> To: user@nutch.apache.org
> Subject: Generate segment of only unfetched urls
> 
> I'm using this to generate a segment:
> 
> bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
> mapred.map.tasks.speculative.execution=false -D
> mapreduce.map.speculative=false -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapreduce.reduce.speculative=false -D mapred.map.output.compress=true
> -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments
> -noFilter -noNorm -numFetchers 19
> 
> I'm seeing that the change in fetched urls after updatedb runs is much
> smaller than the number of successfully fetched documents for the segment.
> I'm wondering if some of the urls that were downloaded at the beginning of
> life of the crawldb are being downloaded again hence the delta being lower.
> 
> I'm going to try to debug but just thought I'd ask a few questions first:
> 
>  * what's the easiest way to verify that the urls in the segment are urls
> that have never been fetched?
>  * if that's not the case, does someone know what would be the appropriate
> command to use to only fetch unfetched urls?
>  * I'm using generate.max.count in the hope that it will give the best
> through put for each of our crawl cycles, i.e. maximising out thread usage,
> does that sound sensible?
> 
> Cheers
> Harry

Reply via email to