RE: Generate segment of only unfetched urls

Markus Jelsma Thu, 21 Jul 2016 04:10:55 -0700

Hello - Support for it was added in Nutch 1.12.
Markus


 
 
-----Original message-----
> From:Harry Waye <ha...@arachnys.com>
> Sent: Thursday 21st July 2016 12:26
> To: user@nutch.apache.org
> Subject: Re: Generate segment of only unfetched urls
> 
> I may have missed a common curtesy in providing the nutch version. I'm
> using 1.11. It looks like generate doesn't support Jexl in this version.
> I'm going to have a look to see if it's easily back-portable or if a later
> 1.xx has support.
> 
> Cheers
> 
> On Thu, Jul 21, 2016 at 11:19 AM Harry Waye <ha...@arachnys.com> wrote:
> 
> > Fantastic, thanks Markus
> >
> > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <markus.jel...@openindex.io>
> > wrote:
> >
> >> Hi Harry,
> >>
> >> The generator has Jexl support, check [1] for fields. Metadata is as-is.
> >>
> >> It's very simple:
> >> # bin/nutch generate -expr "status == db_unfetched"
> >>
> >> Cheers
> >>
> >> [1]
> >> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:Harry Waye <ha...@arachnys.com>
> >> > Sent: Wednesday 20th July 2016 15:40
> >> > To: user@nutch.apache.org
> >> > Subject: Generate segment of only unfetched urls
> >> >
> >> > I'm using this to generate a segment:
> >> >
> >> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
> >> > mapred.map.tasks.speculative.execution=false -D
> >> > mapreduce.map.speculative=false -D
> >> > mapred.reduce.tasks.speculative.execution=false -D
> >> > mapreduce.reduce.speculative=false -D mapred.map.output.compress=true
> >> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb segments
> >> > -noFilter -noNorm -numFetchers 19
> >> >
> >> > I'm seeing that the change in fetched urls after updatedb runs is much
> >> > smaller than the number of successfully fetched documents for the
> >> segment.
> >> > I'm wondering if some of the urls that were downloaded at the beginning
> >> of
> >> > life of the crawldb are being downloaded again hence the delta being
> >> lower.
> >> >
> >> > I'm going to try to debug but just thought I'd ask a few questions
> >> first:
> >> >
> >> >  * what's the easiest way to verify that the urls in the segment are
> >> urls
> >> > that have never been fetched?
> >> >  * if that's not the case, does someone know what would be the
> >> appropriate
> >> > command to use to only fetch unfetched urls?
> >> >  * I'm using generate.max.count in the hope that it will give the best
> >> > through put for each of our crawl cycles, i.e. maximising out thread
> >> usage,
> >> > does that sound sensible?
> >> >
> >> > Cheers
> >> > Harry
> >>
> >
>

RE: Generate segment of only unfetched urls

Reply via email to