Re: Generate segment of only unfetched urls

Harry Waye Thu, 21 Jul 2016 04:14:36 -0700

Great thanks Markus, will give it a go

On Thu, Jul 21, 2016 at 12:10 PM Markus Jelsma <markus.jel...@openindex.io>
wrote:


> Hello - Support for it was added in Nutch 1.12.
> Markus
>
>
>
> -----Original message-----
> > From:Harry Waye <ha...@arachnys.com>
> > Sent: Thursday 21st July 2016 12:26
> > To: user@nutch.apache.org
> > Subject: Re: Generate segment of only unfetched urls
> >
> > I may have missed a common curtesy in providing the nutch version. I'm
> > using 1.11. It looks like generate doesn't support Jexl in this version.
> > I'm going to have a look to see if it's easily back-portable or if a
> later
> > 1.xx has support.
> >
> > Cheers
> >
> > On Thu, Jul 21, 2016 at 11:19 AM Harry Waye <ha...@arachnys.com> wrote:
> >
> > > Fantastic, thanks Markus
> > >
> > > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma <
> markus.jel...@openindex.io>
> > > wrote:
> > >
> > >> Hi Harry,
> > >>
> > >> The generator has Jexl support, check [1] for fields. Metadata is
> as-is.
> > >>
> > >> It's very simple:
> > >> # bin/nutch generate -expr "status == db_unfetched"
> > >>
> > >> Cheers
> > >>
> > >> [1]
> > >>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Harry Waye <ha...@arachnys.com>
> > >> > Sent: Wednesday 20th July 2016 15:40
> > >> > To: user@nutch.apache.org
> > >> > Subject: Generate segment of only unfetched urls
> > >> >
> > >> > I'm using this to generate a segment:
> > >> >
> > >> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D
> > >> > mapred.map.tasks.speculative.execution=false -D
> > >> > mapreduce.map.speculative=false -D
> > >> > mapred.reduce.tasks.speculative.execution=false -D
> > >> > mapreduce.reduce.speculative=false -D
> mapred.map.output.compress=true
> > >> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb
> segments
> > >> > -noFilter -noNorm -numFetchers 19
> > >> >
> > >> > I'm seeing that the change in fetched urls after updatedb runs is
> much
> > >> > smaller than the number of successfully fetched documents for the
> > >> segment.
> > >> > I'm wondering if some of the urls that were downloaded at the
> beginning
> > >> of
> > >> > life of the crawldb are being downloaded again hence the delta being
> > >> lower.
> > >> >
> > >> > I'm going to try to debug but just thought I'd ask a few questions
> > >> first:
> > >> >
> > >> >  * what's the easiest way to verify that the urls in the segment are
> > >> urls
> > >> > that have never been fetched?
> > >> >  * if that's not the case, does someone know what would be the
> > >> appropriate
> > >> > command to use to only fetch unfetched urls?
> > >> >  * I'm using generate.max.count in the hope that it will give the
> best
> > >> > through put for each of our crawl cycles, i.e. maximising out thread
> > >> usage,
> > >> > does that sound sensible?
> > >> >
> > >> > Cheers
> > >> > Harry
> > >>
> > >
> >
>

Re: Generate segment of only unfetched urls

Reply via email to