Great thanks Markus, will give it a go On Thu, Jul 21, 2016 at 12:10 PM Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hello - Support for it was added in Nutch 1.12. > Markus > > > > -----Original message----- > > From:Harry Waye <ha...@arachnys.com> > > Sent: Thursday 21st July 2016 12:26 > > To: user@nutch.apache.org > > Subject: Re: Generate segment of only unfetched urls > > > > I may have missed a common curtesy in providing the nutch version. I'm > > using 1.11. It looks like generate doesn't support Jexl in this version. > > I'm going to have a look to see if it's easily back-portable or if a > later > > 1.xx has support. > > > > Cheers > > > > On Thu, Jul 21, 2016 at 11:19 AM Harry Waye <ha...@arachnys.com> wrote: > > > > > Fantastic, thanks Markus > > > > > > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma < > markus.jel...@openindex.io> > > > wrote: > > > > > >> Hi Harry, > > >> > > >> The generator has Jexl support, check [1] for fields. Metadata is > as-is. > > >> > > >> It's very simple: > > >> # bin/nutch generate -expr "status == db_unfetched" > > >> > > >> Cheers > > >> > > >> [1] > > >> > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524 > > >> > > >> > > >> > > >> -----Original message----- > > >> > From:Harry Waye <ha...@arachnys.com> > > >> > Sent: Wednesday 20th July 2016 15:40 > > >> > To: user@nutch.apache.org > > >> > Subject: Generate segment of only unfetched urls > > >> > > > >> > I'm using this to generate a segment: > > >> > > > >> > bin/nutch generate -D mapred.child.java.opts=-Xmx6000m -D > > >> > mapred.map.tasks.speculative.execution=false -D > > >> > mapreduce.map.speculative=false -D > > >> > mapred.reduce.tasks.speculative.execution=false -D > > >> > mapreduce.reduce.speculative=false -D > mapred.map.output.compress=true > > >> > -Dgenerate.max.count=20000 -D mapred.reduce.tasks=100 crawldb > segments > > >> > -noFilter -noNorm -numFetchers 19 > > >> > > > >> > I'm seeing that the change in fetched urls after updatedb runs is > much > > >> > smaller than the number of successfully fetched documents for the > > >> segment. > > >> > I'm wondering if some of the urls that were downloaded at the > beginning > > >> of > > >> > life of the crawldb are being downloaded again hence the delta being > > >> lower. > > >> > > > >> > I'm going to try to debug but just thought I'd ask a few questions > > >> first: > > >> > > > >> > * what's the easiest way to verify that the urls in the segment are > > >> urls > > >> > that have never been fetched? > > >> > * if that's not the case, does someone know what would be the > > >> appropriate > > >> > command to use to only fetch unfetched urls? > > >> > * I'm using generate.max.count in the hope that it will give the > best > > >> > through put for each of our crawl cycles, i.e. maximising out thread > > >> usage, > > >> > does that sound sensible? > > >> > > > >> > Cheers > > >> > Harry > > >> > > > > > >