Thanks for that information. I've moved on to using regex-urlfilter instead of trying to filter by depth. It's probably better for what I'm trying to do anyway.
From: Sebastian Nagel <wastl.na...@googlemail.com> To: user@nutch.apache.org Sent: Monday, September 25, 2017 9:36 AM Subject: Re: depth scoring filter Hi Michael, I've just tried it with 1.12 and the recent master of 1.x - works as expected, Except for - meta refresh redirects and when the fetcher isn't parsing. Actually, this is an open issue since few months. I'll try to address it the next days - https://issues.apache.org/jira/browse/NUTCH-2261 A little background what happens for meta refresh redirects: - the _depth_ is copied from the link source to the link target in the segment - when CrawlDb is updated with links and fetch status from the segment - _depth_=1000 is the fall-back if there is no _depth_ found in the segment's CrawlDatum But there may be some other reason. Starting from http://www.cnn.com/ with 3 cycles I've got only one page with the wired _depth_=1000. Maybe try it slowly, cycle by cycle and check whether one item in the CrawlDb gets wrong... Best, Sebastian On 09/22/2017 04:57 AM, Michael Coffey wrote: > I am still having trouble with the depth scoring filter, and now I have a > simpler test case. It does work, somewhat, when I give it a list of 50 seed > URLs, but when I give it a very short list, it fails. > I have tried depth.max values in the range of 1-6. None of them work for the > short-list cases. > > If my seed list contains just http://www.cnn.com/ > it can do one generate/fetch/update cycle, but then fails saying "0 records > selected for fetching" on the next cycle. > The same is true if I give it this short list of > urlshttp://www.thedailybeast.com/ > http://www.thedailybeast.com > https://thedailybeast.com/ > https://thedailybeast.com > > The same is true for this short list of urlshttps://nytimes.com/ > http://www.nytimes.com/ > https://www.nytimes.com/ > > In each case, the first cycle updates a reasonable-looking list of urls into > the crawldb, so it seems strange that the depth filter stops it from > selecting anything in subsequent rounds. > The cnn seed works fine when I use opic and not scoring-depth. > > Here is a partial listing of the readdb dump from the failing cnn trial > http://www.cnn.com/ Version: 7 > Status: 2 (db_fetched) > Fetch time: Fri Sep 22 15:47:46 PDT 2017 > Modified time: Thu Sep 21 15:47:46 PDT 2017 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: d9a6e1aaedca7795ea469dce4929704a > Metadata: > _depth_=1 > _pst_=success(1), lastModified=0 > _rs_=77 > Content-Type=text/html > _maxdepth_=3 > nutch.protocol.code=200 > > http://www.google.com/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:13 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > http://www.googletagservices.com/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:12 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > http://www.i.cdn.cnn.com/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:13 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > http://www.ugdturner.com/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:11 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > http://z.cdn.turner.com/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:12 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > https://plus.google.com/+cnn/posts Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Sep 21 15:49:13 PDT 2017 > Modified time: Wed Dec 31 16:00:00 PST 1969 > Retries since fetch: 0 > Retry interval: 5184000 seconds (60 days) > Score: 0.03125 > Signature: null > Metadata: > _depth_=1000 > _maxdepth_=3 > > > > > > > From: Jigal van Hemert | alterNET internet BV <ji...@alternet.nl> > To: user <user@nutch.apache.org> > Sent: Tuesday, September 19, 2017 11:43 PM > Subject: Re: depth scoring filter > > Hi, > > On 20 September 2017 at 06:36, Michael Coffey <mcof...@yahoo.com.invalid> > wrote: > >> I am trying do develop a news crawler and I want to prohibit it from >> wandering too far away from the seed list that I provide. >> It seems like I should use the DepthScoringFilter, but I am having trouble >> getting it to work. After a few crawl cycles, all the _depth_ metadata say >> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look >> like depths. >> I have added a scoring.depth.max property to nutch-site.xml. >> <property> >> <name>scoring.depth.max</name> >> <value>3</value> >> </property> >> >> > I use the same plugin to only index seed plus one level below. The value > for this is 2 so your setup crawls seed plus two levels below. > > I never looked at the values for the _depth_ metadata and frankly, because > it does what it's supposed to do, I personally don't care what it stores in > its metadata here. > > What do I need to do to limit the crawl frontier so it won't go more than N >> hops from the seed list, if that is possible? >> >> > As said above, it should be enough to set the value to N+1. >