[ https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1564: ----------------------------------- Description: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta > interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval < MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval > MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past ({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place 2 days in the past (Apr 16). According to the [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long: * increase the fetch interval immediately (not step by step) * because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the "reference time", i.e. we expect a change soon. These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past. This problem has been noted by [~pascaldimassimo] in [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. was: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time are may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta > interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval < MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval > MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past ({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place 2 days in the past (Apr 16). According to the [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long: * increase the fetch interval immediately (not step by step) * because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the "reference time", i.e. we expect a change soon. These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past. This problem has been noted by [~pascaldimassimo] in [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. > AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not > modified > ------------------------------------------------------------------------------------- > > Key: NUTCH-1564 > URL: https://issues.apache.org/jira/browse/NUTCH-1564 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.6, 2.1 > Reporter: Sebastian Nagel > Priority: Critical > > In a continuous crawl with adaptive fetch scheduling documents not modified > for a longer time may be fetched in every cycle. > A continous crawl is run daily with a 3 cycles and the following scheduling > intervals (freshness matters): > {code} > db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule > db.fetch.schedule.adaptive.sync_delta = true (default) > db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) > db.fetch.interval.default = 172800 (2 days) > db.fetch.schedule.adaptive.min_interval = 86400 (1 day) > db.fetch.schedule.adaptive.max_interval = 604800 (7 days) > db.fetch.interval.max = 604800 (7 days) > {code} > At Apr 18 a URL is generated and fetched (from segment dump): > {code} > Crawl Generate:: > Status: 2 (db_fetched) > Fetch time: Mon Apr 15 19:43:22 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > Crawl Fetch:: > Status: 33 (fetch_success) > Fetch time: Thu Apr 18 01:23:51 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > {code} > Running CrawlDb update results in a next fetch time in the past (which forces > an immediate refetch in the next cycle): > {code} > Status: 6 (db_notmodified) > Fetch time: Tue Apr 16 01:37:00 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > {code} > This behavior is caused by the sync_delta calculation in > AdaptiveFetchSchedule: > {code} > if (SYNC_DELTA) { > // try to synchronize with the time of change > long delta = (fetchTime - modifiedTime) / 1000L; > if (delta > interval) interval = delta; > refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); > } > if (interval < MIN_INTERVAL) { > interval = MIN_INTERVAL; > } else if (interval > MAX_INTERVAL) { > interval = MAX_INTERVAL; > } > ... > datum.setFetchTime(refTime + Math.round(interval * 1000.0)); > {code} > {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the > past ({{delta}} * 0.3). After adding {{interval}} (adjusted to > {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place > 2 days in the past (Apr 16). > According to the > [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] > (if understood right), there are two aims of the sync_delta if we know that > a document hasn't been modified for long: > * increase the fetch interval immediately (not step by step) > * because we expect the document to be changed within the adaptive interval > (but it hasn't), we shift the "reference time", i.e. we expect a change soon. > These two aims are somehow in contradiction. In any case, the next fetch time > should be always within the range of (currentFetchTime + MIN_INTERVAL) and > (currentFetchTime + MAX_INTERVAL) and never in the past. > This problem has been noted by [~pascaldimassimo] in > [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and > [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. -- This message was sent by Atlassian JIRA (v6.1#6144)