[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

Sebastian Nagel (JIRA) Tue, 29 Oct 2013 12:52:03 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-1564:
-----------------------------------

    Description: 
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default               = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max                   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
    // try to synchronize with the time of change
    long delta = (fetchTime - modifiedTime) / 1000L;
    if (delta > interval) interval = delta;
    refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval < MIN_INTERVAL) {
    interval = MIN_INTERVAL;
  } else if (interval > MAX_INTERVAL) {
    interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));
{code}
{{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past 
({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 
days) to {{refTime}} the next fetch "should" take place 2 days in the past (Apr 
16).

According to the 
[javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
 (if understood right), there are two aims of the sync_delta if we know that a 
document hasn't been modified for long:
* increase the fetch interval immediately (not step by step)
* because we expect the document to be changed within the adaptive interval 
(but it hasn't), we shift the "reference time", i.e. we expect a change soon.

These two aims are somehow in contradiction. In any case, the next fetch time 
should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
(currentFetchTime + MAX_INTERVAL) and never in the past.

This problem has been noted by [~pascaldimassimo] in 
[1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
[2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].


  was:
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time are may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default               = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max                   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
    // try to synchronize with the time of change
    long delta = (fetchTime - modifiedTime) / 1000L;
    if (delta > interval) interval = delta;
    refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval < MIN_INTERVAL) {
    interval = MIN_INTERVAL;
  } else if (interval > MAX_INTERVAL) {
    interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));
{code}
{{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past 
({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 
days) to {{refTime}} the next fetch "should" take place 2 days in the past (Apr 
16).

According to the 
[javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
 (if understood right), there are two aims of the sync_delta if we know that a 
document hasn't been modified for long:
* increase the fetch interval immediately (not step by step)
* because we expect the document to be changed within the adaptive interval 
(but it hasn't), we shift the "reference time", i.e. we expect a change soon.

These two aims are somehow in contradiction. In any case, the next fetch time 
should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
(currentFetchTime + MAX_INTERVAL) and never in the past.

This problem has been noted by [~pascaldimassimo] in 
[1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
[2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].



> AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not 
> modified
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1564
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1564
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.6, 2.1
>            Reporter: Sebastian Nagel
>            Priority: Critical
>
> In a continuous crawl with adaptive fetch scheduling documents not modified 
> for a longer time may be fetched in every cycle.
> A continous crawl is run daily with a 3 cycles and the following scheduling 
> intervals (freshness matters):
> {code}
> db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
> db.fetch.schedule.adaptive.sync_delta   = true (default)
> db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
> db.fetch.interval.default               = 172800 (2 days)
> db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
> db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
> db.fetch.interval.max                   = 604800 (7 days)
> {code}
> At Apr 18 a URL is generated and fetched (from segment dump):
> {code}
> Crawl Generate::
> Status: 2 (db_fetched)
> Fetch time: Mon Apr 15 19:43:22 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> Crawl Fetch::
> Status: 33 (fetch_success)
> Fetch time: Thu Apr 18 01:23:51 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> Running CrawlDb update results in a next fetch time in the past (which forces 
> an immediate refetch in the next cycle):
> {code}
> Status: 6 (db_notmodified)
> Fetch time: Tue Apr 16 01:37:00 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> This behavior is caused by the sync_delta calculation in 
> AdaptiveFetchSchedule:
> {code}
>   if (SYNC_DELTA) {
>     // try to synchronize with the time of change
>     long delta = (fetchTime - modifiedTime) / 1000L;
>     if (delta > interval) interval = delta;
>     refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
>   }
>   if (interval < MIN_INTERVAL) {
>     interval = MIN_INTERVAL;
>   } else if (interval > MAX_INTERVAL) {
>     interval = MAX_INTERVAL;
>   }
> ...
> datum.setFetchTime(refTime + Math.round(interval * 1000.0));
> {code}
> {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the 
> past ({{delta}} * 0.3). After adding {{interval}} (adjusted to 
> {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place 
> 2 days in the past (Apr 16).
> According to the 
> [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
>  (if understood right), there are two aims of the sync_delta if we know that 
> a document hasn't been modified for long:
> * increase the fetch interval immediately (not step by step)
> * because we expect the document to be changed within the adaptive interval 
> (but it hasn't), we shift the "reference time", i.e. we expect a change soon.
> These two aims are somehow in contradiction. In any case, the next fetch time 
> should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
> (currentFetchTime + MAX_INTERVAL) and never in the past.
> This problem has been noted by [~pascaldimassimo] in 
> [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
> [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

Reply via email to