[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

Karl Wright (JIRA) Thu, 13 Feb 2014 04:50:28 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900288#comment-13900288
 ]


Karl Wright commented on CONNECTORS-850:
----------------------------------------

Here's the algorithm that MCF uses to calculate when to refetch a document in 
dynamic crawling.

First, it keeps track, over all time, of the first time the document was 
fetched, and the last time it was fetched, and the number of changes that took 
place in-between, to come up with an estimated value for the average time 
between changes.  When you change the document, of course, this value is 
affected, but may not be affected that strongly if the document had a long 
period of stability.  (If you want to make this history go away for a document, 
you can click the "reindex all documents" link on the output connection's view 
page.  That causes MCF to forget everything about what's been indexed before.)

The actual time determined for the next fetch is calculated here:

{code}
    public Long calculateDocumentRescheduleTime(long currentTime, long timeAmt, 
String localIdentifier)
    {
      Long recrawlTime = null;
      Long recrawlInterval = job.getInterval();
      if (recrawlInterval != null)
      {
        Long maxInterval = job.getMaxInterval();
        long actualInterval = recrawlInterval.longValue() + timeAmt;
        if (maxInterval != null && actualInterval > maxInterval.longValue())
          actualInterval = maxInterval.longValue();
        recrawlTime = new Long(currentTime + actualInterval);
      }
      if (Logging.scheduling.isDebugEnabled())
        Logging.scheduling.debug("Default rescan time for document 
'"+localIdentifier+"' is 
"+((recrawlTime==null)?"NEVER":recrawlTime.toString()));
      Long lowerBound = getDocumentRescheduleLowerBoundTime(localIdentifier);
      if (lowerBound != null)
      {
        if (recrawlTime == null || recrawlTime.longValue() < 
lowerBound.longValue())
        {
          recrawlTime = lowerBound;
          if (Logging.scheduling.isDebugEnabled())
            Logging.scheduling.debug(" Rescan time overridden for document 
'"+localIdentifier+"' due to lower bound; new value is 
"+recrawlTime.toString());
        }
      }
      Long upperBound = getDocumentRescheduleUpperBoundTime(localIdentifier);
      if (upperBound != null)
      {
        if (recrawlTime == null || recrawlTime.longValue() > 
upperBound.longValue())
        {
          recrawlTime = upperBound;
          if (Logging.scheduling.isDebugEnabled())
            Logging.scheduling.debug(" Rescan time overridden for document 
'"+localIdentifier+"' due to upper bound; new value is 
"+recrawlTime.toString());
        }
      }
      return recrawlTime;
    }

{code}

As you can see, both the average interval between fetches (timeAmt), and what 
the connector sets as far as time bounds are concerned, go into the 
calculation.  The minimum recrawl interval (job.getInterval()) and the maximum 
recrawl interval (job.getMaxInterval()) are also important.  The key part of 
the calculation is as follows:

{code}
        Long maxInterval = job.getMaxInterval();
        long actualInterval = recrawlInterval.longValue() + timeAmt;
        if (maxInterval != null && actualInterval > maxInterval.longValue())
          actualInterval = maxInterval.longValue();
        recrawlTime = new Long(currentTime + actualInterval);
{code}

The actual interval chosen is the job's minimum recrawl interval, plus the 
average time between changes for the document, capped by the job's maximum 
recrawl interval.

Hope that clarifies things.


> Maximum interval in dynamic crawling
> ------------------------------------
>
>                 Key: CONNECTORS-850
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.4.1
>            Reporter: Florian Schmedding
>            Assignee: Karl Wright
>            Priority: Minor
>              Labels: features
>             Fix For: ManifoldCF 1.5
>
>
> Currently, the dynamic crawling method used for a continuous job extends the 
> reseed and recrawl intervals when no changes are found in a checked document. 
> However, it should be possible to restrict this extension to a maximum value 
> in order to make sure that new documents are discovered within a certain 
> interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

Reply via email to