Re: Continuous crawling

Karl Wright Sat, 04 Jan 2014 08:26:12 -0800

Hi Florian,

What you are seeing is "dynamic crawling" behavior.  The time between
refetches of a document is based on the history of fetches of that
document.  The recrawl interval is the initial time between document
fetches, but if a document does not change, the interval for the document
increases according to a formula.


I would need to look at the code to be able to give you the precise
formula, but if you need a limit on the amount of time between document
fetch attempts, I suggest you create a ticket and I will look into adding
that as a feature.

Thanks,
Karl



On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
[email protected]> wrote:

> Hello,
>
> the parameters reseed interval and recrawl interval of a continuous
> crawling job are not quite clear to me. The documentation tells that the
> reseed interval is the time after which the seeds are checked again, and
> the recrawl interval is the time after which a document is checked for
> changes.
>
> However, we observed that the recrawl interval for a document increases
> after each check. On the other hand, the reseed interval seems to be set
> up correctly in the database metadata about the seed documents. Yet the
> web server does not receive requests at each time the interval elapses but
> only after several intervals have elapsed.
>
> We are using a web connector. The web server does not tell the client to
> cache the documents. Any help would be appreciated.
>
> Best regards,
> Florian
>
>
>
>

Re: Continuous crawling

Reply via email to