Re: Continuous crawling

Karl Wright Tue, 14 Jan 2014 04:43:20 -0800

Hi Florian,

I was unable to reproduce the behavior you described.


Could you view your job, and post a screen shot of that page?  I want to
see what your schedule record(s) look like.

Thanks,
Karl



On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]> wrote:

> Hi Florian,
>
> I've never noted this behavior before.  I'll see if I can reproduce it
> here.
>
> Karl
>
>
>
> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> the scheduled job seems to work as expecetd. However, it runs two times:
>> It starts at the beginning of the scheduled time, finishes, and
>> immediately starts again. After finishing the second run it waits for the
>> next scheduled time. Why does it run two times? The start method is "Start
>> at beginning of schedule window".
>>
>> Yes, you're right about the checking guarantee. Currently, our interval is
>> long enough for a complete crawler run.
>>
>> Best,
>> Florian
>>
>>
>> > Hi Florian,
>> >
>> > It is impossible to *guarantee* that a document will be checked, because
>> > if
>> > load on the crawler is high enough, it will fall behind.  But I will
>> look
>> > into adding the feature you request.
>> >
>> > Karl
>> >
>> >
>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>> > [email protected]> wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> yes, in our case it is necessary to make sure that new documents are
>> >> discovered and indexed within a certain interval. I have created a
>> >> feature
>> >> request on that. In the meantime we will try to use a scheduled job
>> >> instead.
>> >>
>> >> Thanks for your help,
>> >> Florian
>> >>
>> >>
>> >> > Hi Florian,
>> >> >
>> >> > What you are seeing is "dynamic crawling" behavior.  The time between
>> >> > refetches of a document is based on the history of fetches of that
>> >> > document.  The recrawl interval is the initial time between document
>> >> > fetches, but if a document does not change, the interval for the
>> >> document
>> >> > increases according to a formula.
>> >> >
>> >> > I would need to look at the code to be able to give you the precise
>> >> > formula, but if you need a limit on the amount of time between
>> >> document
>> >> > fetch attempts, I suggest you create a ticket and I will look into
>> >> adding
>> >> > that as a feature.
>> >> >
>> >> > Thanks,
>> >> > Karl
>> >> >
>> >> >
>> >> >
>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> the parameters reseed interval and recrawl interval of a continuous
>> >> >> crawling job are not quite clear to me. The documentation tells that
>> >> the
>> >> >> reseed interval is the time after which the seeds are checked again,
>> >> and
>> >> >> the recrawl interval is the time after which a document is checked
>> >> for
>> >> >> changes.
>> >> >>
>> >> >> However, we observed that the recrawl interval for a document
>> >> increases
>> >> >> after each check. On the other hand, the reseed interval seems to be
>> >> set
>> >> >> up correctly in the database metadata about the seed documents. Yet
>> >> the
>> >> >> web server does not receive requests at each time the interval
>> >> elapses
>> >> >> but
>> >> >> only after several intervals have elapsed.
>> >> >>
>> >> >> We are using a web connector. The web server does not tell the
>> client
>> >> to
>> >> >> cache the documents. Any help would be appreciated.
>> >> >>
>> >> >> Best regards,
>> >> >> Florian
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>

Re: Continuous crawling

Reply via email to