Re: Continuous crawling

Florian Schmedding Mon, 03 Feb 2014 09:25:21 -0800

Hi Karl,

I've just observed that the job was started according to its schedule and
crawled all documents correctly (I've chosen to re-ingest all documents
before the run). However, after finishing the last document (zero active
documents) it was somehow aborted and restarted immediately. Is this an
expected behavior?


Best,
Florian


> Hi Florian,
>
> Based on this schedule, your crawls will be able to start whenever the
> hour
> turns.  So they can start every hour on the hour.  If the last crawl
> crossed an hour boundary, the next crawl will start immediately, I
> believe.
>
> Karl
>
>
>
> On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> these are the values:
>> Priority:       5       Start method:   Start at beginning of schedule
>> window
>> Schedule type:  Scan every document once        Minimum recrawl
>> interval:
>>       Not
>> applicable
>> Expiration interval:    Not applicable  Reseed interval:        Not
>> applicable
>> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am 5
>> am
>> 6 am 7
>> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9
>> pm 10 pm 11 pm
>> Maximum run time:       No limit        Job invocation:         Complete
>>
>> Maybe it is because I've changed the job from continuous crawling to
>> this
>> schedule. I started it a few times manually, too. I couldn't notice
>> anything strange in the job setup or in the respective entries in the
>> database.
>>
>> Regards,
>> Florian
>>
>> > Hi Florian,
>> >
>> > I was unable to reproduce the behavior you described.
>> >
>> > Could you view your job, and post a screen shot of that page?  I want
>> to
>> > see what your schedule record(s) look like.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>> >
>> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]>
>> wrote:
>> >
>> >> Hi Florian,
>> >>
>> >> I've never noted this behavior before.  I'll see if I can reproduce
>> it
>> >> here.
>> >>
>> >> Karl
>> >>
>> >>
>> >>
>> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi Karl,
>> >>>
>> >>> the scheduled job seems to work as expecetd. However, it runs two
>> >>> times:
>> >>> It starts at the beginning of the scheduled time, finishes, and
>> >>> immediately starts again. After finishing the second run it waits
>> for
>> >>> the
>> >>> next scheduled time. Why does it run two times? The start method is
>> >>> "Start
>> >>> at beginning of schedule window".
>> >>>
>> >>> Yes, you're right about the checking guarantee. Currently, our
>> interval
>> >>> is
>> >>> long enough for a complete crawler run.
>> >>>
>> >>> Best,
>> >>> Florian
>> >>>
>> >>>
>> >>> > Hi Florian,
>> >>> >
>> >>> > It is impossible to *guarantee* that a document will be checked,
>> >>> because
>> >>> > if
>> >>> > load on the crawler is high enough, it will fall behind.  But I
>> will
>> >>> look
>> >>> > into adding the feature you request.
>> >>> >
>> >>> > Karl
>> >>> >
>> >>> >
>> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>> >>> > [email protected]> wrote:
>> >>> >
>> >>> >> Hi Karl,
>> >>> >>
>> >>> >> yes, in our case it is necessary to make sure that new documents
>> are
>> >>> >> discovered and indexed within a certain interval. I have created
>> a
>> >>> >> feature
>> >>> >> request on that. In the meantime we will try to use a scheduled
>> job
>> >>> >> instead.
>> >>> >>
>> >>> >> Thanks for your help,
>> >>> >> Florian
>> >>> >>
>> >>> >>
>> >>> >> > Hi Florian,
>> >>> >> >
>> >>> >> > What you are seeing is "dynamic crawling" behavior.  The time
>> >>> between
>> >>> >> > refetches of a document is based on the history of fetches of
>> that
>> >>> >> > document.  The recrawl interval is the initial time between
>> >>> document
>> >>> >> > fetches, but if a document does not change, the interval for
>> the
>> >>> >> document
>> >>> >> > increases according to a formula.
>> >>> >> >
>> >>> >> > I would need to look at the code to be able to give you the
>> >>> precise
>> >>> >> > formula, but if you need a limit on the amount of time between
>> >>> >> document
>> >>> >> > fetch attempts, I suggest you create a ticket and I will look
>> into
>> >>> >> adding
>> >>> >> > that as a feature.
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> > Karl
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>> >>> >> > [email protected]> wrote:
>> >>> >> >
>> >>> >> >> Hello,
>> >>> >> >>
>> >>> >> >> the parameters reseed interval and recrawl interval of a
>> >>> continuous
>> >>> >> >> crawling job are not quite clear to me. The documentation
>> tells
>> >>> that
>> >>> >> the
>> >>> >> >> reseed interval is the time after which the seeds are checked
>> >>> again,
>> >>> >> and
>> >>> >> >> the recrawl interval is the time after which a document is
>> >>> checked
>> >>> >> for
>> >>> >> >> changes.
>> >>> >> >>
>> >>> >> >> However, we observed that the recrawl interval for a document
>> >>> >> increases
>> >>> >> >> after each check. On the other hand, the reseed interval seems
>> to
>> >>> be
>> >>> >> set
>> >>> >> >> up correctly in the database metadata about the seed
>> documents.
>> >>> Yet
>> >>> >> the
>> >>> >> >> web server does not receive requests at each time the interval
>> >>> >> elapses
>> >>> >> >> but
>> >>> >> >> only after several intervals have elapsed.
>> >>> >> >>
>> >>> >> >> We are using a web connector. The web server does not tell the
>> >>> client
>> >>> >> to
>> >>> >> >> cache the documents. Any help would be appreciated.
>> >>> >> >>
>> >>> >> >> Best regards,
>> >>> >> >> Florian
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>>
>>
>

Re: Continuous crawling

Reply via email to