Re: Continuous crawling

Karl Wright Wed, 15 Jan 2014 11:02:05 -0800

Hi Florian,

Based on this schedule, your crawls will be able to start whenever the hour
turns.  So they can start every hour on the hour.  If the last crawl
crossed an hour boundary, the next crawl will start immediately, I believe.


Karl



On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
[email protected]> wrote:

> Hi Karl,
>
> these are the values:
> Priority:       5       Start method:   Start at beginning of schedule
> window
> Schedule type:  Scan every document once        Minimum recrawl interval:
>       Not
> applicable
> Expiration interval:    Not applicable  Reseed interval:        Not
> applicable
> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am 5 am
> 6 am 7
> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9
> pm 10 pm 11 pm
> Maximum run time:       No limit        Job invocation:         Complete
>
> Maybe it is because I've changed the job from continuous crawling to this
> schedule. I started it a few times manually, too. I couldn't notice
> anything strange in the job setup or in the respective entries in the
> database.
>
> Regards,
> Florian
>
> > Hi Florian,
> >
> > I was unable to reproduce the behavior you described.
> >
> > Could you view your job, and post a screen shot of that page?  I want to
> > see what your schedule record(s) look like.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]> wrote:
> >
> >> Hi Florian,
> >>
> >> I've never noted this behavior before.  I'll see if I can reproduce it
> >> here.
> >>
> >> Karl
> >>
> >>
> >>
> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> >> [email protected]> wrote:
> >>
> >>> Hi Karl,
> >>>
> >>> the scheduled job seems to work as expecetd. However, it runs two
> >>> times:
> >>> It starts at the beginning of the scheduled time, finishes, and
> >>> immediately starts again. After finishing the second run it waits for
> >>> the
> >>> next scheduled time. Why does it run two times? The start method is
> >>> "Start
> >>> at beginning of schedule window".
> >>>
> >>> Yes, you're right about the checking guarantee. Currently, our interval
> >>> is
> >>> long enough for a complete crawler run.
> >>>
> >>> Best,
> >>> Florian
> >>>
> >>>
> >>> > Hi Florian,
> >>> >
> >>> > It is impossible to *guarantee* that a document will be checked,
> >>> because
> >>> > if
> >>> > load on the crawler is high enough, it will fall behind.  But I will
> >>> look
> >>> > into adding the feature you request.
> >>> >
> >>> > Karl
> >>> >
> >>> >
> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> >>> > [email protected]> wrote:
> >>> >
> >>> >> Hi Karl,
> >>> >>
> >>> >> yes, in our case it is necessary to make sure that new documents are
> >>> >> discovered and indexed within a certain interval. I have created a
> >>> >> feature
> >>> >> request on that. In the meantime we will try to use a scheduled job
> >>> >> instead.
> >>> >>
> >>> >> Thanks for your help,
> >>> >> Florian
> >>> >>
> >>> >>
> >>> >> > Hi Florian,
> >>> >> >
> >>> >> > What you are seeing is "dynamic crawling" behavior.  The time
> >>> between
> >>> >> > refetches of a document is based on the history of fetches of that
> >>> >> > document.  The recrawl interval is the initial time between
> >>> document
> >>> >> > fetches, but if a document does not change, the interval for the
> >>> >> document
> >>> >> > increases according to a formula.
> >>> >> >
> >>> >> > I would need to look at the code to be able to give you the
> >>> precise
> >>> >> > formula, but if you need a limit on the amount of time between
> >>> >> document
> >>> >> > fetch attempts, I suggest you create a ticket and I will look into
> >>> >> adding
> >>> >> > that as a feature.
> >>> >> >
> >>> >> > Thanks,
> >>> >> > Karl
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> >>> >> > [email protected]> wrote:
> >>> >> >
> >>> >> >> Hello,
> >>> >> >>
> >>> >> >> the parameters reseed interval and recrawl interval of a
> >>> continuous
> >>> >> >> crawling job are not quite clear to me. The documentation tells
> >>> that
> >>> >> the
> >>> >> >> reseed interval is the time after which the seeds are checked
> >>> again,
> >>> >> and
> >>> >> >> the recrawl interval is the time after which a document is
> >>> checked
> >>> >> for
> >>> >> >> changes.
> >>> >> >>
> >>> >> >> However, we observed that the recrawl interval for a document
> >>> >> increases
> >>> >> >> after each check. On the other hand, the reseed interval seems to
> >>> be
> >>> >> set
> >>> >> >> up correctly in the database metadata about the seed documents.
> >>> Yet
> >>> >> the
> >>> >> >> web server does not receive requests at each time the interval
> >>> >> elapses
> >>> >> >> but
> >>> >> >> only after several intervals have elapsed.
> >>> >> >>
> >>> >> >> We are using a web connector. The web server does not tell the
> >>> client
> >>> >> to
> >>> >> >> cache the documents. Any help would be appreciated.
> >>> >> >>
> >>> >> >> Best regards,
> >>> >> >> Florian
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>
> >
>
>
>

Re: Continuous crawling

Reply via email to