Hi Florian, I've never noted this behavior before. I'll see if I can reproduce it here.
Karl On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding < [email protected]> wrote: > Hi Karl, > > the scheduled job seems to work as expecetd. However, it runs two times: > It starts at the beginning of the scheduled time, finishes, and > immediately starts again. After finishing the second run it waits for the > next scheduled time. Why does it run two times? The start method is "Start > at beginning of schedule window". > > Yes, you're right about the checking guarantee. Currently, our interval is > long enough for a complete crawler run. > > Best, > Florian > > > > Hi Florian, > > > > It is impossible to *guarantee* that a document will be checked, because > > if > > load on the crawler is high enough, it will fall behind. But I will look > > into adding the feature you request. > > > > Karl > > > > > > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding < > > [email protected]> wrote: > > > >> Hi Karl, > >> > >> yes, in our case it is necessary to make sure that new documents are > >> discovered and indexed within a certain interval. I have created a > >> feature > >> request on that. In the meantime we will try to use a scheduled job > >> instead. > >> > >> Thanks for your help, > >> Florian > >> > >> > >> > Hi Florian, > >> > > >> > What you are seeing is "dynamic crawling" behavior. The time between > >> > refetches of a document is based on the history of fetches of that > >> > document. The recrawl interval is the initial time between document > >> > fetches, but if a document does not change, the interval for the > >> document > >> > increases according to a formula. > >> > > >> > I would need to look at the code to be able to give you the precise > >> > formula, but if you need a limit on the amount of time between > >> document > >> > fetch attempts, I suggest you create a ticket and I will look into > >> adding > >> > that as a feature. > >> > > >> > Thanks, > >> > Karl > >> > > >> > > >> > > >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding < > >> > [email protected]> wrote: > >> > > >> >> Hello, > >> >> > >> >> the parameters reseed interval and recrawl interval of a continuous > >> >> crawling job are not quite clear to me. The documentation tells that > >> the > >> >> reseed interval is the time after which the seeds are checked again, > >> and > >> >> the recrawl interval is the time after which a document is checked > >> for > >> >> changes. > >> >> > >> >> However, we observed that the recrawl interval for a document > >> increases > >> >> after each check. On the other hand, the reseed interval seems to be > >> set > >> >> up correctly in the database metadata about the seed documents. Yet > >> the > >> >> web server does not receive requests at each time the interval > >> elapses > >> >> but > >> >> only after several intervals have elapsed. > >> >> > >> >> We are using a web connector. The web server does not tell the client > >> to > >> >> cache the documents. Any help would be appreciated. > >> >> > >> >> Best regards, > >> >> Florian > >> >> > >> >> > >> >> > >> >> > >> > > >> > >> > >> > > > > >
