Hi Karl, I've just observed that the job was started according to its schedule and crawled all documents correctly (I've chosen to re-ingest all documents before the run). However, after finishing the last document (zero active documents) it was somehow aborted and restarted immediately. Is this an expected behavior?
Best, Florian > Hi Florian, > > Based on this schedule, your crawls will be able to start whenever the > hour > turns. So they can start every hour on the hour. If the last crawl > crossed an hour boundary, the next crawl will start immediately, I > believe. > > Karl > > > > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding < > [email protected]> wrote: > >> Hi Karl, >> >> these are the values: >> Priority: 5 Start method: Start at beginning of schedule >> window >> Schedule type: Scan every document once Minimum recrawl >> interval: >> Not >> applicable >> Expiration interval: Not applicable Reseed interval: Not >> applicable >> Scheduled time: Any day of week at 12 am 1 am 2 am 3 am 4 am 5 >> am >> 6 am 7 >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9 >> pm 10 pm 11 pm >> Maximum run time: No limit Job invocation: Complete >> >> Maybe it is because I've changed the job from continuous crawling to >> this >> schedule. I started it a few times manually, too. I couldn't notice >> anything strange in the job setup or in the respective entries in the >> database. >> >> Regards, >> Florian >> >> > Hi Florian, >> > >> > I was unable to reproduce the behavior you described. >> > >> > Could you view your job, and post a screen shot of that page? I want >> to >> > see what your schedule record(s) look like. >> > >> > Thanks, >> > Karl >> > >> > >> > >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]> >> wrote: >> > >> >> Hi Florian, >> >> >> >> I've never noted this behavior before. I'll see if I can reproduce >> it >> >> here. >> >> >> >> Karl >> >> >> >> >> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding < >> >> [email protected]> wrote: >> >> >> >>> Hi Karl, >> >>> >> >>> the scheduled job seems to work as expecetd. However, it runs two >> >>> times: >> >>> It starts at the beginning of the scheduled time, finishes, and >> >>> immediately starts again. After finishing the second run it waits >> for >> >>> the >> >>> next scheduled time. Why does it run two times? The start method is >> >>> "Start >> >>> at beginning of schedule window". >> >>> >> >>> Yes, you're right about the checking guarantee. Currently, our >> interval >> >>> is >> >>> long enough for a complete crawler run. >> >>> >> >>> Best, >> >>> Florian >> >>> >> >>> >> >>> > Hi Florian, >> >>> > >> >>> > It is impossible to *guarantee* that a document will be checked, >> >>> because >> >>> > if >> >>> > load on the crawler is high enough, it will fall behind. But I >> will >> >>> look >> >>> > into adding the feature you request. >> >>> > >> >>> > Karl >> >>> > >> >>> > >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding < >> >>> > [email protected]> wrote: >> >>> > >> >>> >> Hi Karl, >> >>> >> >> >>> >> yes, in our case it is necessary to make sure that new documents >> are >> >>> >> discovered and indexed within a certain interval. I have created >> a >> >>> >> feature >> >>> >> request on that. In the meantime we will try to use a scheduled >> job >> >>> >> instead. >> >>> >> >> >>> >> Thanks for your help, >> >>> >> Florian >> >>> >> >> >>> >> >> >>> >> > Hi Florian, >> >>> >> > >> >>> >> > What you are seeing is "dynamic crawling" behavior. The time >> >>> between >> >>> >> > refetches of a document is based on the history of fetches of >> that >> >>> >> > document. The recrawl interval is the initial time between >> >>> document >> >>> >> > fetches, but if a document does not change, the interval for >> the >> >>> >> document >> >>> >> > increases according to a formula. >> >>> >> > >> >>> >> > I would need to look at the code to be able to give you the >> >>> precise >> >>> >> > formula, but if you need a limit on the amount of time between >> >>> >> document >> >>> >> > fetch attempts, I suggest you create a ticket and I will look >> into >> >>> >> adding >> >>> >> > that as a feature. >> >>> >> > >> >>> >> > Thanks, >> >>> >> > Karl >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding < >> >>> >> > [email protected]> wrote: >> >>> >> > >> >>> >> >> Hello, >> >>> >> >> >> >>> >> >> the parameters reseed interval and recrawl interval of a >> >>> continuous >> >>> >> >> crawling job are not quite clear to me. The documentation >> tells >> >>> that >> >>> >> the >> >>> >> >> reseed interval is the time after which the seeds are checked >> >>> again, >> >>> >> and >> >>> >> >> the recrawl interval is the time after which a document is >> >>> checked >> >>> >> for >> >>> >> >> changes. >> >>> >> >> >> >>> >> >> However, we observed that the recrawl interval for a document >> >>> >> increases >> >>> >> >> after each check. On the other hand, the reseed interval seems >> to >> >>> be >> >>> >> set >> >>> >> >> up correctly in the database metadata about the seed >> documents. >> >>> Yet >> >>> >> the >> >>> >> >> web server does not receive requests at each time the interval >> >>> >> elapses >> >>> >> >> but >> >>> >> >> only after several intervals have elapsed. >> >>> >> >> >> >>> >> >> We are using a web connector. The web server does not tell the >> >>> client >> >>> >> to >> >>> >> >> cache the documents. Any help would be appreciated. >> >>> >> >> >> >>> >> >> Best regards, >> >>> >> >> Florian >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> > >> >>> >> >> >>> >> >> >>> >> >> >>> > >> >>> >> >>> >> >>> >> >> >> > >> >> >> >
