Hi Karl, as commented on https://issues.apache.org/jira/browse/CONNECTORS-880 the incorrect repetition of the job was caused by a case-insensitive collation in MySQL. Thanks for your help.
Regards, Florian > Hi Florian, > > That's the whole point; the exception is taking place but not being > properly logged due to a bug. That's why it has been so confusing. > CONNECTORS-880 supposedly fixes the bug at least, but not the cause of the > underlying exception that is triggering it. > > > Karl > > > > On Wed, Feb 5, 2014 at 10:07 AM, Florian Schmedding < > [email protected]> wrote: > >> Hi Karl, >> >> thanks for the fix. However, it is a bit difficult to try it because I >> do >> not have a test system with the same setup. Before doing it I'm going to >> log all output from Manifold to check if there is some error visible >> when >> a job completes and restarts unexpectedly. >> >> Best, >> Florian >> >> >> > Any luck with this? >> > Karl >> > >> > >> > On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <[email protected]> >> wrote: >> > >> >> I've created a branch at: >> >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 . >> >> This contains my proposed fix; please try it out. If you would like, >> I >> >> can >> >> also attach a patch, although I'm not certain it would apply properly >> >> onto >> >> MCF 1.4.1 sources. >> >> >> >> Karl >> >> >> >> >> >> >> >> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <[email protected]> >> wrote: >> >> >> >>> Hi Florian, >> >>> >> >>> I'm pretty sure now that what is happening is that your output >> >>> connector >> >>> is throwing some kind of exception when it is asked to remove >> documents >> >>> during the cleanup phase of the crawl. The state transitions in the >> >>> framework seem to be incorrect under these conditions, and the error >> is >> >>> likely not logged into the job's error field. The ticket I've >> created >> >>> to >> >>> address this is CONNECTORS-880. >> >>> >> >>> Karl >> >>> >> >>> >> >>> >> >>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <[email protected]> >> wrote: >> >>> >> >>>> The code path for an abort sequence looks pretty iron-clad. The >> >>>> bad-case output: >> >>>> >> >>>> >> >>>> >>>>>> >> >>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >> >>>> 1385573203052 >> >>>> for shutdown >> >>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job >> >>>> 1385573203052 in need of notification >> >>>> <<<<<< >> >>>> >> >>>> is not including: >> >>>> >> >>>> >> >>>> >>>>>> >> >>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job >> 1385573203052 >> >>>> now >> >>>> completed >> >>>> <<<<<< >> >>>> >> >>>> is very significant, because it is in that method that the >> last-check >> >>>> time would be updated typically, in the method >> JobManager.finishJob(). >> >>>> If >> >>>> an abort took place, it would have started BEFORE all this; once >> the >> >>>> job >> >>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job >> >>>> can be >> >>>> aborted either manually or by repository-connector related >> activity. >> >>>> At >> >>>> that time the job is cleaning up documents that are no longer >> >>>> reachable. I >> >>>> will check to see what happens if the output connector throws an >> >>>> exception >> >>>> during this phase; it's the only thing I can think of that might >> >>>> potentially derail the job from finishing. >> >>>> >> >>>> Karl >> >>>> >> >>>> >> >>>> >> >>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <[email protected]> >> >>>> wrote: >> >>>> >> >>>>> Hi Florian, >> >>>>> >> >>>>> The only way this can happen is if the proper job termination >> state >> >>>>> sequence does not take place. When MCF checks to see if a job >> should >> >>>>> be >> >>>>> started, if it determines that the answer is "no" it updates the >> job >> >>>>> record >> >>>>> immediately with a new "last checked" value. But if it starts the >> >>>>> job, it >> >>>>> waits for the job completion to take place before updating the >> job's >> >>>>> "last >> >>>>> checked" time. When a job aborts, at first glance it looks like >> it >> >>>>> also >> >>>>> does the right thing, but clearly that's not true, and there must >> be >> >>>>> a bug >> >>>>> somewhere in how this condition is handled. >> >>>>> >> >>>>> I'll create a ticket to research this. In the interim, I suggest >> you >> >>>>> figure out why your job is aborting in the first place. >> >>>>> >> >>>>> Thanks, >> >>>>> Karl >> >>>>> >> >>>>> >> >>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright >> >>>>> <[email protected]>wrote: >> >>>>> >> >>>>>> Hi Florian, >> >>>>>> >> >>>>>> I do not expect errors to appear in the tomcat log. >> >>>>>> >> >>>>>> But this is interesting: >> >>>>>> >> >>>>>> Good: >> >>>>>> >> >>>>>> >>>>>> >> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if >> job >> >>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>> 1391439592120, >> >>>>>> and now it is 1391439602151 >> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match >> FOUND >> >>>>>> within interval 1391439592120 to 1391439602151 >> >>>>>> ... >> >>>>>> >> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if >> job >> >>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>> 1391440412615, >> >>>>>> and now it is 1391440427102 >> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time match >> >>>>>> found >> >>>>>> within interval 1391440412615 to 1391440427102 >> >>>>>> <<<<<< >> >>>>>> "last checked" time for job is updated. >> >>>>>> >> >>>>>> Bad: >> >>>>>> >> >>>>>> >>>>>> >> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if >> job >> >>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>> 1391446794075, >> >>>>>> and now it is 1391446804106 >> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match >> FOUND >> >>>>>> within interval 1391446794075 to 1391446804106 >> >>>>>> ... >> >>>>>> >> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if >> job >> >>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>> 1391446794075, >> >>>>>> and now it is 1391447647733 >> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match >> FOUND >> >>>>>> within interval 1391446794075 to 1391447647733 >> >>>>>> <<<<<< >> >>>>>> Note that the "last checked" time is NOT updated. >> >>>>>> >> >>>>>> I don't understand why, in one case, the "last checked" time is >> >>>>>> being >> >>>>>> updated for the job, and is not in another case. I will look to >> see >> >>>>>> if >> >>>>>> there is any way in the code that this can happen. >> >>>>>> >> >>>>>> Karl >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding < >> >>>>>> [email protected]> wrote: >> >>>>>> >> >>>>>>> Hi Karl, >> >>>>>>> >> >>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold >> log >> >>>>>>> contains only the job log messages (<property >> >>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two >> log >> >>>>>>> snippets, one from a normal run, and one where the job got >> repeated >> >>>>>>> two >> >>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job >> >>>>>>> notification" when the job finally terminates, and the thread >> >>>>>>> sequence >> >>>>>>> "Finisher - Job notification" when the job gets restarted again >> >>>>>>> instead of >> >>>>>>> terminating. >> >>>>>>> >> >>>>>>> >> >>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391439582108, >> >>>>>>> and now it is 1391439592119 >> >>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) - No time >> match >> >>>>>>> found >> >>>>>>> within interval 1391439582108 to 1391439592119 >> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391439592120, >> >>>>>>> and now it is 1391439602151 >> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match >> >>>>>>> FOUND >> >>>>>>> within interval 1391439592120 to 1391439602151 >> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job >> >>>>>>> '1385573203052' is >> >>>>>>> within run window at 1391439602151 ms. (which starts at >> >>>>>>> 1391439600000 >> >>>>>>> ms.) >> >>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for >> >>>>>>> job >> >>>>>>> start >> >>>>>>> for job 1385573203052 >> >>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for startup >> >>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job >> 1385573203052 >> >>>>>>> is >> >>>>>>> now >> >>>>>>> started >> >>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for shutdown >> >>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job >> >>>>>>> 1385573203052 >> >>>>>>> now >> >>>>>>> completed >> >>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found >> job >> >>>>>>> 1385573203052 in need of notification >> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391440412615, >> >>>>>>> and now it is 1391440427102 >> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time >> match >> >>>>>>> found >> >>>>>>> within interval 1391440412615 to 1391440427102 >> >>>>>>> >> >>>>>>> >> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391446784053, >> >>>>>>> and now it is 1391446794074 >> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - No time >> match >> >>>>>>> found >> >>>>>>> within interval 1391446784053 to 1391446794074 >> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391446794075, >> >>>>>>> and now it is 1391446804106 >> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match >> >>>>>>> FOUND >> >>>>>>> within interval 1391446794075 to 1391446804106 >> >>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job >> >>>>>>> '1385573203052' is >> >>>>>>> within run window at 1391446804106 ms. (which starts at >> >>>>>>> 1391446800000 >> >>>>>>> ms.) >> >>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for >> >>>>>>> job >> >>>>>>> start >> >>>>>>> for job 1385573203052 >> >>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for startup >> >>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job >> 1385573203052 >> >>>>>>> is >> >>>>>>> now >> >>>>>>> started >> >>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for shutdown >> >>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found >> job >> >>>>>>> 1385573203052 in need of notification >> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391446794075, >> >>>>>>> and now it is 1391447647733 >> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match >> >>>>>>> FOUND >> >>>>>>> within interval 1391446794075 to 1391447647733 >> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job >> >>>>>>> '1385573203052' is >> >>>>>>> within run window at 1391447647733 ms. (which starts at >> >>>>>>> 1391446800000 >> >>>>>>> ms.) >> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391446794075, >> >>>>>>> and now it is 1391447657740 >> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Time match >> >>>>>>> FOUND >> >>>>>>> within interval 1391446794075 to 1391447657740 >> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job >> >>>>>>> '1385573203052' is >> >>>>>>> within run window at 1391447657740 ms. (which starts at >> >>>>>>> 1391446800000 >> >>>>>>> ms.) >> >>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for >> >>>>>>> job >> >>>>>>> start >> >>>>>>> for job 1385573203052 >> >>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for startup >> >>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job >> 1385573203052 >> >>>>>>> is >> >>>>>>> now >> >>>>>>> started >> >>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for shutdown >> >>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found >> job >> >>>>>>> 1385573203052 in need of notification >> >>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391446794075, >> >>>>>>> and now it is 1391448479353 >> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Time match >> >>>>>>> FOUND >> >>>>>>> within interval 1391446794075 to 1391448479353 >> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job >> >>>>>>> '1385573203052' is >> >>>>>>> within run window at 1391448479353 ms. (which starts at >> >>>>>>> 1391446800000 >> >>>>>>> ms.) >> >>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for >> >>>>>>> job >> >>>>>>> start >> >>>>>>> for job 1385573203052 >> >>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for startup >> >>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job >> 1385573203052 >> >>>>>>> is >> >>>>>>> now >> >>>>>>> started >> >>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job >> >>>>>>> 1385573203052 >> >>>>>>> for shutdown >> >>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job >> >>>>>>> 1385573203052 >> >>>>>>> now >> >>>>>>> completed >> >>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found >> job >> >>>>>>> 1385573203052 in need of notification >> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if >> job >> >>>>>>> 1385573203052 needs to be started; it was last checked at >> >>>>>>> 1391449283114, >> >>>>>>> and now it is 1391449292400 >> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - No time >> match >> >>>>>>> found >> >>>>>>> within interval 1391449283114 to 1391449292400 >> >>>>>>> >> >>>>>>> >> >>>>>>> Do you need another log output? >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> Florian >> >>>>>>> >> >>>>>>> > Also, what does the log have to say? If there is an error >> >>>>>>> aborting >> >>>>>>> the >> >>>>>>> > job, there should be some record of it in the manifoldcf.log. >> >>>>>>> > >> >>>>>>> > Thanks, >> >>>>>>> > Karl >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright >> <[email protected]> >> >>>>>>> wrote: >> >>>>>>> > >> >>>>>>> >> Hi Florian, >> >>>>>>> >> >> >>>>>>> >> Please run the job manually, when outside the scheduling >> window >> >>>>>>> or >> >>>>>>> with >> >>>>>>> >> the scheduling off. What is the reason for the job abort? >> >>>>>>> >> >> >>>>>>> >> Karl >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding < >> >>>>>>> >> [email protected]> wrote: >> >>>>>>> >> >> >>>>>>> >>> Hi Karl, >> >>>>>>> >>> >> >>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time >> column >> >>>>>>> when I >> >>>>>>> >>> refreshed the job status just after the number of active >> >>>>>>> documents was >> >>>>>>> >>> zero. At the next refresh the job was starting up. After >> >>>>>>> looking >> >>>>>>> in the >> >>>>>>> >>> history I found out that it even started a third time. You >> can >> >>>>>>> see the >> >>>>>>> >>> history of a single day below (job continue, end, start, >> stop, >> >>>>>>> unwait, >> >>>>>>> >>> wait). The start method is "Start at beginning of schedule >> >>>>>>> window". Job >> >>>>>>> >>> invocation is "complete". Hop count mode is "Delete >> unreachable >> >>>>>>> >>> documents". >> >>>>>>> >>> >> >>>>>>> >>> 02.03.2014 18:41 job end >> >>>>>>> >>> 02.03.2014 18:28 job start >> >>>>>>> >>> 02.03.2014 18:14 job start >> >>>>>>> >>> 02.03.2014 18:00 job start >> >>>>>>> >>> 02.03.2014 17:49 job end >> >>>>>>> >>> 02.03.2014 17:27 job end >> >>>>>>> >>> 02.03.2014 17:13 job start >> >>>>>>> >>> 02.03.2014 17:00 job start >> >>>>>>> >>> 02.03.2014 16:13 job end >> >>>>>>> >>> 02.03.2014 16:00 job start >> >>>>>>> >>> 02.03.2014 15:41 job end >> >>>>>>> >>> 02.03.2014 15:27 job start >> >>>>>>> >>> 02.03.2014 15:14 job start >> >>>>>>> >>> 02.03.2014 15:00 job start >> >>>>>>> >>> 02.03.2014 14:13 job end >> >>>>>>> >>> 02.03.2014 14:00 job start >> >>>>>>> >>> 02.03.2014 13:13 job end >> >>>>>>> >>> 02.03.2014 13:00 job start >> >>>>>>> >>> 02.03.2014 12:27 job end >> >>>>>>> >>> 02.03.2014 12:14 job start >> >>>>>>> >>> 02.03.2014 12:00 job start >> >>>>>>> >>> 02.03.2014 11:13 job end >> >>>>>>> >>> 02.03.2014 11:00 job start >> >>>>>>> >>> 02.03.2014 10:13 job end >> >>>>>>> >>> 02.03.2014 10:00 job start >> >>>>>>> >>> 02.03.2014 09:29 job end >> >>>>>>> >>> 02.03.2014 09:14 job start >> >>>>>>> >>> 02.03.2014 09:00 job start >> >>>>>>> >>> >> >>>>>>> >>> Best, >> >>>>>>> >>> Florian >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> > Hi Florian, >> >>>>>>> >>> > >> >>>>>>> >>> > Jobs don't just abort randomly. Are you sure that the job >> >>>>>>> aborted? >> >>>>>>> >>> Or >> >>>>>>> >>> > did >> >>>>>>> >>> > it just restart? >> >>>>>>> >>> > >> >>>>>>> >>> > As for "is this normal", it depends on how you have >> created >> >>>>>>> your job. >> >>>>>>> >>> If >> >>>>>>> >>> > you selected the "Start within schedule window" selection, >> >>>>>>> MCF >> >>>>>>> will >> >>>>>>> >>> > restart >> >>>>>>> >>> > the job whenever it finishes and run it until the end of >> the >> >>>>>>> >>> scheduling >> >>>>>>> >>> > window. >> >>>>>>> >>> > >> >>>>>>> >>> > Karl >> >>>>>>> >>> > >> >>>>>>> >>> > >> >>>>>>> >>> > >> >>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding < >> >>>>>>> >>> > [email protected]> wrote: >> >>>>>>> >>> > >> >>>>>>> >>> >> Hi Karl, >> >>>>>>> >>> >> >> >>>>>>> >>> >> I've just observed that the job was started according to >> its >> >>>>>>> >>> schedule >> >>>>>>> >>> >> and >> >>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest >> >>>>>>> all >> >>>>>>> >>> documents >> >>>>>>> >>> >> before the run). However, after finishing the last >> document >> >>>>>>> (zero >> >>>>>>> >>> active >> >>>>>>> >>> >> documents) it was somehow aborted and restarted >> immediately. >> >>>>>>> Is this >> >>>>>>> >>> an >> >>>>>>> >>> >> expected behavior? >> >>>>>>> >>> >> >> >>>>>>> >>> >> Best, >> >>>>>>> >>> >> Florian >> >>>>>>> >>> >> >> >>>>>>> >>> >> >> >>>>>>> >>> >> > Hi Florian, >> >>>>>>> >>> >> > >> >>>>>>> >>> >> > Based on this schedule, your crawls will be able to >> start >> >>>>>>> whenever >> >>>>>>> >>> the >> >>>>>>> >>> >> > hour >> >>>>>>> >>> >> > turns. So they can start every hour on the hour. If >> the >> >>>>>>> last >> >>>>>>> >>> crawl >> >>>>>>> >>> >> > crossed an hour boundary, the next crawl will start >> >>>>>>> immediately, I >> >>>>>>> >>> >> > believe. >> >>>>>>> >>> >> > >> >>>>>>> >>> >> > Karl >> >>>>>>> >>> >> > >> >>>>>>> >>> >> > >> >>>>>>> >>> >> > >> >>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding < >> >>>>>>> >>> >> > [email protected]> wrote: >> >>>>>>> >>> >> > >> >>>>>>> >>> >> >> Hi Karl, >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> these are the values: >> >>>>>>> >>> >> >> Priority: 5 Start method: Start at >> >>>>>>> beginning >> >>>>>>> of >> >>>>>>> >>> >> schedule >> >>>>>>> >>> >> >> window >> >>>>>>> >>> >> >> Schedule type: Scan every document once >> Minimum >> >>>>>>> recrawl >> >>>>>>> >>> >> >> interval: >> >>>>>>> >>> >> >> Not >> >>>>>>> >>> >> >> applicable >> >>>>>>> >>> >> >> Expiration interval: Not applicable Reseed >> interval: >> >>>>>>> >>> Not >> >>>>>>> >>> >> >> applicable >> >>>>>>> >>> >> >> Scheduled time: Any day of week at 12 am 1 am >> 2 >> >>>>>>> am >> >>>>>>> 3 am 4 >> >>>>>>> >>> am >> >>>>>>> >>> >> 5 >> >>>>>>> >>> >> >> am >> >>>>>>> >>> >> >> 6 am 7 >> >>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 >> pm 6 >> >>>>>>> pm 7 pm >> >>>>>>> >>> 8 >> >>>>>>> >>> >> pm 9 >> >>>>>>> >>> >> >> pm 10 pm 11 pm >> >>>>>>> >>> >> >> Maximum run time: No limit Job >> invocation: >> >>>>>>> >>> >> Complete >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> Maybe it is because I've changed the job from >> continuous >> >>>>>>> crawling >> >>>>>>> >>> to >> >>>>>>> >>> >> >> this >> >>>>>>> >>> >> >> schedule. I started it a few times manually, too. I >> >>>>>>> couldn't >> >>>>>>> >>> notice >> >>>>>>> >>> >> >> anything strange in the job setup or in the respective >> >>>>>>> entries in >> >>>>>>> >>> the >> >>>>>>> >>> >> >> database. >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> Regards, >> >>>>>>> >>> >> >> Florian >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> > Hi Florian, >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > I was unable to reproduce the behavior you >> described. >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > Could you view your job, and post a screen shot of >> that >> >>>>>>> page? >> >>>>>>> >>> I >> >>>>>>> >>> >> want >> >>>>>>> >>> >> >> to >> >>>>>>> >>> >> >> > see what your schedule record(s) look like. >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > Thanks, >> >>>>>>> >>> >> >> > Karl >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright >> >>>>>>> >>> <[email protected]> >> >>>>>>> >>> >> >> wrote: >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> >> Hi Florian, >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >> I've never noted this behavior before. I'll see if >> I >> >>>>>>> can >> >>>>>>> >>> >> reproduce >> >>>>>>> >>> >> >> it >> >>>>>>> >>> >> >> >> here. >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >> Karl >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding >> < >> >>>>>>> >>> >> >> >> [email protected]> wrote: >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> >>> Hi Karl, >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. >> However, >> >>>>>>> it runs >> >>>>>>> >>> two >> >>>>>>> >>> >> >> >>> times: >> >>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time, >> >>>>>>> finishes, >> >>>>>>> >>> and >> >>>>>>> >>> >> >> >>> immediately starts again. After finishing the >> second >> >>>>>>> run it >> >>>>>>> >>> waits >> >>>>>>> >>> >> >> for >> >>>>>>> >>> >> >> >>> the >> >>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? >> The >> >>>>>>> start >> >>>>>>> >>> method >> >>>>>>> >>> >> is >> >>>>>>> >>> >> >> >>> "Start >> >>>>>>> >>> >> >> >>> at beginning of schedule window". >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee. >> >>>>>>> Currently, >> >>>>>>> >>> our >> >>>>>>> >>> >> >> interval >> >>>>>>> >>> >> >> >>> is >> >>>>>>> >>> >> >> >>> long enough for a complete crawler run. >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> Best, >> >>>>>>> >>> >> >> >>> Florian >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> > Hi Florian, >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document >> >>>>>>> will >> >>>>>>> be >> >>>>>>> >>> >> checked, >> >>>>>>> >>> >> >> >>> because >> >>>>>>> >>> >> >> >>> > if >> >>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall >> >>>>>>> behind. >> >>>>>>> >>> But >> >>>>>>> >>> I >> >>>>>>> >>> >> >> will >> >>>>>>> >>> >> >> >>> look >> >>>>>>> >>> >> >> >>> > into adding the feature you request. >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> > Karl >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian >> Schmedding >> >>>>>>> < >> >>>>>>> >>> >> >> >>> > [email protected]> wrote: >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> >> Hi Karl, >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure >> that >> >>>>>>> new >> >>>>>>> >>> >> documents >> >>>>>>> >>> >> >> are >> >>>>>>> >>> >> >> >>> >> discovered and indexed within a certain >> interval. >> >>>>>>> I >> >>>>>>> have >> >>>>>>> >>> >> created >> >>>>>>> >>> >> >> a >> >>>>>>> >>> >> >> >>> >> feature >> >>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to >> >>>>>>> use a >> >>>>>>> >>> >> scheduled >> >>>>>>> >>> >> >> job >> >>>>>>> >>> >> >> >>> >> instead. >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> Thanks for your help, >> >>>>>>> >>> >> >> >>> >> Florian >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> > Hi Florian, >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling" >> >>>>>>> behavior. The >> >>>>>>> >>> >> time >> >>>>>>> >>> >> >> >>> between >> >>>>>>> >>> >> >> >>> >> > refetches of a document is based on the >> history >> >>>>>>> of >> >>>>>>> >>> fetches >> >>>>>>> >>> >> of >> >>>>>>> >>> >> >> that >> >>>>>>> >>> >> >> >>> >> > document. The recrawl interval is the >> initial >> >>>>>>> time >> >>>>>>> >>> between >> >>>>>>> >>> >> >> >>> document >> >>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, >> the >> >>>>>>> interval >> >>>>>>> >>> for >> >>>>>>> >>> >> >> the >> >>>>>>> >>> >> >> >>> >> document >> >>>>>>> >>> >> >> >>> >> > increases according to a formula. >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able >> to >> >>>>>>> give you >> >>>>>>> >>> the >> >>>>>>> >>> >> >> >>> precise >> >>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the >> amount >> >>>>>>> of >> >>>>>>> time >> >>>>>>> >>> >> between >> >>>>>>> >>> >> >> >>> >> document >> >>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket >> >>>>>>> and >> >>>>>>> I will >> >>>>>>> >>> >> look >> >>>>>>> >>> >> >> into >> >>>>>>> >>> >> >> >>> >> adding >> >>>>>>> >>> >> >> >>> >> > that as a feature. >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > Thanks, >> >>>>>>> >>> >> >> >>> >> > Karl >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian >> >>>>>>> Schmedding >> >>>>>>> < >> >>>>>>> >>> >> >> >>> >> > [email protected]> wrote: >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> >> Hello, >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl >> >>>>>>> interval of >> >>>>>>> >>> a >> >>>>>>> >>> >> >> >>> continuous >> >>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The >> >>>>>>> >>> documentation >> >>>>>>> >>> >> >> tells >> >>>>>>> >>> >> >> >>> that >> >>>>>>> >>> >> >> >>> >> the >> >>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the >> >>>>>>> seeds >> >>>>>>> are >> >>>>>>> >>> >> checked >> >>>>>>> >>> >> >> >>> again, >> >>>>>>> >>> >> >> >>> >> and >> >>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which >> a >> >>>>>>> document >> >>>>>>> >>> is >> >>>>>>> >>> >> >> >>> checked >> >>>>>>> >>> >> >> >>> >> for >> >>>>>>> >>> >> >> >>> >> >> changes. >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl >> interval >> >>>>>>> for a >> >>>>>>> >>> >> document >> >>>>>>> >>> >> >> >>> >> increases >> >>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the >> reseed >> >>>>>>> >>> interval >> >>>>>>> >>> >> seems >> >>>>>>> >>> >> >> to >> >>>>>>> >>> >> >> >>> be >> >>>>>>> >>> >> >> >>> >> set >> >>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about >> the >> >>>>>>> seed >> >>>>>>> >>> >> >> documents. >> >>>>>>> >>> >> >> >>> Yet >> >>>>>>> >>> >> >> >>> >> the >> >>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each >> >>>>>>> time >> >>>>>>> the >> >>>>>>> >>> >> interval >> >>>>>>> >>> >> >> >>> >> elapses >> >>>>>>> >>> >> >> >>> >> >> but >> >>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed. >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server >> >>>>>>> does >> >>>>>>> not >> >>>>>>> >>> tell >> >>>>>>> >>> >> the >> >>>>>>> >>> >> >> >>> client >> >>>>>>> >>> >> >> >>> >> to >> >>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be >> >>>>>>> appreciated. >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> Best regards, >> >>>>>>> >>> >> >> >>> >> >> Florian >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> >> >> >>>>>>> >>> >> >> >>> >> > >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> > >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >> >> >>>>>>> >>> >> >> > >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> > >> >>>>>>> >>> >> >> >>>>>>> >>> >> >> >>>>>>> >>> >> >> >>>>>>> >>> > >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >> >> >>>>>>> > >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> >> >> >
