Re: Continuous crawling

Florian Schmedding Wed, 12 Feb 2014 08:48:19 -0800

Hi Karl,

as commented on https://issues.apache.org/jira/browse/CONNECTORS-880 the
incorrect repetition of the job was caused by a case-insensitive collation
in MySQL. Thanks for your help.


Regards,
Florian

> Hi Florian,
>
> That's the whole point; the exception is taking place but not being
> properly logged due to a bug.  That's why it has been so confusing.
> CONNECTORS-880 supposedly fixes the bug at least, but not the cause of the
> underlying exception that is triggering it.
>
>
> Karl
>
>
>
> On Wed, Feb 5, 2014 at 10:07 AM, Florian Schmedding <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> thanks for the fix. However, it is a bit difficult to try it because I
>> do
>> not have a test system with the same setup. Before doing it I'm going to
>> log all output from Manifold to check if there is some error visible
>> when
>> a job completes and restarts unexpectedly.
>>
>> Best,
>> Florian
>>
>>
>> > Any luck with this?
>> > Karl
>> >
>> >
>> > On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <[email protected]>
>> wrote:
>> >
>> >> I've created a branch at:
>> >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 .
>> >> This contains my proposed fix; please try it out.  If you would like,
>> I
>> >> can
>> >> also attach a patch, although I'm not certain it would apply properly
>> >> onto
>> >> MCF 1.4.1 sources.
>> >>
>> >> Karl
>> >>
>> >>
>> >>
>> >> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <[email protected]>
>> wrote:
>> >>
>> >>> Hi Florian,
>> >>>
>> >>> I'm pretty sure now that what is happening is that your output
>> >>> connector
>> >>> is throwing some kind of exception when it is asked to remove
>> documents
>> >>> during the cleanup phase of the crawl.  The state transitions in the
>> >>> framework seem to be incorrect under these conditions, and the error
>> is
>> >>> likely not logged into the job's error field.  The ticket I've
>> created
>> >>> to
>> >>> address this is CONNECTORS-880.
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <[email protected]>
>> wrote:
>> >>>
>> >>>> The code path for an abort sequence looks pretty iron-clad.  The
>> >>>> bad-case output:
>> >>>>
>> >>>>
>> >>>> >>>>>>
>> >>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>> >>>> 1385573203052
>> >>>> for shutdown
>> >>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>> >>>> 1385573203052 in need of notification
>> >>>> <<<<<<
>> >>>>
>> >>>> is not including:
>> >>>>
>> >>>>
>> >>>> >>>>>>
>> >>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job
>> 1385573203052
>> >>>> now
>> >>>> completed
>> >>>> <<<<<<
>> >>>>
>> >>>> is very significant, because it is in that method that the
>> last-check
>> >>>> time would be updated typically, in the method
>> JobManager.finishJob().
>> >>>>  If
>> >>>> an abort took place, it would have started BEFORE all this; once
>> the
>> >>>> job
>> >>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job
>> >>>> can be
>> >>>> aborted either manually or by repository-connector related
>> activity.
>> >>>> At
>> >>>> that time the job is cleaning up documents that are no longer
>> >>>> reachable.  I
>> >>>> will check to see what happens if the output connector throws an
>> >>>> exception
>> >>>> during this phase; it's the only thing I can think of that might
>> >>>> potentially derail the job from finishing.
>> >>>>
>> >>>> Karl
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Florian,
>> >>>>>
>> >>>>> The only way this can happen is if the proper job termination
>> state
>> >>>>> sequence does not take place.  When MCF checks to see if a job
>> should
>> >>>>> be
>> >>>>> started, if it determines that the answer is "no" it updates the
>> job
>> >>>>> record
>> >>>>> immediately with a new "last checked" value.  But if it starts the
>> >>>>> job, it
>> >>>>> waits for the job completion to take place before updating the
>> job's
>> >>>>> "last
>> >>>>> checked" time.  When a job aborts, at first glance it looks like
>> it
>> >>>>> also
>> >>>>> does the right thing, but clearly that's not true, and there must
>> be
>> >>>>> a bug
>> >>>>> somewhere in how this condition is handled.
>> >>>>>
>> >>>>> I'll create a ticket to research this. In the interim, I suggest
>> you
>> >>>>> figure out why your job is aborting in the first place.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Karl
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright
>> >>>>> <[email protected]>wrote:
>> >>>>>
>> >>>>>> Hi Florian,
>> >>>>>>
>> >>>>>> I do not expect errors to appear in the tomcat log.
>> >>>>>>
>> >>>>>> But this is interesting:
>> >>>>>>
>> >>>>>> Good:
>> >>>>>>
>> >>>>>> >>>>>>
>> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if
>> job
>> >>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>> 1391439592120,
>> >>>>>> and now it is 1391439602151
>> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match
>> FOUND
>> >>>>>> within interval 1391439592120 to 1391439602151
>> >>>>>>  ...
>> >>>>>>
>> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if
>> job
>> >>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>> 1391440412615,
>> >>>>>> and now it is 1391440427102
>> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
>> >>>>>> found
>> >>>>>> within interval 1391440412615 to 1391440427102
>> >>>>>> <<<<<<
>> >>>>>> "last checked" time for job is updated.
>> >>>>>>
>> >>>>>> Bad:
>> >>>>>>
>> >>>>>> >>>>>>
>> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if
>> job
>> >>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>> 1391446794075,
>> >>>>>> and now it is 1391446804106
>> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match
>> FOUND
>> >>>>>> within interval 1391446794075 to 1391446804106
>> >>>>>>  ...
>> >>>>>>
>> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if
>> job
>> >>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>> 1391446794075,
>> >>>>>> and now it is 1391447647733
>> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match
>> FOUND
>> >>>>>> within interval 1391446794075 to 1391447647733
>> >>>>>> <<<<<<
>> >>>>>> Note that the "last checked" time is NOT updated.
>> >>>>>>
>> >>>>>> I don't understand why, in one case, the "last checked" time is
>> >>>>>> being
>> >>>>>> updated for the job, and is not in another case.  I will look to
>> see
>> >>>>>> if
>> >>>>>> there is any way in the code that this can happen.
>> >>>>>>
>> >>>>>> Karl
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
>> >>>>>> [email protected]> wrote:
>> >>>>>>
>> >>>>>>> Hi Karl,
>> >>>>>>>
>> >>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold
>> log
>> >>>>>>> contains only the job log messages (<property
>> >>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two
>> log
>> >>>>>>> snippets, one from a normal run, and one where the job got
>> repeated
>> >>>>>>> two
>> >>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job
>> >>>>>>> notification" when the job finally terminates, and the thread
>> >>>>>>> sequence
>> >>>>>>> "Finisher - Job notification" when the job gets restarted again
>> >>>>>>> instead of
>> >>>>>>> terminating.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391439582108,
>> >>>>>>> and now it is 1391439592119
>> >>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time
>> match
>> >>>>>>> found
>> >>>>>>> within interval 1391439582108 to 1391439592119
>> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391439592120,
>> >>>>>>> and now it is 1391439602151
>> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match
>> >>>>>>> FOUND
>> >>>>>>> within interval 1391439592120 to 1391439602151
>> >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job
>> >>>>>>> '1385573203052' is
>> >>>>>>> within run window at 1391439602151 ms. (which starts at
>> >>>>>>> 1391439600000
>> >>>>>>> ms.)
>> >>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for
>> >>>>>>> job
>> >>>>>>> start
>> >>>>>>> for job 1385573203052
>> >>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for startup
>> >>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job
>> 1385573203052
>> >>>>>>> is
>> >>>>>>> now
>> >>>>>>> started
>> >>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for shutdown
>> >>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job
>> >>>>>>> 1385573203052
>> >>>>>>> now
>> >>>>>>> completed
>> >>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found
>> job
>> >>>>>>> 1385573203052 in need of notification
>> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391440412615,
>> >>>>>>> and now it is 1391440427102
>> >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time
>> match
>> >>>>>>> found
>> >>>>>>> within interval 1391440412615 to 1391440427102
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391446784053,
>> >>>>>>> and now it is 1391446794074
>> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time
>> match
>> >>>>>>> found
>> >>>>>>> within interval 1391446784053 to 1391446794074
>> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391446794075,
>> >>>>>>> and now it is 1391446804106
>> >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match
>> >>>>>>> FOUND
>> >>>>>>> within interval 1391446794075 to 1391446804106
>> >>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job
>> >>>>>>> '1385573203052' is
>> >>>>>>> within run window at 1391446804106 ms. (which starts at
>> >>>>>>> 1391446800000
>> >>>>>>> ms.)
>> >>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for
>> >>>>>>> job
>> >>>>>>> start
>> >>>>>>> for job 1385573203052
>> >>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for startup
>> >>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job
>> 1385573203052
>> >>>>>>> is
>> >>>>>>> now
>> >>>>>>> started
>> >>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for shutdown
>> >>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found
>> job
>> >>>>>>> 1385573203052 in need of notification
>> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391446794075,
>> >>>>>>> and now it is 1391447647733
>> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match
>> >>>>>>> FOUND
>> >>>>>>> within interval 1391446794075 to 1391447647733
>> >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job
>> >>>>>>> '1385573203052' is
>> >>>>>>> within run window at 1391447647733 ms. (which starts at
>> >>>>>>> 1391446800000
>> >>>>>>> ms.)
>> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391446794075,
>> >>>>>>> and now it is 1391447657740
>> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match
>> >>>>>>> FOUND
>> >>>>>>> within interval 1391446794075 to 1391447657740
>> >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job
>> >>>>>>> '1385573203052' is
>> >>>>>>> within run window at 1391447657740 ms. (which starts at
>> >>>>>>> 1391446800000
>> >>>>>>> ms.)
>> >>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for
>> >>>>>>> job
>> >>>>>>> start
>> >>>>>>> for job 1385573203052
>> >>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for startup
>> >>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job
>> 1385573203052
>> >>>>>>> is
>> >>>>>>> now
>> >>>>>>> started
>> >>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for shutdown
>> >>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found
>> job
>> >>>>>>> 1385573203052 in need of notification
>> >>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391446794075,
>> >>>>>>> and now it is 1391448479353
>> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match
>> >>>>>>> FOUND
>> >>>>>>> within interval 1391446794075 to 1391448479353
>> >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job
>> >>>>>>> '1385573203052' is
>> >>>>>>> within run window at 1391448479353 ms. (which starts at
>> >>>>>>> 1391446800000
>> >>>>>>> ms.)
>> >>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for
>> >>>>>>> job
>> >>>>>>> start
>> >>>>>>> for job 1385573203052
>> >>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for startup
>> >>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job
>> 1385573203052
>> >>>>>>> is
>> >>>>>>> now
>> >>>>>>> started
>> >>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
>> >>>>>>> 1385573203052
>> >>>>>>> for shutdown
>> >>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job
>> >>>>>>> 1385573203052
>> >>>>>>> now
>> >>>>>>> completed
>> >>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found
>> job
>> >>>>>>> 1385573203052 in need of notification
>> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if
>> job
>> >>>>>>> 1385573203052 needs to be started; it was last checked at
>> >>>>>>> 1391449283114,
>> >>>>>>> and now it is 1391449292400
>> >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time
>> match
>> >>>>>>> found
>> >>>>>>> within interval 1391449283114 to 1391449292400
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Do you need another log output?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Florian
>> >>>>>>>
>> >>>>>>> > Also, what does the log have to say?  If there is an error
>> >>>>>>> aborting
>> >>>>>>> the
>> >>>>>>> > job, there should be some record of it in the manifoldcf.log.
>> >>>>>>> >
>> >>>>>>> > Thanks,
>> >>>>>>> > Karl
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright
>> <[email protected]>
>> >>>>>>> wrote:
>> >>>>>>> >
>> >>>>>>> >> Hi Florian,
>> >>>>>>> >>
>> >>>>>>> >> Please run the job manually, when outside the scheduling
>> window
>> >>>>>>> or
>> >>>>>>> with
>> >>>>>>> >> the scheduling off.  What is the reason for the job abort?
>> >>>>>>> >>
>> >>>>>>> >> Karl
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>> >>>>>>> >> [email protected]> wrote:
>> >>>>>>> >>
>> >>>>>>> >>> Hi Karl,
>> >>>>>>> >>>
>> >>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time
>> column
>> >>>>>>> when I
>> >>>>>>> >>> refreshed the job status just after the number of active
>> >>>>>>> documents was
>> >>>>>>> >>> zero. At the next refresh the job was starting up. After
>> >>>>>>> looking
>> >>>>>>> in the
>> >>>>>>> >>> history I found out that it even started a third time. You
>> can
>> >>>>>>> see the
>> >>>>>>> >>> history of a single day below (job continue, end, start,
>> stop,
>> >>>>>>> unwait,
>> >>>>>>> >>> wait). The start method is "Start at beginning of schedule
>> >>>>>>> window". Job
>> >>>>>>> >>> invocation is "complete". Hop count mode is "Delete
>> unreachable
>> >>>>>>> >>> documents".
>> >>>>>>> >>>
>> >>>>>>> >>> 02.03.2014 18:41        job end
>> >>>>>>> >>> 02.03.2014 18:28        job start
>> >>>>>>> >>> 02.03.2014 18:14        job start
>> >>>>>>> >>> 02.03.2014 18:00        job start
>> >>>>>>> >>> 02.03.2014 17:49        job end
>> >>>>>>> >>> 02.03.2014 17:27        job end
>> >>>>>>> >>> 02.03.2014 17:13        job start
>> >>>>>>> >>> 02.03.2014 17:00        job start
>> >>>>>>> >>> 02.03.2014 16:13        job end
>> >>>>>>> >>> 02.03.2014 16:00        job start
>> >>>>>>> >>> 02.03.2014 15:41        job end
>> >>>>>>> >>> 02.03.2014 15:27        job start
>> >>>>>>> >>> 02.03.2014 15:14        job start
>> >>>>>>> >>> 02.03.2014 15:00        job start
>> >>>>>>> >>> 02.03.2014 14:13        job end
>> >>>>>>> >>> 02.03.2014 14:00        job start
>> >>>>>>> >>> 02.03.2014 13:13        job end
>> >>>>>>> >>> 02.03.2014 13:00        job start
>> >>>>>>> >>> 02.03.2014 12:27        job end
>> >>>>>>> >>> 02.03.2014 12:14        job start
>> >>>>>>> >>> 02.03.2014 12:00        job start
>> >>>>>>> >>> 02.03.2014 11:13        job end
>> >>>>>>> >>> 02.03.2014 11:00        job start
>> >>>>>>> >>> 02.03.2014 10:13        job end
>> >>>>>>> >>> 02.03.2014 10:00        job start
>> >>>>>>> >>> 02.03.2014 09:29        job end
>> >>>>>>> >>> 02.03.2014 09:14        job start
>> >>>>>>> >>> 02.03.2014 09:00        job start
>> >>>>>>> >>>
>> >>>>>>> >>> Best,
>> >>>>>>> >>> Florian
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>> > Hi Florian,
>> >>>>>>> >>> >
>> >>>>>>> >>> > Jobs don't just abort randomly.  Are you sure that the job
>> >>>>>>> aborted?
>> >>>>>>> >>> Or
>> >>>>>>> >>> > did
>> >>>>>>> >>> > it just restart?
>> >>>>>>> >>> >
>> >>>>>>> >>> > As for "is this normal", it depends on how you have
>> created
>> >>>>>>> your job.
>> >>>>>>> >>>  If
>> >>>>>>> >>> > you selected the "Start within schedule window" selection,
>> >>>>>>> MCF
>> >>>>>>> will
>> >>>>>>> >>> > restart
>> >>>>>>> >>> > the job whenever it finishes and run it until the end of
>> the
>> >>>>>>> >>> scheduling
>> >>>>>>> >>> > window.
>> >>>>>>> >>> >
>> >>>>>>> >>> > Karl
>> >>>>>>> >>> >
>> >>>>>>> >>> >
>> >>>>>>> >>> >
>> >>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>> >>>>>>> >>> > [email protected]> wrote:
>> >>>>>>> >>> >
>> >>>>>>> >>> >> Hi Karl,
>> >>>>>>> >>> >>
>> >>>>>>> >>> >> I've just observed that the job was started according to
>> its
>> >>>>>>> >>> schedule
>> >>>>>>> >>> >> and
>> >>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest
>> >>>>>>> all
>> >>>>>>> >>> documents
>> >>>>>>> >>> >> before the run). However, after finishing the last
>> document
>> >>>>>>> (zero
>> >>>>>>> >>> active
>> >>>>>>> >>> >> documents) it was somehow aborted and restarted
>> immediately.
>> >>>>>>> Is this
>> >>>>>>> >>> an
>> >>>>>>> >>> >> expected behavior?
>> >>>>>>> >>> >>
>> >>>>>>> >>> >> Best,
>> >>>>>>> >>> >> Florian
>> >>>>>>> >>> >>
>> >>>>>>> >>> >>
>> >>>>>>> >>> >> > Hi Florian,
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> > Based on this schedule, your crawls will be able to
>> start
>> >>>>>>> whenever
>> >>>>>>> >>> the
>> >>>>>>> >>> >> > hour
>> >>>>>>> >>> >> > turns.  So they can start every hour on the hour.  If
>> the
>> >>>>>>> last
>> >>>>>>> >>> crawl
>> >>>>>>> >>> >> > crossed an hour boundary, the next crawl will start
>> >>>>>>> immediately, I
>> >>>>>>> >>> >> > believe.
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> > Karl
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
>> >>>>>>> >>> >> > [email protected]> wrote:
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >> >> Hi Karl,
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >> these are the values:
>> >>>>>>> >>> >> >> Priority:       5       Start method:   Start at
>> >>>>>>> beginning
>> >>>>>>> of
>> >>>>>>> >>> >> schedule
>> >>>>>>> >>> >> >> window
>> >>>>>>> >>> >> >> Schedule type:  Scan every document once
>> Minimum
>> >>>>>>> recrawl
>> >>>>>>> >>> >> >> interval:
>> >>>>>>> >>> >> >>       Not
>> >>>>>>> >>> >> >> applicable
>> >>>>>>> >>> >> >> Expiration interval:    Not applicable  Reseed
>> interval:
>> >>>>>>> >>> Not
>> >>>>>>> >>> >> >> applicable
>> >>>>>>> >>> >> >> Scheduled time:         Any day of week at 12 am 1 am
>> 2
>> >>>>>>> am
>> >>>>>>> 3 am 4
>> >>>>>>> >>> am
>> >>>>>>> >>> >> 5
>> >>>>>>> >>> >> >> am
>> >>>>>>> >>> >> >> 6 am 7
>> >>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5
>> pm 6
>> >>>>>>> pm 7 pm
>> >>>>>>> >>> 8
>> >>>>>>> >>> >> pm 9
>> >>>>>>> >>> >> >> pm 10 pm 11 pm
>> >>>>>>> >>> >> >> Maximum run time:       No limit        Job
>> invocation:
>> >>>>>>> >>> >> Complete
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >> Maybe it is because I've changed the job from
>> continuous
>> >>>>>>> crawling
>> >>>>>>> >>> to
>> >>>>>>> >>> >> >> this
>> >>>>>>> >>> >> >> schedule. I started it a few times manually, too. I
>> >>>>>>> couldn't
>> >>>>>>> >>> notice
>> >>>>>>> >>> >> >> anything strange in the job setup or in the respective
>> >>>>>>> entries in
>> >>>>>>> >>> the
>> >>>>>>> >>> >> >> database.
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >> Regards,
>> >>>>>>> >>> >> >> Florian
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >> > Hi Florian,
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> > I was unable to reproduce the behavior you
>> described.
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> > Could you view your job, and post a screen shot of
>> that
>> >>>>>>> page?
>> >>>>>>> >>> I
>> >>>>>>> >>> >> want
>> >>>>>>> >>> >> >> to
>> >>>>>>> >>> >> >> > see what your schedule record(s) look like.
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> > Thanks,
>> >>>>>>> >>> >> >> > Karl
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
>> >>>>>>> >>> <[email protected]>
>> >>>>>>> >>> >> >> wrote:
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >> >> Hi Florian,
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >> I've never noted this behavior before.  I'll see if
>> I
>> >>>>>>> can
>> >>>>>>> >>> >> reproduce
>> >>>>>>> >>> >> >> it
>> >>>>>>> >>> >> >> >> here.
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >> Karl
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding
>> <
>> >>>>>>> >>> >> >> >> [email protected]> wrote:
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >>> Hi Karl,
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd.
>> However,
>> >>>>>>> it runs
>> >>>>>>> >>> two
>> >>>>>>> >>> >> >> >>> times:
>> >>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time,
>> >>>>>>> finishes,
>> >>>>>>> >>> and
>> >>>>>>> >>> >> >> >>> immediately starts again. After finishing the
>> second
>> >>>>>>> run it
>> >>>>>>> >>> waits
>> >>>>>>> >>> >> >> for
>> >>>>>>> >>> >> >> >>> the
>> >>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times?
>> The
>> >>>>>>> start
>> >>>>>>> >>> method
>> >>>>>>> >>> >> is
>> >>>>>>> >>> >> >> >>> "Start
>> >>>>>>> >>> >> >> >>> at beginning of schedule window".
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee.
>> >>>>>>> Currently,
>> >>>>>>> >>> our
>> >>>>>>> >>> >> >> interval
>> >>>>>>> >>> >> >> >>> is
>> >>>>>>> >>> >> >> >>> long enough for a complete crawler run.
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>> Best,
>> >>>>>>> >>> >> >> >>> Florian
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>> > Hi Florian,
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document
>> >>>>>>> will
>> >>>>>>> be
>> >>>>>>> >>> >> checked,
>> >>>>>>> >>> >> >> >>> because
>> >>>>>>> >>> >> >> >>> > if
>> >>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall
>> >>>>>>> behind.
>> >>>>>>> >>> But
>> >>>>>>> >>> I
>> >>>>>>> >>> >> >> will
>> >>>>>>> >>> >> >> >>> look
>> >>>>>>> >>> >> >> >>> > into adding the feature you request.
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>> > Karl
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian
>> Schmedding
>> >>>>>>> <
>> >>>>>>> >>> >> >> >>> > [email protected]> wrote:
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>> >> Hi Karl,
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure
>> that
>> >>>>>>> new
>> >>>>>>> >>> >> documents
>> >>>>>>> >>> >> >> are
>> >>>>>>> >>> >> >> >>> >> discovered and indexed within a certain
>> interval.
>> >>>>>>> I
>> >>>>>>> have
>> >>>>>>> >>> >> created
>> >>>>>>> >>> >> >> a
>> >>>>>>> >>> >> >> >>> >> feature
>> >>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to
>> >>>>>>> use a
>> >>>>>>> >>> >> scheduled
>> >>>>>>> >>> >> >> job
>> >>>>>>> >>> >> >> >>> >> instead.
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >> Thanks for your help,
>> >>>>>>> >>> >> >> >>> >> Florian
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >> > Hi Florian,
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling"
>> >>>>>>> behavior.  The
>> >>>>>>> >>> >> time
>> >>>>>>> >>> >> >> >>> between
>> >>>>>>> >>> >> >> >>> >> > refetches of a document is based on the
>> history
>> >>>>>>> of
>> >>>>>>> >>> fetches
>> >>>>>>> >>> >> of
>> >>>>>>> >>> >> >> that
>> >>>>>>> >>> >> >> >>> >> > document.  The recrawl interval is the
>> initial
>> >>>>>>> time
>> >>>>>>> >>> between
>> >>>>>>> >>> >> >> >>> document
>> >>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change,
>> the
>> >>>>>>> interval
>> >>>>>>> >>> for
>> >>>>>>> >>> >> >> the
>> >>>>>>> >>> >> >> >>> >> document
>> >>>>>>> >>> >> >> >>> >> > increases according to a formula.
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able
>> to
>> >>>>>>> give you
>> >>>>>>> >>> the
>> >>>>>>> >>> >> >> >>> precise
>> >>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the
>> amount
>> >>>>>>> of
>> >>>>>>> time
>> >>>>>>> >>> >> between
>> >>>>>>> >>> >> >> >>> >> document
>> >>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket
>> >>>>>>> and
>> >>>>>>> I will
>> >>>>>>> >>> >> look
>> >>>>>>> >>> >> >> into
>> >>>>>>> >>> >> >> >>> >> adding
>> >>>>>>> >>> >> >> >>> >> > that as a feature.
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> > Thanks,
>> >>>>>>> >>> >> >> >>> >> > Karl
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian
>> >>>>>>> Schmedding
>> >>>>>>> <
>> >>>>>>> >>> >> >> >>> >> > [email protected]> wrote:
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >> >> Hello,
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl
>> >>>>>>> interval of
>> >>>>>>> >>> a
>> >>>>>>> >>> >> >> >>> continuous
>> >>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The
>> >>>>>>> >>> documentation
>> >>>>>>> >>> >> >> tells
>> >>>>>>> >>> >> >> >>> that
>> >>>>>>> >>> >> >> >>> >> the
>> >>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the
>> >>>>>>> seeds
>> >>>>>>> are
>> >>>>>>> >>> >> checked
>> >>>>>>> >>> >> >> >>> again,
>> >>>>>>> >>> >> >> >>> >> and
>> >>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which
>> a
>> >>>>>>> document
>> >>>>>>> >>> is
>> >>>>>>> >>> >> >> >>> checked
>> >>>>>>> >>> >> >> >>> >> for
>> >>>>>>> >>> >> >> >>> >> >> changes.
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl
>> interval
>> >>>>>>> for a
>> >>>>>>> >>> >> document
>> >>>>>>> >>> >> >> >>> >> increases
>> >>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the
>> reseed
>> >>>>>>> >>> interval
>> >>>>>>> >>> >> seems
>> >>>>>>> >>> >> >> to
>> >>>>>>> >>> >> >> >>> be
>> >>>>>>> >>> >> >> >>> >> set
>> >>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about
>> the
>> >>>>>>> seed
>> >>>>>>> >>> >> >> documents.
>> >>>>>>> >>> >> >> >>> Yet
>> >>>>>>> >>> >> >> >>> >> the
>> >>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each
>> >>>>>>> time
>> >>>>>>> the
>> >>>>>>> >>> >> interval
>> >>>>>>> >>> >> >> >>> >> elapses
>> >>>>>>> >>> >> >> >>> >> >> but
>> >>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed.
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server
>> >>>>>>> does
>> >>>>>>> not
>> >>>>>>> >>> tell
>> >>>>>>> >>> >> the
>> >>>>>>> >>> >> >> >>> client
>> >>>>>>> >>> >> >> >>> >> to
>> >>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be
>> >>>>>>> appreciated.
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >> Best regards,
>> >>>>>>> >>> >> >> >>> >> >> Florian
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >>
>> >>>>>>> >>> >> >> >>> >> >
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >>
>> >>>>>>> >>> >> >> >>> >
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>>
>> >>>>>>> >>> >> >> >>
>> >>>>>>> >>> >> >> >
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >>
>> >>>>>>> >>> >> >
>> >>>>>>> >>> >>
>> >>>>>>> >>> >>
>> >>>>>>> >>> >>
>> >>>>>>> >>> >
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>

Re: Continuous crawling

Reply via email to