Re: Continuous crawling

Karl Wright Tue, 04 Feb 2014 10:31:01 -0800

Hi Florian,

The only way this can happen is if the proper job termination state
sequence does not take place.  When MCF checks to see if a job should be
started, if it determines that the answer is "no" it updates the job record
immediately with a new "last checked" value.  But if it starts the job, it
waits for the job completion to take place before updating the job's "last
checked" time.  When a job aborts, at first glance it looks like it also
does the right thing, but clearly that's not true, and there must be a bug
somewhere in how this condition is handled.


I'll create a ticket to research this. In the interim, I suggest you figure
out why your job is aborting in the first place.

Thanks,
Karl


On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright <[email protected]> wrote:

> Hi Florian,
>
> I do not expect errors to appear in the tomcat log.
>
> But this is interesting:
>
> Good:
>
> >>>>>>
> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
> 1385573203052 needs to be started; it was last checked at 1391439592120,
> and now it is 1391439602151
> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
> within interval 1391439592120 to 1391439602151
> ...
>
> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
> 1385573203052 needs to be started; it was last checked at 1391440412615,
> and now it is 1391440427102
> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
> within interval 1391440412615 to 1391440427102
> <<<<<<
> "last checked" time for job is updated.
>
> Bad:
>
> >>>>>>
> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
> 1385573203052 needs to be started; it was last checked at 1391446794075,
> and now it is 1391446804106
> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
> within interval 1391446794075 to 1391446804106
> ...
>
> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
> 1385573203052 needs to be started; it was last checked at 1391446794075,
> and now it is 1391447647733
> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
> within interval 1391446794075 to 1391447647733
> <<<<<<
> Note that the "last checked" time is NOT updated.
>
> I don't understand why, in one case, the "last checked" time is being
> updated for the job, and is not in another case.  I will look to see if
> there is any way in the code that this can happen.
>
> Karl
>
>
>
> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> there are no errors in the Tomcat logs. Currently, the Manifold log
>> contains only the job log messages (<property
>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log
>> snippets, one from a normal run, and one where the job got repeated two
>> times. I noticed the thread sequence "Finisher - Job reset - Job
>> notification" when the job finally terminates, and the thread sequence
>> "Finisher - Job notification" when the job gets restarted again instead of
>> terminating.
>>
>>
>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391439582108,
>> and now it is 1391439592119
>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match found
>> within interval 1391439582108 to 1391439592119
>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391439592120,
>> and now it is 1391439602151
>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>> within interval 1391439592120 to 1391439602151
>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job '1385573203052' is
>> within run window at 1391439602151 ms. (which starts at 1391439600000 ms.)
>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for job start
>> for job 1385573203052
>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job 1385573203052
>> for startup
>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 is now
>> started
>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job 1385573203052
>> for shutdown
>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 now
>> completed
>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
>> 1385573203052 in need of notification
>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391440412615,
>> and now it is 1391440427102
>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match found
>> within interval 1391440412615 to 1391440427102
>>
>>
>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446784053,
>> and now it is 1391446794074
>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match found
>> within interval 1391446784053 to 1391446794074
>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391446804106
>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391446804106
>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job '1385573203052' is
>> within run window at 1391446804106 ms. (which starts at 1391446800000 ms.)
>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for job start
>> for job 1385573203052
>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job 1385573203052
>> for startup
>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 is now
>> started
>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job 1385573203052
>> for shutdown
>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
>> 1385573203052 in need of notification
>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391447647733
>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391447647733
>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job '1385573203052' is
>> within run window at 1391447647733 ms. (which starts at 1391446800000 ms.)
>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391447657740
>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391447657740
>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job '1385573203052' is
>> within run window at 1391447657740 ms. (which starts at 1391446800000 ms.)
>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for job start
>> for job 1385573203052
>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job 1385573203052
>> for startup
>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 is now
>> started
>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job 1385573203052
>> for shutdown
>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>> 1385573203052 in need of notification
>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391446794075,
>> and now it is 1391448479353
>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match FOUND
>> within interval 1391446794075 to 1391448479353
>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job '1385573203052' is
>> within run window at 1391448479353 ms. (which starts at 1391446800000 ms.)
>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for job start
>> for job 1385573203052
>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job 1385573203052
>> for startup
>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 is now
>> started
>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job 1385573203052
>> for shutdown
>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job 1385573203052 now
>> completed
>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
>> 1385573203052 in need of notification
>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
>> 1385573203052 needs to be started; it was last checked at 1391449283114,
>> and now it is 1391449292400
>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match found
>> within interval 1391449283114 to 1391449292400
>>
>>
>> Do you need another log output?
>>
>> Best,
>> Florian
>>
>> > Also, what does the log have to say?  If there is an error aborting the
>> > job, there should be some record of it in the manifoldcf.log.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <[email protected]> wrote:
>> >
>> >> Hi Florian,
>> >>
>> >> Please run the job manually, when outside the scheduling window or with
>> >> the scheduling off.  What is the reason for the job abort?
>> >>
>> >> Karl
>> >>
>> >>
>> >>
>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi Karl,
>> >>>
>> >>> yes, I've coincidentally seen "Aborted" in the end time column when I
>> >>> refreshed the job status just after the number of active documents was
>> >>> zero. At the next refresh the job was starting up. After looking in
>> the
>> >>> history I found out that it even started a third time. You can see the
>> >>> history of a single day below (job continue, end, start, stop, unwait,
>> >>> wait). The start method is "Start at beginning of schedule window".
>> Job
>> >>> invocation is "complete". Hop count mode is "Delete unreachable
>> >>> documents".
>> >>>
>> >>> 02.03.2014 18:41        job end
>> >>> 02.03.2014 18:28        job start
>> >>> 02.03.2014 18:14        job start
>> >>> 02.03.2014 18:00        job start
>> >>> 02.03.2014 17:49        job end
>> >>> 02.03.2014 17:27        job end
>> >>> 02.03.2014 17:13        job start
>> >>> 02.03.2014 17:00        job start
>> >>> 02.03.2014 16:13        job end
>> >>> 02.03.2014 16:00        job start
>> >>> 02.03.2014 15:41        job end
>> >>> 02.03.2014 15:27        job start
>> >>> 02.03.2014 15:14        job start
>> >>> 02.03.2014 15:00        job start
>> >>> 02.03.2014 14:13        job end
>> >>> 02.03.2014 14:00        job start
>> >>> 02.03.2014 13:13        job end
>> >>> 02.03.2014 13:00        job start
>> >>> 02.03.2014 12:27        job end
>> >>> 02.03.2014 12:14        job start
>> >>> 02.03.2014 12:00        job start
>> >>> 02.03.2014 11:13        job end
>> >>> 02.03.2014 11:00        job start
>> >>> 02.03.2014 10:13        job end
>> >>> 02.03.2014 10:00        job start
>> >>> 02.03.2014 09:29        job end
>> >>> 02.03.2014 09:14        job start
>> >>> 02.03.2014 09:00        job start
>> >>>
>> >>> Best,
>> >>> Florian
>> >>>
>> >>>
>> >>> > Hi Florian,
>> >>> >
>> >>> > Jobs don't just abort randomly.  Are you sure that the job aborted?
>> >>> Or
>> >>> > did
>> >>> > it just restart?
>> >>> >
>> >>> > As for "is this normal", it depends on how you have created your
>> job.
>> >>>  If
>> >>> > you selected the "Start within schedule window" selection, MCF will
>> >>> > restart
>> >>> > the job whenever it finishes and run it until the end of the
>> >>> scheduling
>> >>> > window.
>> >>> >
>> >>> > Karl
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>> >>> > [email protected]> wrote:
>> >>> >
>> >>> >> Hi Karl,
>> >>> >>
>> >>> >> I've just observed that the job was started according to its
>> >>> schedule
>> >>> >> and
>> >>> >> crawled all documents correctly (I've chosen to re-ingest all
>> >>> documents
>> >>> >> before the run). However, after finishing the last document (zero
>> >>> active
>> >>> >> documents) it was somehow aborted and restarted immediately. Is
>> this
>> >>> an
>> >>> >> expected behavior?
>> >>> >>
>> >>> >> Best,
>> >>> >> Florian
>> >>> >>
>> >>> >>
>> >>> >> > Hi Florian,
>> >>> >> >
>> >>> >> > Based on this schedule, your crawls will be able to start
>> whenever
>> >>> the
>> >>> >> > hour
>> >>> >> > turns.  So they can start every hour on the hour.  If the last
>> >>> crawl
>> >>> >> > crossed an hour boundary, the next crawl will start immediately,
>> I
>> >>> >> > believe.
>> >>> >> >
>> >>> >> > Karl
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
>> >>> >> > [email protected]> wrote:
>> >>> >> >
>> >>> >> >> Hi Karl,
>> >>> >> >>
>> >>> >> >> these are the values:
>> >>> >> >> Priority:       5       Start method:   Start at beginning of
>> >>> >> schedule
>> >>> >> >> window
>> >>> >> >> Schedule type:  Scan every document once        Minimum recrawl
>> >>> >> >> interval:
>> >>> >> >>       Not
>> >>> >> >> applicable
>> >>> >> >> Expiration interval:    Not applicable  Reseed interval:
>> >>> Not
>> >>> >> >> applicable
>> >>> >> >> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am
>> 4
>> >>> am
>> >>> >> 5
>> >>> >> >> am
>> >>> >> >> 6 am 7
>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7
>> pm
>> >>> 8
>> >>> >> pm 9
>> >>> >> >> pm 10 pm 11 pm
>> >>> >> >> Maximum run time:       No limit        Job invocation:
>> >>> >> Complete
>> >>> >> >>
>> >>> >> >> Maybe it is because I've changed the job from continuous
>> crawling
>> >>> to
>> >>> >> >> this
>> >>> >> >> schedule. I started it a few times manually, too. I couldn't
>> >>> notice
>> >>> >> >> anything strange in the job setup or in the respective entries
>> in
>> >>> the
>> >>> >> >> database.
>> >>> >> >>
>> >>> >> >> Regards,
>> >>> >> >> Florian
>> >>> >> >>
>> >>> >> >> > Hi Florian,
>> >>> >> >> >
>> >>> >> >> > I was unable to reproduce the behavior you described.
>> >>> >> >> >
>> >>> >> >> > Could you view your job, and post a screen shot of that page?
>> >>> I
>> >>> >> want
>> >>> >> >> to
>> >>> >> >> > see what your schedule record(s) look like.
>> >>> >> >> >
>> >>> >> >> > Thanks,
>> >>> >> >> > Karl
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
>> >>> <[email protected]>
>> >>> >> >> wrote:
>> >>> >> >> >
>> >>> >> >> >> Hi Florian,
>> >>> >> >> >>
>> >>> >> >> >> I've never noted this behavior before.  I'll see if I can
>> >>> >> reproduce
>> >>> >> >> it
>> >>> >> >> >> here.
>> >>> >> >> >>
>> >>> >> >> >> Karl
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>> >>> >> >> >> [email protected]> wrote:
>> >>> >> >> >>
>> >>> >> >> >>> Hi Karl,
>> >>> >> >> >>>
>> >>> >> >> >>> the scheduled job seems to work as expecetd. However, it
>> runs
>> >>> two
>> >>> >> >> >>> times:
>> >>> >> >> >>> It starts at the beginning of the scheduled time, finishes,
>> >>> and
>> >>> >> >> >>> immediately starts again. After finishing the second run it
>> >>> waits
>> >>> >> >> for
>> >>> >> >> >>> the
>> >>> >> >> >>> next scheduled time. Why does it run two times? The start
>> >>> method
>> >>> >> is
>> >>> >> >> >>> "Start
>> >>> >> >> >>> at beginning of schedule window".
>> >>> >> >> >>>
>> >>> >> >> >>> Yes, you're right about the checking guarantee. Currently,
>> >>> our
>> >>> >> >> interval
>> >>> >> >> >>> is
>> >>> >> >> >>> long enough for a complete crawler run.
>> >>> >> >> >>>
>> >>> >> >> >>> Best,
>> >>> >> >> >>> Florian
>> >>> >> >> >>>
>> >>> >> >> >>>
>> >>> >> >> >>> > Hi Florian,
>> >>> >> >> >>> >
>> >>> >> >> >>> > It is impossible to *guarantee* that a document will be
>> >>> >> checked,
>> >>> >> >> >>> because
>> >>> >> >> >>> > if
>> >>> >> >> >>> > load on the crawler is high enough, it will fall behind.
>> >>> But
>> >>> I
>> >>> >> >> will
>> >>> >> >> >>> look
>> >>> >> >> >>> > into adding the feature you request.
>> >>> >> >> >>> >
>> >>> >> >> >>> > Karl
>> >>> >> >> >>> >
>> >>> >> >> >>> >
>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>> >>> >> >> >>> > [email protected]> wrote:
>> >>> >> >> >>> >
>> >>> >> >> >>> >> Hi Karl,
>> >>> >> >> >>> >>
>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that new
>> >>> >> documents
>> >>> >> >> are
>> >>> >> >> >>> >> discovered and indexed within a certain interval. I have
>> >>> >> created
>> >>> >> >> a
>> >>> >> >> >>> >> feature
>> >>> >> >> >>> >> request on that. In the meantime we will try to use a
>> >>> >> scheduled
>> >>> >> >> job
>> >>> >> >> >>> >> instead.
>> >>> >> >> >>> >>
>> >>> >> >> >>> >> Thanks for your help,
>> >>> >> >> >>> >> Florian
>> >>> >> >> >>> >>
>> >>> >> >> >>> >>
>> >>> >> >> >>> >> > Hi Florian,
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling" behavior.
>>  The
>> >>> >> time
>> >>> >> >> >>> between
>> >>> >> >> >>> >> > refetches of a document is based on the history of
>> >>> fetches
>> >>> >> of
>> >>> >> >> that
>> >>> >> >> >>> >> > document.  The recrawl interval is the initial time
>> >>> between
>> >>> >> >> >>> document
>> >>> >> >> >>> >> > fetches, but if a document does not change, the
>> interval
>> >>> for
>> >>> >> >> the
>> >>> >> >> >>> >> document
>> >>> >> >> >>> >> > increases according to a formula.
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> > I would need to look at the code to be able to give you
>> >>> the
>> >>> >> >> >>> precise
>> >>> >> >> >>> >> > formula, but if you need a limit on the amount of time
>> >>> >> between
>> >>> >> >> >>> >> document
>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket and I
>> will
>> >>> >> look
>> >>> >> >> into
>> >>> >> >> >>> >> adding
>> >>> >> >> >>> >> > that as a feature.
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> > Thanks,
>> >>> >> >> >>> >> > Karl
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>> >>> >> >> >>> >> > [email protected]> wrote:
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >> >> Hello,
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl interval of
>> >>> a
>> >>> >> >> >>> continuous
>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The
>> >>> documentation
>> >>> >> >> tells
>> >>> >> >> >>> that
>> >>> >> >> >>> >> the
>> >>> >> >> >>> >> >> reseed interval is the time after which the seeds are
>> >>> >> checked
>> >>> >> >> >>> again,
>> >>> >> >> >>> >> and
>> >>> >> >> >>> >> >> the recrawl interval is the time after which a
>> document
>> >>> is
>> >>> >> >> >>> checked
>> >>> >> >> >>> >> for
>> >>> >> >> >>> >> >> changes.
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >> However, we observed that the recrawl interval for a
>> >>> >> document
>> >>> >> >> >>> >> increases
>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed
>> >>> interval
>> >>> >> seems
>> >>> >> >> to
>> >>> >> >> >>> be
>> >>> >> >> >>> >> set
>> >>> >> >> >>> >> >> up correctly in the database metadata about the seed
>> >>> >> >> documents.
>> >>> >> >> >>> Yet
>> >>> >> >> >>> >> the
>> >>> >> >> >>> >> >> web server does not receive requests at each time the
>> >>> >> interval
>> >>> >> >> >>> >> elapses
>> >>> >> >> >>> >> >> but
>> >>> >> >> >>> >> >> only after several intervals have elapsed.
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >> We are using a web connector. The web server does not
>> >>> tell
>> >>> >> the
>> >>> >> >> >>> client
>> >>> >> >> >>> >> to
>> >>> >> >> >>> >> >> cache the documents. Any help would be appreciated.
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >> Best regards,
>> >>> >> >> >>> >> >> Florian
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >>
>> >>> >> >> >>> >> >
>> >>> >> >> >>> >>
>> >>> >> >> >>> >>
>> >>> >> >> >>> >>
>> >>> >> >> >>> >
>> >>> >> >> >>>
>> >>> >> >> >>>
>> >>> >> >> >>>
>> >>> >> >> >>
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>>
>>
>

Re: Continuous crawling

Reply via email to