Hi Florian,

Please run the job manually, when outside the scheduling window or with the
scheduling off.  What is the reason for the job abort?

Karl



On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
[email protected]> wrote:

> Hi Karl,
>
> yes, I've coincidentally seen "Aborted" in the end time column when I
> refreshed the job status just after the number of active documents was
> zero. At the next refresh the job was starting up. After looking in the
> history I found out that it even started a third time. You can see the
> history of a single day below (job continue, end, start, stop, unwait,
> wait). The start method is "Start at beginning of schedule window". Job
> invocation is "complete". Hop count mode is "Delete unreachable
> documents".
>
> 02.03.2014 18:41        job end
> 02.03.2014 18:28        job start
> 02.03.2014 18:14        job start
> 02.03.2014 18:00        job start
> 02.03.2014 17:49        job end
> 02.03.2014 17:27        job end
> 02.03.2014 17:13        job start
> 02.03.2014 17:00        job start
> 02.03.2014 16:13        job end
> 02.03.2014 16:00        job start
> 02.03.2014 15:41        job end
> 02.03.2014 15:27        job start
> 02.03.2014 15:14        job start
> 02.03.2014 15:00        job start
> 02.03.2014 14:13        job end
> 02.03.2014 14:00        job start
> 02.03.2014 13:13        job end
> 02.03.2014 13:00        job start
> 02.03.2014 12:27        job end
> 02.03.2014 12:14        job start
> 02.03.2014 12:00        job start
> 02.03.2014 11:13        job end
> 02.03.2014 11:00        job start
> 02.03.2014 10:13        job end
> 02.03.2014 10:00        job start
> 02.03.2014 09:29        job end
> 02.03.2014 09:14        job start
> 02.03.2014 09:00        job start
>
> Best,
> Florian
>
>
> > Hi Florian,
> >
> > Jobs don't just abort randomly.  Are you sure that the job aborted?  Or
> > did
> > it just restart?
> >
> > As for "is this normal", it depends on how you have created your job.  If
> > you selected the "Start within schedule window" selection, MCF will
> > restart
> > the job whenever it finishes and run it until the end of the scheduling
> > window.
> >
> > Karl
> >
> >
> >
> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
> > [email protected]> wrote:
> >
> >> Hi Karl,
> >>
> >> I've just observed that the job was started according to its schedule
> >> and
> >> crawled all documents correctly (I've chosen to re-ingest all documents
> >> before the run). However, after finishing the last document (zero active
> >> documents) it was somehow aborted and restarted immediately. Is this an
> >> expected behavior?
> >>
> >> Best,
> >> Florian
> >>
> >>
> >> > Hi Florian,
> >> >
> >> > Based on this schedule, your crawls will be able to start whenever the
> >> > hour
> >> > turns.  So they can start every hour on the hour.  If the last crawl
> >> > crossed an hour boundary, the next crawl will start immediately, I
> >> > believe.
> >> >
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
> >> > [email protected]> wrote:
> >> >
> >> >> Hi Karl,
> >> >>
> >> >> these are the values:
> >> >> Priority:       5       Start method:   Start at beginning of
> >> schedule
> >> >> window
> >> >> Schedule type:  Scan every document once        Minimum recrawl
> >> >> interval:
> >> >>       Not
> >> >> applicable
> >> >> Expiration interval:    Not applicable  Reseed interval:        Not
> >> >> applicable
> >> >> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am
> >> 5
> >> >> am
> >> >> 6 am 7
> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8
> >> pm 9
> >> >> pm 10 pm 11 pm
> >> >> Maximum run time:       No limit        Job invocation:
> >> Complete
> >> >>
> >> >> Maybe it is because I've changed the job from continuous crawling to
> >> >> this
> >> >> schedule. I started it a few times manually, too. I couldn't notice
> >> >> anything strange in the job setup or in the respective entries in the
> >> >> database.
> >> >>
> >> >> Regards,
> >> >> Florian
> >> >>
> >> >> > Hi Florian,
> >> >> >
> >> >> > I was unable to reproduce the behavior you described.
> >> >> >
> >> >> > Could you view your job, and post a screen shot of that page?  I
> >> want
> >> >> to
> >> >> > see what your schedule record(s) look like.
> >> >> >
> >> >> > Thanks,
> >> >> > Karl
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> Hi Florian,
> >> >> >>
> >> >> >> I've never noted this behavior before.  I'll see if I can
> >> reproduce
> >> >> it
> >> >> >> here.
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> >> >> >> [email protected]> wrote:
> >> >> >>
> >> >> >>> Hi Karl,
> >> >> >>>
> >> >> >>> the scheduled job seems to work as expecetd. However, it runs two
> >> >> >>> times:
> >> >> >>> It starts at the beginning of the scheduled time, finishes, and
> >> >> >>> immediately starts again. After finishing the second run it waits
> >> >> for
> >> >> >>> the
> >> >> >>> next scheduled time. Why does it run two times? The start method
> >> is
> >> >> >>> "Start
> >> >> >>> at beginning of schedule window".
> >> >> >>>
> >> >> >>> Yes, you're right about the checking guarantee. Currently, our
> >> >> interval
> >> >> >>> is
> >> >> >>> long enough for a complete crawler run.
> >> >> >>>
> >> >> >>> Best,
> >> >> >>> Florian
> >> >> >>>
> >> >> >>>
> >> >> >>> > Hi Florian,
> >> >> >>> >
> >> >> >>> > It is impossible to *guarantee* that a document will be
> >> checked,
> >> >> >>> because
> >> >> >>> > if
> >> >> >>> > load on the crawler is high enough, it will fall behind.  But I
> >> >> will
> >> >> >>> look
> >> >> >>> > into adding the feature you request.
> >> >> >>> >
> >> >> >>> > Karl
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> >> >> >>> > [email protected]> wrote:
> >> >> >>> >
> >> >> >>> >> Hi Karl,
> >> >> >>> >>
> >> >> >>> >> yes, in our case it is necessary to make sure that new
> >> documents
> >> >> are
> >> >> >>> >> discovered and indexed within a certain interval. I have
> >> created
> >> >> a
> >> >> >>> >> feature
> >> >> >>> >> request on that. In the meantime we will try to use a
> >> scheduled
> >> >> job
> >> >> >>> >> instead.
> >> >> >>> >>
> >> >> >>> >> Thanks for your help,
> >> >> >>> >> Florian
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> > Hi Florian,
> >> >> >>> >> >
> >> >> >>> >> > What you are seeing is "dynamic crawling" behavior.  The
> >> time
> >> >> >>> between
> >> >> >>> >> > refetches of a document is based on the history of fetches
> >> of
> >> >> that
> >> >> >>> >> > document.  The recrawl interval is the initial time between
> >> >> >>> document
> >> >> >>> >> > fetches, but if a document does not change, the interval for
> >> >> the
> >> >> >>> >> document
> >> >> >>> >> > increases according to a formula.
> >> >> >>> >> >
> >> >> >>> >> > I would need to look at the code to be able to give you the
> >> >> >>> precise
> >> >> >>> >> > formula, but if you need a limit on the amount of time
> >> between
> >> >> >>> >> document
> >> >> >>> >> > fetch attempts, I suggest you create a ticket and I will
> >> look
> >> >> into
> >> >> >>> >> adding
> >> >> >>> >> > that as a feature.
> >> >> >>> >> >
> >> >> >>> >> > Thanks,
> >> >> >>> >> > Karl
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> >> >> >>> >> > [email protected]> wrote:
> >> >> >>> >> >
> >> >> >>> >> >> Hello,
> >> >> >>> >> >>
> >> >> >>> >> >> the parameters reseed interval and recrawl interval of a
> >> >> >>> continuous
> >> >> >>> >> >> crawling job are not quite clear to me. The documentation
> >> >> tells
> >> >> >>> that
> >> >> >>> >> the
> >> >> >>> >> >> reseed interval is the time after which the seeds are
> >> checked
> >> >> >>> again,
> >> >> >>> >> and
> >> >> >>> >> >> the recrawl interval is the time after which a document is
> >> >> >>> checked
> >> >> >>> >> for
> >> >> >>> >> >> changes.
> >> >> >>> >> >>
> >> >> >>> >> >> However, we observed that the recrawl interval for a
> >> document
> >> >> >>> >> increases
> >> >> >>> >> >> after each check. On the other hand, the reseed interval
> >> seems
> >> >> to
> >> >> >>> be
> >> >> >>> >> set
> >> >> >>> >> >> up correctly in the database metadata about the seed
> >> >> documents.
> >> >> >>> Yet
> >> >> >>> >> the
> >> >> >>> >> >> web server does not receive requests at each time the
> >> interval
> >> >> >>> >> elapses
> >> >> >>> >> >> but
> >> >> >>> >> >> only after several intervals have elapsed.
> >> >> >>> >> >>
> >> >> >>> >> >> We are using a web connector. The web server does not tell
> >> the
> >> >> >>> client
> >> >> >>> >> to
> >> >> >>> >> >> cache the documents. Any help would be appreciated.
> >> >> >>> >> >>
> >> >> >>> >> >> Best regards,
> >> >> >>> >> >> Florian
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Reply via email to