[Bacula-users] Rescheduled job stuck

Arno Lehmann Fri, 23 Jun 2006 13:07:45 -0700

[I'm astonished how often my ISP manages to get its mail smarthosts
listed in some RBL server...]

Hi,

I will ignore your reports concerning crashing DIR - that seems to be
FreeBSD specific.

Nevertheless, I see what you report concerning rescheduled jobs hanging
in a state they could not be in.

I upgraded to 1.38.10 yesterday, coming from 1.38.6. The rest of the
system is SuSE 8.1, kernel 2.4.21, not enough memory :-), MySQL on a
different host. I don't think this matters, though.

This morning, when my backup jobs ran, my notebook was not available.
This job uses rescheduling in case of an error, and it has a run before
script that checks if the machine is available.

Usually I start working some time after the scheduled job start, so the
notebook is turned on, some minutes later the rescheduled job connects
the then-running notebook, and voilà, the backup runs.

This morning, I had a very unreliable WLAN connectivity (by now there
are about 6 to 10 WLANs overlapping. Crap.) The result was that, at some
point in time, the job was also stuck in a state it should not be in:

Running Jobs:
  JobId Level   Name                       Status
======================================================================
   6467 Increme  DracheStd.2006-06-23_08.20.00 is waiting on max Storage
jobs
====

while, at the same time, 'sta sd=HPDAT' reports:
#sta sd=HPDAT
Connecting to Storage daemon HPDAT at goblin:9103

goblin-sd Version: 1.38.10 (08 June 2006) i586-pc-linux-gnu suse 8.1
Daemon started 23-Jun-06 07:32, 10 Jobs run since started.

Running Jobs:
No Jobs running.
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
...

Funny.

All other jobs ran normally, but I noticed that the jobs with a higher
priority value ran even though this stuck job is still waiting.

The last lines from Baculas log are:
23-Jun 10:25 goblin-dir: RunBefore: ping: unknown host drache
23-Jun 10:25 goblin-dir: RunBefore: Host drache is DOWN!
23-Jun 10:25 goblin-dir: DracheStd.2006-06-23_10.20.36 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 10:25 goblin-dir: Rescheduled Job DracheStd.2006-06-23_10.20.36
at 23-Jun-2006 10:25 to re-run in 600 seconds (23-Jun-2006 10:35).
23-Jun 10:25 goblin-dir: Job DracheStd.2006-06-23_10.20.36 waiting 600
seconds for scheduled start time.

In other words, this looks as though the job was normally rescheduled.
Interestingly, and that might be a clue, the normal job run order was
somehow different than what I expected.

The job was scheduled to run at 8:20, together with lots of other jobs
which usually all run in parallel.

In the log, today I see the following entries for the job in question:

23-Jun 08:20 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 08:20 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 08:20 to re-run in 600 seconds (23-Jun-2006 08:30).
23-Jun 08:20 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 08:30 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 08:30 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 08:30 to re-run in 600 seconds (23-Jun-2006 08:40).
23-Jun 08:30 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 08:40 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 08:40 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 08:40 to re-run in 600 seconds (23-Jun-2006 08:50).
23-Jun 08:40 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 08:50 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 08:50 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 08:50 to re-run in 600 seconds (23-Jun-2006 09:00).
23-Jun 08:50 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 09:00 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 09:00 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 09:00 to re-run in 600 seconds (23-Jun-2006 09:10).
23-Jun 09:00 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 09:10 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 09:10 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 09:10 to re-run in 600 seconds (23-Jun-2006 09:20).
23-Jun 09:10 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 09:20 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 09:20 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 09:20 to re-run in 600 seconds (23-Jun-2006 09:30).
23-Jun 09:20 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 09:30 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 09:30 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00
at 23-Jun-2006 09:30 to re-run in 600 seconds (23-Jun-2006 09:40).
23-Jun 09:30 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600
seconds for scheduled start time.
23-Jun 09:40 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 10:25 goblin-dir: DracheStd.2006-06-23_10.20.36 Fatal error:
RunBeforeJob error: ERR=Child exited with code 1
23-Jun 10:25 goblin-dir: Rescheduled Job DracheStd.2006-06-23_10.20.36
at 23-Jun-2006 10:25 to re-run in 600 seconds (23-Jun-2006 10:35).
23-Jun 10:25 goblin-dir: Job DracheStd.2006-06-23_10.20.36 waiting 600
seconds for scheduled start time.

You note that between 9:40 and 10:25 no tries and reschedules took
place. That is probably due to the lower priority jobs that ran then:
The normal jobs all run with priority 10.
One job to dump some databases runs with priority 30, and according to
the job report, it ran while DracheStd still waited:

> Job:                    GoblinDB.2006-06-23_08.25.06
...
>   Scheduled time:         23-Jun-2006 08:25:05
>   Start time:             23-Jun-2006 09:42:24
>   End time:               23-Jun-2006 09:42:29

The Catalog backup, at priority 100, also ran:

> Job:                    BackupCatalog.2006-06-23_08.34.00
...
>   Scheduled time:         23-Jun-2006 08:34:00
>   Start time:             23-Jun-2006 10:02:17
>   End time:               23-Jun-2006 10:25:04
>   Elapsed time:           22 mins 47 secs

These jobs should not normally be started while a job with priority 10
is still active, I think.

I have good reason to assume that this problem is new with 1.38.10 and
did not exist in 1.38.6. My reasoning:

I have a job running at priority 999 which shuts down the backup server.
The server is started daily in the morning by its built-in timer.
During the few months I ran 1.38.6, there were some cases where I was
out of the office for some days. My backup jobs are all scheduled daily.
When I returned, I never saw a stuck job like today, rather I found the
job reports with "Fatal Error" for DracheStd, and Ok for Shutdown.

Anyway, back to the problem at hand:
I suspect the rescheduling makes some trouble. There must be differences
between how rescheduling works in 1.38.10 compared to 1.38.6. Either the
DIR-internal job status is screwed up (to something indicating "Waiting
for storage"), or the SD loses track of the jobs activities, or
something else.

Przemysław, can you try to verify this behaviour? I.e., if your DIR does
not crash when a scheduled job starts, do you observe the same I do?
Like job is rescheduled some times, but after some rescheduling events
the status is "waiting for storage" and other jobs, with higher priority
values, are started.

Finally, after I cancel the stuck job, it is marked as cancelled in the
sta dir output, but does not (immediately) quit. For example, no job
report is generated, no log entries appended, the jobs remains in the
DIR status report as "running", and the number of threads / processes of
bacula-dir does not decrease.

Even after stopping and restarting the SD this job does not vanish from
the "running" list of sta dir.

Kern, in case you read this - I will file a bug report shortly.

Arno

On 6/23/2006 2:11 PM, Przemysław Staniszewski wrote:
> Przemysław Staniszewski wrote:
> 
> Hello
> 
>  I put everything one again on the server. I have this strange situation.
> 
> When run jobs that could not be done - because host is offline, job (in
> logs) is rescheduled in 3600 seconds and after that for a while I can
> see this in Running Jobs: with waiting for its start time (correct). But
> if I look later there is No Jobs running. Strange. There are no more logs.
> 
>  Now after that I run by hand the same last job where host is down and
> have xxx is waiting on max Storage jobs
> 
> And second situation: when run 2 jobs and they become cancel - next jobs
> is waiting on the max storage jobs and just waiting. All jobs running
> after them just waiting on max on the storage jobs.
> 
> In my Job is:
> 
> Max Wait Time = 2 minutes
> Reschedule Interval = 1 hours
> Reschedule On Errors = yes
> Reschedule Interval = 1 hours
> Rerun Failed Levels = yes
> 
> And its all. Does anybody have something like that (bacula 1.38.9) ?
> 

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Rescheduled job stuck

Reply via email to