[I'm astonished how often my ISP manages to get its mail smarthosts listed in some RBL server...]
Hi, I will ignore your reports concerning crashing DIR - that seems to be FreeBSD specific. Nevertheless, I see what you report concerning rescheduled jobs hanging in a state they could not be in. I upgraded to 1.38.10 yesterday, coming from 1.38.6. The rest of the system is SuSE 8.1, kernel 2.4.21, not enough memory :-), MySQL on a different host. I don't think this matters, though. This morning, when my backup jobs ran, my notebook was not available. This job uses rescheduling in case of an error, and it has a run before script that checks if the machine is available. Usually I start working some time after the scheduled job start, so the notebook is turned on, some minutes later the rescheduled job connects the then-running notebook, and voilà, the backup runs. This morning, I had a very unreliable WLAN connectivity (by now there are about 6 to 10 WLANs overlapping. Crap.) The result was that, at some point in time, the job was also stuck in a state it should not be in: Running Jobs: JobId Level Name Status ====================================================================== 6467 Increme DracheStd.2006-06-23_08.20.00 is waiting on max Storage jobs ==== while, at the same time, 'sta sd=HPDAT' reports: #sta sd=HPDAT Connecting to Storage daemon HPDAT at goblin:9103 goblin-sd Version: 1.38.10 (08 June 2006) i586-pc-linux-gnu suse 8.1 Daemon started 23-Jun-06 07:32, 10 Jobs run since started. Running Jobs: No Jobs running. ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: ... Funny. All other jobs ran normally, but I noticed that the jobs with a higher priority value ran even though this stuck job is still waiting. The last lines from Baculas log are: 23-Jun 10:25 goblin-dir: RunBefore: ping: unknown host drache 23-Jun 10:25 goblin-dir: RunBefore: Host drache is DOWN! 23-Jun 10:25 goblin-dir: DracheStd.2006-06-23_10.20.36 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 10:25 goblin-dir: Rescheduled Job DracheStd.2006-06-23_10.20.36 at 23-Jun-2006 10:25 to re-run in 600 seconds (23-Jun-2006 10:35). 23-Jun 10:25 goblin-dir: Job DracheStd.2006-06-23_10.20.36 waiting 600 seconds for scheduled start time. In other words, this looks as though the job was normally rescheduled. Interestingly, and that might be a clue, the normal job run order was somehow different than what I expected. The job was scheduled to run at 8:20, together with lots of other jobs which usually all run in parallel. In the log, today I see the following entries for the job in question: 23-Jun 08:20 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 08:20 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 08:20 to re-run in 600 seconds (23-Jun-2006 08:30). 23-Jun 08:20 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 08:30 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 08:30 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 08:30 to re-run in 600 seconds (23-Jun-2006 08:40). 23-Jun 08:30 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 08:40 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 08:40 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 08:40 to re-run in 600 seconds (23-Jun-2006 08:50). 23-Jun 08:40 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 08:50 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 08:50 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 08:50 to re-run in 600 seconds (23-Jun-2006 09:00). 23-Jun 08:50 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 09:00 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 09:00 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 09:00 to re-run in 600 seconds (23-Jun-2006 09:10). 23-Jun 09:00 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 09:10 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 09:10 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 09:10 to re-run in 600 seconds (23-Jun-2006 09:20). 23-Jun 09:10 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 09:20 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 09:20 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 09:20 to re-run in 600 seconds (23-Jun-2006 09:30). 23-Jun 09:20 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 09:30 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 09:30 goblin-dir: Rescheduled Job DracheStd.2006-06-23_08.20.00 at 23-Jun-2006 09:30 to re-run in 600 seconds (23-Jun-2006 09:40). 23-Jun 09:30 goblin-dir: Job DracheStd.2006-06-23_08.20.00 waiting 600 seconds for scheduled start time. 23-Jun 09:40 goblin-dir: DracheStd.2006-06-23_08.20.00 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 10:25 goblin-dir: DracheStd.2006-06-23_10.20.36 Fatal error: RunBeforeJob error: ERR=Child exited with code 1 23-Jun 10:25 goblin-dir: Rescheduled Job DracheStd.2006-06-23_10.20.36 at 23-Jun-2006 10:25 to re-run in 600 seconds (23-Jun-2006 10:35). 23-Jun 10:25 goblin-dir: Job DracheStd.2006-06-23_10.20.36 waiting 600 seconds for scheduled start time. You note that between 9:40 and 10:25 no tries and reschedules took place. That is probably due to the lower priority jobs that ran then: The normal jobs all run with priority 10. One job to dump some databases runs with priority 30, and according to the job report, it ran while DracheStd still waited: > Job: GoblinDB.2006-06-23_08.25.06 ... > Scheduled time: 23-Jun-2006 08:25:05 > Start time: 23-Jun-2006 09:42:24 > End time: 23-Jun-2006 09:42:29 The Catalog backup, at priority 100, also ran: > Job: BackupCatalog.2006-06-23_08.34.00 ... > Scheduled time: 23-Jun-2006 08:34:00 > Start time: 23-Jun-2006 10:02:17 > End time: 23-Jun-2006 10:25:04 > Elapsed time: 22 mins 47 secs These jobs should not normally be started while a job with priority 10 is still active, I think. I have good reason to assume that this problem is new with 1.38.10 and did not exist in 1.38.6. My reasoning: I have a job running at priority 999 which shuts down the backup server. The server is started daily in the morning by its built-in timer. During the few months I ran 1.38.6, there were some cases where I was out of the office for some days. My backup jobs are all scheduled daily. When I returned, I never saw a stuck job like today, rather I found the job reports with "Fatal Error" for DracheStd, and Ok for Shutdown. Anyway, back to the problem at hand: I suspect the rescheduling makes some trouble. There must be differences between how rescheduling works in 1.38.10 compared to 1.38.6. Either the DIR-internal job status is screwed up (to something indicating "Waiting for storage"), or the SD loses track of the jobs activities, or something else. Przemysław, can you try to verify this behaviour? I.e., if your DIR does not crash when a scheduled job starts, do you observe the same I do? Like job is rescheduled some times, but after some rescheduling events the status is "waiting for storage" and other jobs, with higher priority values, are started. Finally, after I cancel the stuck job, it is marked as cancelled in the sta dir output, but does not (immediately) quit. For example, no job report is generated, no log entries appended, the jobs remains in the DIR status report as "running", and the number of threads / processes of bacula-dir does not decrease. Even after stopping and restarting the SD this job does not vanish from the "running" list of sta dir. Kern, in case you read this - I will file a bug report shortly. Arno On 6/23/2006 2:11 PM, Przemysław Staniszewski wrote: > Przemysław Staniszewski wrote: > > Hello > > I put everything one again on the server. I have this strange situation. > > When run jobs that could not be done - because host is offline, job (in > logs) is rescheduled in 3600 seconds and after that for a while I can > see this in Running Jobs: with waiting for its start time (correct). But > if I look later there is No Jobs running. Strange. There are no more logs. > > Now after that I run by hand the same last job where host is down and > have xxx is waiting on max Storage jobs > > And second situation: when run 2 jobs and they become cancel - next jobs > is waiting on the max storage jobs and just waiting. All jobs running > after them just waiting on max on the storage jobs. > > In my Job is: > > Max Wait Time = 2 minutes > Reschedule Interval = 1 hours > Reschedule On Errors = yes > Reschedule Interval = 1 hours > Rerun Failed Levels = yes > > And its all. Does anybody have something like that (bacula 1.38.9) ? > -- IT-Service Lehmann [EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de -- IT-Service Lehmann [EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de -- IT-Service Lehmann [EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users