Hi, Le Mercredi 17 Mars 2010 00:17:59, Hugh Brown a écrit : > This is a complicated problem; apologies in advance if there's any > missing information. > > I'm running Bacula 5.0.1 on CentOS 5.4, x86_64. I came back from a > week's vacation today to discover that the storage daemon had become > hung one day after my vacation started. :-( > > Three jobs were running, and everything else was stacked up behind > that, waiting on Max Storage Jobs (currently set to 3). There were > two bacula-sd processes listed: one old (dating from the last reboot > of the machine) and one young (dated from the time the hung jobs had > run). > > I installed the debuginfo RPM, and ran btraceback (output attached) > with the PID of the younger job as an argument. After that, I tried > "kill -3 [younger PID]", and when that didn't work ran "kill -9 > [younger PID]". At that point, surprisingly, the three hung jobs > finished, and the jobs that had been waiting to run began to run. From > the reports (attached), it looked like the director had tried to kill > the jobs since they'd run far too long, but this evidently did not > succeed: > > Error: Watchdog sending kill after 518427 secs to thread stalled reading > Storage daemon. > > After doing some searching, I came across bug #1527 > (http://bugs.bacula.org/view.php?id=1527), which looks similar to > problem in one respect: the output of "status storage" in bconsole > just hung when it got to "Used volume status". (I'm afraid I did not > keep a copy of the output.) However, the tracebacks from that bug > look different from mine, so I'm not sure that it's the same.
Yes, but your backtrace looks very strange, so i'm not sure that we can trust it. > As I mentioned, I came across this bug a week after it occurred > (sigh), so my ability to get more info is limited. I will be running > backups again tonight and will be watching closely; I've added > monitoring for big stacks of long-running jobs, which should hopefully > catch this if it happens again. > > My questions are: > > -- Is this backtrace worth submitting as a bug report? > > -- Does this look like the same problem reported in #1527? If so, > should I recompile bacula with the lockmgr option as shown in the bug > report? You can also apply the attached patch. > -- The director tried to kill the long-standing job but failed. Is > this just another symptom of a deadlock in bacula-sd, or is there > something else going on? Without a good backtrace on all compoments, it's hard to say. Once you applied the patch, turn on the lockmanager, you can submit a bug if you find a dead lock. Bye > Thanks in advance for any advice you can give, and please let me know > if you need any further info. > > -- > Hugh Brown, Systems Manager > The Centre for High-Throughput Biology > [email protected] ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
