Le Vendredi 19 Mars 2010 19:09:01, Kern Sibbald a écrit : > Hello, > > I recommend that you submit this as a bug report. Please include your > bacula-dir.conf and bacula-sd.conf as well as the two files you included > here. > > On the timeout for the alert command. Adding it would require yet another > Bacula directive to specify the timeout, and that is really a feature > request (not something we fix as a bug in general). Also, if the alert > command takes any time or stalls your system, then you have a hardware or > software problem that should be fixed. > > If it is the Alert command that is holding things up, then the simplest > thing to do is to remove the alert command.
Hi Hugh, An other solution that you can implement quickly is to call a simple wrapper script that will do the timeout work and kill the alert command after x secs if it hangs. Bye > Best regards, > > Kern > > On Friday 19 March 2010 18:53:47 Hugh Brown wrote: > > (Sorry, once more with actual attachments.) > > > > Kern Sibbald wrote: > > > At this point, before sending anything, first, please ensure the patch > > > is applied. If so, 90% probability you will not have any more > > > problems. If you do, the lock manager will produce a nice dump with > > > additional information -- if it is not emailed to you, you should find > > > two files in your working directory that contain the traceback and the > > > bactrace outputs. > > > > > > The lock manager will not prevent lockups, but it will detect deadlock > > > situations, and then blow up the SD so as to produce a useful dump. > > > > I ran into this problem again last night (symptoms: no response after > > "Used volume status:" when running "status storage" on bconsole; extra > > bacula-sd process, which strace shows is running futex() over and over > > again), and managed to get the traceback and the lock dump. > > > > Unfortunately, the deadlock detection did not seem to work; I left > > things hung for about 20 minutes or so before running "kill -6" on the > > parent SD process. (That still left the child, so I had to "kill -9" > > that one.) However, I'm hoping that the info is still useful; if so, > > let me know and I'll file a bug. > > > > And now for some uninformed speculation: > > > > Looking at the backtrace and the lock dump, it seems that one thread > > (0x4519d940) held the two locks that were being waited for by other > > threads. In turn, that lock-holding thread had finished a job (jcr > > 0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the > > alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error > > -q errorsonly -d scsi %c'") and waiting for output. I'm assuming that > > the thread was still waiting for this when I killed it. > > > > Looking at the code for release_device(), it seems that the alert > > > > command is called without the optional watchdog timer: > > alert = get_pool_memory(PM_FNAME); > > alert = edit_device_codes(dcr, alert, dcr->device->alert_command, > > > > ""); bpipe = open_bpipe(alert, 0, "r"); > > > > (Line 529-531 in acquire.c, version 5.0.1) > > > > Obviously, if something's wrong w/the alert command or my hardware, > > that's bad. But would it be a good thing to call the alert command > > with, say, a 60-second watchdog timer to avoid this kind of problem? > > If there are other issues at work that make this just a workaround, > > wouldn't it still be good to be alerted that there's a problem? (Here > > I'm assuming that either the lock manager could do this (and > > kill/segfault the process, producing a backtrace), or that the timeout > > could be caught w/o greater harm and turned into a log message.) > > > > Natch, I'm not a programmer (let alone a Bacula dev), nor do I play > > one on TV, but I'm very curious about what's going on under the hood. > > If I've mistaken something or missed the point entirely, I'd be > > grateful if someone could point it out. > > > > Thanks again for your time! > > > > -- > > Hugh Brown, Systems Manager > > The Centre for High-Throughput Biology > > [email protected] > > --------------------------------------------------------------------------- > --- Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Bacula-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/bacula-devel ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
