Hello,

I recommend that you submit this as a bug report.  Please include your 
bacula-dir.conf and bacula-sd.conf as well as the two files you included 
here.

On the timeout for the alert command.  Adding it would require yet another 
Bacula directive to specify the timeout, and that is really a feature request 
(not something we fix as a bug in general).  Also, if the alert command takes 
any time or stalls your system, then you have a hardware or software problem 
that should be fixed.

If it is the Alert command that is holding things up, then the simplest thing 
to do is to remove the alert command.

Best regards,

Kern

On Friday 19 March 2010 18:53:47 Hugh Brown wrote:
> (Sorry, once more with actual attachments.)
>
> Kern Sibbald wrote:
> > At this point, before sending anything, first, please ensure the patch is
> > applied.  If so, 90% probability you will not have any more problems.  If
> > you do, the lock manager will produce a nice dump with additional
> > information -- if it is not emailed to you, you should find two files in
> > your working directory that contain the traceback and the bactrace
> > outputs.
> >
> > The lock manager will not prevent lockups, but it will detect deadlock
> > situations, and then blow up the SD so as to produce a useful dump.
>
> I ran into this problem again last night (symptoms: no response after
> "Used volume status:" when running "status storage" on bconsole; extra
> bacula-sd process, which strace shows is running futex() over and over
> again), and managed to get the traceback and the lock dump.
>
> Unfortunately, the deadlock detection did not seem to work; I left
> things hung for about 20 minutes or so before running "kill -6" on the
> parent SD process.  (That still left the child, so I had to "kill -9"
> that one.)  However, I'm hoping that the info is still useful; if so,
> let me know and I'll file a bug.
>
> And now for some uninformed speculation:
>
> Looking at the backtrace and the lock dump, it seems that one thread
> (0x4519d940) held the two locks that were being waited for by other
> threads.  In turn, that lock-holding thread had finished a job (jcr
> 0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the
> alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error
> -q errorsonly -d scsi %c'") and waiting for output.  I'm assuming that
> the thread was still waiting for this when I killed it.
>
> Looking at the code for release_device(), it seems that the alert
> command is called without the optional watchdog timer:
>
>       alert = get_pool_memory(PM_FNAME);
>       alert = edit_device_codes(dcr, alert, dcr->device->alert_command,
> ""); bpipe = open_bpipe(alert, 0, "r");
>
> (Line 529-531 in acquire.c, version 5.0.1)
>
> Obviously, if something's wrong w/the alert command or my hardware,
> that's bad.  But would it be a good thing to call the alert command
> with, say, a 60-second watchdog timer to avoid this kind of problem?
> If there are other issues at work that make this just a workaround,
> wouldn't it still be good to be alerted that there's a problem?  (Here
> I'm assuming that either the lock manager could do this (and
> kill/segfault the process, producing a backtrace), or that the timeout
> could be caught w/o greater harm and turned into a log message.)
>
> Natch, I'm not a programmer (let alone a Bacula dev), nor do I play
> one on TV, but I'm very curious about what's going on under the hood.
> If I've mistaken something or missed the point entirely, I'd be
> grateful if someone could point it out.
>
> Thanks again for your time!
>
> --
> Hugh Brown, Systems Manager
> The Centre for High-Throughput Biology
> [email protected]



------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to