[Bacula-devel] Problem with SD hang in 5.0.1

Hugh Brown Tue, 16 Mar 2010 16:21:17 -0700

This is a complicated problem; apologies in advance if there's any
missing information.


I'm running Bacula 5.0.1 on CentOS 5.4, x86_64.  I came back from a
week's vacation today to discover that the storage daemon had become
hung one day after my vacation started. :-(

Three jobs were running, and everything else was stacked up behind
that, waiting on Max Storage Jobs (currently set to 3).  There were
two bacula-sd processes listed: one old (dating from the last reboot
of the machine) and one young (dated from the time the hung jobs had
run).

I installed the debuginfo RPM, and ran btraceback (output attached)
with the PID of the younger job as an argument.  After that, I tried
"kill -3 [younger PID]", and when that didn't work ran "kill -9
[younger PID]".  At that point, surprisingly, the three hung jobs
finished, and the jobs that had been waiting to run began to run. From
the reports (attached), it looked like the director had tried to kill
the jobs since they'd run far too long, but this evidently did not
succeed:

        Error: Watchdog sending kill after 518427 secs to thread stalled 
reading Storage daemon.

After doing some searching, I came across bug #1527
(http://bugs.bacula.org/view.php?id=1527), which looks similar to
problem in one respect: the output of "status storage" in bconsole
just hung when it got to "Used volume status".  (I'm afraid I did not
keep a copy of the output.)  However, the tracebacks from that bug
look different from mine, so I'm not sure that it's the same.

As I mentioned, I came across this bug a week after it occurred
(sigh), so my ability to get more info is limited.  I will be running
backups again tonight and will be watching closely; I've added
monitoring for big stacks of long-running jobs, which should hopefully
catch this if it happens again.

My questions are:

-- Is this backtrace worth submitting as a bug report?

-- Does this look like the same problem reported in #1527?  If so,
should I recompile bacula with the lockmgr option as shown in the bug
report?

-- The director tried to kill the long-standing job but failed.  Is
this just another symptom of a deadlock in bacula-sd, or is there
something else going on?

Thanks in advance for any advice you can give, and please let me know
if you need any further info.

--
Hugh Brown, Systems Manager
The Centre for High-Throughput Biology
[email protected]

[Thread debugging using libthread_db enabled]
[New Thread 0x45816940 (LWP 22908)]
0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
$1 = "cbs-01-sd", '\0' <repeats 20 times>
$2 = 0x1d69e528 "bacula-sd"
$3 = 0x1d69e568 "/usr/sbin/"
$4 = 0x0
$5 = 0x34f4841d1e "5.0.1 (24 February 2010)"
$6 = 0x34f4841d37 "x86_64-redhat-linux-gnu"
$7 = 0x34f4841d4f "redhat"
$8 = 0x34f4841a15 ""
$9 = "agnatha.chibi.ubc.ca", '\0' <repeats 29 times>
#0  0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
#2  0x00000035aa4d05be in closelog () from /lib64/libc.so.6
#3  0x00000034f4818bf5 in open_bpipe (prog=<value optimized out>, wait=<value 
optimized out>, mode=<value optimized out>) at bpipe.c:138
#4  0x000000000040ceb5 in release_device (dcr=0x1d98f8d8) at acquire.c:531
#5  0x000000000040f31b in do_append_data (jcr=0x1d6c0b38) at append.c:329
#6  0x000000000042263f in append_data_cmd (jcr=0x1d6c0b38) at fd_cmds.c:203
#7  0x0000000000421df3 in do_fd_commands (jcr=0x1d6c0b38) at fd_cmds.c:162
#8  0x0000000000422559 in run_job (jcr=0x1d6c0b38) at fd_cmds.c:124
#9  0x0000000000422bb4 in run_cmd (jcr=0x1d6c0b38) at job.c:225
#10 0x000000000041f4b8 in handle_connection_request (arg=0x1d798148) at 
dircmd.c:233
#11 0x00000034f4839fee in workq_server (arg=<value optimized out>) at 
workq.c:346
#12 0x00000035ab006617 in start_thread () from /lib64/libpthread.so.0
#13 0x00000035aa4d3c2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x45816940 (LWP 22908)):
#0  0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
#2  0x00000035aa4d05be in closelog () from /lib64/libc.so.6
#3  0x00000034f4818bf5 in open_bpipe (prog=<value optimized out>, wait=<value 
optimized out>, mode=<value optimized out>) at bpipe.c:138
#4  0x000000000040ceb5 in release_device (dcr=0x1d98f8d8) at acquire.c:531
#5  0x000000000040f31b in do_append_data (jcr=0x1d6c0b38) at append.c:329
#6  0x000000000042263f in append_data_cmd (jcr=0x1d6c0b38) at fd_cmds.c:203
#7  0x0000000000421df3 in do_fd_commands (jcr=0x1d6c0b38) at fd_cmds.c:162
#8  0x0000000000422559 in run_job (jcr=0x1d6c0b38) at fd_cmds.c:124
#9  0x0000000000422bb4 in run_cmd (jcr=0x1d6c0b38) at job.c:225
#10 0x000000000041f4b8 in handle_connection_request (arg=0x1d798148) at 
dircmd.c:233
#11 0x00000034f4839fee in workq_server (arg=<value optimized out>) at 
workq.c:346
#12 0x00000035ab006617 in start_thread () from /lib64/libpthread.so.0
#13 0x00000035aa4d3c2d in clone () from /lib64/libc.so.6
#0  0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000035aa4d05be in closelog () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000034f4818bf5 in open_bpipe (prog=<value optimized out>, wait=<value 
optimized out>, mode=<value optimized out>) at bpipe.c:138
138           closelog();                     /* close syslog if open */
Current language:  auto; currently c++
bargv = Could not find the frame base for "open_bpipe(char*, int, char const*)".
bargc = <value optimized out>
i = <value optimized out>
readp = Could not find the frame base for "open_bpipe(char*, int, char const*)".
writep = Could not find the frame base for "open_bpipe(char*, int, char 
const*)".
tprog = Could not find the frame base for "open_bpipe(char*, int, char const*)".
mode_read = <value optimized out>
bpipe = <value optimized out>
save_errno = <value optimized out>
#4  0x000000000040ceb5 in release_device (dcr=0x1d98f8d8) at acquire.c:531
531           bpipe = open_bpipe(alert, 0, "r");
alert = (POOLMEM *) 0x35aa753c90 "\002"
status = <value optimized out>
bpipe = <value optimized out>
line = 
"h\201y\035\000\000\000\000?\000\000\000\000\000\000\000??d\000\000\000\000\000??w\035\000\000\000\000??\230\035\000\000\000\000?\177\201?4\000\000\000??q\035\000\000\000\000p#r?5\000\000\000c\000\001\000\000\000\000\000\020\fl\035\000\000\000\000i?\203?\000\000\000\000c\000\001\000\000\000\000\000h\201y\035\000\000\000\000d\000\001\000\000\000\000\000??d\000\000\000\000\000??w\035\000\000\000\000??\230\035\000\000\000\000j\202\201?4\000\000\00022201\000\000\...@y\201e\000\000\000\000`x\201e\000\000\000\000??m\035\000\000\000\000\230?m\035\000\000\000\000\000??\000\000?\003\235?\003\000\000\000\000\000\000"...
jcr = (JCR *) 0x1d6c0b38
dev = (DEVICE *) 0x1d6beb78
tbuf = 
"?\000\000\000?\000\000\000\220?\230\035\000\000\000\000x?\230\035\000\000\000\0009\215\202?4",
 '\0' <repeats 11 times>, 
"\215aD\000\000\000\000\000\215aD\000\000\000\000\000??\203?4\000\000\000??\230\035",
 '\0' <repeats 20 times>, "??\230\035\000\000\000\000??\230\035"
#5  0x000000000040f31b in do_append_data (jcr=0x1d6c0b38) at append.c:329
329        release_device(dcr);
n = <value optimized out>
file_index = 199
stream = 1
last_file_index = 199
fd = <value optimized out>
ok = true
rec = {link = {next = 0x0, prev = 0x0}, File = 0, Block = 0, VolSessionId = 
474, VolSessionTime = 1267395547, FileIndex = 199, Stream = 1, data_len = 84, 
remainder = 0, state = 0, bsr = 0x0, ser_buf = '\0' <repeats 11 times>, 
  data = 0x1d88a3c0 "", match_stat = 0, last_VolSessionId = 0, 
last_VolSessionTime = 0, last_FileIndex = 0}
buf1 = 
"199\000\000\000\000\000\000??\000\000?\003\235H\201y\035\000\000\000\000d\000\001\000\000\000\000\000?\001\000\000\000\000\000\000\000??\000\000?\003\235??\210\035\000\000\000\000????",
 '\0' <repeats 12 times>, 
"\020\000\000\000\000\000\000\000??\210\035\000\000\000\000????\000\000\000\000\000\\\201E"
buf2 = 
"??\210\035\000\000\000\000\000??\000\000?\003\235c\000\001\000\000\000\000\000\001\000\000\000\000\000\000\000H\201y\035\000\000\000\000c\000\001\000\000\000\000\000\230tm\035\000\000\000\000d\000\001\000\000\000\000\000
 
?d\000\000\000\000\0008\vl\035\000\000\000\000\230tm\035\000\000\000\000j\202\201?4\000\000\000\030\000\000"
dcr = (DCR *) 0x1d98f8d8
dev = (DEVICE *) 0x1d6beb78
ec = "81.53 M\000\230tm\035", '\0' <repeats 12 times>, 
"??\210\035\000\000\000\000\004\000\000\000\000\000\000\000M}\201?4\000\000\0008\v"
job_elapsed = <value optimized out>
#6  0x000000000042263f in append_data_cmd (jcr=0x1d6c0b38) at fd_cmds.c:203
203           if (do_append_data(jcr)) {
fd = (BSOCK *) 0x1d6d7498
#7  0x0000000000421df3 in do_fd_commands (jcr=0x1d6c0b38) at fd_cmds.c:162
162                 if (!fd_cmds[i].func(jcr)) {    /* do command */
stat = <value optimized out>
i = 1
fd = (BSOCK *) 0x1d6d7498

06-Mar 21:05 cbs-01-dir JobId 10222: Start Backup JobId 10222, 
Job=example-03-var.2010-03-06_21.05.01_14
06-Mar 21:05 cbs-01-dir JobId 10222: Using Device "Drive-0"
06-Mar 21:05 cbs-01-sd JobId 10222: Spooling data ...
example-03-fd JobId 10222:      /var/lib/nfs/rpc_pipefs is a different 
filesystem. Will not descend from /var into /var/lib/nfs/rpc_pipefs
06-Mar 21:05 cbs-01-sd JobId 10222: Job write elapsed time = 00:00:01, Transfer 
rate = 81.53 M Bytes/second
06-Mar 21:05 cbs-01-sd JobId 10222: Committing spooled data to Volume "000004". 
Despooling 81,620,978 bytes ...
06-Mar 21:05 cbs-01-sd JobId 10222: Despooling elapsed time = 00:00:01, 
Transfer rate = 81.62 M Bytes/second
16-Mar 12:01 cbs-01-sd JobId 10222: Alert: cbs-01-sd: signal.c:141-10222 sig=3 
Quit
16-Mar 12:01 cbs-01-sd JobId 10222: Alert: Bacula interrupted by signal 3: Quit
16-Mar 12:01 cbs-01-sd JobId 10222: 3997 Bad alert command: sh -c 
'/usr/sbin/tapeinfo -f /dev/changer |grep TapeAlert|cat': ERR=Child died from 
signal 9: Kill, unblockable.
16-Mar 12:01 cbs-01-sd JobId 10222: Sending spooled attrs to the Director. 
Despooling 53,665 bytes ...
12-Mar 21:05 cbs-01-dir JobId 10222: Error: Watchdog sending kill after 518411 
secs to thread stalled reading Storage daemon.
16-Mar 12:01 cbs-01-dir JobId 10222: Bacula cbs-01-dir 5.0.1 (24Feb10): 
16-Mar-2010 12:01:46
  Build OS:               x86_64-redhat-linux-gnu redhat
  JobId:                  10222
  Job:                    example-03-var.2010-03-06_21.05.01_14
  Backup Level:           Incremental, since=2010-03-05 21:07:20
  Client:                 "example-03-fd" 3.0.1 (30Apr09) 
x86_64-redhat-linux-gnu,redhat,
  FileSet:                "var" 2009-07-09 12:17:34
  Pool:                   "Daily" (From Run IncPool override)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "tape" (From Job resource)
  Scheduled time:         06-Mar-2010 21:05:01
  Start time:             06-Mar-2010 21:05:39
  End time:               16-Mar-2010 12:01:46
  Elapsed time:           9 days 13 hours 56 mins 7 secs
  Priority:               10
  FD Files Written:       199
  SD Files Written:       199
  FD Bytes Written:       81,515,540 (81.51 MB)
  SD Bytes Written:       81,539,880 (81.53 MB)
  Rate:                   0.1 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               no
  Volume name(s):         000004
  Volume Session Id:      474
  Volume Session Time:    1267395547
  Last Volume Bytes:      331,962,624,000 (331.9 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK -- with warnings

16-Mar 12:01 cbs-01-dir JobId 10222: Begin pruning Jobs older than 1 month 26 
days .
16-Mar 12:01 cbs-01-dir JobId 10222: No Jobs found to prune.
16-Mar 12:01 cbs-01-dir JobId 10222: Begin pruning Jobs.
16-Mar 12:01 cbs-01-dir JobId 10222: Pruned Files from 18 Jobs for client 
example-03-fd from catalog.
16-Mar 12:01 cbs-01-dir JobId 10222: End auto prune.

06-Mar 21:05 cbs-01-dir JobId 10218: Start Backup JobId 10218, 
Job=example-01-opt.2010-03-06_21.05.01_10
06-Mar 21:05 cbs-01-dir JobId 10218: Using Device "Drive-0"
06-Mar 21:05 cbs-01-sd JobId 10218: Spooling data ...
example-01-fd JobId 10218:      /var/lib/nfs/rpc_pipefs is a different 
filesystem. Will not descend from /var into /var/lib/nfs/rpc_pipefs
06-Mar 21:05 cbs-01-sd JobId 10218: Job write elapsed time = 00:00:08, Transfer 
rate = 11.21 M Bytes/second
06-Mar 21:05 cbs-01-sd JobId 10218: Committing spooled data to Volume "000004". 
Despooling 89,821,921 bytes ...
16-Mar 12:01 cbs-01-sd JobId 10218: Despooling elapsed time = 00:00:01, 
Transfer rate = 89.82 M Bytes/second
16-Mar 12:01 cbs-01-sd JobId 10218: Sending spooled attrs to the Director. 
Despooling 57,214 bytes ...
12-Mar 21:05 cbs-01-dir JobId 10218: Error: Watchdog sending kill after 518411 
secs to thread stalled reading Storage daemon.
16-Mar 12:01 cbs-01-dir JobId 10218: Bacula cbs-01-dir 5.0.1 (24Feb10): 
16-Mar-2010 12:01:47
  Build OS:               x86_64-redhat-linux-gnu redhat
  JobId:                  10218
  Job:                    example-01-opt.2010-03-06_21.05.01_10
  Backup Level:           Incremental, since=2010-03-05 21:07:04
  Client:                 "example-01-fd" 3.0.1 (30Apr09) 
x86_64-redhat-linux-gnu,redhat,
  FileSet:                "opt" 2010-01-06 21:05:02
  Pool:                   "Daily" (From Run IncPool override)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "tape" (From Job resource)
  Scheduled time:         06-Mar-2010 21:05:01
  Start time:             06-Mar-2010 21:05:33
  End time:               16-Mar-2010 12:01:47
  Elapsed time:           9 days 13 hours 56 mins 14 secs
  Priority:               10
  FD Files Written:       216
  SD Files Written:       216
  FD Bytes Written:       89,706,629 (89.70 MB)
  SD Bytes Written:       89,732,903 (89.73 MB)
  Rate:                   0.1 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               no
  Volume name(s):         000004
  Volume Session Id:      472
  Volume Session Time:    1267395547
  Last Volume Bytes:      332,052,489,216 (332.0 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK -- with warnings

16-Mar 12:01 cbs-01-dir JobId 10218: Begin pruning Jobs older than 1 month 26 
days .
16-Mar 12:01 cbs-01-dir JobId 10218: No Jobs found to prune.
16-Mar 12:01 cbs-01-dir JobId 10218: Begin pruning Jobs.
16-Mar 12:01 cbs-01-dir JobId 10218: Pruned Files from 29 Jobs for client 
example-01-fd from catalog.
16-Mar 12:01 cbs-01-dir JobId 10218: End auto prune.

06-Mar 21:05 cbs-01-dir JobId 10202: Start Backup JobId 10202, 
Job=example-03-var.2010-03-06_21.05.00_54
06-Mar 21:05 cbs-01-dir JobId 10202: Using Device "Drive-0"
06-Mar 21:05 cbs-01-sd JobId 10202: Spooling data ...
06-Mar 21:05 cbs-01-fd JobId 10202:      /var/lib/nfs/rpc_pipefs is a different 
filesystem. Will not descend from /var into /var/lib/nfs/rpc_pipefs
06-Mar 21:07 cbs-01-sd JobId 10202: Job write elapsed time = 00:02:47, Transfer 
rate = 22.79 M Bytes/second
06-Mar 21:07 cbs-01-sd JobId 10202: Committing spooled data to Volume "000004". 
Despooling 3,810,523,377 bytes ...
16-Mar 12:02 cbs-01-sd JobId 10202: Despooling elapsed time = 00:01:08, 
Transfer rate = 56.03 M Bytes/second
16-Mar 12:02 cbs-01-sd JobId 10202: Sending spooled attrs to the Director. 
Despooling 85,522 bytes ...
12-Mar 21:08 cbs-01-dir JobId 10202: Error: Watchdog sending kill after 518427 
secs to thread stalled reading Storage daemon.
16-Mar 12:02 cbs-01-dir JobId 10202: Bacula cbs-01-dir 5.0.1 (24Feb10): 
16-Mar-2010 12:02:56
  Build OS:               x86_64-redhat-linux-gnu redhat
  JobId:                  10202
  Job:                    example-03-var.2010-03-06_21.05.00_54
  Backup Level:           Incremental, since=2010-03-05 21:05:07
  Client:                 "example-03-fd" 5.0.1 (24Feb10) 
x86_64-redhat-linux-gnu,redhat,
  FileSet:                "var" 2009-07-09 12:17:34
  Pool:                   "Daily" (From Run IncPool override)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "tape" (From Job resource)
  Scheduled time:         06-Mar-2010 21:05:00
  Start time:             06-Mar-2010 21:05:07
  End time:               16-Mar-2010 12:02:56
  Elapsed time:           9 days 13 hours 57 mins 49 secs
  Priority:               10
  FD Files Written:       315
  SD Files Written:       315
  FD Bytes Written:       3,806,944,538 (3.806 GB)
  SD Bytes Written:       3,806,982,715 (3.806 GB)
  Rate:                   4.6 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               no
  Volume name(s):         000004
  Volume Session Id:      459
  Volume Session Time:    1267395547
  Last Volume Bytes:      335,862,632,448 (335.8 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK -- with warnings

16-Mar 12:02 cbs-01-dir JobId 10202: Begin pruning Jobs older than 1 month 26 
days .
16-Mar 12:02 cbs-01-dir JobId 10202: No Jobs found to prune.
16-Mar 12:02 cbs-01-dir JobId 10202: Begin pruning Jobs.
16-Mar 12:03 cbs-01-dir JobId 10202: Pruned Files from 41 Jobs for client 
example-03-fd from catalog.
16-Mar 12:03 cbs-01-dir JobId 10202: End auto prune.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

[Bacula-devel] Problem with SD hang in 5.0.1

Reply via email to