Hi Kern, I'll upgrade to 1.36.3 and see what happens. Maybe "Fix deadlock in multiple simultaneous jobs." (from ReleaseNotes) could be the right one. I already setup this site with 1.36.3 FileFormat because I knew it's going to be required!
Regards Volker On Mi, 13 Jul 2005, Kern Sibbald wrote: > Hello Volker, > > There were one or two race conditions that I fixed in 1.36.3. You might look > at the release notes and see if they appy to you. Beware 1.36.3 requires the > new format FileSets (and hence a Full backup unless you explicitly disable > it). > > On Tuesday 12 July 2005 00:24, Volker Sauer wrote: > > Hello, > > > > after some weeks of proper operation, my new bacula site causes > > problems. > > > > Problem description: > > After setting up the site, job scheduling worked for about 2 weeks. Now, > > quite often, the director hangs after starting the scheduled jobs at > > night. Connecting via console is not possible anymore. > > > > The only solution so far was restarting the director and manually > > running the jobs. After a few days (maybe even the next night) the > > problem occurs again. > > Looking into the archives of this list, I found some similar problem > > descriptions but none of the suggest things lead to a solution (at least > > none reported). > > > > The amazing thing is, that I have a second site with very similar > > config but only another loader, that works fine for years now. > > Now, I found out, that it could have come from a hanging sd, too. See > > below why. Question is: why hangs the dir because the sd is hanging? > > > > I haven't been able to produce a trace-file. I'll set debuglevel=100 and > > trace=1 now and see if I can gather some information and post it to this > > list as soon as possbile. > > What I can provide (so far) is some other debugging output (see below). > > > > Does anyone have ideas or comments on this odd behaviour?? > > > > Regards > > Volker > > > > Debugging-Session attached: > > > > 8<------------------------------------------------------------------------- > >------ > > > > System: Debian Sarge Kernel 2.6.8-2-k7-smp > > Bacula: 1.36.2-2sarge1 > > Storage: Overland 10x DLT with Quantum DLT40/80 > > Database: mysql 4.0.24-10 (Size: 1.4G) > > dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling > > for all machines except bacula-host itself. > > > > ----------------------------------------------------- > > dakar: / 7# bconsole > > Connecting to Director dakar:9101 > > Director authorization problem. > > Most likely the passwords do not agree. > > Please see > > http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help. > > ----------------------------------------------------- > > (passwords are definitely okay). > > > > > > This is the conmesg-file so far: > > ----------------------- > > dakar: /var/lib/bacula 33# cat backup-dir.conmsg > > > > 11-Jul 21:00 backup-dir: Start Backup JobId > > 899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir: > > Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00 > > backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00 > > backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul > > 21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of > > data. 11-Jul 21:00 backup-dir: Start Backup JobId 907, > > Job=bali-rootfs.2005-07-11_21.00.08 ------------------------ > > The bali-fd is up and seems to okay. The director seems to wait for > > something. Nothing else in this file > > > > > > There's nothing in the syslog (SCSI error etc.), either: > > ---------------------------------------------------- > > Jul 11 20:40:30 dakar -- MARK -- > > Jul 11 21:00:30 dakar -- MARK -- > > Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD ( run-parts > > --report /etc/cron.hourly) > > Jul 11 21:40:30 dakar -- MARK -- > > Jul 11 22:00:30 dakar -- MARK -- > > Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD ( run-parts > > --report /etc/cron.hourly) > > Jul 11 22:40:30 dakar -- MARK -- > > Jul 11 23:00:30 dakar -- MARK -- > > ----------------------------------------------------- > > > > Here's what strace -p says: > > > > on bacula-director-process: > > dakar: ~ 2# strace -v -p 19051 > > Process 19051 attached - interrupt to quit > > futex(0x80c5680, FUTEX_WAIT, 2, NULL > > (running bconsole doesn't produce any additional output). It seems to > > wait for something. > > > > > > on bacula-sd-process: > > dakar: ~ 3# strace -v -p 19099 > > Process 19099 attached - interrupt to quit > > select(5, [4], NULL, NULL, NULL <unfinished ...> > > Process 19099 detached > > > > > > on bacula-fd on bali-fd: > > bali: ~ 4# strace -v -p 3301 > > Process 3301 attached - interrupt to quit > > select(4, [3], NULL, NULL, NULL <unfinished ...> > > Process 3301 detached > > > > > > on bacula-fd on paris-fd: > > paris: ~# strace -v -p 14280 > > Process 14280 attached - interrupt to quit > > select(4, [3], NULL, NULL, NULL > > > > Something is quite strange about the bacula-fd on paris: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 14280 root 16 0 263m 221m 2840 S 0.0 10.9 0:00.02 > > /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf > > > > The bacula-fd is 263M in size!! Hell, why so much?? I don't know if > > this has something to do with the director. > > > > What I'll do right now is restart paris-fd and see what happens: > > > > Results: > > - bacula-fd on paris is normal size > > - bconsole still doesn't connect > > - backup-dir.conmsg contains messages about non-reachable paris: > > > > 11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal > > error: Network error with FD during Backup: ERR=No data available > > 11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error: > > Network error with FD during Backup: ERR=No data available > > 11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal > > error: No Job status returned from FD. > > 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error: > > No Job status returned from FD. > > 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula > > 1.36.2 (28Feb05): 11-Jul-2005 23:34:14 > > JobId: 900 > > Job: paris-netboot.2005-07-11_21.00.01 > > Backup Level: Incremental, since=2005-07-08 21:00:02 > > Client: paris-fd > > FileSet: "paris-netboot" 2005-06-11 01:35:56 > > Pool: "DiffInc" > > Storage: "DLT" > > Start time: 11-Jul-2005 21:00:02 > > [.........] > > > > bacula-dir still FUTEX_WAITing: > > dakar: /var/lib/bacula 9# strace -p 19051 > > Process 19051 attached - interrupt to quit > > futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...> > > > > Well, so far. I will restart bacula-dir now und run some jobs manually: > > ..... > > After running the jobs and waiting for 10 minutes nothing happens. the > > bacula-dir still works and stat sd says: > > > > Device status: > > Device "/dev/tape" is mounted with Volume "DiffInc-03" > > Device is BLOCKED waiting for appendable media. > > Total Bytes Read=0 Blocks Read=0 Bytes/block=0 > > Positioned at File=19 Block=0 > > Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898 > > max bytes/job. > > Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max > > bytes. > > > > I'll try unmount and mount: > > *umount > > Automatically selected Storage: DLT > > 3001 Device "/dev/tape" unmounted. > > * > > *mount > > Automatically selected Storage: DLT > > 3001 Device /dev/tape is mounted with Volume "DiffInc-03" > > * > > > > But Again: > > > > Device status: > > Device "/dev/tape" is mounted with Volume "DiffInc-03" > > Device is BLOCKED waiting for appendable media. > > Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512 > > Positioned at File=0 Block=0 > > Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898 > > max bytes/job. > > Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max > > bytes. > > > > > > This time the bacula-sd seems to be stuck. Maybe this causes the > > bacula-fd after hours of waiting to hang??? > > > > I'm restarting sd: > > > > 12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job > > 922 canceled. > > 12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job > > 921 canceled. > > 12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal > > error: Job 915 canceled. > > 12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal > > error: Job 914 canceled. > > 12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913 > > canceled. > > 12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error: > > job.c:1665 Bad response to Append Data command. Wanted 3000 OK data > > , got 3903 Error append data > > > > 12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error: > > job.c:1665 Bad response to Append Data command. Wanted 3000 OK data > > , got 3903 Error append data > > > > 12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal > > error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK > > data > > > > [ ... errors of broken jobs removed ...] > > > > 12-Jul 00:01 backup-dir: Start Backup JobId 918, > > Job=paris-home.staff.3.2005-07-11_23.42.44 > > 12-Jul 00:01 backup-dir: Start Backup JobId 919, > > Job=paris-home.prak.2005-07-11_23.42.49 > > 12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0" > > command. > > 12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is > > Slot 3. > > 12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving > > to end of data. > > * > > * > > 12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at > > file=19. > > 12-Jul 00:02 backup-sd: Spooling data ... > > 12-Jul 00:02 backup-sd: Spooling data ... > > > > > > now it seems to work.... ok, all jobs finished. > > > > Question is: is there a connection between a hanging sd (which I'm not > > sure if it was hanging while the dir was hanging!?) and the hangig dir?? > > > > I'll try to get a trace file of the situation! > > > > End Debugging Output > > ---------------------------------------------------------------------- > > -- > Best regards, > > Kern > > ("> > /\ > V_V > > -- Volker Sauer * Alexanderstrasse 39/217 * 64283 Darmstadt Telefon: 06151-154260 * Mobil: 0179-6901475 * ICQ#98164307 mailto:[EMAIL PROTECTED] * http://www.volker-sauer.de PGPKey-Fingerprint: DB2611C7B12E0B2739992E4F7E354E4D5DD5D0E0
signature.asc
Description: Digital signature