Hi Kern,

I'll upgrade to 1.36.3 and see what happens. Maybe "Fix deadlock in
multiple simultaneous jobs." (from ReleaseNotes) could be the right one.
I already setup this site with 1.36.3 FileFormat because I knew it's
going to be required!

Regards
Volker

On Mi, 13 Jul 2005, Kern Sibbald wrote:

> Hello Volker,
> 
> There were one or two race conditions that I fixed in 1.36.3.  You might look 
> at the release notes and see if they appy to you.  Beware 1.36.3 requires the 
> new format FileSets (and hence a Full backup unless you explicitly disable 
> it).
> 
> On Tuesday 12 July 2005 00:24, Volker Sauer wrote:
> > Hello,
> >
> > after some weeks of proper operation, my new bacula site causes
> > problems.
> >
> > Problem description:
> > After setting up the site, job scheduling worked for about 2 weeks. Now,
> > quite often, the director hangs after starting the scheduled jobs at
> > night. Connecting via console is not possible anymore.
> >
> > The only solution so far was restarting the director and manually
> > running the jobs. After a few days (maybe even the next night) the
> > problem occurs again.
> > Looking into the archives of this list, I found some similar problem
> > descriptions but none of the suggest things lead to a solution (at least
> > none reported).
> >
> > The amazing thing is, that I have a second site with very similar
> > config but only another loader, that works fine for years now.
> > Now, I found out, that it could have come from a hanging sd, too. See
> > below why. Question is: why hangs the dir because the sd is hanging?
> >
> > I haven't been able to produce a trace-file. I'll set debuglevel=100 and
> > trace=1 now and see if I can gather some information and post it to this
> > list as soon as possbile.
> > What I can provide (so far) is some other debugging output (see below).
> >
> > Does anyone have ideas or comments on this odd behaviour??
> >
> > Regards
> > Volker
> >
> > Debugging-Session attached:
> >
> > 8<-------------------------------------------------------------------------
> >------
> >
> > System: Debian Sarge Kernel 2.6.8-2-k7-smp
> > Bacula: 1.36.2-2sarge1
> > Storage: Overland 10x DLT with Quantum DLT40/80
> > Database: mysql 4.0.24-10 (Size: 1.4G)
> > dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling
> > for all machines except bacula-host itself.
> >
> > -----------------------------------------------------
> > dakar: / 7# bconsole
> > Connecting to Director dakar:9101
> > Director authorization problem.
> > Most likely the passwords do not agree.
> > Please see
> > http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help.
> > -----------------------------------------------------
> > (passwords are definitely okay).
> >
> >
> > This is the conmesg-file so far:
> > -----------------------
> > dakar: /var/lib/bacula 33# cat backup-dir.conmsg
> >
> > 11-Jul 21:00 backup-dir: Start Backup JobId
> > 899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir:
> > Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00
> > backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00
> > backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul
> > 21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of
> > data. 11-Jul 21:00 backup-dir: Start Backup JobId 907,
> > Job=bali-rootfs.2005-07-11_21.00.08 ------------------------
> > The bali-fd is up and seems to okay. The director seems to wait for
> > something. Nothing else in this file
> >
> >
> > There's nothing in the syslog (SCSI error etc.), either:
> > ----------------------------------------------------
> > Jul 11 20:40:30 dakar -- MARK --
> > Jul 11 21:00:30 dakar -- MARK --
> > Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD (   run-parts
> > --report /etc/cron.hourly)
> > Jul 11 21:40:30 dakar -- MARK --
> > Jul 11 22:00:30 dakar -- MARK --
> > Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD (   run-parts
> > --report /etc/cron.hourly)
> > Jul 11 22:40:30 dakar -- MARK --
> > Jul 11 23:00:30 dakar -- MARK --
> > -----------------------------------------------------
> >
> > Here's what strace -p says:
> >
> > on bacula-director-process:
> >     dakar: ~ 2# strace -v -p 19051
> >     Process 19051 attached - interrupt to quit
> >     futex(0x80c5680, FUTEX_WAIT, 2, NULL
> > (running bconsole doesn't produce any additional output). It seems to
> > wait for something.
> >
> >
> > on bacula-sd-process:
> >     dakar: ~ 3# strace -v -p 19099
> >     Process 19099 attached - interrupt to quit
> >     select(5, [4], NULL, NULL, NULL <unfinished ...>
> >     Process 19099 detached
> >
> >
> > on bacula-fd on bali-fd:
> >     bali: ~ 4# strace -v -p 3301
> >     Process 3301 attached - interrupt to quit
> >     select(4, [3], NULL, NULL, NULL <unfinished ...>
> >     Process 3301 detached
> >
> >
> > on bacula-fd on paris-fd:
> >     paris: ~# strace -v -p 14280
> >     Process 14280 attached - interrupt to quit
> >     select(4, [3], NULL, NULL, NULL
> >
> > Something is quite strange about the bacula-fd on paris:
> >
> > PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > 14280 root      16   0  263m 221m 2840 S  0.0 10.9   0:00.02 
> > /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
> >
> > The bacula-fd is 263M in size!! Hell, why so much?? I don't know if
> > this has something to do with the director.
> >
> > What I'll do right now is restart paris-fd and see what happens:
> >
> > Results:
> > - bacula-fd on paris is normal size
> > - bconsole still doesn't connect
> > - backup-dir.conmsg contains messages about non-reachable paris:
> >
> > 11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
> > error: Network error with FD during Backup: ERR=No data available
> > 11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
> > Network error with FD during Backup: ERR=No data available
> > 11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
> > error: No Job status returned from FD.
> > 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
> > No Job status returned from FD.
> > 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula
> > 1.36.2 (28Feb05): 11-Jul-2005 23:34:14
> >   JobId:                  900
> >   Job:                    paris-netboot.2005-07-11_21.00.01
> >   Backup Level:           Incremental, since=2005-07-08 21:00:02
> >   Client:                 paris-fd
> >   FileSet:                "paris-netboot" 2005-06-11 01:35:56
> >   Pool:                   "DiffInc"
> >   Storage:                "DLT"
> >   Start time:             11-Jul-2005 21:00:02
> > [.........]
> >
> > bacula-dir still FUTEX_WAITing:
> >     dakar: /var/lib/bacula 9# strace -p 19051
> >     Process 19051 attached - interrupt to quit
> >     futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...>
> >
> > Well, so far. I will restart bacula-dir now und run some jobs manually:
> > .....
> > After running the jobs and waiting for 10 minutes nothing happens. the
> > bacula-dir still works and stat sd says:
> >
> > Device status:
> > Device "/dev/tape" is mounted with Volume "DiffInc-03"
> >     Device is BLOCKED waiting for appendable media.
> >     Total Bytes Read=0 Blocks Read=0 Bytes/block=0
> >     Positioned at File=19 Block=0
> > Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
> > max bytes/job.
> > Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
> > bytes.
> >
> > I'll try unmount and mount:
> > *umount
> > Automatically selected Storage: DLT
> > 3001 Device "/dev/tape" unmounted.
> > *
> > *mount
> > Automatically selected Storage: DLT
> > 3001 Device /dev/tape is mounted with Volume "DiffInc-03"
> > *
> >
> > But Again:
> >
> > Device status:
> > Device "/dev/tape" is mounted with Volume "DiffInc-03"
> >     Device is BLOCKED waiting for appendable media.
> >     Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512
> >     Positioned at File=0 Block=0
> > Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
> > max bytes/job.
> > Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
> > bytes.
> >
> >
> > This time the bacula-sd seems to be stuck. Maybe this causes the
> > bacula-fd after hours of waiting to hang???
> >
> > I'm restarting sd:
> >
> > 12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job
> > 922 canceled.
> > 12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job
> > 921 canceled.
> > 12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal
> > error: Job 915 canceled.
> > 12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal
> > error: Job 914 canceled.
> > 12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913
> > canceled.
> > 12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error:
> > job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
> > , got 3903 Error append data
> >
> > 12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error:
> > job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
> > , got 3903 Error append data
> >
> > 12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal
> > error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK
> > data
> >
> > [ ... errors of broken jobs removed ...]
> >
> > 12-Jul 00:01 backup-dir: Start Backup JobId 918,
> > Job=paris-home.staff.3.2005-07-11_23.42.44
> > 12-Jul 00:01 backup-dir: Start Backup JobId 919,
> > Job=paris-home.prak.2005-07-11_23.42.49
> > 12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0"
> > command.
> > 12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is
> > Slot 3.
> > 12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving
> > to end of data.
> > *
> > *
> > 12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at
> > file=19.
> > 12-Jul 00:02 backup-sd: Spooling data ...
> > 12-Jul 00:02 backup-sd: Spooling data ...
> >
> >
> > now it seems to work.... ok, all jobs finished.
> >
> > Question is: is there a connection between a hanging sd (which I'm not
> > sure if it was hanging while the dir was hanging!?) and the hangig dir??
> >
> > I'll try to get a trace file of the situation!
> >
> > End Debugging Output
> > ----------------------------------------------------------------------
> 
> -- 
> Best regards,
> 
> Kern
> 
>   (">
>   /\
>   V_V
> 
> 

-- 
  Volker Sauer  *  Alexanderstrasse 39/217  *  64283 Darmstadt
  Telefon: 06151-154260  *  Mobil: 0179-6901475 * ICQ#98164307
  mailto:[EMAIL PROTECTED]  *  http://www.volker-sauer.de
  PGPKey-Fingerprint: DB2611C7B12E0B2739992E4F7E354E4D5DD5D0E0

Attachment: signature.asc
Description: Digital signature

Reply via email to