Hello,

Volker Sauer wrote:

Hi Kern,

I'll upgrade to 1.36.3 and see what happens. Maybe "Fix deadlock in
multiple simultaneous jobs." (from ReleaseNotes) could be the right one.
I already setup this site with 1.36.3 FileFormat because I knew it's
going to be required!

One thing to note -
I had the same problem of a locking DIR, which worked ok after a restart, and I could never find a reason (partly because I never investigated with gdb, but that's beyond my skills and as long as I could restart my backups rather easily that was ok).

With 1.36.3 this problem vanished.

Until yesterday.

I noticed the director hanging. Couldn't connect to it, no tape activity, some jobs were started but nothing was written to tape.

I killed and restarted the bacula processes - DIR, SD and FD - because I used the init script.

I noticed soon afterwards that I had a hardware problem: One of my old museum-class tape drives dies, seriously impacting the SCSI bus. The result was that a CD-RW drive didn't start, thus no device was allocated to it, the sg device for one autochanger had changed, and mtx segfaulted when accessing the configured changer device.

This could, obvioulsly, only affect the SD directly, but the DIR was stuck, too.

After turning off the broken tape drive, and modifying baculas configuration appropriately, everything works flawlessly again.

Conclusion 1:
SD problems _can_ hang the DIR. This confirms what Volker found. Probably there is some timeout in the DIR, but in such a case this timeout is either too long (my opinion) or shouldn't block console connections (and probably the rest of the DIR working).

Conclusion 2:
mtx is crap. It shouldn't segfault. I'll look at Peters replacement, although I couldn't get it to work before... I also heard about mover...

Conclusion 3:
Don't use decades-old hardware in a backup system ;-)

Conclusion 4:
When the SD hangs, check your system logs for hardware errors. Carefully. Take them serious (Had I done this yesterday morning would have been much more fun).

Arno

Regards
Volker

On Mi, 13 Jul 2005, Kern Sibbald wrote:


Hello Volker,

There were one or two race conditions that I fixed in 1.36.3. You might look at the release notes and see if they appy to you. Beware 1.36.3 requires the new format FileSets (and hence a Full backup unless you explicitly disable it).

On Tuesday 12 July 2005 00:24, Volker Sauer wrote:

Hello,

after some weeks of proper operation, my new bacula site causes
problems.

Problem description:
After setting up the site, job scheduling worked for about 2 weeks. Now,
quite often, the director hangs after starting the scheduled jobs at
night. Connecting via console is not possible anymore.

The only solution so far was restarting the director and manually
running the jobs. After a few days (maybe even the next night) the
problem occurs again.
Looking into the archives of this list, I found some similar problem
descriptions but none of the suggest things lead to a solution (at least
none reported).

The amazing thing is, that I have a second site with very similar
config but only another loader, that works fine for years now.
Now, I found out, that it could have come from a hanging sd, too. See
below why. Question is: why hangs the dir because the sd is hanging?

I haven't been able to produce a trace-file. I'll set debuglevel=100 and
trace=1 now and see if I can gather some information and post it to this
list as soon as possbile.
What I can provide (so far) is some other debugging output (see below).

Does anyone have ideas or comments on this odd behaviour??

Regards
Volker

Debugging-Session attached:

8<-------------------------------------------------------------------------
------

System: Debian Sarge Kernel 2.6.8-2-k7-smp
Bacula: 1.36.2-2sarge1
Storage: Overland 10x DLT with Quantum DLT40/80
Database: mysql 4.0.24-10 (Size: 1.4G)
dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling
for all machines except bacula-host itself.

-----------------------------------------------------
dakar: / 7# bconsole
Connecting to Director dakar:9101
Director authorization problem.
Most likely the passwords do not agree.
Please see
http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help.
-----------------------------------------------------
(passwords are definitely okay).


This is the conmesg-file so far:
-----------------------
dakar: /var/lib/bacula 33# cat backup-dir.conmsg

11-Jul 21:00 backup-dir: Start Backup JobId
899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir:
Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00
backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00
backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul
21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of
data. 11-Jul 21:00 backup-dir: Start Backup JobId 907,
Job=bali-rootfs.2005-07-11_21.00.08 ------------------------
The bali-fd is up and seems to okay. The director seems to wait for
something. Nothing else in this file


There's nothing in the syslog (SCSI error etc.), either:
----------------------------------------------------
Jul 11 20:40:30 dakar -- MARK --
Jul 11 21:00:30 dakar -- MARK --
Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD (   run-parts
--report /etc/cron.hourly)
Jul 11 21:40:30 dakar -- MARK --
Jul 11 22:00:30 dakar -- MARK --
Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD (   run-parts
--report /etc/cron.hourly)
Jul 11 22:40:30 dakar -- MARK --
Jul 11 23:00:30 dakar -- MARK --
-----------------------------------------------------

Here's what strace -p says:

on bacula-director-process:
        dakar: ~ 2# strace -v -p 19051
        Process 19051 attached - interrupt to quit
        futex(0x80c5680, FUTEX_WAIT, 2, NULL
(running bconsole doesn't produce any additional output). It seems to
wait for something.


on bacula-sd-process:
        dakar: ~ 3# strace -v -p 19099
        Process 19099 attached - interrupt to quit
        select(5, [4], NULL, NULL, NULL <unfinished ...>
        Process 19099 detached


on bacula-fd on bali-fd:
        bali: ~ 4# strace -v -p 3301
        Process 3301 attached - interrupt to quit
        select(4, [3], NULL, NULL, NULL <unfinished ...>
        Process 3301 detached


on bacula-fd on paris-fd:
        paris: ~# strace -v -p 14280
        Process 14280 attached - interrupt to quit
        select(4, [3], NULL, NULL, NULL

Something is quite strange about the bacula-fd on paris:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14280 root 16 0 263m 221m 2840 S 0.0 10.9 0:00.02 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf

The bacula-fd is 263M in size!! Hell, why so much?? I don't know if
this has something to do with the director.

What I'll do right now is restart paris-fd and see what happens:

Results:
- bacula-fd on paris is normal size
- bconsole still doesn't connect
- backup-dir.conmsg contains messages about non-reachable paris:

11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
error: Network error with FD during Backup: ERR=No data available
11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
Network error with FD during Backup: ERR=No data available
11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
error: No Job status returned from FD.
11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
No Job status returned from FD.
11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula
1.36.2 (28Feb05): 11-Jul-2005 23:34:14
 JobId:                  900
 Job:                    paris-netboot.2005-07-11_21.00.01
 Backup Level:           Incremental, since=2005-07-08 21:00:02
 Client:                 paris-fd
 FileSet:                "paris-netboot" 2005-06-11 01:35:56
 Pool:                   "DiffInc"
 Storage:                "DLT"
 Start time:             11-Jul-2005 21:00:02
[.........]

bacula-dir still FUTEX_WAITing:
        dakar: /var/lib/bacula 9# strace -p 19051
        Process 19051 attached - interrupt to quit
        futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...>

Well, so far. I will restart bacula-dir now und run some jobs manually:
.....
After running the jobs and waiting for 10 minutes nothing happens. the
bacula-dir still works and stat sd says:

Device status:
Device "/dev/tape" is mounted with Volume "DiffInc-03"
   Device is BLOCKED waiting for appendable media.
   Total Bytes Read=0 Blocks Read=0 Bytes/block=0
   Positioned at File=19 Block=0
Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
max bytes/job.
Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
bytes.

I'll try unmount and mount:
*umount
Automatically selected Storage: DLT
3001 Device "/dev/tape" unmounted.
*
*mount
Automatically selected Storage: DLT
3001 Device /dev/tape is mounted with Volume "DiffInc-03"
*

But Again:

Device status:
Device "/dev/tape" is mounted with Volume "DiffInc-03"
   Device is BLOCKED waiting for appendable media.
   Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512
   Positioned at File=0 Block=0
Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
max bytes/job.
Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
bytes.


This time the bacula-sd seems to be stuck. Maybe this causes the
bacula-fd after hours of waiting to hang???

I'm restarting sd:

12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job
922 canceled.
12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job
921 canceled.
12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal
error: Job 915 canceled.
12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal
error: Job 914 canceled.
12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913
canceled.
12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error:
job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data

12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error:
job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data

12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal
error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK
data

[ ... errors of broken jobs removed ...]

12-Jul 00:01 backup-dir: Start Backup JobId 918,
Job=paris-home.staff.3.2005-07-11_23.42.44
12-Jul 00:01 backup-dir: Start Backup JobId 919,
Job=paris-home.prak.2005-07-11_23.42.49
12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0"
command.
12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is
Slot 3.
12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving
to end of data.
*
*
12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at
file=19.
12-Jul 00:02 backup-sd: Spooling data ...
12-Jul 00:02 backup-sd: Spooling data ...


now it seems to work.... ok, all jobs finished.

Question is: is there a connection between a hanging sd (which I'm not
sure if it was hanging while the dir was hanging!?) and the hangig dir??

I'll try to get a trace file of the situation!

End Debugging Output
----------------------------------------------------------------------

--
Best regards,

Kern

 (">
 /\
 V_V





--
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to