Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36

2005-11-14 Thread Kern Sibbald
Hello,

On Monday 14 November 2005 00:47, Steve Ellis wrote:
 Kern Sibbald said:
  Hello again,
 
  You didn't by any chance recently upgrade from a 2.4 kernel to a 2.6
  kernel
  did you?I am seeing all kinds of hangs and other funny behavior in
  the Storage daemon due to the change in the behavior of the open() call
  for tape
  drives from one kernel to another.

 Thanks for looking at this so quickly Kern-

Well, if something is fundamentally broken, I would like to fix it, and I'm 
just now testing 1.38.1 for release.


 No, I am running a 2.6 kernel, but I have been running it for 18 months or
 so.  I'm running a vintage Fedora Core 2 release--too lazy (and afraid) to
 upgrade on this system that is critical to my home network.  There has not
 been a new Core2 kernel in quite some time--my last kernel upgrade was in
 March, which I'm positive I was running, at least by August (I know I
 rebooted about that time).

 I'm a networking software engineer, so although I have a lot of capability
 to maintain, fix and debug a lot of stuff here at home, I don't have much
 in the way of spare time--consequently, I tend to keep using things if
 they are still working.  I did want to switch to Bacuala 1.38, LTO2 and
 Fedora Core4, but have so far only done the first upgrade (bacula).  I saw
 messages on bacula-users about recent 2.6 changes, and was hoping that any
 dust would have settled by the time I got there (presumably when I get
 around to FC4--or FC5, if I continue to put it off any longer).

 If it would help, I can turn on some sd logging, or something.  The poll
 interval suggestion will probably work for me for now, especially once I
 get the LTO2 drive online, making nearly all of my backups a 1 tape
 affair.

After looking into this a bit here (I still have more testing to do), I am 
more and more convinced that your problem is due to the kernel change.

Basically, what I see is that if there is no tape in the drive, the open() 
call blocks either in the OS or in Bacula code, it then fails at some point, 
and your job is terminated.  The old behavior of the OS was to always permit 
open() on the drive regardless of whether or not there was a tape in it.

I don't know when the change occurred -- i.e. what version of the kernel. 
Given the new kernel development mode, it is very likely that it came during 
one of the various 2.6.x releases.  There is a certain logic in what they 
have changed, but IMO, it is a perverse way of dealing with the situation (no 
tape in the drive), and will cause all kinds of problems.

If increasing the poll time works for you, OK, but after the tests I did here, 
I don't really think it will work.  The real fix is going to take a major 
redesign of Bacula, which currently expects to always open a drive, and when 
it cannot, it fails the job.

There are two workarounds for this situation that I see at the current time:

1. Remove the Offline on Unmount this will leave the old tape in the drive
and allow Bacula to continue to open the drive.  However, you should 
probably set your poll time to 5 minutes so it doesn't wear the tape too 
much (I think that most modern tape drivers don't even re-read the tape.
They simply cache the first block and keep returning it).

2. If you keep the Offline on Unmount, you can probably prevent the failure
by increasing the Maximum Open Wait to some large value.  This will
cause Bacula to continue to try to open the drive even if it fails.  I 
this solution a bit less satisfactory than the above.


I still have not run tests to see if the Polling is broken in 1.38, which is a 
possibility since the code that does the waiting was moved around and 
enhanced.  My previous tests simulated your situation (no tape in the drive) 
and never got very far because the OS prevented the drive from being opened, 
and thus the polling code was never used.

-- 
Best regards,

Kern

  (
  /\
  V_V


---
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42 plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36

2005-11-14 Thread John Stoffel

Kern You didn't by any chance recently upgrade from a 2.4 kernel to a
Kern 2.6 kernel did you?  I am seeing all kinds of hangs and other
Kern funny behavior in the Storage daemon due to the change in the
Kern behavior of the open() call for tape drives from one kernel to
Kern another.

When my DLT 7k was alive, I was only running Linux 2.6 kernels and I
never had a problem using the system.  Now that my drive hangs the bus
on EOT, I've been driveless and without backups.  Ouch!

So take my works with a grain of salt... but I think Linux kernel 2.6
is fine for SCSI tapes.  The drive hung on a Solaris system as well...


---
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42 plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36

2005-11-13 Thread Kern Sibbald
Hello,

I might well have added some additional insanity checks to guard against bad 
tapes or bad tape drives, and this could be interacting with the poll 
feature.  

Why don't you up your poll interval to 5 minutes and see if that increases the 
time Bacula waits before giving up.  If it does, then at least you have a 
work around -- increase the poll interval to be sufficiently long that it 
will not fail, or disable the polling and simply mount the drive.

In the mean time, I'll take a look at the various insanity checks that I have 
(particularly any that I added) ...

On Sunday 13 November 2005 19:42, Steve Ellis wrote:
 I'd been eagerly awaiting 1.38, as well as eagerly awaiting a much better
 tape drive, now I have both, which made me very excited (both the new
 features in 1.38 and the LTO2 drive that is replacing my DDS4 are
 _extremely_ cool), but see a different and undesirable behavior with 1.38
 (even on my old tape drive).

 A little background:

 My server runs headless downstairs in the garage, I usually get to the
 bacula console from my desktop machine upstairs, and I don't have an
 autoloader.  Consequently, it is much more convenient if bacula spits out
 the tape that it doesn't want (if it is full, or I forgot to change it),
 and waits for me to insert the correct tape.  Previously, in 1.36.?, with
 my config (below), bacula would patiently wait a long, long time for me to
 get around to giving it the tape it wanted.  When I gave it the tape it
 wanted, it would automatically mount and start using it.  Now, it looks
 like it is only willing to wait about 25 minutes before giving up, and if
 the drive is unloaded, all subsequent jobs (requiring the same device)
 fail 20 or so minutes after they start too.


 My guess is that there is now a limit on the number of times bacula will
 poll the device waiting for the new tape, and since I've set a pretty
 short poll interval (1 minute), it gives up too easily.

 Actually, I believe this was a problem in an earlier release, which Kern
 fixed when I saw it, but it was fixed in the 1.36 build I was using (which
 I hope wasn't my own local customization).

 At any rate, anyone who wants to operate their drive in the way I do will
 hit this problem if they are not quick in putting in the correct tape,
 unless there is a config file option to control the number of polls of
 which I am not aware (I did look in the manual section for the device
 configuration and didn't see anything).  If there is another way to
 accomplish what I want, or even something close to what I want, I'd like
 to hear about it.


 -se

 Here's the relevant clip from my bacula-sd.conf:

 Device {
   Name = DDS4
   Media Type = DDS-4
   Archive Device = /dev/nst1
   Automatic Mount = Yes   # when device opened, read it
   Always Open = Yes
   Volume Poll Interval = 1 min
   Close On Poll = Yes
   Offline On Unmount = Yes
   Removable Media = Yes
   Random Access = No
   Maximum Spool Size = 10737418240
   Spool Directory = /backup/bacula/spool
   Alert Command = sh -c 'tapeinfo -f %c |grep TapeAlert|cat'
   Maximum Network Buffer Size = 262144
 }

-- 
Best regards,

Kern

  (
  /\
  V_V


---
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42 plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36

2005-11-13 Thread Steve Ellis

Kern Sibbald said:
 Hello again,

 You didn't by any chance recently upgrade from a 2.4 kernel to a 2.6
 kernel
 did you?I am seeing all kinds of hangs and other funny behavior in the
 Storage daemon due to the change in the behavior of the open() call for
 tape
 drives from one kernel to another.

Thanks for looking at this so quickly Kern-

No, I am running a 2.6 kernel, but I have been running it for 18 months or
so.  I'm running a vintage Fedora Core 2 release--too lazy (and afraid) to
upgrade on this system that is critical to my home network.  There has not
been a new Core2 kernel in quite some time--my last kernel upgrade was in
March, which I'm positive I was running, at least by August (I know I
rebooted about that time).

I'm a networking software engineer, so although I have a lot of capability
to maintain, fix and debug a lot of stuff here at home, I don't have much
in the way of spare time--consequently, I tend to keep using things if
they are still working.  I did want to switch to Bacuala 1.38, LTO2 and
Fedora Core4, but have so far only done the first upgrade (bacula).  I saw
messages on bacula-users about recent 2.6 changes, and was hoping that any
dust would have settled by the time I got there (presumably when I get
around to FC4--or FC5, if I continue to put it off any longer).

If it would help, I can turn on some sd logging, or something.  The poll
interval suggestion will probably work for me for now, especially once I
get the LTO2 drive online, making nearly all of my backups a 1 tape
affair.

Thanks!

-- 
-se

Steve Ellis


---
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42 plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users