Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36
Hello, On Monday 14 November 2005 00:47, Steve Ellis wrote: Kern Sibbald said: Hello again, You didn't by any chance recently upgrade from a 2.4 kernel to a 2.6 kernel did you?I am seeing all kinds of hangs and other funny behavior in the Storage daemon due to the change in the behavior of the open() call for tape drives from one kernel to another. Thanks for looking at this so quickly Kern- Well, if something is fundamentally broken, I would like to fix it, and I'm just now testing 1.38.1 for release. No, I am running a 2.6 kernel, but I have been running it for 18 months or so. I'm running a vintage Fedora Core 2 release--too lazy (and afraid) to upgrade on this system that is critical to my home network. There has not been a new Core2 kernel in quite some time--my last kernel upgrade was in March, which I'm positive I was running, at least by August (I know I rebooted about that time). I'm a networking software engineer, so although I have a lot of capability to maintain, fix and debug a lot of stuff here at home, I don't have much in the way of spare time--consequently, I tend to keep using things if they are still working. I did want to switch to Bacuala 1.38, LTO2 and Fedora Core4, but have so far only done the first upgrade (bacula). I saw messages on bacula-users about recent 2.6 changes, and was hoping that any dust would have settled by the time I got there (presumably when I get around to FC4--or FC5, if I continue to put it off any longer). If it would help, I can turn on some sd logging, or something. The poll interval suggestion will probably work for me for now, especially once I get the LTO2 drive online, making nearly all of my backups a 1 tape affair. After looking into this a bit here (I still have more testing to do), I am more and more convinced that your problem is due to the kernel change. Basically, what I see is that if there is no tape in the drive, the open() call blocks either in the OS or in Bacula code, it then fails at some point, and your job is terminated. The old behavior of the OS was to always permit open() on the drive regardless of whether or not there was a tape in it. I don't know when the change occurred -- i.e. what version of the kernel. Given the new kernel development mode, it is very likely that it came during one of the various 2.6.x releases. There is a certain logic in what they have changed, but IMO, it is a perverse way of dealing with the situation (no tape in the drive), and will cause all kinds of problems. If increasing the poll time works for you, OK, but after the tests I did here, I don't really think it will work. The real fix is going to take a major redesign of Bacula, which currently expects to always open a drive, and when it cannot, it fails the job. There are two workarounds for this situation that I see at the current time: 1. Remove the Offline on Unmount this will leave the old tape in the drive and allow Bacula to continue to open the drive. However, you should probably set your poll time to 5 minutes so it doesn't wear the tape too much (I think that most modern tape drivers don't even re-read the tape. They simply cache the first block and keep returning it). 2. If you keep the Offline on Unmount, you can probably prevent the failure by increasing the Maximum Open Wait to some large value. This will cause Bacula to continue to try to open the drive even if it fails. I this solution a bit less satisfactory than the above. I still have not run tests to see if the Polling is broken in 1.38, which is a possibility since the code that does the waiting was moved around and enhanced. My previous tests simulated your situation (no tape in the drive) and never got very far because the OS prevented the drive from being opened, and thus the polling code was never used. -- Best regards, Kern ( /\ V_V --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36
Kern You didn't by any chance recently upgrade from a 2.4 kernel to a Kern 2.6 kernel did you? I am seeing all kinds of hangs and other Kern funny behavior in the Storage daemon due to the change in the Kern behavior of the open() call for tape drives from one kernel to Kern another. When my DLT 7k was alive, I was only running Linux 2.6 kernels and I never had a problem using the system. Now that my drive hangs the bus on EOT, I've been driveless and without backups. Ouch! So take my works with a grain of salt... but I think Linux kernel 2.6 is fine for SCSI tapes. The drive hung on a Solaris system as well... --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36
Hello, I might well have added some additional insanity checks to guard against bad tapes or bad tape drives, and this could be interacting with the poll feature. Why don't you up your poll interval to 5 minutes and see if that increases the time Bacula waits before giving up. If it does, then at least you have a work around -- increase the poll interval to be sufficiently long that it will not fail, or disable the polling and simply mount the drive. In the mean time, I'll take a look at the various insanity checks that I have (particularly any that I added) ... On Sunday 13 November 2005 19:42, Steve Ellis wrote: I'd been eagerly awaiting 1.38, as well as eagerly awaiting a much better tape drive, now I have both, which made me very excited (both the new features in 1.38 and the LTO2 drive that is replacing my DDS4 are _extremely_ cool), but see a different and undesirable behavior with 1.38 (even on my old tape drive). A little background: My server runs headless downstairs in the garage, I usually get to the bacula console from my desktop machine upstairs, and I don't have an autoloader. Consequently, it is much more convenient if bacula spits out the tape that it doesn't want (if it is full, or I forgot to change it), and waits for me to insert the correct tape. Previously, in 1.36.?, with my config (below), bacula would patiently wait a long, long time for me to get around to giving it the tape it wanted. When I gave it the tape it wanted, it would automatically mount and start using it. Now, it looks like it is only willing to wait about 25 minutes before giving up, and if the drive is unloaded, all subsequent jobs (requiring the same device) fail 20 or so minutes after they start too. My guess is that there is now a limit on the number of times bacula will poll the device waiting for the new tape, and since I've set a pretty short poll interval (1 minute), it gives up too easily. Actually, I believe this was a problem in an earlier release, which Kern fixed when I saw it, but it was fixed in the 1.36 build I was using (which I hope wasn't my own local customization). At any rate, anyone who wants to operate their drive in the way I do will hit this problem if they are not quick in putting in the correct tape, unless there is a config file option to control the number of polls of which I am not aware (I did look in the manual section for the device configuration and didn't see anything). If there is another way to accomplish what I want, or even something close to what I want, I'd like to hear about it. -se Here's the relevant clip from my bacula-sd.conf: Device { Name = DDS4 Media Type = DDS-4 Archive Device = /dev/nst1 Automatic Mount = Yes # when device opened, read it Always Open = Yes Volume Poll Interval = 1 min Close On Poll = Yes Offline On Unmount = Yes Removable Media = Yes Random Access = No Maximum Spool Size = 10737418240 Spool Directory = /backup/bacula/spool Alert Command = sh -c 'tapeinfo -f %c |grep TapeAlert|cat' Maximum Network Buffer Size = 262144 } -- Best regards, Kern ( /\ V_V --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Different and undesirable behavior with 1.38 than with 1.36
Kern Sibbald said: Hello again, You didn't by any chance recently upgrade from a 2.4 kernel to a 2.6 kernel did you?I am seeing all kinds of hangs and other funny behavior in the Storage daemon due to the change in the behavior of the open() call for tape drives from one kernel to another. Thanks for looking at this so quickly Kern- No, I am running a 2.6 kernel, but I have been running it for 18 months or so. I'm running a vintage Fedora Core 2 release--too lazy (and afraid) to upgrade on this system that is critical to my home network. There has not been a new Core2 kernel in quite some time--my last kernel upgrade was in March, which I'm positive I was running, at least by August (I know I rebooted about that time). I'm a networking software engineer, so although I have a lot of capability to maintain, fix and debug a lot of stuff here at home, I don't have much in the way of spare time--consequently, I tend to keep using things if they are still working. I did want to switch to Bacuala 1.38, LTO2 and Fedora Core4, but have so far only done the first upgrade (bacula). I saw messages on bacula-users about recent 2.6 changes, and was hoping that any dust would have settled by the time I got there (presumably when I get around to FC4--or FC5, if I continue to put it off any longer). If it would help, I can turn on some sd logging, or something. The poll interval suggestion will probably work for me for now, especially once I get the LTO2 drive online, making nearly all of my backups a 1 tape affair. Thanks! -- -se Steve Ellis --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users