Re: [Bacula-users] Bacula BETA 1.38.3 (14 December 2005) released

2005-12-19 Thread Kern Sibbald
On Monday 19 December 2005 18:54, Rick Knight wrote:
> Kern Sibbald wrote:
> >Hello,
> >
> >I have released the second BETA version 1.38.3 (14 December 2005) as a tar
> >file to Source Forge.  This version has a rewrite of the reservation
> >algorithm that hopefully will improve situations where users were finding
> > all jobs waiting to reserve a drive.  I've also reworked the way Bacula
> > opens a drive, so it is more likely to succeed.
> >
> >Changes since the last beta are:
> >
> >14Dec05
> >- Correct reservation system to do a last ditch try
> >  for any mounted volume, then anyone anywhere.
> >- Add quotes around table Version because of
> >  error in MySQL 4.1.15 -- bug report submitted.
> >- Correct some minor problems with btape in the fill
> >  command.
> >- Updates to ssh-tunnel from Joshua Kugler.
> >- Added a report.pl program from Jonas Bjorklund.
> >- Simplify the O_NONBLOCK open() code for tape drives,
> >  and always open nonblocking.
> >- Do not wait for open() if EIO returned (shouldn't happen).
> >- Eliminate 3 argument to tape open().
> >- Correct the slot # edited in the 3995 Bad autochanger unload
> >  message.
> >- With -S on bscan (show progress) do not divide by zero.
> >13Dec05
> >- Make cancel pthread_cond_signal() pthread_cond_broadcast().
> >- When dcr is freed, also broadcast dev->wait_next_vol signal.
> >- Remove unused code in wait_for_device.
> >- Make wait_for_device() always return after 120 seconds of wait.
> >12Dec05
> >- Use localhost if no network configured
> >11Dec05
> >- Eliminated duplicate MaxVolBytes in cat update -- bug 509.
> >- Remove debug print.
> >- Add bail_out in error during state file reading.
> >
> >Best regards,
> >
> >Kern
> >
> >
> >---
> >This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> > files for problems?  Stop!  Download the new AJAX search engine that
> > makes searching your log files as easy as surfing the  web.  DOWNLOAD
> > SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> >___
> >Bacula-users mailing list
> >Bacula-users@lists.sourceforge.net
> >https://lists.sourceforge.net/lists/listinfo/bacula-users
>
> Kern,
>
> Last night's backups ran perfectly. No " waiting to reserve" or any
> other errors.

Thanks for the feedback.  It is nice to hear that it is now working better.

I'll either release a 3rd beta with more corrections before the end of the 
week, or will go directly to 1.38.3 ...

-- 
Best regards,

Kern

  (">
  /\
  V_V


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3 (14 December 2005) released

2005-12-19 Thread Rick Knight

Kern Sibbald wrote:


Hello,

I have released the second BETA version 1.38.3 (14 December 2005) as a tar 
file to Source Forge.  This version has a rewrite of the reservation 
algorithm that hopefully will improve situations where users were finding all 
jobs waiting to reserve a drive.  I've also reworked the way Bacula opens a 
drive, so it is more likely to succeed.


Changes since the last beta are:

14Dec05
- Correct reservation system to do a last ditch try
 for any mounted volume, then anyone anywhere.
- Add quotes around table Version because of 
 error in MySQL 4.1.15 -- bug report submitted.

- Correct some minor problems with btape in the fill
 command.
- Updates to ssh-tunnel from Joshua Kugler.
- Added a report.pl program from Jonas Bjorklund.
- Simplify the O_NONBLOCK open() code for tape drives,
 and always open nonblocking.
- Do not wait for open() if EIO returned (shouldn't happen).
- Eliminate 3 argument to tape open().
- Correct the slot # edited in the 3995 Bad autochanger unload
 message.
- With -S on bscan (show progress) do not divide by zero.
13Dec05
- Make cancel pthread_cond_signal() pthread_cond_broadcast().
- When dcr is freed, also broadcast dev->wait_next_vol signal.
- Remove unused code in wait_for_device.
- Make wait_for_device() always return after 120 seconds of wait.
12Dec05
- Use localhost if no network configured
11Dec05
- Eliminated duplicate MaxVolBytes in cat update -- bug 509.
- Remove debug print.
- Add bail_out in error during state file reading.

Best regards,

Kern


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
 


Kern,

Last night's backups ran perfectly. No " waiting to reserve" or any 
other errors. 


Thanks,
Rick Knight


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3 (14 December 2005) released

2005-12-17 Thread Rick Knight

Kern Sibbald wrote:


Hello,

I have released the second BETA version 1.38.3 (14 December 2005) as a tar 
file to Source Forge.  This version has a rewrite of the reservation 
algorithm that hopefully will improve situations where users were finding all 
jobs waiting to reserve a drive.  I've also reworked the way Bacula opens a 
drive, so it is more likely to succeed.


Changes since the last beta are:

14Dec05
- Correct reservation system to do a last ditch try
 for any mounted volume, then anyone anywhere.
- Add quotes around table Version because of 
 error in MySQL 4.1.15 -- bug report submitted.

- Correct some minor problems with btape in the fill
 command.
- Updates to ssh-tunnel from Joshua Kugler.
- Added a report.pl program from Jonas Bjorklund.
- Simplify the O_NONBLOCK open() code for tape drives,
 and always open nonblocking.
- Do not wait for open() if EIO returned (shouldn't happen).
- Eliminate 3 argument to tape open().
- Correct the slot # edited in the 3995 Bad autochanger unload
 message.
- With -S on bscan (show progress) do not divide by zero.
13Dec05
- Make cancel pthread_cond_signal() pthread_cond_broadcast().
- When dcr is freed, also broadcast dev->wait_next_vol signal.
- Remove unused code in wait_for_device.
- Make wait_for_device() always return after 120 seconds of wait.
12Dec05
- Use localhost if no network configured
11Dec05
- Eliminated duplicate MaxVolBytes in cat update -- bug 509.
- Remove debug print.
- Add bail_out in error during state file reading.

Best regards,

Kern


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
 


Kern,

I just installed the lates 1.38.3 and problem appears to be solved. 
Manual backups are running fine. I have an incremental jub scheduled for 
tonight and that will be a better test, but so far so good.


Thanks for a great tool and for the support.

Rick Knight


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-13 Thread Kern Sibbald
On Tuesday 13 December 2005 04:05, Rob wrote:
> What happened is that the upgrade to 1.38 overwrote my modified mtx-changer
> script with the default, 

Hmmm. Bacula really should not overwrite a script that has been changed by the 
user, at the same time, any new changes may be important.  Perhaps this is 
something I need to modify ...

> so you are correct, that was the problem. Still 
> leaves the strange error message though.

That is a simple "typo" error -- the wrong variable was being displayed, so 
you can ignore the fact that the Slot number was not correct.  There was an 
error however.

Thanks for the feedback.

>
> Thanks,
> Rob
>
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Kern Sibbald
> Sent: Monday, December 12, 2005 2:46 PM
> To: bacula-users@lists.sourceforge.net
> Cc: Rob
> Subject: Re: [Bacula-users] Bacula BETA 1.38.3
>
> On Monday 12 December 2005 20:10, Rob wrote:
> > FYI, I haven't had time to look into it much, but I have been seeing
>
> errors
>
> > with my auto changer since 1.38.1 that I had never seen with 1.36.*
> > before that look a lot like these. As Kern said, as if something seems to
> > be missing from the log, see:
> >
> > 04-Dec 03:34 bug-sd: End of Volume "NJO008D" at 80:11492 on device
> > "Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
> > 04-Dec 03:35 bug-sd: spider.2005-12-04_03.05.04 Error: Re-read of last
> > block failed. Last block=80530 Current block=14717.
> > 04-Dec 03:35 bug-sd: End of medium on Volume "NJO008D"
>
> Bytes=45,428,287,520
>
> > Blocks=704,222 at 04-Dec-2005 03:35.
> > 04-Dec 03:35 bug-sd: 3301 Issuing autochanger "loaded drive 0" command.
> > 04-Dec 03:35 bug-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
> > 04-Dec 03:35 bug-sd: 3307 Issuing autochanger "unload slot 8, drive 0"
> > command.
> > 04-Dec 03:35 bug-sd: 3995 Bad autochanger "unload slot 9, drive 0":
> > ERR=Child exited with code 1.
> > 04-Dec 03:35 bug-sd: Please mount Volume "NJO009D" on Storage Device
> > "Drive-1" (/dev/nst0) for Job spider.2005-12-04_03.05.04
>
> I'm beginning to think that the error message that edits the slot number is
> just broken.  The error you are seeing is because there is a problem with
> your mtx-changer script.  The error the previous person was seeing was
> because of a misconfiguration (due to incorrect documentation).
>
> > Rob
> >
> > -Original Message-
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of Kern
>
> Sibbald
>
> > Sent: Monday, December 12, 2005 9:20 AM
> > To: bacula-users@lists.sourceforge.net
> > Cc: Volker Dierks
> > Subject: Re: [Bacula-users] Bacula BETA 1.38.3
> >
> > On Monday 12 December 2005 12:52, Volker Dierks wrote:
> > > Hello,
> > >
> > > Volker Dierks wrote:
> > > >> Usually, I'd see if the problem can be reproduced with the existing
> > > >> system setup. If that's possible, I'd first check if the actual
> > > >> cause might be purely SCSI device related.
> > > >
> > > > That's what I'm going to do first. I'll create the second pool again
> > > > (with the same tapes) and put all nodes into that pool ...
> > >
> > > I've done this tonight .. in turn:
> > > - the backup up started as planned on drive two with the same tape as
> > >   Thursday (the tape was already mounted so no mtx stuff take place)
> > > - after some minutes (and 500 MB written data on that tape) everything
> > >   hangs again .. so I restarted everything and disabled that tape
> > > - I mounted the next tape and started the backup again. After 7 GB of
> > >   written data to that tape (and 5 successful backuped nodes) I got to
> > >   bed.
> > >
> > > Until here, it lookes like the problems were truly caused by the tape.
> > > But this morning I got the following mail:
> > > 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538
> > > Write error at 12:5438 on device "Drive-2" (/dev/nst1).
> > > ERR=Input/output error. 12-Dec 03:24 mw-mcs-sd:
> > > nfs-1.2005-12-12_02.15.08 Error: Error writing final EOF to tape. This
> > > Volume may not be readable. dev.c:1553 ioctl
> >
> > MTWEOF
> >
> > > error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec
> >
> > 03:24
> >
> > Unles

RE: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Rob
What happened is that the upgrade to 1.38 overwrote my modified mtx-changer
script with the default, so you are correct, that was the problem. Still
leaves the strange error message though.

Thanks,
Rob


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Kern Sibbald
Sent: Monday, December 12, 2005 2:46 PM
To: bacula-users@lists.sourceforge.net
Cc: Rob
Subject: Re: [Bacula-users] Bacula BETA 1.38.3

On Monday 12 December 2005 20:10, Rob wrote:
> FYI, I haven't had time to look into it much, but I have been seeing
errors
> with my auto changer since 1.38.1 that I had never seen with 1.36.* before
> that look a lot like these. As Kern said, as if something seems to be
> missing from the log, see:
>
> 04-Dec 03:34 bug-sd: End of Volume "NJO008D" at 80:11492 on device
> "Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
> 04-Dec 03:35 bug-sd: spider.2005-12-04_03.05.04 Error: Re-read of last
> block failed. Last block=80530 Current block=14717.
> 04-Dec 03:35 bug-sd: End of medium on Volume "NJO008D"
Bytes=45,428,287,520
> Blocks=704,222 at 04-Dec-2005 03:35.
> 04-Dec 03:35 bug-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 04-Dec 03:35 bug-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
> 04-Dec 03:35 bug-sd: 3307 Issuing autochanger "unload slot 8, drive 0"
> command.
> 04-Dec 03:35 bug-sd: 3995 Bad autochanger "unload slot 9, drive 0":
> ERR=Child exited with code 1.
> 04-Dec 03:35 bug-sd: Please mount Volume "NJO009D" on Storage Device
> "Drive-1" (/dev/nst0) for Job spider.2005-12-04_03.05.04

I'm beginning to think that the error message that edits the slot number is 
just broken.  The error you are seeing is because there is a problem with 
your mtx-changer script.  The error the previous person was seeing was 
because of a misconfiguration (due to incorrect documentation).

>
> Rob
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Kern
Sibbald
> Sent: Monday, December 12, 2005 9:20 AM
> To: bacula-users@lists.sourceforge.net
> Cc: Volker Dierks
> Subject: Re: [Bacula-users] Bacula BETA 1.38.3
>
> On Monday 12 December 2005 12:52, Volker Dierks wrote:
> > Hello,
> >
> > Volker Dierks wrote:
> > >> Usually, I'd see if the problem can be reproduced with the existing
> > >> system setup. If that's possible, I'd first check if the actual cause
> > >> might be purely SCSI device related.
> > >
> > > That's what I'm going to do first. I'll create the second pool again
> > > (with the same tapes) and put all nodes into that pool ...
> >
> > I've done this tonight .. in turn:
> > - the backup up started as planned on drive two with the same tape as
> >   Thursday (the tape was already mounted so no mtx stuff take place)
> > - after some minutes (and 500 MB written data on that tape) everything
> >   hangs again .. so I restarted everything and disabled that tape
> > - I mounted the next tape and started the backup again. After 7 GB of
> >   written data to that tape (and 5 successful backuped nodes) I got to
> >   bed.
> >
> > Until here, it lookes like the problems were truly caused by the tape.
> > But this morning I got the following mail:
> > 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538
> > Write error at 12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output
> > error. 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error
> > writing final EOF to tape. This Volume may not be readable. dev.c:1553
> > ioctl
>
> MTWEOF
>
> > error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec
>
> 03:24
>
> Unless you have 7GB tapes, this looks like a hardware problem: bad media,
> dirty tape drive, bad drive, bad SCSI cables (or improperly installed),
bad
> SCSI card, ...
>
> These kinds of problems typically generate a number of kernel (SCSI)
> messages
> in the log.
>
> > mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" Bytes=7,078,064,979
> > Blocks=109,722 at 12-Dec-2005 03:24. 12-Dec 03:24 mw-mcs-sd: 3301
Issuing
> > autochanger "loaded drive 1" command. 12-Dec 03:24 mw-mcs-sd: 3302
> > Autochanger "loaded drive 1", result is Slot 12. 12-Dec 04:10 mw-mcs-sd:
> > 3307 Issuing autochanger "unload slot 12, drive 1" command. 12-Dec 04:14
> > mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": ERR=Child
died
> > from signal 15: Termination.
>
> This looks like you don't have your autoch

Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Volker Dierks

Kern Sibbald schrieb:


On Monday 12 December 2005 20:10, Rob wrote:
 


FYI, I haven't had time to look into it much, but I have been seeing errors
with my auto changer since 1.38.1 that I had never seen with 1.36.* before
that look a lot like these. As Kern said, as if something seems to be
missing from the log, see:

04-Dec 03:34 bug-sd: End of Volume "NJO008D" at 80:11492 on device
"Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
04-Dec 03:35 bug-sd: spider.2005-12-04_03.05.04 Error: Re-read of last
block failed. Last block=80530 Current block=14717.
04-Dec 03:35 bug-sd: End of medium on Volume "NJO008D" Bytes=45,428,287,520
Blocks=704,222 at 04-Dec-2005 03:35.
04-Dec 03:35 bug-sd: 3301 Issuing autochanger "loaded drive 0" command.
04-Dec 03:35 bug-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
04-Dec 03:35 bug-sd: 3307 Issuing autochanger "unload slot 8, drive 0"
command.
04-Dec 03:35 bug-sd: 3995 Bad autochanger "unload slot 9, drive 0":
ERR=Child exited with code 1.
04-Dec 03:35 bug-sd: Please mount Volume "NJO009D" on Storage Device
"Drive-1" (/dev/nst0) for Job spider.2005-12-04_03.05.04
   



I'm beginning to think that the error message that edits the slot number is 
just broken.  The error you are seeing is because there is a problem with 
your mtx-changer script.  The error the previous person was seeing was 
because of a misconfiguration (due to incorrect documentation).
 

Sorry, but I've fooled you. The "Maximum Changer Wait = ..." option has 
been added to the
attached configuration this morning. Everything posted down there, was 
without this configuration

directive. Sorry ...

Volker

 


Rob

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Kern Sibbald
Sent: Monday, December 12, 2005 9:20 AM
To: bacula-users@lists.sourceforge.net
Cc: Volker Dierks
Subject: Re: [Bacula-users] Bacula BETA 1.38.3

On Monday 12 December 2005 12:52, Volker Dierks wrote:
   


Hello,

Volker Dierks wrote:
 


Usually, I'd see if the problem can be reproduced with the existing
system setup. If that's possible, I'd first check if the actual cause
might be purely SCSI device related.
 


That's what I'm going to do first. I'll create the second pool again
(with the same tapes) and put all nodes into that pool ...
   


I've done this tonight .. in turn:
- the backup up started as planned on drive two with the same tape as
 Thursday (the tape was already mounted so no mtx stuff take place)
- after some minutes (and 500 MB written data on that tape) everything
 hangs again .. so I restarted everything and disabled that tape
- I mounted the next tape and started the backup again. After 7 GB of
 written data to that tape (and 5 successful backuped nodes) I got to
 bed.

Until here, it lookes like the problems were truly caused by the tape.
But this morning I got the following mail:
12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538
Write error at 12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output
error. 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error
writing final EOF to tape. This Volume may not be readable. dev.c:1553
ioctl
 


MTWEOF

   


error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec
 


03:24

Unless you have 7GB tapes, this looks like a hardware problem: bad media,
dirty tape drive, bad drive, bad SCSI cables (or improperly installed), bad
SCSI card, ...

These kinds of problems typically generate a number of kernel (SCSI)
messages
in the log.

   


mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" Bytes=7,078,064,979
Blocks=109,722 at 12-Dec-2005 03:24. 12-Dec 03:24 mw-mcs-sd: 3301 Issuing
autochanger "loaded drive 1" command. 12-Dec 03:24 mw-mcs-sd: 3302
Autochanger "loaded drive 1", result is Slot 12. 12-Dec 04:10 mw-mcs-sd:
3307 Issuing autochanger "unload slot 12, drive 1" command. 12-Dec 04:14
mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": ERR=Child died
from signal 15: Termination.
 


This looks like you don't have your autochanger script properly configured
as
one user pointed out -- setting the sleep longer may help.  However, I do
not
understand why in one message it says "unload slot 12", then on the next
line
it says "unload slot 13 ... ERR".  There seems to be something missing as
Bacula will normally issue a "loaded drive" or load a drive before
unloading

it for a second time.

   


12-Dec 04:14 mw-mcs-sd: Please mount Volume
"MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
nfs-1.2005-12-12_02.15.08 12-Dec 05:14 mw-mcs-sd: Please mount Volume
"MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
nfs-1.2005-12-12_0

Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Kern Sibbald
On Monday 12 December 2005 20:10, Rob wrote:
> FYI, I haven't had time to look into it much, but I have been seeing errors
> with my auto changer since 1.38.1 that I had never seen with 1.36.* before
> that look a lot like these. As Kern said, as if something seems to be
> missing from the log, see:
>
> 04-Dec 03:34 bug-sd: End of Volume "NJO008D" at 80:11492 on device
> "Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
> 04-Dec 03:35 bug-sd: spider.2005-12-04_03.05.04 Error: Re-read of last
> block failed. Last block=80530 Current block=14717.
> 04-Dec 03:35 bug-sd: End of medium on Volume "NJO008D" Bytes=45,428,287,520
> Blocks=704,222 at 04-Dec-2005 03:35.
> 04-Dec 03:35 bug-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 04-Dec 03:35 bug-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
> 04-Dec 03:35 bug-sd: 3307 Issuing autochanger "unload slot 8, drive 0"
> command.
> 04-Dec 03:35 bug-sd: 3995 Bad autochanger "unload slot 9, drive 0":
> ERR=Child exited with code 1.
> 04-Dec 03:35 bug-sd: Please mount Volume "NJO009D" on Storage Device
> "Drive-1" (/dev/nst0) for Job spider.2005-12-04_03.05.04

I'm beginning to think that the error message that edits the slot number is 
just broken.  The error you are seeing is because there is a problem with 
your mtx-changer script.  The error the previous person was seeing was 
because of a misconfiguration (due to incorrect documentation).

>
> Rob
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Kern Sibbald
> Sent: Monday, December 12, 2005 9:20 AM
> To: bacula-users@lists.sourceforge.net
> Cc: Volker Dierks
> Subject: Re: [Bacula-users] Bacula BETA 1.38.3
>
> On Monday 12 December 2005 12:52, Volker Dierks wrote:
> > Hello,
> >
> > Volker Dierks wrote:
> > >> Usually, I'd see if the problem can be reproduced with the existing
> > >> system setup. If that's possible, I'd first check if the actual cause
> > >> might be purely SCSI device related.
> > >
> > > That's what I'm going to do first. I'll create the second pool again
> > > (with the same tapes) and put all nodes into that pool ...
> >
> > I've done this tonight .. in turn:
> > - the backup up started as planned on drive two with the same tape as
> >   Thursday (the tape was already mounted so no mtx stuff take place)
> > - after some minutes (and 500 MB written data on that tape) everything
> >   hangs again .. so I restarted everything and disabled that tape
> > - I mounted the next tape and started the backup again. After 7 GB of
> >   written data to that tape (and 5 successful backuped nodes) I got to
> >   bed.
> >
> > Until here, it lookes like the problems were truly caused by the tape.
> > But this morning I got the following mail:
> > 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538
> > Write error at 12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output
> > error. 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error
> > writing final EOF to tape. This Volume may not be readable. dev.c:1553
> > ioctl
>
> MTWEOF
>
> > error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec
>
> 03:24
>
> Unless you have 7GB tapes, this looks like a hardware problem: bad media,
> dirty tape drive, bad drive, bad SCSI cables (or improperly installed), bad
> SCSI card, ...
>
> These kinds of problems typically generate a number of kernel (SCSI)
> messages
> in the log.
>
> > mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" Bytes=7,078,064,979
> > Blocks=109,722 at 12-Dec-2005 03:24. 12-Dec 03:24 mw-mcs-sd: 3301 Issuing
> > autochanger "loaded drive 1" command. 12-Dec 03:24 mw-mcs-sd: 3302
> > Autochanger "loaded drive 1", result is Slot 12. 12-Dec 04:10 mw-mcs-sd:
> > 3307 Issuing autochanger "unload slot 12, drive 1" command. 12-Dec 04:14
> > mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": ERR=Child died
> > from signal 15: Termination.
>
> This looks like you don't have your autochanger script properly configured
> as
> one user pointed out -- setting the sleep longer may help.  However, I do
> not
> understand why in one message it says "unload slot 12", then on the next
> line
> it says "unload slot 13 ... ERR".  There seems to be something missing as
> Bacula will normally issue a "loaded drive" or load a drive before
> unloading
>
> it for a second tim

RE: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Rob
FYI, I haven't had time to look into it much, but I have been seeing errors
with my auto changer since 1.38.1 that I had never seen with 1.36.* before
that look a lot like these. As Kern said, as if something seems to be
missing from the log, see:

04-Dec 03:34 bug-sd: End of Volume "NJO008D" at 80:11492 on device "Drive-1"
(/dev/nst0). Write of 64512 bytes got -1.
04-Dec 03:35 bug-sd: spider.2005-12-04_03.05.04 Error: Re-read of last block
failed. Last block=80530 Current block=14717.
04-Dec 03:35 bug-sd: End of medium on Volume "NJO008D" Bytes=45,428,287,520
Blocks=704,222 at 04-Dec-2005 03:35.
04-Dec 03:35 bug-sd: 3301 Issuing autochanger "loaded drive 0" command.
04-Dec 03:35 bug-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
04-Dec 03:35 bug-sd: 3307 Issuing autochanger "unload slot 8, drive 0"
command.
04-Dec 03:35 bug-sd: 3995 Bad autochanger "unload slot 9, drive 0":
ERR=Child exited with code 1.
04-Dec 03:35 bug-sd: Please mount Volume "NJO009D" on Storage Device
"Drive-1" (/dev/nst0) for Job spider.2005-12-04_03.05.04

Rob

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Kern Sibbald
Sent: Monday, December 12, 2005 9:20 AM
To: bacula-users@lists.sourceforge.net
Cc: Volker Dierks
Subject: Re: [Bacula-users] Bacula BETA 1.38.3

On Monday 12 December 2005 12:52, Volker Dierks wrote:
> Hello,
>
> Volker Dierks wrote:
> >> Usually, I'd see if the problem can be reproduced with the existing
> >> system setup. If that's possible, I'd first check if the actual cause
> >> might be purely SCSI device related.
> >
> > That's what I'm going to do first. I'll create the second pool again
> > (with the same tapes) and put all nodes into that pool ...
>
> I've done this tonight .. in turn:
> - the backup up started as planned on drive two with the same tape as
>   Thursday (the tape was already mounted so no mtx stuff take place)
> - after some minutes (and 500 MB written data on that tape) everything
>   hangs again .. so I restarted everything and disabled that tape
> - I mounted the next tape and started the backup again. After 7 GB of
>   written data to that tape (and 5 successful backuped nodes) I got to
>   bed.
>
> Until here, it lookes like the problems were truly caused by the tape.
> But this morning I got the following mail:
> 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538 Write
> error at 12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output error.
> 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error writing
> final EOF to tape. This Volume may not be readable. dev.c:1553 ioctl
MTWEOF
> error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec
03:24

Unless you have 7GB tapes, this looks like a hardware problem: bad media, 
dirty tape drive, bad drive, bad SCSI cables (or improperly installed), bad 
SCSI card, ...

These kinds of problems typically generate a number of kernel (SCSI)
messages 
in the log.

> mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" Bytes=7,078,064,979
> Blocks=109,722 at 12-Dec-2005 03:24. 12-Dec 03:24 mw-mcs-sd: 3301 Issuing
> autochanger "loaded drive 1" command. 12-Dec 03:24 mw-mcs-sd: 3302
> Autochanger "loaded drive 1", result is Slot 12. 12-Dec 04:10 mw-mcs-sd:
> 3307 Issuing autochanger "unload slot 12, drive 1" command. 12-Dec 04:14
> mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": ERR=Child died
> from signal 15: Termination. 

This looks like you don't have your autochanger script properly configured
as 
one user pointed out -- setting the sleep longer may help.  However, I do
not 
understand why in one message it says "unload slot 12", then on the next
line 
it says "unload slot 13 ... ERR".  There seems to be something missing as 
Bacula will normally issue a "loaded drive" or load a drive before unloading

it for a second time.

> 12-Dec 04:14 mw-mcs-sd: Please mount Volume 
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 05:14 mw-mcs-sd: Please mount Volume
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 07:14 mw-mcs-sd: Please mount Volume
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 08:59 nfs-1-fd: nfs-1.2005-12-12_02.15.08
> Fatal error: backup.c:498 Network send error to SD. ERR=Broken pipe 12-Dec
> 08:59 mw-mcs-dir: nfs-1.2005-12-12_02.15.08 Error: Bacula 1.38.2
(20Nov05):
> 12-Dec-2005 08:59:32
>
> At 08:59 I stopped bacula-dir and -sd. Th

Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Arno Lehmann

Hello,

Kern Sibbald schrieb:

On Monday 12 December 2005 12:54, Volker Dierks wrote:


Sorry, I forgot to attach the file ... here it is.




Well, you fell into a documentation error.  Please remove the line from your
Device resources that says:

  Maximum Changer Wait = 10 minutes

it is incorrect.  Despite what the *old* documentation said, the time *must* 
be specified in seconds.


Yes. ;-)

Even better: Also edit the script to use the wait_for_tape function (or 
whatever it's called.) Assuming you can reliably get the tape status, 
this is a much cleaner solution - I've seen tape load times from a few 
seconds to some minutes one one and the same drive (DLT, admittedly :-) 
but I prefer to wait until a drive is ready over the fixed timeout.


Also, in case the security timeout in the function waiting for the tape 
to be loaded ever triggers, it is useful to write some log data, 
especially containing tapealert data. I think that, given a sufficently 
long emergency timeout, almost all cases of a tape not recognized by the 
tape drive indicate a serious failure.


For example my configuration:
- in mtx-changer's wait_for_drive function, I poll every three seconds 
for some hundred times.
- In bacula-sd.conf, I set a timeout of 20 minutes or something. This 
should never be reached, of course.



If you want to set it to 10 minutes you need to set it to 600.



You probably also want to increast the sleep in mtx-changer ...



Definitely...

Arno

--
IT-Service Lehmann[EMAIL PROTECTED]
Arno Lehmann  http://www.its-lehmann.de


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Alan Brown

On Sun, 11 Dec 2005, Kern Sibbald wrote:


The bscan problem that I found caused it to generate a JobMedia record in the
database that had an end FileIndex one less than it should have been.  This
was the last record on a Volume, and the record was continued on the next
Volume.  When Bacula constructed a bsr, the "optimization" code had this one
off problem, so when the restore job ran, the last record (partial) record on
the first tape was ignored.  When the restore job got the second tape up,
after reading the first (partial) record, it realized that the first part of
the record from the first Volume was not there, so my insanity check code
aborted.


That sounds about right.

Perhaps it needed larger tape spools to trigger?

AB


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Kern Sibbald
On Monday 12 December 2005 12:54, Volker Dierks wrote:
> Sorry, I forgot to attach the file ... here it is.
>

Well, you fell into a documentation error.  Please remove the line from your
Device resources that says:

  Maximum Changer Wait = 10 minutes

it is incorrect.  Despite what the *old* documentation said, the time *must* 
be specified in seconds.

If you want to set it to 10 minutes you need to set it to 600.

You probably also want to increast the sleep in mtx-changer ...

-- 
Best regards,

Kern

  (">
  /\
  V_V


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Kern Sibbald
On Monday 12 December 2005 12:52, Volker Dierks wrote:
> Hello,
>
> Volker Dierks wrote:
> >> Usually, I'd see if the problem can be reproduced with the existing
> >> system setup. If that's possible, I'd first check if the actual cause
> >> might be purely SCSI device related.
> >
> > That's what I'm going to do first. I'll create the second pool again
> > (with the same tapes) and put all nodes into that pool ...
>
> I've done this tonight .. in turn:
> - the backup up started as planned on drive two with the same tape as
>   Thursday (the tape was already mounted so no mtx stuff take place)
> - after some minutes (and 500 MB written data on that tape) everything
>   hangs again .. so I restarted everything and disabled that tape
> - I mounted the next tape and started the backup again. After 7 GB of
>   written data to that tape (and 5 successful backuped nodes) I got to
>   bed.
>
> Until here, it lookes like the problems were truly caused by the tape.
> But this morning I got the following mail:
> 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538 Write
> error at 12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output error.
> 12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error writing
> final EOF to tape. This Volume may not be readable. dev.c:1553 ioctl MTWEOF
> error on "Drive-2" (/dev/nst1). ERR=No such device or address. 12-Dec 03:24

Unless you have 7GB tapes, this looks like a hardware problem: bad media, 
dirty tape drive, bad drive, bad SCSI cables (or improperly installed), bad 
SCSI card, ...

These kinds of problems typically generate a number of kernel (SCSI) messages 
in the log.

> mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" Bytes=7,078,064,979
> Blocks=109,722 at 12-Dec-2005 03:24. 12-Dec 03:24 mw-mcs-sd: 3301 Issuing
> autochanger "loaded drive 1" command. 12-Dec 03:24 mw-mcs-sd: 3302
> Autochanger "loaded drive 1", result is Slot 12. 12-Dec 04:10 mw-mcs-sd:
> 3307 Issuing autochanger "unload slot 12, drive 1" command. 12-Dec 04:14
> mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": ERR=Child died
> from signal 15: Termination. 

This looks like you don't have your autochanger script properly configured as 
one user pointed out -- setting the sleep longer may help.  However, I do not 
understand why in one message it says "unload slot 12", then on the next line 
it says "unload slot 13 ... ERR".  There seems to be something missing as 
Bacula will normally issue a "loaded drive" or load a drive before unloading 
it for a second time.

> 12-Dec 04:14 mw-mcs-sd: Please mount Volume 
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 05:14 mw-mcs-sd: Please mount Volume
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 07:14 mw-mcs-sd: Please mount Volume
> "MW-MCS-1-13" on Storage Device "Drive-2" (/dev/nst1) for Job
> nfs-1.2005-12-12_02.15.08 12-Dec 08:59 nfs-1-fd: nfs-1.2005-12-12_02.15.08
> Fatal error: backup.c:498 Network send error to SD. ERR=Broken pipe 12-Dec
> 08:59 mw-mcs-dir: nfs-1.2005-12-12_02.15.08 Error: Bacula 1.38.2 (20Nov05):
> 12-Dec-2005 08:59:32
>
> At 08:59 I stopped bacula-dir and -sd. The kernel-Log contains the
> same SCSI ABORT messages as posted before starting at 02:54:
> Dec 12 02:54:30 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT
> message

If you are getting SCSI ABORT messages, then either there is some hardware 
problem or the Bacula Device resource is not setup right for that drive.

Did you pass *all* the tests in the Tape Testing chapter?

>
> The last thing I can imagine is: All tapes which were used in Drive-2
> up to now are previously used (by amanda). This is the way I recycled
> them:
> mt -f /dev/nst1 rewind
> mt -f /dev/nst1 setdensity 0x89

I always find explicitly setting the density this way *very* prone to error.

> mt -f /dev/nst1 rewind
> mt -f /dev/nst1 weof
> mt -f /dev/nst1 weof
> write the Bacula label
>
> Perhaps this is not the right way? I've attached our configartion and
> would be very thankful, if someone can confirm that it's correct. It's
> the one drive configuration pointing to Pool: DRIVE-2. When using this
> configuration against Pool: DRIVE-1 (all tapes in this pool are fresh
> new ones) everything is working fine.
>
> Volker
>
> PS: I'm running "mt -f /dev/nst1 erase" on MW-MCS-1-12 atm. If this
> fails, I would say that drive two is faulty.
>
>
> ---
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> ___
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Best re

Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Volker Dierks

Sorry, I forgot to attach the file ... here it is.

Volker


conf.tgz
Description: GNU Unix tar archive


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-12 Thread Volker Dierks

Hello,

Volker Dierks wrote:
Usually, I'd see if the problem can be reproduced with the existing 
system setup. If that's possible, I'd first check if the actual cause 
might be purely SCSI device related.


That's what I'm going to do first. I'll create the second pool again
(with the same tapes) and put all nodes into that pool ...


I've done this tonight .. in turn:
- the backup up started as planned on drive two with the same tape as
 Thursday (the tape was already mounted so no mtx stuff take place)
- after some minutes (and 500 MB written data on that tape) everything
 hangs again .. so I restarted everything and disabled that tape
- I mounted the next tape and started the backup again. After 7 GB of
 written data to that tape (and 5 successful backuped nodes) I got to
 bed.

Until here, it lookes like the problems were truly caused by the tape.
But this morning I got the following mail:
12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: block.c:538 Write error at 
12:5438 on device "Drive-2" (/dev/nst1). ERR=Input/output error.
12-Dec 03:24 mw-mcs-sd: nfs-1.2005-12-12_02.15.08 Error: Error writing final 
EOF to tape. This Volume may not be readable.
dev.c:1553 ioctl MTWEOF error on "Drive-2" (/dev/nst1). ERR=No such device or 
address.
12-Dec 03:24 mw-mcs-sd: End of medium on Volume "MW-MCS-1-12" 
Bytes=7,078,064,979 Blocks=109,722 at 12-Dec-2005 03:24.
12-Dec 03:24 mw-mcs-sd: 3301 Issuing autochanger "loaded drive 1" command.
12-Dec 03:24 mw-mcs-sd: 3302 Autochanger "loaded drive 1", result is Slot 12.
12-Dec 04:10 mw-mcs-sd: 3307 Issuing autochanger "unload slot 12, drive 1" 
command.
12-Dec 04:14 mw-mcs-sd: 3995 Bad autochanger "unload slot 13, drive 1": 
ERR=Child died from signal 15: Termination.
12-Dec 04:14 mw-mcs-sd: Please mount Volume "MW-MCS-1-13" on Storage Device 
"Drive-2" (/dev/nst1) for Job nfs-1.2005-12-12_02.15.08
12-Dec 05:14 mw-mcs-sd: Please mount Volume "MW-MCS-1-13" on Storage Device 
"Drive-2" (/dev/nst1) for Job nfs-1.2005-12-12_02.15.08
12-Dec 07:14 mw-mcs-sd: Please mount Volume "MW-MCS-1-13" on Storage Device 
"Drive-2" (/dev/nst1) for Job nfs-1.2005-12-12_02.15.08
12-Dec 08:59 nfs-1-fd: nfs-1.2005-12-12_02.15.08 Fatal error: backup.c:498 
Network send error to SD. ERR=Broken pipe
12-Dec 08:59 mw-mcs-dir: nfs-1.2005-12-12_02.15.08 Error: Bacula 1.38.2 
(20Nov05): 12-Dec-2005 08:59:32

At 08:59 I stopped bacula-dir and -sd. The kernel-Log contains the
same SCSI ABORT messages as posted before starting at 02:54:
Dec 12 02:54:30 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT message

The last thing I can imagine is: All tapes which were used in Drive-2
up to now are previously used (by amanda). This is the way I recycled
them:
mt -f /dev/nst1 rewind
mt -f /dev/nst1 setdensity 0x89
mt -f /dev/nst1 rewind
mt -f /dev/nst1 weof
mt -f /dev/nst1 weof
write the Bacula label

Perhaps this is not the right way? I've attached our configartion and
would be very thankful, if someone can confirm that it's correct. It's
the one drive configuration pointing to Pool: DRIVE-2. When using this
configuration against Pool: DRIVE-1 (all tapes in this pool are fresh
new ones) everything is working fine.

Volker

PS: I'm running "mt -f /dev/nst1 erase" on MW-MCS-1-12 atm. If this
   fails, I would say that drive two is faulty.


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-devel] Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-11 Thread Phil Stracchino
Kern Sibbald wrote:
> The bscan problem that I found caused it to generate a JobMedia record in the 
> database that had an end FileIndex one less than it should have been.  This 
> was the last record on a Volume, and the record was continued on the next 
> Volume.  When Bacula constructed a bsr, the "optimization" code had this one 
> off problem, so when the restore job ran, the last record (partial) record on 
> the first tape was ignored.  When the restore job got the second tape up, 
> after reading the first (partial) record, it realized that the first part of 
> the record from the first Volume was not there, so my insanity check code 
> aborted.
> 
> What surprises me is that this never triggered before in all the years I ran 
> it.  I wish I had more time to devote to regression testing as I would 
> develop a case that is 100% sure to exercise this problem ...

I recall finding (and fixing) a very similar bug in IBM's backup.exe
program that shipped with DOS 3.20.  I spent about five hours on the
phone with a frantic New York stockbroker figuring out the problem and
walking him through patching his backup disks so that the backup would
restore properly.


-- 
 Phil Stracchino   [EMAIL PROTECTED]
Renaissance Man, Unix generalist, Perl hacker
 Mobile: 603-216-7037 Landline: 603-886-3518


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-11 Thread Kern Sibbald
On Sunday 11 December 2005 10:32, Volker Dierks wrote:
> Hello,
>
> Arno Lehmann wrote:
> > Well, I haven't tried jobs going to different drives in one autochanger,
> > so I won't discuss that part of your report.
>
> Hopefully this is supported?! The "Maximum Changer Wait" option seems
> reasonable for the situation that both drives need a new tape at the
> same time. I'll increase this to 10 minutes because the loader is slow
> and fresh loaded tapes are read in even more slowly. I had to change the
> sleep time in mtx-changer (the load case) to 140 seconds. But to make
> that point clear .. no tape change was initiated whilst the descibed
> incedent occurred.
>
> > Usually, I'd see if the problem can be reproduced with the existing
> > system setup. If that's possible, I'd first check if the actual cause
> > might be purely SCSI device related.
>
> That's what I'm going to do first. I'll create the second pool again
> (with the same tapes) and put all nodes into that pool. So a complete
> (Full) backup of all nodes should be done on drive two (150 GB). If
> this succeeds I'll try both drives again with 1.38.2 because all SCSI
> cables have been changed on Friday. If it fails again, I'll give 1.38.3
> a try.

I didn't read your first email very carefully, but I do have the following 
comments that I hope will help you:

1. Bacula itself uses only standard ioctl(), read() and write() calls, so it 
would be very hard for it to damage any hardware.  mtx does use the raw SCSI 
channel, so there is more potential for problems there.  That said, the 
mtx-changer script calls mtx only in standard well documented ways.

2. If you are having problems with multiple drives and/or multiple pools or 
"lockups" with such combinations, you will save your self some pain by going 
directly to version 1.38.3 -- I wouldn't hesitate in those cases ...

>
> Volker
>
>
> ---
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> ___
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Best regards,

Kern

  (">
  /\
  V_V


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-11 Thread Volker Dierks

Hello,

Arno Lehmann wrote:
Well, I haven't tried jobs going to different drives in one autochanger, 
so I won't discuss that part of your report.

Hopefully this is supported?! The "Maximum Changer Wait" option seems
reasonable for the situation that both drives need a new tape at the
same time. I'll increase this to 10 minutes because the loader is slow
and fresh loaded tapes are read in even more slowly. I had to change the
sleep time in mtx-changer (the load case) to 140 seconds. But to make
that point clear .. no tape change was initiated whilst the descibed
incedent occurred.

Usually, I'd see if the problem can be reproduced with the existing 
system setup. If that's possible, I'd first check if the actual cause 
might be purely SCSI device related.

That's what I'm going to do first. I'll create the second pool again
(with the same tapes) and put all nodes into that pool. So a complete
(Full) backup of all nodes should be done on drive two (150 GB). If
this succeeds I'll try both drives again with 1.38.2 because all SCSI
cables have been changed on Friday. If it fails again, I'll give 1.38.3
a try.

Volker


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-11 Thread Kern Sibbald
On Saturday 10 December 2005 22:34, Alan Brown wrote:
> On Sat, 10 Dec 2005, Kern Sibbald wrote:
> > As an aside: when testing 1.38.3, the bscan regression script failed
> > (bscan aborted due to a logic error).  I think bscan has been around and
> > mostly unmodified for about 3 years now, and so this regression test has
> > be run thousands of times with no problem. As a consequence, it was
> > surprising to find that the bug has existed since the first bscan, and
> > not so surprising that it involved a record that was split between two
> > Volumes ...
>
> I wondre if that fix will solve the spanning issue many of us had been
> seeing when testing bscan on autochangers? :)

I don't remember this issue, could you fill me in on what it is?

The bscan problem that I found caused it to generate a JobMedia record in the 
database that had an end FileIndex one less than it should have been.  This 
was the last record on a Volume, and the record was continued on the next 
Volume.  When Bacula constructed a bsr, the "optimization" code had this one 
off problem, so when the restore job ran, the last record (partial) record on 
the first tape was ignored.  When the restore job got the second tape up, 
after reading the first (partial) record, it realized that the first part of 
the record from the first Volume was not there, so my insanity check code 
aborted.

What surprises me is that this never triggered before in all the years I ran 
it.  I wish I had more time to devote to regression testing as I would 
develop a case that is 100% sure to exercise this problem ...

-- 
Best regards,

Kern

  (">
  /\
  V_V


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-10 Thread Arno Lehmann

Hello,

Volker Dierks schrieb:
...

perhaps I'll give it a try. But a little tale first.

We've got a HP 2/20 Library with 2 DLT-8000 drives. Our backup box is 
running

Debian GNU/Linux 3.0, Bacula 1.38.2 and 11 nodes. The system has gone into
production on Wednesday (with one drive) and tremendous success. Bacula is
really great.

To speed things up, I tried to activate the second drive on Thursday. I've
created a second pool and relabeled some tapes into that pool. 
Everything I've
found - regarding using multiple drives - says, that several pools are 
needed.

This were the configuration changes:

...
After 20 minutes I tried to cancel the (still stucked) jobs without 
success.
Thus I stoppped bacula-dir and bacula-sd which leaves two bacula-sd 
processes
in status D behind. They couldn't be killed so I rebooted the box. This 
also
failed with a booted kernel saying that init couldn't find the root 
partition.

After a poweroff/on the box came up as usual.


Well, I haven't tried jobs going to different drives in one autochanger, 
so I won't discuss that part of your report.



My conclusion is that the second drive is faulty and blew up the SCSI bus
(see the kernel log at the end). Job 2 was stuck at 160 MB. In the meantime
job 1 finished writing 450 MB and job 3 was started. If I remember 
correctly,
job 3 was able to write 2.6 GB to drive one until it also got stucked. I 
don't

know if a faulty tape can rise up such an incedent.


Hardly, but that doesn't mean it's impossible. Similar kernel driver 
reports and SCSI subsystem hangs have occured here, and I'm quite sure - 
again, not absolutely - that they resulted of a combination of a drive 
hardware error and an imperfect driver.


I fact, there are reports that that the aic7xxx driver doesn't work 
correctly in all cases, caused by different hardware on different SCSI 
HBAs. As far as I know, there have been some issues with the controller 
chips handled by this driver, which Adaptec tried to rsolve by a number 
of "silent" hardware updates. The Adaptec-supplied windows drivers 
obviously know how to handle the different hardware capabilities (and 
errors, as some might say), but the linux drivers don't implement the 
necessary functions for all cases. This all is third-hand knowledge and 
completely NOT backed up by any real understanding of the AIC chips and 
the corresponding drivers, by the way. Still, I found the source code of 
the linux drivers quite interesting, as there are some references to 
special handling of certain conditions on some AIC chips.


By the way: Here, when I saw such errors, they wrere, as far as I can 
say always caused by actual SCSI errors from some devices - I had a 
spool disk dying during despooling, for example, and I had some real 
tape drive errors that could only be recovered by power cycling the tape 
drive. Still, some of the errors I could identify *should* have been 
handled by the drivers without a SCSI subsystem breakdown.


Usually, I'd see if the problem can be reproduced with the existing 
system setup. If that's possible, I'd first check if the actual cause 
might be purely SCSI device related.


On the other hand (which is what I hope) there could be a configuration 
error
(Job {} and Client {} didn't have Maximum Concurrent Jobs set) or the 
changes

in this BETA will fix this behaviour.


Well, you can always try it, assuming you accept to use beta software in 
a production system. Having read Kerns report, personaly, I'd try it, 
but I don't have really vital data here. Of course, as far as I see, 
it's unlikely that Bacula can destroy existing data, in the worst cases 
I can imagine you might lose some existing volumes and your catalog, I 
think.


Arno


I've planned to add the second drive again tomorrow and use another tape.
Should I also upgrade to 1.38.3?

Volker

Dec  9 01:18:59 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT 
message

Dec  9 01:18:59 backup kernel: CDB: 0xa 0x0 0x0 0xfc 0x0 0x0
Dec  9 01:18:59 backup kernel: scsi1: At time of recovery, card was not 
paused
Dec  9 01:18:59 backup kernel: >> Dump Card State Begins 
<
Dec  9 01:18:59 backup kernel: scsi1: Dumping Card State while idle, at 
SEQADDR 0x8

Dec  9 01:18:59 backup kernel: Card was paused
Dec  9 01:18:59 backup kernel: ACCUM = 0x0, SINDEX = 0x3, DINDEX = 0xe4, 
ARG_2 = 0x0

Dec  9 01:18:59 backup kernel: HCNT = 0x0 SCBPTR = 0x0
Dec  9 01:18:59 backup kernel: SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] 
SCSIBUSL[0x0] Dec  9 01:18:59 backup kernel: LASTPHASE[0x1] 
SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] Dec  9 01:18:59 backup kernel: 
SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] Dec  9 01:18:59 
backup kernel: SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] Dec  9 
01:18:59 backup kernel: SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] Dec  
9 01:18:59 backup kernel: STACK: 0x0 0x163 0x109 0x3

Dec  9 01:18:59 backup kernel: SCB count = 5
Dec  9 01:18:59 backup kernel: K

Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-10 Thread Alan Brown

On Sat, 10 Dec 2005, Kern Sibbald wrote:


As an aside: when testing 1.38.3, the bscan regression script failed (bscan
aborted due to a logic error).  I think bscan has been around and mostly
unmodified for about 3 years now, and so this regression test has be run
thousands of times with no problem. As a consequence, it was surprising to
find that the bug has existed since the first bscan, and not so surprising
that it involved a record that was split between two Volumes ...


I wondre if that fix will solve the spanning issue many of us had been 
seeing when testing bscan on autochangers? :)


AB



---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Bacula BETA 1.38.3

2005-12-10 Thread Volker Dierks

Hello Kern, hello all,

Volume Bacula will want on Monday, ...  The major change is a total revamp of 
the inner loop of the device reservation code following the algorithm 
proposed in a recent email.  This appears to correct the problems of getting 
multiple autochanger drives running simultaneously, as well as several other 
reported problems.


perhaps I'll give it a try. But a little tale first.

We've got a HP 2/20 Library with 2 DLT-8000 drives. Our backup box is running
Debian GNU/Linux 3.0, Bacula 1.38.2 and 11 nodes. The system has gone into
production on Wednesday (with one drive) and tremendous success. Bacula is
really great.

To speed things up, I tried to activate the second drive on Thursday. I've
created a second pool and relabeled some tapes into that pool. Everything I've
found - regarding using multiple drives - says, that several pools are needed.
This were the configuration changes:

bacula-dir.conf:
Director {
   Maximum Concurrent Jobs = 2 (was 1)
}

Storage {
   Maximum Concurrent Jobs = 2 (was unset)
}

Job {} and Client {} have Maximum Concurrent Jobs unset

bacula-sd.conf:
Storage {
   Maximum Concurrent Jobs = 20 (unchanged)
}

I also put some nodes into that pool. This is what happend:

Job 1 (pool a) started to write on drive one. Job 2 (pool b) started to write
on drive two (the new one). Great. Then, job 1 finished and job three (pool a)
was started. At this time I noticed that job 2 seems to be stucked (written
blocks didn't increase any more). A little bit later job 3 was also stucked.
After 20 minutes I tried to cancel the (still stucked) jobs without success.
Thus I stoppped bacula-dir and bacula-sd which leaves two bacula-sd processes
in status D behind. They couldn't be killed so I rebooted the box. This also
failed with a booted kernel saying that init couldn't find the root partition.
After a poweroff/on the box came up as usual.

My conclusion is that the second drive is faulty and blew up the SCSI bus
(see the kernel log at the end). Job 2 was stuck at 160 MB. In the meantime
job 1 finished writing 450 MB and job 3 was started. If I remember correctly,
job 3 was able to write 2.6 GB to drive one until it also got stucked. I don't
know if a faulty tape can rise up such an incedent.

On the other hand (which is what I hope) there could be a configuration error
(Job {} and Client {} didn't have Maximum Concurrent Jobs set) or the changes
in this BETA will fix this behaviour.

I've planned to add the second drive again tomorrow and use another tape.
Should I also upgrade to 1.38.3?

Volker

Dec  9 01:18:59 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT message
Dec  9 01:18:59 backup kernel: CDB: 0xa 0x0 0x0 0xfc 0x0 0x0
Dec  9 01:18:59 backup kernel: scsi1: At time of recovery, card was not paused
Dec  9 01:18:59 backup kernel: >> Dump Card State Begins 
<
Dec  9 01:18:59 backup kernel: scsi1: Dumping Card State while idle, at SEQADDR 
0x8
Dec  9 01:18:59 backup kernel: Card was paused
Dec  9 01:18:59 backup kernel: ACCUM = 0x0, SINDEX = 0x3, DINDEX = 0xe4, ARG_2 
= 0x0
Dec  9 01:18:59 backup kernel: HCNT = 0x0 SCBPTR = 0x0
Dec  9 01:18:59 backup kernel: SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] 
Dec  9 01:18:59 backup kernel: LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] 
Dec  9 01:18:59 backup kernel: SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] 
Dec  9 01:18:59 backup kernel: SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] 
Dec  9 01:18:59 backup kernel: SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] 
Dec  9 01:18:59 backup kernel: STACK: 0x0 0x163 0x109 0x3

Dec  9 01:18:59 backup kernel: SCB count = 5
Dec  9 01:18:59 backup kernel: Kernel NEXTQSCB = 2
Dec  9 01:18:59 backup kernel: Card NEXTQSCB = 2
Dec  9 01:18:59 backup kernel: QINFIFO entries: 
Dec  9 01:18:59 backup kernel: Waiting Queue entries: 
Dec  9 01:18:59 backup kernel: Disconnected Queue entries: 1:4 
Dec  9 01:18:59 backup kernel: QOUTFIFO entries: 
Dec  9 01:18:59 backup kernel: Sequencer Free SCB List: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
Dec  9 01:18:59 backup kernel: Sequencer SCB Info: 
Dec  9 01:18:59 backup kernel:   0 SCB_CONTROL[0xc0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   1 SCB_CONTROL[0x44] SCB_SCSIID[0x57] SCB_LUN[0x0] SCB_TAG[0x4] 
Dec  9 01:18:59 backup kernel:   2 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   3 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   4 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   5 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   6 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup kernel:   7 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Dec  9 01:18:59 backup