Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-12 Thread Steve Costaras


On 2011-07-12 05:38, Martin Simmons wrote:
> Yes, that looks mostly normal.
>
> I would report that log output as a bug at bugs.bacula.org.
>
> I'm a little surprised that it specifically asked for the volume named FA0016
> though:
>
>2011-07-10 03SD-loki JobId 6: Please mount Volume "FA0016" or label a new 
> one for:
>
> but you then issued the label command for that volume.
>
> Was FA0016 in the database already?  If not, how did bacula predict the name?

Yes, I pre-populate the database with the range of tapes for each pool 
since I already have the bar coded tapes.

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-12 Thread Martin Simmons
> On Mon, 11 Jul 2011 16:00:15 -0500, Steve Costaras said:
> Authentication-Results:  cm-omr4 smtp.user=stev...@chaven.com; auth=pass 
> (CRAM-MD5)
> 
> On 2011-07-11 06:13, Martin Simmons wrote:
> >> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said:
> >> Importance: Normal
> >> Sensitivity: Normal
> >>
> >> I am trying a full backup/multi-job to a single client and all was going 
> >> well until this morning when I received the error below.   All other jobs 
> >> were also canceled.
> >>
> >> My question is two fold:
> >>
> >> 1) What the heck is this error?  I can unmount the drive, issue a rawfill 
> >> to
> >> the tape w/ btape and no problems?
> >> ...
> >> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" 
> >> (/dev/nst0)
> >> Requesting to mount LTO4 ...
> >> 3905 Bizarre wait state 7
> >> Do not forget to mount the drive!!!
> >> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on 
> >> device "LTO4" (/dev/nst0)
> >> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" 
> >> (/dev/nst0) at 10-Jul-2011 03:51.
> >> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on 
> >> read-only Volume. dev="LTO4" (/dev/nst0)
> >> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 
> >> Blocks=0 at 10-Jul-2011 03:51.
> >> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled.
> >> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic 
> >> error. Cannot write overflow block to device "LTO4" (/dev/nst0). 
> >> ERR=Input/output error
> > Do you regularly see the "3905 Bizarre wait state 7" message?  It could be 
> > an
> > indication of problems (and everything after that could be a consequence of
> > it).
> >
> > What are the messages that lead up to that point?
> Nothing, really, this was the 17th tape in a row on a ~3day (so far) 
> backup.No messages in /var/log/messages.   Previous messages from 
> bacula are below as you can see it just blows chunks right after FA0016 
> is mounted, all concurrent jobs are killed.And I've tested that tape 
> before the backup ran and again right after this failure with btape.   
> no problems.

Yes, that looks mostly normal.

I would report that log output as a bug at bugs.bacula.org.

I'm a little surprised that it specifically asked for the volume named FA0016
though:

  2011-07-10 03SD-loki JobId 6: Please mount Volume "FA0016" or label a new one 
for:

but you then issued the label command for that volume.

Was FA0016 in the database already?  If not, how did bacula predict the name?

__Martin

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-11 Thread Steve Costaras


On 2011-07-11 06:13, Martin Simmons wrote:
>> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said:
>> Importance: Normal
>> Sensitivity: Normal
>>
>> I am trying a full backup/multi-job to a single client and all was going 
>> well until this morning when I received the error below.   All other jobs 
>> were also canceled.
>>
>> My question is two fold:
>>
>> 1) What the heck is this error?  I can unmount the drive, issue a rawfill to
>> the tape w/ btape and no problems?
>> ...
>> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
>> Requesting to mount LTO4 ...
>> 3905 Bizarre wait state 7
>> Do not forget to mount the drive!!!
>> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on 
>> device "LTO4" (/dev/nst0)
>> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" 
>> (/dev/nst0) at 10-Jul-2011 03:51.
>> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on 
>> read-only Volume. dev="LTO4" (/dev/nst0)
>> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 
>> Blocks=0 at 10-Jul-2011 03:51.
>> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled.
>> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. 
>> Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output 
>> error
> Do you regularly see the "3905 Bizarre wait state 7" message?  It could be an
> indication of problems (and everything after that could be a consequence of
> it).
>
> What are the messages that lead up to that point?
Nothing, really, this was the 17th tape in a row on a ~3day (so far) 
backup.No messages in /var/log/messages.   Previous messages from 
bacula are below as you can see it just blows chunks right after FA0016 
is mounted, all concurrent jobs are killed.And I've tested that tape 
before the backup ran and again right after this failure with btape.   
no problems.



---
*label storage=LTO4 pool=BackupSetFA volume=FA0015
Connecting to Storage daemon LTO4 at loki:9103 ...
Sending label command for Volume "FA0015" Slot 14 ...
3000 OK label. VolBytes=1024 DVD=0 Volume="FA0015" Device="LTO4" (/dev/nst0)
Requesting to mount LTO4 ...
3001 Device "LTO4" (/dev/nst0) is mounted with Volume "FA0015"
*
2011-07-10 00SD-loki JobId 3: Wrote label to prelabeled Volume "FA0015" 
on device "LTO4" (/dev/nst0)
2011-07-10 00SD-loki JobId 3: New volume "FA0015" mounted on device 
"LTO4" (/dev/nst0) at 10-Jul-2011 00:48.
*
2011-07-10 00SD-loki JobId 3: Despooling elapsed time = 01:21:56, 
Transfer rate = 70.98 M Bytes/second
2011-07-10 00SD-loki JobId 3: Alert: smartctl version 5.38 
[x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
2011-07-10 00SD-loki JobId 3: Alert: Home page is 
http://smartmontools.sourceforge.net/
2011-07-10 00SD-loki JobId 3: Alert:
2011-07-10 00SD-loki JobId 3: Alert: TapeAlert: OK
2011-07-10 00SD-loki JobId 3: Alert:
2011-07-10 00SD-loki JobId 3: Alert: Error counter log:
2011-07-10 00SD-loki JobId 3: Alert:Errors Corrected 
by   Total   Correction GigabytesTotal
2011-07-10 00SD-loki JobId 3: Alert:ECC  
rereads/errors   algorithm  processeduncorrected
2011-07-10 00SD-loki JobId 3: Alert:fast | delayed   
rewrites  corrected  invocations   [10^9 bytes]  errors
2011-07-10 00SD-loki JobId 3: Alert: read:  00 
0 0  0  0.000   0
2011-07-10 00SD-loki JobId 3: Alert: write:  30100  
3010  3010   3010  0.000   0
2011-07-10 00SD-loki JobId 3: Sending spooled attrs to the Director. 
Despooling 65,784,417 bytes ...
2011-07-10 00DIR-loki JobId 3: Bacula DIR-loki 5.0.3 (04Aug10): 
10-Jul-2011 00:58:04
   Build OS:   x86_64-unknown-linux-gnu ubuntu 10.04
   JobId:  3
   Job:JOB-loki_var_ftp_.2011-07-07_17.45.00_05
   Backup Level:   Full
   Client: "FD-loki" 5.0.3 (04Aug10) 
x86_64-unknown-linux-gnu,ubuntu,10.04
   FileSet:"FS-loki_var_ftp_" 2011-07-06 18:00:00
   Pool:   "BackupSetFA" (From Run FullPool override)
   Catalog:"MyCatalog" (From Client resource)
   Storage:"LTO4" (From Pool resource)
   Scheduled time: 07-Jul-2011 17:45:00
   Start time: 07-Jul-2011 17:50:30
   End time:   10-Jul-2011 00:58:04
   Elapsed time:   2 days 7 hours 7 mins 34 secs
   Priority:   50
   FD Files Written:   186,287
   SD Files Written:   186,287
   FD Bytes Written:   2,925,298,735,317 (2.925 TB)
   SD Bytes Written:   2,925,332,067,132 (2.925 TB)
   Rate:   14740.4 KB/s
   Software Compression:   None
   VSS:no
   Encryption: no
   Accurate:   yes
   Volume name(s): 
FA0001|FA0002|FA0005|FA0006|FA0010|FA0011|FA0014

Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-11 Thread Martin Simmons
> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said:
> Importance: Normal
> Sensitivity: Normal
> 
> I am trying a full backup/multi-job to a single client and all was going well 
> until this morning when I received the error below.   All other jobs were 
> also canceled.  
> 
> My question is two fold:
> 
> 1) What the heck is this error?  I can unmount the drive, issue a rawfill to
> the tape w/ btape and no problems?
> ...
> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
> Requesting to mount LTO4 ...
> 3905 Bizarre wait state 7
> Do not forget to mount the drive!!!
> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on 
> device "LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" 
> (/dev/nst0) at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on 
> read-only Volume. dev="LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 
> Blocks=0 at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled.
> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. 
> Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output 
> error

Do you regularly see the "3905 Bizarre wait state 7" message?  It could be an
indication of problems (and everything after that could be a consequence of
it).

What are the messages that lead up to that point?

__Martin

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"

2011-07-10 Thread Dan Langille
Resending, with additional information.

On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote:

>  
> -Original Message-
> From: Dan Langille [mailto:d...@langille.org]
> Sent: Sunday, July 10, 2011 12:58 PM
> To: stev...@chaven.com
> Cc: bacula-users@lists.sourceforge.net
> Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block 
> to device "LTO4"
> 
> >> 
> >> 2) since everything is spooled first, there should be NO error that should 
> >> cancel a job. A tape drive could fail, a tape could burst into flame, all 
> >> that would be needed was bacula to know that >>there was an issue and give 
> >> the admin a simple statement do you want to fix the issue or cancel?, the 
> >> admin to fix the problem, and then bacula told to restart from the last 
> >> block that was >>stored successfully OR if need be from the beginning of 
> >> the spooled data file.
> 
> >This I do know. Although, at first glance it seems easy to do this, it is 
> >not. If it was trivial to do, I assure you, it would already be in place.
> 
> >> Canceling jobs that run for days for TB's of data is just screwed up.
> 
> >I suggest running smaller jobs. I don't mean to sound trite, but that really 
> >is the solution. Given that the alternative is non-trivial, the sensible 
> >choice is, I'm afraid, cancel the job.
> 
> I'm already kicking off 20+ jobs for a single system already.   This does not 
> work when we're talking over the 100TB/nearly 200TB mark. And when these 
> errors happen it does not matter how many jobs you have as /all/ outstanding 
> jobs fail when you have concurancy (in this case all jobs that were qued and 
> were not even writing to the same tape were canceled).  
This sounds like a configuration issue.  Queued jobs should not be cancelled 
when a previous job cancels.  FYI, I've never seen this happen on my systems.  
I think this is something you need to follow up on

> This does not happen with any other enterprise backup software not that they 
> should be 100% mimicked.
> With the data sizes we have today I don't see why there are not better error 
> handling checks/routines.


This is open source software.  Stuff gets written because someone wants it.  
Clearly, nobody who wants it has written. That is why it does not exist.

But sorry, that's not helping you find a solution.  James Harper has some good 
points. :)  I hope it leads somewhere.

-- 
Dan Langille - http://langille.org

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"

2011-07-10 Thread Dan Langille

On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote:

>  
> -Original Message-
> From: Dan Langille [mailto:d...@langille.org]
> Sent: Sunday, July 10, 2011 12:58 PM
> To: stev...@chaven.com
> Cc: bacula-users@lists.sourceforge.net
> Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block 
> to device "LTO4"
> 
> >> 
> >> 2) since everything is spooled first, there should be NO error that should 
> >> cancel a job. A tape drive could fail, a tape could burst into flame, all 
> >> that would be needed was bacula to know that >>there was an issue and give 
> >> the admin a simple statement do you want to fix the issue or cancel?, the 
> >> admin to fix the problem, and then bacula told to restart from the last 
> >> block that was >>stored successfully OR if need be from the beginning of 
> >> the spooled data file.
> 
> >This I do know. Although, at first glance it seems easy to do this, it is 
> >not. If it was trivial to do, I assure you, it would already be in place.
> 
> >> Canceling jobs that run for days for TB's of data is just screwed up.
> 
> >I suggest running smaller jobs. I don't mean to sound trite, but that really 
> >is the solution. Given that the alternative is non-trivial, the sensible 
> >choice is, I'm afraid, cancel the job.
> 
> I'm already kicking off 20+ jobs for a single system already.   This does not 
> work when we're talking over the 100TB/nearly 200TB mark. And when these 
> errors happen it does not matter how many jobs you have as /all/ outstanding 
> jobs fail when you have concurancy (in this case all jobs that were qued and 
> were not even writing to the same tape were canceled).  
This sounds like a configuration issue.  Queued jobs should not be cancelled 
when a previous job cancels.

> This does not happen with any other enterprise backup software not that they 
> should be 100% mimicked.
> With the data sizes we have today I don't see why there are not better error 
> handling checks/routines.


This is open source software.  Stuff gets written because someone wants it.  
Clearly, nobody who wants it has written. That is why it does not exist.

-- 
Dan Langille - http://langille.org

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread Steve Costaras


> Just had a quick look... the "read-only" message is this in stored/block.c:
>
> if (!dev->can_append()) {
> dev->dev_errno = EIO;
> Jmsg1(jcr, M_FATAL, 0, _("Attempt to write on read-only Volume. dev=%s\n"), 
> dev->print_name());
> return false;
> }
>
>And can_append() is:
>
>int can_append() const { return state & ST_APPEND; }
>
>so it does seem pretty basic unless there is a race somewhere in getting the 
>value of 'state'.
>
>Are there any kernel messages that might indicate a problem somewhere at that 
>time?


Nothing related to bacula/tape modules.   I am running zfsonlinux for the file 
system here and there is a known bug with that causing soft lockups for 60-120 
seconds:  

[121423.079640] BUG: soft lockup - CPU#5 stuck for 61s! [z_wr_iss/5:5354]

Though the system recovers.  This normally happens at delete time (txg_sync) 
which as this was a new tape mount that would/could be close to the time when 
an old spool was being deleted (spool sizes are 800G which is the same size as 
the LTO4 tape).  

Though I did not see anything like that happen at the time, when it normally 
happens there is a complete system 'freeze' for a couple seconds and then 
recovery, I was in via ssh and did not see that and was able to umount & run 
btape commands.







--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread James Harper
> 
> no idea, if we can find out what triggered the original message. Without
> doing anything physical, I did an umount storage=LTO4 from bacula and then
> went and did a full btape rawfill without a single problem on the volume:
> 
> *status
>  Bacula status: file=0 block=1
>  Device status: ONLINE IM_REP_EN file=0 block=1
> btape: btape.c:2133 Device status: 641. ERR=
> *rewind
> btape: btape.c:578 Rewound "LTO4" (/dev/nst0)
> *rawfill
> btape: btape.c:2847 Begin writing raw blocks of 2097152 bytes.
> +++ (...)
> Write failed at block 384701. stat=-1 ERR=No space left on device
> btape: btape.c:410 Volume bytes=806.7 GB. Write rate = 106.1 MB/s
> btape: btape.c:608 Wrote 1 EOF to "LTO4" (/dev/nst0)
> *
> 
> zero problems at all.
> 

Just had a quick look... the "read-only" message is this in stored/block.c:

   if (!dev->can_append()) {
  dev->dev_errno = EIO;
  Jmsg1(jcr, M_FATAL, 0, _("Attempt to write on read-only Volume. 
dev=%s\n"), dev->print_name());
  return false;
   }

And can_append() is:

int can_append() const { return state & ST_APPEND; }

so it does seem pretty basic unless there is a race somewhere in getting the 
value of 'state'.

Are there any kernel messages that might indicate a problem somewhere at that 
time?

James
--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread Steve Costaras

no idea, if we can find out what triggered the original message. Without doing 
anything physical, I did an umount storage=LTO4 from bacula and then went and 
did a full btape rawfill without a single problem on the volume:

*status
 Bacula status: file=0 block=1
 Device status: ONLINE IM_REP_EN file=0 block=1
btape: btape.c:2133 Device status: 641. ERR=
*rewind
btape: btape.c:578 Rewound "LTO4" (/dev/nst0)
*rawfill
btape: btape.c:2847 Begin writing raw blocks of 2097152 bytes.
+++ (...)
Write failed at block 384701. stat=-1 ERR=No space left on device
btape: btape.c:410 Volume bytes=806.7 GB. Write rate = 106.1 MB/s
btape: btape.c:608 Wrote 1 EOF to "LTO4" (/dev/nst0)
*

zero problems at all.




-Original Message-
From: James Harper [mailto:james.har...@bendigoit.com.au]
Sent: Sunday, July 10, 2011 06:42 PM
To: stev...@chaven.com, bacula-users@lists.sourceforge.net
Subject: RE: [Bacula-users] Catastrophic error. Cannot write overflow block to 
device "LTO4"

> 
> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
> Requesting to mount LTO4 ...
> 3905 Bizarre wait state 7
> Do not forget to mount the drive!!!
> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on
> device "LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4"
> (/dev/nst0) at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on
> read-only Volume. dev="LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024
> Blocks=0 at 10-Jul-2011 03:51.

This probably isn't helpful, but why does Bacula think that the volume is 
read-only?

James

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread James Harper
> 
> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
> Requesting to mount LTO4 ...
> 3905 Bizarre wait state 7
> Do not forget to mount the drive!!!
> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on
> device "LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4"
> (/dev/nst0) at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on
> read-only Volume. dev="LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024
> Blocks=0 at 10-Jul-2011 03:51.

This probably isn't helpful, but why does Bacula think that the volume is 
read-only?

James

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"

2011-07-10 Thread Steve Costaras

-Original Message-
From: Dan Langille [mailto:d...@langille.org]
Sent: Sunday, July 10, 2011 12:58 PM
To: stev...@chaven.com
Cc: bacula-users@lists.sourceforge.net
Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block to 
device "LTO4"

>>
>> 2) since everything is spooled first, there should be NO error that should 
>> cancel a job. A tape drive could fail, a tape could burst into flame, all 
>> that would be needed was bacula to know that >>there was an issue and give 
>> the admin a simple statement do you want to fix the issue or cancel?, the 
>> admin to fix the problem, and then bacula told to restart from the last 
>> block that was >>stored successfully OR if need be from the beginning of the 
>> spooled data file.

>This I do know. Although, at first glance it seems easy to do this, it is not. 
>If it was trivial to do, I assure you, it would already be in place.

>> Canceling jobs that run for days for TB's of data is just screwed up.

>I suggest running smaller jobs. I don't mean to sound trite, but that really 
>is the solution. Given that the alternative is non-trivial, the sensible 
>choice is, I'm afraid, cancel the job.

I'm already kicking off 20+ jobs for a single system already. This does not 
work when we're talking over the 100TB/nearly 200TB mark. And when these errors 
happen it does not matter how many jobs you have as /all/ outstanding jobs fail 
when you have concurancy (in this case all jobs that were qued and were not 
even writing to the same tape were canceled). This does not happen with any 
other enterprise backup software not that they should be 100% mimicked. With 
the data sizes we have today I don't see why there are not better error 
handling checks/routines.





--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread Dan Langille

On Jul 10, 2011, at 8:17 AM, Steve Costaras wrote:

> 
> 
> I am trying a full backup/multi-job to a single client and all was going well 
> until this morning when I received the error below.   All other jobs were 
> also canceled.  
> 
> My question is two fold:
> 
> 1) What the heck is this error?   I can unmount the drive, issue a rawfill to 
> the tape w/ btape and no problems?   

I don't know.  Perhaps someone else will.

> 
> 2) since everything is spooled first, there should be NO error that should 
> cancel a job.   A tape drive could fail, a tape could burst into flame,  all 
> that would be needed was bacula to know that there was an issue and give the 
> admin a simple statement do you want to fix the issue or cancel?, the admin 
> to fix the problem, and then bacula told to restart from the last block that 
> was stored successfully OR if need be from the beginning of the spooled data 
> file.

This I do know.  Although, at first glance it seems easy to do this, it is not. 
   If it was trivial to do, I assure you, it would already be in place.

> Canceling jobs that run for days for TB's of data is just screwed up.

I suggest running smaller jobs.  I don't mean to sound trite, but that really 
is the solution.  Given that the alternative is non-trivial, the sensible 
choice is, I'm afraid, cancel the job.

> 
> Steve 
> 
> 
> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
> Requesting to mount LTO4 ...
> 3905 Bizarre wait state 7
> Do not forget to mount the drive!!!
> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on 
> device "LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" 
> (/dev/nst0) at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on 
> read-only Volume. dev="LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 
> Blocks=0 at 10-Jul-2011 03:51.
> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled.
> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. 
> Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output 
> error
> 
> *
> 2011-07-10 03SD-loki JobId 6: Despooling elapsed time = 02:32:53, Transfer 
> rate = 93.64 M Bytes/second
> 2011-07-10 03SD-loki JobId 6: Job write elapsed time = 57:37:54, Transfer 
> rate = 8.278 M Bytes/second
> 2011-07-10 03FD-loki JobId 6: Error: bsock.c:393 Write error sending 65536 
> bytes to Storage daemon:loki:9103: ERR=Connection reset by peer
> 2011-07-10 03FD-loki JobId 6: Fatal error: backup.c:1024 Network send error 
> to SD. ERR=Connection reset by peer
> 2011-07-10 03SD-loki JobId 7: Fatal error: block.c:439 Attempt to write on 
> read-only Volume. dev="LTO4" (/dev/nst0)
> 2011-07-10 03SD-loki JobId 7: Fatal error: spool.c:301 Fatal append error on 
> device "LTO4" (/dev/nst0): ERR=block.c:1015 Read zero bytes at 0:0 on device 
> "LTO4" (/dev/nst0).
> 
> 2011-07-10 03SD-loki JobId 7: Despooling elapsed time = 00:00:01, Transfer 
> rate = 858.9 G Bytes/second
> *
> 2011-07-10 03DIR-loki JobId 6: Error: Bacula DIR-loki 5.0.3 (04Aug10): 
> 10-Jul-2011 03:52:08
>  Build OS:   x86_64-unknown-linux-gnu ubuntu 10.04
>  JobId:  6
>  Job:
> JOB-loki_var_ftp_pub_Multimedia_DVD.2011-07-07_17.45.01_08
>  Backup Level:   Full
>  Client: "FD-loki" 5.0.3 (04Aug10) 
> x86_64-unknown-linux-gnu,ubuntu,10.04
>  FileSet:"FS-loki_var_ftp_pub_Multimedia_DVD" 2011-07-06 
> 18:00:01
>  Pool:   "BackupSetFA" (From Run FullPool override)
>  Catalog:"MyCatalog" (From Client resource)
>  Storage:"LTO4" (From Pool resource)
>  Scheduled time: 07-Jul-2011 17:45:01
>  Start time: 07-Jul-2011 17:50:30
>  End time:   10-Jul-2011 03:52:08
>  Elapsed time:   2 days 10 hours 1 min 38 secs
>  Priority:   50
>  FD Files Written:   452
>  SD Files Written:   452
>  FD Bytes Written:   1,717,640,639,816 (1.717 TB)
>  SD Bytes Written:   1,717,632,388,872 (1.717 TB)
>  Rate:   8222.4 KB/s
>  Software Compression:   None
>  VSS:no
>  Encryption: no
>  Accurate:   yes
>  Volume name(s): FA0011|FA0012|FA0015
>  Volume Session Id:  6
>  Volume Session Time:1310078212
>  Last Volume Bytes:  1,024 (1.024 KB)
>  Non-fatal FD errors:1
>  SD Errors:  0
>  FD termination status:  Error
>  SD termination status:  Error
>  Termination:*** Backup Error ***
> ---
> 
> 
> 
> --
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security 
> threats, fraudulent activity, and more. Splu

[Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

2011-07-10 Thread Steve Costaras


I am trying a full backup/multi-job to a single client and all was going well 
until this morning when I received the error below.   All other jobs were also 
canceled.  

My question is two fold:

1) What the heck is this error?   I can unmount the drive, issue a rawfill to 
the tape w/ btape and no problems?   

2) since everything is spooled first, there should be NO error that should 
cancel a job.   A tape drive could fail, a tape could burst into flame,  all 
that would be needed was bacula to know that there was an issue and give the 
admin a simple statement do you want to fix the issue or cancel?, the admin to 
fix the problem, and then bacula told to restart from the last block that was 
stored successfully OR if need be from the beginning of the spooled data file.

Canceling jobs that run for days for TB's of data is just screwed up.

Steve 


3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0)
Requesting to mount LTO4 ...
3905 Bizarre wait state 7
Do not forget to mount the drive!!!
2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on 
device "LTO4" (/dev/nst0)
2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" 
(/dev/nst0) at 10-Jul-2011 03:51.
2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on 
read-only Volume. dev="LTO4" (/dev/nst0)
2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 
Blocks=0 at 10-Jul-2011 03:51.
2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled.
2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. 
Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output error

*
2011-07-10 03SD-loki JobId 6: Despooling elapsed time = 02:32:53, Transfer rate 
= 93.64 M Bytes/second
2011-07-10 03SD-loki JobId 6: Job write elapsed time = 57:37:54, Transfer rate 
= 8.278 M Bytes/second
2011-07-10 03FD-loki JobId 6: Error: bsock.c:393 Write error sending 65536 
bytes to Storage daemon:loki:9103: ERR=Connection reset by peer
2011-07-10 03FD-loki JobId 6: Fatal error: backup.c:1024 Network send error to 
SD. ERR=Connection reset by peer
2011-07-10 03SD-loki JobId 7: Fatal error: block.c:439 Attempt to write on 
read-only Volume. dev="LTO4" (/dev/nst0)
2011-07-10 03SD-loki JobId 7: Fatal error: spool.c:301 Fatal append error on 
device "LTO4" (/dev/nst0): ERR=block.c:1015 Read zero bytes at 0:0 on device 
"LTO4" (/dev/nst0).

2011-07-10 03SD-loki JobId 7: Despooling elapsed time = 00:00:01, Transfer rate 
= 858.9 G Bytes/second
*
2011-07-10 03DIR-loki JobId 6: Error: Bacula DIR-loki 5.0.3 (04Aug10): 
10-Jul-2011 03:52:08
  Build OS:   x86_64-unknown-linux-gnu ubuntu 10.04
  JobId:  6
  Job:
JOB-loki_var_ftp_pub_Multimedia_DVD.2011-07-07_17.45.01_08
  Backup Level:   Full
  Client: "FD-loki" 5.0.3 (04Aug10) 
x86_64-unknown-linux-gnu,ubuntu,10.04
  FileSet:"FS-loki_var_ftp_pub_Multimedia_DVD" 2011-07-06 
18:00:01
  Pool:   "BackupSetFA" (From Run FullPool override)
  Catalog:"MyCatalog" (From Client resource)
  Storage:"LTO4" (From Pool resource)
  Scheduled time: 07-Jul-2011 17:45:01
  Start time: 07-Jul-2011 17:50:30
  End time:   10-Jul-2011 03:52:08
  Elapsed time:   2 days 10 hours 1 min 38 secs
  Priority:   50
  FD Files Written:   452
  SD Files Written:   452
  FD Bytes Written:   1,717,640,639,816 (1.717 TB)
  SD Bytes Written:   1,717,632,388,872 (1.717 TB)
  Rate:   8222.4 KB/s
  Software Compression:   None
  VSS:no
  Encryption: no
  Accurate:   yes
  Volume name(s): FA0011|FA0012|FA0015
  Volume Session Id:  6
  Volume Session Time:1310078212
  Last Volume Bytes:  1,024 (1.024 KB)
  Non-fatal FD errors:1
  SD Errors:  0
  FD termination status:  Error
  SD termination status:  Error
  Termination:*** Backup Error ***
---



--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users