Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
On 2011-07-12 05:38, Martin Simmons wrote: > Yes, that looks mostly normal. > > I would report that log output as a bug at bugs.bacula.org. > > I'm a little surprised that it specifically asked for the volume named FA0016 > though: > >2011-07-10 03SD-loki JobId 6: Please mount Volume "FA0016" or label a new > one for: > > but you then issued the label command for that volume. > > Was FA0016 in the database already? If not, how did bacula predict the name? Yes, I pre-populate the database with the range of tapes for each pool since I already have the bar coded tapes. -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
> On Mon, 11 Jul 2011 16:00:15 -0500, Steve Costaras said: > Authentication-Results: cm-omr4 smtp.user=stev...@chaven.com; auth=pass > (CRAM-MD5) > > On 2011-07-11 06:13, Martin Simmons wrote: > >> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said: > >> Importance: Normal > >> Sensitivity: Normal > >> > >> I am trying a full backup/multi-job to a single client and all was going > >> well until this morning when I received the error below. All other jobs > >> were also canceled. > >> > >> My question is two fold: > >> > >> 1) What the heck is this error? I can unmount the drive, issue a rawfill > >> to > >> the tape w/ btape and no problems? > >> ... > >> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" > >> (/dev/nst0) > >> Requesting to mount LTO4 ... > >> 3905 Bizarre wait state 7 > >> Do not forget to mount the drive!!! > >> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on > >> device "LTO4" (/dev/nst0) > >> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" > >> (/dev/nst0) at 10-Jul-2011 03:51. > >> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on > >> read-only Volume. dev="LTO4" (/dev/nst0) > >> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 > >> Blocks=0 at 10-Jul-2011 03:51. > >> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled. > >> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic > >> error. Cannot write overflow block to device "LTO4" (/dev/nst0). > >> ERR=Input/output error > > Do you regularly see the "3905 Bizarre wait state 7" message? It could be > > an > > indication of problems (and everything after that could be a consequence of > > it). > > > > What are the messages that lead up to that point? > Nothing, really, this was the 17th tape in a row on a ~3day (so far) > backup.No messages in /var/log/messages. Previous messages from > bacula are below as you can see it just blows chunks right after FA0016 > is mounted, all concurrent jobs are killed.And I've tested that tape > before the backup ran and again right after this failure with btape. > no problems. Yes, that looks mostly normal. I would report that log output as a bug at bugs.bacula.org. I'm a little surprised that it specifically asked for the volume named FA0016 though: 2011-07-10 03SD-loki JobId 6: Please mount Volume "FA0016" or label a new one for: but you then issued the label command for that volume. Was FA0016 in the database already? If not, how did bacula predict the name? __Martin -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
On 2011-07-11 06:13, Martin Simmons wrote: >> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said: >> Importance: Normal >> Sensitivity: Normal >> >> I am trying a full backup/multi-job to a single client and all was going >> well until this morning when I received the error below. All other jobs >> were also canceled. >> >> My question is two fold: >> >> 1) What the heck is this error? I can unmount the drive, issue a rawfill to >> the tape w/ btape and no problems? >> ... >> 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) >> Requesting to mount LTO4 ... >> 3905 Bizarre wait state 7 >> Do not forget to mount the drive!!! >> 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on >> device "LTO4" (/dev/nst0) >> 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" >> (/dev/nst0) at 10-Jul-2011 03:51. >> 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on >> read-only Volume. dev="LTO4" (/dev/nst0) >> 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 >> Blocks=0 at 10-Jul-2011 03:51. >> 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled. >> 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. >> Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output >> error > Do you regularly see the "3905 Bizarre wait state 7" message? It could be an > indication of problems (and everything after that could be a consequence of > it). > > What are the messages that lead up to that point? Nothing, really, this was the 17th tape in a row on a ~3day (so far) backup.No messages in /var/log/messages. Previous messages from bacula are below as you can see it just blows chunks right after FA0016 is mounted, all concurrent jobs are killed.And I've tested that tape before the backup ran and again right after this failure with btape. no problems. --- *label storage=LTO4 pool=BackupSetFA volume=FA0015 Connecting to Storage daemon LTO4 at loki:9103 ... Sending label command for Volume "FA0015" Slot 14 ... 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0015" Device="LTO4" (/dev/nst0) Requesting to mount LTO4 ... 3001 Device "LTO4" (/dev/nst0) is mounted with Volume "FA0015" * 2011-07-10 00SD-loki JobId 3: Wrote label to prelabeled Volume "FA0015" on device "LTO4" (/dev/nst0) 2011-07-10 00SD-loki JobId 3: New volume "FA0015" mounted on device "LTO4" (/dev/nst0) at 10-Jul-2011 00:48. * 2011-07-10 00SD-loki JobId 3: Despooling elapsed time = 01:21:56, Transfer rate = 70.98 M Bytes/second 2011-07-10 00SD-loki JobId 3: Alert: smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen 2011-07-10 00SD-loki JobId 3: Alert: Home page is http://smartmontools.sourceforge.net/ 2011-07-10 00SD-loki JobId 3: Alert: 2011-07-10 00SD-loki JobId 3: Alert: TapeAlert: OK 2011-07-10 00SD-loki JobId 3: Alert: 2011-07-10 00SD-loki JobId 3: Alert: Error counter log: 2011-07-10 00SD-loki JobId 3: Alert:Errors Corrected by Total Correction GigabytesTotal 2011-07-10 00SD-loki JobId 3: Alert:ECC rereads/errors algorithm processeduncorrected 2011-07-10 00SD-loki JobId 3: Alert:fast | delayed rewrites corrected invocations [10^9 bytes] errors 2011-07-10 00SD-loki JobId 3: Alert: read: 00 0 0 0 0.000 0 2011-07-10 00SD-loki JobId 3: Alert: write: 30100 3010 3010 3010 0.000 0 2011-07-10 00SD-loki JobId 3: Sending spooled attrs to the Director. Despooling 65,784,417 bytes ... 2011-07-10 00DIR-loki JobId 3: Bacula DIR-loki 5.0.3 (04Aug10): 10-Jul-2011 00:58:04 Build OS: x86_64-unknown-linux-gnu ubuntu 10.04 JobId: 3 Job:JOB-loki_var_ftp_.2011-07-07_17.45.00_05 Backup Level: Full Client: "FD-loki" 5.0.3 (04Aug10) x86_64-unknown-linux-gnu,ubuntu,10.04 FileSet:"FS-loki_var_ftp_" 2011-07-06 18:00:00 Pool: "BackupSetFA" (From Run FullPool override) Catalog:"MyCatalog" (From Client resource) Storage:"LTO4" (From Pool resource) Scheduled time: 07-Jul-2011 17:45:00 Start time: 07-Jul-2011 17:50:30 End time: 10-Jul-2011 00:58:04 Elapsed time: 2 days 7 hours 7 mins 34 secs Priority: 50 FD Files Written: 186,287 SD Files Written: 186,287 FD Bytes Written: 2,925,298,735,317 (2.925 TB) SD Bytes Written: 2,925,332,067,132 (2.925 TB) Rate: 14740.4 KB/s Software Compression: None VSS:no Encryption: no Accurate: yes Volume name(s): FA0001|FA0002|FA0005|FA0006|FA0010|FA0011|FA0014
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
> On Sun, 10 Jul 2011 12:17:55 +, Steve Costaras said: > Importance: Normal > Sensitivity: Normal > > I am trying a full backup/multi-job to a single client and all was going well > until this morning when I received the error below. All other jobs were > also canceled. > > My question is two fold: > > 1) What the heck is this error? I can unmount the drive, issue a rawfill to > the tape w/ btape and no problems? > ... > 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) > Requesting to mount LTO4 ... > 3905 Bizarre wait state 7 > Do not forget to mount the drive!!! > 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on > device "LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" > (/dev/nst0) at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on > read-only Volume. dev="LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 > Blocks=0 at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled. > 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. > Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output > error Do you regularly see the "3905 Bizarre wait state 7" message? It could be an indication of problems (and everything after that could be a consequence of it). What are the messages that lead up to that point? __Martin -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"
Resending, with additional information. On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote: > > -Original Message- > From: Dan Langille [mailto:d...@langille.org] > Sent: Sunday, July 10, 2011 12:58 PM > To: stev...@chaven.com > Cc: bacula-users@lists.sourceforge.net > Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block > to device "LTO4" > > >> > >> 2) since everything is spooled first, there should be NO error that should > >> cancel a job. A tape drive could fail, a tape could burst into flame, all > >> that would be needed was bacula to know that >>there was an issue and give > >> the admin a simple statement do you want to fix the issue or cancel?, the > >> admin to fix the problem, and then bacula told to restart from the last > >> block that was >>stored successfully OR if need be from the beginning of > >> the spooled data file. > > >This I do know. Although, at first glance it seems easy to do this, it is > >not. If it was trivial to do, I assure you, it would already be in place. > > >> Canceling jobs that run for days for TB's of data is just screwed up. > > >I suggest running smaller jobs. I don't mean to sound trite, but that really > >is the solution. Given that the alternative is non-trivial, the sensible > >choice is, I'm afraid, cancel the job. > > I'm already kicking off 20+ jobs for a single system already. This does not > work when we're talking over the 100TB/nearly 200TB mark. And when these > errors happen it does not matter how many jobs you have as /all/ outstanding > jobs fail when you have concurancy (in this case all jobs that were qued and > were not even writing to the same tape were canceled). This sounds like a configuration issue. Queued jobs should not be cancelled when a previous job cancels. FYI, I've never seen this happen on my systems. I think this is something you need to follow up on > This does not happen with any other enterprise backup software not that they > should be 100% mimicked. > With the data sizes we have today I don't see why there are not better error > handling checks/routines. This is open source software. Stuff gets written because someone wants it. Clearly, nobody who wants it has written. That is why it does not exist. But sorry, that's not helping you find a solution. James Harper has some good points. :) I hope it leads somewhere. -- Dan Langille - http://langille.org -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"
On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote: > > -Original Message- > From: Dan Langille [mailto:d...@langille.org] > Sent: Sunday, July 10, 2011 12:58 PM > To: stev...@chaven.com > Cc: bacula-users@lists.sourceforge.net > Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block > to device "LTO4" > > >> > >> 2) since everything is spooled first, there should be NO error that should > >> cancel a job. A tape drive could fail, a tape could burst into flame, all > >> that would be needed was bacula to know that >>there was an issue and give > >> the admin a simple statement do you want to fix the issue or cancel?, the > >> admin to fix the problem, and then bacula told to restart from the last > >> block that was >>stored successfully OR if need be from the beginning of > >> the spooled data file. > > >This I do know. Although, at first glance it seems easy to do this, it is > >not. If it was trivial to do, I assure you, it would already be in place. > > >> Canceling jobs that run for days for TB's of data is just screwed up. > > >I suggest running smaller jobs. I don't mean to sound trite, but that really > >is the solution. Given that the alternative is non-trivial, the sensible > >choice is, I'm afraid, cancel the job. > > I'm already kicking off 20+ jobs for a single system already. This does not > work when we're talking over the 100TB/nearly 200TB mark. And when these > errors happen it does not matter how many jobs you have as /all/ outstanding > jobs fail when you have concurancy (in this case all jobs that were qued and > were not even writing to the same tape were canceled). This sounds like a configuration issue. Queued jobs should not be cancelled when a previous job cancels. > This does not happen with any other enterprise backup software not that they > should be 100% mimicked. > With the data sizes we have today I don't see why there are not better error > handling checks/routines. This is open source software. Stuff gets written because someone wants it. Clearly, nobody who wants it has written. That is why it does not exist. -- Dan Langille - http://langille.org -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
> Just had a quick look... the "read-only" message is this in stored/block.c: > > if (!dev->can_append()) { > dev->dev_errno = EIO; > Jmsg1(jcr, M_FATAL, 0, _("Attempt to write on read-only Volume. dev=%s\n"), > dev->print_name()); > return false; > } > >And can_append() is: > >int can_append() const { return state & ST_APPEND; } > >so it does seem pretty basic unless there is a race somewhere in getting the >value of 'state'. > >Are there any kernel messages that might indicate a problem somewhere at that >time? Nothing related to bacula/tape modules. I am running zfsonlinux for the file system here and there is a known bug with that causing soft lockups for 60-120 seconds: [121423.079640] BUG: soft lockup - CPU#5 stuck for 61s! [z_wr_iss/5:5354] Though the system recovers. This normally happens at delete time (txg_sync) which as this was a new tape mount that would/could be close to the time when an old spool was being deleted (spool sizes are 800G which is the same size as the LTO4 tape). Though I did not see anything like that happen at the time, when it normally happens there is a complete system 'freeze' for a couple seconds and then recovery, I was in via ssh and did not see that and was able to umount & run btape commands. -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
> > no idea, if we can find out what triggered the original message. Without > doing anything physical, I did an umount storage=LTO4 from bacula and then > went and did a full btape rawfill without a single problem on the volume: > > *status > Bacula status: file=0 block=1 > Device status: ONLINE IM_REP_EN file=0 block=1 > btape: btape.c:2133 Device status: 641. ERR= > *rewind > btape: btape.c:578 Rewound "LTO4" (/dev/nst0) > *rawfill > btape: btape.c:2847 Begin writing raw blocks of 2097152 bytes. > +++ (...) > Write failed at block 384701. stat=-1 ERR=No space left on device > btape: btape.c:410 Volume bytes=806.7 GB. Write rate = 106.1 MB/s > btape: btape.c:608 Wrote 1 EOF to "LTO4" (/dev/nst0) > * > > zero problems at all. > Just had a quick look... the "read-only" message is this in stored/block.c: if (!dev->can_append()) { dev->dev_errno = EIO; Jmsg1(jcr, M_FATAL, 0, _("Attempt to write on read-only Volume. dev=%s\n"), dev->print_name()); return false; } And can_append() is: int can_append() const { return state & ST_APPEND; } so it does seem pretty basic unless there is a race somewhere in getting the value of 'state'. Are there any kernel messages that might indicate a problem somewhere at that time? James -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
no idea, if we can find out what triggered the original message. Without doing anything physical, I did an umount storage=LTO4 from bacula and then went and did a full btape rawfill without a single problem on the volume: *status Bacula status: file=0 block=1 Device status: ONLINE IM_REP_EN file=0 block=1 btape: btape.c:2133 Device status: 641. ERR= *rewind btape: btape.c:578 Rewound "LTO4" (/dev/nst0) *rawfill btape: btape.c:2847 Begin writing raw blocks of 2097152 bytes. +++ (...) Write failed at block 384701. stat=-1 ERR=No space left on device btape: btape.c:410 Volume bytes=806.7 GB. Write rate = 106.1 MB/s btape: btape.c:608 Wrote 1 EOF to "LTO4" (/dev/nst0) * zero problems at all. -Original Message- From: James Harper [mailto:james.har...@bendigoit.com.au] Sent: Sunday, July 10, 2011 06:42 PM To: stev...@chaven.com, bacula-users@lists.sourceforge.net Subject: RE: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4" > > 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) > Requesting to mount LTO4 ... > 3905 Bizarre wait state 7 > Do not forget to mount the drive!!! > 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on > device "LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" > (/dev/nst0) at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on > read-only Volume. dev="LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 > Blocks=0 at 10-Jul-2011 03:51. This probably isn't helpful, but why does Bacula think that the volume is read-only? James -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
> > 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) > Requesting to mount LTO4 ... > 3905 Bizarre wait state 7 > Do not forget to mount the drive!!! > 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on > device "LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" > (/dev/nst0) at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on > read-only Volume. dev="LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 > Blocks=0 at 10-Jul-2011 03:51. This probably isn't helpful, but why does Bacula think that the volume is read-only? James -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"
-Original Message- From: Dan Langille [mailto:d...@langille.org] Sent: Sunday, July 10, 2011 12:58 PM To: stev...@chaven.com Cc: bacula-users@lists.sourceforge.net Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4" >> >> 2) since everything is spooled first, there should be NO error that should >> cancel a job. A tape drive could fail, a tape could burst into flame, all >> that would be needed was bacula to know that >>there was an issue and give >> the admin a simple statement do you want to fix the issue or cancel?, the >> admin to fix the problem, and then bacula told to restart from the last >> block that was >>stored successfully OR if need be from the beginning of the >> spooled data file. >This I do know. Although, at first glance it seems easy to do this, it is not. >If it was trivial to do, I assure you, it would already be in place. >> Canceling jobs that run for days for TB's of data is just screwed up. >I suggest running smaller jobs. I don't mean to sound trite, but that really >is the solution. Given that the alternative is non-trivial, the sensible >choice is, I'm afraid, cancel the job. I'm already kicking off 20+ jobs for a single system already. This does not work when we're talking over the 100TB/nearly 200TB mark. And when these errors happen it does not matter how many jobs you have as /all/ outstanding jobs fail when you have concurancy (in this case all jobs that were qued and were not even writing to the same tape were canceled). This does not happen with any other enterprise backup software not that they should be 100% mimicked. With the data sizes we have today I don't see why there are not better error handling checks/routines. -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
On Jul 10, 2011, at 8:17 AM, Steve Costaras wrote: > > > I am trying a full backup/multi-job to a single client and all was going well > until this morning when I received the error below. All other jobs were > also canceled. > > My question is two fold: > > 1) What the heck is this error? I can unmount the drive, issue a rawfill to > the tape w/ btape and no problems? I don't know. Perhaps someone else will. > > 2) since everything is spooled first, there should be NO error that should > cancel a job. A tape drive could fail, a tape could burst into flame, all > that would be needed was bacula to know that there was an issue and give the > admin a simple statement do you want to fix the issue or cancel?, the admin > to fix the problem, and then bacula told to restart from the last block that > was stored successfully OR if need be from the beginning of the spooled data > file. This I do know. Although, at first glance it seems easy to do this, it is not. If it was trivial to do, I assure you, it would already be in place. > Canceling jobs that run for days for TB's of data is just screwed up. I suggest running smaller jobs. I don't mean to sound trite, but that really is the solution. Given that the alternative is non-trivial, the sensible choice is, I'm afraid, cancel the job. > > Steve > > > 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) > Requesting to mount LTO4 ... > 3905 Bizarre wait state 7 > Do not forget to mount the drive!!! > 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on > device "LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" > (/dev/nst0) at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on > read-only Volume. dev="LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 > Blocks=0 at 10-Jul-2011 03:51. > 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled. > 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. > Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output > error > > * > 2011-07-10 03SD-loki JobId 6: Despooling elapsed time = 02:32:53, Transfer > rate = 93.64 M Bytes/second > 2011-07-10 03SD-loki JobId 6: Job write elapsed time = 57:37:54, Transfer > rate = 8.278 M Bytes/second > 2011-07-10 03FD-loki JobId 6: Error: bsock.c:393 Write error sending 65536 > bytes to Storage daemon:loki:9103: ERR=Connection reset by peer > 2011-07-10 03FD-loki JobId 6: Fatal error: backup.c:1024 Network send error > to SD. ERR=Connection reset by peer > 2011-07-10 03SD-loki JobId 7: Fatal error: block.c:439 Attempt to write on > read-only Volume. dev="LTO4" (/dev/nst0) > 2011-07-10 03SD-loki JobId 7: Fatal error: spool.c:301 Fatal append error on > device "LTO4" (/dev/nst0): ERR=block.c:1015 Read zero bytes at 0:0 on device > "LTO4" (/dev/nst0). > > 2011-07-10 03SD-loki JobId 7: Despooling elapsed time = 00:00:01, Transfer > rate = 858.9 G Bytes/second > * > 2011-07-10 03DIR-loki JobId 6: Error: Bacula DIR-loki 5.0.3 (04Aug10): > 10-Jul-2011 03:52:08 > Build OS: x86_64-unknown-linux-gnu ubuntu 10.04 > JobId: 6 > Job: > JOB-loki_var_ftp_pub_Multimedia_DVD.2011-07-07_17.45.01_08 > Backup Level: Full > Client: "FD-loki" 5.0.3 (04Aug10) > x86_64-unknown-linux-gnu,ubuntu,10.04 > FileSet:"FS-loki_var_ftp_pub_Multimedia_DVD" 2011-07-06 > 18:00:01 > Pool: "BackupSetFA" (From Run FullPool override) > Catalog:"MyCatalog" (From Client resource) > Storage:"LTO4" (From Pool resource) > Scheduled time: 07-Jul-2011 17:45:01 > Start time: 07-Jul-2011 17:50:30 > End time: 10-Jul-2011 03:52:08 > Elapsed time: 2 days 10 hours 1 min 38 secs > Priority: 50 > FD Files Written: 452 > SD Files Written: 452 > FD Bytes Written: 1,717,640,639,816 (1.717 TB) > SD Bytes Written: 1,717,632,388,872 (1.717 TB) > Rate: 8222.4 KB/s > Software Compression: None > VSS:no > Encryption: no > Accurate: yes > Volume name(s): FA0011|FA0012|FA0015 > Volume Session Id: 6 > Volume Session Time:1310078212 > Last Volume Bytes: 1,024 (1.024 KB) > Non-fatal FD errors:1 > SD Errors: 0 > FD termination status: Error > SD termination status: Error > Termination:*** Backup Error *** > --- > > > > -- > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splu
[Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"
I am trying a full backup/multi-job to a single client and all was going well until this morning when I received the error below. All other jobs were also canceled. My question is two fold: 1) What the heck is this error? I can unmount the drive, issue a rawfill to the tape w/ btape and no problems? 2) since everything is spooled first, there should be NO error that should cancel a job. A tape drive could fail, a tape could burst into flame, all that would be needed was bacula to know that there was an issue and give the admin a simple statement do you want to fix the issue or cancel?, the admin to fix the problem, and then bacula told to restart from the last block that was stored successfully OR if need be from the beginning of the spooled data file. Canceling jobs that run for days for TB's of data is just screwed up. Steve 3000 OK label. VolBytes=1024 DVD=0 Volume="FA0016" Device="LTO4" (/dev/nst0) Requesting to mount LTO4 ... 3905 Bizarre wait state 7 Do not forget to mount the drive!!! 2011-07-10 03SD-loki JobId 6: Wrote label to prelabeled Volume "FA0016" on device "LTO4" (/dev/nst0) 2011-07-10 03SD-loki JobId 6: New volume "FA0016" mounted on device "LTO4" (/dev/nst0) at 10-Jul-2011 03:51. 2011-07-10 03SD-loki JobId 6: Fatal error: block.c:439 Attempt to write on read-only Volume. dev="LTO4" (/dev/nst0) 2011-07-10 03SD-loki JobId 6: End of medium on Volume "FA0016" Bytes=1,024 Blocks=0 at 10-Jul-2011 03:51. 2011-07-10 03SD-loki JobId 6: Fatal error: Job 6 canceled. 2011-07-10 03SD-loki JobId 6: Fatal error: device.c:192 Catastrophic error. Cannot write overflow block to device "LTO4" (/dev/nst0). ERR=Input/output error * 2011-07-10 03SD-loki JobId 6: Despooling elapsed time = 02:32:53, Transfer rate = 93.64 M Bytes/second 2011-07-10 03SD-loki JobId 6: Job write elapsed time = 57:37:54, Transfer rate = 8.278 M Bytes/second 2011-07-10 03FD-loki JobId 6: Error: bsock.c:393 Write error sending 65536 bytes to Storage daemon:loki:9103: ERR=Connection reset by peer 2011-07-10 03FD-loki JobId 6: Fatal error: backup.c:1024 Network send error to SD. ERR=Connection reset by peer 2011-07-10 03SD-loki JobId 7: Fatal error: block.c:439 Attempt to write on read-only Volume. dev="LTO4" (/dev/nst0) 2011-07-10 03SD-loki JobId 7: Fatal error: spool.c:301 Fatal append error on device "LTO4" (/dev/nst0): ERR=block.c:1015 Read zero bytes at 0:0 on device "LTO4" (/dev/nst0). 2011-07-10 03SD-loki JobId 7: Despooling elapsed time = 00:00:01, Transfer rate = 858.9 G Bytes/second * 2011-07-10 03DIR-loki JobId 6: Error: Bacula DIR-loki 5.0.3 (04Aug10): 10-Jul-2011 03:52:08 Build OS: x86_64-unknown-linux-gnu ubuntu 10.04 JobId: 6 Job: JOB-loki_var_ftp_pub_Multimedia_DVD.2011-07-07_17.45.01_08 Backup Level: Full Client: "FD-loki" 5.0.3 (04Aug10) x86_64-unknown-linux-gnu,ubuntu,10.04 FileSet:"FS-loki_var_ftp_pub_Multimedia_DVD" 2011-07-06 18:00:01 Pool: "BackupSetFA" (From Run FullPool override) Catalog:"MyCatalog" (From Client resource) Storage:"LTO4" (From Pool resource) Scheduled time: 07-Jul-2011 17:45:01 Start time: 07-Jul-2011 17:50:30 End time: 10-Jul-2011 03:52:08 Elapsed time: 2 days 10 hours 1 min 38 secs Priority: 50 FD Files Written: 452 SD Files Written: 452 FD Bytes Written: 1,717,640,639,816 (1.717 TB) SD Bytes Written: 1,717,632,388,872 (1.717 TB) Rate: 8222.4 KB/s Software Compression: None VSS:no Encryption: no Accurate: yes Volume name(s): FA0011|FA0012|FA0015 Volume Session Id: 6 Volume Session Time:1310078212 Last Volume Bytes: 1,024 (1.024 KB) Non-fatal FD errors:1 SD Errors: 0 FD termination status: Error SD termination status: Error Termination:*** Backup Error *** --- -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users