Re: 2.4.5: dumping directly to tape failure: no retry?
* Jean-Francois Malouin <[EMAIL PROTECTED]> [20051102 07:58]: > * Jean-Louis Martineau <[EMAIL PROTECTED]> [20051101 19:56]: > > Hi Jean-Francois, > > > > Thanks for the good report. > > > > The problem is the result of the dumper, which is FAILED, in that > > case the dump is not retried. > > > > Could you try this patch. > > I just applied the patch and will let know how it goes. Looks like that patch did the work! Amanda retried a full backup of the filesystem /data/mril/mril2 that failed and luckily enough the same situation arised as yesterday when amanda hit EOT while dumping directly to tape. But this time it retried and dumped it successfully. amdump log excerpts (sorry for the long lines!) driver: dumping yorick:/data/mril/mril2 directly to tape [...] driver: send-cmd time 4867.717 to taper: PORT-WRITE 00-00031 yorick feff9ffe7f /data/mril/mril2 0 20051103 driver: send-cmd time 4867.721 to dumper0: PORT-DUMP 00-00031 16845 yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP |;bsd-auth;index; [...] taper: writing end marker. [stk_40-conf2-22 ERR kb 103840288 fm 16] changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info changer: got exit: 0 str: 15 10 1 1 changer_query: changer return was 10 1 1 changer_query: searchable = 1 changer_find: looking for stk_40-conf2-23 changer is searchable = 1 changer_search: stk_40-conf2-23 changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -search stk_40-conf2-23 changer: got exit: 0 str: 16 /hw/tape/tps12d2nrnsv taper: slot 16: date 20050930 label stk_40-conf2-23 (exact label match) taper: read label `stk_40-conf2-23' date `20050930' taper: wrote label `stk_40-conf2-23' date `20051103' dumper: kill index command driver: result time 12869.605 from dumper0: TRY-AGAIN 00-00031 ["data write: Broken pipe"] driver: result time 12869.606 from taper: TRY-AGAIN 00-00031 [writing file: No space left on device] driver: error time 12869.632 serial gen mismatch driver: dumping yorick:/data/mril/mril2 directly to tape driver: send-cmd time 12869.633 to taper: PORT-WRITE 00-00032 yorick feff9ffe7f /data/mril/mril2 0 20051103 driver: send-cmd time 12869.637 to dumper0: PORT-DUMP 00-00032 16655 yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP |;bsd-auth;index; driver: result time 21175.492 from dumper0: DONE 00-00032 9630451696304515 8306 [sec 8305.590 kb 96304515 kps 11595.1 orig-kb 96304516] taper: reader-side: got label stk_40-conf2-23 filenum 1 driver: result time 21175.550 from taper: DONE 00-00032 stk_40-conf2-23 1 [sec 8305.906 kb 96304516 kps 11594.7 {wr: writers 3009518 rdwait 2319.975 wrwait 5796.334 filemark 2.688}] Thanks again. This list is fantastic! jf > > Thanks a bunch, > jf > > > > > > > On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote: > > > dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031 > > > > -- > > Jean-Louis Martineau email: [EMAIL PROTECTED] > > Département IRO, Université de Montréal > > C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529 > > Montréal, Canada, H3C 3J7Fax: (514) 343-5834 > > > --- amanda-2.4.5p1.orig/server-src/dumper.c 2005-09-20 20:11:38.0 > > -0400 > > +++ amanda-2.4.5p1.new/server-src/dumper.c 2005-11-01 19:42:29.0 > > -0500 > > @@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd; > > rc = 2; > > goto failed; > > } > > - if(update_dataptr(&outfd, size1)) { > > - rc = 2; > > + if((rc=update_dataptr(&outfd, size1))) { > > goto failed; > > } > > if(size1 == 0) { > > -- <° ><
Re: 2.4.5: dumping directly to tape failure: no retry?
On Tuesday 01 November 2005 11:37, Jean-Francois Malouin wrote: >Hi, > >[ BTW, I noticed a thread addressing this question yesterday or so > but I don't want to hijack it... ] > >I just upgraded to 2.4.5 and noticed that if a DLE is sent directly to >tape it is not retried if it hits EOT (a full dump that won't fit in >the holdding space). I have runtapes = 10 and never run into this >problem before, or at least with the previous version I was running, >2.4.4p3-20040805. OS: irix-6.5.x with an STK L80 library and LTO1 >Ultriums. If it hits EOT while writing directly to the tape, then the file is lost. Thats been true for as long as I've been using amanda, something like 6 or 7 years now. So that behaviour is not new. In order for amanda to try it on a new tape, it would have to restart the DLE from scratch and amanda has never to my knowledge attempted to do that. >Here is some excerpts form the amdump logfile. In the following look >at yorick:/data/mril/mril2. It's the one that failed. > >To me it looks like it was being dumped directly to tape label >stk_40-conf2-28, it hit EOT and was not retried. amanda >just went on with the other DLEs in her schedule. > > >amdump logfile: > > >taper: slot 11: date 20050925 label stk_40-conf2-28 (exact label > match) taper: read label `stk_40-conf2-28' date `20050925' >taper: wrote label `stk_40-conf2-28' date `20051101' >driver: result time 98.468 from taper: TAPER-OK > >[...] > >FAILED QUEUE: empty >DONE QUEUE: > >[...] > > 10: yorick /data/mril/mril2 > >[...] > >ANALYZING ESTIMATES... > >[...] > >GENERATING SCHEDULE: > >DUMP yorick feff9ffe7f /data/mril/mril2 20051101 2 0 >1970:1:1:0:0:0 958730408789 3 2005:10:28:10:51:39 95873040 281565 > >[ a full dump of ~92GB is scheduled ] > > >[...] > >driver: send-cmd time 15078.866 to dumper0: FILE-DUMP 01-00029 >/holddisk/conf2/stk_40/20051101031508/shadow._.1 shadow >feff9ffe0f / NODEVICE 1 2005:10:30:12:33:6 2097056 DUMP 265696 > >|;bsd-auth;index; > >[...] > >driver: finished-cmd time 15241.837 taper wrote shadow:/ >driver: dumping yorick:/data/mril/mril2 directly to tape >driver: send-cmd time 15241.838 to taper: PORT-WRITE 00-00031 yorick >feff9ffe7f /data/mril/mril2 0 20051101 >driver: result time 15241.842 from taper: PORT 16614 >driver: send-cmd time 15241.842 to dumper0: PORT-DUMP 00-00031 16614 >yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP > >|;bsd-auth;index; > > [ Here EOT is encountered, next tape is loaded > and then 3 DLEs are dumped to tape but not the one >that hit EOT ] > >taper: writing end marker. [stk_40-conf2-28 ERR kb 103821568 fm 16] >changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info >changer: got exit: 0 str: 11 10 1 1 >changer_query: changer return was 10 1 1 >changer_query: searchable = 1 >changer_find: looking for stk_40-conf2-29 changer is searchable = 1 >changer_search: stk_40-conf2-29 >changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx >-search stk_40-conf2-29 >changer: got exit: 0 str: 12 /hw/tape/tps12d2nrnsv >taper: slot 12: date 20050926 label stk_40-conf2-29 (exact label > match) taper: read label `stk_40-conf2-29' date `20050926' >taper: wrote label `stk_40-conf2-29' date `20051101' >dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031 >["data write : Broken pipe"] >driver: result time 23806.849 from taper: TRY-AGAIN 00-00031 [writing >file: No space left on device] >driver: error time 23806.888 serial gen mismatch >driver: dumping yorick:/data/noel/noel1 directly to tape >driver: send-cmd time 23806.888 to taper: PORT-WRITE 00-00032 yorick >feff9ffe7f /data/noel/noel1 1 20051101 > >[...] > >driver: result time 23806.891 from taper: PORT 16475 >driver: send-cmd time 23806.891 to dumper0: PORT-DUMP 00-00032 16475 >yorick feff9ffe7f /data/noel/noel1 NODEVICE 1 2005:10:30:15:52:23 > DUMP > >|;bsd-auth;index; > >driver: state time 23806.891 free kps: 1023791 space: 52428800 taper: >writing idle-dumpers: 9 qlen tapeq: 0 runq: 2 roomq: 0 wakeup: 86400 >driver-idle: not-idle >driver: interface-state time 23806.891 if : free 1022391 if LE0: free >400 if LOCAL: free 1000 >driver: hdisk-state time 23806.891 hdisk 0: free 52428800 dumpers 0 >dumper: time 8564.873taper: s: streamtream_accept: connection from >_client127.0.0.1: connected to .16650127.0.0.1taper: t.16475 >ry_socksize: dumper: receive stream_clientbuffer size is : our side is >32768 >0.0.0.0.16650 > >[...] > >driver: result time 23815.032 from dumper0: DONE 00-00032 37 36 8 [sec >7.917 kb 36 kps 4.5 orig-kb 37] >taper: reader-side: got label stk_40-conf2-29 filenum 1 >driver: result time 23819.646 from taper: DONE 00-00032 >stk_40-conf2-29 1 [sec 12.748 kb 37 kps 2.9 >{wr: writers 3 rdwait 5.444 wrwait 4.611 filemark 2.691} ] >driver: error time 23819.646 serial gen mismatch >driver: dumping yorick:/data/neurolin
Re: 2.4.5: dumping directly to tape failure: no retry?
* Jean-Louis Martineau <[EMAIL PROTECTED]> [20051101 19:56]: > Hi Jean-Francois, > > Thanks for the good report. > > The problem is the result of the dumper, which is FAILED, in that > case the dump is not retried. > > Could you try this patch. I just applied the patch and will let know how it goes. Thanks a bunch, jf > > > On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote: > > dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031 > > -- > Jean-Louis Martineau email: [EMAIL PROTECTED] > Département IRO, Université de Montréal > C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529 > Montréal, Canada, H3C 3J7Fax: (514) 343-5834 > --- amanda-2.4.5p1.orig/server-src/dumper.c 2005-09-20 20:11:38.0 > -0400 > +++ amanda-2.4.5p1.new/server-src/dumper.c2005-11-01 19:42:29.0 > -0500 > @@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd; > rc = 2; > goto failed; > } > - if(update_dataptr(&outfd, size1)) { > - rc = 2; > + if((rc=update_dataptr(&outfd, size1))) { > goto failed; > } > if(size1 == 0) {
Re: 2.4.5: dumping directly to tape failure: no retry?
Hi Jean-Francois, Thanks for the good report. The problem is the result of the dumper, which is FAILED, in that case the dump is not retried. Could you try this patch. On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote: > dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031 -- Jean-Louis Martineau email: [EMAIL PROTECTED] Département IRO, Université de Montréal C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529 Montréal, Canada, H3C 3J7Fax: (514) 343-5834 --- amanda-2.4.5p1.orig/server-src/dumper.c 2005-09-20 20:11:38.0 -0400 +++ amanda-2.4.5p1.new/server-src/dumper.c 2005-11-01 19:42:29.0 -0500 @@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd; rc = 2; goto failed; } - if(update_dataptr(&outfd, size1)) { - rc = 2; + if((rc=update_dataptr(&outfd, size1))) { goto failed; } if(size1 == 0) {
Re: 2.4.5: dumping directly to tape failure: no retry?
On Tue, Nov 01, 2005 at 04:30:33PM -0500, Jean-Francois Malouin wrote: > > Is it reasonable to restart the entire dump of the DLE? > > Are there any other "dump" failures that are retried? > > To the first question: yes, as long as the dump takes lees than > dtimeout as I'm saying below. > My understanding of dtimeout is the amount of idle time, i.e. no communications between client and server, regarding the dump. For example when the dump program is deciding what to do but before it begins dumping. If correct, why does that come into play for your decision about total time of the dump? -- Jon H. LaBadie [EMAIL PROTECTED] JG Computing 4455 Province Line Road(609) 252-0159 Princeton, NJ 08540-4322 (609) 683-7220 (fax)
Re: 2.4.5: dumping directly to tape failure: no retry?
* Jon LaBadie <[EMAIL PROTECTED]> [20051101 16:02]: > On Tue, Nov 01, 2005 at 02:55:28PM -0500, Scott R. Burns wrote: > > I observed this same issue with: > > > > 2.4.4p4 > > NetBSD/i386 > > HP DAT24*6 > > > > My last DLE was > than the remaining tape, but would fit entirely on the > > next empty tape in the changer as it was the only remaining DLE. It was > > reported as retrying but never did. In this case my holding space was < the > > compressed size of this DLE but it would fit onto tape. > > > > So it was writing direct to tape. And after the initial failure > did not retry. Sounds like the same situation Jeff Allison > describes in the thread "spanning tapes". > > > Ultimately I found enough drive space to reserve for a holding space > than > > this DLE size but I would like to gain that back if this issue could be > > fixed. > > I wonder what is the best approach to a "failed" dump. Note, it is > not a failed taping such as would occur if the taping were coming > from the holding disk. The reason for the dump failure was tape > related, but the dump did fail. > > Is it reasonable to restart the entire dump of the DLE? > Are there any other "dump" failures that are retried? To the first question: yes, as long as the dump takes lees than dtimeout as I'm saying below. About the 2nd one I'll just say something about dump failures that are *not* retried. I've seen dump failures because of holdding disk attrition: there was plenny of space at the beginning of the backup run but since I have many amanda configurations competing for holdding disk resources, in the event that it fills up the DLE that was migrating to tape will fail and there won't be a retry. I'm not quite sure how such a situation should be deal with as I know that my setup is far from being 'standard'. This being said maybe something like 'well, looks like /holddisk is full, so let's retry with a dump direct to tape' would make sense? > > GH pointed out that Jeff's total dump time was over 15 hours. > As the failed DLE was the only large (and I think only remote) > DLE, most of those 15 hours were spent on the one dump. Should > another 15 hours be expended on the retry of a failed dump? > > Just soliciting opinions. I believe this should be dealt with a proper/meaningful value for dtimeout. In the case I presented above the dump took something like 3hr before failing only because amanda had no way of knowing which DLEs made it to tape and how much tape capacity had been used before the failed one. I certainly would want to see a retry in a case like this because the 'failed' DLE *can* fit in a tape and amanda should try as hard as she can to write the DLE to tape-- as long as it takes less than dtimeout. jf > > -- > Jon H. LaBadie [EMAIL PROTECTED] > JG Computing > 4455 Province Line Road(609) 252-0159 > Princeton, NJ 08540-4322 (609) 683-7220 (fax)
Re: 2.4.5: dumping directly to tape failure: no retry?
On Tue, Nov 01, 2005 at 02:55:28PM -0500, Scott R. Burns wrote: > I observed this same issue with: > > 2.4.4p4 > NetBSD/i386 > HP DAT24*6 > > My last DLE was > than the remaining tape, but would fit entirely on the > next empty tape in the changer as it was the only remaining DLE. It was > reported as retrying but never did. In this case my holding space was < the > compressed size of this DLE but it would fit onto tape. > So it was writing direct to tape. And after the initial failure did not retry. Sounds like the same situation Jeff Allison describes in the thread "spanning tapes". > Ultimately I found enough drive space to reserve for a holding space > than > this DLE size but I would like to gain that back if this issue could be > fixed. I wonder what is the best approach to a "failed" dump. Note, it is not a failed taping such as would occur if the taping were coming from the holding disk. The reason for the dump failure was tape related, but the dump did fail. Is it reasonable to restart the entire dump of the DLE? Are there any other "dump" failures that are retried? GH pointed out that Jeff's total dump time was over 15 hours. As the failed DLE was the only large (and I think only remote) DLE, most of those 15 hours were spent on the one dump. Should another 15 hours be expended on the retry of a failed dump? Just soliciting opinions. -- Jon H. LaBadie [EMAIL PROTECTED] JG Computing 4455 Province Line Road(609) 252-0159 Princeton, NJ 08540-4322 (609) 683-7220 (fax)
RE: 2.4.5: dumping directly to tape failure: no retry?
I observed this same issue with: 2.4.4p4 NetBSD/i386 HP DAT24*6 My last DLE was > than the remaining tape, but would fit entirely on the next empty tape in the changer as it was the only remaining DLE. It was reported as retrying but never did. In this case my holding space was < the compressed size of this DLE but it would fit onto tape. Ultimately I found enough drive space to reserve for a holding space > than this DLE size but I would like to gain that back if this issue could be fixed. Search the archives for detailed reporting on the problem by me. Scott... Scott R. Burns NETCON Technologies Incorporated Suite 135 - 4474 Blakie Road London, Ontario, Canada N6L 1G7 Voice: +1.519.652.0401 Fax: +1.519.652.9275 Web: www.netcontech.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Jean-Francois Malouin Sent: Tuesday, November 01, 2005 11:37 AM To: AMANDA users Subject: 2.4.5: dumping directly to tape failure: no retry? Hi, [ BTW, I noticed a thread addressing this question yesterday or so but I don't want to hijack it... ] I just upgraded to 2.4.5 and noticed that if a DLE is sent directly to tape it is not retried if it hits EOT (a full dump that won't fit in the holdding space). I have runtapes = 10 and never run into this problem before, or at least with the previous version I was running, 2.4.4p3-20040805. OS: irix-6.5.x with an STK L80 library and LTO1 Ultriums. Here is some excerpts form the amdump logfile. In the following look at yorick:/data/mril/mril2. It's the one that failed. To me it looks like it was being dumped directly to tape label stk_40-conf2-28, it hit EOT and was not retried. amanda just went on with the other DLEs in her schedule. amdump logfile: taper: slot 11: date 20050925 label stk_40-conf2-28 (exact label match) taper: read label `stk_40-conf2-28' date `20050925' taper: wrote label `stk_40-conf2-28' date `20051101' driver: result time 98.468 from taper: TAPER-OK [...] FAILED QUEUE: empty DONE QUEUE: [...] 10: yorick /data/mril/mril2 [...] ANALYZING ESTIMATES... [...] GENERATING SCHEDULE: DUMP yorick feff9ffe7f /data/mril/mril2 20051101 2 0 1970:1:1:0:0:0 958730408789 3 2005:10:28:10:51:39 95873040 281565 [ a full dump of ~92GB is scheduled ] [...] driver: send-cmd time 15078.866 to dumper0: FILE-DUMP 01-00029 /holddisk/conf2/stk_40/20051101031508/shadow._.1 shadow feff9ffe0f / NODEVICE 1 2005:10:30:12:33:6 2097056 DUMP 265696 |;bsd-auth;index; [...] driver: finished-cmd time 15241.837 taper wrote shadow:/ driver: dumping yorick:/data/mril/mril2 directly to tape driver: send-cmd time 15241.838 to taper: PORT-WRITE 00-00031 yorick feff9ffe7f /data/mril/mril2 0 20051101 driver: result time 15241.842 from taper: PORT 16614 driver: send-cmd time 15241.842 to dumper0: PORT-DUMP 00-00031 16614 yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP |;bsd-auth;index; [ Here EOT is encountered, next tape is loaded and then 3 DLEs are dumped to tape but not the one that hit EOT ] taper: writing end marker. [stk_40-conf2-28 ERR kb 103821568 fm 16] changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info changer: got exit: 0 str: 11 10 1 1 changer_query: changer return was 10 1 1 changer_query: searchable = 1 changer_find: looking for stk_40-conf2-29 changer is searchable = 1 changer_search: stk_40-conf2-29 changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -search stk_40-conf2-29 changer: got exit: 0 str: 12 /hw/tape/tps12d2nrnsv taper: slot 12: date 20050926 label stk_40-conf2-29 (exact label match) taper: read label `stk_40-conf2-29' date `20050926' taper: wrote label `stk_40-conf2-29' date `20051101' dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031 ["data write : Broken pipe"] driver: result time 23806.849 from taper: TRY-AGAIN 00-00031 [writing file: No space left on device] driver: error time 23806.888 serial gen mismatch driver: dumping yorick:/data/noel/noel1 directly to tape driver: send-cmd time 23806.888 to taper: PORT-WRITE 00-00032 yorick feff9ffe7f /data/noel/noel1 1 20051101 [...] driver: result time 23806.891 from taper: PORT 16475 driver: send-cmd time 23806.891 to dumper0: PORT-DUMP 00-00032 16475 yorick feff9ffe7f /data/noel/noel1 NODEVICE 1 2005:10:30:15:52:23 DUMP |;bsd-auth;index; driver: state time 23806.891 free kps: 1023791 space: 52428800 taper: writing idle-dumpers: 9 qlen tapeq: 0 runq: 2 roomq: 0 wakeup: 86400 driver-idle: not-idle driver: interface-state time 23806.891 if : free 1022391 if LE0: free 400 if LOCAL: free 1000 driver: hdisk-state time 23806.891 hdisk 0: free 52428800 dumpers 0 dumper: time 8564.873taper: s: streamtream_accept: connection from _client127.0.0.1: connected to .16650127.0.0.1taper: t.16475 ry_socksize: dumper: receive stream_clientbuffer size is : our side is 327