Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-03 Thread Jean-Francois Malouin
* Jean-Francois Malouin [EMAIL PROTECTED] [20051102 07:58]:
 * Jean-Louis Martineau [EMAIL PROTECTED] [20051101 19:56]:
  Hi Jean-Francois,
  
  Thanks for the good report.
  
  The problem is the result of the dumper, which is FAILED, in that
  case the dump is not retried.
  
  Could you try this patch.
 
 I just applied the patch and will let know how it goes.

Looks like that patch did the work! Amanda retried a full backup of
the filesystem /data/mril/mril2 that failed and luckily enough the
same situation arised as yesterday when amanda hit EOT while dumping
directly to tape. But this time it retried and dumped it successfully.


amdump log excerpts (sorry for the long lines!)


driver: dumping yorick:/data/mril/mril2 directly to tape

[...]

driver: send-cmd time 4867.717 to taper: PORT-WRITE 00-00031 yorick
feff9ffe7f /data/mril/mril2 0 20051103
driver: send-cmd time 4867.721 to dumper0: PORT-DUMP 00-00031 16845
yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP
|;bsd-auth;index;

[...]

taper: writing end marker. [stk_40-conf2-22 ERR kb 103840288 fm 16]
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info
changer: got exit: 0 str: 15 10 1 1
changer_query: changer return was 10 1 1
changer_query: searchable = 1
changer_find: looking for stk_40-conf2-23 changer is searchable = 1
changer_search: stk_40-conf2-23
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -search 
stk_40-conf2-23
changer: got exit: 0 str: 16 /hw/tape/tps12d2nrnsv
taper: slot 16: date 20050930 label stk_40-conf2-23 (exact label match)
taper: read label `stk_40-conf2-23' date `20050930'
taper: wrote label `stk_40-conf2-23' date `20051103'
dumper: kill index command
driver: result time 12869.605 from dumper0: TRY-AGAIN 00-00031 [data write: 
Broken pipe]
driver: result time 12869.606 from taper: TRY-AGAIN 00-00031 [writing file: No 
space left on device]
driver: error time 12869.632 serial gen mismatch
driver: dumping yorick:/data/mril/mril2 directly to tape
driver: send-cmd time 12869.633 to taper: PORT-WRITE 00-00032 yorick 
feff9ffe7f /data/mril/mril2 0 20051103
driver: send-cmd time 12869.637 to dumper0: PORT-DUMP 00-00032 16655 yorick 
feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP |;bsd-auth;index;
driver: result time 21175.492 from dumper0: DONE 00-00032 9630451696304515 8306 
[sec 8305.590 kb 96304515 kps 11595.1 orig-kb 96304516]
taper: reader-side: got label stk_40-conf2-23 filenum 1
driver: result time 21175.550 from taper: DONE 00-00032 stk_40-conf2-23 1 
[sec 8305.906 kb 96304516 kps 11594.7 {wr: writers 3009518 rdwait 2319.975 
wrwait 5796.334 filemark 2.688}]


Thanks again. This list is fantastic!
jf

 
 Thanks a bunch,
 jf
 
  
  
  On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote:
   dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031
  
  -- 
  Jean-Louis Martineau email: [EMAIL PROTECTED] 
  Département IRO, Université de Montréal
  C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529
  Montréal, Canada, H3C 3J7Fax: (514) 343-5834
 
  --- amanda-2.4.5p1.orig/server-src/dumper.c 2005-09-20 20:11:38.0 
  -0400
  +++ amanda-2.4.5p1.new/server-src/dumper.c  2005-11-01 19:42:29.0 
  -0500
  @@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd;
  rc = 2;
  goto failed;
  }
  -   if(update_dataptr(outfd, size1)) {
  -   rc = 2;
  +   if((rc=update_dataptr(outfd, size1))) {
  goto failed;
  }
  if(size1 == 0) {
 
 

-- 
° 


Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-02 Thread Jean-Francois Malouin
* Jean-Louis Martineau [EMAIL PROTECTED] [20051101 19:56]:
 Hi Jean-Francois,
 
 Thanks for the good report.
 
 The problem is the result of the dumper, which is FAILED, in that
 case the dump is not retried.
 
 Could you try this patch.

I just applied the patch and will let know how it goes.

Thanks a bunch,
jf

 
 
 On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote:
  dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031
 
 -- 
 Jean-Louis Martineau email: [EMAIL PROTECTED] 
 Département IRO, Université de Montréal
 C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529
 Montréal, Canada, H3C 3J7Fax: (514) 343-5834

 --- amanda-2.4.5p1.orig/server-src/dumper.c   2005-09-20 20:11:38.0 
 -0400
 +++ amanda-2.4.5p1.new/server-src/dumper.c2005-11-01 19:42:29.0 
 -0500
 @@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd;
   rc = 2;
   goto failed;
   }
 - if(update_dataptr(outfd, size1)) {
 - rc = 2;
 + if((rc=update_dataptr(outfd, size1))) {
   goto failed;
   }
   if(size1 == 0) {




Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-02 Thread Gene Heskett
On Tuesday 01 November 2005 11:37, Jean-Francois Malouin wrote:
Hi,

[ BTW, I noticed a thread addressing this question yesterday or so
  but I don't want to hijack it... ]

I just upgraded to 2.4.5 and noticed that if a DLE is sent directly to
tape it is not retried if it hits EOT (a full dump that won't fit in
the holdding space). I have runtapes = 10 and never run into this
problem before, or at least with the previous version I was running,
2.4.4p3-20040805. OS: irix-6.5.x with an STK L80 library and LTO1
Ultriums.

If it hits EOT while writing directly to the tape, then the file is
lost.  Thats been true for as long as I've been using amanda, something
like 6 or 7 years now.  So that behaviour is not new. In order for
amanda to try it on a new tape, it would have to restart the DLE from
scratch and amanda has never to my knowledge attempted to do that.

Here is some excerpts form the amdump logfile. In the following look
at yorick:/data/mril/mril2. It's the one that failed.

To me it looks like it was being dumped directly to tape label
stk_40-conf2-28, it hit EOT and was not retried. amanda
just went on with the other DLEs in her schedule.


amdump logfile:


taper: slot 11: date 20050925 label stk_40-conf2-28 (exact label
 match) taper: read label `stk_40-conf2-28' date `20050925'
taper: wrote label `stk_40-conf2-28' date `20051101'
driver: result time 98.468 from taper: TAPER-OK

[...]

FAILED QUEUE: empty
DONE QUEUE:

[...]

 10: yorick /data/mril/mril2

[...]

ANALYZING ESTIMATES...

[...]

GENERATING SCHEDULE:

DUMP yorick feff9ffe7f /data/mril/mril2 20051101 2 0
1970:1:1:0:0:0 958730408789 3 2005:10:28:10:51:39 95873040 281565

[ a full dump of ~92GB is scheduled ]


[...]

driver: send-cmd time 15078.866 to dumper0: FILE-DUMP 01-00029
/holddisk/conf2/stk_40/20051101031508/shadow._.1 shadow
feff9ffe0f / NODEVICE 1 2005:10:30:12:33:6 2097056 DUMP 265696

|;bsd-auth;index;

[...]

driver: finished-cmd time 15241.837 taper wrote shadow:/
driver: dumping yorick:/data/mril/mril2 directly to tape
driver: send-cmd time 15241.838 to taper: PORT-WRITE 00-00031 yorick
feff9ffe7f /data/mril/mril2 0 20051101
driver: result time 15241.842 from taper: PORT 16614
driver: send-cmd time 15241.842 to dumper0: PORT-DUMP 00-00031 16614
yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP

|;bsd-auth;index;

   [ Here EOT is encountered, next tape is loaded
 and then 3 DLEs are dumped to tape but not the one
that hit EOT ]

taper: writing end marker. [stk_40-conf2-28 ERR kb 103821568 fm 16]
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info
changer: got exit: 0 str: 11 10 1 1
changer_query: changer return was 10 1 1
changer_query: searchable = 1
changer_find: looking for stk_40-conf2-29 changer is searchable = 1
changer_search: stk_40-conf2-29
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx
-search stk_40-conf2-29
changer: got exit: 0 str: 12 /hw/tape/tps12d2nrnsv
taper: slot 12: date 20050926 label stk_40-conf2-29 (exact label
 match) taper: read label `stk_40-conf2-29' date `20050926'
taper: wrote label `stk_40-conf2-29' date `20051101'
dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031
[data write : Broken pipe]
driver: result time 23806.849 from taper: TRY-AGAIN 00-00031 [writing
file: No space left on device]
driver: error time 23806.888 serial gen mismatch
driver: dumping yorick:/data/noel/noel1 directly to tape
driver: send-cmd time 23806.888 to taper: PORT-WRITE 00-00032 yorick
feff9ffe7f /data/noel/noel1 1 20051101

[...]

driver: result time 23806.891 from taper: PORT 16475
driver: send-cmd time 23806.891 to dumper0: PORT-DUMP 00-00032 16475
yorick feff9ffe7f /data/noel/noel1 NODEVICE 1 2005:10:30:15:52:23
 DUMP

|;bsd-auth;index;

driver: state time 23806.891 free kps: 1023791 space: 52428800 taper:
writing idle-dumpers: 9 qlen tapeq: 0 runq: 2 roomq: 0 wakeup: 86400
driver-idle: not-idle
driver: interface-state time 23806.891 if : free 1022391 if LE0: free
400 if LOCAL: free 1000
driver: hdisk-state time 23806.891 hdisk 0: free 52428800 dumpers 0
dumper: time 8564.873taper: s: streamtream_accept: connection from
_client127.0.0.1: connected to .16650127.0.0.1taper: t.16475
ry_socksize: dumper: receive stream_clientbuffer size is : our side is
32768
0.0.0.0.16650

[...]

driver: result time 23815.032 from dumper0: DONE 00-00032 37 36 8 [sec
7.917 kb 36 kps 4.5 orig-kb 37]
taper: reader-side: got label stk_40-conf2-29 filenum 1
driver: result time 23819.646 from taper: DONE 00-00032
stk_40-conf2-29 1 [sec 12.748 kb 37 kps 2.9
{wr: writers 3 rdwait 5.444 wrwait 4.611 filemark 2.691} ]
driver: error time 23819.646 serial gen mismatch
driver: dumping yorick:/data/neuroling/neuroling1 directly to tape

[...]

taper: DONE [idle wait: 13960.445 secs]
taper: writing end marker. 

RE: 2.4.5: dumping directly to tape failure: no retry?

2005-11-01 Thread Scott R. Burns
I observed this same issue with:

2.4.4p4
NetBSD/i386
HP DAT24*6

My last DLE was  than the remaining tape, but would fit entirely on the
next empty tape in the changer as it was the only remaining DLE. It was
reported as retrying but never did. In this case my holding space was  the
compressed size of this DLE but it would fit onto tape.

Ultimately I found enough drive space to reserve for a holding space  than
this DLE size but I would like to gain that back if this issue could be
fixed.

Search the archives for detailed reporting on the problem by me.

Scott...

Scott R. Burns
NETCON Technologies Incorporated
Suite 135 - 4474 Blakie Road
London, Ontario, Canada
N6L 1G7
Voice: +1.519.652.0401
Fax: +1.519.652.9275
Web: www.netcontech.com

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Jean-Francois Malouin
Sent: Tuesday, November 01, 2005 11:37 AM
To: AMANDA users
Subject: 2.4.5: dumping directly to tape failure: no retry?


Hi,

[ BTW, I noticed a thread addressing this question yesterday or so
  but I don't want to hijack it... ]

I just upgraded to 2.4.5 and noticed that if a DLE is sent directly to
tape it is not retried if it hits EOT (a full dump that won't fit in
the holdding space). I have runtapes = 10 and never run into this
problem before, or at least with the previous version I was running,
2.4.4p3-20040805. OS: irix-6.5.x with an STK L80 library and LTO1
Ultriums.

Here is some excerpts form the amdump logfile. In the following look
at yorick:/data/mril/mril2. It's the one that failed.

To me it looks like it was being dumped directly to tape label
stk_40-conf2-28, it hit EOT and was not retried. amanda
just went on with the other DLEs in her schedule.


amdump logfile:


taper: slot 11: date 20050925 label stk_40-conf2-28 (exact label match)
taper: read label `stk_40-conf2-28' date `20050925'
taper: wrote label `stk_40-conf2-28' date `20051101'
driver: result time 98.468 from taper: TAPER-OK

[...]

FAILED QUEUE: empty
DONE QUEUE:

[...]

 10: yorick /data/mril/mril2

[...]

ANALYZING ESTIMATES...

[...]

GENERATING SCHEDULE:

DUMP yorick feff9ffe7f /data/mril/mril2 20051101 2 0
1970:1:1:0:0:0 958730408789 3 2005:10:28:10:51:39 95873040 281565

[ a full dump of ~92GB is scheduled ]


[...]

driver: send-cmd time 15078.866 to dumper0: FILE-DUMP 01-00029
/holddisk/conf2/stk_40/20051101031508/shadow._.1 shadow
feff9ffe0f / NODEVICE 1 2005:10:30:12:33:6 2097056 DUMP 265696
|;bsd-auth;index;

[...]

driver: finished-cmd time 15241.837 taper wrote shadow:/
driver: dumping yorick:/data/mril/mril2 directly to tape
driver: send-cmd time 15241.838 to taper: PORT-WRITE 00-00031 yorick
feff9ffe7f /data/mril/mril2 0 20051101
driver: result time 15241.842 from taper: PORT 16614
driver: send-cmd time 15241.842 to dumper0: PORT-DUMP 00-00031 16614
yorick feff9ffe7f /data/mril/mril2 NODEVICE 0 1970:1:1:0:0:0 DUMP
|;bsd-auth;index;

   [ Here EOT is encountered, next tape is loaded
 and then 3 DLEs are dumped to tape but not the one
that hit EOT ]

taper: writing end marker. [stk_40-conf2-28 ERR kb 103821568 fm 16]
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx -info
changer: got exit: 0 str: 11 10 1 1
changer_query: changer return was 10 1 1
changer_query: searchable = 1
changer_find: looking for stk_40-conf2-29 changer is searchable = 1
changer_search: stk_40-conf2-29
changer: opening pipe to: /opt/amanda/amanda2/libexec/chg-zd-mtx
-search stk_40-conf2-29
changer: got exit: 0 str: 12 /hw/tape/tps12d2nrnsv
taper: slot 12: date 20050926 label stk_40-conf2-29 (exact label match)
taper: read label `stk_40-conf2-29' date `20050926'
taper: wrote label `stk_40-conf2-29' date `20051101'
dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031
[data write : Broken pipe]
driver: result time 23806.849 from taper: TRY-AGAIN 00-00031 [writing
file: No space left on device]
driver: error time 23806.888 serial gen mismatch
driver: dumping yorick:/data/noel/noel1 directly to tape
driver: send-cmd time 23806.888 to taper: PORT-WRITE 00-00032 yorick
feff9ffe7f /data/noel/noel1 1 20051101

[...]

driver: result time 23806.891 from taper: PORT 16475
driver: send-cmd time 23806.891 to dumper0: PORT-DUMP 00-00032 16475
yorick feff9ffe7f /data/noel/noel1 NODEVICE 1 2005:10:30:15:52:23 DUMP
|;bsd-auth;index;
driver: state time 23806.891 free kps: 1023791 space: 52428800 taper:
writing idle-dumpers: 9 qlen tapeq: 0 runq: 2 roomq: 0 wakeup: 86400
driver-idle: not-idle
driver: interface-state time 23806.891 if : free 1022391 if LE0: free
400 if LOCAL: free 1000
driver: hdisk-state time 23806.891 hdisk 0: free 52428800 dumpers 0
dumper: time 8564.873taper: s: streamtream_accept: connection from
_client127.0.0.1: connected to .16650127.0.0.1taper: t.16475
ry_socksize: dumper: receive stream_clientbuffer size is : our side is
32768

Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-01 Thread Jon LaBadie
On Tue, Nov 01, 2005 at 02:55:28PM -0500, Scott R. Burns wrote:
 I observed this same issue with:
 
 2.4.4p4
 NetBSD/i386
 HP DAT24*6
 
 My last DLE was  than the remaining tape, but would fit entirely on the
 next empty tape in the changer as it was the only remaining DLE. It was
 reported as retrying but never did. In this case my holding space was  the
 compressed size of this DLE but it would fit onto tape.
 

So it was writing direct to tape.  And after the initial failure
did not retry.  Sounds like the same situation Jeff Allison
describes in the thread spanning tapes.

 Ultimately I found enough drive space to reserve for a holding space  than
 this DLE size but I would like to gain that back if this issue could be
 fixed.

I wonder what is the best approach to a failed dump.  Note, it is
not a failed taping such as would occur if the taping were coming
from the holding disk.  The reason for the dump failure was tape
related, but the dump did fail.

Is it reasonable to restart the entire dump of the DLE?
Are there any other dump failures that are retried?

GH pointed out that Jeff's total dump time was over 15 hours.
As the failed DLE was the only large (and I think only remote)
DLE, most of those 15 hours were spent on the one dump.  Should
another 15 hours be expended on the retry of a failed dump?

Just soliciting opinions.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-01 Thread Jean-Francois Malouin
* Jon LaBadie [EMAIL PROTECTED] [20051101 16:02]:
 On Tue, Nov 01, 2005 at 02:55:28PM -0500, Scott R. Burns wrote:
  I observed this same issue with:
  
  2.4.4p4
  NetBSD/i386
  HP DAT24*6
  
  My last DLE was  than the remaining tape, but would fit entirely on the
  next empty tape in the changer as it was the only remaining DLE. It was
  reported as retrying but never did. In this case my holding space was  the
  compressed size of this DLE but it would fit onto tape.
  
 
 So it was writing direct to tape.  And after the initial failure
 did not retry.  Sounds like the same situation Jeff Allison
 describes in the thread spanning tapes.
 
  Ultimately I found enough drive space to reserve for a holding space  than
  this DLE size but I would like to gain that back if this issue could be
  fixed.
 
 I wonder what is the best approach to a failed dump.  Note, it is
 not a failed taping such as would occur if the taping were coming
 from the holding disk.  The reason for the dump failure was tape
 related, but the dump did fail.
 
 Is it reasonable to restart the entire dump of the DLE?
 Are there any other dump failures that are retried?

To the first question: yes, as long as the dump takes lees than
dtimeout as I'm saying below.

About the 2nd one I'll just say something about dump failures that are
*not* retried. I've seen dump failures because of holdding disk
attrition: there was plenny of space at the beginning of the backup
run but since I have many amanda configurations competing for holdding
disk resources, in the event that it fills up the DLE that was
migrating to tape will fail and there won't be a retry. I'm not quite
sure how such a situation should be deal with as I know that my setup
is far from being 'standard'. This being said maybe something like
'well, looks like /holddisk is full, so let's retry with a dump direct
to tape' would make sense?

 
 GH pointed out that Jeff's total dump time was over 15 hours.
 As the failed DLE was the only large (and I think only remote)
 DLE, most of those 15 hours were spent on the one dump.  Should
 another 15 hours be expended on the retry of a failed dump?
 
 Just soliciting opinions.

I believe this should be dealt with a proper/meaningful value for
dtimeout. In the case I presented above the dump took something like
3hr before failing only because amanda had no way of knowing which
DLEs made it to tape and how much tape capacity had been used before
the failed one. I certainly would want to see a retry in a case like
this because the 'failed' DLE *can* fit in a tape and amanda should
try as hard as she can to write the DLE to tape-- as long as it takes
less than dtimeout.

jf

 
 -- 
 Jon H. LaBadie  [EMAIL PROTECTED]
  JG Computing
  4455 Province Line Road(609) 252-0159
  Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-01 Thread Jon LaBadie
On Tue, Nov 01, 2005 at 04:30:33PM -0500, Jean-Francois Malouin wrote:

  Is it reasonable to restart the entire dump of the DLE?
  Are there any other dump failures that are retried?
 
 To the first question: yes, as long as the dump takes lees than
 dtimeout as I'm saying below.
 

My understanding of dtimeout is the amount of idle time, i.e.
no communications between client and server, regarding the dump.
For example when the dump program is deciding what to do but
before it begins dumping.

If correct, why does that come into play for your decision
about total time of the dump?

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: 2.4.5: dumping directly to tape failure: no retry?

2005-11-01 Thread Jean-Louis Martineau
Hi Jean-Francois,

Thanks for the good report.

The problem is the result of the dumper, which is FAILED, in that
case the dump is not retried.

Could you try this patch.


On Tue, Nov 01, 2005 at 11:37:02AM -0500, Jean-Francois Malouin wrote:
 dumper: driver: result time 23806.848 from dumper0: FAILED 00-00031

-- 
Jean-Louis Martineau email: [EMAIL PROTECTED] 
Département IRO, Université de Montréal
C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 1-3529
Montréal, Canada, H3C 3J7Fax: (514) 343-5834
--- amanda-2.4.5p1.orig/server-src/dumper.c 2005-09-20 20:11:38.0 
-0400
+++ amanda-2.4.5p1.new/server-src/dumper.c  2005-11-01 19:42:29.0 
-0500
@@ -1387,8 +1387,7 @@ int mesgfd, datafd, indexfd, outfd;
rc = 2;
goto failed;
}
-   if(update_dataptr(outfd, size1)) {
-   rc = 2;
+   if((rc=update_dataptr(outfd, size1))) {
goto failed;
}
if(size1 == 0) {