Will retry behavior again

2005-09-01 Thread dobryanskaya
Hello, guys, 
I've been asking about the retry behavior and how to delay the dumps about 
couple days ago. Dump delaying conf. command worked (somewhat - as far as dump 
delaying goes ;) ), but it does not seem to solve the problem - we still have 2 
FS failing, because of No space left, and only 2 tapes actually used (with 4 
scheduled). 

I checked the amdump file for what was going on and here is the related section 
of it: 
--
driver: dumping host:hercules directly to tape
driver: send-cmd time 3.553 to taper: PORT-WRITE 00-00029 host f
eff9ffe7f hercules 0 20050901
taper: try_socksize: receive buffer size is 65536
taper: stream_server: waiting for connection: 0.0.0.0.33378
driver: result time 3.553 from taper: PORT 33378
driver: send-cmd time 3.553 to dumper0: PORT-DUMP 00-00029 33378 host
feff9ffe7f hercules NODEVICE 0 1970:1:1:0:0:0 GNUTAR
[skipped]
changer: opening pipe to: /usr/local/libexec/chg-zd-mtx -slot current
changer: got exit: 0 str: 2 /dev/nst0
taper: slot 2: date 20050901 label weekly2 (active tape)
changer: opening pipe to: /usr/local/libexec/chg-zd-mtx -slot next
changer: got exit: 0 str: 3 /dev/nst0
taper: slot 3: date Xlabel weekly3 (new tape)
taper: read label `weekly3' date `X'
taper: wrote label `weekly3' date `20050901'
dumper: kill index command
driver: result time 39992.417 from dumper0: FAILED 00-00029 [data write: 
Connection reset by peer]
driver: result time 39992.417 from taper: TRY-AGAIN 00-00029 [writing file: No 
space left on device]
driver: error time 39992.429 serial gen mismatch
^^^
--
so, AMANDA advanced to the next tape, the taper request to retry was actually 
made, but serial gen mismatch has happend. 

I searched google for the phrase - and did not find anything helpful about this 
error. 
Does anybody know what is this error means and how to deal with it? 


Also, I'm positive, that we had enough space for holding disk to hold this 
particuar FS. Why did it start directly to tape (log's the very first line)? 

Thanks. 




Re: Will retry behavior again

2005-09-01 Thread Jon LaBadie
On Thu, Sep 01, 2005 at 02:40:14PM -0400, [EMAIL PROTECTED] wrote:
 
 
 Also, I'm positive, that we had enough space for holding disk to hold
 this particuar FS. Why did it start directly to tape (log's the very
 first line)? 

Have you adjusted the holding disk reserve parameter?
By default it reserves 100% for incrementals in case of
degraded operation (eg. no tape available).

In my case, the holding disk can store several days
worth of normal backups, full and incremental.  So I've
adjusted the reserve down to 10 or 20%.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: Will retry behavior again

2005-09-01 Thread dobryanskaya
Actually it was set to 0, since I it suppose to be archival dump(no 
incremental) 
 Jon LaBadie [EMAIL PROTECTED] wrote: 
 On Thu, Sep 01, 2005 at 02:40:14PM -0400, [EMAIL PROTECTED] wrote:
  
  
  Also, I'm positive, that we had enough space for holding disk to hold
  this particuar FS. Why did it start directly to tape (log's the very
  first line)? 
 
 Have you adjusted the holding disk reserve parameter?
 By default it reserves 100% for incrementals in case of
 degraded operation (eg. no tape available).
 
 In my case, the holding disk can store several days
 worth of normal backups, full and incremental.  So I've
 adjusted the reserve down to 10 or 20%.
 
 -- 
 Jon H. LaBadie  [EMAIL PROTECTED]
  JG Computing
  4455 Province Line Road(609) 252-0159
  Princeton, NJ  08540-4322  (609) 683-7220 (fax)



Re: Will retry behavior again

2005-09-01 Thread Paul Bijnens

[EMAIL PROTECTED] wrote:
...

taper: wrote label `weekly3' date `20050901'
dumper: kill index command
driver: result time 39992.417 from dumper0: FAILED 00-00029 [data write: Connection 
reset by peer]
driver: result time 39992.417 from taper: TRY-AGAIN 00-00029 [writing file: No 
space left on device]
driver: error time 39992.429 serial gen mismatch
^^^
--
so, AMANDA advanced to the next tape, the taper request to retry was
actually made, but serial gen mismatch has happend.


Driver keeps a table of which dumper is handling what.
When it receives a command it also checks the table 
dumper-number-to-current-filesystem-it-is-handling.

The 00029  is the generation number.
Fine.

First it receives a FAILED from dumper0, and thus it frees the table
entry, effectively setting generation number to 0, to indicate it's
doing nothing.

Less than a microsecond later, it receives a TRY-AGAIN from taper,
which is referring to the same generation number, but which was
freed just before.  So amanda says that it received a command for
which the generation number did not match.

OK, that explains the error message and what it means.

The strange thing above seems the order of the events.
When bumping into EOT, I would expect the sequence:
- First taper bumps into end of tape:
  taper:  TRY-AGAIN 00-00029 [...No space left on device]
- then driver says to port-dumper:
  kill whatever you're doing
  driver: ABORT 00-00029   This command is missing above!!!
  dumper: kill index command

But the kill index comes in first, followed by driver saying
it failed here, then followed by taper saying tape is full.
From this sequence, it seems amanda made the correct decision to
not try again what taper instructed, because dumper signalled a
fatal error first.

Why would that happen???  I don't know.


I searched google for the phrase - and did not find anything helpful about this error. 
Does anybody know what is this error means and how to deal with it? 



Also, I'm positive, that we had enough space for holding disk to hold
this particuar FS. Why did it start directly to tape (log's the very
first line)?


Shot in the dark: maybe a holdingdisk no in the dumptype?
See the output of amadmin weekly disklist hercules.

Another shot in the dark: what version of amanda is the server?
Older versions had also a notion of negative chunksize: dumps larger
than the absolute value of chunksize were portdump too.  Maybe you
have a negative chunksize?  The same older versions of amanda could
also port-dump when chunksize was omitted.  This is all from memory
I don't even have an old man page around (except in my archive backups :-)
Amanda 2.4.2 already has no more support for negative chunksizes
(and warns if you use them).

--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...*
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***


Re: Will retry behavior again

2005-09-01 Thread Vera



Paul, 

we are running amanda v.2.4.5, on FC3. These are 
forced full dumps (dumplevel 0). I do not use negative chuncksize, but I do use 
negative "usedisk" (when you set how much of a disk must stay un-occupied by 
disk images). Do you think it may be somehow related? 
Also, I double-checked for "reserve" (set to 0) and 
"holdingdisk" was default (meaning I did not touch it at all before this last 
run, and now I set itexplicitly to "yes"). We also cleared some HDD space, 
and I'm going to create a second holding disk. 
Is there anything you (all) can think ofI 
must do before the next weekly run? 

  The strange thing above seems the 
order of the events. When bumping into EOT, I would expect the 
sequence: - First taper bumps into end of tape: 
taper: TRY-AGAIN 00-00029 [...No space left on device] - then 
driver says to port-dumper: kill whatever you're 
doing driver: ABORT 00-00029 
 This command is missing above!!! dumper: kill index 
command
Here is what I found 
strange: 
1) it was taping to the tape 2(weekly2), 

2) when received the NODEVICE, 
adjusted to weekly3, which is a new tape(!), no dumps were done to it so far, 
and then 
3) kill index-FAILED-TRY 
AGAIN-mismatch. 

I would expect the sequesnce to be 
1) tape: weekly2, NODEVICE(?), FAILED
2) load tape weekly3
3) TRY AGAIN-whatever...

The same happend for another DLE, when amanda 
adjusted from weekly1 to weekly2 - that DLE was has failed as well. The saddest 
part is that the biggest DLE are tend to fail. I really do not want to 
splitthis configuration into two, and currently thinking about writing a 
script which would check if any DLE has failed, adjust the configuration and 
force the dump on the same day. I realize this sounds awkward, but I do 
not see any other solution. Do you?

Thanks
Vera


Will retry behavior

2005-08-30 Thread dobryanskaya
Hello all, 

We have recently decided to add archival (full) dumps configuration (weekly). 
And it is not running as smoothly as we hoped.
 
Here is the problem. For the full dumps we use 4 tapes (40G) on a run. Short 
filesystems are copied over without any problems, problems start with big 
filesystems. 
Surprizingly, we found out that only 3 tapes were used, but 2 big FS failed 
to go on tape, because there were no space left. 

Here is the content of the AMANDA's full dump report: 

here is the copy of the report with failed FS:

NOTES:
  taper: tape weekly1 kb 35900256 fm 14 writing file: No space left on device
  taper: retrying zeus: /hmssql.0 on new tape: [writing file: No space left on 
device]
  taper: tape weekly2 kb 35878464 fm 3 writing file: No space left on device
  taper: retrying athena: /F.0 on new tape: [writing file: No space left on 
device]
  taper: tape weekly3 kb 0 fm 0 [OK]
---
The result is that athena and zeus are failed. :( 

The question is if I can specify, may be number of attempts, for example, by 
changing the dumporder (currently sSsS, also tried sssS) or taperalgo (was 
first, and we are going to try firstfit). Or, may be it is possible to 
control the order in which FS are stored to tapes more accurate?

Any advice is appreciated.  

Thanks
Vera



Re: Will retry behavior

2005-08-30 Thread Jon LaBadie
On Tue, Aug 30, 2005 at 05:19:29PM -0400, [EMAIL PROTECTED] wrote:
 Hello all, 
 
 We have recently decided to add archival (full) dumps configuration (weekly). 
 And it is not running as smoothly as we hoped.
  
 Here is the problem. For the full dumps we use 4 tapes (40G) on a run. 
 Short filesystems are copied over without any problems, problems start with 
 big filesystems. 
 Surprizingly, we found out that only 3 tapes were used, but 2 big FS failed 
 to go on tape, because there were no space left. 
 
 Here is the content of the AMANDA's full dump report: 
 
 here is the copy of the report with failed FS:
 
 NOTES:
   taper: tape weekly1 kb 35900256 fm 14 writing file: No space left on device
   taper: retrying zeus: /hmssql.0 on new tape: [writing file: No space left 
 on device]
   taper: tape weekly2 kb 35878464 fm 3 writing file: No space left on device
   taper: retrying athena: /F.0 on new tape: [writing file: No space left on 
 device]
   taper: tape weekly3 kb 0 fm 0 [OK]
 ---
 The result is that athena and zeus are failed. :( 
 
 The question is if I can specify, may be number of attempts, for example, by 
 changing the dumporder (currently sSsS, also tried sssS) or taperalgo 
 (was first, and we are going to try firstfit). Or, may be it is possible 
 to control the order in which FS are stored to tapes more accurate?
 
 Any advice is appreciated.  

Someone recently posted a patch that delays taping until all dumps are
collected in the holding disk.  Then the largest fit algorithm should
make optimal use of your tapes.  But of course you need sufficient
holding disk.

IIRC there is a DLE parameter to delay, or don't start until.  Perhaps
you could delay smaller DLEs.

Multiple archive configs, with the same index and tape lists might be used.
Perhaps then the large DLEs could be done by themselves or each with just
a few smaller DLEs delayed.
-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: Will retry behavior

2005-08-30 Thread Mike Delaney
On Tue, Aug 30, 2005 at 05:19:29PM -0400, [EMAIL PROTECTED] wrote:
 
 NOTES:
   taper: tape weekly1 kb 35900256 fm 14 writing file: No space left on device
   taper: retrying zeus: /hmssql.0 on new tape: [writing file: No space left 
 on device]
   taper: tape weekly2 kb 35878464 fm 3 writing file: No space left on device
   taper: retrying athena: /F.0 on new tape: [writing file: No space left on 
 device]
   taper: tape weekly3 kb 0 fm 0 [OK]
 ---
 The result is that athena and zeus are failed. :( 

I think you're misreading the message.  The first line says that amanda failed
to write a DLE to tape weekly1 beacuse the tape was full.  The second line
says that it retried taping DLE zeus:/hmssql.0 on a new tape [the last attempt
failed because the first tape was full].  There's nothing there that says that
zeus:/hmssql.0 didn't make it on to tape.

The same goes for the 3rd and 4th lines w.r.t. athena:/F.0.

If those dumps really had failed to tape, there would have been a big NOTICE
IN ALL CAPS up at the very top of the report stating that some DLEs had
failed to tape.



Re: Will retry behavior

2005-08-30 Thread Vera

wrote:


Here is the problem. For the full dumps we use 4 tapes (40G) on a run. 
Short filesystems are copied over without any problems, problems start 
with big filesystems.
Surprizingly, we found out that only 3 tapes were used, but 2 big FS 
failed to go on tape, because there were no space left.

Someone recently posted a patch that delays taping until all dumps are
collected in the holding disk.  Then the largest fit algorithm should
make optimal use of your tapes.  But of course you need sufficient
holding disk.


Jon,

thank you.

The major problem we are facing is the free HDD space available - it is 
definitely insufficient to hold all disks. Unfortunately it is not going to 
get better in the foreseeable future.



IIRC there is a DLE parameter to delay, or don't start until.  Perhaps
you could delay smaller DLEs.


Do you mean just assigning the order in which dumps are created (and 
eventually, dumped to tapes)?  So, basically, if we have, 16 FS, and 4 of 
them are large, then probably delaying 1 big and 3 small FS by couple hours 
would work?


In this case first 4 FS are recorded to the first tape, then next 4 are 
started (retried to the next tape in the case if tape is over), etc.?


Do you think it would work?

Multiple archive configs, with the same index and tape lists might be 
used.

Perhaps then the large DLEs could be done by themselves or each with just
a few smaller DLEs delayed.

do not really want to do multiple configs. :(


--
Jon H. LaBadie  [EMAIL PROTECTED]
JG Computing
4455 Province Line Road(609) 252-0159
Princeton, NJ  08540-4322  (609) 683-7220 (fax)