Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> Paul Lussier wrote:
>>> You should add a spindle for dle on the same physical disk, it can be
>>> a lot faster.
>>> 
>> I don't understand this statement.  Could you clarify please?
>>   
> man amanda
> search for spindle
> All DLE of a physical disk should have the same spindle (>0).
> It's generally faster to run them sequentially instead of in parallel,
> just think about head movement.

Ahh, right, I've been down that route before.  This system is an NFS
appliance like a NetApp containing ~5TB striped across a single RAID5
array.  In this case, head movement (i.e. thrashing) isn't an issue.

Consider for a moment, an NFS server with 20 exports all on the same
"spindle" being accessed simultaneously by several hundred clients.
Since the specs on this file server are supposed handle this scenario,
having 1 of those clients doing simultaneous recursions of all it's
exports should hardly put any stress on the system.

In fact, when I had spindles set on the individual DLEs such that the
backups occurred sequentially, the estimate was taking far longer than
it is now.  Currently the estimates for these DLEs in parallel are at
about 9 hours.  Sequentially we were looking at somewhere close to 35.

Several of those DLEs are up around 500-600GB each, and therefore
*each one* takes close to 9 hours.  The aggregate time when done
sequentially is 9n, where n=# of DLEs of that similar size.

I think in parallel is fine, we just need to get the amandad and
sendsize to cooperate.

>> Are you suggesting this is currently possible, or that it might be a
>> good solution for the future? 
>
> for future.

That's what I suspected.  

Thank you for all your help.  I've recompiled with a higher
REP_TIMEOUT (15) and am re-running the test to be sure that's it.

Provided I don't run into any more problems after that, I'll likely
set my estimates to 'calcsize' for the next test and see what happens.

-- 
Thanks,
Paul


Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Jean-Louis Martineau

Paul Lussier wrote:

You should add a spindle for dle on the same physical disk, it can be
a lot faster.



I don't understand this statement.  Could you clarify please?
  

man amanda
search for spindle
All DLE of a physical disk should have the same spindle (>0).
It's generally faster to run them sequentially instead of in parallel, 
just think about head movement.
  

A solution could be to add an 'etimeout' in amanda-client.conf,
amandad could use it instead of REP_TIMEOUT.
Maybe the server could send it's own timeout to amandad.



Are you suggesting this is currently possible, or that it might be a
good solution for the future?  I saw in amandad.c there are comments
mentioning that REP_TIMEOUT and ACK_TIMEOUT should be configurable.
I think that's a good future direction :)
  

for future.


Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> Paul Lussier wrote:
>>
>> Can you point me to where in the docs this is mentioned?
>
> It's not documented, it's not a server limit, it's a client limit we
> added do be sure amandad will eventually terminate.

Ahh, that's why I never knew about it :) Perhaps some mention of it
could be made in the docs for the next release.  With storage sizes
only ever increasing, it's probably only a matter of time before
someone else runs into this (if they're lucky, they'll search these
archives :)
   
> historical data are build from successful backup, first estimate will
> be way off, but it will learn.

Oh, okay.  I didn't realize it could learn that way.

> You should add a spindle for dle on the same physical disk, it can be
> a lot faster.

I don't understand this statement.  Could you clarify please?

> A solution could be to add an 'etimeout' in amanda-client.conf,
> amandad could use it instead of REP_TIMEOUT.
> Maybe the server could send it's own timeout to amandad.

Are you suggesting this is currently possible, or that it might be a
good solution for the future?  I saw in amandad.c there are comments
mentioning that REP_TIMEOUT and ACK_TIMEOUT should be configurable.
I think that's a good future direction :)
-- 
Thanks,
Paul


Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Jean-Louis Martineau

Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

  

Why you never posted the error in the amandad debug file?



I thought I had.  I've got etimeout set to 72000, so seeing it timeout
near 21000 set off alarms for me.

  

---
amandad: time 21603.544: /usr/local/libexec/sendsize timed out waiting
for REP data
amandad: time 21603.781: sending NAK pkt:
<
ERROR timeout on reply pipe

---


amanda have a timeout of 6 hours (21600 seconds).



Can you point me to where in the docs this is mentioned?  I've never
seen this menioned before (though I wasn't really looking for it) and
I can't seem to find it anywhere right now (running on no sleep and no
caffeine!)
  


It's not documented, it's not a server limit, it's a client limit we 
added do be sure amandad will eventually terminate.
  

You can change it in amanda-src/amandad.c
Change the value of REP_TIMEOUT.

Since the estimate is really slow, you could try calcsize or server.



I had intentionally avoided using either of those because:

 a) I'm trying to set up a new configuration which has not history and
'server' option indicates it needs historical data to estimate with.

 b) I wanted to use 'client' to be as accurate as possible in order to
create the historical data 'server' requires so I could eventually
switch to that.
  
historical data are build from successful backup, first estimate will be 
way off, but it will learn.


You should add a spindle for dle on the same physical disk, it can be a 
lot faster.

I noticethat in 'man amanda.conf' for the "estimate" or
"(c,d,e)timeout" parameter there is no mention of what the maximum
timeout is (it must be in here somewhere, I'm just not finding it...)

I set my (e,d)timeout to 72000, or 20 hours. Could there be mention in
the documentation of what the max timeout is (21600) closer to the
various timeout parameters, *or* some kind of warning if amanda.conf
has timeout parameters which are set in excess of compiled in limits?

Also, is there some means of checking the amanda.conf file for these
types of parameter violations?  If not, I could probably come up with
a config-file parser/checker like this (with a little guidance) if
people were interested. My complete ignorance of the code base informs
me: "It's just a simple perl script. No, really!" :)
  
A solution could be to add an 'etimeout' in amanda-client.conf, amandad 
could use it instead of REP_TIMEOUT.

Maybe the server could send it's own timeout to amandad.



Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Paul Bijnens

On 2007-10-12 15:45, Paul Lussier wrote:
---


amanda have a timeout of 6 hours (21600 seconds).


Can you point me to where in the docs this is mentioned?  I've never
seen this menioned before (though I wasn't really looking for it) and
I can't seem to find it anywhere right now (running on no sleep and no
caffeine!)


rfc2324

But you better set "estimate calcsize" or even "estimate server"
instead of waiting almost 24 hours now...


--
Paul Bijnens, xplanation Technology ServicesTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out  *
***



Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> Why you never posted the error in the amandad debug file?

I thought I had.  I've got etimeout set to 72000, so seeing it timeout
near 21000 set off alarms for me.

> ---
> amandad: time 21603.544: /usr/local/libexec/sendsize timed out waiting
> for REP data
> amandad: time 21603.781: sending NAK pkt:
> <
> ERROR timeout on reply pipe
>>
> ---
>
> amanda have a timeout of 6 hours (21600 seconds).

Can you point me to where in the docs this is mentioned?  I've never
seen this menioned before (though I wasn't really looking for it) and
I can't seem to find it anywhere right now (running on no sleep and no
caffeine!)

> You can change it in amanda-src/amandad.c
> Change the value of REP_TIMEOUT.
>
> Since the estimate is really slow, you could try calcsize or server.

I had intentionally avoided using either of those because:

 a) I'm trying to set up a new configuration which has not history and
'server' option indicates it needs historical data to estimate with.

 b) I wanted to use 'client' to be as accurate as possible in order to
create the historical data 'server' requires so I could eventually
switch to that.

I noticethat in 'man amanda.conf' for the "estimate" or
"(c,d,e)timeout" parameter there is no mention of what the maximum
timeout is (it must be in here somewhere, I'm just not finding it...)

I set my (e,d)timeout to 72000, or 20 hours. Could there be mention in
the documentation of what the max timeout is (21600) closer to the
various timeout parameters, *or* some kind of warning if amanda.conf
has timeout parameters which are set in excess of compiled in limits?

Also, is there some means of checking the amanda.conf file for these
types of parameter violations?  If not, I could probably come up with
a config-file parser/checker like this (with a little guidance) if
people were interested. My complete ignorance of the code base informs
me: "It's just a simple perl script. No, really!" :)

Thanks for hitting me with a clue.  I'll go recompile now :)
-- 
Thanks,
Paul



Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Jean-Louis Martineau

Why you never posted the error in the amandad debug file?
---
amandad: time 21603.544: /usr/local/libexec/sendsize timed out waiting 
for REP data

amandad: time 21603.781: sending NAK pkt:
<
ERROR timeout on reply pipe
>
---

amanda have a timeout of 6 hours (21600 seconds).
You can change it in amanda-src/amandad.c
Change the value of REP_TIMEOUT.

Since the estimate is really slow, you could try calcsize or server.

Jean-Louis

Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

  

If sendsize is not running, it's because it crashed.



Hmm, it's definitely not running, but I don't see any trace of a crash.
Is there more verbose logging that can be turned on somewhere?

  

I don't understand why amandad finish before sendsize, can you post
complete amandad and sendsize debug files.



Of course, attached.  amdump is still running, btw, so I can send that
log, or any other that's useful.

Thanks again!
  




Re: sendsize finishes, planner doesn't notice...

2007-10-12 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> If sendsize is not running, it's because it crashed.

Hmm, it's definitely not running, but I don't see any trace of a crash.
Is there more verbose logging that can be turned on somewhere?

> I don't understand why amandad finish before sendsize, can you post
> complete amandad and sendsize debug files.

Of course, attached.  amdump is still running, btw, so I can send that
log, or any other that's useful.

Thanks again!
-- 
Thanks,
Paul



sendsize.20071009224835.debug.bz2
Description: sendsize debug log


amandad.20071009224834.debug.bz2
Description: amandad debug log


Re: sendsize finishes, planner doesn't notice...

2007-10-11 Thread Jean-Louis Martineau

Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

  

It's weird.

Do you have an amdump log file or just amdump.1?
The only way to get this is if you killed amanda process on the
server, maybe a server crash.
Do you still have amanda process running on the server?



I do now. I started amanda off Tuesday night at "Tue Oct  9 22:48:34 2007".

According the /var/log/amanda/amandad/amandad.20071009224834.debug file:

  amandad: time 21604.147: pid 26218 finish time Wed Oct 10 04:48:39 2007

According to sendsize.20071009224835.debug:

amanda2:/var/log/amanda/client/offsite# tail sendsize.20071009224835.debug 
errmsg is /usr/local/libexec/runtar exited with status 1: see /var/log/amanda/client/offsite/sendsize.20071009224835.debug

sendsize[26687]: time 37138.237: done with amname /permabit/user/uz dirname 
/permabit/user spindle -1
sendsize[26379]: time 37823.330: Total bytes written: 541649408000 (505GiB, 
14MiB/s)
sendsize[26379]: time 37823.453: .
sendsize[26379]: time 37823.453: estimate time for /permabit/user/eh level 0: 
37823.251
sendsize[26379]: time 37823.453: estimate size for /permabit/user/eh level 0: 
528954500 KB
sendsize[26379]: time 37823.453: waiting for runtar "/permabit/user/eh" child
sendsize[26379]: time 37823.453: after runtar /permabit/user/eh wait
errmsg is /usr/local/libexec/runtar exited with status 1: see 
/var/log/amanda/client/offsite/sendsize.20071009224835.debug
sendsize[26379]: time 37823.537: done with amname /permabit/user/eh dirname 
/permabit/user spindle -1

So, sendsize claims to be done, yet planner doesn't think so:
  
sendsize doesn't claims to be done, I don't see the "finish time' line 
at the of the log.

Is it still running?

  planner: time 16531.383: got partial result for host amanda2 disk \
 /permabit/user/uz: 0 -> -2K, -1 -> -2K, -1 -> -2K
  [...]
  planner: time 16531.384: got partial result for host amanda2 disk \
 /permabit/user/eh: 0 -> -2K, -1 -> -2K, -1 -> -2K

amdump is currently still running, amandad has finished, but we're
still waiting for estimates which will never arrive.

I also find it disturbing that the debug log I'm looking at,
sendsize.20071009224835.debug, tells me to look at the log I'm looking
at for further information:
 
errmsg is /usr/local/libexec/runtar exited with status 1: see \

/var/log/amanda/client/offsite/sendsize.20071009224835.debug

Any idea why amandad is dying before sending the estimate data back to
the planner?  My etimeout is currently set to:
  

amandad didn't dye, its log show it finished correctly.

Am I missing something extremely obvious?  I've been using amanda for
over a decade, and I can't figure out why she's behaving like this.

If there's any more information you need in order to help me figure
this out, please let me know, the suspense here is killing me :)
  

If sendsize is not running, it's because it crashed.

I don't understand why amandad finish before sendsize, can you post 
complete amandad and sendsize debug files.


Jean-Louis


Re: sendsize finishes, planner doesn't notice...

2007-10-11 Thread Deb Baddorf

The seems a bit similar to firewall issues we had a while back ---
the sendsize estimate took long enough that the connection FROM the
server was closed.  The firewall only allowed connections made by the
server, or replies back through the same connection... and needed to be
opened for the client to start a new connection back TO the server,  when
the estimate took over a certain amount of time.
   My understanding of it may be poor, but perhaps this will jog somebody's
mind
Deb



At 3:39 PM -0400 10/11/07, Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:


 It's weird.

 Do you have an amdump log file or just amdump.1?
 The only way to get this is if you killed amanda process on the
 server, maybe a server crash.
 Do you still have amanda process running on the server?


I do now. I started amanda off Tuesday night at "Tue Oct  9 22:48:34 2007".

According the /var/log/amanda/amandad/amandad.20071009224834.debug file:

  amandad: time 21604.147: pid 26218 finish time Wed Oct 10 04:48:39 2007

According to sendsize.20071009224835.debug:

amanda2:/var/log/amanda/client/offsite# tail sendsize.20071009224835.debug
errmsg is /usr/local/libexec/runtar exited with status 1: see 
/var/log/amanda/client/offsite/sendsize.20071009224835.debug
sendsize[26687]: time 37138.237: done with amname /permabit/user/uz 
dirname /permabit/user spindle -1
sendsize[26379]: time 37823.330: Total bytes written: 541649408000 
(505GiB, 14MiB/s)

sendsize[26379]: time 37823.453: .
sendsize[26379]: time 37823.453: estimate time for /permabit/user/eh 
level 0: 37823.251
sendsize[26379]: time 37823.453: estimate size for /permabit/user/eh 
level 0: 528954500 KB

sendsize[26379]: time 37823.453: waiting for runtar "/permabit/user/eh" child
sendsize[26379]: time 37823.453: after runtar /permabit/user/eh wait
errmsg is /usr/local/libexec/runtar exited with status 1: see 
/var/log/amanda/client/offsite/sendsize.20071009224835.debug
sendsize[26379]: time 37823.537: done with amname /permabit/user/eh 
dirname /permabit/user spindle -1


So, sendsize claims to be done, yet planner doesn't think so:

  planner: time 16531.383: got partial result for host amanda2 disk \
 /permabit/user/uz: 0 -> -2K, -1 -> -2K, -1 -> -2K
  [...]
  planner: time 16531.384: got partial result for host amanda2 disk \
 /permabit/user/eh: 0 -> -2K, -1 -> -2K, -1 -> -2K

amdump is currently still running, amandad has finished, but we're
still waiting for estimates which will never arrive.

I also find it disturbing that the debug log I'm looking at,
sendsize.20071009224835.debug, tells me to look at the log I'm looking
at for further information:

errmsg is /usr/local/libexec/runtar exited with status 1: see \
/var/log/amanda/client/offsite/sendsize.20071009224835.debug

Any idea why amandad is dying before sending the estimate data back to
the planner?  My etimeout is currently set to:

  # grep timeout /etc/amanda/offsite/amanda.conf
  etimeout  72000  # number of seconds per filesystem for estimates.
  dtimeout  72000 # number of idle seconds before a dump is aborted.
  ctimeout30  # maximum number of seconds that amcheck waits
  amanda2:/var/log/amanda/server/offsite# su - backup -c 'amadmin 
offsite config' | grep -i timeout

  ETIMEOUT  72000
  DTIMEOUT  72000
  CTIMEOUT  30

  amanda2:/var/log/amanda/server/offsite# /usr/local/sbin/amgetconf 
offsite etimeout

72000

su - backup -c 'amadmin offsite version'
build: VERSION="Amanda-2.5.2p1"
   BUILT_DATE="Tue Sep 4 15:45:27 EDT 2007"
   BUILT_MACH="Linux amanda2 2.6.18-4-686 #1 SMP Mon Mar 26 
17:17:36 UTC 2007 i686 GNU/Linux"

   CC="gcc-4.2"
   CONFIGURE_COMMAND="'./configure' '--prefix=/usr/local' 
'--enable-shared' '--sysconfdir=/etc' '--localstatedir=/var/lib' 
'--with-gnutar-listdir=/var/lib/amanda/gnutar-lists' 
'--with-index-server=localhost' '--with-user=backup' 
'--with-group=backup' '--with-bsd-security' '--with-amandahosts' 
'--with-smbclient=/usr/bin/smbclient' 
'--with-debugging=/var/log/amanda' 
'--with-dumperdir=/usr/lib/amanda/dumper.d' 
'--with-tcpportrange=5,50100' '--with-udpportrange=840,860' 
'--with-maxtapeblocksize=256' '--with-ssh-security'"

paths: bindir="/usr/local/bin" sbindir="/usr/local/sbin"
   libexecdir="/usr/local/libexec" mandir="/usr/local/man"
   AMANDA_TMPDIR="/tmp/amanda"
   AMANDA_DBGDIR="/var/log/amanda" CONFIG_DIR="/etc/amanda"
   DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/" DUMP=UNDEF
   RESTORE=UNDEF VDUMP=UNDEF VRESTORE=UNDEF XFSDUMP=UNDEF
   XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEF
   SAMBA_CLIENT=UNDEF GNUTAR="/bin/tar"
   COMPRESS_PATH="/bin/gzip" UNCOMPRESS_PATH="/bin/gzip"
   LPRCMD="/usr/bin/lpr" MAILER="/usr/bin/Mail"
   listed_incr_dir="/var/lib/amanda/gnutar-lists"
defs:  DEFAULT_SERVER="localhost" DEFAULT_CONFIG="DailySet1"
   DEFAULT_TAPE_SERVER="localhost" HAVE_MMAP NEED_STRSTR
   

Re: sendsize finishes, planner doesn't notice...

2007-10-11 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> It's weird.
>
> Do you have an amdump log file or just amdump.1?
> The only way to get this is if you killed amanda process on the
> server, maybe a server crash.
> Do you still have amanda process running on the server?

I do now. I started amanda off Tuesday night at "Tue Oct  9 22:48:34 2007".

According the /var/log/amanda/amandad/amandad.20071009224834.debug file:

  amandad: time 21604.147: pid 26218 finish time Wed Oct 10 04:48:39 2007

According to sendsize.20071009224835.debug:

amanda2:/var/log/amanda/client/offsite# tail sendsize.20071009224835.debug 
errmsg is /usr/local/libexec/runtar exited with status 1: see 
/var/log/amanda/client/offsite/sendsize.20071009224835.debug
sendsize[26687]: time 37138.237: done with amname /permabit/user/uz dirname 
/permabit/user spindle -1
sendsize[26379]: time 37823.330: Total bytes written: 541649408000 (505GiB, 
14MiB/s)
sendsize[26379]: time 37823.453: .
sendsize[26379]: time 37823.453: estimate time for /permabit/user/eh level 0: 
37823.251
sendsize[26379]: time 37823.453: estimate size for /permabit/user/eh level 0: 
528954500 KB
sendsize[26379]: time 37823.453: waiting for runtar "/permabit/user/eh" child
sendsize[26379]: time 37823.453: after runtar /permabit/user/eh wait
errmsg is /usr/local/libexec/runtar exited with status 1: see 
/var/log/amanda/client/offsite/sendsize.20071009224835.debug
sendsize[26379]: time 37823.537: done with amname /permabit/user/eh dirname 
/permabit/user spindle -1

So, sendsize claims to be done, yet planner doesn't think so:

  planner: time 16531.383: got partial result for host amanda2 disk \
 /permabit/user/uz: 0 -> -2K, -1 -> -2K, -1 -> -2K
  [...]
  planner: time 16531.384: got partial result for host amanda2 disk \
 /permabit/user/eh: 0 -> -2K, -1 -> -2K, -1 -> -2K

amdump is currently still running, amandad has finished, but we're
still waiting for estimates which will never arrive.

I also find it disturbing that the debug log I'm looking at,
sendsize.20071009224835.debug, tells me to look at the log I'm looking
at for further information:
 
errmsg is /usr/local/libexec/runtar exited with status 1: see \
/var/log/amanda/client/offsite/sendsize.20071009224835.debug

Any idea why amandad is dying before sending the estimate data back to
the planner?  My etimeout is currently set to:

  # grep timeout /etc/amanda/offsite/amanda.conf
  etimeout  72000  # number of seconds per filesystem for estimates.
  dtimeout  72000 # number of idle seconds before a dump is aborted.
  ctimeout30  # maximum number of seconds that amcheck waits
  amanda2:/var/log/amanda/server/offsite# su - backup -c 'amadmin offsite 
config' | grep -i timeout
  ETIMEOUT  72000
  DTIMEOUT  72000
  CTIMEOUT  30

  amanda2:/var/log/amanda/server/offsite# /usr/local/sbin/amgetconf offsite 
etimeout
72000

su - backup -c 'amadmin offsite version'
build: VERSION="Amanda-2.5.2p1"
   BUILT_DATE="Tue Sep 4 15:45:27 EDT 2007"
   BUILT_MACH="Linux amanda2 2.6.18-4-686 #1 SMP Mon Mar 26 17:17:36 UTC 
2007 i686 GNU/Linux"
   CC="gcc-4.2"
   CONFIGURE_COMMAND="'./configure' '--prefix=/usr/local' '--enable-shared' 
'--sysconfdir=/etc' '--localstatedir=/var/lib' 
'--with-gnutar-listdir=/var/lib/amanda/gnutar-lists' 
'--with-index-server=localhost' '--with-user=backup' '--with-group=backup' 
'--with-bsd-security' '--with-amandahosts' 
'--with-smbclient=/usr/bin/smbclient' '--with-debugging=/var/log/amanda' 
'--with-dumperdir=/usr/lib/amanda/dumper.d' '--with-tcpportrange=5,50100' 
'--with-udpportrange=840,860' '--with-maxtapeblocksize=256' 
'--with-ssh-security'"
paths: bindir="/usr/local/bin" sbindir="/usr/local/sbin"
   libexecdir="/usr/local/libexec" mandir="/usr/local/man"
   AMANDA_TMPDIR="/tmp/amanda"
   AMANDA_DBGDIR="/var/log/amanda" CONFIG_DIR="/etc/amanda"
   DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/" DUMP=UNDEF
   RESTORE=UNDEF VDUMP=UNDEF VRESTORE=UNDEF XFSDUMP=UNDEF
   XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEF
   SAMBA_CLIENT=UNDEF GNUTAR="/bin/tar"
   COMPRESS_PATH="/bin/gzip" UNCOMPRESS_PATH="/bin/gzip"
   LPRCMD="/usr/bin/lpr" MAILER="/usr/bin/Mail"
   listed_incr_dir="/var/lib/amanda/gnutar-lists"
defs:  DEFAULT_SERVER="localhost" DEFAULT_CONFIG="DailySet1"
   DEFAULT_TAPE_SERVER="localhost" HAVE_MMAP NEED_STRSTR
   HAVE_SYSVSHM LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE
   AMANDA_DEBUG_DAYS=4 BSD_SECURITY RSH_SECURITY USE_AMANDAHOSTS
   CLIENT_LOGIN="backup" FORCE_USERID HAVE_GZIP
   COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
   COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"


Am I missing something extremely obvious?  I've been using amanda for
over a decade, and I can't figure out why she's behaving like this.

If there's any more information you need in order to help me figure
this out, please let me know, the suspense he

Re: sendsize finishes, planner doesn't notice...

2007-10-04 Thread Paul Lussier
Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

> It's weird.
>
> Do you have an amdump log file or just amdump.1?
> The only way to get this is if you killed amanda process on the
> server, maybe a server crash.
> Do you still have amanda process running on the server?

No, the reason it's a .1 is because I killed the process on the server
after 12 hours of inactivity.  I'm currently running another dump
attempt with locally compiled 2.5.2 vs. the Debian package.  My theory
is that 2.5.2 doesn't have this problem.

I could have let it run to completion, but it would have taken 3 days or so...
-- 
Thanks,
Paul


Re: sendsize finishes, planner doesn't notice...

2007-10-04 Thread Jean-Louis Martineau

It's weird.

Do you have an amdump log file or just amdump.1?
The only way to get this is if you killed amanda process on the server, 
maybe a server crash.

Do you still have amanda process running on the server?

Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

  

Can you send me the complete amdump log file, planner.*.debug,
amandad.*.debug and sendsize.*.debug?



Sure, they're all attached.
  




Re: sendsize finishes, planner doesn't notice...

2007-10-04 Thread Jean-Louis Martineau

Paul Lussier wrote:

Jean-Louis Martineau <[EMAIL PROTECTED]> writes:

  

Look at the amdump log file, it list all estimate received from the clients.

Are you sure the sendsize debug file you look at is the correct one?
sendsize will continue even after an estimate timeout.



It was the only sendsize log at the time, and it hadn't been updated
since 20:00ish last night.  Additionally, all the gnutar processes on
the client had exited, along with the amandad controlling process.
In the sendsize log, I see lots of things like:

sendsize[8151]: time 0.814: calculating for amname /permabit/user/uz, dirname 
/permabit/user, spindle -1
sendsize[8151]: time 0.814: getting size via gnutar for /permabit/user/uz level 0
sendsize[8151]: time 0.816: spawning /usr/lib/amanda/runtar in pipeline
sendsize[8151]: argument list: runtar offsite /bin/tar --create --file 
/dev/null --directory /permabit/user --one-file-system --listed-incremental 
/var/lib/amanda/gnutar-lists/amanda2_permabit_user_uz_0.new --sparse 
--ignore-failed-read --totals --exclude-from 
/tmp/amanda/sendsize._permabit_user_uz.20071003113106.exclude .
sendsize[8151]: time 17808.869: /bin/tar: 
./mfortson/dev/mfortson-prodtest/main/src/java/server: file changed as we read 
it
sendsize[8151]: time 33454.109: Total bytes written: 515029217280 (480GiB, 
15MiB/s)
sendsize[8151]: time 33454.253: .
sendsize[8151]: estimate time for /permabit/user/uz level 0: 33453.437
sendsize[8151]: estimate size for /permabit/user/uz level 0: 502958220 KB
sendsize[8151]: time 33454.253: waiting for runtar "/permabit/user/uz" child
sendsize[8151]: time 33454.253: after runtar /permabit/user/uz wait
sendsize[8151]: time 33454.359: done with amname /permabit/user/uz dirname 
/permabit/user spindle -1

Which I interpreted, perhaps incorrectly, to mean that sendsize had
communicated this estimate back to the planner.
Not necessarily, check the amandad.*.debug files to know if the client 
sent the estimate.

  In the amdump log, I can see:

amanda2:/permabit/user/uz overdue 13790 days for level 0
setup_estimate: amanda2:/permabit/user/uz: command 0, options: none
last_level -1 next_level0 -13790 level_days 0getting estimates 0 (-2) -1 
(-2) -1 (-2)
...
planner: time 18804.756: got partial result for host amanda2 disk /permabit/user/uz: 0 
-> -2K, -1 -> -2K, -1 -> -2K
  

The planner never receive the estimate.

So, the last time planner heard from sendsize was at 18804.756, yet
sendsize actually finished at time 33454.359. This was around 20:49
last night.  When I checked it this morning at about 08:30, almost 12
hours later, planner was still waiting for sendsize.  Yet, on the
client, there were no gnutar processes left, nor was there a amandad
process.  Everything on the client was completely quiescent.

Am I missing something here?  Will the server contact the client again
later via amandad to gather any estimates which have been written to
the log since the last time?  I didn't think the server polled the
clients this way.  I always thought a session was established via
amandad and that sendsize or whatever fed data directly back to amdump
via that amandad process.
  

You are right.

Can you send me the complete amdump log file, planner.*.debug, 
amandad.*.debug and sendsize.*.debug?


Jean-Louis



Re: sendsize finishes, planner doesn't notice...

2007-10-04 Thread Jean-Louis Martineau

Paul Lussier wrote:

Hi all,

I'm using amanda 2.5.1p1-2.1 from Debian/stable.

I have several file systems which take hours to estimate and dump.
My amanda.conf contains:

  etimeout  10800 # 3 hours
  dtimeout   7200 # 2 hours
  ctimeout 30

My sendsize log reports the following:

  $ egrep "estimate (time|size) for" sendsize.20071003113105.debug \
  |grep '/permabit/user'|sort
  ...
  sendsize[8132]: estimate size for /permabit/user/eh level 0: -1 KB
  sendsize[8132]: estimate time for /permabit/user/eh level 0: 18804.285
  sendsize[8136]: estimate size for /permabit/user/il level 0: 470515080 KB
  sendsize[8136]: estimate time for /permabit/user/il level 0: 33523.568
  sendsize[8137]: estimate size for /permabit/user/mp level 0: 388366900 KB
  sendsize[8137]: estimate time for /permabit/user/mp level 0: 31830.040
  sendsize[8144]: estimate size for /permabit/user/qt level 0: 438384190 KB
  sendsize[8144]: estimate time for /permabit/user/qt level 0: 33232.123
  sendsize[8151]: estimate size for /permabit/user/uz level 0: 502958220 KB
  sendsize[8151]: estimate time for /permabit/user/uz level 0: 33453.437
  sendsize[8301]: estimate size for /permabit/user/assar level 0: 169842670 KB
  sendsize[8301]: estimate time for /permabit/user/assar level 0: 15124.977

I'm assuming that the number which is not in KB is in seconds.  Which
means that the lowest one of these took over 5 hours to complete, and
I need to increase both (e,d)timeout to at least 9 hours to accomodate
the highest of these.

The strange thing is that all these estimates *did* complete from what
I can tell in the sendsize log.  Yet the planner doesn't seem to think
they have:

  $ amstatus offsite | grep getting
  amanda2:/permabit/release  getting estimate
  amanda2:/permabit/user/eh  getting estimate
  amanda2:/permabit/user/il  getting estimate
  amanda2:/permabit/user/mp  getting estimate
  amanda2:/permabit/user/qt  getting estimate
  amanda2:/permabit/user/uz  getting estimate

I *assume* it's because of the timeout bug in amanda 2.5.1:

  $ amadmin offsite config | grep -i timeout
  ETIMEOUT  22
  DTIMEOUT  21
  CTIMEOUT  190030

Which seems to indicate that planner is going to sit aroud for 61+
hours waiting for estimates to show up ?  What I'm not quite certain
of though, is why doesn't planner notice that these DLEs have
completed?  It noticed all the other DLEs have completed their
estimate phase, so why not these?

Is there something in the logs I can look for to determine how planner
notices that sendsize has completed for a given DLE?
  


Look at the amdump log file, it list all estimate received from the clients.

Are you sure the sendsize debug file you look at is the correct one?
sendsize will continue even after an estimate timeout.



sendsize finishes, planner doesn't notice...

2007-10-04 Thread Paul Lussier

Hi all,

I'm using amanda 2.5.1p1-2.1 from Debian/stable.

I have several file systems which take hours to estimate and dump.
My amanda.conf contains:

  etimeout  10800 # 3 hours
  dtimeout   7200 # 2 hours
  ctimeout 30

My sendsize log reports the following:

  $ egrep "estimate (time|size) for" sendsize.20071003113105.debug \
  |grep '/permabit/user'|sort
  ...
  sendsize[8132]: estimate size for /permabit/user/eh level 0: -1 KB
  sendsize[8132]: estimate time for /permabit/user/eh level 0: 18804.285
  sendsize[8136]: estimate size for /permabit/user/il level 0: 470515080 KB
  sendsize[8136]: estimate time for /permabit/user/il level 0: 33523.568
  sendsize[8137]: estimate size for /permabit/user/mp level 0: 388366900 KB
  sendsize[8137]: estimate time for /permabit/user/mp level 0: 31830.040
  sendsize[8144]: estimate size for /permabit/user/qt level 0: 438384190 KB
  sendsize[8144]: estimate time for /permabit/user/qt level 0: 33232.123
  sendsize[8151]: estimate size for /permabit/user/uz level 0: 502958220 KB
  sendsize[8151]: estimate time for /permabit/user/uz level 0: 33453.437
  sendsize[8301]: estimate size for /permabit/user/assar level 0: 169842670 KB
  sendsize[8301]: estimate time for /permabit/user/assar level 0: 15124.977

I'm assuming that the number which is not in KB is in seconds.  Which
means that the lowest one of these took over 5 hours to complete, and
I need to increase both (e,d)timeout to at least 9 hours to accomodate
the highest of these.

The strange thing is that all these estimates *did* complete from what
I can tell in the sendsize log.  Yet the planner doesn't seem to think
they have:

  $ amstatus offsite | grep getting
  amanda2:/permabit/release  getting estimate
  amanda2:/permabit/user/eh  getting estimate
  amanda2:/permabit/user/il  getting estimate
  amanda2:/permabit/user/mp  getting estimate
  amanda2:/permabit/user/qt  getting estimate
  amanda2:/permabit/user/uz  getting estimate

I *assume* it's because of the timeout bug in amanda 2.5.1:

  $ amadmin offsite config | grep -i timeout
  ETIMEOUT  22
  DTIMEOUT  21
  CTIMEOUT  190030

Which seems to indicate that planner is going to sit aroud for 61+
hours waiting for estimates to show up ?  What I'm not quite certain
of though, is why doesn't planner notice that these DLEs have
completed?  It noticed all the other DLEs have completed their
estimate phase, so why not these?

Is there something in the logs I can look for to determine how planner
notices that sendsize has completed for a given DLE?

-- 
Thanks,
Paul