In the amandad.XXX.debug log I have the following lines, which Im assuming are the error report of the problem? Now, the question is, how to fix it :-)
-Nick
amandad: time 0.010: amandahosts security check passed amandad: time 0.010: running service "/usr/lib/amanda/sendsize" amandad: time 182.436: sending REP packet: ---- Amanda 2.4 REP HANDLE 005-40813308 SEQ 1102082216 OPTIONS features=fffffeff9ffe0f; / 0 SIZE 301197 / 1 SIZE 100 /u00 0 SIZE 143930 /u00 1 SIZE 41411 /usr 0 SIZE 880958 /usr 1 SIZE 79 /usr/local 0 SIZE 174 /usr/local 1 SIZE 47 /var 0 SIZE 299300 /var 1 SIZE 2857 ----
amandad: time 192.437: dgram_recv: timeout after 10 seconds amandad: time 192.437: waiting for ack: timeout, retrying amandad: time 202.439: dgram_recv: timeout after 10 seconds amandad: time 202.439: waiting for ack: timeout, retrying amandad: time 212.441: dgram_recv: timeout after 10 seconds amandad: time 212.442: waiting for ack: timeout, retrying amandad: time 222.444: dgram_recv: timeout after 10 seconds amandad: time 222.444: waiting for ack: timeout, retrying amandad: time 232.446: dgram_recv: timeout after 10 seconds amandad: time 232.446: waiting for ack: timeout, giving up! amandad: time 232.446: pid 21896 finish time Fri Dec 3 09:01:32 2004
Paul Bijnens wrote:
Nick Danger wrote:
Nope - still a problem. The error is still as below:
FAILURE AND STRANGE DUMP SUMMARY:
dominion.h /var lev 0 FAILED [Estimate timeout from dominion.xxx] dominion.h /usr/local lev 0 FAILED [Estimate timeout from dominion.xxx] dominion.h /usr lev 0 FAILED [Estimate timeout from dominion.xxx] dominion.h /u00 lev 0 FAILED [Estimate timeout from dominion.xxx] dominion.h / lev 0 FAILED [Estimate timeout from dominion.xxx]
I have the timeout in amanda.conf set to an ungodly high number of
etimeout -12000 # total number of seconds for estimates.
[...]
sendsize: debug 1 pid 26242 ruid 33 euid 33: start at Thu Dec 2 11:25:07 2004
sendsize: version 2.4.4p1
[...]
sendsize: time 172.473: pid 26242 finish time Thu Dec 2 11:27:59 2004
The estimate really takes only 173 seconds. That means that etimeout is plenty (better lower it again to normal values).
The problem seems to be in the reply packet.
I've already seen problems with a UDP-packet overflow, but that's unlikely. That problem happened with older versions where the UDP size was only 8Kbyte or so. Currently it is 64K, but it could be limited by the OS too, of course. The reply packet is usually larger than the request packet, because it contains 1 to 3 lines for each DLE (level 0, current level, current plus 1). In amandad.DATETIME.debug, you can find the request packet, and the reply packet. Any weird limitation on UDP packet size on one of the hosts (or intermediate routers/firewalls)?
Another problem could be in the iptables modules for amanda, where there is already twice a bug introduced. I don't know exactly the last status of that bug. If not needed, do not use the amanda iptables modules. Try "lsmod | grep amanda". (Or on intermediate firewalls!)
Maybe try a network traffic dump (with tcpdump or similar program) on client *and* host?