Hello, I have Amanda 2.5.1 running on a Mandriva 2008 server and 8 clients which are running either Amanda 2.4 or Amanda 2.5.1. Generally everything is working okay but I am getting the following intermittent error which occurs on one or more DLEs:
FAILURE AND STRANGE DUMP SUMMARY: HOSTNAME /opt lev 1 FAILED [cannot read header: got 0 instead of 32768] HOSTNAME /opt lev 1 FAILED [too many dumper retry: "[request failed: timeout waiting for ACK]"] The corresponding information in /tmp/amanda/sendbackup... on the client is: sendbackup: debug 1 pid 14475 ruid 203 euid 203: start at Sat May 31 00:13:12 2008 /usr/local/libexec/sendbackup: version 2.4.3 parsed request as: program `DUMP' disk `/opt' device `/opt' level 1 since 2008:5:29:23:29:11 options `|;auth=BSD;compress-fast;index;' sendbackup: try_socksize: send buffer size is 65536 sendbackup: time 0.004: stream_server: waiting for connection: 0.0.0.0.40506 sendbackup: time 0.004: stream_server: waiting for connection: 0.0.0.0.40507 sendbackup: time 0.005: stream_server: waiting for connection: 0.0.0.0.40508 sendbackup: time 0.005: waiting for connect on 40506, then 40507, then 40508 sendbackup: time 30.015: stream_accept: timeout after 30 seconds sendbackup: time 30.015: timeout on data port 40506 sendbackup: time 60.025: stream_accept: timeout after 30 seconds sendbackup: time 60.025: timeout on mesg port 40507 sendbackup: time 90.035: stream_accept: timeout after 30 seconds sendbackup: time 90.035: timeout on index port 40508 sendbackup: time 90.035: pid 14475 finish time Sat May 31 00:14:42 2008 Sometimes this happens for all disks on a single host and sometimes it only happens on one or two. There doesn't seem to be any correlation between dump levels, disk, disk size or host, and I can't see any other patterns in the failures. These failures seem to occur on average 1 day out of 3, but only one or two hosts will fail each time. The problems only started when we upgraded from Amanda version 2.5.0 to 2.5.1 on the server as part of an upgrade to Mandriva 2008. I've had a search through the mailing list archive and similar errors seem to point to network issues. Capturing the packets between the server and one of the hosts gave the following packet trace: Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860 SECURITY USER amanda SERVICE sendbackup OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME; DUMP /opt 1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index; Amanda 2.4 ACK HANDLE 000-0000000b SEQ 1213076860 Amanda 2.4 REP HANDLE 000-0000000b SEQ 1213076860 CONNECT DATA 40506 MESG 40507 INDEX 40508 OPTIONS features=fffffeff9f00; Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860 SECURITY USER amanda SERVICE sendbackup OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME; DUMP /opt 1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index; Amanda 2.4 REP HANDLE 000-0000000b SEQ 1213076860 CONNECT DATA 40506 MESG 40507 INDEX 40508 OPTIONS features=fffffeff9f00; Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860 SECURITY USER amanda SERVICE sendbackup OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME; DUMP /opt 1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index; ... For comparison, a packet trace for a successful connection between the same client and server looked like this: Amanda 2.5 REQ HANDLE 000-00000000 SEQ 1212388333 SECURITY USER amanda SERVICE sendbackup OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME; DUMP /export/home1 0 1970:1:1:0:0:0 OPTIONS |;auth=BSD;srvcomp-fast;index; Amanda 2.4 ACK HANDLE 000-00000000 SEQ 1212388333 Amanda 2.4 REP HANDLE 000-00000000 SEQ 1212388333 CONNECT DATA 40403 MESG 40404 INDEX 40405 OPTIONS features=fffffeff9f00; Amanda 2.5 ACK HANDLE 000-00000000 SEQ 1212388333 So it looks like the client is responding but the server is not receiving the packet. The packet capture software (Wireshark) is running on the same machine as the Amanda server so the packet is definitely being received by the server, but for some reason the Amanda process doesn't receive it and it sends out another request - this repeats until it eventually gives up. Does anyone have any suggestions as to what might be going on here? We've been trying to work out what the problem could be for some time now so any ideas would be very welcome. Regards, Michael