planner timeouts
Regarding the planner timeouts. Here are the amdump logs on the server regarding the clients... planner: time 1.154: got partial result for host twofish.cgx.transora.com disk /home: 0 - -2K, -1 - -2K, -1 - -2K taper: using label `ON1006L3' date `20081112130548' driver: result time 26.700 from taper: TAPER-OK driver: state time 26.700 free kps: 9000 space: 524288000 taper: idle idle-dumpers: 27 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle driver: interface-state time 26.700 if default: free 9000 driver: hdisk-state time 26.700 hdisk 0: free 524288000 dumpers 0 planner: time 302.185: getting estimates took 302.181 secs FAILED QUEUE: 0: twofish.cgx.transora.com /home DONE QUEUE: empty ANALYZING ESTIMATES... planner: FAILED twofish.cgx.transora.com /home 20081112130548 0 [disk /home, all estimate timed out] INITIAL SCHEDULE (size 64): DELAYING DUMPS IF NEEDED, total_size 64, tape length 703561728 mark 0 delay: Total size now 64. PROMOTING DUMPS IF NEEDED, total_lev0 0, balanced_size 0... planner: time 302.186: analysis took 0.000 secs oldlog errors: STATS driver startup time 0.182 ERROR planner Request to twofish.cgx.transora.com failed: timeout waiting for REP WARNING planner disk twofish.cgx.transora.com:/home, estimate of level 0 timed out. FAIL planner twofish.cgx.transora.com /home 20081112130548 0 [disk /home, all estimate timed out] FINISH planner date 20081112130548 time 302.186 WARNING driver WARNING: got empty schedule from planner FINISH driver date 20081112130548 time 303.225 Client logs: Run tar log: 1226516749.495366: runtar: pid 27706 ruid 30063 euid 0: start at Wed Nov 12 13:05:49 2008 1226516749.495652: runtar: version 2.6.0p2 1226516749.505371: runtar: /usr/local/bin/tar version: tar (GNU tar) 1.19 1226516749.505857: runtar: config: Full 1226516749.516449: runtar: pid 27706 ruid 0 euid 0: rename at Wed Nov 12 13:05:49 2008 1226516749.516894: runtar: running: /usr/local/bin/tar --create --file /dev/null --directory /home --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/twofish.cgx.transora.com_home_0.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._home.20081112130549.exclude . 1226516749.516936: runtar: pid 27706 finish time Wed Nov 12 13:05:49 2008 Sendsize log: 1226516749.413525: sendsize: pid 27703 ruid 30063 euid 30063: start at Wed Nov 12 13:05:49 2008 1226516749.413851: sendsize: version 2.6.0p2 1226516749.414107: sendsize: warning: errors processing config file /etc/amanda/amanda-client.conf (non-fatal) 1226516749.414762: sendsize: warning: errors processing config file /etc/amanda/Full/amanda-client.conf (non-fatal) 1226516749.454464: sendsize: pid 27703 ruid 30063 euid 30063: rename at Wed Nov 12 13:05:49 2008 1226516749.457190: sendsize: waiting for any estimate child: 1 running 1226516749.458632: sendsize: calculating for amname /home, dirname /home, spindle -1 1226516749.458774: sendsize: getting size via gnutar for /home level 0 1226516749.461231: sendsize: Spawning /opt/amanda/client/libexec/amanda/runtar runtar Full /usr/local/bin/tar --create --file /dev/null --directory /home --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/twofish.cgx.transora.com_home_0.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._home.20081112130549.exclude . in pipeline amandad.20081112130548.debug log: 1226516749.378493: amandad: creating new service: sendsize OPTIONS features=9ffe00;maxdumps=8;hostname=twofish.cgx.transora .com;config=Full; GNUTAR /home 0 1970:1:1:0:0:0 -1 OPTIONS |;auth=BSD;index;exclude-list=/home/amanda/exclude_list; 1226516749.381505: amandad: sending ACK pkt: 1226516749.381671: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c) 1226516749.381692: amandad: (sockaddr_in *)26180 = { 2, 1018, 10.1.1.104 } 1226516749.381708: amandad: dgram_send_addr: ff350c3c-socket = 0 1226516749.414819: amandad: sending PREP pkt: OPTIONS features=9ffe00; 1226516749.414857: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c) 1226516749.414874: amandad: (sockaddr_in *)26180 = { 2, 1018, 10.1.1.104 } 1226516749.414912: amandad: dgram_send_addr: ff350c3c-socket = 0 1226516850.419333: amandad: dgram_recv(dgram=ff350c3c, timeout=0, fromaddr=ff360c28) 1226516850.419470: amandad: (sockaddr_in *)ff360c28 = { 2, 1018, 10.1.1.104 } 1226516850.420038: amandad: received REQ pkt: SERVICE sendsize OPTIONS features=9ffe00;maxdumps=8;hostname=twofish.cgx.transora .com;config=Full; GNUTAR /home 0 1970:1:1:0:0:0 -1 OPTIONS |;auth=BSD;index;exclude-list=/home/amanda/exclude_list; 1226516850.420071: amandad: received dup P_REQ packet, ACKing it 1226516850.420086: amandad: sending ACK pkt: 1226516850.420108: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c) 1226516850.420122
Re: planner timeouts and udp packet size
Jon LaBadie wrote: On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote: I've got this amanda client (server and client at v2.4.5) with a 1.5TB raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't make it to tape as the planner timeouts. Now, would someone be kind enough to jolt my memory as to what is the relationship between UDP packet size and the number/size of DLEs entries along with their ... From a Dec. 2004 reply by Paul Bijenes: Wow, you surely have a large mail archive. I didn't have that message anymore. The last updated (5 minutes ago) explanatin is found here: http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F or shorter: http://tinyurl.com/ca7pv -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
Re: planner timeouts and udp packet size
* Paul Bijnens [EMAIL PROTECTED] [20051207 07:41]: Jon LaBadie wrote: On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote: I've got this amanda client (server and client at v2.4.5) with a 1.5TB raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't make it to tape as the planner timeouts. Now, would someone be kind enough to jolt my memory as to what is the relationship between UDP packet size and the number/size of DLEs entries along with their ... From a Dec. 2004 reply by Paul Bijenes: Wow, you surely have a large mail archive. I didn't have that message anymore. The last updated (5 minutes ago) explanatin is found here: http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F or shorter: http://tinyurl.com/ca7pv Thank you all for the help. I found that the udp reply packet is short of 37 bytes from 32Kb taking into account the IP and UPD encapsulation and I can't seem to find what the max size of udp packet on irix6.5 -- anyone knows? All I can gather from systune (kernel tunable utilities on irix) is udp_sendspace = 61440 (0xf000) 60Kb Anyways, I modified the dumptypes to use exclude list on the client and now I'm below the udp request from last night is ~27Kb and I see no timeouts but my exclude fails :( Ah, well... I'll post another question when I'm sure that I haven't made (yet) another stupid mistake... regards, jf -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * *** -- °
Re: planner timeouts and udp packet size
Replying to myself, * Jean-Francois Malouin [EMAIL PROTECTED] [20051207 16:03]: * Paul Bijnens [EMAIL PROTECTED] [20051207 07:41]: Jon LaBadie wrote: On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote: I've got this amanda client (server and client at v2.4.5) with a 1.5TB raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't make it to tape as the planner timeouts. Now, would someone be kind enough to jolt my memory as to what is the relationship between UDP packet size and the number/size of DLEs entries along with their ... From a Dec. 2004 reply by Paul Bijenes: Wow, you surely have a large mail archive. I didn't have that message anymore. The last updated (5 minutes ago) explanatin is found here: http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F or shorter: http://tinyurl.com/ca7pv Thank you all for the help. I found that the udp reply packet is short of 37 bytes from 32Kb taking into account the IP and UPD encapsulation and I can't seem to find what the max size of udp packet on irix6.5 -- anyone knows? All I can gather from systune (kernel tunable utilities on irix) is udp_sendspace = 61440 (0xf000) 60Kb Anyways, I modified the dumptypes to use exclude list on the client and now I'm below the udp request from last night is ~27Kb and I see no timeouts but my exclude fails :( Ah, well... I'll post another question when I'm sure that I haven't made (yet) another stupid mistake... stupid it was indeed: I had an extra white space in front of every exclude pattern... jf regards, jf -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
planner timeouts and udp packet size
Hi, I've got this amanda client (server and client at v2.4.5) with a 1.5TB raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't make it to tape as the planner timeouts. Now, would someone be kind enough to jolt my memory as to what is the relationship between UDP packet size and the number/size of DLEs entries along with their exclude list? The FAQ mentions something about this (UDP packet size of 64Kb and results missing) but that might not be quite up to date. This raid partition has been splitted using quite a few 'exclude append ./blah' in the disklist and I wonder if it would be better to have them in a file on the client side rather than explicitely in the disklist file per se. Would that make any difference at all regarding what get stuffed in the UDP packets at the estimate phase -- making them smaller -- so that I wouldn't hit that hard limit? I hope this makes sense :) TIA, jf
Re: planner timeouts and udp packet size
On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote: Hi, I've got this amanda client (server and client at v2.4.5) with a 1.5TB raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't make it to tape as the planner timeouts. Now, would someone be kind enough to jolt my memory as to what is the relationship between UDP packet size and the number/size of DLEs entries along with their exclude list? The FAQ mentions something about this (UDP packet size of 64Kb and results missing) but that might not be quite up to date. This raid partition has been splitted using quite a few 'exclude append ./blah' in the disklist and I wonder if it would be better to have them in a file on the client side rather than explicitely in the disklist file per se. Would that make any difference at all regarding what get stuffed in the UDP packets at the estimate phase -- making them smaller -- so that I wouldn't hit that hard limit? From a Dec. 2004 reply by Paul Bijenes: || || The problem seems to be in the reply packet. || || I've already seen problems with a UDP-packet overflow, but that's || unlikely. That problem happened with older versions where the UDP || size was only 8Kbyte or so. Currently it is 64K, but it could be || limited by the OS too, of course. The reply packet is usually larger || than the request packet, because it contains 1 to 3 lines for each || DLE (level 0, current level, current plus 1). || -- Jon H. LaBadie [EMAIL PROTECTED] JG Computing 4455 Province Line Road(609) 252-0159 Princeton, NJ 08540-4322 (609) 683-7220 (fax)
Re: planner timeouts
Charles Sprickman wrote: That did work. I removed a few entries from disklist and got a good run. And that brings up another question... What else is sent in that request? Something seems to vary based on the success of the last run. In the sendsize request Amanda can ask for a level 0, level X, and level X+1, dump. That could be two or three estimates, depending on the previous run (e.g. no need to ask for level 2, when a we only have a level 0 to base on). ps - ^X^C sounds very familiar from a very long time ago, what is it? emacs -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
Re: planner timeouts
On Thu, 8 Sep 2005, Paul Bijnens wrote: http://bway.net/~spork/amanda-tcpdump.txt You'll note that the frags thing is still set, but if you compare what the server sent to what the client received, it is identical. I believe you have a UDP fragementation network problem somewhere inbetween the two hosts, or maybe the devel2 host itself. I believe you're on the right track here, and that this is not an amanda issue. Your input has been really helpful in giving me some direction. Compare these two lines (first: view on devel2, second: view on h13): 00:07:40.444231 devel2.945 h13.amanda: udp 1465 (ttl 64, id 45642, len 1493) 00:07:40.484232 devel2.945 h13.amanda: udp 1465 (frag 45642:[EMAIL PROTECTED]) (ttl 51, len 1492) The host devel2, sends a packet, claiming length 1493 bytes. (1465 real payload + 28 bytes UDP packet header = 1493 bytes). That's it right there. The server is back at an office on DSL. Supposedly without PPPoE. But after finally getting into the router I see that even though there's a routed static subnet and all, it is still doing PPPoE. MTU of 1492 on PPPoE. Sent packet of 1493, received 1492. I think I see where this is going. The host h13, is apparently 13 hops away from the client (ttl 64-51). That's pretty strange, when you say they are connected to the same switch. Even more strange is that the reply packets travel 15 hops (ttl 64-49). But I'm not the specialist here. I wasn't clear, devel2, the amanda server is on a DSL line. The hosts that it backs up however are all on the same LAN in the same subnet on the same switch in a datacenter. The server packet however now says the total length is 1492, but the payload length is still 1465. Strange because the 28 byte UDP header seems be shortened by 1 byte? That could be when the UDP checksum, which is optional, is removed from the header. Some device inbetween removed the checksum byte. Must be a real firewall or layer 3 switch, changing packets it transmits. I'm eyeing the Netopia/Cayman DSL router very suspiciously at the moment. :) As workaround, you could try to reduce the UDP packet by e.g. leaving out some DLE's or omitting compress-fast from the options, or anything that shortens the string and makes the complete datagram fit in one packet. That did work. I removed a few entries from disklist and got a good run. And that brings up another question... What else is sent in that request? Something seems to vary based on the success of the last run. After that one good partial run, I uncommented everything again in disklist for that host and had a successful full run. In other words, after a good partial run the next batch of sendsize requests were smaller by a few bytes. Why? That would also explain why I now see a second and third host showing the same symptoms. Something in that request grows and pushes me past the 1492 byte limit. Anyhow, I've got some really good information to work with now, thanks again for all your help and insight. Now I have to look at how PMTUD works and how this setup is breaking it. Thanks, Charles ps - ^X^C sounds very familiar from a very long time ago, what is it? -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
Re: planner timeouts
Charles Sprickman wrote: On Tue, 6 Sep 2005, Paul Bijnens wrote: http://bway.net/~spork/amanda-tcpdump.txt You'll note that the frags thing is still set, but if you compare what the server sent to what the client received, it is identical. I believe you have a UDP fragementation network problem somewhere inbetween the two hosts, or maybe the devel2 host itself. Compare these two lines (first: view on devel2, second: view on h13): 00:07:40.444231 devel2.945 h13.amanda: udp 1465 (ttl 64, id 45642, len 1493) 00:07:40.484232 devel2.945 h13.amanda: udp 1465 (frag 45642:[EMAIL PROTECTED]) (ttl 51, len 1492) The host devel2, sends a packet, claiming length 1493 bytes. (1465 real payload + 28 bytes UDP packet header = 1493 bytes). It also seems to be split over many UDP packets, but the following packet is never seen in the trace. THe packet that is shown is only 1486 bytes long. There should be a second packet with the rest of 7 bytes, but that is missing from the trace. From the content of the first packet, that second packet should contain index+newline, indeed 7 bytes. The host h13, is apparently 13 hops away from the client (ttl 64-51). That's pretty strange, when you say they are connected to the same switch. Even more strange is that the reply packets travel 15 hops (ttl 64-49). But I'm not the specialist here. The server packet however now says the total length is 1492, but the payload length is still 1465. Strange because the 28 byte UDP header seems be shortened by 1 byte? That could be when the UDP checksum, which is optional, is removed from the header. Some device inbetween removed the checksum byte. Must be a real firewall or layer 3 switch, changing packets it transmits. Anyway, the trailing 7 byte packet never got sent. I would take a look at the UDP configuration of that client, that seems to have trouble with splitting large UDP datagrams. Some OS-es need some twiggling with parameters to tune this. As workaround, you could try to reduce the UDP packet by e.g. leaving out some DLE's or omitting compress-fast from the options, or anything that shortens the string and makes the complete datagram fit in one packet. I'm totally stumped here. We had another server just start doing the same thing the other day. They are all on the same subnet/lan/switch, two don't work, the remaining 10 or so do. Verify if it is indeed the size of the UDP packet sent. You mean the two that don't work are on the same subnet/lan/switch. Or are all the server and client on the same subnet/lan/switch? At least the devel2 machine is not sending the continuation packet, and the subnet/switch/lan seems to be strange too. -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
Re: planner timeouts
Charles Sprickman wrote: Me neither... Is there any way to increase the verbosity of amandad? inetd config is good (it's a client, so it's only got the single line for amandad), nothing in the system logs, all of the stuff in the libexec directory appears to have correct perms (the proper things are setuid to my amanda user)... No, but in the tcpdump trace below, I would add -n 1500 -X to verify what exactly is sent. The devel2 (server) view: 19:46:23.967162 devel2.937 h13.blah.com.amanda: udp 117 19:46:24.074337 h13.blah.com.amanda devel2.937: udp 50 19:46:24.249414 h13.blah.com.amanda devel2.937: udp 81 19:46:24.249497 devel2.937 h13.blah.com.amanda: udp 50 19:46:24.489787 devel2.937 h13.blah.com.amanda: udp 1465 19:46:34.497794 devel2.937 h13.blah.com.amanda: udp 1465 19:46:44.508815 devel2.937 h13.blah.com.amanda: udp 1465 The h13 (client) view: 19:46:23.982760 devel2.937 h13.blah.com.amanda: udp 117 19:46:24.054390 h13.blah.com.amanda devel2.937: udp 50 19:46:24.230317 h13.blah.com.amanda devel2.937: udp 81 19:46:24.264200 devel2.937 h13.blah.com.amanda: udp 50 19:46:24.523791 devel2.937 h13.blah.com.amanda: udp 1465 (frag 59731:[EMAIL PROTECTED]) 19:46:34.531821 devel2.937 h13.blah.com.amanda: udp 1465 (frag 62763:[EMAIL PROTECTED]) 19:46:44.542471 devel2.937 h13.blah.com.amanda: udp 1465 (frag 9535:[EMAIL PROTECTED]) I'm not a tcpdump guru, it is suspect that the client sends something flagged as frag, while the server does not see the frag flag. Could be tcpdump implementation difference on the two machines too. The server side logs some info in the amdump file, when it receives the packets. Alas only after decoding them. If the server didn't see them, it won't log anything. Have a careful look anyway in the amdump file around that time (the dumpfile contains milestones in seconds from start of file, so that you can locate the lines around which the packets were received. If you're brave, yYou could also try to strace planner. Because planner is a suid-root program, you need to make an strace copy with has also the suid-root bit set. Then add that command in the amdump shell script around line 100, which then becomes a long line: /usr/local/bin/strace-suid -o /tmp/planner-strace -s 1500 $libexecdir/planner$SUF $@ | ...driver... (and don't forget to remove the strace-suid program afterwards!) -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... Are you sure? ... YES ... Phew ... I'm out * ***
Re: planner timeouts
Charles Sprickman wrote: h13 (client) debug logs. Note that there is two-way communication, and everything seems to go correctly. In the debug dir, there are only amandad debug logs, nothing else. That doesn't sound right to me. There should be a sendbackup log file as well, a runtar one, and so on. Can you verify your inetd config on that particular client, to see whether there's something afoul? Have a look at the system logs as well, while you're at it. amanda might be unable to run any secondary programs, for instance. Me neither... Is there any way to increase the verbosity of amandad? inetd config is good (it's a client, so it's only got the single line for amandad), nothing in the system logs, all of the stuff in the libexec directory appears to have correct perms (the proper things are setuid to my amanda user)... GETTING ESTIMATES... planner: time 30.956: error result for host h13.blah.com disk /spool: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/bin: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/control: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/db/pkg: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /usr/local/: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /home: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /: Request to h13.blah.com timed out. planner: time 30.956: getting estimates took 30.811 secs Does that spell a 30s timeout somewhere? amanda.conf not taken into account, perhaps? And the obligatory question, did you double-check that there's no firewall between that particular client and server? (If you did, triple-check. :-) ) I have bumped up all the timeouts in amanda.conf to ridiculously large values. :) There are firewalls, and the tcpdump trace I sent was taken with each host's firewall software (ipfilter) disabled. Additionally, the firewall logs blocked traffic and had nothing to say about this. What else can I look at here? I'm also including the tcpdump output again at the end of this message. Thanks, Charles The devel2 (server) view: 19:46:23.967162 devel2.937 h13.blah.com.amanda: udp 117 19:46:24.074337 h13.blah.com.amanda devel2.937: udp 50 19:46:24.249414 h13.blah.com.amanda devel2.937: udp 81 19:46:24.249497 devel2.937 h13.blah.com.amanda: udp 50 19:46:24.489787 devel2.937 h13.blah.com.amanda: udp 1465 19:46:34.497794 devel2.937 h13.blah.com.amanda: udp 1465 19:46:44.508815 devel2.937 h13.blah.com.amanda: udp 1465 The h13 (client) view: 19:46:23.982760 devel2.937 h13.blah.com.amanda: udp 117 19:46:24.054390 h13.blah.com.amanda devel2.937: udp 50 19:46:24.230317 h13.blah.com.amanda devel2.937: udp 81 19:46:24.264200 devel2.937 h13.blah.com.amanda: udp 50 19:46:24.523791 devel2.937 h13.blah.com.amanda: udp 1465 (frag 59731:[EMAIL PROTECTED]) 19:46:34.531821 devel2.937 h13.blah.com.amanda: udp 1465 (frag 62763:[EMAIL PROTECTED]) 19:46:44.542471 devel2.937 h13.blah.com.amanda: udp 1465 (frag 9535:[EMAIL PROTECTED]) Alex -- Alexander Jolk / BUF Compagnie tel +33-1 42 68 18 28 / fax +33-1 42 68 18 29
Re: planner timeouts
Charles Sprickman wrote: h13 (client) debug logs. Note that there is two-way communication, and everything seems to go correctly. In the debug dir, there are only amandad debug logs, nothing else. That doesn't sound right to me. There should be a sendbackup log file as well, a runtar one, and so on. Can you verify your inetd config on that particular client, to see whether there's something afoul? Have a look at the system logs as well, while you're at it. amanda might be unable to run any secondary programs, for instance. GETTING ESTIMATES... planner: time 30.956: error result for host h13.blah.com disk /spool: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/bin: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/control: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/db/pkg: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /usr/local/: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /home: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /: Request to h13.blah.com timed out. planner: time 30.956: getting estimates took 30.811 secs Does that spell a 30s timeout somewhere? amanda.conf not taken into account, perhaps? And the obligatory question, did you double-check that there's no firewall between that particular client and server? (If you did, triple-check. :-) ) Alex -- Alexander Jolk / BUF Compagnie tel +33-1 42 68 18 28 / fax +33-1 42 68 18 29
planner timeouts
Hi, Recently we started having trouble getting backups from one of our servers. One day it was working, the next it wasn't. I've been digging around quite a bit and I'm not able to really see where the failure is. It appears that the planner process on the client machine times out, but I'm not sure why, and I'm seeing nothing in the debug logs on the client to indicate what the problem is. I've reviewed all the changes on the client from when it worked until when it didn't and haven't seen anything that involves amanda or networking. We do run firewalls on each host, and I've tested with them disabled. This is a rather old version of Amanda and we will be updating as soon as we get the OS and other software up-to-date on all the hosts. Version is 2.4.3. The server (devel2) is in one location (an office with dsl, no pppoe) and the clients are located in a datacenter. All of the other hosts on the same network are backing up with no problems. All other hosts are running the same version of amanda, and most are running the same version of FreeBSD (4.11). Some hosts have more data to backup, some have less. All the timeouts in amanda.conf on the server have been bumped up: ctimeout 32000 # timeout for clients dtimeout 16000 # timeout for dump processes etimeout 32000 # timeout for dump estimates The logs are lengthy, I apologize in advance. Below I have included the job output from the server, all client debug logs, all server debug logs, normal logs from the server, and a tcpdump of the session as seen from both hosts. The host devel2 is the backup server, h13 is the client that is failing. job output: FAILURE AND STRANGE DUMP SUMMARY: h13.blah /spool lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /var/qmail/bin lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /var/qmail/control lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /var/db/pkg lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /usr/local/ lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /home lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah / lev 0 FAILED [Request to h13.blah.com timed out.] h13 (client) debug logs. Note that there is two-way communication, and everything seems to go correctly. In the debug dir, there are only amandad debug logs, nothing else. amandad: time 0.000: got packet: Amanda 2.4 REQ HANDLE 000-80930808 SEQ 1125445583 SECURITY USER operator SERVICE noop OPTIONS features=feff9f00; amandad: time 0.000: sending ack: Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583 amandad: time 0.089: bsd security: remote host devel2 user operator local user operator amandad: time 0.175: amandahosts security check passed amandad: time 0.175: running service noop amandad: time 0.176: sending REP packet: Amanda 2.4 REP HANDLE 000-80930808 SEQ 1125445583 OPTIONS features=feff9f00; amandad: time 0.210: got packet: Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583 amandad: time 0.210: pid 19416 finish time Tue Aug 30 19:46:24 2005 devel2 (server) debug logs. Not sure if this is useful, but here it is: amtrmidx.20050830194655.debug : h13.blah.com / rm /var/db/amanda/index/h13.blah.com/_/20050707_2.gz rm /var/db/amanda/index/h13.blah.com/_/20050703_2.gz h13.blah.com /home rm /var/db/amanda/index/h13.blah.com/_home/20050707_6.gz h13.blah.com /usr/local/ rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050707_0.gz rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050703_2.gz h13.blah.com /var/db/pkg rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050707_1.gz rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050703_0.gz h13.blah.com /var/qmail/control rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050707_1.gz rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050703_1.gz h13.blah.com /var/qmail/bin rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050707_1.gz rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050703_1.gz h13.blah.com /spool amtrmlog.20050830194654.debug: amtrmlog: debug 1 pid 39378 ruid 2 euid 2: start at Tue Aug 30 19:46:54 2005 /usr/local/libexec/amanda/amtrmlog: version 2.4.3 Keeping 48 log files amtrmlog: pid 39378 finish time Tue Aug 30 19:46:55 2005 standard log file. we do see the planner timeout here, but it's not clear to me why.: /var/log/amanda/log.1: dumper: dgram_bind: socket bound to 0.0.0.0.943 dumper: pid 39290 executable dumper2 version 2.4.3, using port 943 dumper: dgram_bind: socket bound to 0.0.0.0.945 dumper: pid 39292 executable dumper4 version 2.4.3, using port 945 dumper: dgram_bind: socket bound to 0.0.0.0.942 dumper: pid 39289 executable dumper1 version 2.4.3, using port 942 dumper: dgram_bind: socket bound to 0.0.0.0.944 dumper: pid 39291 executable dumper3 version 2.4.3, using port 944 planner: time 0.028: dgram_bind: