planner timeouts

2008-11-12 Thread Krahn, Anderson
Regarding the planner timeouts.

 

 

Here are the amdump logs on the server regarding the clients...

 

 

planner: time 1.154: got partial result for host
twofish.cgx.transora.com disk /home: 0 - -2K, -1 - -2K, -1 - -2K

taper: using label `ON1006L3' date `20081112130548'

driver: result time 26.700 from taper: TAPER-OK

driver: state time 26.700 free kps: 9000 space: 524288000 taper:
idle idle-dumpers: 27 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0
driver-idle: not-idle

driver: interface-state time 26.700 if default: free 9000

driver: hdisk-state time 26.700 hdisk 0: free 524288000 dumpers 0

planner: time 302.185: getting estimates took 302.181 secs

FAILED QUEUE:

  0: twofish.cgx.transora.com /home

DONE QUEUE: empty

 

ANALYZING ESTIMATES...

planner: FAILED twofish.cgx.transora.com /home 20081112130548 0 [disk
/home, all estimate timed out]

INITIAL SCHEDULE (size 64):

 

DELAYING DUMPS IF NEEDED, total_size 64, tape length 703561728 mark 0

  delay: Total size now 64.

 

PROMOTING DUMPS IF NEEDED, total_lev0 0, balanced_size 0...

planner: time 302.186: analysis took 0.000 secs

 

oldlog errors:

 

STATS driver startup time 0.182

ERROR planner Request to twofish.cgx.transora.com failed: timeout
waiting for REP

WARNING planner disk twofish.cgx.transora.com:/home, estimate of level 0
timed out.

FAIL planner twofish.cgx.transora.com /home 20081112130548 0 [disk
/home, all estimate timed out]

FINISH planner date 20081112130548 time 302.186

WARNING driver WARNING: got empty schedule from planner

FINISH driver date 20081112130548 time 303.225

 

 

Client logs:

 

Run tar log:

1226516749.495366: runtar: pid 27706 ruid 30063 euid 0: start at Wed Nov
12 13:05:49 2008

1226516749.495652: runtar: version 2.6.0p2

1226516749.505371: runtar: /usr/local/bin/tar version: tar (GNU tar)
1.19

 

1226516749.505857: runtar: config: Full

1226516749.516449: runtar: pid 27706 ruid 0 euid 0: rename at Wed Nov 12
13:05:49 2008

1226516749.516894: runtar: running: /usr/local/bin/tar --create --file
/dev/null --directory /home --one-file-system --listed-incremental
/usr/local/var/amanda/gnutar-lists/twofish.cgx.transora.com_home_0.new
--sparse --ignore-failed-read --totals --exclude-from
/tmp/amanda/sendsize._home.20081112130549.exclude .

1226516749.516936: runtar: pid 27706 finish time Wed Nov 12 13:05:49
2008

 

Sendsize log:

1226516749.413525: sendsize: pid 27703 ruid 30063 euid 30063: start at
Wed Nov 12 13:05:49 2008

1226516749.413851: sendsize: version 2.6.0p2

1226516749.414107: sendsize: warning: errors processing config file
/etc/amanda/amanda-client.conf (non-fatal)

1226516749.414762: sendsize: warning: errors processing config file
/etc/amanda/Full/amanda-client.conf (non-fatal)

1226516749.454464: sendsize: pid 27703 ruid 30063 euid 30063: rename at
Wed Nov 12 13:05:49 2008

1226516749.457190: sendsize: waiting for any estimate child: 1 running

1226516749.458632: sendsize: calculating for amname /home, dirname
/home, spindle -1

1226516749.458774: sendsize: getting size via gnutar for /home level 0

1226516749.461231: sendsize: Spawning
/opt/amanda/client/libexec/amanda/runtar runtar Full /usr/local/bin/tar
--create --file /dev/null --directory /home --one-file-system
--listed-incremental
/usr/local/var/amanda/gnutar-lists/twofish.cgx.transora.com_home_0.new
--sparse --ignore-failed-read --totals --exclude-from
/tmp/amanda/sendsize._home.20081112130549.exclude . in pipeline

 

amandad.20081112130548.debug log:

 

1226516749.378493: amandad: creating new service: sendsize

OPTIONS
features=9ffe00;maxdumps=8;hostname=twofish.cgx.transora
.com;config=Full;

GNUTAR /home  0 1970:1:1:0:0:0 -1  OPTIONS
|;auth=BSD;index;exclude-list=/home/amanda/exclude_list;

 

1226516749.381505: amandad: sending ACK pkt:

 

 

1226516749.381671: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c)

1226516749.381692: amandad: (sockaddr_in *)26180 = { 2, 1018, 10.1.1.104
}

1226516749.381708: amandad: dgram_send_addr: ff350c3c-socket = 0

1226516749.414819: amandad: sending PREP pkt:

 

OPTIONS features=9ffe00;

 

1226516749.414857: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c)

1226516749.414874: amandad: (sockaddr_in *)26180 = { 2, 1018, 10.1.1.104
}

1226516749.414912: amandad: dgram_send_addr: ff350c3c-socket = 0

1226516850.419333: amandad: dgram_recv(dgram=ff350c3c, timeout=0,
fromaddr=ff360c28)

1226516850.419470: amandad: (sockaddr_in *)ff360c28 = { 2, 1018,
10.1.1.104 }

1226516850.420038: amandad: received REQ pkt:

 

SERVICE sendsize

OPTIONS
features=9ffe00;maxdumps=8;hostname=twofish.cgx.transora
.com;config=Full;

GNUTAR /home  0 1970:1:1:0:0:0 -1  OPTIONS
|;auth=BSD;index;exclude-list=/home/amanda/exclude_list;

 

1226516850.420071: amandad: received dup P_REQ packet, ACKing it

1226516850.420086: amandad: sending ACK pkt:

 

 

1226516850.420108: amandad: dgram_send_addr(addr=26180, dgram=ff350c3c)

1226516850.420122

Re: planner timeouts and udp packet size

2005-12-07 Thread Paul Bijnens

Jon LaBadie wrote:

On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote:


I've got this amanda client (server and client at v2.4.5) with a 1.5TB
raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't
make it to tape as the planner timeouts. Now, would someone be kind
enough to jolt my memory as to what is the relationship between UDP
packet size and the number/size of DLEs entries along with their

...



From a Dec. 2004 reply by Paul Bijenes:


Wow, you surely have a large mail archive.  I didn't have that message 
anymore.


The last updated (5 minutes ago) explanatin is found here:

http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F

or shorter:

http://tinyurl.com/ca7pv


--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***




Re: planner timeouts and udp packet size

2005-12-07 Thread Jean-Francois Malouin
* Paul Bijnens [EMAIL PROTECTED] [20051207 07:41]:
 Jon LaBadie wrote:
 On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote:
 
 I've got this amanda client (server and client at v2.4.5) with a 1.5TB
 raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't
 make it to tape as the planner timeouts. Now, would someone be kind
 enough to jolt my memory as to what is the relationship between UDP
 packet size and the number/size of DLEs entries along with their
 ...
 
 From a Dec. 2004 reply by Paul Bijenes:
 
 Wow, you surely have a large mail archive.  I didn't have that message 
 anymore.
 
 The last updated (5 minutes ago) explanatin is found here:
 
 http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F
 
 or shorter:
 
 http://tinyurl.com/ca7pv

Thank you all for the help.

I found that the udp reply packet is short of 37 bytes from 32Kb
taking into account the IP and UPD encapsulation and I can't seem to
find what the max size of udp packet on irix6.5 -- anyone knows? All I
can gather from systune (kernel tunable utilities on irix) is
udp_sendspace = 61440 (0xf000) 60Kb

Anyways, I modified the dumptypes to use exclude list on the client
and now I'm below the udp request from last night is ~27Kb and I see
no timeouts but my exclude fails :( 
Ah, well... I'll post another question when I'm sure that I haven't
made (yet) another stupid mistake...

regards,
jf

 
 
 -- 
 Paul Bijnens, XplanationTel  +32 16 397.511
 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
 http://www.xplanation.com/  email:  [EMAIL PROTECTED]
 ***
 * I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
 * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
 * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
 * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
 * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
 * ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
 ***
 

-- 
° 


Re: planner timeouts and udp packet size

2005-12-07 Thread Jean-Francois Malouin
Replying to myself,

* Jean-Francois Malouin [EMAIL PROTECTED] [20051207 16:03]:
 * Paul Bijnens [EMAIL PROTECTED] [20051207 07:41]:
  Jon LaBadie wrote:
  On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote:
  
  I've got this amanda client (server and client at v2.4.5) with a 1.5TB
  raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't
  make it to tape as the planner timeouts. Now, would someone be kind
  enough to jolt my memory as to what is the relationship between UDP
  packet size and the number/size of DLEs entries along with their
  ...
  
  From a Dec. 2004 reply by Paul Bijenes:
  
  Wow, you surely have a large mail archive.  I didn't have that message 
  anymore.
  
  The last updated (5 minutes ago) explanatin is found here:
  
  http://wiki.zmanda.com/index.php/Why_does_%27amdump%27_report_%60results_missing%27%3F#UDP_packet_too_large.3F
  
  or shorter:
  
  http://tinyurl.com/ca7pv
 
 Thank you all for the help.
 
 I found that the udp reply packet is short of 37 bytes from 32Kb
 taking into account the IP and UPD encapsulation and I can't seem to
 find what the max size of udp packet on irix6.5 -- anyone knows? All I
 can gather from systune (kernel tunable utilities on irix) is
 udp_sendspace = 61440 (0xf000) 60Kb
 
 Anyways, I modified the dumptypes to use exclude list on the client
 and now I'm below the udp request from last night is ~27Kb and I see
 no timeouts but my exclude fails :( 
 Ah, well... I'll post another question when I'm sure that I haven't
 made (yet) another stupid mistake...

stupid it was indeed:
I had an extra white space in front of every exclude pattern...

jf

 
 regards,
 jf
 
  
  
  -- 
  Paul Bijnens, XplanationTel  +32 16 397.511
  Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
  http://www.xplanation.com/  email:  [EMAIL PROTECTED]
  ***
  * I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
  * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
  * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
  * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
  * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
  * ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
  ***
  
 


planner timeouts and udp packet size

2005-12-06 Thread Jean-Francois Malouin
Hi,

I've got this amanda client (server and client at v2.4.5) with a 1.5TB
raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't
make it to tape as the planner timeouts. Now, would someone be kind
enough to jolt my memory as to what is the relationship between UDP
packet size and the number/size of DLEs entries along with their
exclude list? The FAQ mentions something about this (UDP packet size
of 64Kb and results missing) but that might not be quite up to date. 

This raid partition has been splitted using quite a few 'exclude
append ./blah' in the disklist and I wonder if it would be better to
have them in a file on the client side rather than explicitely in the
disklist file per se. Would that make any difference at all regarding
what get stuffed in the UDP packets at the estimate phase -- making
them smaller -- so that I wouldn't hit that hard limit?

I hope this makes sense :)

TIA,
jf


Re: planner timeouts and udp packet size

2005-12-06 Thread Jon LaBadie
On Tue, Dec 06, 2005 at 12:06:19PM -0500, Jean-Francois Malouin wrote:
 Hi,
 
 I've got this amanda client (server and client at v2.4.5) with a 1.5TB
 raid splitted/chunked in ~70 DLEs (using gtar) and some of them don't
 make it to tape as the planner timeouts. Now, would someone be kind
 enough to jolt my memory as to what is the relationship between UDP
 packet size and the number/size of DLEs entries along with their
 exclude list? The FAQ mentions something about this (UDP packet size
 of 64Kb and results missing) but that might not be quite up to date. 
 
 This raid partition has been splitted using quite a few 'exclude
 append ./blah' in the disklist and I wonder if it would be better to
 have them in a file on the client side rather than explicitely in the
 disklist file per se. Would that make any difference at all regarding
 what get stuffed in the UDP packets at the estimate phase -- making
 them smaller -- so that I wouldn't hit that hard limit?
 

From a Dec. 2004 reply by Paul Bijenes:

||
|| The problem seems to be in the reply packet.
|| 
|| I've already seen problems with a UDP-packet overflow, but that's
|| unlikely.  That problem happened with older versions where the UDP
|| size was only 8Kbyte or so. Currently it is 64K, but it could be
|| limited by the OS too, of course.  The reply packet is usually larger
|| than the request packet, because it contains 1 to 3 lines for each
|| DLE (level 0, current level, current plus 1).
||

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: planner timeouts

2005-09-12 Thread Paul Bijnens

Charles Sprickman wrote:


That did work.  I removed a few entries from disklist and got a good 
run.  And that brings up another question...  What else is sent in that 
request?  Something seems to vary based on the success of the last run. 


In the sendsize request Amanda can ask for a level 0, level X, and
level X+1, dump.  That could be two or three estimates, depending on
the previous run (e.g. no need to ask for level 2, when a we only have
a level 0 to base on).


ps - ^X^C sounds very familiar from a very long time ago, what is it?


emacs

--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***




Re: planner timeouts

2005-09-09 Thread Charles Sprickman

On Thu, 8 Sep 2005, Paul Bijnens wrote:


http://bway.net/~spork/amanda-tcpdump.txt

You'll note that the frags thing is still set, but if you compare what 
the server sent to what the client received, it is identical.


I believe you have a UDP fragementation network problem somewhere
inbetween the two hosts, or maybe the devel2 host itself.


I believe you're on the right track here, and that this is not an amanda 
issue.  Your input has been really helpful in giving me some direction.



Compare these two lines (first: view on devel2, second: view on h13):

00:07:40.444231 devel2.945  h13.amanda: udp 1465 (ttl 64, id 45642, len 
1493)
00:07:40.484232 devel2.945  h13.amanda: udp 1465 (frag 45642:[EMAIL PROTECTED]) (ttl 
51, len 1492)


The host devel2, sends a packet, claiming length 1493 bytes.
(1465 real payload + 28 bytes UDP packet header = 1493 bytes).


That's it right there.  The server is back at an office on DSL. 
Supposedly without PPPoE.  But after finally getting into the router I see 
that even though there's a routed static subnet and all, it is still doing 
PPPoE.  MTU of 1492 on PPPoE.  Sent packet of 1493, received 1492.  I 
think I see where this is going.



The host h13, is apparently 13 hops away from the client (ttl 64-51).
That's pretty strange, when you say they are connected to the same
switch.  Even more strange is that the reply packets travel 15 hops
(ttl 64-49).  But I'm not the specialist here.


I wasn't clear, devel2, the amanda server is on a DSL line.  The hosts 
that it backs up however are all on the same LAN in the same subnet on the 
same switch in a datacenter.



The server packet however now says the total length is 1492, but the
payload length is still 1465.  Strange because the 28 byte UDP header
seems be shortened by 1 byte? That could be when the UDP checksum, which
is optional, is removed from the header.  Some device inbetween removed
the checksum byte.  Must be a real firewall or layer 3 switch, changing
packets it transmits.


I'm eyeing the Netopia/Cayman DSL router very suspiciously at the moment. 
:)



As workaround, you could try to reduce the UDP packet by e.g.
leaving out some DLE's or omitting compress-fast from the options, or
anything that shortens the string and makes the complete datagram fit
in one packet.


That did work.  I removed a few entries from disklist and got a good 
run.  And that brings up another question...  What else is sent in that 
request?  Something seems to vary based on the success of the last run. 
After that one good partial run, I uncommented everything again in 
disklist for that host and had a successful full run.  In other words, 
after a good partial run the next batch of sendsize requests were smaller 
by a few bytes.  Why?


That would also explain why I now see a second and third host showing the 
same symptoms.  Something in that request grows and pushes me past the 
1492 byte limit.


Anyhow, I've got some really good information to work with now, thanks 
again for all your help and insight.  Now I have to look at how PMTUD 
works and how this setup is breaking it.


Thanks,

Charles

ps - ^X^C sounds very familiar from a very long time ago, what is it?



--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***





Re: planner timeouts

2005-09-08 Thread Paul Bijnens

Charles Sprickman wrote:

On Tue, 6 Sep 2005, Paul Bijnens wrote:

http://bway.net/~spork/amanda-tcpdump.txt

You'll note that the frags thing is still set, but if you compare what 
the server sent to what the client received, it is identical.



I believe you have a UDP fragementation network problem somewhere
inbetween the two hosts, or maybe the devel2 host itself.

Compare these two lines (first: view on devel2, second: view on h13):

00:07:40.444231 devel2.945  h13.amanda: udp 1465 (ttl 64, id 45642, len 
1493)
00:07:40.484232 devel2.945  h13.amanda: udp 1465 (frag 45642:[EMAIL PROTECTED]) 
(ttl 51, len 1492)


The host devel2, sends a packet, claiming length 1493 bytes.
(1465 real payload + 28 bytes UDP packet header = 1493 bytes).
It also seems to be split over many UDP packets, but the following
packet is never seen in the trace.  THe packet that is shown is only 
1486 bytes long.  There should be a second packet with the rest

of 7 bytes, but that is missing from the trace.
From the content of the first packet, that second packet should
contain index+newline, indeed 7 bytes.

The host h13, is apparently 13 hops away from the client (ttl 64-51).
That's pretty strange, when you say they are connected to the same
switch.  Even more strange is that the reply packets travel 15 hops
(ttl 64-49).  But I'm not the specialist here.

The server packet however now says the total length is 1492, but the
payload length is still 1465.  Strange because the 28 byte UDP header
seems be shortened by 1 byte? That could be when the UDP checksum, which
is optional, is removed from the header.  Some device inbetween removed
the checksum byte.  Must be a real firewall or layer 3 switch, changing
packets it transmits.

Anyway, the trailing 7 byte packet never got sent.

I would take a look at the UDP configuration of that client, that seems
to have trouble with splitting large UDP datagrams.

Some OS-es need some twiggling with parameters to tune this.

As workaround, you could try to reduce the UDP packet by e.g.
leaving out some DLE's or omitting compress-fast from the options, or
anything that shortens the string and makes the complete datagram fit
in one packet.


I'm totally stumped here.  We had another server just start doing the 
same thing the other day.  They are all on the same subnet/lan/switch, 
two don't work, the remaining 10 or so do.


Verify if it is indeed the size of the UDP packet sent.
You mean the two that don't work are on the same subnet/lan/switch.
Or are all the server and client on the same subnet/lan/switch?

At least the devel2 machine is not sending the continuation packet,
and the subnet/switch/lan seems to be strange too.


--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***




Re: planner timeouts

2005-09-06 Thread Paul Bijnens

Charles Sprickman wrote:


Me neither...  Is there any way to increase the verbosity of amandad? 
inetd config is good (it's a client, so it's only got the single line 
for amandad), nothing in the system logs, all of the stuff in the 
libexec directory appears to have correct perms (the proper things are 
setuid to my amanda user)...


No, but in the tcpdump trace below, I would add -n 1500 -X to verify
what exactly is sent.


The devel2 (server) view:

19:46:23.967162 devel2.937  h13.blah.com.amanda: udp 117
19:46:24.074337 h13.blah.com.amanda  devel2.937: udp 50
19:46:24.249414 h13.blah.com.amanda  devel2.937: udp 81
19:46:24.249497 devel2.937  h13.blah.com.amanda: udp 50
19:46:24.489787 devel2.937  h13.blah.com.amanda: udp 1465
19:46:34.497794 devel2.937  h13.blah.com.amanda: udp 1465
19:46:44.508815 devel2.937  h13.blah.com.amanda: udp 1465

The h13 (client) view:

19:46:23.982760 devel2.937  h13.blah.com.amanda: udp 117
19:46:24.054390 h13.blah.com.amanda  devel2.937: udp 50
19:46:24.230317 h13.blah.com.amanda  devel2.937: udp 81
19:46:24.264200 devel2.937  h13.blah.com.amanda: udp 50
19:46:24.523791 devel2.937  h13.blah.com.amanda: udp 1465 (frag
59731:[EMAIL PROTECTED])
19:46:34.531821 devel2.937  h13.blah.com.amanda: udp 1465 (frag
62763:[EMAIL PROTECTED])
19:46:44.542471 devel2.937  h13.blah.com.amanda: udp 1465 (frag
9535:[EMAIL PROTECTED])


I'm not a tcpdump guru, it is suspect that the client sends something
flagged as frag, while the server does not see the frag flag.
Could be tcpdump implementation difference on the two machines too.

The server side logs some info in the amdump file, when it receives
the packets.  Alas only after decoding them.  If the server
didn't see them, it won't log anything.  Have a careful look anyway
in the amdump file around that time (the dumpfile contains milestones
in seconds from start of file, so that you can locate the lines around
which the packets were received.

If you're brave, yYou could also try to strace planner.
Because planner is a suid-root program, you need to make an strace
copy with has also the suid-root bit set.  Then add that command in
the amdump shell script around line 100, which then becomes a long line:

  /usr/local/bin/strace-suid -o /tmp/planner-strace -s 1500 
$libexecdir/planner$SUF $@ | ...driver...


(and don't forget to remove the strace-suid program afterwards!)


--
Paul Bijnens, XplanationTel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax  +32 16 397.512
http://www.xplanation.com/  email:  [EMAIL PROTECTED]
***
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  Are you sure?  ...   YES   ...   Phew ...   I'm out  *
***




Re: planner timeouts

2005-09-05 Thread Charles Sprickman

Charles Sprickman wrote:
h13 (client) debug logs.  Note that there is two-way communication, and 
everything seems to go correctly.  In the debug dir, there are only 
amandad debug logs, nothing else.


That doesn't sound right to me.  There should be a sendbackup log file as 
well, a runtar one, and so on.  Can you verify your inetd config on that 
particular client, to see whether there's something afoul?  Have a look at 
the system logs as well, while you're at it.  amanda might be unable to run 
any secondary programs, for instance.


Me neither...  Is there any way to increase the verbosity of amandad? 
inetd config is good (it's a client, so it's only got the single line for 
amandad), nothing in the system logs, all of the stuff in the libexec 
directory appears to have correct perms (the proper things are setuid to 
my amanda user)...



GETTING ESTIMATES...
planner: time 30.956: error result for host h13.blah.com disk /spool: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/var/qmail/bin: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/var/qmail/control: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /var/db/pkg: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /usr/local/: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /home: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /: Request to 
h13.blah.com timed out.

planner: time 30.956: getting estimates took 30.811 secs


Does that spell a 30s timeout somewhere?  amanda.conf not taken into account, 
perhaps?  And the obligatory question, did you double-check that there's no 
firewall between that particular client and server?  (If you did, 
triple-check. :-) )


I have bumped up all the timeouts in amanda.conf to ridiculously large 
values. :)  There are firewalls, and the tcpdump trace I sent was taken 
with each host's firewall software (ipfilter) disabled.  Additionally, the 
firewall logs blocked traffic and had nothing to say about this.


What else can I look at here?

I'm also including the tcpdump output again at the end of this message.

Thanks,

Charles

The devel2 (server) view:

19:46:23.967162 devel2.937  h13.blah.com.amanda: udp 117
19:46:24.074337 h13.blah.com.amanda  devel2.937: udp 50
19:46:24.249414 h13.blah.com.amanda  devel2.937: udp 81
19:46:24.249497 devel2.937  h13.blah.com.amanda: udp 50
19:46:24.489787 devel2.937  h13.blah.com.amanda: udp 1465
19:46:34.497794 devel2.937  h13.blah.com.amanda: udp 1465
19:46:44.508815 devel2.937  h13.blah.com.amanda: udp 1465

The h13 (client) view:

19:46:23.982760 devel2.937  h13.blah.com.amanda: udp 117
19:46:24.054390 h13.blah.com.amanda  devel2.937: udp 50
19:46:24.230317 h13.blah.com.amanda  devel2.937: udp 81
19:46:24.264200 devel2.937  h13.blah.com.amanda: udp 50
19:46:24.523791 devel2.937  h13.blah.com.amanda: udp 1465 (frag
59731:[EMAIL PROTECTED])
19:46:34.531821 devel2.937  h13.blah.com.amanda: udp 1465 (frag
62763:[EMAIL PROTECTED])
19:46:44.542471 devel2.937  h13.blah.com.amanda: udp 1465 (frag
9535:[EMAIL PROTECTED])



Alex


--
Alexander Jolk / BUF Compagnie
tel +33-1 42 68 18 28 /  fax +33-1 42 68 18 29




Re: planner timeouts

2005-08-31 Thread Alexander Jolk

Charles Sprickman wrote:
h13 (client) debug logs.  Note that there is two-way communication, and 
everything seems to go correctly.  In the debug dir, there are only 
amandad debug logs, nothing else.


That doesn't sound right to me.  There should be a sendbackup log file 
as well, a runtar one, and so on.  Can you verify your inetd config on 
that particular client, to see whether there's something afoul?  Have a 
look at the system logs as well, while you're at it.  amanda might be 
unable to run any secondary programs, for instance.



GETTING ESTIMATES...
planner: time 30.956: error result for host h13.blah.com disk /spool: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/var/qmail/bin: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/var/qmail/control: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/var/db/pkg: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk 
/usr/local/: Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /home: 
Request to h13.blah.com timed out.
planner: time 30.956: error result for host h13.blah.com disk /: Request 
to h13.blah.com timed out.

planner: time 30.956: getting estimates took 30.811 secs


Does that spell a 30s timeout somewhere?  amanda.conf not taken into 
account, perhaps?  And the obligatory question, did you double-check 
that there's no firewall between that particular client and server?  (If 
you did, triple-check. :-) )


Alex


--
Alexander Jolk / BUF Compagnie
tel +33-1 42 68 18 28 /  fax +33-1 42 68 18 29



planner timeouts

2005-08-30 Thread Charles Sprickman

Hi,

Recently we started having trouble getting backups from one of our 
servers.  One day it was working, the next it wasn't.  I've been digging 
around quite a bit and I'm not able to really see where the failure is. 
It appears that the planner process on the client machine times out, but 
I'm not sure why, and I'm seeing nothing in the debug logs on the client 
to indicate what the problem is.  I've reviewed all the changes on the 
client from when it worked until when it didn't and haven't seen anything 
that involves amanda or networking.  We do run firewalls on each host, and 
I've tested with them disabled.


This is a rather old version of Amanda and we will be updating as soon as 
we get the OS and other software up-to-date on all the hosts.  Version is 
2.4.3.


The server (devel2) is in one location (an office with dsl, no pppoe) and 
the clients are located in a datacenter.  All of the other hosts on the 
same network are backing up with no problems.  All other hosts are running 
the same version of amanda, and most are running the same version of 
FreeBSD (4.11).  Some hosts have more data to backup, some have less.


All the timeouts in amanda.conf on the server have been bumped up:
ctimeout 32000  # timeout for clients
dtimeout 16000  # timeout for dump processes
etimeout 32000  # timeout for dump estimates

The logs are lengthy, I apologize in advance.  Below I have included the 
job output from the server, all client debug logs, all server debug logs, 
normal logs from the server, and a tcpdump of the session as seen from 
both hosts.  The host devel2 is the backup server, h13 is the client 
that is failing.




job output:

FAILURE AND STRANGE DUMP SUMMARY:
  h13.blah /spool lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /var/qmail/bin lev 0 FAILED [Request to h13.blah.com timed 
out.]
  h13.blah /var/qmail/control lev 0 FAILED [Request to h13.blah.com timed 
out.]

  h13.blah /var/db/pkg lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /usr/local/ lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /home lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah / lev 0 FAILED [Request to h13.blah.com timed out.]

h13 (client) debug logs.  Note that there is two-way communication, and 
everything seems to go correctly.  In the debug dir, there are only 
amandad debug logs, nothing else.


amandad: time 0.000: got packet:

Amanda 2.4 REQ HANDLE 000-80930808 SEQ 1125445583
SECURITY USER operator
SERVICE noop
OPTIONS features=feff9f00;


amandad: time 0.000: sending ack:

Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583


amandad: time 0.089: bsd security: remote host devel2 user operator local 
user operator

amandad: time 0.175: amandahosts security check passed
amandad: time 0.175: running service noop
amandad: time 0.176: sending REP packet:

Amanda 2.4 REP HANDLE 000-80930808 SEQ 1125445583
OPTIONS features=feff9f00; 



amandad: time 0.210: got packet:

Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583


amandad: time 0.210: pid 19416 finish time Tue Aug 30 19:46:24 2005

devel2 (server) debug logs.  Not sure if this is useful, but here it is:

amtrmidx.20050830194655.debug :

h13.blah.com /
rm /var/db/amanda/index/h13.blah.com/_/20050707_2.gz
rm /var/db/amanda/index/h13.blah.com/_/20050703_2.gz
h13.blah.com /home
rm /var/db/amanda/index/h13.blah.com/_home/20050707_6.gz
h13.blah.com /usr/local/
rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050707_0.gz
rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050703_2.gz
h13.blah.com /var/db/pkg
rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050703_0.gz
h13.blah.com /var/qmail/control
rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050703_1.gz
h13.blah.com /var/qmail/bin
rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050703_1.gz
h13.blah.com /spool

amtrmlog.20050830194654.debug:

amtrmlog: debug 1 pid 39378 ruid 2 euid 2: start at Tue Aug 30 19:46:54 
2005

/usr/local/libexec/amanda/amtrmlog: version 2.4.3
Keeping 48 log files
amtrmlog: pid 39378 finish time Tue Aug 30 19:46:55 2005

standard log file.  we do see the planner timeout here, but it's not 
clear to me why.:


/var/log/amanda/log.1:

dumper: dgram_bind: socket bound to 0.0.0.0.943
dumper: pid 39290 executable dumper2 version 2.4.3, using port 943
dumper: dgram_bind: socket bound to 0.0.0.0.945
dumper: pid 39292 executable dumper4 version 2.4.3, using port 945
dumper: dgram_bind: socket bound to 0.0.0.0.942
dumper: pid 39289 executable dumper1 version 2.4.3, using port 942
dumper: dgram_bind: socket bound to 0.0.0.0.944
dumper: pid 39291 executable dumper3 version 2.4.3, using port 944
planner: time 0.028: dgram_bind: