Re: BUG (was: Re: Handitarded....odd (partial) estimate timeout errors.)
--On January 5, 2006 4:49:53 PM +0100 Paul Bijnens <[EMAIL PROTECTED]> wrote: Michael Loftis wrote: Paul asked for the logs, it seems like there's an amanda bug. The units Yes, indeed, there is a bug in Amanda! You have 236 DLE's for that host, and from my reading of the code the REQuest UDP packet is limited to 32K instead of 64K (see planner.c lines 1377-1383) (Need to update the documentation!) Woot, I'm NOT crazy! :D ...did I just say woot? My apologies. It seems that that planner splits up the REQuest packet into separate UDP-packets when exceeding MAX_DGRAM/2, i.e. 32K. Your first request was 32580 bytes. Adding the next string to that request would have excceeded the 32768 limit. The reason for division by 2 seems to reserver space for error replies on each of those. I knew it was size related but that my packets were significantly less than the MAX_DGRAM. This definitely explains it. However, the amandad client only expects one and only one REQuest packet. Any other REQuest packet coming from the same connection (5-tuple: protocol, remotehost, remoteport, localhost, localport) and having a type "REQ" is considered a duplicate. It should actually test for the handle and sequence to be identical too. It does not. It's not fixed quickly either: when receiving the first "REQ" packet, the amandad client forks and execs the request program (sendsize in this case) and reads from the results from a pipe. By the time the second, non-identical request comes in (with different handle, sequence -- which is currently not checked), sendsize is already started and cannot be given additional DLE's to estimate. As a temporary workaround, you could shorten the exclude-list string for that host by creating a symlink: ln -s /etc/amanda/exclude.gtar /.excl Yeah...This will help for a time. Hopefully long enough for a patch to fix amandad. I'll have to create a separate type for this server, since we've got well over a hundred now and they all share that main backup type. I figured shortening the UDP packets somehow would help, I knew it was just odd that it wasn't quite right and I seemed to be running into the problem way too early :) and use that as exclude-list: this shortens each line by 20 byte, which would shrink the package to fit again. (236 DLE's * 20 = 4720 bytes less in a REQuest UDP for that host!) AnywayI'm getting a headache thinking about it :) all my other DLEs seem ok for that host, and the ones that it misses are not always exactly the same, but all seem to be non-calcsize estimated. Just bad luck for those entries that happen to go in the end of the queue. On the other hand, when really unlucky, you could have up to three estimates for each DLE, overflowing even the 4K we saved by shrinking the exclude string... Like I said, hopefully by then either the hackers (or myself) will have put together a patch. ... I see three ways to fix this...one of which I don't know will fix, what about turning wait=yes to wait=no in my xinetd.conf? Not sure what that would break. The other two involve code...multiple sendsize's, *or* a protocol change to wait for a 'final start' packet, or an amandad change to wait a few extra seconds before starting the actual sendsize, coalescing the results. And you're right, the other ways aren't easy...one involves possibly breaking the protocol too.
BUG (was: Re: Handitarded....odd (partial) estimate timeout errors.)
Michael Loftis wrote: Paul asked for the logs, it seems like there's an amanda bug. The units Yes, indeed, there is a bug in Amanda! You have 236 DLE's for that host, and from my reading of the code the REQuest UDP packet is limited to 32K instead of 64K (see planner.c lines 1377-1383) (Need to update the documentation!) It seems that that planner splits up the REQuest packet into separate UDP-packets when exceeding MAX_DGRAM/2, i.e. 32K. Your first request was 32580 bytes. Adding the next string to that request would have excceeded the 32768 limit. The reason for division by 2 seems to reserver space for error replies on each of those. However, the amandad client only expects one and only one REQuest packet. Any other REQuest packet coming from the same connection (5-tuple: protocol, remotehost, remoteport, localhost, localport) and having a type "REQ" is considered a duplicate. It should actually test for the handle and sequence to be identical too. It does not. It's not fixed quickly either: when receiving the first "REQ" packet, the amandad client forks and execs the request program (sendsize in this case) and reads from the results from a pipe. By the time the second, non-identical request comes in (with different handle, sequence -- which is currently not checked), sendsize is already started and cannot be given additional DLE's to estimate. As a temporary workaround, you could shorten the exclude-list string for that host by creating a symlink: ln -s /etc/amanda/exclude.gtar /.excl and use that as exclude-list: this shortens each line by 20 byte, which would shrink the package to fit again. (236 DLE's * 20 = 4720 bytes less in a REQuest UDP for that host!) AnywayI'm getting a headache thinking about it :) all my other DLEs seem ok for that host, and the ones that it misses are not always exactly the same, but all seem to be non-calcsize estimated. Just bad luck for those entries that happen to go in the end of the queue. On the other hand, when really unlucky, you could have up to three estimates for each DLE, overflowing even the 4K we saved by shrinking the exclude string... -- Paul Bijnens, XplanationTel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUMFax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... "Are you sure?" ... YES ... Phew ... I'm out * ***