On Fri, Nov 24, 2006 at 05:22:05PM +0100, H?kan Olsson wrote: > 5. the selected SPI (or "larval" SA state) on the local system is > updated with the keying material, timeouts etc - i.e the "real" SA is > finalized > > This continues until all negotiations are complete -- however there > is a limit on how long this "larval" SA lives in the kernel... as you > may guess it's 60 seconds. (The idea being if a negotiation has not > completed in 60 seconds something has probably failed.) > > Since the hosts seems to be a bit slow in running IKE negotiations, > you hit the 60 second limit before all negotiations are complete, all > remaining "larval" SAs are dropped and when isakmpd tries to "update" > them into real SAs this of course fails. ("No such process" approx > means "no SA found" here.)
Thank you for that very clear description. Is this 60 second timeout a tunable? Or can you point me to where it's defined in the kernel? I'd like to try increasing it. However, at this stage I don't really understand why setting -D 5=99, which generates copious logs, makes it work. In fact I can get to 3,000 tunnels (6,000 flows) within a couple of minutes with this flag set. Perhaps this extra logging delays the starts of some of the negotations, somehow spreading the workload. (Maybe having a workload spreading option, so that no more than N outstanding exchanges are present at once, would be a useful control anyway) > PS > When I tried between two ~700Mhz P-III machines a while back, setting > up 4096 (or was it 8k) SAs was no problem. Another developer had a > scenario setting up 40960 SAs over loopback on his laptop -- mainly a > test of kernel memory usage, but he did not hit the 60s larval-SA > time limit there either. I can think of several possibilities as to why some negotiations are taking more than 60 seconds. For instance: (1) The Cisco 7301 may be slow to respond. It does have a VAM2+ crypto accelerator installed, but I don't know if it's used for isakmp exchanges, or just for symmetric encryption/decryption. (However, 'show proc cpu history' suggests CPU load is no more than about 25%) (2) There may be packet loss and retransmissions, maybe due to some network buffer overflowing, either on OpenBSD or Cisco. The OpenBSD box is using a nasty rl0 card, because that's the only spare interface I had available to go into the test LAN. Having said that, watching with 'top' I don't see the interrupt load go above 10%. I'm not sure how to probe deeper to get a handle on what's actually happening though. Perhaps isakmpd -L logging might shed some light, although I don't fancy decoding QM exchanges by hand :-( Regards, Brian.