Hi, I am hunting a couple of corner cases, IKEv2 rekey initiator failures. These issues appear when testing clones. Think of 100 IKEv2 Child SAs under one IKE SA and rekeying them all. In the test rekey margin and salife are short. Short values do not matter, because looking back in Tuomo production systems logs, with larger rekey margin and saflifetime, I see the same issues.
First one is possibly fixed. https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz The pluto log will show the following debugline and rekey response is dropped. Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect record'n'send with an out-of-order wrong packet response; discarding packet: ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1 ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1 this is due to delete messages jumping the queue and Message ID seems to sync up again with next dpd or delete message. I am not sure that is RFC complaint. Second One seems to harder to trace. Here are the symptoms that I found so far. Rekaying IPsec fails with the following message and the connections disappears. If you have auto=route it may come back again. auto=route is a bandaid. Apr 6 13:58:50.367487: | ikev2_child_sa_respond returned STF_INTERNAL_ERROR Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished abort this exchange the internal error seems to caused EVENT_SA_EXPIRE, timeout in 0 seconds which removed predecessor state before rekey send the message. Apr 6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0 seconds for #3500 Apr 6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as #3699 for IPSEC SA Apr 6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #3500 It seems when replace is hit, code decided to rekey, and also scheduled an EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case it margin end as 0. Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it manually for longer time. It took 58 minutes before this error to occured. WARNING: the following log is big 100s Mega bytes, careful when clicking on the link in a browser. It is better to download it using wget to read. https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz I let the test run, it happend twice in about 6 hours. Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished abort this exchange Apr 6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320 vanished abort this exchange Looking back in older logs, Tuomo's log, I see the same sequence and connrection restarting either due to keying tries or traffic. He is running with default rekey margin and fuzz, while the test has shorter rekey margin. Note it is not just because of shorter rekey margin or salifetime. The oldest seems to hapeend in "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey #912 vanished abort this exchange" I am wondering where is the bug? pluto's rekey margin calcuation, or some other logic forcing the EXPIRE event. -antony _______________________________________________ Swan-dev mailing list Swan-dev@lists.libreswan.org https://lists.libreswan.org/mailman/listinfo/swan-dev