Re: [Swan] Fwd: Problem with random rekey failures

Paul Wouters Tue, 15 Jun 2021 08:40:55 -0700

On Tue, 15 Jun 2021, Miguel Ponce Antolin wrote:

I have been suffering a random problem with libreswan v3.25 when connecting an 
AWS EC2 Instance running Libreswan and a Cisco ASA on the other end.


Is it possible to test v4.4 ? We have rpms build on 
download.libreswan.org/binaries/

Specifically, with the many subnets you are likely needing this fix from 4.4:

* IKEv2: Connections would not always switch when needed [Andrew/Paul]

But the changelog between 3.25 and 4.4 is huge. There might be other
items you need too.

Alternatively, you can try and split up your subnetS  into different
conns, eg:


       conn vpn
           type=tunnel
           authby=secret
           # use auto=ignore, will be read in via also= statements
           auto=ignore
           left=%defaultroute
           leftid=xxx.xxx.xxx.120
           leftsubnets=xxx.xxx.xxx.80/28
           right=xxx.xxx.xxx.45
           rightid=xxx.xxx.xxx.45
           # no rightsubnet= here
           # dont use this with more than one subnet...    
leftsourceip=xxx.xxx.xxx.92
           ikev2=insist
           ike=aes256-sha2;dh14
           esp=aes256-sha256
           keyexchange=ike
           ikelifetime=28800s
           salifetime=28800s
           dpddelay=30
           dpdtimeout=120
           dpdaction=restart
           encapsulation=no

      conn vpn-1
        also=vpn
        auto=start
        rightsubnet=10.subnet.1.0/22

      conn vpn-2
        also=vpn
        auto=start
        rightsubnet=10.subnet.2.0/20

      [...]

      conn vpn-18
        also=vpn
        auto=start
        rightsubnet=10.subnet.18.9/32


This uses a slightly different code path to get all the tunnels loaded and 
active.

We tried to "force" to reconnect using the ping command to an IP in various 
rightsubnets but when the problem is active we continously are seeing this
kind of logs:


That would be hacky and not really solve race conditions.

Jun 11 11:17:25.795153: "vpn/1x15" #221: message id deadlock? wait sending, add 
to send next list using parent #165 unacknowledged 1 next message
id=63 ike exchange window 1


Note that this is a bit of a concern. You can only have one IKE message
outstanding, and this indicates that the Cisco might not be answering
that outstanding message, and so the only thing libreswan can do is
wait longer or restart _everything_ related to that IKE SA, so that
means all tunnels. We did reduce the change of message id deadlock
some point in the past with our pending() code, so again tetsing
with an upgraded libreswan would be a useful test.

Is there any troubleshooting we could do in order to know where the rekey 
request is lost or why is not trying to rekey at all when this problem is
active?


Depending on what the issues are, you can try to ensure either libreswan
or Cisco is always the rekey initiator by tweaking the ikelifetime and
salifetime. Eg try ikelifetime=24h with salifetime=8h and most likely
Cisco will trigger all the rekeys. Or use ikelifetime=2h and
salifetime=1h to make libreswan likely always initiate the rekeys.

Paul
_______________________________________________
Swan mailing list
[email protected]
https://lists.libreswan.org/mailman/listinfo/swan

Re: [Swan] Fwd: Problem with random rekey failures

Reply via email to