>Synopsis: Inactive Wireguard connections never time out
>Category: Networking
>Environment:
System : OpenBSD 7.6
Details : OpenBSD 7.6-current (CUSTOM) #4: Sat Dec 14 01:13:27 GMT
2024
lloyd@bsdtst01:/sys/arch/amd64/compile/CUSTOM
Architecture: OpenBSD.amd64
Machine : amd64
>Description:
I've been running the Wireguard logging patch I submitted to -tech for
a few days
with little other issue. Recently I did some testing with a device over
cellular
(dynamic client behind CG-NAT). The testing was completed and the
client powered
off and disconnected from the remote network. In other words, it was no
longer
at the remote IP address (typical of a roaming client).
The next morning I checked the syslog to find 17 hours worth of the
following
messages (until I rebooted the box), repeated every 5 seconds, with
x.x.x.x is
the remote IP address of the wireless provider's CG-NAT server:
--snip--
09:27:00 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:05 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:05 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:11 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:11 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:16 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:16 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:21 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:21 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:26 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:26 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:31 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:31 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:36 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:36 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:42 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:42 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:47 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 3)
09:27:47 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:52 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:27:52 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:57 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 3)
09:27:57 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:03 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:28:03 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:08 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 3)
09:28:08 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:13 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
09:28:13 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:18 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 3)
09:28:18 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:23 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds,
retrying (try 2)
--snip--
As you can see, it appears the Wireguard service will continue to try
and reach out to
the remote endpoint forever, even after it's long gone. If I understand
correctly, it's
supposed to try MAX_TIMER_HANDSHAKES times (aka 18) then stop. The log
entries most often
have the value of "try 2" repeated, with occasional "try 3". I grep'd
the logs and over
17 hours of data, I have exactly 4 log entries of "try 4" and 3 log
entries of "try 5".
I double checked my patch and it does not seem to alter any logic,
simply presentation of
debug logging information.
The value of t below is passed as a pointer to
wg_timers_run_retry_handshake():
if (t->t_handshake_retries <= MAX_TIMER_HANDSHAKES)
My C is a bit rusty, but my suspicion is one of the instances of:
t->t_handshake_retries = 0;
scattered elsewhere in the code is being called, zeroing out that
counter before it has
a chance to complete and creating a race condition.
>How-To-Repeat:
Connect a remote endpoint to an OpenBSD Wireguard server. Shutdown or
disconnect the remote
endpoint. Observe any debug logs generated.
>Fix:
Unknown