Re: [Openvpn-devel] weird issue with server failover when Not using keepalive

Arne Schwabe Fri, 04 Dec 2020 06:39:53 -0800

Am 04.12.20 um 11:59 schrieb Jan Just Keijser:
> hey guys,
> 
> I'm posting this on behalf of the eduVPN team. François Kooman spent a
> long time debugging an issue and finally managed to find the piece of
> code that causes the weird behavior.
> Let me explain:
> 
> For eduVPN, multiple openvpn instances are offered , both on UDP and TCP
> ports and the client config that is used lists all of these instances.
> The client can then do automatic roll-over to a TCP based setup if UDP
> is not working (blocked) for some reason.
> Now François had *not* set the keepalive option in the TCP setup, as a
> TCP connection has a keepalive of its own, more or less and this caused
> some very odd behaviour:
> 
> 1) the client tries to connect to a UDP based server; server is
> down/blocked, hence openvpn does a failover to the next client
> 2) openvpn connects but after exactly 2 minutes the connection is restarted
> 3) the reconnects keep happening every 2 minutes suggesting it is a
> ping-restart/keepalive setting
> 
> We've tracked this down to the following piece of code, which has been
> present in the OpenVPN code base since v2.1 (which was the first version
> to support connection entries). File is init.c, here from v2.4.9:
> 
>  188 static void
>  189 update_options_ce_post(struct options *options)
>  190 {
>  191 #if P2MP
>  192     /*
>  193      * In pull mode, we usually import --ping/--ping-restart
> parameters from
>  194      * the server.  However we should also set an initial default
> --ping-restart
>  195      * for the period of time before we pull the --ping-restart
> parameter
>  196      * from the server.
>  197      */
>  198     if (options->pull
>  199         && options->ping_rec_timeout_action == PING_UNDEF
>  200         && proto_is_dgram(options->ce.proto))
>  201     {
>  202         options->ping_rec_timeout = PRE_PULL_INITIAL_PING_RESTART;
>  203         options->ping_rec_timeout_action = PING_RESTART;
>  204     }
>  205 #endif
>  206 }
> 
> When failing over, this function 'update_options_ce_post' is called and
> for UDP based connections, the ping_rec_timeout is updated.
> *Why?*
> Note that ping_rec_timeout is a GLOBAL option and affects all connection
> entries, both TCP and UDP based. Comment out the call to
> 'update_options_ce_post' and the restarts are gone.
> Shall we just comment out/remove this particular piece of code altogether?


No it still serves a purpose in normal/other setups. It just breaks
*your* setup. Without this a client can be stuck forever if the
connection times out after the control channel has been established but
not received a PUSH_REPLY if my quick analysis here is correct.

But the logic of is_dgram and how it is set is strange, I definitively
grant you that. This might also a bug that got more exposed by my
removal of the non-connection logic in OpenVPN 2.4. Before OpenVPN 2.4
we had completely different code paths for profiles containing
<connection> and profiles without for handling those.


> JJK
> 
> PS This leads me to think that perhaps the ping-* options should be made
> connection-entry specific.  That way, you can have different behavior
> for TCP based setups and UDP based setups. Also note (see below) that
> this problem also affect failover from one UDP based server to the next
> , if --keepalive is disabled, so it's not "just" UDP vs TCP.

Normally keepalive settings should be by the server since a mismatching
keepalive between client and server is not a good idea anyway.

> I managed to recreate this setup and I can even get the same odd
> behaviour in a 100% UDP based setup.
> Server config:
> ###############
> proto udp
> port 1194
> dev tun
> server 10.200.0.0 255.255.255.0
> dh       dh2048.pem
> ca       ca.crt
> cert     server.crt
> key      server.key
> persist-key
> persist-tun
> topology subnet
> user  nobody
> group nobody  # use "group nogroup" on some distros
> cipher aes-256-cbc
> auth   sha256
> ###############
> 
> (yes, I know the server blurts out
>   WARNING: --keepalive option is missing from server config
> on startup)
> 
> Client config:
> ###############
> client
> remote <server> 1195 udp  ## use a non available port first
> remote <server> 1194 udp
> ### remote <server> 1195 tcp
> dev tun
> nobind
> remote-cert-tls server
> ca       ca.crt
> cert     client1.crt
> key      client1.key
> cipher aes-256-cbc
> auth   sha256
> ################
> 
> So the client first connects to a (non-existent) server, and then fails
> over to the second entry, and we get a connection. Then, every 2 minutes
> we get a connection restart.
> 
> If I change the client config to list only a single
>   remote <server> 1194 udp
> line then this reconnect behavior does NOT occur ?!?!?!?
> 
This might be a bug in the initialisation order. That the ping timer is
armed before next_connection_entry is called. If you force it reconnect
by restarting server or kill -USR1, does it then also show the
disconnect after 120s behaviour?

Arne


_______________________________________________
Openvpn-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openvpn-devel

Re: [Openvpn-devel] weird issue with server failover when *Not* using keepalive

Reply via email to

Re: [Openvpn-devel] weird issue with server failover when Not using keepalive