Re: WireGuard host crashes roughly every week

2021-08-05 Thread Matt P.
Thanks so much Matt!

It works! I've reenabled PersistantKeepalive overnight and mbufs are staying 
low.

The failed handshakes are still occurring, "ifconfig wg0 debug" filled my dmesg 
with hundreds of lines like:

> wg0: Handshake for peer 10 did not complete after 5 seconds, retrying (try 6)
> wg0: Sending handshake initiation to peer 10
> wg0: Zeroing out keys for peer 10

But I don't have any evidence that this is hurting anything :)

--Matt

> On Aug 4, 2021, at 5:36 AM, Matt Dunwoodie  wrote:
> 
> On Tue, 3 Aug 2021 13:02:15 -0500
> "Matt P."  wrote:
> 
>> Hi Stuart!
>> 
>> Your advice lead me to discover, the issue happens only with the
>> "PersistantKeepalive = 25" option I had enabled on each wg-quick
>> peer. Looks like you could recreate it by making a few no-address
>> peers with this option enabled.
> 
> Hi Matt,
> 
> This insight was very helpful. It looks like mbufs are not freed if
> we're sending to a peer with no endpoint. Specifically, "wg_send" is
> expected to free the mbuf if there is an error sending. This (untested)
> patch should fix it.
> 
> Cheers,
> Matt
> 
> diff --git if_wg.c if_wg.c
> index 18333eda4cb..5f4319558ab 100644
> --- if_wg.c
> +++ if_wg.c
> @@ -810,6 +810,7 @@ wg_send(struct wg_softc *sc, struct wg_endpoint *e, 
> struct mbuf *m)
>IPPROTO_IPV6);
> #endif
>} else {
> +m_freem(m);
>return EAFNOSUPPORT;
>}
> 



Re: WireGuard host crashes roughly every week

2021-08-03 Thread Matt P.
(Resending, as I forgot to include the mailing list itself)

> On Aug 1, 2021, at 3:37 AM, Stuart Henderson  wrote:
> 
> It is always good to include dmesg when reporting a problem.
> 
> An outline of the wireguard and other network config would be
> useful too. If you can give instructions to reproduce that would
> be ideal. If not then as much information about the setup as
> possible so we can try to reproduce.
> 
> Does anything funny show up in dmesg if you do "ifconfig wg0
> debug"? (replace/repeat wg0 if you have other wg interfaces).


Hi Stuart!

Your advice lead me to discover, the issue happens only with the 
"PersistantKeepalive = 25" option I had enabled on each wg-quick peer. Looks 
like you could recreate it by making a few no-address peers with this option 
enabled.

In /etc/wireguard/wg0.conf I have a config file for wg-quick:

> [Interface]
> PrivateKey = 
> ListenPort = 5  
> Address= 10.0.166.1/24
> SaveConfig = false
> MTU= 1400
> 
> [Peer]
> # ExamplePeer1
> PresharedKey= 
> PublicKey= 
> AllowedIPs= 10.0.166.2/32
> PersistentKeepalive = 25

... And so on.

The 'ifconfig wg0 debug' with PersistantKeepalive enabled leaves these messages 
in the dmesg:

> wg0: Handshake for peer 6 did not complete after 5 seconds, retrying (try 18)
> wg0: Sending handshake initiation to peer 6
> wg0: Sending handshake initiation to peer 3
> wg0: Sending handshake initiation to peer 7
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 2 did not complete after 5 seconds, retrying (try 18)
> wg0: Sending handshake initiation to peer 2
> wg0: Sending handshake initiation to peer 1
> wg0: Handshake for peer 4 did not complete after 5 seconds, retrying (try 14)
> wg0: Sending handshake initiation to peer 4
> wg0: Sending handshake initiation to peer 5
> wg0: Handshake for peer 6 did not complete after 5 seconds, retrying (try 19)
> wg0: Sending handshake initiation to peer 6
> wg0: Handshake for peer 3 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 3
> wg0: Handshake for peer 2 did not complete after 5 seconds, retrying (try 19)
> wg0: Sending handshake initiation to peer 2
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 7 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 7
> wg0: Handshake for peer 5 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 5
> wg0: Handshake for peer 4 did not complete after 5 seconds, retrying (try 15)
> wg0: Sending handshake initiation to peer 4
> wg0: Handshake for peer 1 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 1

You can see the peers don't have pre-configured addresses as they are usually 
phones and not connected. But with PersistantKeepalive it looks like Wireguard 
is trying to connect to them, despite having no idea where to find them.

I commented out the PersistantKeepalive lines and the number of mbufs stays low 
as it should be. The VPN still works fine. Supposedly the PersistantKeepalive 
would prevent a NAT from destroying your connection due to no traffic in 30 
seconds, which I've never seen before, but I figured better safe than sorry.

With PersistantKeepalive disabled on the server (enabled on the client), if I 
connect to the server and then disconnect, it begins trying to handshake the 
missing partner again, but this time it _doesn't_ raise the mbufs.

> wg0: Receiving handshake initiation from peer 0
> wg0: Sending handshake response to peer 0
> wg0: Receiving keepalive packet from peer 0
> wg0: Sending keepalive packet to peer 0
> wg0: Receiving keepalive packet from peer 0
> wg0: Receiving keepalive packet from peer 0
> wg0: Receiving keepalive packet from peer 0
> wg0: Receiving keepalive packet from peer 0
> wg0: Retrying handshake with peer 0 because we stopped hearing back after 15 
> seconds
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 3)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 4)
> wg0: Sending handshake initiation to peer 0
> wg0: Retrying handshake with peer 0 because we stopped hearing back after 15 
> seconds
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 2)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 3)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5 seconds, retrying (try 4)
> wg0: Sending handshake initiation to peer 0
> wg0: Handshake for peer 0 did not complete after 5

Re: WireGuard host crashes roughly every week

2021-07-31 Thread Matt P.
Hi Todd!

You're right, the number of mbufs on the machine in question is steadily 
climbing.

This is a few minutes after a reboot, with an RC script starting wireguard 
automatically:

> 27836 mbufs in use:
> 27827 mbufs allocated to data
> 3 mbufs allocated to packet headers
> 6 mbufs allocated to socket names and addresses
> 0/16 mbuf 2048 byte clusters in use (current/peak)
> 20/75 mbuf 2112 byte clusters in use (current/peak)
> 0/8 mbuf 4096 byte clusters in use (current/peak)
> 0/0 mbuf 8192 byte clusters in use (current/peak)
> 0/0 mbuf 9216 byte clusters in use (current/peak)
> 0/0 mbuf 12288 byte clusters in use (current/peak)
> 0/0 mbuf 16384 byte clusters in use (current/peak)
> 0/0 mbuf 65536 byte clusters in use (current/peak)
> 7192/7192/524288 Kbytes allocated to network (current/peak/max)
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines

And then, just a second or two later:

> 27874 mbufs in use:
> 27863 mbufs allocated to data
> 5 mbufs allocated to packet headers
> 6 mbufs allocated to socket names and addresses
> 0/16 mbuf 2048 byte clusters in use (current/peak)
> 20/75 mbuf 2112 byte clusters in use (current/peak)
> 0/8 mbuf 4096 byte clusters in use (current/peak)
> 0/0 mbuf 8192 byte clusters in use (current/peak)
> 0/0 mbuf 9216 byte clusters in use (current/peak)
> 0/0 mbuf 12288 byte clusters in use (current/peak)
> 0/0 mbuf 16384 byte clusters in use (current/peak)
> 0/0 mbuf 65536 byte clusters in use (current/peak)
> 7204/7204/524288 Kbytes allocated to network (current/peak/max)
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines

>From the nearly identical Pi (sans wireguard):

> 72 mbufs in use:
> 42 mbufs allocated to data
> 1 mbuf allocated to packet headers
> 29 mbufs allocated to socket names and addresses
> 12/64 mbuf 2048 byte clusters in use (current/peak)
> 0/0 mbuf 2112 byte clusters in use (current/peak)
> 0/8 mbuf 4096 byte clusters in use (current/peak)
> 0/0 mbuf 8192 byte clusters in use (current/peak)
> 0/0 mbuf 9216 byte clusters in use (current/peak)
> 0/0 mbuf 12288 byte clusters in use (current/peak)
> 0/0 mbuf 16384 byte clusters in use (current/peak)
> 0/0 mbuf 65536 byte clusters in use (current/peak)
> 216/216/131072 Kbytes allocated to network (current/peak/max)
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines


I tried disabling the wg startup. When I start the box I have very few mbufs 
(around 50) like on the other machine. Once I start wireguard manually it 
begins climbing again, though the number is nowhere near the "27836 mbufs in 
use" like when it loads at boot.

When I stop wireguard (with wg-quick, destroying the interface), the number of 
mbufs stays where it is but stops climbing.

What should I do next?

--Matt

> On Jul 30, 2021, at 9:31 AM, Todd C. Miller  wrote:
> On Thu, 29 Jul 2021 20:09:12 -0500, "Matt P." wrote:
> 
>> I have an OpenBSD box that breaks after a week or so of running. All network 
>> traffic stops reaching the box. If I look at the screen or serial output, I c
>> an get the "login:" prompt, and when I enter my name I get prompted for a pas
>> sword, but once I enter a password it hangs. Key presses and control codes st
>> ill show on the screen, but the login never succeeds or fails. I thought cont
>> rol-C might cause it to go back to the login prompt, but it doesn't. I have t
>> o hard reboot the box to get it back.
> 
> This may be due to a memory leak.  You could monitor the output of
> "netstat -m" and also "vmstat -m" and watch for memory use increasing
> over time.  The number of mbufs in use reported by "netstat -m"
> should be relatively stable.
> 
> - todd



WireGuard host crashes roughly every week

2021-07-29 Thread Matt P.
Hi all.

I have an OpenBSD box that breaks after a week or so of running. All network 
traffic stops reaching the box. If I look at the screen or serial output, I can 
get the "login:" prompt, and when I enter my name I get prompted for a 
password, but once I enter a password it hangs. Key presses and control codes 
still show on the screen, but the login never succeeds or fails. I thought 
control-C might cause it to go back to the login prompt, but it doesn't. I have 
to hard reboot the box to get it back.

This box runs a Wireguard server accessible from the internet, and I think it's 
related to the crashing. I used to run the same WireGuard configuration on a 
different OpenBSD machine (a Raspberry Pi instead of x64), and the same 
crashing would happen. I blamed the crashing on the Pi port of OpenBSD, which 
is why I switched machines, but it stopped happening on the Pi and started on 
the x64 box.

I'm a newbie at systems administration, and don't know where to go from here. 
There's no kernel panics to send, and I didn't see anything in the log files 
about the crash. What should I do?

--Matt