Re: Instability during large transfers

2017-02-16 Thread Samuel Holland

On 02/16/17 12:38, Samuel Holland wrote:

I will enable netconsole on the firewall now that I can hopefully
reproduce the panic, but since the networking setup there is rather
complicated (vlans on top of bridges) I'm not sure if I will get all
of the panic messages.


Success! I reproduced the panic and got full logs.

I ran:
$ ssh  cat /dev/zero | dd of=/dev/null bs=4096 status=progress
108295966720 bytes (108 GB, 101 GiB) copied, 17437 s, 6.2 MB/s
packet_write_wait: Connection to  port 22: Broken pipe

26440386+4 records in
26440386+4 records out
108299829248 bytes (108 GB, 101 GiB) copied, 18413 s, 5.9 MB/s

Attached is the extended netconsole output from the firewall.

Hope this helps,
Samuel


wireguardpanic.gz
Description: application/gzip
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard


Re: Instability during large transfers

2017-02-17 Thread Jason A. Donenfeld
Hey Samuel,

Thanks very much for the excellent debugging output. I'll try to
reproduce this as well on my systems.

The stack trace does indicate that the OOPS is happening in padata,
not in wireguard, so I wonder if this is some bug caused either by
grsecurity or by something else that was then fixed, but since your
kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
case, I'll try to reproduce on that kernel and on newer kernels and
will get back to you.

I presume you have most PaX options turned on?

Thanks,
Jason
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard


Re: Instability during large transfers

2017-02-17 Thread Samuel Holland

Hello,

On 02/17/17 07:36, Jason A. Donenfeld wrote:

The stack trace does indicate that the OOPS is happening in padata,
not in wireguard, so I wonder if this is some bug caused either by
grsecurity or by something else that was then fixed, but since your
kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
case, I'll try to reproduce on that kernel and on newer kernels and
will get back to you.


There do not appear to be any relevant changes to padata in the past few
years, and grsecurity doesn't look like it affects padata much, but that
doesn't rule it out:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=padata
https://grsecurity.net/changelog-test.txt


I presume you have most PaX options turned on?


Attached is my config.gz (it's the same on all machines).


Thanks,
Jason


Thanks,
Samuel



config-4.7.10-hardened.gz
Description: application/gzip
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard


Re: Instability during large transfers

2017-03-01 Thread Samuel Holland

On 02/17/17 07:36, Jason A. Donenfeld wrote:

Thanks very much for the excellent debugging output. I'll try to
reproduce this as well on my systems.


I assume you have not been able to reproduce this issue.


The stack trace does indicate that the OOPS is happening in padata,
not in wireguard, so I wonder if this is some bug caused either by
grsecurity or by something else that was then fixed, but since your
kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
case, I'll try to reproduce on that kernel and on newer kernels and
will get back to you.

I presume you have most PaX options turned on?


Since this is on 4.7.10 (that is pre-4.9), this is not related to the
other bug recently reported.

I have disabled all grsecurity/PaX options in my kernel config
(attached) and was able to trigger the bug again. This is with WireGuard
commit f97b7e34bda436ac4572697a8770837eec7470b6 and debugging enabled.
Again attached is the dmesg.

I used the same SSH cat /dev/zero | dd of=/dev/null as before. This time
I got "192656101376 bytes (193 GB, 179 GiB) copied, 41643 s, 4.6 MB/s"
before the connection was broken.

Interestingly, when the firewall came back up, I again had the issue
where devices were continuing to handshake, but no data went through
(and I could confirm this with the wireguard debug output in dmesg).

I was unable to reproduce this issue with a spare laptop (ThinkPad
X220), even after leaving it running for about three days. Since the
router has a rather weak Atom CPU (http://ark.intel.com/products/78866),
I suspect maybe a race condition due to the high load might be involved?

Is there anything else I can do to debug this? Enable some kernel
debugging option? Try a vanilla kernel? Try a newer kernel?


Thanks, Jason


Thanks,
Samuel Holland


panic_grsec_disabled.config.gz
Description: application/gzip


panic_grsec_disabled.dmesg.gz
Description: application/gzip
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard


Re: Instability during large transfers

2017-03-21 Thread Samuel Holland

Hello,

On 03/01/17 16:44, Samuel Holland wrote:

Is there anything else I can do to debug this? Enable some kernel
debugging option? Try a vanilla kernel? Try a newer kernel?


I've upgraded to vanilla (kernel.org) Linux 4.9.15, and while I've been
unable to trigger the same panic so far, I'm getting a similar linked
list corruption warning. This time, instead of failing to pass data once
the warning hits, some packets are coming in out of order with 3000-5000
ms latency. Attached is the dmesg output, kernel config, and ping logs
for inside the VPN. The actual link latency between peers is around 10ms.

As I still haven't been able to reproduce this issue on another machine,
I cannot rule out hardware issues, but this exact machine worked fine
for a year straight (2x 6 months uptime) using OpenVPN. So I'm looking
at either WireGuard or padata as the cause.

The kernel has linked list debugging turned on.



I upgraded again to 4.9.16, and enabled every debug and self-test option
that I could find that seemed relevant. The stack traces and symptoms
still look similar. Except now when it panics (instead of just a
warning), I don't get a log; the netconsole just suddenly cuts over to
the next boot.

You can also see how it stops sending keepalive packets once the
warnings hit.

At this point, unless you want me to try something, I'm going to stop
sending logs. I'm not very familiar with the kernel's
timer/workqueue/crypto subsystems, so there's not much debugging I can
do on my own until I have time to do research (and I have no idea when
that will be). But if you can't reproduce the issue, then it will be
rather difficult for you to debug as well.

I'd still like to get this fixed, and I'll gladly help any way I can.

Thanks,
Samuel Holland


warning_4.9.15.config.xz
Description: application/xz


warning_4.9.15.dmesg.xz
Description: application/xz


warning_4.9.15.dmesg_secondboot.xz
Description: application/xz


warning_4.9.15.vpn_ping.xz
Description: application/xz


warning_4.9.16.dmesg.xz
Description: application/xz
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard


Re: Instability during large transfers

2017-03-21 Thread Jason A. Donenfeld
Hey Samuel,

Thanks for sticking with it... This is a super hard bug to track down!
I wish I could reproduce it.

Can you try two odd things, separately, and report back?

1. Compile with DEBUG_ATOMIC_SLEEP. Do you get any additional warnings?
2. Hack kernel/sched/Makefile to remove "-fno-omit-frame-pointer" and
see if the backtraces you get are a bit more coherent.

Feel free to find me on IRC.

Jason
___
WireGuard mailing list
WireGuard@lists.zx2c4.com
https://lists.zx2c4.com/mailman/listinfo/wireguard