Re: Instability during large transfers
On 02/16/17 12:38, Samuel Holland wrote: I will enable netconsole on the firewall now that I can hopefully reproduce the panic, but since the networking setup there is rather complicated (vlans on top of bridges) I'm not sure if I will get all of the panic messages. Success! I reproduced the panic and got full logs. I ran: $ ssh cat /dev/zero | dd of=/dev/null bs=4096 status=progress 108295966720 bytes (108 GB, 101 GiB) copied, 17437 s, 6.2 MB/s packet_write_wait: Connection to port 22: Broken pipe 26440386+4 records in 26440386+4 records out 108299829248 bytes (108 GB, 101 GiB) copied, 18413 s, 5.9 MB/s Attached is the extended netconsole output from the firewall. Hope this helps, Samuel wireguardpanic.gz Description: application/gzip ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard
Re: Instability during large transfers
Hey Samuel, Thanks very much for the excellent debugging output. I'll try to reproduce this as well on my systems. The stack trace does indicate that the OOPS is happening in padata, not in wireguard, so I wonder if this is some bug caused either by grsecurity or by something else that was then fixed, but since your kernel is a bit old (4.7.10) maybe the fix didn't make it. In either case, I'll try to reproduce on that kernel and on newer kernels and will get back to you. I presume you have most PaX options turned on? Thanks, Jason ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard
Re: Instability during large transfers
Hello, On 02/17/17 07:36, Jason A. Donenfeld wrote: The stack trace does indicate that the OOPS is happening in padata, not in wireguard, so I wonder if this is some bug caused either by grsecurity or by something else that was then fixed, but since your kernel is a bit old (4.7.10) maybe the fix didn't make it. In either case, I'll try to reproduce on that kernel and on newer kernels and will get back to you. There do not appear to be any relevant changes to padata in the past few years, and grsecurity doesn't look like it affects padata much, but that doesn't rule it out: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=padata https://grsecurity.net/changelog-test.txt I presume you have most PaX options turned on? Attached is my config.gz (it's the same on all machines). Thanks, Jason Thanks, Samuel config-4.7.10-hardened.gz Description: application/gzip ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard
Re: Instability during large transfers
On 02/17/17 07:36, Jason A. Donenfeld wrote: Thanks very much for the excellent debugging output. I'll try to reproduce this as well on my systems. I assume you have not been able to reproduce this issue. The stack trace does indicate that the OOPS is happening in padata, not in wireguard, so I wonder if this is some bug caused either by grsecurity or by something else that was then fixed, but since your kernel is a bit old (4.7.10) maybe the fix didn't make it. In either case, I'll try to reproduce on that kernel and on newer kernels and will get back to you. I presume you have most PaX options turned on? Since this is on 4.7.10 (that is pre-4.9), this is not related to the other bug recently reported. I have disabled all grsecurity/PaX options in my kernel config (attached) and was able to trigger the bug again. This is with WireGuard commit f97b7e34bda436ac4572697a8770837eec7470b6 and debugging enabled. Again attached is the dmesg. I used the same SSH cat /dev/zero | dd of=/dev/null as before. This time I got "192656101376 bytes (193 GB, 179 GiB) copied, 41643 s, 4.6 MB/s" before the connection was broken. Interestingly, when the firewall came back up, I again had the issue where devices were continuing to handshake, but no data went through (and I could confirm this with the wireguard debug output in dmesg). I was unable to reproduce this issue with a spare laptop (ThinkPad X220), even after leaving it running for about three days. Since the router has a rather weak Atom CPU (http://ark.intel.com/products/78866), I suspect maybe a race condition due to the high load might be involved? Is there anything else I can do to debug this? Enable some kernel debugging option? Try a vanilla kernel? Try a newer kernel? Thanks, Jason Thanks, Samuel Holland panic_grsec_disabled.config.gz Description: application/gzip panic_grsec_disabled.dmesg.gz Description: application/gzip ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard
Re: Instability during large transfers
Hello, On 03/01/17 16:44, Samuel Holland wrote: Is there anything else I can do to debug this? Enable some kernel debugging option? Try a vanilla kernel? Try a newer kernel? I've upgraded to vanilla (kernel.org) Linux 4.9.15, and while I've been unable to trigger the same panic so far, I'm getting a similar linked list corruption warning. This time, instead of failing to pass data once the warning hits, some packets are coming in out of order with 3000-5000 ms latency. Attached is the dmesg output, kernel config, and ping logs for inside the VPN. The actual link latency between peers is around 10ms. As I still haven't been able to reproduce this issue on another machine, I cannot rule out hardware issues, but this exact machine worked fine for a year straight (2x 6 months uptime) using OpenVPN. So I'm looking at either WireGuard or padata as the cause. The kernel has linked list debugging turned on. I upgraded again to 4.9.16, and enabled every debug and self-test option that I could find that seemed relevant. The stack traces and symptoms still look similar. Except now when it panics (instead of just a warning), I don't get a log; the netconsole just suddenly cuts over to the next boot. You can also see how it stops sending keepalive packets once the warnings hit. At this point, unless you want me to try something, I'm going to stop sending logs. I'm not very familiar with the kernel's timer/workqueue/crypto subsystems, so there's not much debugging I can do on my own until I have time to do research (and I have no idea when that will be). But if you can't reproduce the issue, then it will be rather difficult for you to debug as well. I'd still like to get this fixed, and I'll gladly help any way I can. Thanks, Samuel Holland warning_4.9.15.config.xz Description: application/xz warning_4.9.15.dmesg.xz Description: application/xz warning_4.9.15.dmesg_secondboot.xz Description: application/xz warning_4.9.15.vpn_ping.xz Description: application/xz warning_4.9.16.dmesg.xz Description: application/xz ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard
Re: Instability during large transfers
Hey Samuel, Thanks for sticking with it... This is a super hard bug to track down! I wish I could reproduce it. Can you try two odd things, separately, and report back? 1. Compile with DEBUG_ATOMIC_SLEEP. Do you get any additional warnings? 2. Hack kernel/sched/Makefile to remove "-fno-omit-frame-pointer" and see if the backtraces you get are a bit more coherent. Feel free to find me on IRC. Jason ___ WireGuard mailing list WireGuard@lists.zx2c4.com https://lists.zx2c4.com/mailman/listinfo/wireguard