Re: [PATCH] net: sk == 0xffffffff fix - not for commit
Actually found what looks to be a fix for this in another thread. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg574770.html http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854 Cheers, Andy -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
On Fri, Jan 24, 2014 at 07:38:31AM -0600, Andrew Ruder wrote: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg574770.html http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854 Just a little further confirmation. This appears in __inet_lookup_established as the last four instructions before returning. +440: bl __rcu_read_unlock +444: sub sp, r11, #40; 0x28 +448: ldr r0, [r11, #-48] ; 0x30 +452: ldm sp, {r4, r5, r6, r7, r8, r9, r10, r11, sp, pc} -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
W dniu 16.01.2014 17:29, Eric Dumazet pisze: On Thu, 2014-01-16 at 16:21 +0100, Andrzej Pietrasiewicz wrote: W dniu 10.12.2013 15:25, Eric Dumazet pisze: On Tue, 2013-12-10 at 07:55 +0100, Andrzej Pietrasiewicz wrote: W dniu 09.12.2013 16:31, Eric Dumazet pisze: On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? UP build, S5PC110 OK I believe you need additional debugging to track the exact moment 0x is fed to 'sk' It looks like a very strange bug, involving a problem in some assembly helper, register save/restore, compiler bug or stack corruption or something. I started with adding WARN_ON(sk == 0x); just before return in __inet_lookup_established(), and the problem was gone. So this looks very strange, like a toolchain problem. Or a timing issue. Adding a WARN_ON() adds extra instructions and might really change the assembly output. I used gcc-linaro-arm-linux-gnueabihf-4.8-2013.05. If I change the toolchain to gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415 the problem seems to have gone away. Its totally possible some barrier was not properly handled by the compiler. You could disassemble the function on both toolchains and try to spot the issue. So I gave it a try. Below is a part of assembly code (ARM) which corresponds to the last lines of the __inet_lookup_established(): C source: = found: rcu_read_unlock(); return sk; } assembly for toolchain 4.7: === c0333bb8: ebf4bb6ebl c0062978 __rcu_read_unlock c0333bbc: e51b0030ldr r0, [fp, #-48] ; 0x30 c0333bc0: e24bd028sub sp, fp, #40 ; 0x28 c0333bc4: e89daff0ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc} c0333bc8: e5132018ldr r2, [r3, #-24] assembly for toolchain 4.8: === c033ff5c: ebf4927ebl c006495c __rcu_read_unlock c033ff60: e24bd028sub sp, fp, #40 ; 0x28 c033ff64: e51b0030ldr r0, [fp, #-48] ; 0x30 c033ff68: e89daff0ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc} c033ff6c: e5113018ldr r3, [r1, #-24] What can be seen is that the usage of registers is slightly different, and, what is more important, the _order_ of ldr/sub is different. Now, if I swap the instructions at offsets c033ff60 and c033ff64 in the 4.8-generated vmlinux, the problem seems gone! Well, at least the binary behaves the same way as the 4.7-generated one. Here is a _hypothesis_ of what _might_ be happening: The function in question puts its return value in the register r0. In both cases the return value is fetched from a memory location relative #-48 to what the frame pointer points to. However, in the 4.7-generated binary the ldr executes in the branch delay slot, whereas in the 4.8-generated binary it is the sub which executes in the branch delay slot. That way, in the 4.7-generated binary the return value is fetched before __rcu_read_unlock begins, but in the 4.8-generated binary it is fetched some time later. Which might be enough for someone else to schedule in and break the data to be copied to r0 and returned from the function. As I said, this is just a hypothesis. AP -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
W dniu 17.01.2014 13:18, Andrzej Pietrasiewicz pisze: W dniu 16.01.2014 17:29, Eric Dumazet pisze: On Thu, 2014-01-16 at 16:21 +0100, Andrzej Pietrasiewicz wrote: W dniu 10.12.2013 15:25, Eric Dumazet pisze: On Tue, 2013-12-10 at 07:55 +0100, Andrzej Pietrasiewicz wrote: W dniu 09.12.2013 16:31, Eric Dumazet pisze: On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? UP build, S5PC110 OK I believe you need additional debugging to track the exact moment 0x is fed to 'sk' It looks like a very strange bug, involving a problem in some assembly helper, register save/restore, compiler bug or stack corruption or something. I started with adding WARN_ON(sk == 0x); just before return in __inet_lookup_established(), and the problem was gone. So this looks very strange, like a toolchain problem. Or a timing issue. Adding a WARN_ON() adds extra instructions and might really change the assembly output. I used gcc-linaro-arm-linux-gnueabihf-4.8-2013.05. If I change the toolchain to gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415 the problem seems to have gone away. Its totally possible some barrier was not properly handled by the compiler. You could disassemble the function on both toolchains and try to spot the issue. So I gave it a try. Below is a part of assembly code (ARM) which corresponds to the last lines of the __inet_lookup_established(): C source: = found: rcu_read_unlock(); return sk; } assembly for toolchain 4.7: === c0333bb8: ebf4bb6ebl c0062978 __rcu_read_unlock c0333bbc: e51b0030ldr r0, [fp, #-48] ; 0x30 c0333bc0: e24bd028sub sp, fp, #40 ; 0x28 c0333bc4: e89daff0ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc} c0333bc8: e5132018ldr r2, [r3, #-24] assembly for toolchain 4.8: === c033ff5c: ebf4927ebl c006495c __rcu_read_unlock c033ff60: e24bd028sub sp, fp, #40 ; 0x28 c033ff64: e51b0030ldr r0, [fp, #-48] ; 0x30 c033ff68: e89daff0ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc} c033ff6c: e5113018ldr r3, [r1, #-24] What can be seen is that the usage of registers is slightly different, and, what is more important, the _order_ of ldr/sub is different. Now, if I swap the instructions at offsets c033ff60 and c033ff64 in the 4.8-generated vmlinux, the problem seems gone! Well, at least the binary behaves the same way as the 4.7-generated one. Here is a _hypothesis_ of what _might_ be happening: The function in question puts its return value in the register r0. In both cases the return value is fetched from a memory location relative #-48 to what the frame pointer points to. However, in the 4.7-generated binary the ldr executes in the branch delay slot, whereas in the 4.8-generated binary it is the sub which executes in the branch delay slot. That way, in the 4.7-generated binary the return value is fetched before __rcu_read_unlock begins, but in the 4.8-generated binary it is fetched some time later. Which might be enough for someone else to schedule in and break the data to be copied to r0 and returned from the function. As I said, this is just a hypothesis. Please disregard what I have written. There is no delay slot on ARM :O A nice hypothesis, though ;) AP -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
On Thu, 2014-01-16 at 16:21 +0100, Andrzej Pietrasiewicz wrote: W dniu 10.12.2013 15:25, Eric Dumazet pisze: On Tue, 2013-12-10 at 07:55 +0100, Andrzej Pietrasiewicz wrote: W dniu 09.12.2013 16:31, Eric Dumazet pisze: On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? UP build, S5PC110 OK I believe you need additional debugging to track the exact moment 0x is fed to 'sk' It looks like a very strange bug, involving a problem in some assembly helper, register save/restore, compiler bug or stack corruption or something. I started with adding WARN_ON(sk == 0x); just before return in __inet_lookup_established(), and the problem was gone. So this looks very strange, like a toolchain problem. Or a timing issue. Adding a WARN_ON() adds extra instructions and might really change the assembly output. I used gcc-linaro-arm-linux-gnueabihf-4.8-2013.05. If I change the toolchain to gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415 the problem seems to have gone away. Its totally possible some barrier was not properly handled by the compiler. You could disassemble the function on both toolchains and try to spot the issue. -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
On Mon, Dec 09, 2013 at 12:47:52PM +0100, Andrzej Pietrasiewicz wrote: With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. Don't know if this is relevant but I had this very similar stack trace come up a few days ago (below). I am working on a PXA 270/xscale with gcc version 4.8.2 (Buildroot 2013.11-rc1-00028-gf388663). Going to try to see if I can reproduce it a little more readily before I start trying to narrow down what is causing it. === Unable to handle kernel NULL pointer dereference at virtual address 0011 pgd = d18e [0011] *pgd=a6d03831, *pte=, *ppte= Internal error: Oops: 17 [#1] PREEMPT ARM Modules linked in: zeusvirt(O) zeus16550(O) 8390p ipv6 CPU: 0 PID: 2365 Comm: sshd Tainted: G O 3.12.0+ #201 task: d7216f00 ti: d7144000 task.ti: d7144000 PC is at tcp_v4_early_demux+0xe8/0x154 LR is at __inet_lookup_established+0x1bc/0x2e0 pc : [c0341cfc]lr : [c0329bd8]psr: a013 sp : d7145b20 ip : d7145ae8 fp : d7145b44 r10: c0576c28 r9 : 0008 r8 : d7998800 r7 : d7063800 r6 : c6cf2480 r5 : r4 : c6cf2480 r3 : c02ec018 r2 : d7145ad0 r1 : d7b66a28 r0 : Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 397f Table: b18e DAC: 0015 Process sshd (pid: 2365, stack limit = 0xd71441c8) Stack: (0xd7145b20 to 0xd7146000) 5b20: 17bf3f0a 0016 0003 c0026d90 d71f4634 d71f4600 d7145b6c d7145b48 5b40: c03211b4 c0341c20 05ea d7bb0538 d7063800 0034 d71f4600 c6cf2480 5b60: d7145b9c d7145b70 c03218dc c0321158 1001 c0576c1c 5b80: c0577e84 c0576c14 d7145be4 d7145ba0 c02fae04 c03215d4 5ba0: c0590330 c057fc08 d7145bfc c6cf2480 c02571a0 c0576c28 07e1 c05a3dc0 5bc0: 0001 c05a3d60 c05a3d74 c05a3d60 c05a3d68 d7145bfc d7145be8 5be0: c02fb990 c02fa8f0 c05a3dc0 d7145c24 d7145c00 c02fc46c c02fb968 5c00: c02fc3dc c05a3dc0 c05a3d60 0001 012c 0040 d7145c64 d7145c28 5c20: c02fbcd0 c02fc3e8 d78af3c0 d7145c5c 8d99 0001 5c40: c05a81f0 0003 0100 3fa57e1c d7144028 c05a81ec d7145cb4 d7145c68 5c60: c0026a44 c02fbc10 d7145c8c d7145c78 c00538dc c0056ce4 8d98 5c80: 00400100 000a c0228594 6093 c0590330 d7145d54 0001 5ca0: d7bb0480 05b4 d7145ccc d7145cb8 c0026ca4 c00268f4 d7144010 5cc0: d7145ce4 d7145cd0 c0026f58 c0026c58 00ab 001a d7145d04 d7145ce8 5ce0: c000f7d0 c0026ed0 0014 d7145d20 a013 d7145d1c d7145d08 5d00: c00085bc c000f768 c02f0048 c00ca7d8 d7145d7c d7145d20 c03a7dc0 c0008590 5d20: 000118ed c05a474c c05d41cc d7bb0180 d18ed800 d7801080 06a3 5d40: 0001 d7bb0480 05b4 d7145d7c d7145d80 d7145d68 c02f0048 c00ca7d8 5d60: a013 c05a4738 d7bb0180 d7145dac d7145d80 c02f0048 c00ca7b0 5d80: 0001 00c63fc0 d7b66a00 d7b66a00 4040 05b4 d7b66a00 5da0: d7145dcc d7145db0 c032e340 c02effd0 d7145e98 4040 0008c414 5dc0: d7145e54 d7145dd0 c032f368 c032e310 d7145e24 c02ea81c c03a6040 c03a9c6c 5de0: d7145ee8 05b4 d7b66adc 5e00: d7144000 1854 05b4 27ec 0040 d7116d80 05b4 5e20: d7145e6c d7b66a00 d7145ee8 d7145e98 4040 4040 5e40: 4040 0002 d7145e74 d7145e58 c03526c8 c032eb0c d7145e78 d7116d80 5e60: d7145ee0 d7116d80 d7145ed4 d7145e78 c02e63a4 c0352688 c05a3dc0 d7142000 5e80: 0040 4040 d76701c0 d7145ee0 d7145e98 5ea0: d7145ee0 0001 0040 d7145ee8 c6cf2900 5ec0: d7145f78 d7145f44 d7145ed8 c00d1c64 c02e62e4 5ee0: 00089c28 4040 d7116d80 d7145e78 d7216f00 5f00: 4040 5f20: 00089c28 d7116d80 00089c28 d7145f78 4040 00089c28 d7145f74 d7145f48 5f40: c00d23a0 c00d1bf4 d7116d80 5f60: 00089c28 4040 d7145fa4 d7145f78 c00d2948 c00d22c0 5f80: beed167c 0003 000614dc 0004 c000ea28 d7144000 d7145fa8 5fa0: c000e7e0 c00d2908 beed167c 0003 0003 00089c28 4040 beed167c 5fc0: beed167c 0003 000614dc 0004 00089c28 00060a88 093e beed17a0 5fe0: beed167c beed1648 00029910 b6dc821c 6010 0003 [c0341cfc] (tcp_v4_early_demux+0xe8/0x154) from [c03211b4] (ip_rcv_finish+0x68/0x2c0) [c03211b4] (ip_rcv_finish+0x68/0x2c0) from [c03218dc] (ip_rcv+0x314/0x398) [c03218dc] (ip_rcv+0x314/0x398) from [c02fae04] (__netif_receive_skb_core+0x520/0x5d8) [c02fae04] (__netif_receive_skb_core+0x520/0x5d8) from [c02fb990] (__netif_receive_skb+0x34/0x88) [c02fb990] (__netif_receive_skb+0x34/0x88) from [c02fc46c] (process_backlog+0x90/0x148)
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
On Tue, 2013-12-10 at 07:55 +0100, Andrzej Pietrasiewicz wrote: W dniu 09.12.2013 16:31, Eric Dumazet pisze: On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? UP build, S5PC110 OK I believe you need additional debugging to track the exact moment 0x is fed to 'sk' It looks like a very strange bug, involving a problem in some assembly helper, register save/restore, compiler bug or stack corruption or something. You should not have more than 150 instructions to decode, including __inet_lookup_established() Since __inet_lookup_established() dereferences the socket pointer, I do not see why it would crash ~20 instructions _later_ -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? Crash should happen earlier in __inet_lookup_established() -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sk == 0xffffffff fix - not for commit
W dniu 09.12.2013 16:31, Eric Dumazet pisze: On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote: NOT FOR COMMITTING TO MAINLINE. With g_ether loaded the sk occasionally becomes 0x. It happens usually after transferring few hundreds of kilobytes to few tens of megabytes. If sk is 0x then dereferencing it causes kernel panic. This is a *workaround*. I don't know enough net code to understand the core of the problem. However, with this patch applied the problems are gone, or at least pushed farther away. Is it happening on SMP or UP ? UP build, S5PC110 Crash should happen earlier in __inet_lookup_established() AP -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html