Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
>> Okay, so i was 0, so running UP probably isn't going to help. r7 is >> also spec_priv->rfs_chan_spec_scan. >> >> So, I think the question is... how is this NULL - and has it always >> been NULL... > > The problem appears to be that ath_cmn_process_fft() isn't called that > often. When it is, it crashes in ath_cmn_is_fft_buf_full() because > spec_priv->rfs_chan_spec_scan is NULL when ATH9K_DEBUGFS=n. :-( > > I'm running with ATH9K_DEBUGFS=y now. If it goes a couple of days > without crashing, I'll gin up a patch. > A similar patch was applied to ath-next branch: https://patchwork.kernel.org/patch/9431163/. -- Miaoqing ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
On Wed, Nov 23, 2016 at 07:15:39PM +, Jason Cooper wrote: > --- oops from v4.8.6 #2 -- > [42059.303625] Unable to handle kernel NULL pointer dereference at virtual > address 0020 > [42059.311799] pgd = c0004000 > [42059.314522] [0020] *pgd= > [42059.318162] Internal error: Oops: 17 [#1] SMP ARM > [42059.322889] Modules linked in: ath9k ath9k_common ath9k_hw ath > [42059.328809] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.6 #37 > [42059.334755] Hardware name: Marvell Armada 370/XP (Device Tree) > [42059.340613] task: c0b091c0 task.stack: c0b0 > [42059.345176] PC is at ath_cmn_process_fft+0xa0/0x578 [ath9k_common] > [42059.351388] LR is at ath_cmn_process_fft+0xc4/0x578 [ath9k_common] > [42059.357598] pc : []lr : []psr: 8153 > [42059.357598] sp : c0b01cd0 ip : fp : > [42059.369127] r10: c0b034d4 r9 : 0069 r8 : 006c > [42059.374374] r7 : r6 : dcfbd340 r5 : c0b03da0 r4 : > [42059.380930] r3 : 0001 r2 : 0008 r1 : 0004 r0 : Well, the good news is that it's reproducable. It looks like it could be this: static int ath_cmn_is_fft_buf_full(struct ath_spec_scan_priv *spec_priv) { for_each_online_cpu(i) ret += relay_buf_full(rc->buf[i]); where i = 8 (r2) and rc->buf is r7. That's just a guess though, as there's precious little to go on with the Code: line - modern GCCs don't give us much with the Code: line anymore to figure out what's going on without the exact object files. e5933000ldr r3, [r3] e1d330b4ldrhr3, [r3, #4] e58d3030str r3, [sp, #48] ; 0x30 ea02b 1c e7970102ldr r0, [r7, r2, lsl #2] What makes me wonder though is that if i=8, that means you must have a system with 9 online CPUs, which is probably unlikely - or maybe that's the problem, for_each_online_cpu() is going wrong... If it's not that line of code, I don't see what else it would be based on the output of my compiler - there's only one case in my disassembly that corresponds with the single code line that we have to go on, and it's this: a44: e5983020ldr r3, [r8, #32] a48: e793010aldr r0, [r3, sl, lsl #2] <=== a4c: ebfebl 0 a50: e0844000add r4, r4, r0 a54: e59f9434ldr r9, [pc, #1076] a58: e28a2001add r2, sl, #1 a5c: e3a01004mov r1, #4 a60: e1a9mov r0, r9 a64: ebfebl 0 <_find_next_bit_le> a68: e5953000ldr r3, [r5] a6c: e153cmp r0, r3 a70: e1a0a000mov sl, r0 a74: baf2blt a44 I'm debating now about whether we need to dump more of the code in the oops - both before and after the faulting instruction... -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
On Wed, Nov 23, 2016 at 08:59:17PM +, Jason Cooper wrote: > As requested on irc: Thanks. > 7f0: ea02b 800 > 7f4: e7970102ldr r0, [r7, r2, lsl #2] > 7f8: ebfebl 0 > 7fc: e0844000add r4, r4, r0 > 800: e300a000movwsl, #0 > 804: e28b2001add r2, fp, #1 > 808: e340a000movtsl, #0 > 80c: e3a01004mov r1, #4 > 810: e1aamov r0, sl > 814: ebfebl 0 <_find_next_bit_le> > 818: e5953000ldr r3, [r5] > 81c: e153cmp r0, r3 > 820: e1a0b000mov fp, r0 > 824: e2802008add r2, r0, #8 > 828: baf1blt 7f4 Okay, so i was 0, so running UP probably isn't going to help. r7 is also spec_priv->rfs_chan_spec_scan. So, I think the question is... how is this NULL - and has it always been NULL... -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
On Thu, Nov 24, 2016 at 02:06:57PM +0800, miaoq...@codeaurora.org wrote: > > >>Okay, so i was 0, so running UP probably isn't going to help. r7 is > >>also spec_priv->rfs_chan_spec_scan. > >> > >>So, I think the question is... how is this NULL - and has it always > >>been NULL... > > > >The problem appears to be that ath_cmn_process_fft() isn't called that > >often. When it is, it crashes in ath_cmn_is_fft_buf_full() because > >spec_priv->rfs_chan_spec_scan is NULL when ATH9K_DEBUGFS=n. :-( > > > >I'm running with ATH9K_DEBUGFS=y now. If it goes a couple of days > >without crashing, I'll gin up a patch. > > > > A similar patch was applied to ath-next branch: > https://patchwork.kernel.org/patch/9431163/. Hmm. Ok, I'm giving it a spin on my board with SMP=y, ATH9K_DEBUGFS=n (so the only change from known crashing is the patch) and we'll see how it goes. Honestly, though, I think the real problem is when kernels are built without ATH9K_DEBUGFS. Did the reporter of the crash say if that was enabled on his system or not? I'm concerned that there may be other code lurking that secretly depends on ATH9K_DEBUGFS being enabled. thx, Jason. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
All, On Wed, Nov 23, 2016 at 09:40:53PM +, Jason Cooper wrote: > I'm running with ATH9K_DEBUGFS=y now. If it goes a couple of days > without crashing, I'll gin up a patch. Well, it survived overnight, which it's never done before. :-) I'm testing the relay_open() NULL patch now. thx, Jason. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
On Wed, Nov 23, 2016 at 09:17:45PM +, Russell King - ARM Linux wrote: > On Wed, Nov 23, 2016 at 08:59:17PM +, Jason Cooper wrote: > > As requested on irc: > > Thanks. > > > 7f0: ea02b 800 > > 7f4: e7970102ldr r0, [r7, r2, lsl #2] > > 7f8: ebfebl 0 > > 7fc: e0844000add r4, r4, r0 > > 800: e300a000movwsl, #0 > > 804: e28b2001add r2, fp, #1 > > 808: e340a000movtsl, #0 > > 80c: e3a01004mov r1, #4 > > 810: e1aamov r0, sl > > 814: ebfebl 0 <_find_next_bit_le> > > 818: e5953000ldr r3, [r5] > > 81c: e153cmp r0, r3 > > 820: e1a0b000mov fp, r0 > > 824: e2802008add r2, r0, #8 > > 828: baf1blt 7f4 > > Okay, so i was 0, so running UP probably isn't going to help. r7 is > also spec_priv->rfs_chan_spec_scan. > > So, I think the question is... how is this NULL - and has it always > been NULL... The problem appears to be that ath_cmn_process_fft() isn't called that often. When it is, it crashes in ath_cmn_is_fft_buf_full() because spec_priv->rfs_chan_spec_scan is NULL when ATH9K_DEBUGFS=n. :-( I'm running with ATH9K_DEBUGFS=y now. If it goes a couple of days without crashing, I'll gin up a patch. thx, Jason. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
On Wed, Nov 23, 2016 at 07:51:20PM +, Russell King - ARM Linux wrote: > On Wed, Nov 23, 2016 at 07:15:39PM +, Jason Cooper wrote: > > --- oops from v4.8.6 #2 -- > > [42059.303625] Unable to handle kernel NULL pointer dereference at virtual > > address 0020 > > [42059.311799] pgd = c0004000 > > [42059.314522] [0020] *pgd= > > [42059.318162] Internal error: Oops: 17 [#1] SMP ARM > > [42059.322889] Modules linked in: ath9k ath9k_common ath9k_hw ath > > [42059.328809] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.6 #37 > > [42059.334755] Hardware name: Marvell Armada 370/XP (Device Tree) > > [42059.340613] task: c0b091c0 task.stack: c0b0 > > [42059.345176] PC is at ath_cmn_process_fft+0xa0/0x578 [ath9k_common] > > [42059.351388] LR is at ath_cmn_process_fft+0xc4/0x578 [ath9k_common] > > [42059.357598] pc : []lr : []psr: 8153 > > [42059.357598] sp : c0b01cd0 ip : fp : > > [42059.369127] r10: c0b034d4 r9 : 0069 r8 : 006c > > [42059.374374] r7 : r6 : dcfbd340 r5 : c0b03da0 r4 : > > [42059.380930] r3 : 0001 r2 : 0008 r1 : 0004 r0 : > > Well, the good news is that it's reproducable. > > It looks like it could be this: > > static int > ath_cmn_is_fft_buf_full(struct ath_spec_scan_priv *spec_priv) > { > for_each_online_cpu(i) > ret += relay_buf_full(rc->buf[i]); > > where i = 8 (r2) and rc->buf is r7. That's just a guess though, as > there's precious little to go on with the Code: line - modern GCCs > don't give us much with the Code: line anymore to figure out what's > going on without the exact object files. > > e5933000ldr r3, [r3] > e1d330b4ldrhr3, [r3, #4] > e58d3030str r3, [sp, #48] ; 0x30 > ea02b 1c > e7970102ldr r0, [r7, r2, lsl #2] > As requested on irc: -->8 drivers/net/wireless/ath/ath9k/common-spectral.o: file format elf32-littlearm Disassembly of section .text: ... 0754 : 754: e92d4ff0push{r4, r5, r6, r7, r8, r9, sl, fp, lr} 758: e24dd0d4sub sp, sp, #212; 0xd4 75c: e1a04002mov r4, r2 760: e1a06001mov r6, r1 764: e58d0024str r0, [sp, #36] ; 0x24 768: e3a01000mov r1, #0 76c: e58d2018str r2, [sp, #24] 770: e28d0049add r0, sp, #73 ; 0x49 774: e3a02087mov r2, #135; 0x87 778: ebfebl 0 77c: e5d44007ldrbr4, [r4, #7] 780: e20430fdand r3, r4, #253; 0xfd 784: e3530024cmp r3, #36 ; 0x24 788: 13540005cmpne r4, #5 78c: 13a04001movne r4, #1 790: 03a04000moveq r4, #0 794: 13a0movne r0, #0 798: 0a01beq 7a4 79c: e28dd0d4add sp, sp, #212; 0xd4 7a0: e8bd8ff0pop {r4, r5, r6, r7, r8, r9, sl, fp, pc} 7a4: e59d3018ldr r3, [sp, #24] 7a8: e1d380b4ldrhr8, [r3, #4] 7ac: e2489003sub r9, r8, #3 7b0: e0863009add r3, r6, r9 7b4: e5d30002ldrbr0, [r3, #2] 7b8: e210and r0, r0, #16 7bc: e21000ffandsr0, r0, #255; 0xff 7c0: 0af5beq 79c 7c4: e59d3024ldr r3, [sp, #36] ; 0x24 7c8: e3005000movwr5, #0 7cc: e3405000movtr5, #0 7d0: e3e0b000mvn fp, #0 7d4: e5932000ldr r2, [r3] 7d8: e5937004ldr r7, [r3, #4] 7dc: e5923438ldr r3, [r2, #1080] ; 0x438 7e0: e58d2010str r2, [sp, #16] 7e4: e5933000ldr r3, [r3] 7e8: e1d330b4ldrhr3, [r3, #4] 7ec: e58d3030str r3, [sp, #48] ; 0x30 7f0: ea02b 800 7f4: e7970102ldr r0, [r7, r2, lsl #2] 7f8: ebfebl 0 7fc: e0844000add r4, r4, r0 800: e300a000movwsl, #0 804: e28b2001add r2, fp, #1 808: e340a000movtsl, #0 80c: e3a01004mov r1, #4 810: e1aamov r0, sl 814: ebfebl 0 <_find_next_bit_le> 818: e5953000ldr r3, [r5] 81c: e153cmp r0, r3 820: e1a0b000mov fp, r0 824: e2802008add r2,
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
Hi Russell, On Wed, Nov 23, 2016 at 07:51:20PM +, Russell King - ARM Linux wrote: > On Wed, Nov 23, 2016 at 07:15:39PM +, Jason Cooper wrote: > > --- oops from v4.8.6 #2 -- > > [42059.303625] Unable to handle kernel NULL pointer dereference at virtual > > address 0020 > > [42059.311799] pgd = c0004000 > > [42059.314522] [0020] *pgd= > > [42059.318162] Internal error: Oops: 17 [#1] SMP ARM > > [42059.322889] Modules linked in: ath9k ath9k_common ath9k_hw ath > > [42059.328809] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.6 #37 > > [42059.334755] Hardware name: Marvell Armada 370/XP (Device Tree) > > [42059.340613] task: c0b091c0 task.stack: c0b0 > > [42059.345176] PC is at ath_cmn_process_fft+0xa0/0x578 [ath9k_common] > > [42059.351388] LR is at ath_cmn_process_fft+0xc4/0x578 [ath9k_common] > > [42059.357598] pc : []lr : []psr: 8153 > > [42059.357598] sp : c0b01cd0 ip : fp : > > [42059.369127] r10: c0b034d4 r9 : 0069 r8 : 006c > > [42059.374374] r7 : r6 : dcfbd340 r5 : c0b03da0 r4 : > > [42059.380930] r3 : 0001 r2 : 0008 r1 : 0004 r0 : > > Well, the good news is that it's reproducable. > > It looks like it could be this: > > static int > ath_cmn_is_fft_buf_full(struct ath_spec_scan_priv *spec_priv) > { > for_each_online_cpu(i) > ret += relay_buf_full(rc->buf[i]); ahhh, my config has NR_CPUS=4, this SoC is uniprocessor. I'm going to give it a go with SMP=no. This config is a lightly modified mvebu_v7_defconfig. However, NR_CPUS isn't set in mvebu_v7_defconfig. Only in multi_v7_defconfig. I suspect ath9k uses different logic for setting up the relay buffer(s) than for the code you referenced. If SMP=no fails to fail ( :-P ) then we'll know where to start digging. thx, Jason. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
Hi Kalle, On Wed, Nov 23, 2016 at 09:26:42PM +0200, Kalle Valo wrote: > Jason Cooper writes: > > I have a Ubiquiti SR-71 mini-pcie ath9k card in a Globalscale Mirabox > > board (Marvell Armada 370 SoC). Every day or so I get a consistent > > crash that brings down the whole board. I've attached three oops I > > captured on the serial port. > > > > I looked at the commits from v4.8.6 to v4.9-rc6, and nothing jumped out > > at me as "this would fix it". And since it takes a day or so to trigger > > the oops, bisecting would be a bit brutal. Does anyone have any insight > > into this? > > Is this a regression, meaning that it didn't crash on older kernels but > crashes on newer ones? Or has it always crashed? iirc, it's always done this. It's one of my spare wifi backhauls that spends most of it's time in a cardboard box waiting for a task, collecting dust. Kinda like the toys in Toy Story. I pulled it out a month or so ago and the behavior started. It had 4.2.8 on it at the time. I upgraded to latest stable a few weeks ago (v4.8.6) and I'm getting the same issue. When I originally set it up, it didn't run long enough for me to recall if the issue occurred. Best I recall, that was with v4.2.8. thx, Jason. ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Re: [ath9k-devel] ath9k ARMv7 OOPS in v4.8.6, v4.2.8
Jason Cooper writes: > All, > > I have a Ubiquiti SR-71 mini-pcie ath9k card in a Globalscale Mirabox > board (Marvell Armada 370 SoC). Every day or so I get a consistent > crash that brings down the whole board. I've attached three oops I > captured on the serial port. > > I looked at the commits from v4.8.6 to v4.9-rc6, and nothing jumped out > at me as "this would fix it". And since it takes a day or so to trigger > the oops, bisecting would be a bit brutal. Does anyone have any insight > into this? Is this a regression, meaning that it didn't crash on older kernels but crashes on newer ones? Or has it always crashed? -- Kalle Valo ___ ath9k-devel mailing list ath9k-devel@lists.ath9k.org https://lists.ath9k.org/mailman/listinfo/ath9k-devel