Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/10/2018 10:10 AM, Michał Kazior wrote: Ben, The patch is symptomatic. fq_tin_dequeue() already checks if the list is empty before it tries to access first entry. I see no point in using the _or_null() + WARN_ON. The 0x3c deref is likely an offset off of NULL base pointer. Did you check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it point to? gdb pointed to one line above the flow dereference, which is why I was going to put some debugging in there. I suspect there's not enough synchronization between quescing the device/ath10k after fw crashes and performing mac80211's reconfig procedure. I am already running this patch which helps with some of that. That patch never made it upstream, but it fixed problems for me earlier. https://patchwork.kernel.org/patch/9457639/ Could easily be there are some more issues in that logic. Someone else posted a patch to disable mac-80211 tx when FW crashes, I think...I have not tried to backport that. https://patchwork.kernel.org/patch/10411967/ Thanks, Ben Michał On 8 June 2018 at 23:40, Arend van Spriel wrote: On 6/8/2018 5:17 PM, Ben Greear wrote: I recalled an email from Michał leaving tieto so adding his alternate email he provided back then. Gr. AvS On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Hello Michal, git blame shows you as the author of the fq_impl.h code. I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks kernel. There was an apparent mostly-null deref in the fq_tin_dequeue method. According to gdb, it was within 1 line of the dereference of 'flow'. My hack above is probably not that useful. Cong thinks maybe the locking is bad. If you get a chance, please review this thread and see if you have any ideas for a better fix (or better debugging code). As always, if you would like me to generate you a buggy firmware that will crash in the tx path and cause all sorts of mayhem in the ath10k driver and wifi stack, I will be happy to do so. https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/08/2018 07:10 AM, Ben Greear wrote: > Maybe whoever put this code together can take a stab at it. > This was one one the motivation for the Fixes: tag request. By doing a git blame, you can find which commit(s) added this code, and thus CC the author, who might not follow netdev@ closely.
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 05:13 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] Instead of adding WARN_ON(), you need to think about the locking there, it is suspicious: fq is from struct ieee80211_local: struct fq *fq = &local->fq; tin is from struct txq_info: struct fq_tin *tin = &txqi->tin; I don't know if fq and tin are supposed to be 1:1, if not there is a bug in the locking, because ->new_flows and ->old_flows are both inside tin instead of fq, but they are protected by fq->lock Maybe whoever put this code together can take a stab at it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Nevermind then. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 4:48 PM, wrote: > From: Ben Greear > > While testing an ath10k firmware that often crashed under load, > I was seeing kernel crashes as well. One of them appeared to > be a dereference of a NULL flow object in fq_tin_dequeue. > > I have since fixed the firmware flaw, but I think it would be > worth adding the WARN_ON in case the problem appears again. > > BUG: unable to handle kernel NULL pointer dereference at 003c > IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] Instead of adding WARN_ON(), you need to think about the locking there, it is suspicious: fq is from struct ieee80211_local: struct fq *fq = &local->fq; tin is from struct txq_info: struct fq_tin *tin = &txqi->tin; I don't know if fq and tin are supposed to be 1:1, if not there is a bug in the locking, because ->new_flows and ->old_flows are both inside tin instead of fq, but they are protected by fq->lock
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 4:48 PM, wrote: > diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h > index be7c0fa..cb911f0 100644 > --- a/include/net/fq_impl.h > +++ b/include/net/fq_impl.h > @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, > return NULL; > } > > - flow = list_first_entry(head, struct fq_flow, flowchain); > + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); > + > + if (WARN_ON_ONCE(!flow)) > + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock.
[PATCH v2] net-fq: Add WARN_ON check for null flow.
From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] PGD 8001417fe067 P4D 8001417fe067 PUD 13db41067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen lm78 ] CPU: 2 PID: 21733 Comm: ip Tainted: GW O 4.16.8+ #35 Hardware name: _ _/, BIOS 5.11 08/26/2016 RIP: 0010:ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] RSP: 0018:880172d03c30 EFLAGS: 00010286 RAX: 88013b2c RBX: 88013b2c00b8 RCX: 0898 RDX: 0001 RSI: 88013b2c00d8 RDI: 88016ac40820 RBP: 88016ac42ba0 R08: 0020 R09: R10: 0010 R11: 001256c89fd8 R12: 88013b2c R13: 88013b2c00d8 R14: R15: 88013b2c00d8 FS: 7f04e3606700() GS:880172d0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 003c CR3: 00013b35a005 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? update_load_avg+0x607/0x6f0 ath10k_mac_tx_push_txq+0x6e/0x220 [ath10k_core] ath10k_mac_tx_push_pending+0x151/0x1e0 [ath10k_core] ath10k_htt_txrx_compl_task+0x113e/0x1940 [ath10k_core] ? ath10k_ce_completed_send_next_nolock+0x6f/0x90 [ath10k_pci] ? ath10k_ce_completed_send_next+0x31/0x40 [ath10k_pci] ? ath10k_pci_htc_tx_cb+0x30/0xc0 [ath10k_pci] ? ath10k_bus_pci_write32+0x3c/0xa0 [ath10k_pci] ath10k_pci_napi_poll+0x44/0xf0 [ath10k_pci] net_rx_action+0x250/0x3b0 __do_softirq+0xc2/0x2c2 irq_exit+0x93/0xa0 do_IRQ+0x45/0xc0 common_interrupt+0xf/0xf Signed-off-by: Ben Greear --- * v2: Use list_first_entry_or_null as suggested. include/net/fq_impl.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; if (flow->deficit <= 0) { flow->deficit += fq->quantum; -- 2.4.11