On 12/22/2017 12:31 PM, Cong Wang wrote: > On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend > <john.fastab...@gmail.com> wrote: >> On 12/21/2017 04:03 PM, Cong Wang wrote: >>> __skb_array_empty() is only safe if array is never resized. >>> pfifo_fast_dequeue() is called in TX BH context and without >>> qdisc lock, so even after we disable BH on ->reset() path >>> we can still race with other CPU's. >>> >>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") >>> Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com> >>> Cc: John Fastabend <john.fastab...@gmail.com> >>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> >>> --- >>> net/sched/sch_generic.c | 3 --- >>> 1 file changed, 3 deletions(-) >>> >>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >>> index 00ddb5f8f430..9279258ce060 100644 >>> --- a/net/sched/sch_generic.c >>> +++ b/net/sched/sch_generic.c >>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc >>> *qdisc) >>> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { >>> struct skb_array *q = band2list(priv, band); >>> >>> - if (__skb_array_empty(q)) >>> - continue; >>> - >>> skb = skb_array_consume_bh(q); >>> } >>> if (likely(skb)) { >>> >> >> >> So this is a performance thing we don't want to grab the consumer lock on >> empty bands. Which can be fairly common depending on traffic patterns. > > > I understand why you had it, but it is just not safe. You don't want > to achieve performance gain by crashing system, right?
huh? So my point is the patch you submit here is not a real fix but a work around. To peek the head of a consumer/producer ring without a lock, _should_ be fine. This _should_ work as well with consumer or producer operations happening at the same time. After some digging the issue is in the ptr_ring code. The peek code (what empty check calls) is the following, static inline void *__ptr_ring_peek(struct ptr_ring *r) { if (likely(r->size)) return r->queue[r->consumer_head]; return NULL; } So what the splat is detecting is consumer head being 'out of bounds'. This happens because ptr_ring_discard_one increments the consumer_head and then checks to see if it overran the array size. If above peek happens after the increment, but before the size check we get the splat. There are two ways, as far as I can see, to fix this. First do the check before incrementing the consumer head. Or the easier fix, --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + return kcalloc(size + 1, sizeof(void *), gfp); } With Jakub's help (Thanks!) I was able to reproduce the original splat and also verify the above removes it. To be clear "resizing" a skb_array only refers to changing the actual array size not adding/removing elements. > >> >> Although its not logical IMO to have both reset and dequeue running at >> the same time. Some skbs would get through others would get sent, sort >> of a mess. I don't see how it can be an issue. The never resized bit >> in the documentation is referring to resizing the ring size _not_ popping >> off elements of the ring. array_empty just reads the consumer head. >> The only ring resizing in pfifo fast should be at init and destroy where >> enqueue/dequeue should be disconnected by then. Although based on the >> trace I missed a case. > > > Both pfifo_fast_reset() and pfifo_fast_dequeue() call > skb_array_consume_bh(), so there is no difference w.r.t. resizing. > Sorry not following. > And ->reset() is called in qdisc_graft() too. Let's say we have > htb+pfifo_fast, > htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast, > so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue() > concurrently. Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering though if this API can be cleaned up. What are the paths that do a reset without a destroy.. Do we really need to have this pattern where reset is called then later destroy. Seems destroy could do the entire cleanup and this would simplify things. None of this has to do with the splat though. > > >> >> I think the right fix is to only call reset/destroy patterns after >> waiting a grace period and for all tx_action calls in-flight to >> complete. This is also better going forward for more complex qdiscs. > > But we don't even have rcu read lock in TX BH, do we? > > Also, people certainly don't like yet another synchronize_net()... > This needs a fix and is a _real_ bug, but removing __skb_array_empty() doesn't help solve this at all. Will work on a fix after the holiday break. The fix here is to ensure the destroy is not going to happen while tx_action is in-flight. Can be done with qdisc_run and checking correct bits in lockless case. .John