On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend <john.fastab...@gmail.com> wrote: > On 12/22/2017 12:31 PM, Cong Wang wrote: >> I understand why you had it, but it is just not safe. You don't want >> to achieve performance gain by crashing system, right? > > huh? So my point is the patch you submit here is not a > real fix but a work around. To peek the head of a consumer/producer ring > without a lock, _should_ be fine. This _should_ work as well with > consumer or producer operations happening at the same time. After some > digging the issue is in the ptr_ring code.
The comments disagree with you: /* Might be slightly faster than skb_array_empty below, but only safe if the * array is never resized. Also, callers invoking this in a loop must take care * to use a compiler barrier, for example cpu_relax(). */ If the comments are right, you miss a barrier here too. > > The peek code (what empty check calls) is the following, > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > { > if (likely(r->size)) > return r->queue[r->consumer_head]; > return NULL; > } > > So what the splat is detecting is consumer head being 'out of bounds'. > This happens because ptr_ring_discard_one increments the consumer_head > and then checks to see if it overran the array size. If above peek > happens after the increment, but before the size check we get the > splat. There are two ways, as far as I can see, to fix this. First > do the check before incrementing the consumer head. Or the easier > fix, > > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct > ptr_ring *r, > > static inline void **__ptr_ring_init_queue_alloc(unsigned int size, > gfp_t gfp) > { > - return kcalloc(size, sizeof(void *), gfp); > + return kcalloc(size + 1, sizeof(void *), gfp); > } > > With Jakub's help (Thanks!) I was able to reproduce the original splat > and also verify the above removes it. > > To be clear "resizing" a skb_array only refers to changing the actual > array size not adding/removing elements. I never look into the implementation, just simply trust the comments. At least the comments above __skb_array_empty() need to improve. > >> >>> >>> Although its not logical IMO to have both reset and dequeue running at >>> the same time. Some skbs would get through others would get sent, sort >>> of a mess. I don't see how it can be an issue. The never resized bit >>> in the documentation is referring to resizing the ring size _not_ popping >>> off elements of the ring. array_empty just reads the consumer head. >>> The only ring resizing in pfifo fast should be at init and destroy where >>> enqueue/dequeue should be disconnected by then. Although based on the >>> trace I missed a case. >> >> >> Both pfifo_fast_reset() and pfifo_fast_dequeue() call >> skb_array_consume_bh(), so there is no difference w.r.t. resizing. >> > > Sorry not following. > >> And ->reset() is called in qdisc_graft() too. Let's say we have >> htb+pfifo_fast, >> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast, >> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue() >> concurrently. > > Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering > though if this API can be cleaned up. What are the paths that do a reset > without a destroy.. Do we really need to have this pattern where reset > is called then later destroy. Seems destroy could do the entire cleanup > and this would simplify things. None of this has to do with the splat > though. I don't follow your point any more. We are talking about ->reset() race with ->dequeue() which is the cause of the bug, right? If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast, why did you even mention synchronize_net() from the beginning??? Also you changed the code too, to adjust rcu grace period. > >> >> >>> >>> I think the right fix is to only call reset/destroy patterns after >>> waiting a grace period and for all tx_action calls in-flight to >>> complete. This is also better going forward for more complex qdiscs. >> >> But we don't even have rcu read lock in TX BH, do we? >> >> Also, people certainly don't like yet another synchronize_net()... >> > > This needs a fix and is a _real_ bug, but removing __skb_array_empty() > doesn't help solve this at all. Will work on a fix after the holiday > break. The fix here is to ensure the destroy is not going to happen > while tx_action is in-flight. Can be done with qdisc_run and checking > correct bits in lockless case. Sounds like you missed a lot of things with your "lockless" patches.... First qdisc rcu callback, second rcu read lock in TX BH... My quick one-line fix is to amend this bug before you going deeper in this rabbit hole.