On Mon, May 18, 2020 at 4:07 PM Bartosz Golaszewski <b...@bgdev.pl> wrote: > pt., 15 maj 2020 o 15:32 Arnd Bergmann <a...@arndb.de> napisaĆ(a):
> > I would get rid of the 'count' here, as it duplicates the information > > that is already known from the difference between head and tail, and you > > can't update it atomically without holding a lock around the access to > > the ring. The way I'd do this is to have the head and tail pointers > > in separate cache lines, and then use READ_ONCE/WRITE_ONCE > > and smp barriers to access them, with each one updated on one > > thread but read by the other. > > > > Your previous solution seems much more reliable though. For instance > in the above: when we're doing the TX cleanup (we got the TX ready > irq, we're iterating over descriptors until we know there are no more > packets scheduled (count == 0) or we encounter one that's still owned > by DMA), a parallel TX path can schedule new packets to be sent and I > don't see how we can atomically check the count (understood as a > difference between tail and head) and run a new iteration (where we'd > modify the head or tail) without risking the other path getting in the > way. We'd have to always check the descriptor. It should be enough to read both pointers once at the start of each side, then do whatever work you want to do (cleaning, sending, receiving, refilling) and finally updating the one pointer that changed. If both sides do that, you minimize the cache line bouncing and always do a useful amount of work that guarantees forward progress and does not interfere with the other side. > I experimented a bit with this and couldn't come up with anything that > would pass any stress test. > > On the other hand: spin_lock_bh() works fine and I like your approach > from the previous e-mail - except for the work for updating stats as > we could potentially lose some stats when we're updating in process > context with RX/TX paths running in parallel in napi context but that > would be rare enough to overlook it. > > I hope v4 will be good enough even with spinlocks. :) Yes, it should be fine. Avoiding all the locks is mainly an optimization for the number of CPU cycles spent per packet, the other points are more important to get right, in particular the flow control. Arnd