> Date: Sat, 9 May 2020 12:07:55 +0200 > From: Jan Klemkow <j.klem...@wemelug.de> > > On Fri, May 08, 2020 at 04:36:28PM +0200, Mark Kettenis wrote: > > > Date: Fri, 8 May 2020 15:39:26 +0200 > > > From: Jan Klemkow <j.klem...@wemelug.de> > > > On Mon, Apr 27, 2020 at 10:42:52AM +0200, Martin Pieuchot wrote: > > > > So the current logic assumes that as long as the ring contains > > > > descriptors > > > > em_rxrefill() failures aren't fatal and it is not necessary to schedule > > > > a > > > > refill. Why? > > > > > > I'm not sure. I just figured out, what I described below. > > > > > > > Unless we understand this it is difficult to reason about a fix. Have > > > > you looked at other drivers, is it a common pattern? > > > > > > Drivers like iavf(4) and ixl(4) just set the timeout when the alive > > > (if_rxr_inuse returns) counter is 0. Other drivers like ix(4) and > > > bnx(4) call the timeout, when refill was not possible. > > > > > > I adjust the patch to the behavior of ix(4). (see below). > > > > > > > > In my case, the function em_rxeof is unable to pull the last one or > > > > > two > > > > > mbufs from the receive ring, because the descriptor status hasn't set > > > > > the "descriptor done" flag (E1000_RXD_STAT_DD). Thus, the ring has > > > > > one > > > > > or two remaining mbufs, and the timeout to put new ones in is never > > > > > set. > > > > > When new mbufs are available later, the em driver stays in that > > > > > situation. > > > > > > > > Are you saying there's always some descriptor on the ring? Did you > > > > investigate why? > > > > > > Yes, with printf-debugging: I see that we return from em_rxeof() > > > because the status bit E1000_RXD_STAT_DD is not set. And systat(8) > > > shows one or two remaining mbufs in the ALIVE column. > > > > > > > > The diff always sets the timeout to put new mbufs in the receive ring, > > > > > when it has lesser mbufs then the "low water mark". > > > > > > > > This changes the logic explained above. With it the fatal condition > > > > changes from 0 to `lwm'. Another possible fix would be to always > > > > schedule a timeout as soon as em_rxrefill() fails. > > > > > > I adjusted the patch to this behavior, as I mentioned above. > > > > > > > I'm not advocating for a particular change, I'm suggesting we understand > > > > the current logic and its implications. > > > > > > The logic looks to me like: The NIC should first fill the last mbuf > > > (descriptor), then we set a timeout to refill new mbufs (descriptors). > > > It seems to me, that one or two mbufs maybe some kind of small to > > > operate correctly?! > > > > > > But, as far as I understand the documentation of Intel, It should be > > > find to deal with such a low amount of mbufs. I couldn't find any text > > > that said, the firmware needs a minimum amount of DMA memory. > > > > Sorry, I was going to reply earlier to this mail thread. > > > > As far as I know, OpenBSD is the only OS that tries to run with > > partially filled rings. When we started doing this we quickly found > > out that there was hardware out there that didn't work with only a > > single descriptor on the Rx ring. This is why we introduced the low > > water mark. > > Should I make similar diffs for drivers that check for 0 descriptors as > well?
I think a full audit would indeed be good. But that can be done in a follow-up diff. > > I'm not certain if we ever found this documented unambiguously the the > > hardware documentation available to us. But there certainly are hints > > that indicate that descriptors are read a full cache-line at a time. > > And if the descrioptor size is 16 bytes, you need 4 descriptors to > > fill a 64-byte cache line. > > Good to know. > > > So what you were proposing in your first diff makes sense. It maight > > make sense to introduce a function to check the limit though, perhaps > > something like if_rxr_needfill() that returns true if the number of > > descriptors on the ring is below the low water mark. > > What do you think of this diff, OK? Looks good to me. One suggestion to improve the documentation below. ok kettenis@ but wait a couple of days to give dlg@ a chance to comment. > Index: share/man/man9/if_rxr_init.9 > =================================================================== > RCS file: /cvs/src/share/man/man9/if_rxr_init.9,v > retrieving revision 1.7 > diff -u -p -r1.7 if_rxr_init.9 > --- share/man/man9/if_rxr_init.9 17 Nov 2017 03:51:32 -0000 1.7 > +++ share/man/man9/if_rxr_init.9 9 May 2020 06:40:46 -0000 > @@ -35,6 +35,8 @@ > .Fn if_rxr_put "struct if_rxring *rxr" "unsigned int n" > .Ft void > .Fn if_rxr_livelocked "struct if_rxring *rxr" > +.Ft int > +.Fn if_rxr_needrefill "struct if_rxring *rxr" > .Ft unsigned int > .Fn if_rxr_inuse "struct if_rxring *rxr" > .Ft unsigned int > @@ -88,6 +90,10 @@ receive descriptor slots to the ring. > .Fn if_rxr_livelocked > can signal that that receive ring is generating too much load. > .Pp > +.Fn if_rxr_needrefill > +signals that that receive ring runs below the low watermark of descriptors > and > +the chip may not work properly. I'd drop the "and the chip may not work properly" bit. Best to just document what the function does. And if_rxr_init already documents the lwm as the minimum number of descriptors needed to operate correctly. > +.Pp > .Fn if_rxr_inuse > can be used to determine how many descriptor slots have been allocated > on the ring. > @@ -139,6 +145,7 @@ to fill them again. > .Fn if_rxr_get , > .Fn if_rxr_put , > .Fn if_rxr_livelocked , > +.Fn if_rxr_needrefill , > .Fn if_rxr_inuse , > and > .Fn if_rxr_cwm > @@ -159,6 +166,12 @@ returns the number of receive descriptor > The number of descriptors may be less than the > .Fa max > requested. > +.Pp > +.Fn if_rxr_needrefill > +returns 1 if the number of receive descriptor slots currently in use on the > +ring is below the value of > +.Fa lwm . > +Otherwise, zero is returned. > .Pp > .Fn if_rxr_inuse > returns the number of receive descriptor slots currently in use on the ring. > Index: sys/dev/pci/if_em.c > =================================================================== > RCS file: /cvs/src/sys/dev/pci/if_em.c,v > retrieving revision 1.351 > diff -u -p -r1.351 if_em.c > --- sys/dev/pci/if_em.c 26 Apr 2020 20:49:56 -0000 1.351 > +++ sys/dev/pci/if_em.c 9 May 2020 09:45:07 -0000 > @@ -2833,7 +2833,7 @@ em_rxrefill(void *arg) > > if (em_rxfill(que)) > E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head); > - else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0) > + else if (if_rxr_needrefill(&que->rx.sc_rx_ring)) > timeout_add(&que->rx_refill, 1); > } > > Index: sys/net/if_var.h > =================================================================== > RCS file: /cvs/src/sys/net/if_var.h,v > retrieving revision 1.104 > diff -u -p -r1.104 if_var.h > --- sys/net/if_var.h 12 Apr 2020 07:02:43 -0000 1.104 > +++ sys/net/if_var.h 9 May 2020 09:29:25 -0000 > @@ -387,6 +387,7 @@ void if_rxr_init(struct if_rxring *, u_i > u_int if_rxr_get(struct if_rxring *, u_int); > > #define if_rxr_put(_r, _c) do { (_r)->rxr_alive -= (_c); } while (0) > +#define if_rxr_needrefill(_r) ((_r)->rxr_alive < (_r)->rxr_lwm) > #define if_rxr_inuse(_r) ((_r)->rxr_alive) > #define if_rxr_cwm(_r) ((_r)->rxr_cwm) > >