On Fri, May 08, 2020 at 04:36:28PM +0200, Mark Kettenis wrote: > > Date: Fri, 8 May 2020 15:39:26 +0200 > > From: Jan Klemkow <j.klem...@wemelug.de> > > On Mon, Apr 27, 2020 at 10:42:52AM +0200, Martin Pieuchot wrote: > > > So the current logic assumes that as long as the ring contains descriptors > > > em_rxrefill() failures aren't fatal and it is not necessary to schedule a > > > refill. Why? > > > > I'm not sure. I just figured out, what I described below. > > > > > Unless we understand this it is difficult to reason about a fix. Have > > > you looked at other drivers, is it a common pattern? > > > > Drivers like iavf(4) and ixl(4) just set the timeout when the alive > > (if_rxr_inuse returns) counter is 0. Other drivers like ix(4) and > > bnx(4) call the timeout, when refill was not possible. > > > > I adjust the patch to the behavior of ix(4). (see below). > > > > > > In my case, the function em_rxeof is unable to pull the last one or two > > > > mbufs from the receive ring, because the descriptor status hasn't set > > > > the "descriptor done" flag (E1000_RXD_STAT_DD). Thus, the ring has one > > > > or two remaining mbufs, and the timeout to put new ones in is never set. > > > > When new mbufs are available later, the em driver stays in that > > > > situation. > > > > > > Are you saying there's always some descriptor on the ring? Did you > > > investigate why? > > > > Yes, with printf-debugging: I see that we return from em_rxeof() > > because the status bit E1000_RXD_STAT_DD is not set. And systat(8) > > shows one or two remaining mbufs in the ALIVE column. > > > > > > The diff always sets the timeout to put new mbufs in the receive ring, > > > > when it has lesser mbufs then the "low water mark". > > > > > > This changes the logic explained above. With it the fatal condition > > > changes from 0 to `lwm'. Another possible fix would be to always > > > schedule a timeout as soon as em_rxrefill() fails. > > > > I adjusted the patch to this behavior, as I mentioned above. > > > > > I'm not advocating for a particular change, I'm suggesting we understand > > > the current logic and its implications. > > > > The logic looks to me like: The NIC should first fill the last mbuf > > (descriptor), then we set a timeout to refill new mbufs (descriptors). > > It seems to me, that one or two mbufs maybe some kind of small to > > operate correctly?! > > > > But, as far as I understand the documentation of Intel, It should be > > find to deal with such a low amount of mbufs. I couldn't find any text > > that said, the firmware needs a minimum amount of DMA memory. > > Sorry, I was going to reply earlier to this mail thread. > > As far as I know, OpenBSD is the only OS that tries to run with > partially filled rings. When we started doing this we quickly found > out that there was hardware out there that didn't work with only a > single descriptor on the Rx ring. This is why we introduced the low > water mark.
Should I make similar diffs for drivers that check for 0 descriptors as well? > I'm not certain if we ever found this documented unambiguously the the > hardware documentation available to us. But there certainly are hints > that indicate that descriptors are read a full cache-line at a time. > And if the descrioptor size is 16 bytes, you need 4 descriptors to > fill a 64-byte cache line. Good to know. > So what you were proposing in your first diff makes sense. It maight > make sense to introduce a function to check the limit though, perhaps > something like if_rxr_needfill() that returns true if the number of > descriptors on the ring is below the low water mark. What do you think of this diff, OK? Thanks, Jan Index: share/man/man9/if_rxr_init.9 =================================================================== RCS file: /cvs/src/share/man/man9/if_rxr_init.9,v retrieving revision 1.7 diff -u -p -r1.7 if_rxr_init.9 --- share/man/man9/if_rxr_init.9 17 Nov 2017 03:51:32 -0000 1.7 +++ share/man/man9/if_rxr_init.9 9 May 2020 06:40:46 -0000 @@ -35,6 +35,8 @@ .Fn if_rxr_put "struct if_rxring *rxr" "unsigned int n" .Ft void .Fn if_rxr_livelocked "struct if_rxring *rxr" +.Ft int +.Fn if_rxr_needrefill "struct if_rxring *rxr" .Ft unsigned int .Fn if_rxr_inuse "struct if_rxring *rxr" .Ft unsigned int @@ -88,6 +90,10 @@ receive descriptor slots to the ring. .Fn if_rxr_livelocked can signal that that receive ring is generating too much load. .Pp +.Fn if_rxr_needrefill +signals that that receive ring runs below the low watermark of descriptors and +the chip may not work properly. +.Pp .Fn if_rxr_inuse can be used to determine how many descriptor slots have been allocated on the ring. @@ -139,6 +145,7 @@ to fill them again. .Fn if_rxr_get , .Fn if_rxr_put , .Fn if_rxr_livelocked , +.Fn if_rxr_needrefill , .Fn if_rxr_inuse , and .Fn if_rxr_cwm @@ -159,6 +166,12 @@ returns the number of receive descriptor The number of descriptors may be less than the .Fa max requested. +.Pp +.Fn if_rxr_needrefill +returns 1 if the number of receive descriptor slots currently in use on the +ring is below the value of +.Fa lwm . +Otherwise, zero is returned. .Pp .Fn if_rxr_inuse returns the number of receive descriptor slots currently in use on the ring. Index: sys/dev/pci/if_em.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_em.c,v retrieving revision 1.351 diff -u -p -r1.351 if_em.c --- sys/dev/pci/if_em.c 26 Apr 2020 20:49:56 -0000 1.351 +++ sys/dev/pci/if_em.c 9 May 2020 09:45:07 -0000 @@ -2833,7 +2833,7 @@ em_rxrefill(void *arg) if (em_rxfill(que)) E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head); - else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0) + else if (if_rxr_needrefill(&que->rx.sc_rx_ring)) timeout_add(&que->rx_refill, 1); } Index: sys/net/if_var.h =================================================================== RCS file: /cvs/src/sys/net/if_var.h,v retrieving revision 1.104 diff -u -p -r1.104 if_var.h --- sys/net/if_var.h 12 Apr 2020 07:02:43 -0000 1.104 +++ sys/net/if_var.h 9 May 2020 09:29:25 -0000 @@ -387,6 +387,7 @@ void if_rxr_init(struct if_rxring *, u_i u_int if_rxr_get(struct if_rxring *, u_int); #define if_rxr_put(_r, _c) do { (_r)->rxr_alive -= (_c); } while (0) +#define if_rxr_needrefill(_r) ((_r)->rxr_alive < (_r)->rxr_lwm) #define if_rxr_inuse(_r) ((_r)->rxr_alive) #define if_rxr_cwm(_r) ((_r)->rxr_cwm)