> Date: Sat, 9 May 2020 12:07:55 +0200
> From: Jan Klemkow <j.klem...@wemelug.de>
> 
> On Fri, May 08, 2020 at 04:36:28PM +0200, Mark Kettenis wrote:
> > > Date: Fri, 8 May 2020 15:39:26 +0200
> > > From: Jan Klemkow <j.klem...@wemelug.de>
> > > On Mon, Apr 27, 2020 at 10:42:52AM +0200, Martin Pieuchot wrote:
> > > > So the current logic assumes that as long as the ring contains 
> > > > descriptors
> > > > em_rxrefill() failures aren't fatal and it is not necessary to schedule 
> > > > a
> > > > refill.  Why?
> > > 
> > > I'm not sure.  I just figured out, what I described below.
> > > 
> > > > Unless we understand this it is difficult to reason about a fix.  Have
> > > > you looked at other drivers, is it a common pattern?
> > > 
> > > Drivers like iavf(4) and ixl(4) just set the timeout when the alive
> > > (if_rxr_inuse returns) counter is 0.  Other drivers like ix(4) and
> > > bnx(4) call the timeout, when refill was not possible.
> > > 
> > > I adjust the patch to the behavior of ix(4).  (see below).
> > > 
> > > > > In my case, the function em_rxeof is unable to pull the last one or 
> > > > > two
> > > > > mbufs from the receive ring, because the descriptor status hasn't set
> > > > > the "descriptor done" flag (E1000_RXD_STAT_DD).  Thus, the ring has 
> > > > > one
> > > > > or two remaining mbufs, and the timeout to put new ones in is never 
> > > > > set.
> > > > > When new mbufs are available later, the em driver stays in that
> > > > > situation.
> > > > 
> > > > Are you saying there's always some descriptor on the ring?  Did you
> > > > investigate why?
> > > 
> > > Yes, with printf-debugging:  I see that we return from em_rxeof()
> > > because the status bit E1000_RXD_STAT_DD is not set.  And systat(8)
> > > shows one or two remaining mbufs in the ALIVE column.
> > > 
> > > > > The diff always sets the timeout to put new mbufs in the receive ring,
> > > > > when it has lesser mbufs then the "low water mark".
> > > > 
> > > > This changes the logic explained above.  With it the fatal condition
> > > > changes from 0 to `lwm'.  Another possible fix would be to always
> > > > schedule a timeout as soon as em_rxrefill() fails.
> > > 
> > > I adjusted the patch to this behavior, as I mentioned above.
> > > 
> > > > I'm not advocating for a particular change, I'm suggesting we understand
> > > > the current logic and its implications.
> > > 
> > > The logic looks to me like: The NIC should first fill the last mbuf
> > > (descriptor), then we set a timeout to refill new mbufs (descriptors).
> > > It seems to me, that one or two mbufs maybe some kind of small to
> > > operate correctly?!
> > > 
> > > But, as far as I understand the documentation of Intel, It should be
> > > find to deal with such a low amount of mbufs.  I couldn't find any text
> > > that said, the firmware needs a minimum amount of DMA memory.
> > 
> > Sorry, I was going to reply earlier to this mail thread.
> > 
> > As far as I know, OpenBSD is the only OS that tries to run with
> > partially filled rings.  When we started doing this we quickly found
> > out that there was hardware out there that didn't work with only a
> > single descriptor on the Rx ring.  This is why we introduced the low
> > water mark.
> 
> Should I make similar diffs for drivers that check for 0 descriptors as
> well?

I think a full audit would indeed be good.  But that can be done in a
follow-up diff.

> > I'm not certain if we ever found this documented unambiguously the the
> > hardware documentation available to us.  But there certainly are hints
> > that indicate that descriptors are read a full cache-line at a time.
> > And if the descrioptor size is 16 bytes, you need 4 descriptors to
> > fill a 64-byte cache line.
> 
> Good to know.
> 
> > So what you were proposing in your first diff makes sense.  It maight
> > make sense to introduce a function to check the limit though, perhaps
> > something like if_rxr_needfill() that returns true if the number of
> > descriptors on the ring is below the low water mark.
> 
> What do you think of this diff, OK?

Looks good to me.  One suggestion to improve the documentation below.

ok kettenis@ but wait a couple of days to give dlg@ a chance to comment.


> Index: share/man/man9/if_rxr_init.9
> ===================================================================
> RCS file: /cvs/src/share/man/man9/if_rxr_init.9,v
> retrieving revision 1.7
> diff -u -p -r1.7 if_rxr_init.9
> --- share/man/man9/if_rxr_init.9      17 Nov 2017 03:51:32 -0000      1.7
> +++ share/man/man9/if_rxr_init.9      9 May 2020 06:40:46 -0000
> @@ -35,6 +35,8 @@
>  .Fn if_rxr_put "struct if_rxring *rxr" "unsigned int n"
>  .Ft void
>  .Fn if_rxr_livelocked "struct if_rxring *rxr"
> +.Ft int
> +.Fn if_rxr_needrefill "struct if_rxring *rxr"
>  .Ft unsigned int
>  .Fn if_rxr_inuse "struct if_rxring *rxr"
>  .Ft unsigned int
> @@ -88,6 +90,10 @@ receive descriptor slots to the ring.
>  .Fn if_rxr_livelocked
>  can signal that that receive ring is generating too much load.
>  .Pp
> +.Fn if_rxr_needrefill
> +signals that that receive ring runs below the low watermark of descriptors 
> and
> +the chip may not work properly.

I'd drop the "and the chip may not work properly" bit.  Best to just
document what the function does.  And if_rxr_init already documents
the lwm as the minimum number of descriptors needed to operate
correctly.

> +.Pp
>  .Fn if_rxr_inuse
>  can be used to determine how many descriptor slots have been allocated
>  on the ring.
> @@ -139,6 +145,7 @@ to fill them again.
>  .Fn if_rxr_get ,
>  .Fn if_rxr_put ,
>  .Fn if_rxr_livelocked ,
> +.Fn if_rxr_needrefill ,
>  .Fn if_rxr_inuse ,
>  and
>  .Fn if_rxr_cwm
> @@ -159,6 +166,12 @@ returns the number of receive descriptor
>  The number of descriptors may be less than the
>  .Fa max
>  requested.
> +.Pp
> +.Fn if_rxr_needrefill
> +returns 1 if the number of receive descriptor slots currently in use on the
> +ring is below the value of
> +.Fa lwm .
> +Otherwise, zero is returned.
>  .Pp
>  .Fn if_rxr_inuse
>  returns the number of receive descriptor slots currently in use on the ring.
> Index: sys/dev/pci/if_em.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> retrieving revision 1.351
> diff -u -p -r1.351 if_em.c
> --- sys/dev/pci/if_em.c       26 Apr 2020 20:49:56 -0000      1.351
> +++ sys/dev/pci/if_em.c       9 May 2020 09:45:07 -0000
> @@ -2833,7 +2833,7 @@ em_rxrefill(void *arg)
>  
>       if (em_rxfill(que))
>               E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head);
> -     else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0)
> +     else if (if_rxr_needrefill(&que->rx.sc_rx_ring))
>               timeout_add(&que->rx_refill, 1);
>  }
>  
> Index: sys/net/if_var.h
> ===================================================================
> RCS file: /cvs/src/sys/net/if_var.h,v
> retrieving revision 1.104
> diff -u -p -r1.104 if_var.h
> --- sys/net/if_var.h  12 Apr 2020 07:02:43 -0000      1.104
> +++ sys/net/if_var.h  9 May 2020 09:29:25 -0000
> @@ -387,6 +387,7 @@ void      if_rxr_init(struct if_rxring *, u_i
>  u_int        if_rxr_get(struct if_rxring *, u_int);
>  
>  #define if_rxr_put(_r, _c)   do { (_r)->rxr_alive -= (_c); } while (0)
> +#define if_rxr_needrefill(_r)        ((_r)->rxr_alive < (_r)->rxr_lwm)
>  #define if_rxr_inuse(_r)     ((_r)->rxr_alive)
>  #define if_rxr_cwm(_r)               ((_r)->rxr_cwm)
>  
> 

Reply via email to