> Date: Wed, 1 Apr 2020 09:40:05 +0200
> From: Patrick Wildt <patr...@blueri.se>
> 
> On Wed, Apr 01, 2020 at 09:22:07AM +0200, Patrick Wildt wrote:
> > On Wed, Apr 01, 2020 at 04:47:10PM +1100, Jonathan Gray wrote:
> > > On Wed, Apr 01, 2020 at 12:58:23PM +1100, Jonathan Gray wrote:
> > > > On Wed, Mar 18, 2020 at 01:41:06PM +0100, Patrick Wildt wrote:
> > > > > On Wed, Mar 18, 2020 at 11:22:40AM +0100, Patrick Wildt wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I've spent a few days investigating why USB ethernet adapters are so
> > > > > > horribly slow on my ARMs.  Using dt(4) I realized that it was 
> > > > > > spending
> > > > > > most of its time in memcpy.  But, why?  As it turns out, all USB 
> > > > > > data
> > > > > > buffers are mapped COHERENT, which on some/most ARMs means uncached.
> > > > > > Using cached data buffers makes the performance rise from 20 mbit/s 
> > > > > > to
> > > > > > 200 mbit/s.  Quite a difference.
> > > > > > 
> > > > > > sys/dev/usb/usb_mem.c:
> > > > > >     error = bus_dmamem_map(tag, p->segs, p->nsegs, p->size,
> > > > > >                            &p->kaddr, 
> > > > > > BUS_DMA_NOWAIT|BUS_DMA_COHERENT);
> > > > > > 
> > > > > > On x86, COHERENT is essentially a no-op.  On ARM, it depends on the 
> > > > > > SoC.
> > > > > > Some SoCs have cache-coherent USB controllers, some don't.  Mine 
> > > > > > does
> > > > > > not, so mapping it COHERENT means uncached and thus slow.
> > > > > > 
> > > > > > Why do we do that?  Well, when the code was imported in 99, it was
> > > > > > already there.  Since then we have gained infrastructure for DMA
> > > > > > syncs in the USB stack, which I think are proper.
> > > > > > 
> > > > > > sys/dev/usb/usbdi.c - usbd_transfer() (before transfer)
> > > > > > 
> > > > > >     if (!usbd_xfer_isread(xfer)) {
> > > > > >             if ((xfer->flags & USBD_NO_COPY) == 0)
> > > > > >                     memcpy(KERNADDR(&xfer->dmabuf, 0), xfer->buffer,
> > > > > >                         xfer->length);
> > > > > >             usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > > >                 BUS_DMASYNC_PREWRITE);
> > > > > >     } else
> > > > > >             usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > > >                 BUS_DMASYNC_PREREAD);
> > > > > >     err = pipe->methods->transfer(xfer);
> > > > > > 
> > > > > > sys/dev/usb/usbdi.c - usb_transfer_complete() (after transfer)
> > > > > > 
> > > > > >     if (xfer->actlen != 0) {
> > > > > >             if (usbd_xfer_isread(xfer)) {
> > > > > >                     usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > > >                         BUS_DMASYNC_POSTREAD);
> > > > > >                     if (!(xfer->flags & USBD_NO_COPY))
> > > > > >                             memcpy(xfer->buffer, 
> > > > > > KERNADDR(&xfer->dmabuf, 0),
> > > > > >                                 xfer->actlen);
> > > > > >             } else
> > > > > >                     usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > > >                         BUS_DMASYNC_POSTWRITE);
> > > > > >     }
> > > > > > 
> > > > > > We cannot just remove COHERENT, since some drivers, like ehci(4), 
> > > > > > use
> > > > > > the same backend to allocate their rings.  And I can't vouch for 
> > > > > > those
> > > > > > drivers' sanity.
> > > > > > 
> > > > > > As a first step, I would like to go ahead with another solution, 
> > > > > > which
> > > > > > is based on a diff from Marius Strobl, who added those syncs in the
> > > > > > first place.  Essentially it splits the memory handling into 
> > > > > > cacheable
> > > > > > and non-cacheable blocks.  The USB data transfers and everyone who 
> > > > > > uses
> > > > > > usbd_alloc_buffer() then use cacheable buffers, while code like 
> > > > > > ehci(4)
> > > > > > still don't.  This is a bit of a safer approach imho, since we don't
> > > > > > hurt the controller drivers, but speed up the data buffers.
> > > > > > 
> > > > > > Once we have verified that there are no regressions, we can adjust
> > > > > > ehci(4) and the like, add proper syncs, make sure they still work as
> > > > > > well as before, and maybe then back this out again.
> > > > > > 
> > > > > > Keep note that this is all a no-op on X86, but all the other archs 
> > > > > > will
> > > > > > profit from this.
> > > > > > 
> > > > > > ok?
> > > > > > 
> > > > > > Patrick
> > > > > 
> > > > > Update diff with inverted logic.  kettenis@ argues that we should
> > > > > invert the logic, and those who need COHERENT memory should ask
> > > > > for that explicitly, since for bus_dmamem_map() it also needs to
> > > > > be passed explicitly.  This also points out all those users that
> > > > > use usb_allocmem() internally, where we might want to have a look
> > > > > if COHERENT is actually needed or not, or if it can be refactored
> > > > > in another way.
> > > > 
> > > > These commits broke usb on imx.6 with cubox:
> > > > 
> > > > imxehci0 at simplebus3
> > > > usb0 at imxehci0: USB revision 2.0
> > > > usb0: root hub problem
> > > > imxehci1 at simplebus3
> > > > usb1 at imxehci1: USB revision 2.0
> > > > usb1: root hub problem
> > > > "usbmisc" at simplebus3 not configured
> > > 
> > > pandaboard with omap4 (another cortex a9) also has broken usb with
> > > the latest snapshot:
> > > 
> > > omehci0 at simplebus0
> > > usb0 at omehci0: USB revision 2.0
> > > usb0: root hub problem
> > 
> > I think I know what it is.  When we enqueue a request for the root
> > hub, the buffer, for which the USB subsystem allocates a DMA buffer,
> > is filled not by an external device, but by our driver.
> > 
> > Now on completion of that request, since it's doing a READ of the
> > root hub, it will do a DMA sync with POSTREAD.  Since the ehci code
> > hasn't flushed the buffer do memory, but simply did a memcpy to the
> > buffer, the POSTREAD will essentially drop whatever ehci's memcpy
> > did.
> > 
> > Can you check if that makes a difference?  It essentially forces the
> > root hub code to flush the data to the caches, so that the POSTREAD
> > can successfully flush the cache and read the data from memory.  I
> > wonder if there's a better way of doing this, but I kinda doubt it.
> > 
> > Patrick
> > 
> > diff --git a/sys/dev/usb/ehci.c b/sys/dev/usb/ehci.c
> > index a352d83eaf4..d81901d3762 100644
> > --- a/sys/dev/usb/ehci.c
> > +++ b/sys/dev/usb/ehci.c
> > @@ -2154,6 +2154,9 @@ ehci_root_ctrl_start(struct usbd_xfer *xfer)
> >             err = USBD_IOERROR;
> >             goto ret;
> >     }
> > +   if (totlen != 0)
> > +           usb_syncmem(&xfer->dmabuf, 0, totlen,
> > +               BUS_DMASYNC_PREWRITE);
> >     xfer->actlen = totlen;
> >     err = USBD_NORMAL_COMPLETION;
> >   ret:
> > 
> 
> If that diff does indeed work, I think a better fix would be to move the
> DMA syncs from the USB layer into the driver layer, so that for the root
> methods no sync has to be done, but all the individual device methods do
> proper syncs.  Preparing that diff would take some more minutes though.
> In the meantime we could use USB_DMA_COHERENT for all requests without
> an extra allocated buffer, which includes all the control requests.
> Then move the syncs to the drivers, and at last remove the flag again.

This diff works as well.  I suggest you commit it while we discuss
other options.

ok kettenis@

> diff --git a/sys/dev/usb/usbdi.c b/sys/dev/usb/usbdi.c
> index 5161afc21e5..a97e22b64a3 100644
> --- a/sys/dev/usb/usbdi.c
> +++ b/sys/dev/usb/usbdi.c
> @@ -305,7 +305,8 @@ usbd_transfer(struct usbd_xfer *xfer)
>               if (xfer->rqflags & URQ_AUTO_DMABUF)
>                       printf("usbd_transfer: has old buffer!\n");
>  #endif
> -             err = usb_allocmem(bus, xfer->length, 0, 0, &xfer->dmabuf);
> +             err = usb_allocmem(bus, xfer->length, 0, USB_DMA_COHERENT,
> +                 &xfer->dmabuf);
>               if (err)
>                       return (err);
>               xfer->rqflags |= URQ_AUTO_DMABUF;
> 
> 

Reply via email to