On Wed, Apr 01, 2020 at 09:22:07AM +0200, Patrick Wildt wrote:
> On Wed, Apr 01, 2020 at 04:47:10PM +1100, Jonathan Gray wrote:
> > On Wed, Apr 01, 2020 at 12:58:23PM +1100, Jonathan Gray wrote:
> > > On Wed, Mar 18, 2020 at 01:41:06PM +0100, Patrick Wildt wrote:
> > > > On Wed, Mar 18, 2020 at 11:22:40AM +0100, Patrick Wildt wrote:
> > > > > Hi,
> > > > > 
> > > > > I've spent a few days investigating why USB ethernet adapters are so
> > > > > horribly slow on my ARMs.  Using dt(4) I realized that it was spending
> > > > > most of its time in memcpy.  But, why?  As it turns out, all USB data
> > > > > buffers are mapped COHERENT, which on some/most ARMs means uncached.
> > > > > Using cached data buffers makes the performance rise from 20 mbit/s to
> > > > > 200 mbit/s.  Quite a difference.
> > > > > 
> > > > > sys/dev/usb/usb_mem.c:
> > > > >       error = bus_dmamem_map(tag, p->segs, p->nsegs, p->size,
> > > > >                              &p->kaddr, 
> > > > > BUS_DMA_NOWAIT|BUS_DMA_COHERENT);
> > > > > 
> > > > > On x86, COHERENT is essentially a no-op.  On ARM, it depends on the 
> > > > > SoC.
> > > > > Some SoCs have cache-coherent USB controllers, some don't.  Mine does
> > > > > not, so mapping it COHERENT means uncached and thus slow.
> > > > > 
> > > > > Why do we do that?  Well, when the code was imported in 99, it was
> > > > > already there.  Since then we have gained infrastructure for DMA
> > > > > syncs in the USB stack, which I think are proper.
> > > > > 
> > > > > sys/dev/usb/usbdi.c - usbd_transfer() (before transfer)
> > > > > 
> > > > >       if (!usbd_xfer_isread(xfer)) {
> > > > >               if ((xfer->flags & USBD_NO_COPY) == 0)
> > > > >                       memcpy(KERNADDR(&xfer->dmabuf, 0), xfer->buffer,
> > > > >                           xfer->length);
> > > > >               usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > >                   BUS_DMASYNC_PREWRITE);
> > > > >       } else
> > > > >               usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > >                   BUS_DMASYNC_PREREAD);
> > > > >       err = pipe->methods->transfer(xfer);
> > > > > 
> > > > > sys/dev/usb/usbdi.c - usb_transfer_complete() (after transfer)
> > > > > 
> > > > >       if (xfer->actlen != 0) {
> > > > >               if (usbd_xfer_isread(xfer)) {
> > > > >                       usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > >                           BUS_DMASYNC_POSTREAD);
> > > > >                       if (!(xfer->flags & USBD_NO_COPY))
> > > > >                               memcpy(xfer->buffer, 
> > > > > KERNADDR(&xfer->dmabuf, 0),
> > > > >                                   xfer->actlen);
> > > > >               } else
> > > > >                       usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > >                           BUS_DMASYNC_POSTWRITE);
> > > > >       }
> > > > > 
> > > > > We cannot just remove COHERENT, since some drivers, like ehci(4), use
> > > > > the same backend to allocate their rings.  And I can't vouch for those
> > > > > drivers' sanity.
> > > > > 
> > > > > As a first step, I would like to go ahead with another solution, which
> > > > > is based on a diff from Marius Strobl, who added those syncs in the
> > > > > first place.  Essentially it splits the memory handling into cacheable
> > > > > and non-cacheable blocks.  The USB data transfers and everyone who 
> > > > > uses
> > > > > usbd_alloc_buffer() then use cacheable buffers, while code like 
> > > > > ehci(4)
> > > > > still don't.  This is a bit of a safer approach imho, since we don't
> > > > > hurt the controller drivers, but speed up the data buffers.
> > > > > 
> > > > > Once we have verified that there are no regressions, we can adjust
> > > > > ehci(4) and the like, add proper syncs, make sure they still work as
> > > > > well as before, and maybe then back this out again.
> > > > > 
> > > > > Keep note that this is all a no-op on X86, but all the other archs 
> > > > > will
> > > > > profit from this.
> > > > > 
> > > > > ok?
> > > > > 
> > > > > Patrick
> > > > 
> > > > Update diff with inverted logic.  kettenis@ argues that we should
> > > > invert the logic, and those who need COHERENT memory should ask
> > > > for that explicitly, since for bus_dmamem_map() it also needs to
> > > > be passed explicitly.  This also points out all those users that
> > > > use usb_allocmem() internally, where we might want to have a look
> > > > if COHERENT is actually needed or not, or if it can be refactored
> > > > in another way.
> > > 
> > > These commits broke usb on imx.6 with cubox:
> > > 
> > > imxehci0 at simplebus3
> > > usb0 at imxehci0: USB revision 2.0
> > > usb0: root hub problem
> > > imxehci1 at simplebus3
> > > usb1 at imxehci1: USB revision 2.0
> > > usb1: root hub problem
> > > "usbmisc" at simplebus3 not configured
> > 
> > pandaboard with omap4 (another cortex a9) also has broken usb with
> > the latest snapshot:
> > 
> > omehci0 at simplebus0
> > usb0 at omehci0: USB revision 2.0
> > usb0: root hub problem
> 
> I think I know what it is.  When we enqueue a request for the root
> hub, the buffer, for which the USB subsystem allocates a DMA buffer,
> is filled not by an external device, but by our driver.
> 
> Now on completion of that request, since it's doing a READ of the
> root hub, it will do a DMA sync with POSTREAD.  Since the ehci code
> hasn't flushed the buffer do memory, but simply did a memcpy to the
> buffer, the POSTREAD will essentially drop whatever ehci's memcpy
> did.
> 
> Can you check if that makes a difference?  It essentially forces the
> root hub code to flush the data to the caches, so that the POSTREAD
> can successfully flush the cache and read the data from memory.  I
> wonder if there's a better way of doing this, but I kinda doubt it.
> 
> Patrick
> 
> diff --git a/sys/dev/usb/ehci.c b/sys/dev/usb/ehci.c
> index a352d83eaf4..d81901d3762 100644
> --- a/sys/dev/usb/ehci.c
> +++ b/sys/dev/usb/ehci.c
> @@ -2154,6 +2154,9 @@ ehci_root_ctrl_start(struct usbd_xfer *xfer)
>               err = USBD_IOERROR;
>               goto ret;
>       }
> +     if (totlen != 0)
> +             usb_syncmem(&xfer->dmabuf, 0, totlen,
> +                 BUS_DMASYNC_PREWRITE);
>       xfer->actlen = totlen;
>       err = USBD_NORMAL_COMPLETION;
>   ret:
> 

If that diff does indeed work, I think a better fix would be to move the
DMA syncs from the USB layer into the driver layer, so that for the root
methods no sync has to be done, but all the individual device methods do
proper syncs.  Preparing that diff would take some more minutes though.
In the meantime we could use USB_DMA_COHERENT for all requests without
an extra allocated buffer, which includes all the control requests.
Then move the syncs to the drivers, and at last remove the flag again.

diff --git a/sys/dev/usb/usbdi.c b/sys/dev/usb/usbdi.c
index 5161afc21e5..a97e22b64a3 100644
--- a/sys/dev/usb/usbdi.c
+++ b/sys/dev/usb/usbdi.c
@@ -305,7 +305,8 @@ usbd_transfer(struct usbd_xfer *xfer)
                if (xfer->rqflags & URQ_AUTO_DMABUF)
                        printf("usbd_transfer: has old buffer!\n");
 #endif
-               err = usb_allocmem(bus, xfer->length, 0, 0, &xfer->dmabuf);
+               err = usb_allocmem(bus, xfer->length, 0, USB_DMA_COHERENT,
+                   &xfer->dmabuf);
                if (err)
                        return (err);
                xfer->rqflags |= URQ_AUTO_DMABUF;

Reply via email to