On Wed, Apr 01, 2020 at 12:58:23PM +1100, Jonathan Gray wrote: > On Wed, Mar 18, 2020 at 01:41:06PM +0100, Patrick Wildt wrote: > > On Wed, Mar 18, 2020 at 11:22:40AM +0100, Patrick Wildt wrote: > > > Hi, > > > > > > I've spent a few days investigating why USB ethernet adapters are so > > > horribly slow on my ARMs. Using dt(4) I realized that it was spending > > > most of its time in memcpy. But, why? As it turns out, all USB data > > > buffers are mapped COHERENT, which on some/most ARMs means uncached. > > > Using cached data buffers makes the performance rise from 20 mbit/s to > > > 200 mbit/s. Quite a difference. > > > > > > sys/dev/usb/usb_mem.c: > > > error = bus_dmamem_map(tag, p->segs, p->nsegs, p->size, > > > &p->kaddr, BUS_DMA_NOWAIT|BUS_DMA_COHERENT); > > > > > > On x86, COHERENT is essentially a no-op. On ARM, it depends on the SoC. > > > Some SoCs have cache-coherent USB controllers, some don't. Mine does > > > not, so mapping it COHERENT means uncached and thus slow. > > > > > > Why do we do that? Well, when the code was imported in 99, it was > > > already there. Since then we have gained infrastructure for DMA > > > syncs in the USB stack, which I think are proper. > > > > > > sys/dev/usb/usbdi.c - usbd_transfer() (before transfer) > > > > > > if (!usbd_xfer_isread(xfer)) { > > > if ((xfer->flags & USBD_NO_COPY) == 0) > > > memcpy(KERNADDR(&xfer->dmabuf, 0), xfer->buffer, > > > xfer->length); > > > usb_syncmem(&xfer->dmabuf, 0, xfer->length, > > > BUS_DMASYNC_PREWRITE); > > > } else > > > usb_syncmem(&xfer->dmabuf, 0, xfer->length, > > > BUS_DMASYNC_PREREAD); > > > err = pipe->methods->transfer(xfer); > > > > > > sys/dev/usb/usbdi.c - usb_transfer_complete() (after transfer) > > > > > > if (xfer->actlen != 0) { > > > if (usbd_xfer_isread(xfer)) { > > > usb_syncmem(&xfer->dmabuf, 0, xfer->actlen, > > > BUS_DMASYNC_POSTREAD); > > > if (!(xfer->flags & USBD_NO_COPY)) > > > memcpy(xfer->buffer, KERNADDR(&xfer->dmabuf, 0), > > > xfer->actlen); > > > } else > > > usb_syncmem(&xfer->dmabuf, 0, xfer->actlen, > > > BUS_DMASYNC_POSTWRITE); > > > } > > > > > > We cannot just remove COHERENT, since some drivers, like ehci(4), use > > > the same backend to allocate their rings. And I can't vouch for those > > > drivers' sanity. > > > > > > As a first step, I would like to go ahead with another solution, which > > > is based on a diff from Marius Strobl, who added those syncs in the > > > first place. Essentially it splits the memory handling into cacheable > > > and non-cacheable blocks. The USB data transfers and everyone who uses > > > usbd_alloc_buffer() then use cacheable buffers, while code like ehci(4) > > > still don't. This is a bit of a safer approach imho, since we don't > > > hurt the controller drivers, but speed up the data buffers. > > > > > > Once we have verified that there are no regressions, we can adjust > > > ehci(4) and the like, add proper syncs, make sure they still work as > > > well as before, and maybe then back this out again. > > > > > > Keep note that this is all a no-op on X86, but all the other archs will > > > profit from this. > > > > > > ok? > > > > > > Patrick > > > > Update diff with inverted logic. kettenis@ argues that we should > > invert the logic, and those who need COHERENT memory should ask > > for that explicitly, since for bus_dmamem_map() it also needs to > > be passed explicitly. This also points out all those users that > > use usb_allocmem() internally, where we might want to have a look > > if COHERENT is actually needed or not, or if it can be refactored > > in another way. > > These commits broke usb on imx.6 with cubox: > > imxehci0 at simplebus3 > usb0 at imxehci0: USB revision 2.0 > usb0: root hub problem > imxehci1 at simplebus3 > usb1 at imxehci1: USB revision 2.0 > usb1: root hub problem > "usbmisc" at simplebus3 not configured
pandaboard with omap4 (another cortex a9) also has broken usb with the latest snapshot: omehci0 at simplebus0 usb0 at omehci0: USB revision 2.0 usb0: root hub problem