On Fri, Nov 23, 2018 at 2:12 AM David Laight <david.lai...@aculab.com> wrote: > > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel. > Calling memcpy_fromio(kernel_buffer, PCIe_address, length) > generates a lot of single byte TLP.
I just tested it too - it turns out that the __inline_memcpy() code never triggers, and "memcpy_toio()" just generates a memcpy. So that code seems entirely dead. And, in fact, the codebase I looked at was the historical one, because I had been going back and looking at the history. The modern tree *does* have the "__inline_memcpy()" function I pointed at, but it's not actually hooked up to anything! This actually has been broken for _ages_. The breakage goes back to 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and it seems nobody really ever noticed - or thought that it was ok. That commit claims that iomem has no special significance on x86, but that really really isn't true, exactly because the access size does matter. And as mentioned, the generic memory copy routines are not at all appropriate, and that has nothing to do with ERMS. Our "do it by hand" memory copy routine does things like this: .Lless_16bytes: cmpl $8, %edx jb .Lless_8bytes /* * Move data from 8 bytes to 15 bytes. */ movq 0*8(%rsi), %r8 movq -1*8(%rsi, %rdx), %r9 movq %r8, 0*8(%rdi) movq %r9, -1*8(%rdi, %rdx) retq and note how for a 8-byte copy, it will do *two* reads of the same 8 bytes, and *two* writes of the same 8 byte destination. That's perfectly ok for regular memory, and it means that the code can handle an arbitrary 8-15 byte copy without any conditionals or loop counts, but it is *not* ok for iomem. Of course, in practice it all just happens to work in almost all situations (because a lot of iomem devices simply won't care), and manual access to iomem is basically extremely rare these days anyway, but it's definitely entirely and utterly broken. End result: we *used* to do this right. For the last eight years our "memcpy_{to,from}io()" has been entirely broken, and apparently even the people who noticed oddities like David, never reported it as breakage but instead just worked around it in drivers. Ho humm. Let me write a generic routine in lib/iomap_copy.c (which already does the "user specifies chunk size" cases), and hook it up for x86. David, are you using a bus analyzer or something to do your testing? I'll have a trial patch for you asap. Linus