On 11/21/18 10:27 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pab...@redhat.com> wrote:
>>
>> In my experiments 64 bytes was the break even point for all the CPUs I
>> had handy, but I guess that may change with other models.
> 
> Note that experiments with memcpy speed are almost invariably broken.
> microbenchmarks don't show the impact of I$, but they also don't show
> the impact of _behavior_.
> 
> For example, there might be things like "repeat strings do cacheline
> optimizations" that end up meaning that cachelines stay in L2, for
> example, and are never brought into L1. That can be a really good
> thing, but it can also mean that now the result isn't as close to the
> CPU, and the subsequent use of the cacheline can be costlier.

Totally agree, which is why all my testing was NOT microbenchmarking.

> I say "go for upping the limit to 128 bytes".

See below...

> That said, if the aio user copy is _so_ critical that it's this
> noticeable, there may be other issues. Sometimes _real_ cost of small
> user copies is often the STAC/CLAC, more so than the "rep movs".
> 
> It would be interesting to know exactly which copy it is that matters
> so much...  *inlining* the erms case might show that nicely in
> profiles.

Oh I totally agree, which is why I since went a different route. The
copy that matters is the copy_from_user() of the iocb, which is 64
bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
productive for O_DIRECT.

Playing around with this:

http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483

since we're doing a new sys_io_setup2() for polled aio anyway. This
completely avoids the iocb copy, but that's just for my initial
particular gripe.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
        ASM_STAC
-       cmpl $64,%edx
-       jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
+       cmpl $128,%edx
+       jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */
        movl %edx,%ecx
 1:     rep
        movsb

-- 
Jens Axboe

Reply via email to