On 11/21/18 10:27 AM, Linus Torvalds wrote: > On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pab...@redhat.com> wrote: >> >> In my experiments 64 bytes was the break even point for all the CPUs I >> had handy, but I guess that may change with other models. > > Note that experiments with memcpy speed are almost invariably broken. > microbenchmarks don't show the impact of I$, but they also don't show > the impact of _behavior_. > > For example, there might be things like "repeat strings do cacheline > optimizations" that end up meaning that cachelines stay in L2, for > example, and are never brought into L1. That can be a really good > thing, but it can also mean that now the result isn't as close to the > CPU, and the subsequent use of the cacheline can be costlier.
Totally agree, which is why all my testing was NOT microbenchmarking. > I say "go for upping the limit to 128 bytes". See below... > That said, if the aio user copy is _so_ critical that it's this > noticeable, there may be other issues. Sometimes _real_ cost of small > user copies is often the STAC/CLAC, more so than the "rep movs". > > It would be interesting to know exactly which copy it is that matters > so much... *inlining* the erms case might show that nicely in > profiles. Oh I totally agree, which is why I since went a different route. The copy that matters is the copy_from_user() of the iocb, which is 64 bytes. Even for 4k IOs, copying 64b per IO is somewhat counter productive for O_DIRECT. Playing around with this: http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483 since we're doing a new sys_io_setup2() for polled aio anyway. This completely avoids the iocb copy, but that's just for my initial particular gripe. diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index db4e5aa0858b..21c4d68c5fac 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string) */ ENTRY(copy_user_enhanced_fast_string) ASM_STAC - cmpl $64,%edx - jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */ + cmpl $128,%edx + jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */ movl %edx,%ecx 1: rep movsb -- Jens Axboe