So this is a fun one... While I was doing the aio polled work, I noticed
that the submitting process spent a substantial amount of time copying
data to/from userspace. For aio, that's iocb and io_event, which are 64
and 32 bytes respectively. Looking closer at this, and it seems that
ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
cost.
I came up with this hack to test it out, and low and behold, we now cut
the time spent in copying in half. 50% less.
Since these kinds of patches tend to lend themselves to bike shedding, I
also ran a string of kernel compilations out of RAM. Results are as
follows:
Patched : 62.86s avg, stddev 0.65s
Stock : 63.73s avg, stddev 0.67s
which would also seem to indicate that we're faster punting smaller
(< 128 byte) copies.
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Interestingly, text size is smaller with the patch as well?!
I'm sure there are smarter ways to do this, but results look fairly
conclusive. FWIW, the behaviorial change was introduced by:
commit 954e482bde20b0e208fd4d34ef26e10afd194600
Author: Fenghua Yu <fenghua...@intel.com>
Date: Thu May 24 18:19:45 2012 -0700
x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
which contains nothing in terms of benchmarking or results, just claims
that the new hotness is better.
Signed-off-by: Jens Axboe <ax...@kernel.dk>
---
diff --git a/arch/x86/include/asm/uaccess_64.h
b/arch/x86/include/asm/uaccess_64.h
index a9d637bc301d..7dbb78827e64 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
{
unsigned ret;
+ /*
+ * For smaller copies, don't use ERMS as it's slower.
+ */
+ if (len < 128) {
+ alternative_call(copy_user_generic_unrolled,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
+ "=d" (len)),
+ "1" (to), "2" (from), "3" (len)
+ : "memory", "rcx", "r8", "r9", "r10", "r11");
+ return ret;
+ }
+
/*
* If CPU has ERMS feature, use copy_user_enhanced_fast_string.
* Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
* Otherwise, use copy_user_generic_unrolled.
*/
alternative_call_2(copy_user_generic_unrolled,
- copy_user_generic_string,
- X86_FEATURE_REP_GOOD,
- copy_user_enhanced_fast_string,
- X86_FEATURE_ERMS,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
"=d" (len)),
"1" (to), "2" (from), "3" (len)