* Ingo Molnar <mi...@kernel.org> wrote:
> The kernel text size reduction with Jen's patch is small but real: > > text data bss dec hex filename > 19572694 11516934 19873888 50963516 309a43c > vmlinux.before > 19572468 11516934 19873888 50963290 309a35a > vmlinux.after > > But I checked the disassembly, and it's not a real win, the new code is > actually more complex than the old one, as expected, but GCC (7.3.0) does > some particularly stupid things which bloats the generated code. So I dug into this some more: 1) Firstly I tracked down GCC bloating the might_fault() checks and the related out-of-line code exception handling which bloats the full generated function. 2) But with even that complication eliminated, there's a size reduction when Jen's patch is applied, which is puzzling: 19563640 11516790 19882080 50962510 309a04e vmlinux.before 19563274 11516790 19882080 50962144 3099ee0 vmlinux.after but this is entirely due to the .altinstructions section being counted as 'text' part of the vmlinux - while in reality it's not: 3) The _real_ part of the vmlinux gets bloated by Jen's patch: ffffffff81000000 <_stext>: before: ffffffff81b0e5e0 <__clear_user> after: ffffffff81b0e670 <__clear_user>: I.e. we get a e5e0 => e670 bloat, as expected. In the config I tested a later section of the kernel image first aligns away the bloat: before: ffffffff82fa6321 <.altinstr_aux>: after: ffffffff82fa6321 <.altinstr_aux>: and then artificially debloats the modified kernel via the altinstructions section: before: Disassembly of section .exit.text: ffffffff83160798 <intel_uncore_exit> after: Disassembly of section .exit.text: ffffffff83160608 <intel_uncore_exit> Note that there's a third level of obfuscation here: Jen's patch actually *adds* a new altinstructions statement: + /* + * For smaller copies, don't use ERMS as it's slower. + */ + if (len < 128) { + alternative_call(copy_user_generic_unrolled, + copy_user_generic_string, X86_FEATURE_REP_GOOD, + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), + "=d" (len)), + "1" (to), "2" (from), "3" (len) + : "memory", "rcx", "r8", "r9", "r10", "r11"); + return ret; + } + /* * If CPU has ERMS feature, use copy_user_enhanced_fast_string. * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. * Otherwise, use copy_user_generic_unrolled. */ alternative_call_2(copy_user_generic_unrolled, - copy_user_generic_string, - X86_FEATURE_REP_GOOD, - copy_user_enhanced_fast_string, - X86_FEATURE_ERMS, + copy_user_generic_string, X86_FEATURE_REP_GOOD, + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)), "1" (to), "2" (from), "3" (len) So how can this change possibly result in a *small* altinstructions section? 4) The reason is GCC's somewhat broken __builtin_constant() logic, which leaves ~10% of the constant call sites actually active, but which are then optimized by GCC's later stages, and the alternative_call_2() gets optimized out and replaced with the alternative_call() call. This is where Jens's patch 'debloats' the vmlinux and confuses the 'size' utility and gains its code reduction street cred. Note to self: watch out for patches that change altinstructions and don't make premature vmlinux size impact assumptions. :-) Thanks, Ingo