On Thu, Dec 21, 2023 at 2:05 PM Haakon Bugge <[email protected]> wrote: > > On 18 Dec 2023, at 23:09, Paul Moore <[email protected]> wrote: > > On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <[email protected]> > > wrote: > >>> On 14 Dec 2023, at 00:45, Paul Moore <[email protected]> wrote: > >>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <[email protected]> > >>> wrote: > >>>> > >>>> For the most time-consuming function, when running a syscall benchmark > >>>> with STIG compliant audit rules: > >>>> > >>>> Overhead Command Shared Object Symbol > >>>> ......... ............ ................. ........................ > >>>> > >>>> 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op > >>>> > >>>> we apply codegen optimizations, which speeds up the syscall > >>>> performance by around 17% on an Intel Cascade Lake system. > >>>> > >>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C > >>>> application that measures average syscall latency from getpid() > >>>> running 100 million rounds. > >>>> > >>>> Between each perf run, we reboot the system and waits until the last > >>>> minute load is less than 1.0. > >>>> > >>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to > >>>> amplify the changes in the audit system. > >>>> > >>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and > >>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry > >>>> alignment" on an Intel Cascade Lake system. The following three > >>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall, > >>>> and finally Intructions Per Cycle, ipc. > >>>> > >>>> Base vs. base + this commit gives: > >>>> > >>>> ns per call: > >>>> min avg max pstdev > >>>> - 203 203 209 0.954149 > >>>> + 173 173 178 0.884534 > >>>> > >>>> L1d misses per syscall: > >>>> min avg max pstdev > >>>> - 0.012 0.103 0.817 0.238352 > >>>> + 0.010 0.209 1.235 0.399416 > >>>> > >>>> ipc: > >>>> min avg max pstdev > >>>> - 2.320 2.329 2.330 0.003000 > >>>> + 2.430 2.436 2.440 0.004899 > >>>> > >>>> Signed-off-by: Håkon Bugge <[email protected]> > >>>> --- > >>>> kernel/auditsc.c | 2 ++ > >>>> 1 file changed, 2 insertions(+) > >>>> > >>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c > >>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644 > >>>> --- a/kernel/auditsc.c > >>>> +++ b/kernel/auditsc.c > >>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule > >>>> *rule, unsigned long val) > >>>> * parameter can be NULL, but all others must be specified. > >>>> * Returns 1/true if the filter finds a match, 0/false if none are found. > >>>> */ > >>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", > >>>> "align-jumps=16") > >>> > >>> The kernel doesn't really make use of #pragma optimization statements > >>> like this, at least not in any of the core areas, and I'm not > >>> interested in being the first to do so. I appreciate the time and > >>> effort that you have spent profiling the audit subsystem, but this > >>> isn't a patch I can accept at this point in time, I'm sorry. > >> > >> Fair enough. Will a function attribute aka: > >> > >> __attribute__((optimize("foo=bar"))) > >> > >> be acceptable for you? > > > > Unless you can show me widespread acceptance of these types of > > optimizations in core kernel code, my answer is no. I'm sorry. > > Hi Paul, > > Well, I guess we both know that I cannot show you any "widespread acceptance" > of the sort. Here is what I see: > > arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize" > (not what you describe as core kernel) > > lib/zstd/common/compiler.h:# define DONT_VECTORIZE > __attribute__((optimize("no-tree-vectorize"))) > (core, but not widespread) > > Then of course, there are more examples of specific optimizations or disable > thereof in Makefiles: > > kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c > kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \ > kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING > kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) > -fno-omit-frame-pointer > kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) > -fno-stack-protector > kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5 > kernel/sched/Makefile:# According to Alan Modra <[email protected]>, the > -fno-omit-frame-pointer is > kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer > kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := > -fno-gcse > kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector > -fstack-protector-strong > kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector > kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer > > My view is: > > The coding of time critical functions, that access structs, that are > allocated by k{m,z}alloc and family, in a short proximity in time, is, with > all respect, not a good programming practice, as depicted by the example in > this patch series. > > I have to accept, that you will not accept this patch series or a v2 thereof. > > But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a > syscall on the floor. > > It should be unnecessary to quote the number of Cascade Lake systems being > deployed, how many of them that run with audit enabled and STIG component > rules, to point of the significance of those nanoseconds. > > I would be very appreciative if you could come up with a better and more > acceptable solution, in order to shave off these nanoseconds.
I appreciate and understand your frustration, but sometimes the subjective quality and maintainability of the code needs to take precedence, and in my mind the tradeoff for a 37ns win is not desirable. Thank you again for your time and effort. -- paul-moore.com
