Re: [PATCH 0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS
On Tue, 9 Mar 2021, Masahiro Yamada wrote: > On Fri, Feb 26, 2021 at 4:24 AM Nicolas Pitre wrote: > > > > If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease. > > That comes with the feature. > > This patch set intends to change this. > TRIM_UNUSED_KSYMS will build without additional cost, > like LD_DEAD_CODE_DATA_ELIMINATION. OK... I do see how you're going about it. > > > Modules are relocatable ELF. > > > Clang LTO cannot eliminate any code. > > > GCC LTO does not work with relocatable ELF > > > in the first place. > > > > I don't think I follow you here. What relocatable ELF has to do with LTO? > > What is important is, > GCC LTO is the feature of gcc, not binutils. > That is, LD_FINAL is $(CC). Exact. > GCC LTO can be implemented for the final link stage > by using $(CC) as the linker driver. > Then, it can determine which code is unreachable. > In other words, GCC LTO works only when building > the final executable. Yes. And it does so by filling .o files with its intermediate code representation and not ELF code. > On the other hand, a relocatable ELF is created > by $(LD) -r by combining some objects together. > The relocatable ELF can be fed to another $(LD) -r, > or the final link stage. You still can create relocatable ELF using LTO. But LTO stops there. >From that point on, .o files will no longer contain data that LTO can use if you further combine those object files together. But until that point, LTO is still usable. > As I said above, modules are created by $(LD) -r. > It is not possible to implement GCC LTO for modules. If I remember correctly (that was a while ago) the problem with LTO and the kernel had to do with the fact that avery subdirectory was gathering object files in built-in.o using ld -r. At some point we switched to gathering object files into built-in.a files where no linking is taking place. The real linking happens in vmlinux.o where LTO may now do its magic. The same is true for modules. Compiling foo_module.c into foo_module.o will create a .o file with LTO data rather than executable code. But when you create the final .o for the module then LTO takes place and produce the relocatable ELF executable. > > I've successfully used gcc LTO on the kernel quite a while ago. > > > > For a reference about binary size reduction with LTO and > > CONFIG_TRIM_UNUSED_KSYMS please read this article: > > > > https://lwn.net/Articles/746780/ > > Thanks for the great articles. > > Just for curiosity, I think you used GCC LTO from > Andy's GitHub. Right. I provided the reference in the preceding article: https://lwn.net/Articles/744507/ > In the article, you took stm32_defconfig as an example, > but ARM does not select ARCH_SUPPORTS_LTO. > > Did you add some local hacks to make LTO work > for ARM? Of course. This article was written in 2017 and no LTO support at all was in mainline back then. But, besides adding CONFIG_LTO, very little was needed to make it compile, and I did upstream most changes such as commit 75fea300d7, commit a85b2257a5, commit 5d48417592, commit 19c233b79d, etc. > I tried the lto-5.8.1 branch, but > I did not even succeed in building x86 + LTO. My latest working LTO branch (i.e. last time I worked on it) is much older than that. Maybe people aren't very excited about LTO because it makes the time to recompiling the kernel many times longer because gcc does its optimization passes on the whole kernel even if you modify a single file. Nicolas
Re: [PATCH 0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS
On Fri, Feb 26, 2021 at 4:24 AM Nicolas Pitre wrote: > > On Fri, 26 Feb 2021, Masahiro Yamada wrote: > > > On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre wrote: > > > > > > On Fri, 26 Feb 2021, Masahiro Yamada wrote: > > > > > > > > > > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy > > > > about the build speed. > > > > > > > > I re-implemented this feature, and the build time cost is now > > > > almost unnoticeable level. > > > > > > > > I hope this makes Linus happy. > > > > > > :-) > > > > > > I'm surprised to see that Linus is using this feature. When disabled > > > (the default) this should have had no impact on the build time. > > > > Linus is not using this feature, but does build tests. > > After pulling the module subsystem pull request in this merge window, > > CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig. > > If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease. > That comes with the feature. This patch set intends to change this. TRIM_UNUSED_KSYMS will build without additional cost, like LD_DEAD_CODE_DATA_ELIMINATION. > > > > This feature provides a nice security advantage by significantly > > > reducing the kernel input surface. And people are using that also to > > > better what third party vendor can and cannot do with a distro kernel, > > > etc. But that's not the reason why I implemented this feature in the > > > first place. > > > > > > My primary goal was to efficiently reduce the kernel binary size using > > > LTO even with kernel modules enabled. > > > > > > Clang LTO landed in this MW. > > > > Do you think it will reduce the kernel binary size? > > No, opposite. > > LTO ought to reduce binary size. It is rather broken otherwise. > Having a global view before optimizing allows for the compiler to do > project wide constant propagation and dead code elimination. > > > CONFIG_LTO_CLANG cannot trim any code even if it > > is obviously unused. > > Hence, it never reduces the kernel binary size. > > Rather, it produces a bigger kernel. > > Then what's the point? Presumably, reducing the size is not the main interest for Googlers. > > > The reason is Clang LTO was implemented against > > relocatable ELF (vmlinux.o) . > > That's not true LTO then. This is the same as what I said in the review process. :-) https://lore.kernel.org/linux-kbuild/cak7lnasqpogohtuyzbm6n54pzpln35kdxc7vbvwzx8qwumq...@mail.gmail.com/ > > > I pointed out this flaw in the review process, but > > it was dismissed. > > > > This is the main reason why I did not give any Ack > > (but it was merged via Kees Cook's tree). > > > So, the help text of this option should be revised: > > > > This option allows for unused exported symbols to be dropped from > > the build. In turn, this provides the compiler more opportunities > > (especially when using LTO) for optimizing the code and reducing > > binary size. This might have some security advantages as well. > > > > Clang LTO is opposite to your expectation. > > Then Clang LTO is a misnomer. That is the option to revise not this one. > > > > Each EXPORT_SYMBOL() created a > > > symbol dependency that prevented LTO from optimizing out the related > > > code even though a tiny fraction of those exported symbols were needed. > > > > > > The idea behind the recursion was to catch those cases where disabling > > > an exported symbol within a module would optimize out references to more > > > exported symbols that, in turn, could be disabled and possibly trigger > > > yet more code elimination. There is no way that can be achieved without > > > extra compiler passes in a recursive manner. > > > > I do not understand. > > > > Modules are relocatable ELF. > > Clang LTO cannot eliminate any code. > > GCC LTO does not work with relocatable ELF > > in the first place. > > I don't think I follow you here. What relocatable ELF has to do with LTO? What is important is, GCC LTO is the feature of gcc, not binutils. That is, LD_FINAL is $(CC). GCC LTO can be implemented for the final link stage by using $(CC) as the linker driver. Then, it can determine which code is unreachable. In other words, GCC LTO works only when building the final executable. On the other hand, a relocatable ELF is created by $(LD) -r by combining some objects together. The relocatable ELF can be fed to another $(LD) -r, or the final link stage. vmlinux is an executable ELF. modules (*.ko files) are relocatable ELFs. You can confirm it easily by using the 'file' command. masahiro@oscar:~/ref/linux$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=ee0cef2ff3d9f490e0f5ee1d7e74b19aa167933b, not stripped masahiro@oscar:~/ref/linux$ file net/ipv4/netfilter/iptable_nat.ko net/ipv4/netfilter/iptable_nat.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=4829e82f9b9e7fd65be3c19c1cf0e16a7ddf0967, not stripped Modules are not filled with
Re: [PATCH 0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS
On Fri, 26 Feb 2021, Masahiro Yamada wrote: > On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre wrote: > > > > On Fri, 26 Feb 2021, Masahiro Yamada wrote: > > > > > > > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy > > > about the build speed. > > > > > > I re-implemented this feature, and the build time cost is now > > > almost unnoticeable level. > > > > > > I hope this makes Linus happy. > > > > :-) > > > > I'm surprised to see that Linus is using this feature. When disabled > > (the default) this should have had no impact on the build time. > > Linus is not using this feature, but does build tests. > After pulling the module subsystem pull request in this merge window, > CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig. If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease. That comes with the feature. > > This feature provides a nice security advantage by significantly > > reducing the kernel input surface. And people are using that also to > > better what third party vendor can and cannot do with a distro kernel, > > etc. But that's not the reason why I implemented this feature in the > > first place. > > > > My primary goal was to efficiently reduce the kernel binary size using > > LTO even with kernel modules enabled. > > > Clang LTO landed in this MW. > > Do you think it will reduce the kernel binary size? > No, opposite. LTO ought to reduce binary size. It is rather broken otherwise. Having a global view before optimizing allows for the compiler to do project wide constant propagation and dead code elimination. > CONFIG_LTO_CLANG cannot trim any code even if it > is obviously unused. > Hence, it never reduces the kernel binary size. > Rather, it produces a bigger kernel. Then what's the point? > The reason is Clang LTO was implemented against > relocatable ELF (vmlinux.o) . That's not true LTO then. > I pointed out this flaw in the review process, but > it was dismissed. > > This is the main reason why I did not give any Ack > (but it was merged via Kees Cook's tree). > So, the help text of this option should be revised: > > This option allows for unused exported symbols to be dropped from > the build. In turn, this provides the compiler more opportunities > (especially when using LTO) for optimizing the code and reducing > binary size. This might have some security advantages as well. > > Clang LTO is opposite to your expectation. Then Clang LTO is a misnomer. That is the option to revise not this one. > > Each EXPORT_SYMBOL() created a > > symbol dependency that prevented LTO from optimizing out the related > > code even though a tiny fraction of those exported symbols were needed. > > > > The idea behind the recursion was to catch those cases where disabling > > an exported symbol within a module would optimize out references to more > > exported symbols that, in turn, could be disabled and possibly trigger > > yet more code elimination. There is no way that can be achieved without > > extra compiler passes in a recursive manner. > > I do not understand. > > Modules are relocatable ELF. > Clang LTO cannot eliminate any code. > GCC LTO does not work with relocatable ELF > in the first place. I don't think I follow you here. What relocatable ELF has to do with LTO? I've successfully used gcc LTO on the kernel quite a while ago. For a reference about binary size reduction with LTO and CONFIG_TRIM_UNUSED_KSYMS please read this article: https://lwn.net/Articles/746780/ Nicolas
Re: [PATCH 0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS
On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre wrote: > > On Fri, 26 Feb 2021, Masahiro Yamada wrote: > > > > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy > > about the build speed. > > > > I re-implemented this feature, and the build time cost is now > > almost unnoticeable level. > > > > I hope this makes Linus happy. > > :-) > > I'm surprised to see that Linus is using this feature. When disabled > (the default) this should have had no impact on the build time. Linus is not using this feature, but does build tests. After pulling the module subsystem pull request in this merge window, CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig. > This feature provides a nice security advantage by significantly > reducing the kernel input surface. And people are using that also to > better what third party vendor can and cannot do with a distro kernel, > etc. But that's not the reason why I implemented this feature in the > first place. > > My primary goal was to efficiently reduce the kernel binary size using > LTO even with kernel modules enabled. Clang LTO landed in this MW. Do you think it will reduce the kernel binary size? No, opposite. CONFIG_LTO_CLANG cannot trim any code even if it is obviously unused. Hence, it never reduces the kernel binary size. Rather, it produces a bigger kernel. The reason is Clang LTO was implemented against relocatable ELF (vmlinux.o) . I pointed out this flaw in the review process, but it was dismissed. This is the main reason why I did not give any Ack (but it was merged via Kees Cook's tree). So, the help text of this option should be revised: This option allows for unused exported symbols to be dropped from the build. In turn, this provides the compiler more opportunities (especially when using LTO) for optimizing the code and reducing binary size. This might have some security advantages as well. Clang LTO is opposite to your expectation. > Each EXPORT_SYMBOL() created a > symbol dependency that prevented LTO from optimizing out the related > code even though a tiny fraction of those exported symbols were needed. > > The idea behind the recursion was to catch those cases where disabling > an exported symbol within a module would optimize out references to more > exported symbols that, in turn, could be disabled and possibly trigger > yet more code elimination. There is no way that can be achieved without > extra compiler passes in a recursive manner. I do not understand. Modules are relocatable ELF. Clang LTO cannot eliminate any code. GCC LTO does not work with relocatable ELF in the first place. Are you talking about a story in a perfect world? But, I do not know how LTO can eliminate dead code from relocatable ELF. - Current implementation CLANG LTO works against vmlinux.o, so it is completely useless for the purpose of eliminating dead code. So, this case is don't care. TRIM_UNUSED_KSYMS removes only the meta data of EXPORT_SYMBOL, but no further optimization anyway. - What if Clang LTO had been implemented in the final link? (this means LTO runs 3 times if KALLSYMS_ALL is enabled) With proper linker script input with /DISCARD/, the meta-data of EXPORT_SYMBOL() will be dropped, and LTO should be able to do further dead code elimination. So, I guess we do not need to no-op EXPORT_SYMBOL by CPP (unless I am missing something). -- Best Regards Masahiro Yamada
Re: [PATCH 0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS
On Fri, 26 Feb 2021, Masahiro Yamada wrote: > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy > about the build speed. > > I re-implemented this feature, and the build time cost is now > almost unnoticeable level. > > I hope this makes Linus happy. :-) I'm surprised to see that Linus is using this feature. When disabled (the default) this should have had no impact on the build time. This feature provides a nice security advantage by significantly reducing the kernel input surface. And people are using that also to better what third party vendor can and cannot do with a distro kernel, etc. But that's not the reason why I implemented this feature in the first place. My primary goal was to efficiently reduce the kernel binary size using LTO even with kernel modules enabled. Each EXPORT_SYMBOL() created a symbol dependency that prevented LTO from optimizing out the related code even though a tiny fraction of those exported symbols were needed. The idea behind the recursion was to catch those cases where disabling an exported symbol within a module would optimize out references to more exported symbols that, in turn, could be disabled and possibly trigger yet more code elimination. There is no way that can be achieved without extra compiler passes in a recursive manner. Nicolas