https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240
Bug ID: 66240 Summary: RFE: extend -falign-xyz syntax Product: gcc Version: 5.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- Experimentally, compilation with -O2 -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17 results in the following: - functions are aligned using ".p2align 5,,16" asm directive - loops/jumps/labels are aligned using ".p2align 5" -Os -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17 results in the following: - functions are not aligned - loops/jumps/labels are aligned using ".p2align 5" Can this be improved so that in all cases, ".p2align 5,,16" is used? Shouldn't be that hard... Next step (what this RFE is all about). -falign-functions=N is too simplistic. Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte alignment runs fastest (he tried many other possibilites). However, developers are less than thrilled by the idea of a slam-dunk 64-byte aligning everything. Too much waste: On 05/20/2015 02:47 AM, Linus Torvalds wrote: > At the same time, I have to admit that I abhor a 64-byte function > alignment, when we have a fair number of functions that are (much) > smaller than that. > > Is there some way to get gcc to take the size of the function into > account? Because aligning a 16-byte or 32-byte function on a 64-byte > alignment is just criminally nasty and wasteful. I propose the following: align function to 64-byte boundaries *IF* this does not introduce huge amount of padding. GNU as already has support for this: .align N1,FILL,N3 "The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive." So, what we want is to put something like ".align 64,,7" before every function. 98% of functions in typical linux kernel have first instruction 7 or fewer bytes long. Thus, with ".align 64,,7", calling any function will at a minimum be able to fetch one insn in one L1 read, not two. And this would be acheved with only ~3.5 bytes per function wasted to padding on average, whereas ".align 64" would waste 31 byte on average. Please extend -falign-foo=N syntax to, say, -falign-foo=N,M, which generates ".align M,,N-1" or equivalent.