[Bug target/112573] Suboptimal code generation with `-fdata-sections` on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573 --- Comment #2 from Julius Werner --- Sorry, I don't really know what reassociation means in this context. Are you saying that the behavior is WAI? Note that the problem also exists when you write the accesses in h() as: *(a + (y + 3)) = (char)x; so I don't really think this is just an order-of-operations / overflow thing (also, how would that explain why it only occurs with -fdata-sections?). According to my understanding of C, `*(a + (y + 3))` and `a[y + 3]` are different notations for exactly the same expression, so they shouldn't generate different code.
[Bug c/112573] New: Suboptimal code generation with `-fdata-sections` on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573 Bug ID: 112573 Summary: Suboptimal code generation with `-fdata-sections` on aarch64 Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- Target: aarch64 I noticed some weird code generation behavior in aarch64 that seems to be a new regression in GCC 13 or 12. Consider the following minimal test case: char a[32]; void f(int x, int y) { a[y + 3] = (char)x; a[y + 2] = (char)(x >> 8); a[y + 1] = (char)(x >> 16); a[y + 0] = (char)(x >> 24); } void h(int x, int y) { *((a + y) + 3) = (char)x; *((a + y) + 2) = (char)(x >> 8); *((a + y) + 1) = (char)(x >> 16); *((a + y) + 0) = (char)(x >> 24); } These two functions should, of course, be 100% equivalent. Strangely enough they still don't give the same code when compiling with -Os, but the difference is minimal: (File Offset: 0x40): 0: 11000c23add w3, w1, #0x3 4: 9002adrpx2, 0 (File Offset: 0x40) 4: R_AARCH64_ADR_PREL_PG_HI21 .bss 8: 9142add x2, x2, #0x0 8: R_AARCH64_ADD_ABS_LO12_NC.bss c: 13087c04asr w4, w0, #8 10: 3823c840strbw0, [x2, w3, sxtw] 14: 11000823add w3, w1, #0x2 18: 3823c844strbw4, [x2, w3, sxtw] 1c: 11000423add w3, w1, #0x1 20: 13107c04asr w4, w0, #16 24: 13187c00asr w0, w0, #24 28: 3823c844strbw4, [x2, w3, sxtw] 2c: 3821c840strbw0, [x2, w1, sxtw] 30: d65f03c0ret 0034 (File Offset: 0x74): 34: 9002adrpx2, 0 (File Offset: 0x40) 34: R_AARCH64_ADR_PREL_PG_HI21 .bss 38: 9142add x2, x2, #0x0 38: R_AARCH64_ADD_ABS_LO12_NC .bss 3c: 8b21c043add x3, x2, w1, sxtw 40: 13087c04asr w4, w0, #8 44: 39000864strbw4, [x3, #2] 48: 13107c04asr w4, w0, #16 4c: 39000464strbw4, [x3, #1] 50: 39000c60strbw0, [x3, #3] 54: 13187c00asr w0, w0, #24 58: 3821c840strbw0, [x2, w1, sxtw] 5c: d65f03c0ret However, when I add the -fdata-sections flag, the code for f() remains the same, but the code for h() becomes this weird thing: 0034 (File Offset: 0x74): 34: 93407c21sxtwx1, w1 38: 9002adrpx2, 0 (File Offset: 0x40) 38: R_AARCH64_ADR_PREL_PG_HI21 a+0x3 3c: 9142add x2, x2, #0x0 3c: R_AARCH64_ADD_ABS_LO12_NC a+0x3 40: 13087c03asr w3, w0, #8 44: 38216840strbw0, [x2, x1] 48: 9002adrpx2, 0 (File Offset: 0x40) 48: R_AARCH64_ADR_PREL_PG_HI21 a+0x2 4c: 9142add x2, x2, #0x0 4c: R_AARCH64_ADD_ABS_LO12_NC a+0x2 50: 38216843strbw3, [x2, x1] 54: 9002adrpx2, 0 (File Offset: 0x40) 54: R_AARCH64_ADR_PREL_PG_HI21 a+0x1 58: 9142add x2, x2, #0x0 58: R_AARCH64_ADD_ABS_LO12_NC a+0x1 5c: 13107c03asr w3, w0, #16 60: 13187c00asr w0, w0, #24 64: 38216843strbw3, [x2, x1] 68: 9002adrpx2, 0 (File Offset: 0x40) 68: R_AARCH64_ADR_PREL_PG_HI21 a 6c: 9142add x2, x2, #0x0 6c: R_AARCH64_ADD_ABS_LO12_NC a 70: 38216840strbw0, [x2, x1] 74: d65f03c0ret There should be absolutely no reason to reload the address of `a` from scratch for each access. `-fdata-sections` also shouldn't have any influence at all on how the code inside a function looks like, as far as I'm aware. When I try this on GCC 11.2.0 instead, the output for h() with -fdata-sections is the same short code that it is without. Also, there is of course the question why GCC doesn't just reduce this code to the `rev` instruction (it doesn't seem to be able to figure that out on either GCC 11 or GCC 13, but it would be able to do it if the `y` parameter was a constant instead).
[Bug middle-end/99578] [11/12 Regression] gcc-11 -Warray-bounds or -Wstringop-overread warning when accessing a pointer from integer literal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578 Julius Werner changed: What|Removed |Added CC||jwerner at chromium dot org --- Comment #32 from Julius Werner --- I second Kees' request to please (permanently) add a flag to disable this inference and prevent GCC from making any assumptions about object sizes for pointers cast from integers. Ideally it should be a separate one-off flag rather than a level that needs to be set for a variety of individual warnings. This is affecting almost every systems programming project there is and going against 40 years of common C practice. The standard has never really specified a way to do MMIO hardware accesses, yet it's something we need to do in practice and casting integer literals to pointers is one of the main ways we do it. I don't think it's reasonable to suddenly start in 2022 to prosecute a "violation" that has been this prevalent for decades in programming practice. If you want to make these new safety checks available for those who want them that's fine, but please retain the option to stick with the existing behavior for those projects where this really causes a ton more problems than it could ever solve.
[Bug c/97434] New: Missed dead code optimization from data flow analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97434 Bug ID: 97434 Summary: Missed dead code optimization from data flow analysis Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- I found a pretty simple case where GCC cannot optimize out a redundant check. I've reduced it to the following minimal test case: unsigned int random_number(void); void eliminate_me(void); void main(void) { unsigned int a = random_number(); unsigned int b = random_number(); if (b > a) return; int x = b - 8; if (x > 0 && x > a) eliminate_me(); } I think it should be really easy to prove that eliminate_me() cannot be called, because x can never be greater than a (otherwise b would have also been greater than a and the function would have terminated earlier). I don't know anything about how compilers do data flow analysis in detail, but GCC can usually figure out so much that I'm surprised it cannot figure out this one.
[Bug tree-optimization/92903] Cannot elide byteswap when only needed to compare to multiple constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92903 --- Comment #1 from Julius Werner --- For reference, clang 10.0.0 manages to do this optimization correctly: (File Offset: 0x40): 0: b8 01 00 00 00 mov$0x1,%eax 5: 81 ff 23 45 67 89 cmp$0x89674523,%edi b: 74 08 je 15 (File Offset: 0x55) d: 81 ff 12 34 56 78 cmp$0x78563412,%edi 13: 75 01 jne16 (File Offset: 0x56) 15: c3 retq 16: 31 c0 xor%eax,%eax 18: c3 retq
[Bug c/92903] New: Cannot elide byteswap when only needed to compare to multiple constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92903 Bug ID: 92903 Summary: Cannot elide byteswap when only needed to compare to multiple constants Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- I compiled the following test code on GCC 8.3.0: int test(int a) { const int swapped = __builtin_bswap32(a); if (swapped == 0x12345678 || swapped == 0x23456789) return 1; return 0; } On x86_64, I get: (File Offset: 0x40): 0: 0f cf bswap %edi 2: 81 ff 78 56 34 12 cmp$0x12345678,%edi 8: 0f 94 c0sete %al b: 81 ff 89 67 45 23 cmp$0x23456789,%edi 11: 0f 94 c2sete %dl 14: 09 d0 or %edx,%eax 16: 0f b6 c0movzbl %al,%eax 19: c3 retq And on aarch64, I get: (File Offset: 0x40): 0: 528acf01mov w1, #0x5678 // #22136 4: 5ac00800rev w0, w0 8: 72a24681movkw1, #0x1234, lsl #16 c: 6b01001fcmp w0, w1 10: 528cf121mov w1, #0x6789 // #26505 14: 72a468a1movkw1, #0x2345, lsl #16 18: 7a411004ccmpw0, w1, #0x4, ne 1c: 1a9f17e0csetw0, eq 20: d65f03c0ret In both of those cases it would have been better to omit the byteswap instruction and instead embed the constants to compare to in their swapped form right away. This works correctly when comparing to a single constant, like this: int test(int a) { const int swapped = __builtin_bswap32(a); if (swapped == 0x12345678) return 1; return 0; } (File Offset: 0x40): 0: 52868241mov w1, #0x3412 // #13330 4: 72af0ac1movkw1, #0x7856, lsl #16 8: 6b01001fcmp w0, w1 c: 1a9f17e0csetw0, eq 10: d65f03c0ret But comparing the swapped value to more than one constant somehow makes GCC miss this optimization, even if neither the swapped value nor the constants are used for anything else.
[Bug c/92716] -Os doesn't inline byteswap function even though it's a single instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716 --- Comment #1 from Julius Werner --- edit: Just noticed that when I implement it as static inline unsigned int byteswap(unsigned int x) { return __builtin_bswap32(x); } instead, it is still compiled into the same single instruction, but then the inlining works even with -Os. So I guess the decision to inline is made too early in the optimization pipeline or something?
[Bug c/92716] New: -Os doesn't inline byteswap function even though it's a single instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716 Bug ID: 92716 Summary: -Os doesn't inline byteswap function even though it's a single instruction Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- I compiled the following test code for both x86_64 and aarch64 on gcc 8.3.0: static inline unsigned int byteswap(unsigned int x) { return (((x >> 24) & 0xff) << 0) | (((x >> 16) & 0xff) << 8) | (((x >> 8) & 0xff) << 16) | (((x >> 0) & 0xff) << 24); } unsigned int test(unsigned int a, unsigned int b, unsigned int c) { return byteswap(a) + byteswap(b) + byteswap(c); } On x86_64 I get: (File Offset: 0x40): 0: 89 f8 mov%edi,%eax 2: 0f c8 bswap %eax 4: c3 retq 0005 (File Offset: 0x45): 5: e8 f6 ff ff ff callq 0 (File Offset: 0x40) a: 89 f7 mov%esi,%edi c: 89 c1 mov%eax,%ecx e: e8 ed ff ff ff callq 0 (File Offset: 0x40) 13: 89 d7 mov%edx,%edi 15: 01 c1 add%eax,%ecx 17: e8 e4 ff ff ff callq 0 (File Offset: 0x40) 1c: 01 c8 add%ecx,%eax 1e: c3 retq And on aarch64 I get: (File Offset: 0x40): 0: 5ac00800rev w0, w0 4: d65f03c0ret 0008 (File Offset: 0x48): 8: a9bf7bfdstp x29, x30, [sp,#-16]! c: 910003fdmov x29, sp 10: 97fcbl 0 (File Offset: 0x40) 14: 2a0003e3mov w3, w0 18: 2a0103e0mov w0, w1 1c: 97f9bl 0 (File Offset: 0x40) 20: 0b63add w3, w3, w0 24: 2a0203e0mov w0, w2 28: 97f6bl 0 (File Offset: 0x40) 2c: 0b60add w0, w3, w0 30: a8c17bfdldp x29, x30, [sp],#16 34: d65f03c0ret So the good news is that GCC recognized this code as a byteswap function that can be implemented with a single instruction on both of these platforms. The bad news is that it then doesn't seem to realize that inlining this single instruction leads to smaller code size than wrapping it in a function and calling it, even if it is called many times. If I instead compile with -O2, the function is inlined as expected. (I also tried with clang 8.0.1 which manages to inline correctly even with -Os.)
[Bug c/89408] New: No constant folding when dereferencing string literals
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89408 Bug ID: 89408 Summary: No constant folding when dereferencing string literals Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- See the following minimal test case: int main(int argc, char **argv) { switch (argc) { case "C"[0]: return 1; } } When compiling this, GCC throws the error: test.c: In function 'main': test.c:6:2: error: case label does not reduce to an integer constant case "C"[0]: ^~~~ Obviously, since the string literal "C" is constant, it should be trivial to reduce it to an integer constant, just as if I had written 'C'. This may seem like a pointless construct, but it is useful when using token stringification in macros (because the preprocessor doesn't allow you to stringify directly to a character constant). For example, a macro that returns the character keycode for the key combination Ctrl+X (for any Latin letter) can be nicely written as: #define CTRL(letter) (#letter[0] & 0x1f) Unfortunately, this fails in switch statements due to the bug described here. (When used in other contexts, looking through the disassembly shows that the optimizer does in fact seem to be able to constant fold the whole expression at some point.)
[Bug c/88437] New: Excessive struct alignment on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88437 Bug ID: 88437 Summary: Excessive struct alignment on x86_64 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- Compile the following test file for the x86_64 target: struct { void *a; void *b; } mystruct __attribute__((__section__(".data.mystruct")); Then run objdump -x on the output file. The .data.mystruct section will have an alignment of 2**4, even though the largest member of the struct only requires an alignment of 2**3. GCC always seems to align structures to the whole struct size, at least to a maximum of 2**5. This behavior seems questionable at best, and it is inconsistent from all other architectures (I tested i686 and aarch64, both of which always only align to the largest member). I assume it may be intended as a cache line optimization(?), but it affects code size very negatively, so it should at least be up to the user and not hardcoded by architecture. I couldn't find any flag in the documentation to disable this. It even occurs with -Os even though that's meant to strongly favor size over other optimizations.
[Bug target/67701] Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701 --- Comment #3 from Julius Werner --- > I suspect this is an armv7 issue only where there is missing support for > misaligned load/stores. Did a quick test on aarch64 and you're right, neither of the two issues appears there. > Also testvalue is aligned to 1 byte by default so what GCC is doing is > correct as there is no way to do a volatile unaligned loads really. I thought alignment was always tied to a type (at least by default, unless overridden by __attribute__((align)) or the like)? It's true that the expected alignment for testvalue is 1, but I'm not dereferencing testvalue... I'm dereferencing *(volatile unsigned long *)&testvalue. Shouldn't the alignment of that always be 4? (FWIW, _Alignof(*(volatile unsigned long *)&testvalue) evaluates to 4 in both compiler versions.) Another argument: if write32() was defined as a non-inline function in a separate execution unit, this problem would not appear. Should inlining a function really change behavior in this way and produce code that is clearly less efficient (to try to work around a "problem" that the -O0 version of the same code would have never found)?
[Bug c/67701] Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701 Julius Werner changed: What|Removed |Added Target||armv7a --- Comment #1 from Julius Werner --- edit: target architecture in my example is armv7a, btw (although I doubt that this is architecture-specific)
[Bug c/67701] New: Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701 Bug ID: 67701 Summary: Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load) Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- Consider the following sample program: static inline void write32(void *addr, unsigned int value) { *(volatile unsigned int *)addr = value; } extern unsigned char testvalue[]; void main() { write32(&testvalue, 4); } GCC 5.2 (with -O2 -fno-stack-protector -ffreestanding -fomit-frame-pointer) generates: main: mov r1, #4 mov r2, #0 ldr r3, .L2 ldrbr0, [r3]@ zero_extendqisi2 strbr1, [r3] ldrbr1, [r3, #1]@ zero_extendqisi2 strbr2, [r3, #1] ldrbr1, [r3, #2]@ zero_extendqisi2 strbr2, [r3, #2] ldrbr1, [r3, #3]@ zero_extendqisi2 strbr2, [r3, #3] bx lr Whereas GCC 4.9 (same arguments) generates: main: movwr3, #:lower16:testvalue movsr2, #4 movtr3, #:upper16:testvalue ldr r1, [r3] str r2, [r3] bx lr I think there are two problems with this: 1. Both GCC 5.2 and 4.9 generate superfluous load instructions to |addr|, even though the pointer is never dereferenced as an rvalue (only as the left-hand side of an assignment). This is a) unnecessary and b) dangerous since the pointer is declared volatile (meaning it could be an MMIO register with read side-effects). 2. GCC 5.2 seems to somehow assume it must treat the object as unaligned and generates byte-wise accesses. I don't understand why it would do that, since nothing here is declared __attribute__((packed)) and the explicit cast to a 4-byte-aligned type should remove any ambiguity for the compiler about which alignment it can assume. It is important for systems code to have a (portable) way to dereference an address with an explicit alignment regardless of where it came from (e.g. for MMIO registers that only support 32-bit wide accesses). Both of these issues can be solved by using __builtin_assume_aligned(addr, sizeof(unsigned long)), but I still think the compiler shouldn't have done this in the first place (and having a portable solution to this problem is preferable). Context: http://review.coreboot.org/#/c/11698