[Bug target/112573] Suboptimal code generation with `-fdata-sections` on aarch64

2023-11-16 Thread jwerner at chromium dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573

--- Comment #2 from Julius Werner  ---
Sorry, I don't really know what reassociation means in this context. Are you
saying that the behavior is WAI?

Note that the problem also exists when you write the accesses in h() as:

*(a + (y + 3)) = (char)x;

so I don't really think this is just an order-of-operations / overflow thing
(also, how would that explain why it only occurs with -fdata-sections?).
According to my understanding of C, `*(a + (y + 3))` and `a[y + 3]` are
different notations for exactly the same expression, so they shouldn't generate
different code.

[Bug c/112573] New: Suboptimal code generation with `-fdata-sections` on aarch64

2023-11-16 Thread jwerner at chromium dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573

Bug ID: 112573
   Summary: Suboptimal code generation with `-fdata-sections` on
aarch64
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---
Target: aarch64

I noticed some weird code generation behavior in aarch64 that seems to be a new
regression in GCC 13 or 12. Consider the following minimal test case:

char a[32];

void f(int x, int y)
{
a[y + 3] = (char)x;
a[y + 2] = (char)(x >> 8);
a[y + 1] = (char)(x >> 16);
a[y + 0] = (char)(x >> 24);
}

void h(int x, int y)
{
*((a + y) + 3) = (char)x;
*((a + y) + 2) = (char)(x >> 8);
*((a + y) + 1) = (char)(x >> 16);
*((a + y) + 0) = (char)(x >> 24);
}

These two functions should, of course, be 100% equivalent. Strangely enough
they still don't give the same code when compiling with -Os, but the difference
is minimal:

  (File Offset: 0x40):
   0:   11000c23add w3, w1, #0x3
   4:   9002adrpx2, 0  (File Offset: 0x40)
4: R_AARCH64_ADR_PREL_PG_HI21   .bss
   8:   9142add x2, x2, #0x0
8: R_AARCH64_ADD_ABS_LO12_NC.bss
   c:   13087c04asr w4, w0, #8
  10:   3823c840strbw0, [x2, w3, sxtw]
  14:   11000823add w3, w1, #0x2
  18:   3823c844strbw4, [x2, w3, sxtw]
  1c:   11000423add w3, w1, #0x1
  20:   13107c04asr w4, w0, #16
  24:   13187c00asr w0, w0, #24
  28:   3823c844strbw4, [x2, w3, sxtw]
  2c:   3821c840strbw0, [x2, w1, sxtw]
  30:   d65f03c0ret

0034  (File Offset: 0x74):
  34:   9002adrpx2, 0  (File Offset: 0x40)
34: R_AARCH64_ADR_PREL_PG_HI21  .bss
  38:   9142add x2, x2, #0x0
38: R_AARCH64_ADD_ABS_LO12_NC   .bss
  3c:   8b21c043add x3, x2, w1, sxtw
  40:   13087c04asr w4, w0, #8
  44:   39000864strbw4, [x3, #2]
  48:   13107c04asr w4, w0, #16
  4c:   39000464strbw4, [x3, #1]
  50:   39000c60strbw0, [x3, #3]
  54:   13187c00asr w0, w0, #24
  58:   3821c840strbw0, [x2, w1, sxtw]
  5c:   d65f03c0ret

However, when I add the -fdata-sections flag, the code for f() remains the
same, but the code for h() becomes this weird thing:

0034  (File Offset: 0x74):
  34:   93407c21sxtwx1, w1
  38:   9002adrpx2, 0  (File Offset: 0x40)
38: R_AARCH64_ADR_PREL_PG_HI21  a+0x3
  3c:   9142add x2, x2, #0x0
3c: R_AARCH64_ADD_ABS_LO12_NC   a+0x3
  40:   13087c03asr w3, w0, #8
  44:   38216840strbw0, [x2, x1]
  48:   9002adrpx2, 0  (File Offset: 0x40)
48: R_AARCH64_ADR_PREL_PG_HI21  a+0x2
  4c:   9142add x2, x2, #0x0
4c: R_AARCH64_ADD_ABS_LO12_NC   a+0x2
  50:   38216843strbw3, [x2, x1]
  54:   9002adrpx2, 0  (File Offset: 0x40)
54: R_AARCH64_ADR_PREL_PG_HI21  a+0x1
  58:   9142add x2, x2, #0x0
58: R_AARCH64_ADD_ABS_LO12_NC   a+0x1
  5c:   13107c03asr w3, w0, #16
  60:   13187c00asr w0, w0, #24
  64:   38216843strbw3, [x2, x1]
  68:   9002adrpx2, 0  (File Offset: 0x40)
68: R_AARCH64_ADR_PREL_PG_HI21  a
  6c:   9142add x2, x2, #0x0
6c: R_AARCH64_ADD_ABS_LO12_NC   a
  70:   38216840strbw0, [x2, x1]
  74:   d65f03c0ret

There should be absolutely no reason to reload the address of `a` from scratch
for each access. `-fdata-sections` also shouldn't have any influence at all on
how the code inside a function looks like, as far as I'm aware.

When I try this on GCC 11.2.0 instead, the output for h() with -fdata-sections
is the same short code that it is without.

Also, there is of course the question why GCC doesn't just reduce this code to
the `rev` instruction (it doesn't seem to be able to figure that out on either
GCC 11 or GCC 13, but it would be able to do it if the `y` parameter was a
constant instead).

[Bug middle-end/99578] [11/12 Regression] gcc-11 -Warray-bounds or -Wstringop-overread warning when accessing a pointer from integer literal

2022-03-16 Thread jwerner at chromium dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578

Julius Werner  changed:

   What|Removed |Added

 CC||jwerner at chromium dot org

--- Comment #32 from Julius Werner  ---
I second Kees' request to please (permanently) add a flag to disable this
inference and prevent GCC from making any assumptions about object sizes for
pointers cast from integers. Ideally it should be a separate one-off flag
rather than a level that needs to be set for a variety of individual warnings.

This is affecting almost every systems programming project there is and going
against 40 years of common C practice. The standard has never really specified
a way to do MMIO hardware accesses, yet it's something we need to do in
practice and casting integer literals to pointers is one of the main ways we do
it. I don't think it's reasonable to suddenly start in 2022 to prosecute a
"violation" that has been this prevalent for decades in programming practice.

If you want to make these new safety checks available for those who want them
that's fine, but please retain the option to stick with the existing behavior
for those projects where this really causes a ton more problems than it could
ever solve.

[Bug c/97434] New: Missed dead code optimization from data flow analysis

2020-10-14 Thread jwerner at chromium dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97434

Bug ID: 97434
   Summary: Missed dead code optimization from data flow analysis
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

I found a pretty simple case where GCC cannot optimize out a redundant check.
I've reduced it to the following minimal test case:


unsigned int random_number(void);   
void eliminate_me(void);

void main(void) 
{   
unsigned int a = random_number();   
unsigned int b = random_number();   

if (b > a)  
return; 

int x = b - 8;  

if (x > 0 && x > a) 
eliminate_me(); 
}


I think it should be really easy to prove that eliminate_me() cannot be called,
because x can never be greater than a (otherwise b would have also been greater
than a and the function would have terminated earlier). I don't know anything
about how compilers do data flow analysis in detail, but GCC can usually figure
out so much that I'm surprised it cannot figure out this one.

[Bug tree-optimization/92903] Cannot elide byteswap when only needed to compare to multiple constants

2019-12-10 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92903

--- Comment #1 from Julius Werner  ---
For reference, clang 10.0.0 manages to do this optimization correctly:

  (File Offset: 0x40):
   0:   b8 01 00 00 00  mov$0x1,%eax
   5:   81 ff 23 45 67 89   cmp$0x89674523,%edi
   b:   74 08   je 15  (File Offset: 0x55)
   d:   81 ff 12 34 56 78   cmp$0x78563412,%edi
  13:   75 01   jne16  (File Offset: 0x56)
  15:   c3  retq   
  16:   31 c0   xor%eax,%eax
  18:   c3  retq

[Bug c/92903] New: Cannot elide byteswap when only needed to compare to multiple constants

2019-12-10 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92903

Bug ID: 92903
   Summary: Cannot elide byteswap when only needed to compare to
multiple constants
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

I compiled the following test code on GCC 8.3.0:

int test(int a) 
{   
const int swapped = __builtin_bswap32(a);   

if (swapped == 0x12345678 || swapped == 0x23456789) 
return 1;   
return 0;   
}

On x86_64, I get:

  (File Offset: 0x40):
   0:   0f cf   bswap  %edi
   2:   81 ff 78 56 34 12   cmp$0x12345678,%edi
   8:   0f 94 c0sete   %al
   b:   81 ff 89 67 45 23   cmp$0x23456789,%edi
  11:   0f 94 c2sete   %dl
  14:   09 d0   or %edx,%eax
  16:   0f b6 c0movzbl %al,%eax
  19:   c3  retq   

And on aarch64, I get:

  (File Offset: 0x40):
   0:   528acf01mov w1, #0x5678 // #22136
   4:   5ac00800rev w0, w0
   8:   72a24681movkw1, #0x1234, lsl #16
   c:   6b01001fcmp w0, w1
  10:   528cf121mov w1, #0x6789 // #26505
  14:   72a468a1movkw1, #0x2345, lsl #16
  18:   7a411004ccmpw0, w1, #0x4, ne
  1c:   1a9f17e0csetw0, eq
  20:   d65f03c0ret

In both of those cases it would have been better to omit the byteswap
instruction and instead embed the constants to compare to in their swapped form
right away. This works correctly when comparing to a single constant, like
this:

int test(int a) 
{   
const int swapped = __builtin_bswap32(a);   

if (swapped == 0x12345678)  
return 1;   
return 0;   
} 

  (File Offset: 0x40):
   0:   52868241mov w1, #0x3412 // #13330
   4:   72af0ac1movkw1, #0x7856, lsl #16
   8:   6b01001fcmp w0, w1
   c:   1a9f17e0csetw0, eq
  10:   d65f03c0ret

But comparing the swapped value to more than one constant somehow makes GCC
miss this optimization, even if neither the swapped value nor the constants are
used for anything else.

[Bug c/92716] -Os doesn't inline byteswap function even though it's a single instruction

2019-11-28 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716

--- Comment #1 from Julius Werner  ---
edit: Just noticed that when I implement it as

static inline unsigned int byteswap(unsigned int x) 
{   
return __builtin_bswap32(x);
}

instead, it is still compiled into the same single instruction, but then the
inlining works even with -Os. So I guess the decision to inline is made too
early in the optimization pipeline or something?

[Bug c/92716] New: -Os doesn't inline byteswap function even though it's a single instruction

2019-11-28 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716

Bug ID: 92716
   Summary: -Os doesn't inline byteswap function even though it's
a single instruction
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

I compiled the following test code for both x86_64 and aarch64 on gcc 8.3.0:

static inline unsigned int byteswap(unsigned int x) 
{   
return (((x >> 24) & 0xff) << 0) |  
   (((x >> 16) & 0xff) << 8) |  
   (((x >> 8) & 0xff) << 16) |  
   (((x >> 0) & 0xff) << 24);   
}   

unsigned int test(unsigned int a, unsigned int b, unsigned int c) { 
return byteswap(a) + byteswap(b) + byteswap(c); 
}

On x86_64 I get:

  (File Offset: 0x40):
   0:   89 f8   mov%edi,%eax
   2:   0f c8   bswap  %eax
   4:   c3  retq   

0005  (File Offset: 0x45):
   5:   e8 f6 ff ff ff  callq  0  (File Offset: 0x40)
   a:   89 f7   mov%esi,%edi
   c:   89 c1   mov%eax,%ecx
   e:   e8 ed ff ff ff  callq  0  (File Offset: 0x40)
  13:   89 d7   mov%edx,%edi
  15:   01 c1   add%eax,%ecx
  17:   e8 e4 ff ff ff  callq  0  (File Offset: 0x40)
  1c:   01 c8   add%ecx,%eax
  1e:   c3  retq   

And on aarch64 I get:

  (File Offset: 0x40):
   0:   5ac00800rev w0, w0
   4:   d65f03c0ret

0008  (File Offset: 0x48):
   8:   a9bf7bfdstp x29, x30, [sp,#-16]!
   c:   910003fdmov x29, sp
  10:   97fcbl  0  (File Offset: 0x40)
  14:   2a0003e3mov w3, w0
  18:   2a0103e0mov w0, w1
  1c:   97f9bl  0  (File Offset: 0x40)
  20:   0b63add w3, w3, w0
  24:   2a0203e0mov w0, w2
  28:   97f6bl  0  (File Offset: 0x40)
  2c:   0b60add w0, w3, w0
  30:   a8c17bfdldp x29, x30, [sp],#16
  34:   d65f03c0ret

So the good news is that GCC recognized this code as a byteswap function that
can be implemented with a single instruction on both of these platforms. The
bad news is that it then doesn't seem to realize that inlining this single
instruction leads to smaller code size than wrapping it in a function and
calling it, even if it is called many times. If I instead compile with -O2, the
function is inlined as expected. (I also tried with clang 8.0.1 which manages
to inline correctly even with -Os.)

[Bug c/89408] New: No constant folding when dereferencing string literals

2019-02-19 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89408

Bug ID: 89408
   Summary: No constant folding when dereferencing string literals
   Product: gcc
   Version: 8.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

See the following minimal test case:

int main(int argc, char **argv)
{
switch (argc) {
case "C"[0]:
return 1;
}
}

When compiling this, GCC throws the error:

test.c: In function 'main':
test.c:6:2: error: case label does not reduce to an integer constant
  case "C"[0]:
  ^~~~

Obviously, since the string literal "C" is constant, it should be trivial to
reduce it to an integer constant, just as if I had written 'C'.

This may seem like a pointless construct, but it is useful when using token
stringification in macros (because the preprocessor doesn't allow you to
stringify directly to a character constant). For example, a macro that returns
the character keycode for the key combination Ctrl+X (for any Latin letter) can
be nicely written as:

#define CTRL(letter) (#letter[0] & 0x1f)

Unfortunately, this fails in switch statements due to the bug described here.
(When used in other contexts, looking through the disassembly shows that the
optimizer does in fact seem to be able to constant fold the whole expression at
some point.)

[Bug c/88437] New: Excessive struct alignment on x86_64

2018-12-10 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88437

Bug ID: 88437
   Summary: Excessive struct alignment on x86_64
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

Compile the following test file for the x86_64 target:

 struct {
  void *a;
  void *b;
 } mystruct __attribute__((__section__(".data.mystruct"));

Then run objdump -x on the output file. The .data.mystruct section will have an
alignment of 2**4, even though the largest member of the struct only requires
an alignment of 2**3. GCC always seems to align structures to the whole struct
size, at least to a maximum of 2**5.

This behavior seems questionable at best, and it is inconsistent from all other
architectures (I tested i686 and aarch64, both of which always only align to
the largest member). I assume it may be intended as a cache line
optimization(?), but it affects code size very negatively, so it should at
least be up to the user and not hardcoded by architecture. I couldn't find any
flag in the documentation to disable this. It even occurs with -Os even though
that's meant to strongly favor size over other optimizations.

[Bug target/67701] Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)

2015-09-23 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701

--- Comment #3 from Julius Werner  ---
> I suspect this is an armv7 issue only where there is missing support for 
> misaligned load/stores.

Did a quick test on aarch64 and you're right, neither of the two issues appears
there.

> Also testvalue is aligned to 1 byte by default so what GCC is doing is 
> correct as there is no way to do a volatile unaligned loads really.

I thought alignment was always tied to a type (at least by default, unless
overridden by __attribute__((align)) or the like)? It's true that the expected
alignment for testvalue is 1, but I'm not dereferencing testvalue... I'm
dereferencing *(volatile unsigned long *)&testvalue. Shouldn't the alignment of
that always be 4? (FWIW, _Alignof(*(volatile unsigned long *)&testvalue)
evaluates to 4 in both compiler versions.)

Another argument: if write32() was defined as a non-inline function in a
separate execution unit, this problem would not appear. Should inlining a
function really change behavior in this way and produce code that is clearly
less efficient (to try to work around a "problem" that the -O0 version of the
same code would have never found)?


[Bug c/67701] Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)

2015-09-23 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701

Julius Werner  changed:

   What|Removed |Added

 Target||armv7a

--- Comment #1 from Julius Werner  ---
edit: target architecture in my example is armv7a, btw (although I doubt that
this is architecture-specific)


[Bug c/67701] New: Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)

2015-09-23 Thread jwerner at chromium dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701

Bug ID: 67701
   Summary: Unnecessary/bad instructions for u32-casted access to
external symbol (assumes misaligned, superfluous load)
   Product: gcc
   Version: 5.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jwerner at chromium dot org
  Target Milestone: ---

Consider the following sample program:

static inline void write32(void *addr, unsigned int value) {
*(volatile unsigned int *)addr = value;
}

extern unsigned char testvalue[];
void main() {
write32(&testvalue, 4);
}

GCC 5.2 (with -O2 -fno-stack-protector -ffreestanding -fomit-frame-pointer)
generates:

main:
mov r1, #4
mov r2, #0
ldr r3, .L2
ldrbr0, [r3]@ zero_extendqisi2
strbr1, [r3]
ldrbr1, [r3, #1]@ zero_extendqisi2
strbr2, [r3, #1]
ldrbr1, [r3, #2]@ zero_extendqisi2
strbr2, [r3, #2]
ldrbr1, [r3, #3]@ zero_extendqisi2
strbr2, [r3, #3]
bx  lr

Whereas GCC 4.9 (same arguments) generates:

main:
movwr3, #:lower16:testvalue
movsr2, #4
movtr3, #:upper16:testvalue
ldr r1, [r3]
str r2, [r3]
bx  lr

I think there are two problems with this:

1. Both GCC 5.2 and 4.9 generate superfluous load instructions to |addr|, even
though the pointer is never dereferenced as an rvalue (only as the left-hand
side of an assignment). This is a) unnecessary and b) dangerous since the
pointer is declared volatile (meaning it could be an MMIO register with read
side-effects).

2. GCC 5.2 seems to somehow assume it must treat the object as unaligned and
generates byte-wise accesses. I don't understand why it would do that, since
nothing here is declared __attribute__((packed)) and the explicit cast to a
4-byte-aligned type should remove any ambiguity for the compiler about which
alignment it can assume. It is important for systems code to have a (portable)
way to dereference an address with an explicit alignment regardless of where it
came from (e.g. for MMIO registers that only support 32-bit wide accesses).

Both of these issues can be solved by using __builtin_assume_aligned(addr,
sizeof(unsigned long)), but I still think the compiler shouldn't have done this
in the first place (and having a portable solution to this problem is
preferable).

Context: http://review.coreboot.org/#/c/11698