[Bug target/116521] New: missing optimization: xtensa tail-call

2024-08-28 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116521

Bug ID: 116521
   Summary: missing optimization: xtensa tail-call
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---

On GCC 12.2.0, -O2 -Wall -Wextra, the following code:

#include 

__attribute__ ((noinline)) uint32_t callee(uint32_t x, uint16_t y){
return x + y;
}

__attribute__ ((noinline)) uint32_t caller(uint32_t x, uint32_t y){
return callee(x, y);
}

compiles to these xtensa instructions:

callee:
entry   sp, 32
extui   a3, a3, 0, 16
add.n   a2, a3, a2
retw.n
caller:
entry   sp, 32
extui   a11, a3, 0, 16
mov.n   a10, a2
call8   callee
mov.n   a2, a10
retw.n

If the caller were to tail-call callee, it could be a lot closer to the
following on ARM(basically, caller does not need to manipulate the register
windows):

callee:
add r0, r0, r1
bx  lr
caller:
uxthr1, r1 //similar to extui, .., .., 0, 16
b   callee

On xtensa, this might mean that the arguments are in different registers in
caller(), I'm not sure if the caller or callee is responsible for rotating the
window. This may only apply when the number of arguments of each match. It's
also possible I'm misunderstanding the mechanism.

[Bug target/116467] missed optimization: zero-extension duplicated on xtensa

2024-08-26 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116467

--- Comment #2 from rsaxvc at gmail dot com ---
I had wondered about that too but hadn't been able to find anything about it.
User ccrause on esp32.com knew where it was though. From
https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/silicon-solutions/compute-ip/isa-summary.pdf.

>From section 10.1.4, Argument Passing in AR Registers

> All arguments consist of an integral number of 4-byte words. Thus, the 
> minimum argument size is one word. Integer values smaller than a word (that 
> is, char and short) are stored in the least significant portion of the 
> argument word, with the upper bits set to zero for unsigned values or 
> sign-extended for signed values.

>From section 10.1.5, Return Values in AR Registers:

> Return values smaller than a word are stored in the least-significant part of 
> AR[2], with the upper bits set to zero for unsigned values or sign-extended 
> for signed values.

So I think extending just once should be safe, at least for LX cores.

[Bug target/116467] New: missed optimization: zero-extension duplicated on xtensa

2024-08-22 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116467

Bug ID: 116467
   Summary: missed optimization: zero-extension duplicated on
xtensa
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---

On GCC 12.2.0, -O2 -Wall -Wextra, the following code:

#include 

__attribute__ ((noinline)) uint32_t callee(uint32_t x, uint16_t y){
return x + y;
}

__attribute__ ((noinline)) uint32_t caller(uint32_t x, uint32_t y){
return callee(x, y);
}

compiles to these xtensa instructions:

callee:
entry   sp, 32
extui   a3, a3, 0, 16
add.n   a2, a3, a2
retw.n
caller:
entry   sp, 32
extui   a11, a3, 0, 16
mov.n   a10, a2
call8   callee
mov.n   a2, a10
retw.n

I was surprised to find that zero-extension (extui rDest, rSource, 0, 16)
occurs twice, once in each function. On other targets like ARM32, it looks like
uint16_t passed in a register is assumed to be passed zero-extended, so the
callee does not need to repeat it. ARM32, GCC12.2, same flags:

callee:
add r0, r0, r1
bx  lr
caller:
uxthr1, r1 //similar to extui, .., .., 0, 16
b   callee

Could xtensa do the same?

[Bug middle-end/111933] memcpy on Xtensa not optimized when n == sizeof(uint32_t) or sizeof(uint64_t)

2024-08-22 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111933

rsaxvc at gmail dot com changed:

   What|Removed |Added

 CC||rsaxvc at gmail dot com

--- Comment #3 from rsaxvc at gmail dot com ---
(In reply to Davide Bettio from comment #2)

> ...I was writing a function for reading uint32_t and uint64_t values at any 
> address...

I believe memcpy() is the right approach, as dereferencing a misaligned pointer
is unaligned behaviour.

My suspicion is that assuming unalinged access is unsafe is intentional for
ESP32, because some of the internal memories like IRAM require strict
alignment, though most do not. Quoting from
https://blog.espressif.com/esp32-programmers-memory-model-259444d89387 ,

"...IRAM has access limitations in terms of alignment of address and size. If
an unaligned access is made, it results into an exception. The ESP-IDF, after
release 4.2, handles these exceptions transparently to provide load/store as
desired by the caller. As these unaligned accesses result in exception, the
access is slower than the DRAM access. Typically each exception handling
requires approximately 167 CPU cycles (i.e. 0.7 usec per access at 240 MHz or 1
usec per access at 160 MHz)."

It does look like the equivalent 16-bit unaligned load could be faster:

uint16_t from_unaligned_u16(void*p){
uint16_t ret;
memcpy(&ret,p,sizeof(ret));
return ret;
}

readU16: //round-trips through the stack
entry   sp, 48
l8uia8, a2, 0
l8uia2, a2, 1
s8i a8, sp, 0
s8i a2, sp, 1
l16ui   a2, sp, 0
retw.n

uint32_t from_unaligned_u16_seq(uint8_t *p){
uint32_t p1 = p[1];
uint32_t p0 = p[0];
return p0 | p1 << 8;
}

readU16Seq: //works in registers
entry   sp, 32
l8uia8, a2, 1
l8uia2, a2, 0
sllia8, a8, 8
or  a2, a8, a2
retw.n

But for the 32-bit version I couldn't get anything shorter than what GCC did.

[Bug c++/116356] New: signed overflow warning has misplaced line number

2024-08-12 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116356

Bug ID: 116356
   Summary: signed overflow warning has misplaced line number
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---

The following snippet is from a now deleted stack-overflow question. When
compiled with G++ 14.1 on a 64-bit target, "g++ main.cpp -o test
-Wstrict-overflow=5 -Werror -O1" warns:

: In function 'int main()':
:39:1: warning: assuming signed overflow does not occur when changing X
+- C1 cmp C2 to X cmp C2 -+ C1 [-Wstrict-overflow]
   39 | }
  | ^

Where '39' is the line number of the closing of the last scope in the file,
rather than the line number of the signed overflow. In the original linked code
from stack-overflow, it's line 56. I've trimmed the excerpt below.

Impacts at least versions 12.2.0 to 14.2.


###BEGIN ORIGINAL EXCERPT FROM https://godbolt.org/z/W5er718fK ###

#include 
#include 

double operator-(const timespec& lhs, const timespec& rhs);

int main(){
struct timespec start_time{}, end_time{}, curr_time{};
clock_gettime(CLOCK_REALTIME, &start_time);
end_time.tv_sec = start_time.tv_sec + 10;   // Run for 10 seconds
end_time.tv_nsec = 0;
curr_time = start_time;

while(curr_time <= end_time){
std::cout << curr_time - start_time << std::endl;
clock_gettime(CLOCK_REALTIME, &curr_time);
curr_time.tv_nsec += 100ll * 250; // Every 250 ms sample the sensor
if(curr_time.tv_nsec > 10ll){
curr_time.tv_nsec -= 10ll;
curr_time.tv_sec += 1;
}
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &curr_time, nullptr);
}
return 0;
}

double operator-(const timespec& lhs, const timespec& rhs)
{
const auto diff_seconds = (static_cast(lhs.tv_sec) -
static_cast(rhs.tv_sec));
const auto diff_nseconds = (static_cast(lhs.tv_nsec) -
static_cast(rhs.tv_nsec));

const auto diff_sec_to_ms = diff_seconds * 1000.0;
const auto diff_nsec_to_ms = diff_nseconds / 100.0;

return diff_sec_to_ms + diff_nsec_to_ms;
}

[Bug c++/116198] Wdeprecated-enum-enum-conversion triggers inside extern "C"

2024-08-02 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116198

--- Comment #3 from rsaxvc at gmail dot com ---
Thanks, that makes sense. I didn't realize that any code in a header would need
to be able to compile as any language including that header, so we'll need to
fix LVGL.

[Bug c++/116198] New: Wdeprecated-enum-enum-conversion triggers inside extern "C"

2024-08-02 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116198

Bug ID: 116198
   Summary: Wdeprecated-enum-enum-conversion triggers inside
extern "C"
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---

deprecated-enum-enum-conversion warnings apply to C code compiled by G++,
should they? I'm assuming if this code would not trigger a warning in C, it
shouldn't trigger a warning in extern "C". Here's a minimal contrived example:

   #include 

   extern "C"{
  enum {TWO = 2};
  enum {ONE = 1};
  void * triggerDeprecatedEnumConversion() {
 return malloc(ONE | TWO);
  }
   }

When the above code is compiled with g++ 12.2.0, -Wall -Wextra -O2 -std=c++23,
triggers this warning:

   : In function 'void* triggerDeprecatedEnumConversion()':
   :17:27: warning: bitwise operation between different enumeration
types '' and '' is deprecated
[-Wdeprecated-enum-enum-conversion]
  17 | return malloc(ONE | TWO);
 |   ^

One real-world use-case this impacts occurs in LVGL, which uses separate enums
for different parts of a bitfield that are then combined in functions in a
header, so every C++ file that includes those headers triggers this warning.
Related discussion here: https://github.com/lvgl/lvgl/issues/5119

Specific compiler version:
   xtensa-esp32s3-elf-c++.exe (crosstool-NG esp-12.2.0_20230208) 12.2.0
   Copyright (C) 2022 Free Software Foundation, Inc.
   This is free software; see the source for copying conditions.  There is NO
   warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[Bug target/53938] ARM target generates sub-optimal code (extra instructions) on load from memory

2024-04-24 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53938

--- Comment #6 from rsaxvc at gmail dot com ---
This also impacts Cortex-M0 & M23 on GCC13.2.0, just with the new extension
instructions.

Oddly, when loading a volatile u8 or u16 on Cortex-M3/4/7 does not generate
extra zero extension instructions. But these cores do still have separate
ldrb/ldrb + sxtab/sxtah sign extension instead of LDRSB/LDRSH.

[Bug target/113432] New: missing optimization: GCC UXTB zero-extends result of LDRB from volatile variables

2024-01-16 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113432

Bug ID: 113432
   Summary: missing optimization: GCC UXTB zero-extends result of
LDRB from volatile variables
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---

GCC for ARM 13.2.0(-Os, -mthumb) is missing a minor optimization when loading
from a sub-word sized volatile global variable when targeting Cortex-M0 and
Cortex-M23 , but gets it right on many other ARM cores(M3/M4/M7/A7/A9). For M0
and M23, GCC is adding a UXTB(unsigned zero-extend register byte) or UXTH to
the result of LDRB and LDRH, which is already zero-extended. I checked as far
back as GCC 5.4.1 which still has the extra UXTBs/UXTHs. Here's an example that
triggers it:

#include 
uint8_t regular_u8;
volatile uint8_t volatile_u8;

uint32_t load_regular_u8_as_u32(){ return regular_u8;}
uint8_t load_regular_u8_as_u8(){return regular_u8;};
uint32_t load_volatile_u8_as_u32(){return volatile_u8;}
uint8_t load_volatile_u8_as_u8(){return volatile_u8;}

I would expect all four functions to assemble to:
1) address loading(either movw/movt(-O3) or PC-relative LDR(-Os))
2) LDRB r0, [address register of extern variable]
3) BX LR

But, here's the resulting assembly - loading a non-volatile variable is as
expected, but loading the volatile variable uses an additional UXTB
instruction.

load_regular_u8_as_u32:
ldr r3, .L2
ldrbr0, [r3]
bx  lr
.L2:
.word   regular_u8
load_regular_u8_as_u8:
ldr r3, .L5
ldrbr0, [r3]
bx  lr
.L5:
.word   regular_u8
load_volatile_u8_as_u32:
ldr r3, .L8
ldrbr0, [r3]
uxtbr0, r0
bx  lr
.L8:
.word   volatile_u8
load_volatile_u8_as_u8:
ldr r3, .L11
ldrbr0, [r3]
uxtbr0, r0
bx  lr
.L11:
.word   volatile_u8


In the above load_volatile* cases, the UXTB can be omitted, as the LDRB already
zero-extended r0.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2023-07-08 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #5 from rsaxvc at gmail dot com ---
Related ticket requested with Clang:
https://github.com/llvm/llvm-project/issues/63731

latest umulh() function is a little shorter:

static uint64_t umulh(uint64_t a, uint64_t b)
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy +
d),
consider it with variables a,b,c,d, constants x,y = 1<<32
Results in one UMULL or UMLAL(when merged with accumulate below) per multiply*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Accumulate the results, keeping only the top 64-bits of the 128-bit result*/
uint64_t acc = acc0;
acc >>= 32;
acc += acc1;
acc += acc2;
acc >>= 32;
acc += acc3;

return acc;
}

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2023-05-04 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #4 from rsaxvc at gmail dot com ---
Benchmarking shows the speedup to be highly variable depending on CPU core as
well as __aeabi_uldivmod() implementation, and somewhat on numerator.

The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when
available, and umulh() based approach is only 2-3x faster when division
instructions are available.

When umull(32x32 with 64bit result) is available and udiv is not available or
libc doesn't use it, the umulh() based approach proposed here completes 28-38x
faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in
relative speed is due to variable execution time of __aeabi_uldivmod(). Similar
on ARM11.

There's a partial list of some contemporary cores have udiv here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer
it does look like things are headed towards more cores having udiv available.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-08-01 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #3 from rsaxvc at gmail dot com ---
> 32-bit division by constant relies on a 32x32->64 bit multiply.

I believe only the upper half of the bits are required for the product to
represent the integer component of the quotient. This makes parameter passing
much easier in the uint64_t/constant case.

> We don't have an instruction for that on arm nor does gcc support 128-bit 
> integer types on a 32-bit compiler.

I ran into that as well. The attached uint64_t umulh( uint64_t a, uint64_t b )
implements a 64x64->64MSBsOf128BitProduct using only 32x32->64 bit
multiplication, 64-bit addition, and register-aligned shifting.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-07-30 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

rsaxvc at gmail dot com changed:

   What|Removed |Added

 CC||rsaxvc at gmail dot com

--- Comment #1 from rsaxvc at gmail dot com ---
Created attachment 53388
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53388&action=edit
example umulh implementation

C implementation of umulh which can be used to optimize uint64_t/constant
division on 32-bit processors without a umulh instruction but with a
32x32=64bit multiplier.

[Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32

2022-07-30 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

Bug ID: 106484
   Summary: Failure to optimize uint64_t/constant division on
ARM32
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsaxvc at gmail dot com
  Target Milestone: ---
Target: arm

The following test function compiles into a call to __aeabi_uldivmod, even
though the divisor is a constant. Here's an example function:

#include 
uint64_t ns_to_s( uint64_t ns64 )
{
return ns64 / 10ULL;
}

CortexM4(-O3 -Wall -Wextra -mcpu=cortex-m4) assembly:

ns_to_s(unsigned long long):
push{r3, lr}
adr r3, .L4
ldrdr2, [r3]
bl  __aeabi_uldivmod
pop {r3, pc}
.L4:
.word   10
.word   0

Interestingly, gcc 12.1 for aarch64 compiles the above C function by
implementing division by a constant with scaled multiplication by the inverse
using the umulh instruction(not present on 32-bit ARM). (-O3 -Wall -Wextra):

ns_to_s(unsigned long):
mov x1, 23123
lsr x0, x0, 9
movkx1, 0xa09b, lsl 16
movkx1, 0xb82f, lsl 32
movkx1, 0x44, lsl 48
umulh   x0, x0, x1
lsr x0, x0, 11
ret

Instead, if something like __umulh could be added to libgcc, then GCC could use
the constants generated in the aarch64 logic to implement uint64_t/constant
division. Example umulh approach is attached. On Cortex-M4, the umulh-based
approach is significantly faster, although this depends on the specific libc
__aeabi_uldivmod linked against as well as the numerator.