Hi Saurabh,
This looks good, one little nit:
> gcc/ChangeLog:
>
> * config/aarch64/iterators.md: Move UNSPEC_COND_SMAX and
> UNSPEC_COND_SMIN to correct iterators.
This should also have the PR target/116934 before it - it's fine to change it
when you commit.
Speaking of which,
v2: Add more testcase fixes.
The current copysign pattern has a mismatch in the predicates and constraints -
operand[2] is a register_operand but also has an alternative X which allows any
operand. Since it is a floating point operation, having an integer alternative
makes no sense. Change the e
The current copysign pattern has a mismatch in the predicates and constraints -
operand[2] is a register_operand but also has an alternative X which allows any
operand. Since it is a floating point operation, having an integer alternative
makes no sense. Change the expander to always use the vec
Hi Richard,
> The Linaro CI is reporting an ICE while building libgfortran with this change.
So it looks like Thumb-2 oddly enough restricts the negative range of DFmode
eventhough that is unnecessary and inefficient. The easiest workaround turned
out to avoid using checked adjust_address.
Cheer
Hi Richard,
> Doing just this will mean that the register allocator will have to undo a
> pre/post memory operand that was accepted by the predicate (memory_operand).
> I think we really need a tighter predicate (lets call it noautoinc_mem_op)
> here to avoid that. Note that the existing uses
OK to backport to GCC13 (it applies cleanly and regress/bootstrap passes)?
Cheers,
Wilco
On 29/11/2023 18:09, Richard Sandiford wrote:
> Wilco Dijkstra writes:
>> v2: Use UINTVAL, rename max_mops_size.
>>
>> The cpymemdi/setmemdi implementation doesn't fully support
v2: use a new arm_arch_v7ve_neon, fix use of DImode in output_move_neon
The valid offset range of LDRD in arm_legitimate_index_p is increased to
-1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode.
Fix this by moving the LDRD check earlier.
Passes bootstrap & regress, OK for
Hi Christophe,
> PR target/115153
I guess this is typo (should be 115188) ?
Correct.
> +/* { dg-options "-O2 -mthumb" } */-mthumb is included in arm_arch_v6m, so I
> think you don't need to add it
here?
Indeed, it's not strictly necessary. Fixed in v2:
A Thumb-1 memory operand allows
Hi Richard,
>> Essentially anything covered by HWCAP doesn't need an explicit check. So I
>> kept
>> the LS64 and PREDRES checks since they don't have a HWCAP allocated (I'm not
>> entirely convinced we need these, let alone having 3 individual bits for
>> LS64, but
>> that's something for the A
Hi Richard,
I've reworded the commit message a bit:
The CPU features initialization code uses CPUID registers (rather than
HWCAP). The equality comparisons it uses are incorrect: for example FEAT_SVE
is not set if SVE2 is available. Using HWCAPs for these is both simpler and
correct. The initi
Fix CPU features initialization. Use HWCAP rather than explicit accesses
to CPUID registers. Perform the initialization atomically to avoid multi-
threading issues.
Passes regress, OK for commit and backport?
libgcc:
PR target/115342
* config/aarch64/cpuinfo.c (__init_cpu_featu
A Thumb-1 memory operand allows single-register LDMIA/STMIA. This doesn't get
printed as LDR/STR with writeback in unified syntax, resulting in strange
assembler errors if writeback is selected. To work around this, use the 'Uw'
constraint that blocks writeback.
Passes bootstrap & regress, OK for
The valid offset range of LDRD in arm_legitimate_index_p is increased to
-1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode.
Fix this by moving the LDRD check earlier.
Passes bootstrap & regress, OK for commit?
gcc:
PR target/115153
* config/arm/arm.cc (arm
Hi Richard,
> I think this should be in a push_options/pop_options block, as for other
> intrinsics that require certain features.
But then the intrinsic would always be defined, which is contrary to what the
ACLE spec demands - it would not give a compilation error at the callsite
but give assem
Add __ARM_FEATURE_MOPS predefine. Add support for ACLE __arm_mops_memset_tag.
Passes regress, OK for commit?
gcc:
* config/aaarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Add __ARM_FEATURE_MOPS predefine.
* config/aarch64/arm_acle.h: Add __arm_mops_memset_tag().
gc
Improve check-function-bodies by allowing single-character function names.
Also skip '#' comments which may be emitted from inline assembler.
Passes regress, OK for commit?
gcc/testsuite:
* lib/scanasm.exp (configure_check-function-bodies): Allow single-char
function names. Skip
Hi Andrew,
A few comments on the implementation, I think it can be simplified a lot:
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
> AARCH64_FL_SM_OFF;
> #define DWARF2_UNWIND_INFO 1
>
> /* Use R0 through R3 to pass exception handling
Hi Andrew,
> I should note popcount has a similar issue which I hope to fix next week.
> Popcount cost is used during expand so it is very useful to be slightly more
> correct.
It's useful to set the cost so that all of the special cases still apply - even
if popcount is
relatively fast, it's s
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet.
Passes regress & bootstrap - OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.
---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index
f
Add missing '\' in 2-instruction movsi/di alternatives so that they are
printed on separate lines.
Passes bootstrap and regress, OK for commit once stage 1 reopens?
gcc:
* config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
newline in 2-instruction pattern.
(movdi
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_clas
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_clas
Use UZP1 instead of INS when combining low and high halves of vectors.
UZP1 has 3 operands which improves register allocation, and is faster on
some microarchitectures.
Passes regress & bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64-simd.md (aarch64_combine_internal):
Use
According to documentation, '^' should only have an effect during reload.
However ira-costs.cc treats it in the same way as '?' during early costing.
As a result using '^' can accidentally disable valid alternatives and cause
significant regressions (see PR114741). Avoid this by ignoring '^' duri
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build
errors
on older machines.
This counts a trivial build fix, but since it's late in stage 4 I'll let
maintainers chip in.
OK for commit?
libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,
HWC
As mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:
Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
Passes regress and
Hi Richard,
> This description is too brief for me. Could you say in detail how the
> new scheme works? E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS = -DHAVE_FEAT_LSE128
> -endif
That is not needed because we can include auto-config.h in atomic_16.
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.
gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.
---
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
Hi Richard,
> Did you test this on a thumb1 target? It seems to me that the target parts
> that you've
> removed were likely related to that. In fact, I don't see why this test
> would need to be changed at all.
The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
wer
Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.
Passes regress and bootstrap, OK for commit?
libatomic:
PR target/113986
* Makefile.in: Regenerated.
* Makefile.
Hi Richard,
> This bit isn't. The correct fix here is to fix the pattern(s) concerned to
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the
> comments.
I fixed the patterns in v2. There are likely some more, plus we could likely
merge ma
Hi Richard,
> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves. At least, that seems to be
> the effect of...
No it still uses TImode for the !TARGET_SIMD
By default most patterns can be conditionalized on Arm targets. However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes". Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable". Given this sharing, it does not make sense to
use a di
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on th
Hi Richard,
>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll po
(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)
Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os. SPECINT_2017
shows that removing it has no performance difference
Hi,
>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>>
>> Passes regress, OK for commit?
>
> Ok.
Also OK to backport to GCC 13, 12 and 11?
Cheers,
Wilco
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi (-mcpu):
Hi Richard,
>> + rtx base = strip_offset_and_salt (XEXP (x, 1), &offset);
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.
Fixed.
> +
> + if (offset.is_constant ())
> I'm not sure this is really required. Logically the same thing
> would app
GCC tends to optimistically create CONST of globals with an immediate offset.
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_
Hi Richard,
>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file. It could just go at the head
> of aarch64.cc instead.
Sure, I've moved it in v4.
>> + if (len <= 24 || (aarch64_tune_p
Hi Richard,
>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance
>> once
>> the atomic is acquire, release or both. Given there is already a significant
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make
>> sense to add
>> extra
Hi,
>> Is there no benefit to using SWPPL for RELEASE here? Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available.
> Wilco saw value in merging less restricted orderings into more
> restricted ones - mainly to reduce codesize in less frequently use
v3: rebased to latest trunk
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress & bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64.h (MAX_SET_SI
Hi Richard,
>> Enable lock-free 128-bit atomics on AArch64. This is backwards compatible
>> with
>> existing binaries, gives better performance than locking atomics and is what
>> most users expect.
>
> Please add a justification for why it's backwards compatible, rather
> than just stating that
Hi Richard,
> + rtx load[max_ops], store[max_ops];
>
> Please either add a comment explaining why 40 is guaranteed to be
> enough, or (my preference) use:
>
> auto_vec, ...> ops;
I've changed to using auto_vec since that should help reduce conflicts
with Alex' LDP changes. I double-checked maxi
Hi Richard,
Thanks for the review, now committed.
> The new aarch64_split_compare_and_swap code looks a bit twisty.
> The approach in lse.S seems more obvious. But I'm guessing you
> didn't want to spend any time restructuring the pre-LSE
> -mno-outline-atomics code, and I agree the patch in its
Hi Richard,
> +/* Maximum bytes set for an inline memset expansion. With -Os use 3 STP
> + and 1 MOVI/DUP (same size as a call). */
> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
> So it looks like this assumes we have AdvSIMD. What about
> -mgeneral-regs-only?
After my strictalign bugf
Hi,
>>> I checked codesize on SPECINT2017, and 96 had practically identical size.
>>> Using 128 would also be a reasonable Os value with a very slight size
>>> increase,
>>> and 384 looks good for O2 - however I didn't want to tune these values
>>> as this
>>> is a cleanup patch.
>>>
>>> Cheers,
>
Hi Kyrill,
> + if (!(hwcap & HWCAP_CPUID))
> + return false;
> +
> + unsigned long midr;
> + asm volatile ("mrs %0, midr_el1" : "=r" (midr));
> From what I recall that midr_el1 register is emulated by the kernel and so
> userspace software
> has to check that the kernel supports that emula
Hi Kyrill,
> + /* Reduce the maximum size with -Os. */
> + if (optimize_function_for_size_p (cfun))
> + max_set_size = 96;
> +
> This is a new "magic" number in this code. It looks sensible, but how
> did you arrive at it?
We need 1 instruction to create the value to store (DUP or MO
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_progre
ping
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regre
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/Ch
v2: Use check-function-bodies in tests
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_progress_poin
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Hi Ramana,
> I remember this to be the previous discussions and common understanding.
>
> https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html
>
> and here
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html
>
> Can you point any discussion recently that shows this has changed
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In thi
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/Cha
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regress/boot
Hi Ramana,
>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard
>> since
>> that is obviously implied by armhf, a
Hi Ramana,
> Hope this helps.
Yes definitely!
>> Passes regress/bootstrap, OK for commit?
>
> Target ? armhf ? --with-arch , -with-fpu , -with-float parameters ?
> Please be specific.
I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
--build=arm-none-linux-gnueabihf --wit
The outline atomic functions have hidden visibility and can only be called
directly. Therefore we can remove the BTI at function entry. This improves
security by reducing the number of indirect entry points in a binary.
The BTI markings on the objects are still emitted.
Passes regress, OK for c
Hi Ramana,
>> __sync_val_compare_and_swap may be used on 128-bit types and either calls the
>> outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic
>> if
>> the value is stored successfully using STXP, but the current implementations
>> do not perform the store if the compa
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
Hi Richard,
> * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition.
> Shouldn't this be a separate patch? It's not immediately obvious that this
> is a necessary part of this change.
You mean this?
@@ -1627,7 +1627,7 @@ (define_expand "cpymemdi"
(match_operand:BLK 1 "m
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary registers. Fix this by calling
aarch64_expand_cpymem_mops.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/21
* config/aarch64/aarch64.md (aarc
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/103100
* con
Hi Richard,
>> Note that aarch64_internal_mov_immediate may be called after reload,
>> so it would end up even more complex.
>
> The sequence I quoted was supposed to work before and after reload. The:
>
> rtx tmp = aarch64_target_reg (dest, DImode);
>
> would create a fresh tempor
Hi Richard,
> I was worried that reusing "dest" for intermediate results would
> prevent CSE for cases like:
>
> void g (long long, long long);
> void
> f (long long *ptr)
> {
> g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
> }
Note that aarch64_internal_mov_immediate may be called after relo
Support immediate expansion of immediates which can be created from 2 MOVKs
and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store
to apply if we save one instruction.
This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
Passes regress, OK for commit?
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In this case
List official cores first so that -cpu=native does not show a codename with -v
or in errors/warnings.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
(neoverse-v1): Place before zeus.
(neoverse-v2): Place b
The v7 memory ordering model allows reordering of conditional atomic
instructions.
To avoid this, make all atomic patterns unconditional. Expand atomic loads and
stores for all architectures so the memory access can be wrapped into an UNSPEC.
Passes regress/bootstrap, OK for commit?
gcc/ChangeL
Hi Richard,
(that's quick!)
> + if (size > max_copy_size || size > max_mops_size)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> Could you explain this a bit more? If I've followed the logic correctly,
> max_copy_size will always be 0 for movmem, so this "if" condition wil
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary
registers. Fix this by calling aarch64_expand_cpymem which does this. Also
fix an issue with
STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or
generating a huge
expansion
Hi Richard,
>>> Answering my own question, N1 does not officially have FEAT_LSE2.
>>
>> It doesn't indeed. However most cores support atomic 128-bit load/store
>> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since
>> there
>> isn't a feature bit for this in the CPU or HWCA
Hi Richard,
>> Why would HWCAP_USCAT not be set by the kernel?
>>
>> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT.
>>
> Answering my own question, N1 does not officially have FEAT_LSE2.
It doesn't indeed. However most cores support atomic 128-bit load/store
(part of LSE2), so
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.
Passes regress, OK for commit?
libatomic/
config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
selection.
-
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.
Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not
supported.
This results in an implicit store wh
ping
From: Wilco Dijkstra
Sent: 23 February 2023 15:11
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch
fixes this and adds comments to make it easier to see which sequence is
us
Hi,
>> + /* Return-address signing state is toggled by DW_CFA_GNU_window_save
>> (where
>> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression. */
>
> Needs updating to REG_UNSAVED_ARCHEXT.
>
> OK with that changes, thanks, and sorry for the delays & runaround.
Thanks, I've impr
Hi,
> @Wilco, can you please send the rebased patch for patch review? We would
> need in out openSUSE package soon.
Here is an updated and rebased version:
Cheers,
Wilco
v4: rebase and add REG_UNSAVED_ARCHEXT.
A recent change only initializes the regs.how[] during Dwarf unwinding
which resulte
Hi,
> On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote:
>> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite
>> a lot of stuff.
>
> Yep, please, we're also waiting for this patch for pushing to our gcc13
> package.
Well I'm waiting for an OK from a maintainer... I
Hi Szabolcs,
> i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED
> here, other how[reg] means the toggle cfi instruction is mixed with
> incompatible instructions for the pseudo reg.
>
> and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED
> how[reg] is used
Hi Richard,
> Hmm, but the point of the original patch was to support code generators
> that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state.
> Doesn't this patch undo that?
Well it wasn't clear from the code or comments that was supported. I've
added that back in v2.
> Also
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via
immediate add.
Use a 24-bit range as that enables a 3 or 4-instruction immediate to be
replaced by
2 additions. Fix the costing of immediate add to support 24-bit immediate and
12-bit shifted
immediates. The generated code
1 - 100 of 1004 matches
Mail list logo