Hi Richard,
> I was worried that reusing "dest" for intermediate results would
> prevent CSE for cases like:
>
> void g (long long, long long);
> void
> f (long long *ptr)
> {
> g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
> }
Note that aarch64_internal_mov_immediate may be called after
Support immediate expansion of immediates which can be created from 2 MOVKs
and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store
to apply if we save one instruction.
This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
Passes regress, OK for commit?
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries,
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In this case
List official cores first so that -cpu=native does not show a codename with -v
or in errors/warnings.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
(neoverse-v1): Place before zeus.
(neoverse-v2): Place
The v7 memory ordering model allows reordering of conditional atomic
instructions.
To avoid this, make all atomic patterns unconditional. Expand atomic loads and
stores for all architectures so the memory access can be wrapped into an UNSPEC.
Passes regress/bootstrap, OK for commit?
Hi Richard,
(that's quick!)
> + if (size > max_copy_size || size > max_mops_size)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> Could you explain this a bit more? If I've followed the logic correctly,
> max_copy_size will always be 0 for movmem, so this "if" condition
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary
registers. Fix this by calling aarch64_expand_cpymem which does this. Also
fix an issue with
STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or
generating a huge
expansion
Hi Richard,
>>> Answering my own question, N1 does not officially have FEAT_LSE2.
>>
>> It doesn't indeed. However most cores support atomic 128-bit load/store
>> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since
>> there
>> isn't a feature bit for this in the CPU or
Hi Richard,
>> Why would HWCAP_USCAT not be set by the kernel?
>>
>> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT.
>>
> Answering my own question, N1 does not officially have FEAT_LSE2.
It doesn't indeed. However most cores support atomic 128-bit load/store
(part of LSE2), so
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries,
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.
Passes regress, OK for commit?
libatomic/
config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
selection.
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries,
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries,
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.
Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not
supported.
This results in an implicit store
ping
From: Wilco Dijkstra
Sent: 23 February 2023 15:11
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch
fixes this and adds comments to make it easier to see which sequence is
Hi,
>> + /* Return-address signing state is toggled by DW_CFA_GNU_window_save
>> (where
>> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression. */
>
> Needs updating to REG_UNSAVED_ARCHEXT.
>
> OK with that changes, thanks, and sorry for the delays & runaround.
Thanks, I've
Hi,
> @Wilco, can you please send the rebased patch for patch review? We would
> need in out openSUSE package soon.
Here is an updated and rebased version:
Cheers,
Wilco
v4: rebase and add REG_UNSAVED_ARCHEXT.
A recent change only initializes the regs.how[] during Dwarf unwinding
which
Hi,
> On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote:
>> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite
>> a lot of stuff.
>
> Yep, please, we're also waiting for this patch for pushing to our gcc13
> package.
Well I'm waiting for an OK from a maintainer...
Hi Szabolcs,
> i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED
> here, other how[reg] means the toggle cfi instruction is mixed with
> incompatible instructions for the pseudo reg.
>
> and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED
> how[reg] is used
Hi Richard,
> Hmm, but the point of the original patch was to support code generators
> that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state.
> Doesn't this patch undo that?
Well it wasn't clear from the code or comments that was supported. I've
added that back in v2.
>
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via
immediate add.
Use a 24-bit range as that enables a 3 or 4-instruction immediate to be
replaced by
2 additions. Fix the costing of immediate add to support 24-bit immediate and
12-bit shifted
immediates. The generated
Hi Andreas,
Thanks for the report, I've committed the fix:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006
Cheers,
Wilco
Ensure we only pass SI/DImode which fixes the assert.
Committed as obvious.
gcc/
PR target/108006
* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):
Fix call to aarch64_move_imm to use SI/DI.
---
diff --git a/gcc/config/aarch64/aarch64.cc
Hi,
> i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET,
> this pseudo reg is not spilled to the stack, it is reset to 0 in
> each frame and then toggled within a frame.
It's is just a state, we can use any state we want since it is a pseudo reg.
These registers are global and
Hi Richard,
> - scalar_int_mode imode = (mode == HFmode
> - ? SImode
> - : int_mode_for_mode (mode).require ());
> + machine_mode imode = (mode == DFmode) ? DImode : SImode;
> It looks like this might mishandle DDmode, if not now
A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding. The fix is to use REG_SAVED_OFFSET
as the state where the return address signing bit is valid, and if the
state is
Hi Richard,
> Just to make sure I understand: isn't it really just MOVN? I would have
> expected a 32-bit MOVZ to be equivalent to (and add no capabilities over)
> a 64-bit MOVZ.
The 32-bit MOVZ immediates are equivalent, MOVN never overlaps, and
MOVI has some overlaps . Since we allow all 3
Hi Richard,
>> A smart reassociation pass could form more FMAs while also increasing
>> parallelism, but the way it currently works always results in fewer FMAs.
>
> Yeah, as Richard said, that seems the right long-term fix.
> It would also avoid the hack of treating PLUS_EXPR as a signal
> of an
Hi Richard,
> I guess an obvious question is: if 1 (rather than 2) was the right value
> for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA
> pipes? It would be good to clarify how, conceptually, the core property
> should map to the fma_reassoc_width value.
1 turns off
Hi Richard,
> Can you go into more detail about:
>
> Use :option:`-mdirect-extern-access` either in shared libraries or in
> executables, but not in both. Protected symbols used both in a shared
> library and executable may cause linker errors or fail to work correctly
>
> If this is
Add a new option -mdirect-extern-access similar to other targets. This removes
GOT indirections on external symbols with -fPIE, resulting in significantly
better code quality. With -fPIC it only affects protected symbols, allowing
for more efficient shared libraries which can be linked with
Add support for AArch64 LSE and LSE2 to libatomic. Disable outline atomics,
and use LSE ifuncs for 1-8 byte atomics and LSE2 ifuncs for 16-byte atomics.
On Neoverse V1, 16-byte atomics are ~4x faster due to avoiding locks.
Note this is safe since we swap all 16-byte atomics using the same ifunc,
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes. This improves SPECFP2017 on Neoverse V1 by ~1.5%.
Passes regress/bootstrap, OK for commit?
gcc/
PR 107413
*
Committed as trivial fix.
gcc/testsuite/
* gcc.target/aarch64/mgeneral-regs_3.c: Fix testcase.
---
diff --git a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
index
Hi Richard,
Here is the immediate cleanup splitoff from the previous patch:
Simplify, refactor and improve various move immediate functions.
Allow 32-bit MOVZ/N as a valid 64-bit immediate which removes special
cases in aarch64_internal_mov_immediate. Add new constraint so the movdi
pattern
Hi Richard,
> Can you do the aarch64_mov_imm changes as a separate patch? It's difficult
> to review the two changes folded together like this.
Sure, I'll send a separate patch. So here is version 2 again:
[PATCH v2][AArch64] Improve immediate expansion [PR106583]
Improve immediate expansion
ping
Hi Richard,
>>> Sounds good, but could you put it before the mode version,
>>> to avoid the forward declaration?
>>
>> I can swap them around but the forward declaration is still required as
>> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.
>
> OK, how about moving them
Hi Richard,
> Maybe pre-existing, but are ordered comparisons safe for the
> ZERO_EXTRACT case? If we extract the top 8 bits (say), zero extend,
> and compare with zero, the result should be >= 0, whereas TST would
> set N to the top bit.
Yes in principle zero extract should always be positive
Hi Richard,
>>> Sounds good, but could you put it before the mode version,
>>> to avoid the forward declaration?
>>
>> I can swap them around but the forward declaration is still required as
>> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.
>
> OK, how about moving them both
Hi Richard,
> Realise this is awkward, but: CC_NZmode is for operations that set only
> the N and Z flags to useful values. If we want to take advantage of V
> being zero then I think we need a different mode.
>
> We can't go all the way to CCmode because the carry flag has the opposite
> value
Hi Richard,
>> Yes, with a more general search loop we can get that case too -
>> it doesn't trigger much though. The code that checks for this is
>> now refactored into a new function. Given there are now many
>> more calls to aarch64_bitmask_imm, I added a streamlined internal
>> entry point
Hi Richard,
> Did you consider handling the case where the movks aren't for
> consecutive bitranges? E.g. the patch handles:
> but it looks like it would be fairly easy to extend it to:
>
> 0x12345678
Yes, with a more general search loop we can get that case too -
it doesn't trigger
Since AArch64 sets all flags on logical operations, comparisons with zero
can be combined into an AND even if the condition is LE or GT.
Passes regress, OK for commit?
gcc:
PR target/105773
* config/aarch64/aarch64.cc (aarch64_select_cc_mode): Allow
GT/LE for merging
Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs. This reduces the number of 4-instruction
immediates in SPECINT/FP by 10-15%.
Passes regress, OK for commit?
gcc/ChangeLog:
PR target/106583
* config/aarch64/aarch64.cc
Further cleanup option processing. Remove the duplication of global
variables for CPU and tune settings so that CPU option processing is
simplified even further. Move global variables that need save and
restore due to target option processing into aarch64.opt. This removes
the need for explicit
Hi Richard,
I've added a comment - as usual it's just a number. A quick grep in gcc and
glibc showed that priorities 98-101 are used, so I just went a bit below so it
has a higher priority than typical initializations.
Cheers,
Wilco
Here is v2:
Increase the priority of the
Increase the priority of the init_have_lse_atomics constructor so it runs
before other constructors. This improves chances that rr works when LSE
atomics are supported.
Regress and bootstrap pass, OK for commit?
2022-05-24 Wilco Dijkstra
libgcc/
PR libgcc/105708
*
Hi Sebastian,
>> Note the patch still needs an appropriate commit message.
>
> Added the following ChangeLog entry to the commit message.
>
> * config/aarch64/aarch64-protos.h (atomic_ool_names): Increase
>dimension
> of str array.
> * config/aarch64/aarch64.cc
Hi Richard,
> But even if the costs are too high, the patch seems to be overcompensating.
> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.
An LDR is not a replacement for ADRP+LDR, you need a store in addition the
original ADRP+LDR. Basically a simple spill would be
Hi,
>> It's also said that chosen alternatives might be the reason that
>> rematerialization
>> is not choosen and alternatives are chosen based on reload heuristics, not
>> based
>> on actual costs.
>
> Thanks for the pointer. Yeah, it'd be interesting to know if this
> is the same issue,
Hi Richard,
> Looks like you might have attached the old patch. The aarch64_option_restore
> change is mentioned in the changelog but doesn't appear in the patch itself.
Indeed, not sure how that happened. Here is the correct v2 anyway.
Wilco
The --with-cpu/--with-arch configure option
Hi Richard,
> Although invoking ./cc1 directly only half-works with --with-arch,
> it half-works well-enough that I'd still like to keep it working.
> But I agree we should apply your change first, then I can follow up
> with a patch to make --with-* work with ./cc1 later. (I have a version
>
Hi Richard,
> Yeah, I'm not disagreeing with any of that. It's just a question of
> whether the problem should be fixed by artificially lowering the general
> rtx costs with one particular user (RA spill costs) in mind, or whether
> it should be fixed by making the RA spill code take the factors
Hi Richard,
>> There isn't really a better way of doing this within the existing costing
>> code.
>
> Yeah, I was wondering whether we could change something there.
> ADRP+LDR is logically more expensive than a single LDR, especially
> when optimising for size, so I think it's reasonable for the
Hi Richard,
> I'm not questioning the results, but I think we need to look in more
> detail why rematerialisation requires such low costs. The point of
> comparison should be against a spill and reload, so any constant
> that is as cheap as a load should be rematerialised. If that isn't
>
The --with-cpu/--with-arch configure option processing not only checks valid
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask. This isn't
used
however since a --with-cpu is translated into a -mcpu option which is processed
as if
written on the command-line (so
Improve rematerialization costs of addresses. The current costs are set too
high
which results in extra register pressure and spilling. Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills. This results in significant codesize
Hi Sebastian,
> Please find attached the patch amended following your recommendations.
> The number of new functions for _sync is reduced by 3x.
> I tested the patch on Graviton2 aarch64-linux.
> I also checked by hand that the outline functions in libgcc look similar to
> what GCC produces for
Hi Sebastian,
> Wilco pointed out in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162#c7
> that
> "Only __sync needs the extra full barrier, but __atomic does not."
> The attached patch does that by adding out-of-line functions for
> MEMMODEL_SYNC_*.
> Those new functions contain a barrier
Improve and generalize rotate patterns. Rotates by more than half the
bitwidth of a register are canonicalized to rotate left. Many existing
shift patterns don't handle this case correctly, so add rotate left to
the shift iterator and convert rotate left into ror during assembly
output. Add
Hi Richard,
> Can you fold in the rtx costs part of the original GOT relaxation patch?
Sure, see below for the updated version.
> I don't think there's enough information here for me to be able to review
> the patch though. I'll need to find testcases, look in detail at what
> the rtl passes
The stack protector implementation hides symbols in a const unspec, which means
movdi/movsi patterns must always support const on symbol operands and explicitly
strip away the unspec. Do this for the recently added GOT alternatives. Add a
test to ensure stack-protector tests GOT accesses as well.
ping
From: Wilco Dijkstra
Sent: 02 June 2021 11:21
To: GCC Patches
Cc: Kyrylo Tkachov ; Richard Sandiford
Subject: [PATCH] AArch64: Improve address rematerialization costs
Hi,
Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get
v2: rebased
The --with-cpu/--with-arch configure option processing not only checks valid
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask. This isn't
used
however since a --with-cpu is translated into a -mcpu option which is processed
as if
written on the
Hi Richard,
> - Why do we rewrite the constant moves after reload into ldr_got_small_sidi
> and ldr_got_small_? Couldn't we just get the move patterns to
> output the sequence directly?
That's possible too, however it makes the movsi/di patterns more complex.
See version v4 below.
> - I
ping
From: Wilco Dijkstra
Sent: 04 June 2021 14:44
To: Richard Sandiford
Cc: Kyrylo Tkachov ; GCC Patches
Subject: [PATCH v3] AArch64: Improve GOT addressing
Hi Richard,
This merges the v1 and v2 patches and removes the spurious MEM from
ldr_got_small_si/di. This has been rebased after
ping
From: Wilco Dijkstra
Sent: 02 June 2021 11:21
To: GCC Patches
Cc: Kyrylo Tkachov ; Richard Sandiford
Subject: [PATCH] AArch64: Improve address rematerialization costs
Hi,
Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get
Hi Richard,
> The problem is that you're effectively asking for these values to be
> taken on faith without providing any analysis and without describing
> how you arrived at the new numbers. Did you try other values too?
> If so, how did they compare with the numbers that you finally chose?
>
Hi Richard,
> I'm just concerned that here we're using the same explanation but with
> different numbers. Why are the new numbers more right than the old ones
> (especially when it comes to code size, where the trade-off hasn't
> really changed)?
Like all tuning/costing parameters, these values
Enable the fast shift feature in Neoverse V1 and N2 tunings as well.
ChangeLog:
2021-10-18 Wilco Dijkstra
* config/aarch64/aarch64.c (neoversev1_tunings):
Enable AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND.
(neoversen2_tunings): Likewise.
---
diff --git
Tune the case-values-threshold setting for modern cores. A value of 11 improves
SPECINT2017 by 0.2% and reduces codesize by 0.04%. With -Os use value 8 which
reduces codesize by 0.07%.
Passes regress, OK for commit?
ChangeLog:
2021-10-18 Wilco Dijkstra
* config/aarch64/aarch64.c
Hi Richard,
> So rather than have two patterns that generate frintn, I think
> it would be better to change the existing frint_pattern entry to
> "roundeven" instead, and fix whatever the fallout is. Hopefully it
> shouldn't be too bad, since we already use the optab names for the
> other
Enable __builtin_roundeven[f] by adding roundeven as an alias to the
existing frintn support.
Bootstrap OK and passes regress.
ChangeLog:
2021-06-18 Wilco Dijkstra
PR target/100966
* config/aarch64/aarch64.md (UNSPEC_FRINTR): Add.
* config/aarch64/aarch64.c
Hi Richard,
This merges the v1 and v2 patches and removes the spurious MEM from
ldr_got_small_si/di. This has been rebased after [1], and the performance
gain has now doubled.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571708.html
Improve GOT addressing by treating the instructions
Hi Richard,
> No. It's never correct to completely wipe out the existing cost - you
> don't know the context where this is being used.
>
> The most you can do is not add any additional cost.
Remember that aarch64_rtx_costs starts like this:
/* By default, assume that everything has
Hi,
Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get large gains for normal addressing too:
Improve rematerialization costs of addresses. The current costs are set too
high
which results in extra register pressure and spilling.
Hi Richard,
> Are we actually planning to do any linker relaxations here, or is this
> purely theoretical? If doing relaxations is a realistic possiblity then
> I agree that would be a good/legitimate reason to use a single define_insn
> for both instructions. In that case though, there should
Version v2 uses movsi/di for GOT accesses until after reload as suggested. This
caused worse spilling, however improving the costs of GOT accesses resulted in
better codesize and performance gains:
Improve GOT addressing by treating the instructions as a pair. This reduces
register pressure and
Hi Richard,
> Normally we should only put two instructions in the same define_insn
> if there's a specific ABI or architectural reason for not separating
> them. Doing it purely for optimisation reasons is going against the
> general direction of travel. So I think the first question is: why
>
Improve GOT addressing by emitting the instructions as a pair. This reduces
register pressure and improves code quality. With -fPIC codesize improves by
0.65% and SPECINT2017 improves by 0.25%.
Passes bootstrap and regress. OK for commit?
ChangeLog:
2021-05-05 Wilco Dijkstra
*
Hi Richard,
> Hmm, OK. I guess it makes things more consistent in that sense
> (PIC vs. non-PIC). But on the other side it's making things less
> internally consistent for non-PIC, since we don't use the GOT for
> anything else there. I guess in principle there's a danger that a
> custom *-elf
Hi Andrew,
> I thought that was changed not to use the GOT on purpose.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63874
>
> That is if the symbol is not declared in the TU, then using the GOT is
> correct thing to do.
> Is the testcase gcc.target/aarch64/pr63874.c still working or is not
>
Hi Richard,
> Just to check: I guess this part is an optimisation, because it
> means that we can share the GOT entry with other TUs. Is that right?
> I think it would be worth having a comment either way, whatever the
> rationale. A couple of other very minor things:
It's just to make the
Use a GOT indirection for extern weak symbols instead of a literal - this is
the same as
PIC/PIE and mirrors LLVM behaviour. Ensure PIC/PIE use the same offset limits
for symbols
that don't use the GOT.
Passes bootstrap and regress. OK for commit?
ChangeLog:
2021-04-27 Wilco Dijkstra
In aarch64_classify_symbol symbols are allowed large offsets on relocations.
This means the offset can use all of the +/-4GB offset, leaving no offset
available for the symbol itself. This results in relocation overflow and
link-time errors for simple expressions like _array + 0xff00.
To
In aarch64_classify_symbol symbols are allowed large offsets on relocations.
This means the offset can use all of the +/-4GB offset, leaving no offset
available for the symbol itself. This results in relocation overflow and
link-time errors for simple expressions like _array + 0xff00.
To
Hi Richard,
> I specifically want to test generic SVE rather than SVE tuned for a
> specific core, so --with-arch=armv8.2-a+sve is the thing I want to test.
Btw that's not actually what you get if you use cc1 - you always get armv8.0,
so --with-arch doesn't work at all. The only case that
Hi Richard,
>>> I share Richard E's concern about the effect of this on people who run
>>> ./cc1 directly. (And I'm being selfish here, because I regularly run
>>> ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.)
>>> So TBH my preference would be to keep the
Hi Richard,
> I share Richard E's concern about the effect of this on people who run
> ./cc1 directly. (And I'm being selfish here, because I regularly run
> ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.)
> So TBH my preference would be to keep the
Hi,
>>> As for your second patch, --with-cpu-64 could be a simple alias indeed,
>>> but what is the exact definition/expected behaviour of a --with-cpu-32
>>> on a target that only supports 64-bit code? The AArch64 target cannot
>>> generate AArch32 code, so we shouldn't silently
Hi Sebastian,
I presume you're trying to unify the --with- options across most targets?
That would be very useful! However there are significant differences between
targets in how they interpret options like --with-arch=native (or -march). So
those differences also need to be looked at and fixed
Add an initial cost table for Cortex-A76 - this is copied from
cotexa57_extra_costs but updates it based on the Optimization Guide.
Use the new cost table on all Neoverse tunings and ensure the tunings
are consistent for all. As a result more compact code is generated
with more combined shift+alu
Hi Richard,
>> + if (size <= 24 || !TARGET_SIMD
>
> Nit: one condition per line when the condition spans multiple lines.
Fixed.
>> + || (size <= (max_copy_size / 2)
>> + && (aarch64_tune_params.extra_tuning_flags
>> + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)))
>> + copy_bits =
Improve the inline memcpy expansion. Use integer load/store for copies <= 24
bytes
instead of SIMD. Set the maximum copy to expand to 256 by default, except that
-Os or
no Neon expands up to 128 bytes. When using LDP/STP of Q-registers, also use
Q-register
accesses for the unaligned tail,
Hi Jakub,
> On Thu, Oct 08, 2020 at 11:37:24AM +0000, Wilco Dijkstra via Gcc-patches
> wrote:
>> Which optimizations does it enable that aren't possible if the value is
>> defined?
>
> See bugzilla. Note other compilers heavily optimize on those builtins
> undefin
Hi Jakub,
> Having it undefined allows optimizations, and has been that way for years.
Which optimizations does it enable that aren't possible if the value is defined?
> We just should make sure that we optimize code like x ? __builtin_c[lt]z (x)
> : 32;
> etc. properly (and I believe we do).
Btw for PowerPC is 0..32:
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/assembler/idalangref_cntlzw_instrs.html
Wilco
1 - 100 of 103 matches
Mail list logo