from:"Adhemerval Zanella"

Re: [PATCH v2] MATCH: Simplify (a ? x : y) eq/ne (b ? x : y) [PR111150]

2024-07-25 Thread Adhemerval Zanella Netto




On 17/07/24 14:00, Andrew Pinski wrote:
> On Wed, Jul 17, 2024 at 5:24 AM Richard Biener
>  wrote:
>>
>> On Tue, Jul 16, 2024 at 3:36 PM Eikansh Gupta  
>> wrote:
>>>
>>> This patch adds match pattern for `(a ? x : y) eq/ne (b ? x : y)`.
>>> In forwprop1 pass, depending on the type of `a` and `b`, GCC produces
>>> `vec_cond` or `cond_expr`. Based on the observation that `(x != y)` is
>>> TRUE, the pattern can be optimized to produce `(a^b ? TRUE : FALSE)`.
>>>
>>> The patch adds match pattern for a, b:
>>> (a ? x : y) != (b ? x : y) --> (a^b) ? TRUE  : FALSE
>>> (a ? x : y) == (b ? x : y) --> (a^b) ? FALSE : TRUE
>>> (a ? x : y) != (b ? y : x) --> (a^b) ? TRUE  : FALSE
>>> (a ? x : y) == (b ? y : x) --> (a^b) ? FALSE : TRUE
>>
>> OK.
> 
> Pushed as r15-2106-g44fcc1ca11e7ea (with one small change to the
> commit message in the changelog where tabs should be used before the
> *; most likely a copy and paste error).

It seems that this change triggered with Linaro CI on arm 32 bit [1]:

--
Executing on host: 
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/destdir/x86_64-pc-linux-gnu/bin/arm-eabi-g++
   
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C
  -mthumb -march=armv8.1-m.main+mve.fp+fp.dp -mtune=cortex-m55 -mfloat-abi=hard 
-mfpu=auto   -fdiagnostics-plain-output  -nostdinc++ 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/x86_64-pc-linux-gnu/arm-eabi/gcc-gcc.git~master-stage2/arm-eabi/libstdc++-v3/include/arm-eabi
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/x86_64-pc-linux-gnu/arm-eabi/gcc-gcc.git~master-stage2/arm-eabi/libstdc++-v3/include
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/libsupc++
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/include/backward
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/testsuite/util
 -fmessage-length=0  -std=gnu++98 -O1 -fdump-tree-forwprop1 -Wno-psabi  -S  -o 
pr50.s(timeout = 600)
spawn -ignore SIGHUP 
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/destdir/x86_64-pc-linux-gnu/bin/arm-eabi-g++
 
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C
 -mthumb -march=armv8.1-m.main+mve.fp+fp.dp -mtune=cortex-m55 -mfloat-abi=hard 
-mfpu=auto -fdiagnostics-plain-output -nostdinc++ 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/x86_64-pc-linux-gnu/arm-eabi/gcc-gcc.git~master-stage2/arm-eabi/libstdc++-v3/include/arm-eabi
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/builds/x86_64-pc-linux-gnu/arm-eabi/gcc-gcc.git~master-stage2/arm-eabi/libstdc++-v3/include
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/libsupc++
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/include/backward
 
-I/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/libstdc++-v3/testsuite/util
 -fmessage-length=0 -std=gnu++98 -O1 -fdump-tree-forwprop1 -Wno-psabi -S -o 
pr50.s
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C:
 In function 'v4si f1_(v4si, v4si, v4si, v4si, v4si, v4si)':
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C:13:1:
 error: unrecognizable insn:
(insn 22 21 26 2 (set (reg:V4SI 120 [  ])
(unspec:V4SI [
(reg:V4SI 136)
(reg:V4SI 137)
(subreg:V4BI (reg:HI 135) 0)
] VPSELQ_S)) 
"/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C":12:17
 -1
 (nil))
during RTL pass: vregs
/home/tcwg-buildslave/workspace/tcwg_gnu_1/abe/snapshots/gcc.git~master/gcc/testsuite/g++.dg/tree-ssa/pr50.C:13:1:
 internal compiler error: in extract_insn, at recog.cc:2848
0x21fd635 internal_error(char const*, ...)
../../../../../../gcc/gcc/diagnostic-global-context.cc:491
0x9a0958 fancy_abort(char const*, int, char const*)
../../../../../../gcc/gcc/diagnostic.cc:1725
0x840e4d _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
../../../../../../gcc/gcc/rtl-error.cc:108
0x840e6f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
../../../../../../gcc/gcc/rtl-error.cc:116
0x83f76b extract_insn(rtx_insn*)
../../../../../../gcc/gcc/recog.cc:2848
0xf1a805 instantiate_virtual_regs_in_insn
../../../../../../gcc/gcc/function.cc:1612
0xf1a805 instantiate_virtual_regs
../../../../../../gcc/gcc/function.cc:1995
0xf1a805 execute
../../../../../../gcc/gcc/function.cc:2042
--

Should I open a bug report?

[1] 
https://ci.linaro.org/job/tcwg_gnu_embed_check_gcc--master-thumb_m55_hard_eabi-build/517/artifact/artifacts/00-sumfiles/

Re: [PATCH] arm: Update fp16-aapcs-[24].c after insn_propagation patch

2024-07-23 Thread Adhemerval Zanella Netto




On 19/07/24 11:25, Richard Earnshaw (lists) wrote:
> On 11/07/2024 19:31, Richard Sandiford wrote:
>> These tests used to generate:
>>
>> bl  swap
>> ldr r2, [sp, #4]
>> mov r0, r2  @ __fp16
>>
>> but g:9d20529d94b23275885f380d155fe8671ab5353a means that we can
>> load directly into r0:
>>
>> bl  swap
>> ldrhr0, [sp, #4]@ __fp16
>>
>> This patch updates the tests to "defend" this change.
>>
>> While there, the scans include:
>>
>> mov\tr1, r[03]}
>>
>> But if the spill of r2 occurs first, there's no real reason why
>> r2 couldn't be used as the temporary, instead r3.
>>
>> The patch tries to update the scans while preserving the spirit
>> of the originals.
>>
>> Spot-checked with armv8l-unknown-linux-gnueabihf.  OK to install?
>>
>> Richard
> 
> OK.
> 
> I'm not sure that these tests are really doing very much; it would probably 
> be better if they could be rewritten using the gcc.target/arm/aapcs 
> framework.  But that's for another day.
> 
> R.

Hi Richard,

It seems that this not fully fixed on all configurations, I am still seeing:

FAIL: gcc.target/arm/fp16-aapcs-1.c scan-assembler vmov\\.f32\\ts1, s0
FAIL: gcc.target/arm/fp16-aapcs-2.c scan-assembler-times mov\\tr[01], r[0-9]+ 2
FAIL: gcc.target/arm/fp16-aapcs-2.c scan-assembler str\\tr2, 
([^\\n]*).*ldrh\\tr0, \\1
FAIL: gcc.target/arm/fp16-aapcs-3.c scan-assembler vmov\\.f32\\ts1, s0
FAIL: gcc.target/arm/fp16-aapcs-4.c scan-assembler-times mov\\tr[01], r[0-9]+ 2
FAIL: gcc.target/arm/fp16-aapcs-4.c scan-assembler str\\tr2, 
([^\\n]*).*ldrh\\tr0, \\1

The gcc is configured with --with-float=hard --with-fpu=vfpv3-d16 
--with-mode=thumb --with-tune=cortex-a9 --with-arch=armv7-a 

https://ci.linaro.org/job/tcwg_gnu_cross_check_gcc--master-arm-build/1561/artifact/artifacts/00-sumfiles/

Re: [to-be-committed][RISC-V][V3] DCE analysis for extension elimination

2024-07-10 Thread Adhemerval Zanella Netto




On 02/07/24 16:40, Jeff Law wrote:
> [ Actually attaching the patch this time... ]
> 
> The pre-commit testing showed that making ext-dce only active at -O2 and 
> above would require minor edits to the tests.  In some cases we had specified 
> -O1 in the test or specified no optimization level at all. Those need to be 
> bumped to -O2.   In one test we had one set of dg-options overriding another.
> 
> The other approach that could have been taken would be to drop the -On 
> argument, add an explicit -fext-dce and add dg-skip-if options.  I originally 
> thought that was going to be way to go, but the dg-skip-if aspect was going 
> to get ugly as things like interaction between unrolling, peeling and 
> -ftracer would have to be accounted for and would likely need semi-regular 
> adjustment.
> 
> 
> 
> Changes since V2:
>   Testsuite changes to deal with pass only being enabled at -O2 or
>   higher.
> 

Hi Jeff,

It seems that this commit has triggered an issue on 32 bit ARM [1]

FAIL: gcc.dg/plugin/must-tail-call-1.c -fplugin=./must_tail_call_plugin.so 
(internal compiler error: in df_refs_verify, at df-scan.cc:4013)

Should I open a bug report for this?

[1] 
https://git-us.linaro.org/toolchain/ci/interesting-commits.git/plain/gcc/sha1/98914f9eba5f19d3eb93fbce8726b5264631cba0/tcwg_bootstrap_check/master-arm-check_bootstrap/details.txt

Re: New TLS usage in libgcc_s.so.1, compatibility impact

2024-01-15 Thread Adhemerval Zanella Netto via Gcc




On 15/01/24 09:46, Szabolcs Nagy wrote:
> The 01/13/2024 13:49, Florian Weimer wrote:
>> This commit
>>
>> commit 8abddb187b33480d8827f44ec655f45734a1749d
>> Author: Andrew Burgess 
>> Date:   Sat Aug 5 14:31:06 2023 +0200
>>
>> libgcc: support heap-based trampolines
>> 
>> Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
>> and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>> __builtin_nested_func_ptr_deleted functions for these targets.
>> 
>> Co-Authored-By: Maxim Blinov 
>> Co-Authored-By: Iain Sandoe 
>> Co-Authored-By: Francois-Xavier Coudert 
>>
>> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
>> built, it ends up using a dynamic TLS variant on the Linux targets.
>> This means that there is no up-front TLS allocation with glibc (but
>> there would be one with musl).
>>
>> There is still a compatibility impact because glibc assigns a TLS module
>> ID upfront.  This seems to be what causes the
>> ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
>> up with an infinite regress during process termination because
>> libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
>> happens, the bottom of the stack looks like this:
>>
>> #4447 0x77f288f0 in free () from 
>> /lib64/liblttng-ust-libc-wrapper.so.1
>> #4448 0x77fdb142 in free (ptr=)
>> at ../include/rtld-malloc.h:50
>> #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
>> #4450 0x77fdb214 in update_get_addr (ti=0x77f2bfc0, 
>> gen=) at ../elf/dl-tls.c:916
>> #4451 0x77fddccc in __tls_get_addr ()
>> at ../sysdeps/x86_64/tls_get_addr.S:55
>> #4452 0x77f288f0 in free () from 
>> /lib64/liblttng-ust-libc-wrapper.so.1
>> #4453 0x77fdb142 in free (ptr=)
>> at ../include/rtld-malloc.h:50
>> #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
>> #4455 0x77fdb214 in update_get_addr (ti=0x77f39fa0, 
>> gen=) at ../elf/dl-tls.c:916
>> #4456 0x77fddccc in __tls_get_addr ()
>> at ../sysdeps/x86_64/tls_get_addr.S:55
>> #4457 0x77f36113 in lttng_ust_cancelstate_disable_push ()
>>from /lib64/liblttng-ust-common.so.1
>> #4458 0x77f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
>> #4459 0x77f5175a in lttng_ust_cleanup () from 
>> /lib64/liblttng-ust.so.1
>> #4460 0x77fca0f2 in _dl_call_fini (
>> closure_map=closure_map@entry=0x77fbe000) at dl-call_fini.c:43
>> #4461 0x77fce06e in _dl_fini () at dl-fini.c:114
>> #4462 0x77d82fe6 in __run_exit_handlers () from /lib64/libc.so.6
>>
>> Cc:ing  for awareness.
>>
>> The issue also requires a recent glibc with changes to DTV management:
>> commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
>> access after dlopen [BZ #19924]").  If I understand things correctly,
>> before this glibc change, we didn't deallocate the old DTV, so there was
>> no call to the free function.
> 
> with 19924 fixed, after a dlopen or dlclose every thread updates
> its dtv on the next dynamic tls access.
> 
> before that, dtv was only updated up to the generation of the
> module being accessed for a particular tls access.
> 
> so hitting the free in the dtv update path is now more likely
> but the free is not new, it was there before.
> 
> also note that this is unlikely to happen on aarch64 since
> tlsdesc only does dynamic tls access after a 512byte static
> tls reservation runs out.
> 
>>
>> On the glibc side, we should recommend that intercepting mallocs and its
>> dependencies use initial-exec TLS because that kind of TLS does not use
>> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
>> totally by accident, and was in the past helped by glibc bug 19924.  (I
> 
> right.
> 
>> don't think there is anything special about libgcc_s.so.1 that triggers
>> the test failure above, it is just an object with dynamic TLS that is
>> implicitly loaded via dlopen at the right stage of the test.)  In this
>> particular case, we can also paper over the test failure in glibc by not
>> call free at all because the argument is a null pointer:
>>
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index 7b3dd9ab60..14c71cbd06 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -819,7 +819,8 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t 
>> new_gen)
>>   dtv entry free it.  Note: this is not AS-safe.  */
>>/* XXX Ideally we will at some point create a memory
>>   pool.  */
>> -  free (dtv[modid].pointer.to_free);
>> +  if (dtv[modid].pointer.to_free != NULL)
>> +free (dtv[modid].pointer.to_free);
>>dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>>dtv[modid].pointer.to_free = NULL;
> 
> can be done, but !=NULL is more likely since we do modid reuse
> after dlclose.
> 
> there is

Re: [PATCH] longlong.h: Do no use asm input cast for clang

2023-01-10 Thread Adhemerval Zanella Netto via Gcc-patches




On 12/12/22 20:52, Segher Boessenkool wrote:
> On Mon, Dec 12, 2022 at 02:10:16PM -0300, Adhemerval Zanella Netto wrote:
>> On 30/11/22 20:24, Segher Boessenkool wrote:
>>> I understand that the casts should be no-ops on the asm side (maybe they
>>> change the sign) and they are present as type-checking.  Can we implement
>>> this type-checking in a different (portable) way?  I think the macro you use
>>> should be named like __asm_output_check_type (..) or so to indicate the
>>> intended purpose.
> 
> I didn't write that.  Please quote correctly.  Thanks!
> 
>> I do not think trying to leverage it on clang side would yield much, it
>> seems that it really does not want to support this extension.  I am not
>> sure we can really make it portable, best option I can think of would to
>> add a mix of __builtin_classify_type and typeof prior asm call (we do
>> something similar to powerp64 syscall code on glibc), although it would
>> still require some gcc specific builtins.
>>
>> I am open for ideas, since to get this header to be clang-compatible on
>> glibc it requires to get it first on gcc.
> 
> How do you intend to modify all the existing copies of the header that
> haven't been updated for over a decade already?> 
> If you think changing all user code that uses longlong.h is a good idea,
> please change it to not use inline asm, use builtins in some cases but
> mostly just rewrite things in plain C.  But GCC cannot rewrite user code
> (not preemptively anyway ;-) ) -- and longlong.h as encountered in the
> wild (not the one in our libgcc source code) is user code.
> 
> If you think changing the copy in libgcc is a good idea, please change
> the original in glibc first?

That's my original intention [1], but Joseph stated that GCC is the upstream
source of this file.  Joseph, would you be ok for a similar patch to glibc
since gcc is reluctant to accept it?

[1] https://sourceware.org/pipermail/libc-alpha/2022-October/143050.html

Re: [PATCH] longlong.h: Do no use asm input cast for clang

2022-12-12 Thread Adhemerval Zanella Netto via Gcc-patches

On 30/11/22 20:24, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Nov 30, 2022 at 03:16:25PM -0300, Adhemerval Zanella via Gcc-patches 
> wrote:
>> clang by default rejects the input casts with:
>>
>>   error: invalid use of a cast in a inline asm context requiring an
>>   lvalue: remove the cast or build with -fheinous-gnu-extensions
>>
>> And even with -fheinous-gnu-extensions clang still throws an warning
>> and also states that this option might be removed in the future.
>> For gcc the cast are still useful somewhat [1], so just remove it
>> clang is used.
> 
> This is one of the things in inline asm that is tightly tied to GCC
> internals.  You should emulate GCC's behaviour faithfully if you want
> to claim you implement the inline asm GNU C extension.

Agree, that's why I just make it a no-op for clang which indicates that
it does not seem much use of this extension.

> I understand that the casts should be no-ops on the asm side (maybe they
> change the sign) and they are present as type-checking.  Can we implement
> this type-checking in a different (portable) way?  I think the macro you use
> should be named like __asm_output_check_type (..) or so to indicate the
> intended purpose.

I do not think trying to leverage it on clang side would yield much, it
seems that it really does not want to support this extension.  I am not
sure we can really make it portable, best option I can think of would to
add a mix of __builtin_classify_type and typeof prior asm call (we do
something similar to powerp64 syscall code on glibc), although it would
still require some gcc specific builtins.

I am open for ideas, since to get this header to be clang-compatible on
glibc it requires to get it first on gcc.

> 
>> --- a/include/ChangeLog
>> +++ b/include/ChangeLog
> 
> That should not be part of the patch?  Changelog entries should be
> verbatim in the message you send.
> 
> The size of this patch already makes clear this is a bad idea, imo.
> This code is already hard enough to read.

Indeed, I forgot that CL entries were not part of the commit.

[PATCH] longlong.h: Do no use asm input cast for clang

2022-11-30 Thread Adhemerval Zanella via Gcc-patches

clang by default rejects the input casts with:

  error: invalid use of a cast in a inline asm context requiring an
  lvalue: remove the cast or build with -fheinous-gnu-extensions

And even with -fheinous-gnu-extensions clang still throws an warning
and also states that this option might be removed in the future.
For gcc the cast are still useful somewhat [1], so just remove it
clang is used.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581722.html
---
 include/ChangeLog  |  60 ++
 include/longlong.h | 524 +++--
 2 files changed, 325 insertions(+), 259 deletions(-)

diff --git a/include/ChangeLog b/include/ChangeLog
index dda005335c0..747fc923ef5 100644
--- a/include/ChangeLog
+++ b/include/ChangeLog
@@ -1,3 +1,63 @@
+2022-11-30  Adhemerval Zanella  
+
+   * include/longlong.h: Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__M32R__) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__M32R__) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arc__) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arc__) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32][(__ARM_ARCH_2__) || (__ARM_ARCH_2A__)  || 
(__ARM_ARCH_3__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__hppa) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__hppa) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__i960__) && W_TYPE_SIZE == 32](umul_ppmm): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__i960__) && W_TYPE_SIZE == 32](__umulsidi3): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](count_leading_zeros): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32][(__mc88110__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32][(__mc88110__)](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](count_leading_zeros): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32][!((__mcoldfire__))](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](sdiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32][(__mcoldfire__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32](add_ss

Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function

2021-11-23 Thread Adhemerval Zanella via Gcc-patches




On 17/11/2021 10:40, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> However the code is somewhat complex and I would like to have some feedback
>> if gcc will be willing to accept this change (I assume it would require
>> this code merge on glibc beforehand).
> 
> There's a long review queue on the GCC side due to the stage1 close.
> It may still be considered for GCC 12.  Jakub has also requested that
> we hold off committing the glibc side until the GCC side is reviewed.
> 
> I'll flesh out the commit message and NEWS entry once we have agreed
> upon the interface.
> 
>>> new file mode 100644
>>> index 00..c7313c122d
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.c
> 
>>> +/* Data for the main executable.  There is usually a large gap between
>>> +   the main executable and initially loaded shared objects.  Record
>>> +   the main executable separately, to increase the chance that the
>>> +   range for the non-closeable mappings below covers only the shared
>>> +   objects (and not also the gap between main executable and shared
>>> +   objects).  */
>>> +static uintptr_t _dl_eh_main_map_start attribute_relro;
>>> +static struct dl_eh_frame_info _dl_eh_main_info attribute_relro;
>>> +
>>> +/* Data for initally loaded shared objects that cannot be unlaoded.
>>
>> s/initally/initially and s/unlaoded/unloaded.
> 
> Fixed.
> 
>>
>>> +   The mapping base addresses are stored in address order in the
>>> +   _dl_eh_nodelete_mappings_bases array (containing
>>> +   _dl_eh_nodelete_mappings_size elements).  The EH data for a base
>>> +   address is stored in the parallel _dl_eh_nodelete_mappings_infos.
>>> +   These arrays are not modified after initialization.  */
>>> +static uintptr_t _dl_eh_nodelete_mappings_end attribute_relro;
>>> +static size_t _dl_eh_nodelete_mappings_size attribute_relro;
>>> +static uintptr_t *_dl_eh_nodelete_mappings_bases attribute_relro;
>>> +static struct dl_eh_frame_info *_dl_eh_nodelete_mappings_infos
>>> +  attribute_relro;
>>> +
>>> +/* Mappings created by dlopen can go away with dlclose, so a data
>>> +   dynamic data structure with some synchronization is needed.
>>
>> This sounds strange ("a data dynamic data").
> 
> I dropped the first data.
> 
>>
>>> +   Individual segments are similar to the _dl_eh_nodelete_mappings
>>
>> Maybe use _dl_eh_nodelete_mappings_*, because '_dl_eh_nodelete_mappings'
>> itself if not defined anywhere.
> 
> Right.
> 
>>> +   Adding new elements to this data structure is another source of
>>> +   quadratic behavior for dlopen.  If the other causes of quadratic
>>> +   behavior are eliminated, a more complicated data structure will be
>>> +   needed.  */
>>
>> This worries me, specially we have reports that python and other dynamic
>> environments do use a lot of plugin and generates a lot of dlopen() calls.
>> What kind of performance implication do you foresee here?
> 
> The additional overhead is not disproportionate to the other sources of
> quadratic behavior.  With 1,000 dlopen'ed objects, overall run-time
> seems to be comparable to the strcmp time required soname matching, for
> example, and is quite difficult to measure.  So we could fix the
> performance regression if we used a hash table for that …
> 
> It's just an undesirable complexity class.  The implementation is not
> actually slow because it's a mostly-linear copy (although a backwards
> one).  Other parts of dlopen involve pointer chasing and are much
> slower.

Right, I agree this should probably won't incur in performance issues,
I was curious if you have any numbers about it.

> 
>>> +/* Allocate an empty segment that is at least SIZE large.  PREVIOUS */
>>
>> What this PREVIOUS refer to?
> 
> Oops, it's now:
> 
> /* Allocate an empty segment that is at least SIZE large.  PREVIOUS
>points to the chain of previously allocated segments and can be
>NULL.  */
> 
>>> +/* Update the version to reflect that an update is happening.  This
>>> +   does not change the bit that controls the active segment chain.
>>> +   Returns the index of the currently active segment chain.  */
>>> +static inline unsigned int
>>> +_dl_eh_mappings_begin_update (void)
>>> +{
>>> +  unsigned int v
>>> += __atomic_wide_counter_fetch_add_relaxed 
>>> (&_dl_eh_loaded_mappings_version

Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function

2021-11-16 Thread Adhemerval Zanella via Gcc-patches




On 03/11/2021 13:28, Florian Weimer via Gcc-patches wrote:
> This function is similar to __gnu_Unwind_Find_exidx as used on arm.
> It can be used to speed up the libgcc unwinder.

Besides the terse patch description, the design seems ok to accomplish the
lock-free read and update.  There are some question and remarks below,
and I still need to revise the tests.

However the code is somewhat complex and I would like to have some feedback
if gcc will be willing to accept this change (I assume it would require
this code merge on glibc beforehand).

> ---
>  NEWS  |   4 +
>  bits/dlfcn_eh_frame.h |  33 +
>  dlfcn/Makefile|   2 +-
>  dlfcn/dlfcn.h |   2 +
>  elf/Makefile  |  31 +-
>  elf/Versions  |   3 +
>  elf/dl-close.c|   4 +
>  elf/dl-find_eh_frame.c| 864 ++
>  elf/dl-find_eh_frame.h|  90 ++
>  elf/dl-find_eh_frame_slow.h   |  55 ++
>  elf/dl-libc_freeres.c |   2 +
>  elf/dl-open.c |   5 +
>  elf/rtld.c|   7 +
>  elf/tst-dl_find_eh_frame-mod1.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod2.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod3.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod4.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod5.c   |  11 +
>  elf/tst-dl_find_eh_frame-mod6.c   |  11 +
>  elf/tst-dl_find_eh_frame-mod7.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod8.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod9.c   |  10 +
>  elf/tst-dl_find_eh_frame-threads.c| 237 +
>  elf/tst-dl_find_eh_frame.c| 179 
>  include/atomic_wide_counter.h |  14 +
>  include/bits/dlfcn_eh_frame.h |   1 +
>  include/link.h|   3 +
>  manual/Makefile   |   2 +-
>  manual/dynlink.texi   |  69 ++
>  manual/libdl.texi |  10 -
>  manual/probes.texi|   2 +-
>  manual/threads.texi   |   2 +-
>  sysdeps/i386/bits/dlfcn_eh_frame.h|  34 +
>  sysdeps/mach/hurd/i386/ld.abilist |   1 +
>  sysdeps/nios2/bits/dlfcn_eh_frame.h   |  34 +
>  sysdeps/unix/sysv/linux/aarch64/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/alpha/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/arc/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/arm/be/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/arm/le/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/csky/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/hppa/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/i386/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/ia64/ld.abilist   |   1 +
>  .../unix/sysv/linux/m68k/coldfire/ld.abilist  |   1 +
>  .../unix/sysv/linux/m68k/m680x0/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/microblaze/ld.abilist |   1 +
>  .../unix/sysv/linux/mips/mips32/ld.abilist|   1 +
>  .../sysv/linux/mips/mips64/n32/ld.abilist |   1 +
>  .../sysv/linux/mips/mips64/n64/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/nios2/ld.abilist  |   1 +
>  .../sysv/linux/powerpc/powerpc32/ld.abilist   |   1 +
>  .../linux/powerpc/powerpc64/be/ld.abilist |   1 +
>  .../linux/powerpc/powerpc64/le/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/riscv/rv32/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/riscv/rv64/ld.abilist |   1 +
>  .../unix/sysv/linux/s390/s390-32/ld.abilist   |   1 +
>  .../unix/sysv/linux/s390/s390-64/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/sh/be/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/sh/le/ld.abilist  |   1 +
>  .../unix/sysv/linux/sparc/sparc32/ld.abilist  |   1 +
>  .../unix/sysv/linux/sparc/sparc64/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/x86_64/64/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist |   1 +
>  64 files changed, 1795 insertions(+), 16 deletions(-)
>  create mode 100644 bits/dlfcn_eh_frame.h
>  create mode 100644 elf/dl-find_eh_frame.c
>  create mode 100644 elf/dl-find_eh_frame.h
>  create mode 100644 elf/dl-find_eh_frame_slow.h
>  create mode 100644 elf/tst-dl_find_eh_frame-mod1.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod2.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod3.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod4.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod5.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod6.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod7.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod8.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod9.c
>  create mode 100644

Re: [PATCH 2/3] elf: Introduce GLRO (dl_libc_freeres), called from __libc_freeres

2021-11-15 Thread Adhemerval Zanella via Gcc-patches

Maybe add a comment why this is will be used.

Reviewed-by: Adhemerval Zanella  

On 03/11/2021 13:27, Florian Weimer via Gcc-patches wrote:
> ---
>  elf/Makefile   |  2 +-
>  elf/dl-libc_freeres.c  | 24 
>  elf/rtld.c |  1 +
>  malloc/set-freeres.c   |  5 +
>  sysdeps/generic/ldsodefs.h |  7 +++
>  5 files changed, 38 insertions(+), 1 deletion(-)
>  create mode 100644 elf/dl-libc_freeres.c
> 
> diff --git a/elf/Makefile b/elf/Makefile
> index cb9bcfb799..1c768bdf47 100644
> --- a/elf/Makefile
> +++ b/elf/Makefile
> @@ -68,7 +68,7 @@ elide-routines.os = $(all-dl-routines) dl-support 
> enbl-secure dl-origin \
>  rtld-routines= rtld $(all-dl-routines) dl-sysdep dl-environ 
> dl-minimal \
>dl-error-minimal dl-conflict dl-hwcaps dl-hwcaps_split dl-hwcaps-subdirs \
>dl-usage dl-diagnostics dl-diagnostics-kernel dl-diagnostics-cpu \
> -  dl-mutex
> +  dl-mutex dl-libc_freeres
>  all-rtld-routines = $(rtld-routines) $(sysdep-rtld-routines)
>  
>  CFLAGS-dl-runtime.c += -fexceptions -fasynchronous-unwind-tables

Ok.

> diff --git a/elf/dl-libc_freeres.c b/elf/dl-libc_freeres.c
> new file mode 100644
> index 00..68f305a6f9
> --- /dev/null
> +++ b/elf/dl-libc_freeres.c
> @@ -0,0 +1,24 @@
> +/* Deallocating malloc'ed memory from the dynamic loader.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include 
> +
> +void
> +__rtld_libc_freeres (void)
> +{
> +}

Ok.

> diff --git a/elf/rtld.c b/elf/rtld.c
> index be2d5d8e74..847141e21d 100644
> --- a/elf/rtld.c
> +++ b/elf/rtld.c
> @@ -378,6 +378,7 @@ struct rtld_global_ro _rtld_global_ro attribute_relro =
>  ._dl_catch_error = _rtld_catch_error,
>  ._dl_error_free = _dl_error_free,
>  ._dl_tls_get_addr_soft = _dl_tls_get_addr_soft,
> +._dl_libc_freeres = __rtld_libc_freeres,
>  #ifdef HAVE_DL_DISCOVER_OSVERSION
>  ._dl_discover_osversion = _dl_discover_osversion
>  #endif

Ok.

> diff --git a/malloc/set-freeres.c b/malloc/set-freeres.c
> index 5c19a2725c..856ff7831f 100644
> --- a/malloc/set-freeres.c
> +++ b/malloc/set-freeres.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "../nss/nsswitch.h"
>  #include "../libio/libioP.h"
> @@ -67,6 +68,10 @@ __libc_freeres (void)
>  
>call_function_static_weak (__libc_dlerror_result_free);
>  
> +#ifdef SHARED
> +  GLRO (dl_libc_freeres) ();
> +#endif
> +
>for (p = symbol_set_first_element (__libc_freeres_ptrs);
> !symbol_set_end_p (__libc_freeres_ptrs, p); ++p)
>  free (*p);

OK.

> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
> index 1318c36dce..c26860430c 100644
> --- a/sysdeps/generic/ldsodefs.h
> +++ b/sysdeps/generic/ldsodefs.h
> @@ -712,6 +712,10 @@ struct rtld_global_ro
>   namespace.  */
>void (*_dl_error_free) (void *);
>void *(*_dl_tls_get_addr_soft) (struct link_map *);
> +
> +  /* Called from __libc_shared to deallocate malloc'ed memory.  */
> +  void (*_dl_libc_freeres) (void);
> +
>  #ifdef HAVE_DL_DISCOVER_OSVERSION
>int (*_dl_discover_osversion) (void);
>  #endif
> @@ -1416,6 +1420,9 @@ __rtld_mutex_init (void)
>  }
>  #endif /* !PTHREAD_IN_LIBC */
>  
> +/* Implementation of GL (dl_libc_freeres).  */
> +void __rtld_libc_freeres (void) attribute_hidden;
> +
>  void __thread_gscope_wait (void) attribute_hidden;
>  # define THREAD_GSCOPE_WAIT() __thread_gscope_wait ()
>  
> 

OK.

Re: [PATCH 1/3] nptl: Extract from pthread_cond_common.c

2021-11-15 Thread Adhemerval Zanella via Gcc-patches




On 03/11/2021 13:27, Florian Weimer via Libc-alpha wrote:
> And make it an installed header.  This addresses a few aliasing
> violations (which do not seem to result in miscompilation due to
> the use of atomics), and also enables use of wide counters in other
> parts of the library.
> 
> The debug output in nptl/tst-cond22 has been adjusted to print
> the 32-bit values instead because it avoids a big-endian/little-endian
> difference.

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  

> ---
>  bits/atomic_wide_counter.h  |  35 
>  include/atomic_wide_counter.h   |  89 +++
>  include/bits/atomic_wide_counter.h  |   1 +
>  misc/Makefile   |   3 +-
>  misc/atomic_wide_counter.c  | 127 +++
>  nptl/Makefile   |  13 +-
>  nptl/pthread_cond_common.c  | 204 
>  nptl/tst-cond22.c   |  14 +-
>  sysdeps/nptl/bits/thread-shared-types.h |  22 +--
>  9 files changed, 310 insertions(+), 198 deletions(-)
>  create mode 100644 bits/atomic_wide_counter.h
>  create mode 100644 include/atomic_wide_counter.h
>  create mode 100644 include/bits/atomic_wide_counter.h
>  create mode 100644 misc/atomic_wide_counter.c
> 
> diff --git a/bits/atomic_wide_counter.h b/bits/atomic_wide_counter.h
> new file mode 100644
> index 00..0687eb554e
> --- /dev/null
> +++ b/bits/atomic_wide_counter.h
> @@ -0,0 +1,35 @@
> +/* Monotonically increasing wide counters (at least 62 bits).
> +   Copyright (C) 2016-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _BITS_ATOMIC_WIDE_COUNTER_H
> +#define _BITS_ATOMIC_WIDE_COUNTER_H
> +
> +/* Counter that is monotonically increasing (by less than 2**31 per
> +   increment), with a single writer, and an arbitrary number of
> +   readers.  */
> +typedef union
> +{
> +  __extension__ unsigned long long int __value64;
> +  struct
> +  {
> +unsigned int __low;
> +unsigned int __high;
> +  } __value32;
> +} __atomic_wide_counter;
> +
> +#endif /* _BITS_ATOMIC_WIDE_COUNTER_H */

Ok, it would be included in multiple places so we can't tie to a specific
header.

> diff --git a/include/atomic_wide_counter.h b/include/atomic_wide_counter.h
> new file mode 100644
> index 00..31f009d5e6
> --- /dev/null
> +++ b/include/atomic_wide_counter.h
> @@ -0,0 +1,89 @@
> +/* Monotonically increasing wide counters (at least 62 bits).
> +   Copyright (C) 2016-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _ATOMIC_WIDE_COUNTER_H
> +#define _ATOMIC_WIDE_COUNTER_H
> +
> +#include 
> +#include 
> +
> +#if __HAVE_64B_ATOMICS
> +
> +static inline uint64_t
> +__atomic_wide_counter_load_relaxed (__atomic_wide_counter *c)
> +{
> +  return atomic_load_relaxed (>__value64);
> +}
> +
> +static inline uint64_t
> +__atomic_wide_counter_fetch_add_relaxed (__atomic_wide_counter *c,
> + unsigned int val)
> +{
> +  return atomic_fetch_add_relaxed (>__value64, val);
> +}
> +
> +static inline uint64_t
> +__atomic_wide_counter_fetch_a

Re: GCC association with the FSF

2021-04-12 Thread Adhemerval Zanella via Gcc

On 12/04/2021 14:52, Alexandre Oliva wrote:
> On Apr 12, 2021, Adhemerval Zanella  wrote:
> 
>> No, you are insinuating that the glibc community both as maintainer
>> and contributors acted in a hateful way regarding the 'joke'
>> removal. Sorry, but this is not true;
> 
> Easy to say for someone who hasn't been the target of hate, but it's
> just that it was there right then, it's *remains* there.  Not exclusive
> among glibc maintainers, and certainly not unanimous among them, but
> there.  I may even have earned it myself.  But the one that Richard got
> over incorrect assumptions that he commanded the reversal, that's just
> another false piece of evidence often used to support the hate campaign.

There were no "hate" campaign from glibc developers and maintainers,
keep stating it does not make it true.  Since libc-alpha is non moderated
list, there were a lot of unfriendly message from undisclosed or
non-representative people.

What happened is some glibc developers were *really* annoyed in the way
*you* acted, not RMS; and they vocalized it.  And you, instead of work 
toward to create consensus by making some concession (as the currently
we try to run the glibc community), keep arguing to exhaustion that you
acted in the benefit or the project.  

So the aforementioned 'hate' is just because we did not agreed in the
way *you* acted, which caused a lot of distress.

> 
>> The main idea, which I was vocal about and shared with some glibc
>> developers and maintainers, was that the "joke" has no place in a
>> technical manual.
> 
> I understand there is consensus about that now, but back then there were
> too many unsettled policy issues to make that call consensually among
> all relevant parties.
> 
> The main disagreement was not over the issue proper, though.  It was
> about procedure, and then it was about whose opinions as much as
> counted.

No, the disagreement is the way *you* did it. I haven't seen such
contention and disarray you started since I have started to work on the 
project, about a decade ago.

So, please stop put the blame of that episode on the glibc community as 
a whole.

> 
> 
> It was a really trivial issue, but sufficiently hot-button and
> triggering enough underlying issues that it got to be exploited
> politically in several ugly ways.
> 
> It can't really be understood without looking into broader contexts that
> had long been mounting, and that again quite explicit in this list too.
> 
> 
> But I hope we can all agree that it was a horrible mess.
>

Re: GCC association with the FSF

2021-04-11 Thread Adhemerval Zanella via Gcc

On Sun, Apr 11, 2021 at 10:43 PM Alexandre Oliva  wrote:
>
> On Apr 11, 2021, Adhemerval Zanella  wrote:
>
> > All the other active maintainers suggested you shouldn't have done that, 
> > but you
> > ignored it anyway.
>
> How could I possibly have ignored something that hadn't happened yet?
>
> > *we* glibc maintainers were fully aware that it was *you* that decided
> > to act in that way
>
> There have been plenty of insinuations that contradict that assumption
> and attempted to somehow blame it on RMS, but whether the record has
> been set straight on this point now, or if it was straight already, the
> point stands.

No, you are insinuating that the glibc community both as maintainer
and contributors
acted in a hateful way regarding the 'joke' removal. Sorry, but this
is not true; there
were messages that might be characterized as such but they did not come from
either of main glibc developers or maintainers.

>
> As recently as a couple of weeks ago someone referred, in this list, to
> RMS's voicing his objection to the removal of one of the many pieces he
> wrote for the glibc manual, and then setting out to propose and discuss
> policies that incided on the matter, as if those were horrible actions.
>
> That was almost as abhorrent as his asking a GNU developer a question
> that he could have answered by just downloading the subproject's source
> code and looking for the answer himself!  Oh, the horror!
>
>
> If that's not hatred, I don't really wish to know what is :-/

The main idea, which I was vocal about and shared with some glibc
developers and
maintainers, was that the "joke" has no place in a technical manual. You might
disagree ideological and politically from this assessment, but this it
is not "hatred" and
this very rhetoric is trying to characterize it as such is what made
me see that discussion
as frustrating and disappointing.

Re: GCC association with the FSF

2021-04-11 Thread Adhemerval Zanella via Gcc

On Sun, Apr 11, 2021 at 8:06 PM Alexandre Oliva  wrote:
>
> On Apr 11, 2021, Adhemerval Zanella  wrote:
>
> > It was clear to me and others glibc maintainers that it was *you* who
> > bypass the consensus to *not* reinstate the “joke”.
>
> I think you wrote it backwards: what I did was to revert the commit that
> the person who put it in agreed shouldn't have been made at that point,
> so that the debate about whether or not to install the patch could be
> carried out without the fait accompli.  To my surprise, it stopped.
>
> Then, a year or so later, when most of the GNU policies that incided on
> that matter had already been discussed and approved, and they suggested
> (at least to me) that the conclusion was likely that the patch was in
> line with them, some other situation came up that reminded people of the
> patch, it was discussed under the heat of the unrelated situation (which
> I also found inappropriate), but it got applied AFAICT in accordance
> with GNU and GLIBC policies.

RMS briefly stated that he did not want the change to be applied, we
considered his
input back then but we decided to remove the joke *regardless* of what
he thought
about the subject. And you used this to state the change had no consensus to
reinstate it in a way that we haven't done in the project for a couple
of years and which
caused a lot of disarray. The problem was not that you did it, but how
you did it.

You then spent a lot of days trying to convince other glibc
maintainers about your
actions to the point that Torvald and Siddhesh were fed up with your rhetoric.

>
> > maintainers said explicitly you shouldn’t do it.
>
> I do not see nor recall any such responses or reactions to my offer to
> revert the patch in case the installer wouldn't do it, except the
> installer saying they wouldn't do the reversal.  Eventually I did it.
> After the fact, some said I shouldn't have done it.
>
>
> That's my recollection of the events.

All the other active maintainers suggested you shouldn't have done that, but you
ignored it anyway. And we did not want to start a potential contention of patch
applying and reversion from that petty discussion.

But this is done and I don't want to dig into this. My point is *we*
glibc maintainers
were fully aware that it was *you* that decided to act in that way and
it was not my
feelings that it was *hate* the dominant response, but rather a lot of
frustration and
disappointment from how you acted.

Re: GCC association with the FSF

2021-04-11 Thread Adhemerval Zanella via Gcc




> Il giorno 11 apr 2021, alle ore 17:45, Alexandre Oliva via Gcc 
>  ha scritto:
> 
> Remember how much hate RMS got in glibc land for something I did?  I
> said I did it out of my own volition, I explained my why I did it, but
> people wouldn't believe he had nothing to do with it! 

It was clear to me and others glibc maintainers that it was *you* who bypass 
the consensus to *not* reinstate the “joke”. And there was no hate (at least 
not from my side) only *disappointment* that you used your status to do it even 
though most of senior developers and maintainers said explicitly you shouldn’t 
do it.

Re: unnormal Intel 80-bit long doubles and isnanl

2020-11-24 Thread Adhemerval Zanella via Gcc




On 24/11/2020 10:59, Siddhesh Poyarekar wrote:
> On 11/24/20 7:11 PM, Szabolcs Nagy wrote:
>> ideally fpclassify (and other classification macros) would
>> handle all representations.
>>
>> architecturally invalid or trap representations can be a
>> non-standard class but i think classifying them as FP_NAN
>> would break the least amount of code.
> 
> That's my impression too.
> 
>>> glibc evaluates the bit pattern of the 80-bit long double and in the
>>> process, ignores the integer bit, i.e. bit 63.  As a result, it considers
>>> the unnormal number as a valid long double and isnanl returns 0.
>>
>> i think m68k and x86 are different here.
>>
>>>
>>> gcc on the other hand, simply uses the number in a floating point comparison
>>> and uses the parity flag (which indicates an unordered compare, signalling a
>>> NaN) to decide if the number is a NaN.  The unnormal numbers behave like
>>> NaNs in this respect, in that they set the parity flag and with
>>> -fsignalling-nans, would result in an invalid-operation exception.  As a
>>> result, __builtin_isnanl returns 1 for an unnormal number.
>>
>> compiling isnanl to a quiet fp compare is wrong with
>> -fsignalling-nans: classification is not supposed to
>> signal exceptions for snan.
> 
> I agree, but I think that issue with __builtin_isnanl is orthogonal to the 
> question about unnormals.  Once that is fixed in gcc, we could actually use 
> __builtin_isnanl all the time in glibc for isnanl.
> 
> Siddhesh

Which is the currently take from gcc developers on this semantic change of
__builtin_isnanl? Are they considering current behavior of non classifying
the 'unnormal' as NAN the expected behavior and waiting glibc to follow
it or are they willing to align with glibc behavior?

Re: [libgomp] Ask for help on an improvement for synchronization overhead

2020-05-06 Thread Adhemerval Zanella via Gcc




On 30/04/2020 18:12, Jakub Jelinek wrote:
> On Thu, Apr 30, 2020 at 05:37:26PM -0300, Adhemerval Zanella via Gcc wrote:
>> Hi all, I would like to check if someone could help me figure out
>> an issue I am chasing on a libgomp patch intended to partially
>> address the issue described at BZ#79784. 
>>
>> I have identified that one of the bottlenecks is the global barrier 
>> used on both thread pool and team which causes a lof of cache ping-pong 
>> in high-core count machines. And it seems not be an aarch64 specific
>> issue as hinted by the bugzilla.
> 
> This has been a topic of GSoC last year, but the student didn't deliver it
> in usable form and disappeared.
> See e.g. thread with "Work-stealing task scheduling" in subject from
> last year on gcc-patches and other mails on the topic.

In my understanding what I am working is not exactly related to OMP tasking, 
although I see that the global barrier is still an issue on omp task scheduling.
What I am trying to optimize in this specific case is the barrier used
on gomp_thread_pool used on constructs like parallel for and maybe a per-thread
barrier could be extended to other libgomp places.

> 
> So if you'd have time and motivation to do it properly, it would be greatly
> appreciated.
>

[libgomp] Ask for help on an improvement for synchronization overhead

2020-04-30 Thread Adhemerval Zanella via Gcc

Hi all, I would like to check if someone could help me figure out
an issue I am chasing on a libgomp patch intended to partially
address the issue described at BZ#79784. 

I have identified that one of the bottlenecks is the global barrier 
used on both thread pool and team which causes a lof of cache ping-pong 
in high-core count machines. And it seems not be an aarch64 specific
issue as hinted by the bugzilla.

So the optimization I am implementing, which is similar of what LLVM
openmp implementation does; is to use a per OMP thread barrier to
synchronize team/task creation.  The activation I have implemented
so far is a simple linear one, where the master scan linearly over
the children threads (LLVM openmp implement some fancy ones that I
plan to take a look as well).

The patch I came up so far is quite simple [2] and required some polish
yet (some documentation, code styling, etc.), however there is one 
regression that is making me scratching my head: cancel-parallel-2.

What it does to exercise OpenMP cancellation in a 'omp parallel' 
construct and the issue I am seeing is falling to understand why
the final team barrier (done on gomp_team_end called by GOMP_parallel_end)
it not synchronizing correctly with the team barrier in each OpenMP
task.

So any help on the design is appreciate (even if it would I should
re-thinking it for libgomp).

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784
[2] https://github.com/zatrazz/gcc/tree/azanella/libgomp-scalability

Re: [EXTERNAL] Re: How to test aarch64 when building a cross-compiler?

2019-11-25 Thread Adhemerval Zanella




On 25/11/2019 17:28, Andrew Dean via gcc wrote:
>>> This completes successfully. However, when I then try to run the gcc tests 
>>> like
>> so:
>>> runtest --outdir . --tool gcc --srcdir /path/to/gcc/gcc/testsuite
>>> aarch64.exp --target aarch64-linux-gnu --target_board aarch64-sim
>>> --tool_exec
>>> /path_to/build_dir/install/compilers/aarch64-linux-gnu/bin/aarch64-gli
>>> bc-linux-gnu-gcc --verbose -v
>>>
>>> I get errors like this:
>>>
>>> aarch64-glibc-linux-gnu-gcc: fatal error: cannot read spec file
>>> 'rdimon.specs': No such file or directory
>>>
>>> I can see that the rdimon.specs flag is added based on this line in aarch64-
>> sim.exp:
>>
>> Where does aarch64-sim.exp comes from?
> 
> /usr/share/dejagnu/baseboards/aarch64-sim.exp
> 
>>
>>>
>>> set_board_info ldflags  "[libgloss_link_flags] [newlib_link_flags] -
>> specs=rdimon.specs"
>>>
>> I think this is for baremetal/newlib targets, ie. aarch64-elf, not for 
>> aarch64-
>> linux-gnu.
> 
> Hmm.. build-many-glibcs.py doesn't like either aarch64-elf or 
> aarch64-linux-elf...
> I get a KeyError in build_compilers and build_glibcs when it tries to look up 
> the config with either of those values.
> 

Unfortunately the build-many-glibcs.py does not have support for baremetal
build yet (since it is a tool created to build cross-compiling toolchain
using glibc).

Re: [PATCH v6] aarch64: Add split-stack support

2018-02-27 Thread Adhemerval Zanella

Ping (with Szabolcs remarks fixed).

On 07/02/2018 16:07, Adhemerval Zanella wrote:
> Changes from previous version:
> 
>   - Changed the wait to call __morestack to use use a branch with link
> instead of a simple branch.  This allows use a call instruction and
> avoid possible issues with later optimization passes which might
> see a branch outside the instruction block (as noticed in previous
> iterations while building a more complex workload as speccpu2006).
> 
>   - Change the return address to use the branch with link value and
> set x12 to save x30.  This simplifies the required instructions
> to setup/save the return address.
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   ldur   x9, [x9, -8]
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   cmpx9, x10
>   bcc.LX
> main_fn_entry:
>   [function prologue]
> LX:
>   bl __morestack
>   b  main_fn_entry
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldur
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  This feature is only implemented in the GNU
>gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.28.
> 
> 6. Besides a regression tests I also checked with a SPECcpu2006 run with
>-fsplit-stack additional option.  I saw no regression besides 416.gamess
>which fails on trunk as well (not sure if some misconfiguration in my
>environment).
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/config/aarch64/aarch64.c (aarch64_expand_builtin_va_start): Use
>   internal argument pointer instead of virtual_incoming_args_rtx.
>   (morestack_ref): New symbol.
>   (aarch64_load_split_stack_value): New function.
>   (aarch64_expand_split_stack_prologue): Likewise.
>   (aarch64_inte

Re: [PATCH] Add sanitizer support for AArch64 ILP32

2018-02-23 Thread Adhemerval Zanella



On 23/02/2018 11:39, Jakub Jelinek wrote:
> On Fri, Feb 23, 2018 at 11:27:19AM -0300, Adhemerval Zanella wrote:
>> This patch adds asan support for aarch64 ilp32.  It is based on 'official'
>> glibc support branch [1] (which is in turn based on Linux branch [2]).
>>
>> Main changes for libsanitizer is the kernel ABI support for internal
>> syscalls. Different than x32, ILP32 tries to leverage 32-bits syscalls
>> with kernel ABI for 64 bit argument passing (ftruncate for instance)
>> are passed using the new Linux generic way for 32 bits (by splitting
>> high and low in two registers).
>>
>> So instead of adding more adhoc defines to ILP32 this patch extends the
>> SANITIZER_USES_CANONICAL_LINUX_SYSCALLS to represent both 64 bits
>> argument syscalls (value of 1) and 32 bits (value of 2).
>>
>> The shadow offset used is similar to the arm one (29).
>>
>> I did a sanity check on aarch64-linux-gnu and x86_64-linux-gnu without
>> any regressions.
>>
>> PS: I sent this change llvm-commit, but it got stalled because llvm
>> lack aarch64 ilp32 support and noone could confirm no breakage due
>> missing buildbot.  Also the idea is not sync it back to libsanitizer.
> 
> It will be a nightmare for merges from upstream, but because the upstream is
> so uncooperative probably there is no other way.

What I meant it the idea is to sync it back to libsanitizer, sorry for the
misunderstanding. I can try to push it again for compiler-rt, but it took
a long way to actually get any reply.

> 
>> gcc/
>>
>>  * gcc/config/aarch64/aarch64.c (aarch64_asan_shadow_offset): Add
> 
> No gcc/ prefixes in gcc/ChangeLog.

Right, fixed locally.

> 
>>  TARGET_ILP32 support.
>>
>> libsanitizer/
> 
> Nor libsanitizer/ prefixes in libsaniter/ChangeLog.
>> --- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
>> +++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
>> @@ -117,9 +117,9 @@ typedef signed   long long sptr;  // NOLINT
>>  typedef unsigned long uptr;  // NOLINT
>>  typedef signed   long sptr;  // NOLINT
>>  #endif  // defined(_WIN64)
>> -#if defined(__x86_64__)
>> -// Since x32 uses ILP32 data model in 64-bit hardware mode, we must use
>> -// 64-bit pointer to unwind stack frame.
>> +#if defined(__x86_64__) || SANITIZER_AARCH64_ILP32
>> +// Since x32 adn AArch64 ILP32 use ILP32 data model in 64-bit hardware 
>> mode, 
> 
> s/adn/and/

Ack.

> 
>> @@ -445,11 +467,7 @@ uptr internal_execve(const char *filename, char *const 
>> argv[],
>>  // - sanitizer_common.h
>>  bool FileExists(const char *filename) {
>>struct stat st;
>> -#if SANITIZER_USES_CANONICAL_LINUX_SYSCALLS
>> -  if (internal_syscall(SYSCALL(newfstatat), AT_FDCWD, filename, , 0))
>> -#else
>>if (internal_stat(filename, ))
>> -#endif
>>  return false;
>>// Sanity check: filename is a regular file.
>>return S_ISREG(st.st_mode);
> 
> I'm uneasy about this hunk, you change behavior not just on aarch64 ilp32,
> but for many other targets too.  Please don't.

My understanding it a noop since for SANITIZER_USES_CANONICAL_LINUX_SYSCALLS
internal_stat would be a be a internal_syscall(newfstatat, ...) (also
the internal_* names are exactly to abstract this kind of constructions).

> 
> Do you really want it for GCC8?  We are in stage4 and this isn't a
> regression...
> 
>   Jakub
> 

I don't have a strong opinion about that, it would to good to have on GCC8,
but I think I can wait.

[PATCH] Add sanitizer support for AArch64 ILP32

2018-02-23 Thread Adhemerval Zanella

This patch adds asan support for aarch64 ilp32.  It is based on 'official'
glibc support branch [1] (which is in turn based on Linux branch [2]).

Main changes for libsanitizer is the kernel ABI support for internal
syscalls. Different than x32, ILP32 tries to leverage 32-bits syscalls
with kernel ABI for 64 bit argument passing (ftruncate for instance)
are passed using the new Linux generic way for 32 bits (by splitting
high and low in two registers).

So instead of adding more adhoc defines to ILP32 this patch extends the
SANITIZER_USES_CANONICAL_LINUX_SYSCALLS to represent both 64 bits
argument syscalls (value of 1) and 32 bits (value of 2).

The shadow offset used is similar to the arm one (29).

I did a sanity check on aarch64-linux-gnu and x86_64-linux-gnu without
any regressions.

PS: I sent this change llvm-commit, but it got stalled because llvm
lack aarch64 ilp32 support and noone could confirm no breakage due
missing buildbot.  Also the idea is not sync it back to libsanitizer.

gcc/

* gcc/config/aarch64/aarch64.c (aarch64_asan_shadow_offset): Add
TARGET_ILP32 support.

libsanitizer/

* libsanitizer/sanitizer_common/sanitizer_internal_defs.h
(uhwptr, OFF_T, operator_new_size_type): Define for
SANITIZER_AARCH64_ILP32.
* libsanitizer/sanitizer_common/sanitizer_linux.cc (SYSCALL_LL64):
New define: wrap a 64-bit argument for syscall.
(internal_ftruncate, internal_stat, internal_lstat, internal_unlink,
internal_rename, internal_lseek): Add support for
SANITIZER_USES_CANONICAL_LINUX_SYSCALLS equal to 2.
(FileExists): Use internal_stat regardless.
* libsanitizer/sanitizer_common/sanitizer_platform.h
(SANITIZER_AARCH64_ILP32): Define.
(SANITIZER_USES_CANONICAL_LINUX_SYSCALLS): Define to 1 or 2 depending
of the expected architecture ABI.
* libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.cc
(CHECK_SIZE_AND_OFFSET(ipc_perm,mode): Define for AArch64 ILP32.
* libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.h
(struct_kernel_stat_sz, struct_kernel_stat64_sz,
__sanitizer___kernel_uid_t, __sanitizer___kernel_gid_t): Likewise.
[__sanitizer_dirent] (d_ino, d_off): Likewise.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/arm/ilp32
[2] git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
---
 gcc/config/aarch64/aarch64.c   |  5 ++-
 .../sanitizer_common/sanitizer_internal_defs.h | 12 +++
 libsanitizer/sanitizer_common/sanitizer_linux.cc   | 40 ++
 libsanitizer/sanitizer_common/sanitizer_platform.h | 15 ++--
 .../sanitizer_platform_limits_posix.cc |  3 +-
 .../sanitizer_platform_limits_posix.h  | 13 +--
 6 files changed, 69 insertions(+), 19 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 33c90ef..eebf85b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16171,7 +16171,10 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
 static unsigned HOST_WIDE_INT
 aarch64_asan_shadow_offset (void)
 {
-  return (HOST_WIDE_INT_1 << 36);
+  if (TARGET_ILP32)
+return (HOST_WIDE_INT_1 << 29);
+  else
+return (HOST_WIDE_INT_1 << 36);
 }
 
 static rtx
diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h 
b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
index edd6a21..aef83c2 100644
--- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
+++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
@@ -117,9 +117,9 @@ typedef signed   long long sptr;  // NOLINT
 typedef unsigned long uptr;  // NOLINT
 typedef signed   long sptr;  // NOLINT
 #endif  // defined(_WIN64)
-#if defined(__x86_64__)
-// Since x32 uses ILP32 data model in 64-bit hardware mode, we must use
-// 64-bit pointer to unwind stack frame.
+#if defined(__x86_64__) || SANITIZER_AARCH64_ILP32
+// Since x32 adn AArch64 ILP32 use ILP32 data model in 64-bit hardware mode, 
+// we must use 64-bit pointer to unwind stack frame.
 typedef unsigned long long uhwptr;  // NOLINT
 #else
 typedef uptr uhwptr;   // NOLINT
@@ -144,7 +144,7 @@ typedef int error_t;
 typedef int pid_t;
 
 #if SANITIZER_FREEBSD || SANITIZER_NETBSD || SANITIZER_MAC || \
-(SANITIZER_LINUX && defined(__x86_64__))
+(SANITIZER_LINUX && (defined(__x86_64__) || SANITIZER_AARCH64_ILP32))
 typedef u64 OFF_T;
 #else
 typedef uptr OFF_T;
@@ -154,8 +154,8 @@ typedef u64  OFF64_T;
 #if (SANITIZER_WORDSIZE == 64) || SANITIZER_MAC
 typedef uptr operator_new_size_type;
 #else
-# if defined(__s390__) && !defined(__s390x__)
-// Special case: 31-bit s390 has unsigned long as size_t.
+# if (defined(__s390__) && !defined(__s390x__)) || SANITIZER_AARCH64_ILP32
+// Special case: 31-bit s390 and AArch64 ILP32 have unsigned long as size_t.
 typedef unsigned long operator_new_size_type;
 # else
 typedef

Re: [PATCH v6] aarch64: Add split-stack support

2018-02-15 Thread Adhemerval Zanella



On 13/02/2018 13:13, Szabolcs Nagy wrote:
> On 07/02/18 18:07, Adhemerval Zanella wrote:
>  5. The TCB support on GLIBC is meant to be included in version 2.28.
>>
> ...
>> +/* -fsplit-stack uses a TCB field available on glibc-2.27.  GLIBC also
>> +   exports symbol, __tcb_private_ss, to signal it has the field available
>> +   on TCB bloc.  This aims to prevent binaries linked against newer
>> +   GLIBC to run on non-supported ones.  */
> 
> 
> i suspect this needs to be updated since the glibc patch
> is not committed yet.
> 
> (i'll review the glibc patch, if it looks ok then it can
> be committed after the gcc side is accepted.)

I fixed the commit message locally, thanks for checking on this.

> 
>> +
>> +static bool
>> +aarch64_supports_split_stack (bool report ATTRIBUTE_UNUSED,
>> +  struct gcc_options *opts ATTRIBUTE_UNUSED)
>> +{
>> +#ifndef TARGET_GLIBC_MAJOR
>> +#define TARGET_GLIBC_MAJOR 0
>> +#endif
>> +#ifndef TARGET_GLIBC_MINOR
>> +#define TARGET_GLIBC_MINOR 0
>> +#endif
>> +  /* Note: Can't test DEFAULT_ABI here, it isn't set until later.  */
>> +  if (TARGET_GLIBC_MAJOR * 1000 + TARGET_GLIBC_MINOR >= 2026)
>> +    return true;
>> +
>> +  if (report)
>> +    error ("%<-fsplit-stack%> currently only supported on AArch64 GNU/Linux 
>> with glibc-2.27 or later");
>> +  return false;
>> +}
>> +
>> +#undef TARGET_SUPPORTS_SPLIT_STACK
>> +#define TARGET_SUPPORTS_SPLIT_STACK aarch64_supports_split_stack
>> +

[PATCH v6] aarch64: Add split-stack support

2018-02-07 Thread Adhemerval Zanella

Changes from previous version:

  - Changed the wait to call __morestack to use use a branch with link
instead of a simple branch.  This allows use a call instruction and
avoid possible issues with later optimization passes which might
see a branch outside the instruction block (as noticed in previous
iterations while building a more complex workload as speccpu2006).

  - Change the return address to use the branch with link value and
set x12 to save x30.  This simplifies the required instructions
to setup/save the return address.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
ldur   x9, [x9, -8]
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
cmpx9, x10
bcc.LX
main_fn_entry:
[function prologue]
LX:
bl __morestack
b  main_fn_entry

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldur
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  This feature is only implemented in the GNU
   gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.28.

6. Besides a regression tests I also checked with a SPECcpu2006 run with
   -fsplit-stack additional option.  I saw no regression besides 416.gamess
   which fails on trunk as well (not sure if some misconfiguration in my
   environment).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_expand_builtin_va_start): Use
internal argument pointer instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New function.
(aarch64_expand_split_stack_prologue): Likewise.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument pointer when using
split-stack.
* gcc/config/aarch64/aarch64.md
(UNSPECV_STACK_CHECK): New define.
(split_stack_prologue): New expand.
(split_stack_space_check): Likewise.
---

Re: [PATCH] Enable ifunc attribute by default for ARM GNU/Linux

2017-10-10 Thread Adhemerval Zanella



On 09/10/2017 19:20, Joseph Myers wrote:
> On Mon, 9 Oct 2017, Adhemerval Zanella wrote:
> 
>>  *-*-linux*)
>>  case ${target} in
>> -aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-*)
>> +aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-* 
>> | arm*-*)
> 
> I think the cases here are meant to be in alphabetical order.
> 

Ack, I will change.  Is it ok to commit with the change?

[PATCH] Enable ifunc attribute by default for ARM GNU/Linux

2017-10-09 Thread Adhemerval Zanella

Similar to other architectures with IFUNC binutils/glibc support, this
patch enables the ifunc attribute for ARM GNU/Linux.  Although not
required for build master GLIBC, the intention is to allow refactor
its assembly implementation to C [1].

Tested compilation of glibc (in conjunction with a glibc patch to
support using the attribute on ARM) with build-many-glibcs.py (with
a patch to add a armv7 variant which enables multiarch).  I have
not run the GCC tests for ARM.

Adhemerval Zanella  <adhemerval.zane...@linaro.org>

* config.gcc (default_gnu_indirect_function): Default to yes for
arm*-*-linux* with glibc.

[1] https://sourceware.org/ml/libc-alpha/2017-10/msg00334.html
---
 gcc/ChangeLog  | 5 +
 gcc/config.gcc | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 91a55e8..26aa8f6 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -3104,7 +3104,7 @@ case ${target} in
 ;;
 *-*-linux*)
case ${target} in
-   aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-*)
+   aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-* 
| arm*-*)
default_gnu_indirect_function=yes
;;
esac
-- 
2.7.4

Re: [PATCH v5] aarch64: Add split-stack initial support

2017-08-31 Thread Adhemerval Zanella

Ping.

On 26/07/2017 16:08, Adhemerval Zanella wrote:
> This is an update patch based on my previous submission [1].  The changes
> from previous version are:
> 
>   - Update aarch64_supports_split_stack to return true an let the loader
> to actually emit an error if it does not provide the required TCB
> field.  The TCB field support is planed for GLIBC 2.27.
> 
>   - Some cleanup on morestack-c.c to avoid code duplication (ss_pointer
> function).
> 
> I am still some sporadic failures, but it due missing linker support
> for split calls due not enough stack size (on pprof and other go tests
> that call backtrace on signal handling).  This could be mitigate by
> increasing the BACKOFF to a higher value, but to not deviate from other
> ports with split-stack support this patch is using the default values
> (initial stack size of 16k and backoff of 4k).  This should be correctly
> handled with proper gold suppor (as for other ports).
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   adrp   x12, main_fn_entry
>   addx12, x12, :lo12:.L2
>   cmpx9, x10
>   b.lt   
>   b  __morestack
> main_fn_entry:
>   [function prologue]
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  The edit means aarch64 does not need to
>implement __morestack_non_split, necessary on x86 because insufficient
>space is available there to edit the stack comparison code.  This feature
>is only implemented in the GNU gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.26.
> 
> 6. The continuation address materialized on x12 is done using 'adrp'
>plus add and a static relocation.  Current code uses the
>aarch64_expand_mov_immediate function and since a better alternative
>would be 'adp', it could be a future optimization (not implemented
>in this patch).
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aar

[PATCH v5] aarch64: Add split-stack initial support

2017-07-26 Thread Adhemerval Zanella

This is an update patch based on my previous submission [1].  The changes
from previous version are:

  - Update aarch64_supports_split_stack to return true an let the loader
to actually emit an error if it does not provide the required TCB
field.  The TCB field support is planed for GLIBC 2.27.

  - Some cleanup on morestack-c.c to avoid code duplication (ss_pointer
function).

I am still some sporadic failures, but it due missing linker support
for split calls due not enough stack size (on pprof and other go tests
that call backtrace on signal handling).  This could be mitigate by
increasing the BACKOFF to a higher value, but to not deviate from other
ports with split-stack support this patch is using the default values
(initial stack size of 16k and backoff of 4k).  This should be correctly
handled with proper gold suppor (as for other ports).

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.26.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_gen_far_branch): Add suport
to emit 'b' instruction to rtx different than LABEL_REF.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New

[PATCH] sparc: Set noexecstack on mulsi3, divsi3, and modsi3

2017-05-10 Thread Adhemerval Zanella

A recent GLIBC fix for sparc [1] made some configuration to show
an executable stack on ld.so (shown on elf/check-execstack testcase
failure).

It is because with generated sparc toolchain from build-many-glibcs.py
(a GLIBC script to produce cross-compiling toolchains) the minimum
supported sparc32 version is pre-v9 and it requires a software
implementation of '.udiv'.  Since now we are using libgcc.a one instead,
it must have the '.note.GNU-stack' so linker can properly set the stack non
executable.

>From a build using a toolchain from build-many-glibcs.py:

elf/librtld.os.map

[...]
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_divsi3.o)
  
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_udivdi3.o) (.udiv)
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_clz.o)
  
.../lib/gcc/sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_udivdi3.o) (__clz_tab)
[...]

And dumping _udivdi3.o section headers:

  [Nr] Name  TypeAddr OffSize   ES Flg Lk Inf Al
  [ 0]   NULL 00 00 00  0   0  0
  [ 1] .text PROGBITS 34 0002b0 00  AX  0   0  4
  [ 2] .data PROGBITS 0002e4 00 00  WA  0   0  1
  [ 3] .bss  NOBITS   0002e4 00 00  WA  0   0  1
  [ 4] .debug_line   PROGBITS 0002e4 00010d 00  0   0  1
  [ 5] .rela.debug_line  RELA 0007c0 0c 0c   I 12   4  4
  [ 6] .debug_info   PROGBITS 0003f1 ab 00  0   0  1
  [ 7] .rela.debug_info  RELA 0007cc 30 0c   I 12   6  4
  [ 8] .debug_abbrev PROGBITS 00049c 14 00  0   0  1
  [ 9] .debug_arangesPROGBITS 0004b0 20 00  0   0  8
  [10] .rela.debug_arang RELA 0007fc 18 0c   I 12   9  4
  [11] .shstrtab STRTAB   000814 70 00  0   0  1
  [12] .symtab   SYMTAB   0004d0 000220 10 13  32  4
  [13] .strtab   STRTAB   0006f0 cf 00  0   0  1

I am not seeing this on a native gcc build which I configured with:

' --with-arch-directory=sparc64 --enable-multiarch --enable-targets=all
  --with-cpu-32=ultrasparc --with-long-double-128 --enable-multilib'

Both libgcc's __udivdi3 and __umoddi3 do not pull .udiv since for this libgcc 
build
both are using hardware instructions:

elf/librtld.os.map

/home/azanella/gcc/install/lib/gcc/sparc64-linux-gnu/6.3.1/32/libgcc.a(_udivdi3.o)
  
/home/azanella/glibc/glibc-git-build-sparcv9/elf/dl-allobjs.os (__udivdi3)
/home/azanella/gcc/install/lib/gcc/sparc64-linux-gnu/6.3.1/32/libgcc.a(_umoddi3.o)
  
/home/azanella/glibc/glibc-git-build-sparcv9/elf/dl-allobjs.os (__umoddi3)

This patch adds them missing noexectack on sparc assembly implementation.  I saw
no regression on gcc testsuite and it fixes the regression on GLIBC side.

libgcc/

* config/sparc/lb1spc.S [__ELF__ && __linux__]: Emit .note.GNU-stack
section for a non-executable stack.

[1] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=bdc543e338281da051b3dc06eae96c330a485ce6
---
 libgcc/ChangeLog | 5 +
 libgcc/config/sparc/lb1spc.S | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/libgcc/config/sparc/lb1spc.S b/libgcc/config/sparc/lb1spc.S
index b60bd57..e693864 100644
--- a/libgcc/config/sparc/lb1spc.S
+++ b/libgcc/config/sparc/lb1spc.S
@@ -5,6 +5,12 @@
slightly edited to match the desired calling convention, and also to
optimize them for our purposes.  */
 
+/* An executable stack is *not* required for these functions.  */
+#if defined(__ELF__) && defined(__linux__)
+.section .note.GNU-stack,"",%progbits
+.previous
+#endif
+
 #ifdef L_mulsi3
 .text
.align 4
-- 
2.7.4

Re: [PATCH v4] aarch64: Add split-stack initial support

2017-03-06 Thread Adhemerval Zanella

Ping.

On 15/02/2017 21:23, Adhemerval Zanella wrote:
> This is an update patch from my previous version (v3) based mainly on
> glibc comments on exported loader symbol [1].  The changes from previous
> version are:
> 
>- Remove __tcb_private_ss call on libgcc and emit a data usage on
>  glibc symbol when split-stack is used.  This removes the runtime
>  errors when running on older glibc and instead make the loader
>  fails with a missing symbol.
> 
>- Add glibc version check on split-stack support check.
> 
>- Some comments fixes on morestack.S.
> 
>- Remove some compile warnings on morestack-c.c.
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   adrp   x12, main_fn_entry
>   addx12, x12, :lo12:.L2
>   cmpx9, x10
>   b.lt   
>   b  __morestack
> main_fn_entry:
>   [function prologue]
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  The edit means aarch64 does not need to
>implement __morestack_non_split, necessary on x86 because insufficient
>space is available there to edit the stack comparison code.  This feature
>is only implemented in the GNU gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.26.
> 
> 6. The continuation address materialized on x12 is done using 'adrp'
>plus add and a static relocation.  Current code uses the
>aarch64_expand_mov_immediate function and since a better alternative
>would be 'adp', it could be a future optimization (not implemented
>in this patch).
> 
> [1] https://sourceware.org/ml/libc-alpha/2017-02/msg00272.html
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/con

[PATCH v4] aarch64: Add split-stack initial support

2017-02-15 Thread Adhemerval Zanella

This is an update patch from my previous version (v3) based mainly on
glibc comments on exported loader symbol [1].  The changes from previous
version are:

   - Remove __tcb_private_ss call on libgcc and emit a data usage on
 glibc symbol when split-stack is used.  This removes the runtime
 errors when running on older glibc and instead make the loader
 fails with a missing symbol.

   - Add glibc version check on split-stack support check.

   - Some comments fixes on morestack.S.

   - Remove some compile warnings on morestack-c.c.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.26.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

[1] https://sourceware.org/ml/libc-alpha/2017-02/msg00272.html

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_gen_far_branch): Add suport
to emit 'b' instruction to rtx different than LABEL_REF.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New function.
(aarch64_expand_split_stack_prologue): Likewise.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_split_stack_space_check): Likewise.
(aarch64_file_end): Emit the split-stack note sections.

[PATCH v3] aarch64: Add split-stack support

2017-02-13 Thread Adhemerval Zanella

Changes from previous version:

  - The split-stack are is now allocated before the thread pointer
(instead of tcbhead_t which is positioned after it) as decribed
in glibc patch [1].  It has the advantage to not require linker
changes to take in consideration the new tcbhead_t size and it
also fixed the regression I was seeing on the previous version of
this patch at runtime/pprof: __splitstack_setcontext updates some
internal TLS variables and with split-stack trying to update
a larger tcbhead_t field it clobbered the tls variable when
using a default static linker (which expected a default tcbhead_t
structure as size of 16).

  - Function aarch64_split_stack_space_check is changed back to previous
implementation because a wrong comment in my previous version lead
to wrong assumptions on [2].  The function will generate a jump
to 'label' variable (which will call __morestack_allocate_stack_space)
not the arch-specific __morestack implementation.  I corrected the
comment as wellA

  - Change the stack adjustments to always use 2 instructions. [3]

I am not seeing any more issue when using split-stack neither check-go
regressions with this version.

[1] https://sourceware.org/ml/libc-alpha/2017-02/msg00247.html
[2] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00093.html
[3] https://gcc.gnu.org/ml/gcc-patches/2017-02/msg00055.html

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer, x11 the argument
pointer, and x12 to return continuation address.  Unwinding is handled by a
personality routine that knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, -
movk   0x0
addx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

PR 67877
* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.

Re: [PATCH v2] aarch64: Add split-stack initial support

2017-01-31 Thread Adhemerval Zanella

On 25/01/2017 10:10, Jiong Wang wrote:
> On 24/01/17 18:05, Adhemerval Zanella wrote:
>>
>> On 03/01/2017 13:13, Wilco Dijkstra wrote:
>>
>>> +  /* If function uses stacked arguments save the old stack value so 
>>> morestack
>>> + can return it.  */
>>> +  reg11 = gen_rtx_REG (Pmode, R11_REGNUM);
>>> +  if (cfun->machine->frame.saved_regs_size
>>> +  || cfun->machine->frame.saved_varargs_size)
>>> +emit_move_insn (reg11, stack_pointer_rtx);
>>>
>>> This doesn't look right - we could have many arguments even without varargs 
>>> or
>>> saved regs.  This would need to check varargs as well as ctrl->args.size (I 
>>> believe
>>> that is the size of the arguments on the stack). It's fine to omit this 
>>> optimization
>>> in the first version - we already emit 2-3 extra instructions for the check 
>>> anyway.
>> I will check for a better solution.
> 
> Hi Adhemerval
> 
>   My only concern on this this patch is the initialization of R11 (internal 
> arg
> pointer).  The current implementation looks to me is generating wrong code 
> for a
> testcase simply return the sum of ten int param, I see the function body is
> using R11 while there is no initialization of it in split prologue,  so if the
> execution flow is *not* through __morestack, then R11 is not initialized.
> As Wilco suggested, I feel using crtl->args.size instead of
> cfun->machine->frame.saved_regs_size might be the correct approach after
> checking assign_parms in function.c.
> 

Hi Jiong,

Indeed the previous version which used 'saved_regs_size' is wrong for stacked
parameters.  A simple 10 arguments function call shows when the reg11 is
evaluated:

cfun->machine->frame.saved_regs_size= 0
cfun->machine->frame.saved_varargs_size = 0
crtl->args.size = 16

So indeed 'ctrl->args.size' seems the correct argument to used in this case
(which will trigger the correct reg11 set for split-stack case).  In this 
version
I also removed some unused variables I left in previous patch.

>From 30cbedc303d364dd94f0d35abee1d237a3671cdb Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer, x11 the argument
pointer, and x12 to return continuation address.  Unwinding is handled by a
personality routine that knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
	mrsx9, tpidr_el0
	movx10, -
	movk   0x0
	addx10, sp, x10
	movx11, sp   	# if function has stacked arguments
	adrp   x12, main_fn_entry
	addx12, x12, :lo12:.L2
	cmpx9, x10
	b.lt   
	b  __morestack
main_fn_entry:
	[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each

Re: [PATCH v2] aarch64: Add split-stack initial support

2017-01-24 Thread Adhemerval Zanella



On 03/01/2017 13:13, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>   
> Sorry for the late reply - but I think it's getting there. A few more 
> comments:

No worries.

> 
> +  /* If function uses stacked arguments save the old stack value so morestack
> + can return it.  */
> +  reg11 = gen_rtx_REG (Pmode, R11_REGNUM);
> +  if (cfun->machine->frame.saved_regs_size
> +  || cfun->machine->frame.saved_varargs_size)
> +emit_move_insn (reg11, stack_pointer_rtx);
> 
> This doesn't look right - we could have many arguments even without varargs or
> saved regs.  This would need to check varargs as well as ctrl->args.size (I 
> believe
> that is the size of the arguments on the stack). It's fine to omit this 
> optimization
> in the first version - we already emit 2-3 extra instructions for the check 
> anyway.

I will check for a better solution.

> 
> 
> +void
> +aarch64_split_stack_space_check (rtx size, rtx label)
> {
> +  rtx mem, ssvalue, cc, cmp, jump, temp;
> +  rtx requested = gen_reg_rtx (Pmode);
> +  /* Offset from thread pointer to __private_ss.  */
> +  int psso = 0x10;
> +
> +  /* Load __private_ss from TCB.  */
> +  ssvalue = gen_rtx_REG (Pmode, R9_REGNUM);
> 
> ssvalue doesn't need to be a hardcoded register.

Indeed, and it seems that this was not being triggered. I have fixed it in
this version.

> 
> +  emit_insn (gen_aarch64_load_tp_hard (ssvalue));
> +  mem = gen_rtx_MEM (Pmode, plus_constant (Pmode, ssvalue, psso));
> +  emit_move_insn (ssvalue, mem);
> +
> +  temp = gen_rtx_REG (Pmode, R10_REGNUM);
> +
> +  /* And compare it with frame pointer plus required stack.  */
> +  size = force_reg (Pmode, size);
> +  emit_move_insn (requested, gen_rtx_MINUS (Pmode, stack_pointer_rtx, size));
> +
> +  /* Jump to __morestack call if current __private_ss is not suffice.  */
> +  cc = aarch64_gen_compare_reg (LT, temp, ssvalue);
> 
> This uses X10, but where is it set???

I fixed it on this version.

> 
> +  cmp = gen_rtx_fmt_ee (GEU, VOIDmode, cc, const0_rtx);
> +  jump = emit_jump_insn (gen_condjump (cmp, cc, label));
> +  JUMP_LABEL (jump) = label;
> +}
> 
> So neither X10 nor X12 are set before potentially calling __morestack, so I 
> don't
> think it will work. Could this be causing the crash you mentioned?

I do not think so, the issue in with the runtime/pprof libgo test that fails 
with

[Switching to LWP 18926]
0x0050c358 in runtime.sigtrampgo (sig=sig@entry=27, 
info=info@entry=0x7fb63d5da0, ctx=ctx@entry=0x7fb63d5e20)
at ../../../gcc-git/libgo/go/runtime/signal_unix.go:221
221 setg(g.m.gsignal)

Where g.m is null.  Trying to obtain a stackstrace I am not seeing:

(gdb) bt
#0  0x0050c358 in runtime.sigtrampgo (sig=sig@entry=27, 
info=info@entry=0x7fb63d5da0, ctx=ctx@entry=0x7fb63d5e20)
at ../../../gcc-git/libgo/go/runtime/signal_unix.go:221
#1  0x0056acb4 in runtime.sigtramp (sig=27, info=0x7fb63d5da0, 
context=0x7fb63d5e20) at ../../../gcc-git/libgo/runtime/go-signal.c:131
#2  
#3  pprof_test.cpuHog1 () at pprof_test.go:52
#4  0x0040c814 in pprof_test.cpuHogger (f=f@entry=0x57c560 
<runtime_pprof_test.cpuHog1$descriptor>, dur=) at 
pprof_test.go:37
#5  0x0040c9f8 in pprof_test.$nested1 (dur=) at 
pprof_test.go:75
#6  0x0040d038 in pprof_test.testCPUProfile (t=t@entry=0x420804e680, 
need=..., f=f@entry=0x57c600 <runtime_pprof_test.$nested1$descriptor>)
at pprof_test.go:144
#7  0x0040c9a8 in runtime_pprof_test.TestCPUProfile (t=0x420804e680) at 
pprof_test.go:74
#8  0x00543bec in testing.tRunner (param=, fn=) at ../../../gcc-git/libgo/go/testing/testing.go:656
#9  0x00543c84 in testing.$thunk24 (__go_thunk_parameter=) at ../../../gcc-git/libgo/go/testing/testing.go:693
#10 0x0041a7dc in kickoff () at 
../../../gcc-git/libgo/runtime/proc.c:258
/build/buildd/gdb-7.9/gdb/dwarf2-frame.c:1732: internal-error: add_cie: 
Assertion `n < 1 || cie_table->entries[n - 1]->cie_pointer < cie->cie_pointer' 
failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

Which maybe the case that morestack.S unwind info not really correct. It could 
be
a case for issue in gdb as well (I will check with a newer gdb).


> 
> Wilco
> 
>From 09b51fb706a8b15ecaea4ec4b8a80d0a7903053d Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc,

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-25 Thread Adhemerval Zanella



On 15/11/2016 16:38, Wilco Dijkstra wrote:
> 
> On 07/11/2016 16:59, Adhemerval Zanella wrote:
>> On 14/10/2016 15:59, Wilco Dijkstra wrote:
> 
>> There is no limit afaik on gold split stack allocation handling,
>> and I think one could be added for each backend (in the method
>> override require to implement it).
>>
>> In fact it is not really required to tie the nop generation with the
>> instruction generated by 'aarch64_internal_mov_immediate', it is
>> just a matter to simplify linker code.  
> 
> If there is no easy limit and you'll still require a nop, I think it is best 
> then
> to emit mov N+movk #0. Then the scheduler won't be able to reorder
> them with the add/sub.

Good call, I have changed the patch to emit a mov N+mov #0 instead
on relying the emit_nop.

> 
>>> Is there any need to detect underflow of x10 or is there a guarantee that 
>>> stacks are
>>> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
>>> safe
>>> to do a signed comparison.
>>
>> I do not think so, at least none of current backend that implements
>> split stack do so.
> 
> OK, well a signed comparison like in your new version works for underflow.
> 
> Now to the patch:
> 
> 
> @@ -3316,6 +3339,28 @@ aarch64_expand_prologue (void)
>aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
>callee_adjust != 0 || frame_pointer_needed);
>aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
> +
> +  if (split_stack_arg_pointer_used_p ())
> +{
> +  /* Setup the argument pointer (x10) for -fsplit-stack code.  If
> +  __morestack was called, it will left the arg pointer to the
> +  old stack in x28.  Otherwise, the argument pointer is the top
> +  of current frame.  */
> +  rtx x11 = gen_rtx_REG (Pmode, R11_REGNUM);
> +  rtx x28 = gen_rtx_REG (Pmode, R28_REGNUM);
> +  rtx cc_reg = gen_rtx_REG (CCmode, CC_REGNUM);
> +
> +  rtx not_more = gen_label_rtx ();
> +
> +  rtx cmp = gen_rtx_fmt_ee (LT, VOIDmode, cc_reg, const0_rtx);
> +  rtx jump = emit_jump_insn (gen_condjump (cmp, cc_reg, not_more));
> +  JUMP_LABEL (jump) = not_more;
> +  LABEL_NUSES (not_more) += 1;
> +
> +  emit_move_insn (x11, x28);
> +
> +  emit_label (not_more);
> +}
> 
> If you pass the old sp in x11 when called from __morestack you can remove
> the above thunk completely.

Indeed this snippet does not make sense anymore, I removed it.

> 
> +  /* It limits total maximum stack allocation on 2G so its value can be
> + materialized using two instructions at most (movn/movk).  It might be
> + used by the linker to add some extra space for split calling non split
> + stack functions.  */
> +  allocate = cfun->machine->frame.frame_size;
> +  if (allocate > ((HOST_WIDE_INT) 1 << 31))
> +{
> +  sorry ("Stack frame larger than 2G is not supported for 
> -fsplit-stack");
> +  return;
> +}
> 
> Note a 2-instruction mov/movk can generate any immediate up to 4GB and if
> we need even large sizes, we could round up to a multiple of 64KB so that 2
> instructions are enough for a 48-bit stack size...

I think we can set a limit of 4GB (powerpc64 backend limits to 2GB and
it seems fine).  I corrected the comment.

> 
> +  int ninsn = aarch64_internal_mov_immediate (reg10, GEN_INT (-allocate),
> +   true, Pmode);
> +  gcc_assert (ninsn == 1 || ninsn == 2);
> +  if (ninsn == 1)
> +emit_insn (gen_nop ());
> 
> To avoid any issues with the nop being scheduled, it's best to emit an 
> explicit movk
> here (0x if allocate > 0, or 0 if zero) using gen_insv_immdi.

Right, I changed to that.

> 
> +void
> +aarch64_split_stack_space_check (rtx size, rtx label)
> 
> Isn't very similar code used in aarch64_expand_split_stack_prologue? Any 
> possibility
> to share/reuse?

I though about it, but it would require split in two subparts (one
to load __private_ss from TCB and another to jump to __morestack)
and both are basically 4 lines.  In the end I think current approach
should be simpler.

> 
> +static void
> +aarch64_live_on_entry (bitmap regs)
> +{
> +  if (flag_split_stack)
> +bitmap_set_bit (regs, R11_REGNUM);
> +}
> 
> I'm wondering whether you need extra code in aarch64_can_eliminate to deal
> with the argument pointer? Also do we need to define a fixed register, or 
> will GCC
> automatically allocate it to a callee-save if necessary?

Now that you asked I think we can get rid of this live marking.
I used as base for initial patch the powerpc backend

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-14 Thread Adhemerval Zanella

Ping.

On 07/11/2016 16:59, Adhemerval Zanella wrote:
> 
> 
> On 14/10/2016 15:59, Wilco Dijkstra wrote:
>> Hi,
>>
> 
> Thanks for the thoughtful review and sorry for late response. 
> 
>>> Split-stack prologue on function entry is as follow (this goes before the
>>> usual function prologue):
>>
>>> mrsx9, tpidr_el0
>>> movx10, -
>>
>> As Jiong already remarked, the nop won't work. Do we know the maximum 
>> adjustment
>> that the linker is allowed to make? If so, and we can limit the adjustment 
>> to 16MB in
>> most cases, emitting 2 subtracts is best. Larger offset need mov/movk/sub 
>> but that
>> should be extremely rare.
> 
> There is no limit afaik on gold split stack allocation handling,
> and I think one could be added for each backend (in the method
> override require to implement it).
> 
> In fact it is not really required to tie the nop generation with the
> instruction generated by 'aarch64_internal_mov_immediate', it is
> just a matter to simplify linker code.  
> 
> And although 16MB should be rare, nilptr2.go tests allocates 134217824
> so this test fails with this low stack limit.  I am not sure how well
> is the stack usage on 'go', but I think we should at least support
> current testcase scenario.  So for current iteration I kept my
> current approach, but I am open to suggestions.
> 
> 
>>
>>> nop/movk
>>
>>> addx10, sp, x10
>>> ldrx9, [x9, 16]
>>
>> Is there any need to detect underflow of x10 or is there a guarantee that 
>> stacks are
>> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
>> safe
>> to do a signed comparison.
> 
> I do not think so, at least none of current backend that implements
> split stack do so.
> 
>>
>>> cmpx10, x9
>>> b.csenough
>>
>> Why save/restore x30 and the call x30+8 trick when we could pass the
>> continuation address and use a tailcall? That also avoids emitting extra 
>> unwind info.
>>
>>> stpx30, [sp, -16]
>>> bl __morestack
>>> ldpx30, [sp], 16
>>> ret
>>
>> This part doesn't make any sense - both x28 and carry flag as an input, and 
>> spread
>> across the prolog - why???
>>
>>> enough:
>>> mov x10, sp
>>  [prolog]
>>> b.cscontinue
>>> mov x10, x28
>> continue:
>>  [rest of function]
>>
>> Why not do this?
>>
>> function:
>>  mrsx9, tpidr_el0
>>  subx10, sp, N & 0xfff000
>>  subx10, x10, N & 0xfff
>>  ldrx9, [x9, 16]
>>  adr x12, main_fn_entry
>>  movx11, sp   [if function has stacked arguments]
>>  cmpx10, x9
>>  b.gemain_fn_entry
>>  b __morestack
>> main_fn_entry: [x11 is argument pointer]
>>  [prolog]
>>  [rest of function]
>>
>> In __morestack you need to save x8 as well (another argument register!) and 
>> x12 (the 
>> continuation address). After returning from the call x8 doesn't need to be 
>> preserved.
> 
> Indeed this strategy is way better and I adjusted the code follow it.
> The only change is I am using a:
> 
>   [...]
>   cmp x9, x10
>   b.ltmain_fn_entr
>   b   __morestack.
>   [...]
> 
> So I can issue a 'cmp , 0' on __morestack to indicate
> the function was called.
> 
>>
>> There are several issues with unwinding in __morestack. x28 is not described 
>> as a callee-save
>> so will be corrupted if unwinding across a __morestack call. This won't 
>> unwind correctly after
>> the ldp as the unwinder will use the restored frame pointer to try to 
>> restore x29/x30:
>>
>> +ldp x29, x30, [x28, STACKFRAME_BASE]
>> +ldr x28, [x28, STACKFRAME_BASE + 80]
>> +
>> +.cfi_remember_state
>> +.cfi_restore 30
>> +.cfi_restore 29
>> +.cfi_def_cfa 31, 0
> 
> Indeed, it misses x28 save/restore. I think I have added the missing bits, 
> but I
> must confess that I am not well versed in CFI directives.  I will appreciate 
> if 
> you could help me on this new version.
> 
>>
>> This stores a random x30 value on the stack, what is the purpose of this? 
>> Nothing can unwind
>> to here:
>>
>> +# Start using new stack
>> +stp x29, x30, [x0, -16]!
>> +mov sp, x0
>>
>> Also we no longer need split_stack_arg_pointer_used_p () or any code that 
>> uses it (functions
>> that don't have any arguments passed on the stack could omit the mov x11, 
>> sp).
> 
> Right, we new strategy you proposed to do a branch this is indeed not
> really required.  I remove it from on this new patch.
> 
>>
>> Wilco
>>

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-07 Thread Adhemerval Zanella

On 14/10/2016 15:59, Wilco Dijkstra wrote:
> Hi,
> 

Thanks for the thoughtful review and sorry for late response. 

>> Split-stack prologue on function entry is as follow (this goes before the
>> usual function prologue):
> 
>>  mrsx9, tpidr_el0
>>  movx10, -
> 
> As Jiong already remarked, the nop won't work. Do we know the maximum 
> adjustment
> that the linker is allowed to make? If so, and we can limit the adjustment to 
> 16MB in
> most cases, emitting 2 subtracts is best. Larger offset need mov/movk/sub but 
> that
> should be extremely rare.

There is no limit afaik on gold split stack allocation handling,
and I think one could be added for each backend (in the method
override require to implement it).

In fact it is not really required to tie the nop generation with the
instruction generated by 'aarch64_internal_mov_immediate', it is
just a matter to simplify linker code.  

And although 16MB should be rare, nilptr2.go tests allocates 134217824
so this test fails with this low stack limit.  I am not sure how well
is the stack usage on 'go', but I think we should at least support
current testcase scenario.  So for current iteration I kept my
current approach, but I am open to suggestions.

> 
>>  nop/movk
> 
>>  addx10, sp, x10
>>  ldrx9, [x9, 16]
> 
> Is there any need to detect underflow of x10 or is there a guarantee that 
> stacks are
> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
> safe
> to do a signed comparison.

I do not think so, at least none of current backend that implements
split stack do so.

> 
>>  cmpx10, x9
>>  b.csenough
> 
> Why save/restore x30 and the call x30+8 trick when we could pass the
> continuation address and use a tailcall? That also avoids emitting extra 
> unwind info.
> 
>>  stpx30, [sp, -16]
>>  bl __morestack
>>  ldpx30, [sp], 16
>>  ret
> 
> This part doesn't make any sense - both x28 and carry flag as an input, and 
> spread
> across the prolog - why???
> 
>> enough:
>>  mov x10, sp
>   [prolog]
>>  b.cscontinue
>>  mov x10, x28
> continue:
>   [rest of function]
> 
> Why not do this?
> 
> function:
>   mrsx9, tpidr_el0
>   subx10, sp, N & 0xfff000
>   subx10, x10, N & 0xfff
>   ldrx9, [x9, 16]
>   adr x12, main_fn_entry
>   movx11, sp   [if function has stacked arguments]
>   cmpx10, x9
>   b.gemain_fn_entry
>   b __morestack
> main_fn_entry: [x11 is argument pointer]
>   [prolog]
>   [rest of function]
> 
> In __morestack you need to save x8 as well (another argument register!) and 
> x12 (the 
> continuation address). After returning from the call x8 doesn't need to be 
> preserved.

Indeed this strategy is way better and I adjusted the code follow it.
The only change is I am using a:

[...]
cmp x9, x10
b.ltmain_fn_entr
b   __morestack.
[...]

So I can issue a 'cmp , 0' on __morestack to indicate
the function was called.

> 
> There are several issues with unwinding in __morestack. x28 is not described 
> as a callee-save
> so will be corrupted if unwinding across a __morestack call. This won't 
> unwind correctly after
> the ldp as the unwinder will use the restored frame pointer to try to restore 
> x29/x30:
> 
> + ldp x29, x30, [x28, STACKFRAME_BASE]
> + ldr x28, [x28, STACKFRAME_BASE + 80]
> +
> + .cfi_remember_state
> + .cfi_restore 30
> + .cfi_restore 29
> + .cfi_def_cfa 31, 0

Indeed, it misses x28 save/restore. I think I have added the missing bits, but I
must confess that I am not well versed in CFI directives.  I will appreciate if 
you could help me on this new version.

> 
> This stores a random x30 value on the stack, what is the purpose of this? 
> Nothing can unwind
> to here:
> 
> + # Start using new stack
> + stp x29, x30, [x0, -16]!
> + mov sp, x0
> 
> Also we no longer need split_stack_arg_pointer_used_p () or any code that 
> uses it (functions
> that don't have any arguments passed on the stack could omit the mov x11, sp).

Right, we new strategy you proposed to do a branch this is indeed not
really required.  I remove it from on this new patch.

> 
> Wilco
> 
From dd2927aa5deb8d609c748014f3b566962fb852c5 Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH 2/2] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  A

[PATCH v3 2/2] aarch64: Add split-stack initial support

2016-10-17 Thread Adhemerval Zanella

From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

Changes from previous version:

 * Add missing new function comments;
 * Using gen_rtx_fmt_ee instead of gen_rtx_IF_THEN_ELSE to create
   conditional jumps;
 * Some typos and withspace issues.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]movx11, 
movx11, 
bl __morestack
ldpx30, [sp], 16
ret
enough:
mov x10, sp
[...]
b.csfunction
mov x10, x28
function:

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >2G stack initially and although it is possible
   to implement it, limiting to 2G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 2G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (split_stack_arg_pointer_used_p): New
function.
(aarch64_expand_prologue): Setup the argument pointer (x10) for
split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_EXTRA_LIVE_ON_ENTRY): Likewise.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument pointer when using
split-stack.
* gcc/config/aarch64/aarch64.md (UNSPEC_STACK_CHECK): New unspec.
(UNSPECV_SPLIT_STAC

[PATCH 1/2] Fix GCC split-stack variadic argument tests

2016-10-17 Thread Adhemerval Zanella

This test adds the expect va_end after va_start on split-stack tests.

gcc/ChangeLog:

* gcc/testsuite/gcc.dg/split-3.c (down): Call va_end after va_start.
* gcc/testsuite/gcc.dg/split-6.c (down): Likewise.
---
 gcc/ChangeLog  | 5 +
 gcc/testsuite/gcc.dg/split-3.c | 1 +
 gcc/testsuite/gcc.dg/split-6.c | 1 +
 3 files changed, 7 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/split-3.c b/gcc/testsuite/gcc.dg/split-3.c
index 64bbb8c..5ba7616 100644
--- a/gcc/testsuite/gcc.dg/split-3.c
+++ b/gcc/testsuite/gcc.dg/split-3.c
@@ -40,6 +40,7 @@ down (int i, ...)
   || va_arg (ap, int) != 9
   || va_arg (ap, int) != 10)
 abort ();
+  va_end (ap);
 
   if (i > 0)
 {
diff --git a/gcc/testsuite/gcc.dg/split-6.c b/gcc/testsuite/gcc.dg/split-6.c
index b32cf8d..b3016ba 100644
--- a/gcc/testsuite/gcc.dg/split-6.c
+++ b/gcc/testsuite/gcc.dg/split-6.c
@@ -37,6 +37,7 @@ down (int i, ...)
   || va_arg (ap, int) != 9
   || va_arg (ap, int) != 10)
 abort ();
+  va_end (ap);
 
   if (i > 0)
 {
-- 
2.7.4

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-10-11 Thread Adhemerval Zanella



On 07/10/2016 05:28, Kyrill Tkachov wrote:
> Hi Adhemerval,
> 
> CC'ing the aarch64 maintainers/reviewers.
> I have some comments inline, mostly about the GCC coding conventions.

Thanks for the review.

>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index df6514d..c689440 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -3149,6 +3149,49 @@ aarch64_restore_callee_saves (machine_mode mode,
>>   }
>>   }
>>   +static bool
>> +split_stack_arg_pointer_used_p (void)
>> +{
> 
> New function needs a function comment.
> 

Ack, I will add one.

>> +  bool using_split_stack = (flag_split_stack
>> +&& (lookup_attribute ("no_split_stack",
>> +  DECL_ATTRIBUTES (cfun->decl))
>> +   == NULL));
>> +  if (using_split_stack == false)
>> +return false;
>> +
>> +  /* If the pseudo holding the arg pointer is no longer a pseudo,
>> + then the arg pointer is used.  */
>> +  if (cfun->machine->frame.split_stack_arg_pointer != NULL_RTX
>> +  && (!REG_P (cfun->machine->frame.split_stack_arg_pointer)
>> +  || (REGNO (cfun->machine->frame.split_stack_arg_pointer)
>> +  < FIRST_PSEUDO_REGISTER)))
>> +return true;
>> +
>> +  /* Unfortunately we also need to do some code scanning, since
>> + x10 may have been substituted for the pseudo.  */
>> +  rtx_insn *insn;
>> +  basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
>> +  FOR_BB_INSNS (bb, insn)
>> +if (NONDEBUG_INSN_P (insn))
>> +  {
>> +df_ref use;
>> +FOR_EACH_INSN_USE (use, insn)
>> +  {
>> +rtx x = DF_REF_REG (use);
>> +if (REG_P (x) && REGNO (x) == R10_REGNUM)
>> +  return true;
>> +  }
>> +df_ref def;
>> +FOR_EACH_INSN_DEF (def, insn)
>> +  {
>> +rtx x = DF_REF_REG (def);
>> +if (REG_P (x) && REGNO (x) == R10_REGNUM)
>> +  return false;
>> +  }
>> +  }
>> +  return bitmap_bit_p (DF_LR_OUT (bb), R10_REGNUM);
>> +}
>> +
>>   /* AArch64 stack frames generated by this compiler look like:
>> +---+
>> @@ -3204,6 +3247,7 @@ aarch64_expand_prologue (void)
>> unsigned reg1 = cfun->machine->frame.wb_candidate1;
>> unsigned reg2 = cfun->machine->frame.wb_candidate2;
>> rtx_insn *insn;
>> +  bool split_stack_arg_pointer_used = split_stack_arg_pointer_used_p ();
>>   if (flag_stack_usage_info)
>>   current_function_static_stack_size = frame_size;
>> @@ -3220,6 +3264,10 @@ aarch64_expand_prologue (void)
>>   aarch64_emit_probe_stack_range (STACK_CHECK_PROTECT, frame_size);
>>   }
>>   +  /* Save split-stack argument pointer before stack adjustment.  */
>> +  if (split_stack_arg_pointer_used)
>> +emit_move_insn (gen_rtx_REG (Pmode, R10_REGNUM), stack_pointer_rtx);
>> +
>> aarch64_add_constant (Pmode, SP_REGNUM, IP0_REGNUM, -initial_adjust, 
>> true);
>>   if (callee_adjust != 0)
>> @@ -3243,6 +3291,30 @@ aarch64_expand_prologue (void)
>>callee_adjust != 0 || frame_pointer_needed);
>> aarch64_add_constant (Pmode, SP_REGNUM, IP1_REGNUM, -final_adjust,
>>   !frame_pointer_needed);
>> +
>> +  if (split_stack_arg_pointer_used_p ())
>> +{
>> +  /* Setup the argument pointer (x10) for -fsplit-stack code.  If
>> + __morestack was called, it will left the arg pointer to the
>> + old stack in x28.  Otherwise, the argument pointer is the top
>> + of current frame.  */
>> +  rtx x10 = gen_rtx_REG (Pmode, R10_REGNUM);
>> +  rtx x28 = gen_rtx_REG (Pmode, R28_REGNUM);
>> +  rtx not_more = gen_label_rtx ();
>> +  rtx cc_reg = gen_rtx_REG (CCmode, CC_REGNUM);
>> +  rtx jump;
>> +
>> +  jump = gen_rtx_IF_THEN_ELSE (VOIDmode,
>> +   gen_rtx_GEU (VOIDmode, cc_reg,
>> +const0_rtx),
>> +   gen_rtx_LABEL_REF (VOIDmode, not_more),
>> +   pc_rtx);
>> +  jump = emit_jump_insn (gen_rtx_SET (pc_rtx, jump));
> 
> Are you trying to emit a conditional jump here?
> If so it would be cleaner to use the pattern name you have in mind
> directly like so (barring typos ;):
> rtx cmp = gen_rtx_fmt_ee (GEU, VOIDmode, cc_reg, const0_rtx);
> jump = emit_jump_insn (gen_condjump (cmp, cc_reg, not_more));
> This will avoid constructing the SET and IF_THEN_ELSE rtxes manually.

Yes, the idea is to emit a conditional jump.  I have no preference here,
I used the IF_THEN_ELSE functions mainly because the other arches are
using it, but I think your suggestion should be simpler.

> 
>> +  JUMP_LABEL (jump) = not_more;
>> +  LABEL_NUSES (not_more) += 1;
>> +  emit_move_insn (x10, x28);
>> +  emit_label (not_more);
>> +}
>>   }
>> /* Return TRUE if we can use a simple_return insn.
>> @@ -9677,7 +9749,7 @@ aarch64_expand_builtin_va_start (tree valist, rtx

[PATCH v2] aarch64: Add split-stack initial support

2016-10-06 Thread Adhemerval Zanella

From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

Changes from previous version:

 - Rewrite how to setup variadic argument: instead of using the
   hard_fp_offset value to setup the x10, save a fp value before stack
   allocation instead.  This allows linker/gold to not require scan
   and change in case of split to non-split call (only the split prologue
   will require adjustments).

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]
bl __morestack
ldpx30, [sp], 16
ret
enough:
mov x10, sp
[...]
b.csfunction
mov x10, x28
function:


I checked with a bootstrap build on aarch64 and saw no regression.  The go
tests pass with the exception of cplx2.go and runtime/pprof/check (both
unrelated to the patch afaiu).

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >2G stack initially and although it is possible
   to implement it, limiting to 2G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 2G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (split_stack_arg_pointer_used_p): New
function.
(aarch64_expand_prologue): Setup the argument pointer (x10) for
split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_EXTRA_LIVE_

Re: [PATCH] aarch64: Add split-stack initial support

2016-08-18 Thread Adhemerval Zanella



On 08/08/2016 07:58, Jiong Wang wrote:
> 
> Adhemerval Zanella writes:
> 
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index e56398a..2cf239f 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -3227,6 +3227,34 @@ aarch64_expand_prologue (void)
>>> +  emit_move_insn (x11, GEN_INT (hard_fp_offset));
>>> +  emit_insn (gen_add3_insn (x10, x29, x11));
>>> +  jump = gen_rtx_IF_THEN_ELSE (VOIDmode,
>>> +  gen_rtx_GEU (VOIDmode, cc_reg,
>>> +   const0_rtx),
>>> +  gen_rtx_LABEL_REF (VOIDmode, not_more),
>>> +  pc_rtx);
>>> +  jump = emit_jump_insn (gen_rtx_SET (pc_rtx, jump));
>>> +  JUMP_LABEL (jump) = not_more;
>>> +  LABEL_NUSES (not_more) += 1;
>>> +  emit_move_insn (x10, x28);
>>> +  emit_label (not_more);
>>> +}
>>>  }
> 
> This part needs rebase, there are major changes in AArch64 prologue code
> recently.
> 

Right, I see that 'hard_fp_offset' is not defined locally anymore.

>>>  
>>>  /* Return TRUE if we can use a simple_return insn.
>>> @@ -3303,6 +3331,7 @@ aarch64_expand_epilogue (bool for_sibcall)
>>>offset = offset - fp_offset;
>>>  }
>>>  
>>> +
> 
> Unncessary new line.
>

Ack.

 
>>> +
>>> +  /* Load __private_ss from TCB.  */
>>> +  ssvalue = gen_rtx_REG (Pmode, R9_REGNUM);
>>> +  emit_insn (gen_aarch64_load_tp_hard (ssvalue));
>>> +  mem = gen_rtx_MEM (Pmode, plus_constant (Pmode, ssvalue, psso));
>>> +  emit_move_insn (ssvalue, mem);
>>> +
>>> +  temp = gen_rtx_REG (Pmode, R10_REGNUM);
>>> +
>>> +  /* Always emit two insns to calculate the requested stack, so the linker
>>> + can edit them when adjusting size for calling non-split-stack code.  
>>> */
>>> +  ninsn = aarch64_internal_mov_immediate (temp, GEN_INT (-frame_size), 
>>> true,
>>> + Pmode);
>>> +  gcc_assert (ninsn == 1 || ninsn == 2);
>>> +  if (ninsn == 1)
>>> +emit_insn (gen_nop ());
> 
> there will be trouble to linker if the following add is scheduled before
> the nop?

I theory yes, although I haven't see gcc splitting it.  Which would be the
correct way to tie the nop generation to be emitted after the mov immediate?
 
>>> +
>>> +   # Set up for a call to the target function.
>>> +   #ldpx29, x30, [x28, STACKFRAME_BASE]
>>> +   ldr x30, [x28, STACKFRAME_BASE + 8]
>>> +   ldp x0, x1, [x28, STACKFRAME_BASE + 16]
>>> +   ldp x2, x3, [x28, STACKFRAME_BASE + 32]
>>> +   ldp x4, x5, [x28, STACKFRAME_BASE + 48]
>>> +   ldp x6, x7, [x28, STACKFRAME_BASE + 64]
>>> +   add x9, x30, 8
>>> +   cmp x30, x9
> 
> Can you explain why do we need this "cmp" before jumping to target
> function?

This is due the function prologue addition for var args handling:

  mov x11, 
  sub sp, sp, 
  add x10, x29, x11
  b.csfunction:
  mov x10, x28

If __morestack is called it will use the the 'b.cs' to setup the
correct var arg pointer.


Below it the last iteration patch, however I now seeing some similar issue
s390 hit when building libgo:

../../../gcc-git/libgo/go/syscall/socket_linux.go:90:1: error: flow control 
insn inside a basic block
(jump_insn 90 89 91 14 (set (pc)
(if_then_else (geu (reg:CC 66 cc)
(const_int 0 [0]))
(label_ref 92)
(pc))) ../../../gcc-git/libgo/go/syscall/socket_linux.go:90 -1
 (nil)
 -> 92)
../../../gcc-git/libgo/go/syscall/socket_linux.go:90:1: internal compiler 
error: in rtl_verify_bb_insns, at cfgrtl.c:2658
0xac35af _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)

It shows only with -O2, which I think it due how the block is reorganized
internally and regarding the pseudo-return instruction inserted by split-stack.
I am still debugging the issue and how to proper fix it, so if you have any
advice I open to suggestions.




diff --git a/gcc/common/config/aarch64/aarch64-common.c 
b/gcc/common/config/aarch64/aarch64-common.c
index 08e7959..01c3239 100644
--- a/gcc/common/config/aarch64/aarch64-common.c
+++ b/gcc/common/config/aarch64/aarch64-common.c
@@ -106,6 +106,21 @@ aarch64_handle_option (struct gcc_options *opts,
 }
 }
 
+/* -fsplit-stack uses a TCB field available on glibc-2.25.  GLIBC also
+   exports symbol, __tcb_private_ss, to signal

Re: [PATCH] aarch64: Add split-stack initial support

2016-08-05 Thread Adhemerval Zanella

Ping.

On 28/07/2016 17:36, Adhemerval Zanella wrote:
> From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a __private_ss field is
> added on TCB in glibc, a target-specific __morestack implementation and
> helper functions are added in libgcc and compiler supported adjustments
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is similar to powerpc one: at most 2 GB of stack allocation
> is support so stack adjustments can be done with 2 instructions (either just
> a movn plus nop or a movn followed by movk).  The morestack call is non
> standard with x10 hollding the requested stack pointer and x11 the required
> stack size to be copied.  Also function arguments on the old stack are
> accessed based on a value relative to the stack pointer, so x10 is used to
> hold theold stack value.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
>   mrsx9, tpidr_el0
>   movx10, -
>   nop/movk
>   addx10, sp, x10
>   ldrx9, [x9, 16]
>   cmpx10, x9
>   b.csenough
>   stpx30, [sp, -16]movx11, 
>   movx11, 
>   bl __morestack
>   ldpx30, [sp], 16
>   ret
> enough:
>   # usual function prologue, modified a little at the end to set up the
>   # arg_pointer in x10, starts here.  The arg_pointer is initialized,
>   # if it is used, with
>   mov x11, 
>   add x10, x29, x11
>   b.csfunction
>   mov x10, x28
> function:
> 
> Notes:
>  1. Even if a function does not allocate a stack frame, a split-stack prologue
> is created.  It is to avoid issues with tail call for external symbols
> which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
>  2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
> to after the required stack calculation.
> 
>  3. Similar to powerpc, When the linker detects a call from split-stack to
> non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
> instructions (so non-split-stack code gets a larger stack).  The amount is
> tunable by a linker option.  The edit means aarch64 does not need to
> implement __morestack_non_split, necessary on x86 because insufficient
> space is available there to edit the stack comparison code.  This feature
> is only implemented in the GNU gold linker.
> 
>  4. AArch64 does not handle >2G stack initially and although it is possible
> to implement it, limiting to 2G allows to materize the allocation with
> only 2 instructions (mov + movk) and thus simplifying the linker
> adjustments required.  Supporting multiple threads each requiring more
> than 2G of stack is probably not that important, and likely to OOM at
> run time.
> 
>  5. The TCB support on GLIBC is meant to be included in version 2.25 [1].
> 
> I tested bootstrapping on aarch64-linux-gnu and although still digesting
> the results I saw no regression.  All cgo tests are passing, although based
> on previous reports in other archs gold support should be present to avoid
> issues on split calling non-split code.
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/config/aarch64/aarch64.c (aarch64_expand_prologue): Setup the
>   argument pointer (x10) for split-stack.
>   (aarch64_expand_builtin_va_start): Use internal argument pointer
>   instead of virtual_incoming_args_rtx.
>   (aarch64_e

[PATCH] aarch64: Add split-stack initial support

2016-07-28 Thread Adhemerval Zanella

From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported adjustments
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]movx11, 
movx11, 
bl __morestack
ldpx30, [sp], 16
ret
enough:
# usual function prologue, modified a little at the end to set up the
# arg_pointer in x10, starts here.  The arg_pointer is initialized,
# if it is used, with
mov x11, 
add x10, x29, x11
b.csfunction
mov x10, x28
function:

Notes:
 1. Even if a function does not allocate a stack frame, a split-stack prologue
is created.  It is to avoid issues with tail call for external symbols
which might require linker adjustment (libgo/runtime/go-varargs.c).

 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
to after the required stack calculation.

 3. Similar to powerpc, When the linker detects a call from split-stack to
non-split-stack code, it adds 16k (or more) to the value found in "allocate"
instructions (so non-split-stack code gets a larger stack).  The amount is
tunable by a linker option.  The edit means aarch64 does not need to
implement __morestack_non_split, necessary on x86 because insufficient
space is available there to edit the stack comparison code.  This feature
is only implemented in the GNU gold linker.

 4. AArch64 does not handle >2G stack initially and although it is possible
to implement it, limiting to 2G allows to materize the allocation with
only 2 instructions (mov + movk) and thus simplifying the linker
adjustments required.  Supporting multiple threads each requiring more
than 2G of stack is probably not that important, and likely to OOM at
run time.

 5. The TCB support on GLIBC is meant to be included in version 2.25 [1].

I tested bootstrapping on aarch64-linux-gnu and although still digesting
the results I saw no regression.  All cgo tests are passing, although based
on previous reports in other archs gold support should be present to avoid
issues on split calling non-split code.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_expand_prologue): Setup the
argument pointer (x10) for split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument

Split-stack support for aarch64

2016-04-04 Thread Adhemerval Zanella

Hi all,

Since BZ#67877 [1] does not have much information on it, I would like
to ask for some inputs of which is the requirement of implementing
split-stack on aarch64 besides 'feature parity'.

I am asking it because on PR it states the main use it gccgo and
afaik there is some usage in other programs, but the widespread
is not really convincing (for instance, rust now allows targets
to be built without it [2]).

I also noted GO from 1.4 does not use split-stack anymore, stating
it suffers from a performance issue ("hot stack split") and some
search on the internet describes that for 64-bit targets split-stack
is not really an efficient way to manage stack grows.

So this is only for gccgo support? It gccgo stuck in pre 1.4 version
and/or not willing remove split-stack usage (as go itself)?

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67877
[2] https://github.com/rust-lang/rust/issues/16980

Re: [PATCH 1/7] Libsanitizer merge from upstream r249633.

2015-10-14 Thread Adhemerval Zanella

On 14-10-2015 04:54, Jakub Jelinek wrote:
> On Tue, Oct 13, 2015 at 07:54:33PM +0300, Maxim Ostapenko wrote:
>> On 13/10/15 14:15, Maxim Ostapenko wrote:
>>> This is the raw merge itself. I'm bumping SONAME to libasan.so.3.
>>>
>>> -Maxim
>>
>> I have just noticed that I've misused autoconf stuff (used wrong version).
>> Here a fixed version of the same patch. Sorry for inconvenience.
> 
> Is libubsan, libtsan backwards compatible, or do we want to change SONAME
> there too?
> 
> The aarch64 changes are terrible, not just because it doesn't yet have
> runtime decision on what VA to use or that it doesn't support 48-bit VA,
> but also that for the 42-bit VA it uses a different shadow offset from
> 39-bit VA.  But on the compiler side we have just one...
> Though, only the 39-bit VA is enabled right now by default, so out of the
> box the state is as bad as we had in 5.x - users wanting 42-bit VA or 48-bit
> VA have to patch libsanitizer.

Yes we are aware with current deficiencies for aarch64 with a 39-bit and 
42-bit vma and the lack of support of 48-bit vma. On LLVM side current 
approach is to built compiler support for either 39 or 42 bit (and again
we are aware this is not the ideal approach). This approach was used
mainly for validation and buildbot enablement.

I am currently working on a way to make the sanitizer (asan and msan)
VMA independent on aarch64 (aiming to current support 39/42 and later 
48-bit vma) and the approach we decide to use is to change the 
instrumentation to use a parametrized value to compute the shadow memory
regions (based on VMA address) which will be initialized externally
by the  ibsanitizer. TSAN is somewhat easier since instrumentation 
does not  take in consideration the VMA (only the libsanitizer itself).

The idea is to avoid compiler switches a make is transparent to run
the binary regardless of the VMA in the system. The downside is 
instrumentation will require more steps (to lead the parametrized
value to compute shadow memory) and thus slower.

> 
> Have you verified libbacktrace sanitization still works properly (that is
> something upstream does not test)?
> 
> Do you plan to update the asan tests we have to reflect the changes in
> upstream?
> 
>   Jakub
>

[PATCH] rs6000: Fix HTM tcheck assembly encoding

2015-02-20 Thread Adhemerval Zanella

gcc/ChangeLog:

* config/rs6000/htm.md (tcheck): Fix assembly encoding.

gcc/testsuite/ChangeLog

* gcc.target/powerpc/htm-builtin-1.c: Fix tcheck expect value.

---

diff --git a/gcc/config/rs6000/htm.md b/gcc/config/rs6000/htm.md
index 2c4689f..79fb740 100644
--- a/gcc/config/rs6000/htm.md
+++ b/gcc/config/rs6000/htm.md
@@ -252,7 +252,7 @@
(unspec_volatile:CC [(match_operand 0 u3bit_cint_operand n)]
UNSPECV_HTM_TCHECK))]
   TARGET_HTM
-  tcheck. %0
+  tcheck %0
   [(set_attr type htm)
(set_attr length 4)])
 
diff --git a/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c 
b/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
index e58816a..62d64e6 100644
--- a/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
@@ -10,7 +10,7 @@
 /* { dg-final { scan-assembler-times tabortdci\\. 1 } } */
 /* { dg-final { scan-assembler-times tabortwc\\. 1 } } */
 /* { dg-final { scan-assembler-times tabortwci\\. 2 } } */
-/* { dg-final { scan-assembler-times tcheck\\. 1 } } */
+/* { dg-final { scan-assembler-times tcheck 1 } } */
 /* { dg-final { scan-assembler-times trechkpt\\. 1 } } */
 /* { dg-final { scan-assembler-times treclaim\\. 1 } } */
 /* { dg-final { scan-assembler-times tsr\\. 3 } } */

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-09-04 Thread Adhemerval Zanella

On 03-09-2014 11:01, Maciej W. Rozycki wrote:
 On Tue, 2 Sep 2014, Adhemerval Zanella wrote:

 Ping.

 On 19-08-2014 13:54, Adhemerval Zanella wrote:
 Ping.

 On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, 
 update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an 
 assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation 
 in 'clear'
 from your previous message, it is now reusing the mas used in 
 feholdexcept.  The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

 gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.
  FWIW I pushed it through regression testing across my usual set of 
 powerpc-linux-gnu multilibs with the results (for c11-atomic-exec-5.c) as 
 follows:

 -mcpu=603ePASS
 -mcpu=603e -msoft-float   UNSUPPORTED
 -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=speUNSUPPORTED
 -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=speUNSUPPORTED
 -mcpu=7400 -maltivec -mabi=altivecPASS
 -mcpu=e6500 -maltivec -mabi=altivec   PASS
 -mcpu=e5500 -m64  PASS
 -mcpu=e6500 -m64 -maltivec -mabi=altivec  PASS

Thanks for testing it, I'll to add these permutations on my own testbench.


 (floating-point environment is of course unsupported for soft-float 
 targets and for the SPE FPU another change is required to implement 
 floating-point environment handling to complement one proposed here).  
 No regressions otherwise.

  While at it, may I propose another change on top of this?

  I've noticed the test case is rather slow, it certainly takes much more 
 time than the average one, I've seen elapsed times of well over a minute 
 on reasonably fast hardware and occasionally a timeout midway through even 
 though the test case was otherwise progressing just fine.  I think lock 
 contention or unrelated system activity such as hardware interrupts (think 
 a busy network!) may contribute to it for systems using LL/SC loops for 
 atomicity.

  So I think the default timeout that's used for really quick tests should 
 be extended a bit.  I propose a factor of 2, just not to make it too 
 excessive, at least for the beginning (maybe it'll have to be higher 
 eventually).

Do you mind if I incorporate this change on my patchset?


  OK?

 2014-09-03  Maciej W. Rozycki  ma...@codesourcery.com

   gcc/testsuite/
   * gcc.dg/atomic/c11-atomic-exec-5.c (dg-timeout-factor): New 
   setting.

   Maciej

 gcc-test-c11-atomic-exec-5-timeout-factor.diff
 Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c
 ===
 --- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c  
 2014-09-02 17:34:06.718927043 +0100
 +++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c   
 2014-09-03 14:51:12.958927233 +0100
 @@ -9,6 +9,7 @@
  /* { dg-additional-options -D_XOPEN_SOURCE=600 { target 
 *-*-solaris2.1[0-9]* } } */
  /* { dg-require-effective-target fenv_exceptions } */
  /* { dg-require-effective-target pthread } */
 +/* { dg-timeout-factor 2 } */

  #include fenv.h
  #include float.h

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-09-02 Thread Adhemerval Zanella

Ping.

On 19-08-2014 13:54, Adhemerval Zanella wrote:
 Ping.

 On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation in 
 'clear'
 from your previous message, it is now reusing the mas used in feholdexcept.  
 The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
  * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
  function.

 gcc/testsuite:
  * gcc.dg/atomic/c11-atomic-exec-5.c
  (test_main_long_double_add_overflow): Define and run only for
  LDBL_MANT_DIG != 106.
  (test_main_complex_long_double_add_overflow): Likewise.
  (test_main_long_double_sub_overflow): Likewise.
  (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index d088ff6..7d66eb1 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1631,6 +1631,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_UC (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, 
 call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Reload the value of fenv_hold to clear the exceptions.  */
 +
 +  *clear = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-08-19 Thread Adhemerval Zanella

Ping.

On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation in 
 'clear'
 from your previous message, it is now reusing the mas used in feholdexcept.  
 The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
   * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
   function.

 gcc/testsuite:
   * gcc.dg/atomic/c11-atomic-exec-5.c
   (test_main_long_double_add_overflow): Define and run only for
   LDBL_MANT_DIG != 106.
   (test_main_complex_long_double_add_overflow): Likewise.
   (test_main_long_double_sub_overflow): Likewise.
   (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index d088ff6..7d66eb1 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1631,6 +1631,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_UC (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Reload the value of fenv_hold to clear the exceptions.  */
 +
 +  *clear = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu, build_int_cst (uint64_type_node, update_exception_mask

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-08-06 Thread Adhemerval Zanella

On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

Thanks for the review Josephm.  I have changed to avoid the void 
reinterpretation
and use call_mffs directly.  I have also removed the the mask generation in 
'clear'
from your previous message, it is now reusing the mas used in feholdexcept.  
The 
testcase patch is the same as before.

Checked on both linux-powerpc64/powerpc64le and no regressions found.

--

2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.

---

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index d088ff6..7d66eb1 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1631,6 +1631,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 
 #undef TARGET_CAN_USE_DOLOOP_P
 #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
+#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
+#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
 
 /* Processor table.  */
@@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
   return ;
 }
 
+/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
+
+static void
+rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
+{
+  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
+return;
+
+  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
+  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
+  tree call_mffs = build_call_expr (mffs, 0);
+
+  /* Generates the equivalent of feholdexcept (fenv_var)
+
+ *fenv_var = __builtin_mffs ();
+ double fenv_hold;
+ *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
+ __builtin_mtfsf (0xff, fenv_hold);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT hold_exception_mask =
+HOST_WIDE_INT_UC (0x0007);
+
+  tree fenv_var = create_tmp_var (double_type_node, NULL);
+
+  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
+
+  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, hold_exception_mask));
+
+  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, fenv_llu_and);
+
+  tree hold_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
+
+  /* Reload the value of fenv_hold to clear the exceptions.  */
+
+  *clear = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  /* Generates the equivalent of feupdateenv (fenv_var)
+
+ double old_fenv = __builtin_mffs ();
+ double fenv_update;
+ *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
+(*(uint64_t*)fenv_var 0x1ff80fff);
+ __builtin_mtfsf (0xff, fenv_update);  */
+
+  const unsigned HOST_WIDE_INT update_exception_mask =
+HOST_WIDE_INT_UC (0x1f00);
+  const unsigned HOST_WIDE_INT new_exception_mask =
+HOST_WIDE_INT_UC (0x1ff80fff);
+
+  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
+  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+old_llu, build_int_cst (uint64_type_node, update_exception_mask));
+
+  tree new_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, new_exception_mask

[COMMITTED] Add myself to MAINTAINERS file

2014-07-16 Thread Adhemerval Zanella

Hi,

This patch adds myself to the MAINTAINERS file.  Commmitted as 212652.

--

2014-07-16  Adhemerval Zanella  azane...@linux.vnet.ibm.com 
mailto:azanella%40linux.vnet.ibm.com

* MAINTAINERS (Write After Approval): Add myself.

---

--- trunk/MAINTAINERS   2014/07/16 14:21:34 212651
+++ trunk/MAINTAINERS   2014/07/16 14:23:03 212652
@@ -583,6 +583,7 @@
 David Yustedavid.yu...@gmail.com
 Kirill Yukhin  kirill.yuk...@gmail.com
 Kenneth Zadeck zad...@naturalbridge.com
+Adhemerval Zanella azane...@linux.vnet.ibm.com
 Yufeng Zhang   yufeng.zh...@arm.com
 Shujing Zhao   pearly.z...@oracle.com
 Jon Zieglerj...@apple.com

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-07-16 Thread Adhemerval Zanella

Ping.

On 03-07-2014 18:08, Adhemerval Zanella wrote:
 This patch implements the TARGET_ATOMIC_ASSIGN_EXPAND_FENV for
 powerpc-fpu. I have to adjust current c11-atomic-exec-5 testcase
 because for IBM long double 0 += LDBL_MAX might generate 
 overflow/underflow in internal __gcc_qadd calculations.

 The c11-atomic-exec-5 now passes for linux/powerpc, checked on
 powerpc32-linux-fpu, powerpc64-linux, and powerpc64le-linux.

 --

 2014-07-03  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
   * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
   function.

 gcc/testsuite:
   * gcc.dg/atomic/c11-atomic-exec-5.c
   (test_main_long_double_add_overflow): Define and run only for
   LDBL_MANT_DIG != 106.
   (test_main_complex_long_double_add_overflow): Likewise.
   (test_main_long_double_sub_overflow): Likewise.
   (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index bf67e72..75a2a45 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1621,6 +1621,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -32991,6 +32994,105 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_C (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Generates the equivalent of feclearexcept (FE_ALL_EXCEPT):
 +
 + double fenv_clear = __builtin_mffs ();
 + *(uint64_t)fenv_clear = 0xLL;
 + __builtin_mtfsf (0xff, fenv_clear);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT clear_exception_mask =
 +HOST_WIDE_INT_UC (0x);
 +
 +  tree fenv_clear = create_tmp_var (double_type_node, NULL);
 +
 +  tree clear_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_clear, 
 call_mffs);
 +
 +  tree fenv_clean_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, 
 fenv_var);
 +  tree fenv_clear_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +fenv_clean_llu, build_int_cst (uint64_type_node, clear_exception_mask));
 +
 +  tree fenv_clear_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node,
 +fenv_clear_llu_and);
 +
 +  tree clear_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_clear_mtfsf);
 +
 +  *clear = build2 (COMPOUND_EXPR, void_type_node, clear_mffs, clear_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_fenv = create_tmp_var (double_type_node, NULL);
 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu, build_int_cst (uint64_type_node

[PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-07-03 Thread Adhemerval Zanella

This patch implements the TARGET_ATOMIC_ASSIGN_EXPAND_FENV for
powerpc-fpu. I have to adjust current c11-atomic-exec-5 testcase
because for IBM long double 0 += LDBL_MAX might generate 
overflow/underflow in internal __gcc_qadd calculations.

The c11-atomic-exec-5 now passes for linux/powerpc, checked on
powerpc32-linux-fpu, powerpc64-linux, and powerpc64le-linux.

--

2014-07-03  Adhemerval Zanella  azane...@linux.vnet.ibm.com

gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.

---

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index bf67e72..75a2a45 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1621,6 +1621,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 
 #undef TARGET_CAN_USE_DOLOOP_P
 #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
+#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
+#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
 
 /* Processor table.  */
@@ -32991,6 +32994,105 @@ emit_fusion_gpr_load (rtx *operands)
   return ;
 }
 
+/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
+
+static void
+rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
+{
+  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
+return;
+
+  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
+  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
+  tree call_mffs = build_call_expr (mffs, 0);
+
+  /* Generates the equivalent of feholdexcept (fenv_var)
+
+ *fenv_var = __builtin_mffs ();
+ double fenv_hold;
+ *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
+ __builtin_mtfsf (0xff, fenv_hold);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT hold_exception_mask =
+HOST_WIDE_INT_C (0x0007);
+
+  tree fenv_var = create_tmp_var (double_type_node, NULL);
+
+  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
+
+  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, hold_exception_mask));
+
+  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, fenv_llu_and);
+
+  tree hold_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
+
+  /* Generates the equivalent of feclearexcept (FE_ALL_EXCEPT):
+
+ double fenv_clear = __builtin_mffs ();
+ *(uint64_t)fenv_clear = 0xLL;
+ __builtin_mtfsf (0xff, fenv_clear);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT clear_exception_mask =
+HOST_WIDE_INT_UC (0x);
+
+  tree fenv_clear = create_tmp_var (double_type_node, NULL);
+
+  tree clear_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_clear, 
call_mffs);
+
+  tree fenv_clean_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_clear_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+fenv_clean_llu, build_int_cst (uint64_type_node, clear_exception_mask));
+
+  tree fenv_clear_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node,
+fenv_clear_llu_and);
+
+  tree clear_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_clear_mtfsf);
+
+  *clear = build2 (COMPOUND_EXPR, void_type_node, clear_mffs, clear_mtfsf);
+
+  /* Generates the equivalent of feupdateenv (fenv_var)
+
+ double old_fenv = __builtin_mffs ();
+ double fenv_update;
+ *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
+(*(uint64_t*)fenv_var 0x1ff80fff);
+ __builtin_mtfsf (0xff, fenv_update);  */
+
+  const unsigned HOST_WIDE_INT update_exception_mask =
+HOST_WIDE_INT_UC (0x1f00);
+  const unsigned HOST_WIDE_INT new_exception_mask =
+HOST_WIDE_INT_UC (0x1ff80fff);
+
+  tree old_fenv = create_tmp_var (double_type_node, NULL);
+  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, call_mffs);
+
+  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);
+  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+old_llu, build_int_cst (uint64_type_node, update_exception_mask));
+
+  tree new_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, new_exception_mask));
+
+  tree

Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-20 Thread Adhemerval Zanella

Hi,

This is a reformatted patch after Ulrich review:

---

libgcc/ChangeLog:

2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and qNaN to not raise an inexact exception.

gcc/testsuite/ChangeLog:

2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* gcc.target/powerpc/pr57363.c: New test.

--

diff --git a/gcc/testsuite/gcc.target/powerpc/pr57363.c 
b/gcc/testsuite/gcc.target/powerpc/pr57363.c
new file mode 100644
index 000..45ea3f3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr57363.c
@@ -0,0 +1,19 @@
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-options -mlong-double-128 } */
+
+/* Check if adding a qNAN and a normal long double does not generate a
+   inexact exception.  */
+
+#define _GNU_SOURCE
+#include fenv.h
+
+int main(void)
+{
+  double x = __builtin_nan ();
+  long double y = 1.1L;
+
+  feenableexcept (FE_INEXACT);
+  feclearexcept (FE_ALL_EXCEPT);
+  x = x + y;
+  return fetestexcept(FE_INEXACT);
+}
diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;

Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-20 Thread Adhemerval Zanella

On 20-11-2013 14:23, David Edelsohn wrote:
 On Wed, Nov 20, 2013 at 9:11 AM, Adhemerval Zanella
 azane...@linux.vnet.ibm.com wrote:

 libgcc/ChangeLog:

 2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 * config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
 of normal number and qNaN to not raise an inexact exception.

 gcc/testsuite/ChangeLog:

 2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 * gcc.target/powerpc/pr57363.c: New test.
 This is okay.  And the patch is small enough that it does not need a
 copyright assignment for GCC.

 Do you need someone to commit the patch for you?

 Thanks, David

Yes I do need, thanks.

Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-18 Thread Adhemerval Zanella

On 15-11-2013 19:54, Ulrich Weigand wrote:
 Should this be qNaN instead of sNaN here?

Yes, indeed.


 Also, since you already have a test case, I think it would be good to add it
 to the GCC test suite ...

 Otherwise, this looks reasonable to me (but I cannot approve the patch):

 Bye,
 Ulrich

Here it is with updated ChangeLog. Tested on PPC32 and PPC64. The testcase
fails at runtime without the fix with a FLoating-Point Exception.

---

2013-11-15  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and qNaN to not raise an inexact exception.
* gcc/testsuite/gcc.target/powerpc/pr57363.c: New file: check for
PR#57363.

--

diff --git a/gcc/testsuite/gcc.target/powerpc/pr57363.c 
b/gcc/testsuite/gcc.target/powerpc/pr57363.c
new file mode 100644
index 000..cc9b1d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr57363.c
@@ -0,0 +1,19 @@
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-options -mlong-double-128 } */
+
+/* Check if adding a sNAN and a normal long double does not generate a
+   inexact exception.  */
+
+#define _GNU_SOURCE
+#include fenv.h
+
+int main(void)
+{
+  double x = __builtin_nan ();
+  long double y = 1.1L;
+
+  feenableexcept (FE_INEXACT);
+  feclearexcept (FE_ALL_EXCEPT);
+  x = x + y;
+  return fetestexcept(FE_INEXACT);
+}
diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;

Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-18 Thread Adhemerval Zanella

On 18-11-2013 18:10, Ulrich Weigand wrote:
 Adhemerval Zanella wrote:

 2013-11-15 Adhemerval Zanella azane...@linux.vnet.ibm.com * 
 libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add of normal number 
 and qNaN to not raise an inexact exception. * 
 gcc/testsuite/gcc.target/powerpc/pr57363.c: New file: check for PR#57363.

 This should go into two separate ChangeLog files, one in libgcc/ChangeLog:

   * config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add of normal number
   and qNaN to not raise an inexact exception.

 and one in gcc/testsuite/ChangeLog:

   * gcc.target/powerpc/pr57363.c: New test.

Ok, thanks for spot it.


 +/* Check if adding a sNAN and a normal long double does not generate a
 +   inexact exception.  */
 qNaN again :-)

Yeah... :(


 Thanks,
 Ulrich

[PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-15 Thread Adhemerval Zanella

Hi all,

For IBM long double, adding a normal number to a qNaN raises an inexact 
exception. Adding any number to qNaN should not raise any exception. The 
following testcase triggers the issue (the testcase is meant to run a gnu 
compatible libc):

--
$ cat gcc_testcase.c
#include math.h
#include fenv.h
#include stdio.h

double
sum (double x, long double y)
{
  return x + y;
}

int main ()
{
  feenableexcept (FE_INEXACT);

  double x = __builtin_nan ();
  long double y = 1.1L;

  printf (%e\n, sum (x, y));
  
  return 0;
}
$ gcc -O3 -m64 -fno-inline gcc_testcase.c -o gcc_testcase -lm
$ ./gcc_testcase 
Floating point exception (core dumped)
--

The issue is in __gcc_qadd implementation at 
libgcc/config/rs6000/ibm-ldouble.c, 
if the number if non finite, there is not check if it a NaN before actually 
summing all the components. A possible solution would be to add an extra test 
and return the first sum if the number if not infinity.

This also fixes two GLIBC math issues I'm seeing:

TEST_ff_f (nexttoward, qnan_value, 1.1L, qnan_value, NO_INEXACT_EXCEPTION),
TEST_ff_f (nexttoward, 1.1L, qnan_value, qnan_value, NO_INEXACT_EXCEPTION),

Fix tested on PPC32 and PPC64.

---

2013-11-15  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and sNaN to not raise an inexact exception.

---

diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;

59 matches

Mail list logo