from:"\"Steve Ellcey \""

Mailing list unsubscribe not working?

2019-10-30 Thread Steve Ellcey

I am not sure if this is the correct mailing list but I did not see a
better one to use.

I have been trying to unsubscribe from some mailing lists and the process
does not seem to be working.  As an example I sent an unsubscribe request
for libstdc++-digest, got a reply asking me to confirm, when I did a reply
in order to confirm I got:

| Acknowledgment: The address
| 
|prvs=82067f4a26=sell...@marvell.com
| 
| was not on the libstdc++-digest mailing list when I received
| your request and is not a subscriber of this list.
|
| If you unsubscribe, but continue to receive mail, you're subscribed
| under a different address than you currently use. Please look at the
| header for:
| 
| 'Return-Path: '


I looked at the email sent to me and I see:

| Return-Path: libstdc++-digest-return-14453-sellcey=marvell@gcc.gnu.org

So that would seem to imply that I am in fact subscribed as sell...@marvell.com
but the unsubscribe still failed.  Has anyone else had this issue or have
any idea on what is going on?

Steve Ellcey

Re: RFC: Extending --with-advance-toolchain to aarch64

2019-10-10 Thread Steve Ellcey

On Thu, 2019-10-10 at 15:38 -0300, Tulio Magno Quites Machado Filho wrote:
> 
> > Let me first describe what I do now:
> > 
> > configure/build BINUTILS with --prefix=${X} --with-sysroot=${X}
> > configure/build an initial GCC (all-gcc all-target-libgcc) with
> > --prefix=${X} --with-sysroot=${X}
> > configure/build GLIBC, using that GCC, with --prefix=/usr,
> > followed by install with DESTDIR=${X}
> 
> Can you use --prefix=${X}?

I can.  I would rather not, because when you don't have prefix set to
/usr you get a different glibc build.  For example, on aarch64 building
with --prefix=/usr means that libraries are put in lib64 (or libilp32)
instead of just lib.  The glibc folks are always preaching against 
building with a prefix of anything other than /usr.

> 
> Florian already explained why glibc has this test.
> But the Advance Toolchain carries the following patch:
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceware.org_git_-3Fp-3Dglibc.git-3Ba-3Dcommitdiff-3Bh-3D9ca2648e2aa7094e022d5150281b2575f866259f&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=Kj0CuWu6MgrNHos80CzrFt4fiXgwrFhMWDTO9Ue_lRU&m=zJmKExSapjGitHa0CdqSuR7k0QkL_7nNpzI76Y8XSLs&s=oE8dt9sjEr5MEtYG4c_pIgGtWYh2ZH3CG1jPypnGAdg&e=
> 

Ah, I see.  I was hoping that using --with-advance-toolchain would give
me a way to build a toolchain without needing any local/non-standard
patches.

Steve Ellcey
sell...@marvell.com

Re: RFC: Extending --with-advance-toolchain to aarch64

2019-10-10 Thread Steve Ellcey

On Thu, 2019-10-10 at 18:41 +0200, Florian Weimer wrote:
> 
> * Steve Ellcey:
> 
> > I would like these used by default so I took some ideas from
> > --with-advance-toolchain and used that to automatically add these options
> > to LINK_SPEC (see attached patch).  I can compile and link a program with
> > this setup, but when I run the program I get:
> > 
> > % ./x
> > Inconsistency detected by ld.so: get-dynamic-info.h: 147: 
> > elf_get_dynamic_info: 
> > Assertion `info[DT_RPATH] == NULL' failed!
> > 
> > I am not sure why this doesn't work.  Can anyone help me understand
> > why this doesn't work or help me figure out how else I might be able to
> > get the functionality I want. That is: to use shared libraries and a dynamic
> > linker (at run time) that are in a non-standard location without needing
> > to compile or link with special flags.
> 
> An argument could be made that if ld.so has DT_RPATH set,
> LD_LIBRARY_PATH would stop working, which would be a bug.  Hence the
> assert.  It's probably less an issue for DT_RUNPATH.
> 
> The real fix would be to make sure that ld.so isn't built with those
> dynamic tags.  If ld.so wants to use an alternative search path, that
> should be baked into the loader itself, explicitly.
> 
> Do you know where those dynamic tags originate?  Is there some wrapper
> script involved that sets them unconditionally?

I am not sure, but my guess is that it is because I am building
binutils (including ld) using --with-sysroot.  I build both GCC and
binutils with the sysroot directory where I put the glibc that I am
building.  Maybe I should try building GCC with --with-sysroot but
build binutils without it.

Steve Ellcey
sell...@marvell.com

Re: RFC: Extending --with-advance-toolchain to aarch64

2019-10-10 Thread Steve Ellcey

On Thu, 2019-10-10 at 10:49 +1030, Alan Modra wrote:
> On Wed, Oct 09, 2019 at 10:29:48PM +0000, Steve Ellcey wrote:
> > I have a question about building a toolchain that uses (at run
> > time) a
> > dynamic linker and system libraries and headers that are in a non-
> > standard
> > place.
> 
> I had scripts a long time ago to build a complete toolchain including
> glibc that could be installed in a non-standard location and co-exist
> with other system libraries.  I worked around..
> 
> > Inconsistency detected by ld.so: get-dynamic-info.h: 147:
> > elf_get_dynamic_info: 
> > Assertion `info[DT_RPATH] == NULL' failed!
> 
> ..this by patching glibc.

Yes, I have something working by patching glibc (and gcc) but when I
saw the IBM --with-advance-toolchain option I was hoping I might be
able to come up with a build process that worked and did not need any
patching.

Steve Ellcey
sell...@marvell.com

RFC: Extending --with-advance-toolchain to aarch64

2019-10-09 Thread Steve Ellcey

I have a question about building a toolchain that uses (at run time) a
dynamic linker and system libraries and headers that are in a non-standard
place.

I just noticed the IBM --with-advance-toolchain option and I would
like to replicate it for aarch64.

Let me first describe what I do now:

configure/build BINUTILS with --prefix=${X} --with-sysroot=${X}
configure/build an initial GCC (all-gcc all-target-libgcc) with
--prefix=${X} --with-sysroot=${X}
configure/build GLIBC, using that GCC, with --prefix=/usr,
followed by install with DESTDIR=${X}
configure/build final GCC with --prefix=${X} --with-sysroot=${X}

This all works, but if I want my executables to find the shared libraries and
dynamic linker from ${X} when they are running, I need to compile things with:

   -Wl,--rpath=${X}/lib64 -Wl,--dynamic-linker=${X}/lib/ld-linux-aarch64.so.1

I would like these used by default so I took some ideas from
--with-advance-toolchain and used that to automatically add these options
to LINK_SPEC (see attached patch).  I can compile and link a program with
this setup, but when I run the program I get:

% ./x
Inconsistency detected by ld.so: get-dynamic-info.h: 147: elf_get_dynamic_info: 
Assertion `info[DT_RPATH] == NULL' failed!

I am not sure why this doesn't work.  Can anyone help me understand
why this doesn't work or help me figure out how else I might be able to
get the functionality I want. That is: to use shared libraries and a dynamic
linker (at run time) that are in a non-standard location without needing
to compile or link with special flags.

Steve Ellcey
sell...@marvell.com


Here is the patch I am trying, I use the --with-advance-toolchain option as
an absolute pathname instead of relative to /opt like IBM does and I set it
to ${X} in a build that otherwise looks like what I describe above.  Everything
works until I start the final GCC build which is when I get the assertion.


diff --git a/gcc/config.gcc b/gcc/config.gcc
index 481bc9586a7..0532139b0b1 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -3879,7 +3879,7 @@ fi
 supported_defaults=
 case "${target}" in
aarch64*-*-*)
-   supported_defaults="abi cpu arch"
+   supported_defaults="abi cpu arch advance_toolchain"
for which in cpu arch; do
 
eval "val=\$with_$which"
@@ -3981,6 +3981,23 @@ case "${target}" in
  exit 1
fi
done
+   if test "x$with_advance_toolchain" != x; then
+   at=$with_advance_toolchain
+   if test -d "$at/." -a -d "$at/include/."; then
+   tm_file="$tm_file ./advance-toolchain.h"
+   (
+echo "/* Use Advance Toolchain $at */"
+echo "#undef  LINK_ADVANCE_SPEC"
+echo "#define LINK_ADVANCE_SPEC" \
+  "\"--rpath=$at/lib%{mabi=ilp32:ilp32}%{mabi=lp64:64} 
\
+  
"--rpath=$at/usr/lib%{mabi=ilp32:ilp32}%{mabi=lp64:64} \
+  
"--dynamic-linker=$at/lib/ld-linux-aarch64%{mbig-endian:_be}%{mabi=ilp32:_ilp32}.so.1\""
+   ) > advance-toolchain.h
+   else
+   echo "Unknown advance-toolchain $at"
+   exit 1
+   fi
+   fi
;;
 
alpha*-*-*)
diff --git a/gcc/config/aarch64/aarch64-linux.h 
b/gcc/config/aarch64/aarch64-linux.h
index 6ff2163b633..d76fa56c73e 100644
--- a/gcc/config/aarch64/aarch64-linux.h
+++ b/gcc/config/aarch64/aarch64-linux.h
@@ -47,7 +47,10 @@
-maarch64linux%{mabi=ilp32:32}%{mbig-endian:b}"
 
 
-#define LINK_SPEC LINUX_TARGET_LINK_SPEC AARCH64_ERRATA_LINK_SPEC
+#ifndef LINK_ADVANCE_SPEC
+#define LINK_ADVANCE_SPEC
+#endif
+#define LINK_SPEC LINUX_TARGET_LINK_SPEC AARCH64_ERRATA_LINK_SPEC 
LINK_ADVANCE_SPEC
 
 #define GNU_USER_TARGET_MATHFILE_SPEC \
   "%{Ofast|ffast-math|funsafe-math-optimizations:crtfastmath.o%s}"

Re: SPEC 2017 profiling question (502.gcc_r and 505.mcf_r fail)

2019-10-04 Thread Steve Ellcey

On Fri, 2019-10-04 at 15:58 -0500, Bill Schmidt wrote:
> 
> > Has anyone else seen these failures?
> 
> 
> Have you tried -fno-strict-aliasing?  There is a known issue with 
> spec_qsort() that affects both of these benchmarks.  See 
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__gcc.gnu.org_bugzilla_show-5Fbug.cgi-3Fid-3D83201&d=DwIDaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=Kj0CuWu6MgrNHos80CzrFt4fiXgwrFhMWDTO9Ue_lRU&m=M5tfnhGt9QWxrZvk7eKa9J_EonLqJs6YezVWveUtFhM&s=gesldYv1Oq8frkNSrX4O912SsKENeUKBZZruZ5UZ-NM&e=
>  .
> 
> Hope this helps,
> 
> Bill

Ah, of course, thank you.  I verified that this fixes my mcf failure,
gcc is still running.  I already had -fno-strict-aliasing for
perlbench, I should have figured out that it could be affecting other
tests too.

Steve Ellcey
sell...@marvell.com

SPEC 2017 profiling question (502.gcc_r and 505.mcf_r fail)

2019-10-04 Thread Steve Ellcey

I am curious if anyone has tried running 'peak' SPEC 2017 numbers using
profiling.  Now that the cactus lto bug has been fixed I can run all
the SPEC intrate and fprate benchmarks with '-Ofast -flto -march=native'
on my aarch64 box and get accurate results but when I try to use these
options along with -fprofile-generate/-fprofile-use I get two
verification errors: 502.gcc_r and 505.mcf_r. The gcc benchmark is
generating different assembly language for some of its tests and mcf is
generating different numbers that look too large to just be due to
unsafe math optimizations.

Has anyone else seen these failures?

Steve Ellcey
sell...@marvell.com

Boost build broken due to recent C++ change?

2019-09-24 Thread Steve Ellcey

A recent g++ change (I haven't tracked down exactly which one, but in
the last day or two) seems to have broken my boost build.  It is dying
with lots of errors like:

./boost/intrusive/list.hpp:1448:7:   required from here
./boost/intrusive/detail/list_iterator.hpp:93:41: error: call of
overloaded 'get
_next(boost::intrusive::list_node*&)' is ambiguous
   93 |   node_ptr p = node_traits::get_next(members_.nodeptr_);
  |~^~~
In file included from ./boost/intrusive/list_hook.hpp:20,
 from ./boost/intrusive/list.hpp:20,
 from ./boost/fiber/context.hpp:29,
 from libs/fiber/src/algo/algorithm.cpp:9:

Has anyone else run into this?  I will try to create a cutdown test
case.

Steve Ellcey
sell...@marvell.com

ICE when compiling SPEC 526.blender_r benchmark (profiling)

2019-09-23 Thread Steve Ellcey



Before I submit a Bugzilla report or try to cut down a test case, has any
one seen this problem when compiling the 526.blender_r benchmark from
SPEC 2017:

Compiling with '-Ofast -flto -march=native -fprofile-generate' on Aarch64:

during GIMPLE pass: vect
blender/source/blender/imbuf/intern/indexer.c: In function 'IMB_indexer_open':
blender/source/blender/imbuf/intern/indexer.c:157:20: internal compiler error: 
in execute_todo, at passes.c:2012
  157 | struct anim_index *IMB_indexer_open(const char *name)
  |^
0xa5ee2b execute_todo
../../gcc/gcc/passes.c:2012
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

Re: Help with bug in GCC garbage collector

2019-08-19 Thread Steve Ellcey

On Mon, 2019-08-19 at 17:05 -0600, Jeff Law wrote:
> 
> There's a real good chance Martin Liska has already fixed this.  He's
> made a couple fixes in the last week or so with the interactions
> between
> the GC system and the symbol tables.
> 
> 
> 2019-08-15  Martin Liska  
> 
> PR ipa/91404
> * passes.c (order): Remove.
> (uid_hash_t): Likewise).
> (remove_cgraph_node_from_order): Remove from set
> of pointers (cgraph_node *).
> (insert_cgraph_node_to_order): New.
> (duplicate_cgraph_node_to_order): New.
> (do_per_function_toporder): Register all 3 cgraph hooks.
> Skip removed_nodes now as we know about all of them.
> 
> 
> The way I'd approach would be to configure a compiler with
> --enable-checking=gc,gcac, just build it through stage1.  Then run your
> test through that compiler which should fail.  THen apply Martin's patch
> (or update to the head of the trunk), rebuild the stage1 compiler and
> verify it works.

I had already built a compiler with --enable-checking=gc,gcac, that did
not catch the bug (I still got a segfault).  I did update my sources
though and the bug does not happen at ToT so it looks like Martin's
patch did fix my bug.

Steve Ellcey
sell...@marvell.com

Help with bug in GCC garbage collector

2019-08-19 Thread Steve Ellcey

I was wondering if anyone could help me investigate a bug I am seeing
in the GCC garbage collector.  This bug (which may or may not be PR
89179) is causing a segfault in GCC, but when I try to create a
preprocessed source file, the bug doesn't trigger.  The problem is with
the garbage collector trying to mark some memory that has already been
freed.  I have tracked down the initial allocation to:

symbol_table::allocate_cgraph_symbol

It has:

node = ggc_cleared_alloc ();

to allocate a cgraph node.  With the GGC debugging on I see this
allocated:

Allocating object, requested size=360, actual=360 at 0x7029c210 on 
0x41b148c0

then freed:

Freeing object, actual size=360, at 0x7029c210 on 0x41b148c0

And then later, while the garbage collector is marking nodes, I see:

Marking 0x7029c210

The garbage collector shouldn't be marking this node if has already
been freed.

So I guess my main question is how do I figure out how the garbage
collector got to this memory location?  I am guessing some GTY pointer
is still pointing to it and hadn't got nulled out when the memory was
freed.  Does that seem like the most likely cause?

I am not sure why I am only running into this with one particular
application on my Aarch64 platform.  I am building it with -fopenmp,
which could have something to do with it (though there are no simd functions in 
the application).  The application is not that large as C++ programs go.

Steve Ellcey
sell...@marvell.com

Re: [EXT] Re: GCC missing -flto optimizations? SPEC lbm benchmark

2019-02-15 Thread Steve Ellcey

On Fri, 2019-02-15 at 17:48 +0800, Jun Ma wrote:
> 
> ICC is doing much more than GCC in ipo, especially memory layout 
> optimizations. See https://software.intel.com/en-us/node/522667.
> ICC is more aggressive in array transposition/structure splitting
> /field reordering. However, these optimizations have been removed
> from GCC long time ago.  
> As for case lbm_r, IIRC a loop with memory access which stride is 20 is 
> most time-consuming.  ICC will optimize the array(maybe structure?) 
> and vectorize the loop under ipo.
>  
> Thanks
> Jun

Interesting.  I tried using '-qno-opt-mem-layout-trans' on ICC
along with '-Ofast -ipo' and that had no affect on the speed.  I also
tried '-no-vec' and that had no affect either.  The only thing that 
slowed down ICC was '-ip-no-inlining' or '-fno-inline'.  I see that
'-Ofast -ipo' resulted in everything (except libc functions) getting
inlined into the main program when using ICC.  GCC did not do that, but
if I forced it to by using the always_inline attribute, GCC could
inline everything into main the way ICC does.  But that did not speed
up the GCC executable.

Steve Ellcey
sell...@marvell.com

GCC missing -flto optimizations? SPEC lbm benchmark

2019-02-14 Thread Steve Ellcey

I have a question about SPEC CPU 2017 and what GCC can and cannot do
with -flto.  As part of some SPEC analysis I am doing I found that with
-Ofast, ICC and GCC were not that far apart (especially spec int rate,
spec fp rate was a slightly larger difference).

But when I added -ipo to the ICC command and -flto to the GCC command,
the difference got larger.  In particular the 519.lbm_r was more than
twice as fast with ICC and -ipo, but -flto did not help GCC at all.

There are other tests that also show this type of improvement with -ipo
like 538.imagick_r, 544.nab_r, 525.x264_r, 531.deepsjeng_r, and
548.exchange2_r, but none are as dramatic as 519.lbm_r.  Anyone have
any idea on what ICC is doing that GCC is missing?  Is GCC just not
agressive enough with its inlining?

Steve Ellcey
sell...@marvell.com

Failing aarch64 tests (PR 87763), no longer combining instructions with hard registers

2019-01-14 Thread Steve Ellcey

I have a question about PR87763, these are aarch64 specific tests
that are failing after r265398 (combine: Do not combine moves from hard
registers).

These tests are all failing when the assembler scan looks for
specific instructions and these instructions are no longer being
generated.  In some cases the new code is no worse than the old code
(just different) but in most cases the new code is a performance
regression from the old code.

Note that these tests are generally *very* small functions where the
body of the function consists of only 1 to 4 instructions so if we
do not combine instructions involving hard registers there isn't much,
if any, combining that can be done.  In larger functions this probably
would not be an issue and I think those cases are where the incentive
for this patch came from.  So my question is, what do we want to
do about these failures?

Find a GCC patch to generate the better code?  If it isn't done by
combine, how would we do it?  Peephole optimizations?

Modify the tests to pass with the current output?  Which, in my
opinion would make the tests of not much value.

Remove the tests?  Tests that search for specific assembly language
output are rather brittle to begin with and if they are no longer
serving a purpose after the combine patch, maybe we don't need them.

The tests in question are:

gcc.target/aarch64/combine_bfi_1.c
gcc.target/aarch64/insv_1.c
gcc.target/aarch64/lsl_asr_sbfiz.c
gcc.target/aarch64/sve/tls_preserve_1.c
gcc.target/aarch64/tst_5.c
gcc.target/aarch64/tst_6.c
gcc.dg/vect/vect-nop-move.c # Scanning combine dump file, not asm file

ISL tiling question (gcc.dg/graphite/interchange-3.c)

2019-01-11 Thread Steve Ellcey

Someone here was asking about GCC, ISL, and tiling and we looked at
the test gcc.dg/graphite/interchange-3.c on Aarch64.  When this
test is run the graphite pass output file contains the string 'not
tiled' and since the dg-final scan-tree-dump is just looking for
the string 'tiled', it matches and the test passes.

Is this intentional?  It seems like if we wanted to check that it was
not tiled we sould grep for 'not tiled', not just 'tiled'.  If we
want grep to see that it is tiled, then the check for tiling happening
is wrong.

Steve Ellcey
sell...@marvell.com

Re: Bootstrap problem with genatautomata and sysroot

2018-11-26 Thread Steve Ellcey

On Mon, 2018-11-26 at 22:47 +0100, Andreas Schwab wrote:
> External Email
> 
> On Nov 26 2018, Steve Ellcey  wrote:
> 
> > I looked through the patches for the last couple of weeks to see if
> > I could identify
> > what changed here but I haven't found anything.  Maybe it was
> > something in
> > glibc that changed.
> 
> Most likely it only worked by accident so far.  Last week the first
> GLIBC_2.29 symbol has been added to libm.
> 
> Andreas.

Yup, I backed off those glibc changes and I could build, so that seems
to be the problem.  I guess if I want to build a complete toolchain
with bootstrap I will need to update the libm that is in /lib.

Steve Ellcey

Bootstrap problem with genatautomata and sysroot

2018-11-26 Thread Steve Ellcey

I am trying to do a bootstrap build of GCC using a newly built glibc in
a non standard location on my aarch64 platform (thunderx).  This was working
up until a week or so ago but now I am running into a problem I haven't seen
before:

build/genautomata /home/sellcey/test-tot/src/gcc/gcc/common.md 
/home/sellcey/test-tot/src/gcc/gcc/config/aarch64/aarch64.md \
  insn-conditions.md > tmp-automata.c
build/genautomata: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.29' not 
found (required by build/genautomata)
Makefile:2326: recipe for target 's-automata' failed

Has anyone else seen this?

I am building binutils and an initial GCC into a sysroot location, then I build 
glibc using that GCC and install it into that sysroot location and finally do
a full GCC build with bootstrap.  It is the final bootstrap build that fails.
If I do a non-bootstrap build of the final GCC then it works.

I looked through the patches for the last couple of weeks to see if I could 
identify
what changed here but I haven't found anything.  Maybe it was something in
glibc that changed.

Steve Ellcey
sell...@cavium.com

Re: Running the C++ library tests in the GCC testsuite

2018-11-07 Thread Steve Ellcey

On Wed, 2018-11-07 at 17:39 +, Joseph Myers wrote:
> External Email
> 
> On Wed, 7 Nov 2018, Steve Ellcey wrote:
> 
> > 
> > I have a question about the C++ library testsuite.  I built and
> > installed
> > a complete toolchain with GCC, binutils, and glibc in a directory
> > ($T) and
> > then I run the GCC testsuite with this command:
> > 
> > # cd to GCC object directory
> > make -j50 check RUNTESTFLAGS="--tool_opts  '--sysroot=$T -Wl,
> > --dynamic-linker=$T/lib/ld-linux-aarch64.so.1 -Wl,-rpath=$T/lib64
> > -Wl,-rpath=$T/usr/lib64'"
> I advise instead putting those options in your board file.
> 
> set_board_info ldflags "-Wl,whatever"
> 
> Note that you also need to make your board file set LOCPATH and GCONV_PATH
> appropriately (pointing the $sysroot/usr/lib/locale and
> $sysroot/usr/lib64/gconv respectively) for libstdc++ locale tests to work
> correctly with such a non-default glibc.  That would be code in your
> _load procedure in the board file (or in such a procedure in a
> file it loads via load_generic_config, etc.).

I copied unix.exp to unix-sysroot.exp and added this to it:

if {[info exists env(DEJAGNU_UNIX_SYSROOT_FLAGS)]} {
set_board_info ldflags "$env(DEJAGNU_UNIX_SYSROOT_FLAGS)"
}

I figured I would deal with LOCPATH and GCONV_PATH later.  When
I do a partial testrun, I don't get any failures but I do get some
new unresolved tests like this:

Download of ./2108-1.exe to unix-sysroot failed.
UNRESOLVED: gcc.dg/2108-1.c execution test

Have ever seen this error?

Steve Ellcey
sell...@cavium.com

Running the C++ library tests in the GCC testsuite

2018-11-07 Thread Steve Ellcey



I have a question about the C++ library testsuite.  I built and installed
a complete toolchain with GCC, binutils, and glibc in a directory ($T) and
then I run the GCC testsuite with this command:

# cd to GCC object directory
make -j50 check RUNTESTFLAGS="--tool_opts  '--sysroot=$T 
-Wl,--dynamic-linker=$T/lib/ld-linux-aarch64.so.1 -Wl,-rpath=$T/lib64 
-Wl,-rpath=$T/usr/lib64'"

When I look at the gcc.log, g++.log, gfortran.log files I see the -Wl options
that I specified being used when the tests are compiled, but when I look at
the C++ library test log file
(aarch64-linux-gnu/libstdc++-v3/testsuite/libstdc++.log) I do not see
the --rpath or other flags getting used.  Is this expected?  I have a
few tests that fail because of this and die with:

./check_nan.exe: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.27' not 
found (required by ./check_nan.exe)

If I rerun by hand and add the --rpath, etc. flags the test works but I
am not sure why the test harness did not add them itself.

Steve Ellcey
sell...@cavium.com

Re: GCC segfault while compiling SPEC 2017 fprate tests

2018-11-06 Thread Steve Ellcey

On Wed, 2018-11-07 at 00:16 +0700, Arseny Solokha wrote:
> 
> This is probably PR87889, already fixed on trunk.

Yup, that was the problem.  I have updated my sources and things are
building now.  Thanks for the info.

Steve Ellcey

GCC segfault while compiling SPEC 2017 fprate tests

2018-11-06 Thread Steve Ellcey

I was doing some benchmarking with SPEC 2017 fprate on aarch64
(Thunderx2) and I am getting some segfaults from GCC while compiling.

I am working with delta to try and cut down one of the test cases
but I was wondering if anyone else has seen this problem.  The
three tests that segfault while compiling are 510.parest_r, 511.povray_r,
and 521.wrf_r.

If I compile 510.parest_r with -Ofast -std=gnu++17 -fpermissive
I get this segfault:

during GIMPLE pass: vect
source/numerics/histogram.cc: In member function 'void 
dealii::Histogram::evaluate(const std::vector >&, const 
std::vector&, unsigned int, dealii::Histogram::IntervalSpacing) [with 
number = float]':
source/numerics/histogram.cc:54:6: internal compiler error: Segmentation fault
   54 | void Histogram::evaluate (const std::vector > &values,
  |  ^
0xdfd48f crash_signal
/home/sellcey/gcc-tot/src/gcc/gcc/toplev.c:325
0x108a6e4 contains_struct_check(tree_node*, tree_node_structure_enum, char 
const*, int, char const*)
/home/sellcey/gcc-tot/src/gcc/gcc/tree.h:3231
0x108a6e4 slpeel_duplicate_current_defs_from_edges
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vect-loop-manip.c:984
0x108c87b slpeel_tree_duplicate_loop_to_edge_cfg(loop*, loop*, edge_def*)
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vect-loop-manip.c:1074
0x1090ba3 vect_do_peeling(_loop_vec_info*, tree_node*, tree_node*, tree_node**, 
tree_node**, tree_node**, int, bool, bool)
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vect-loop-manip.c:2580
0x108071b vect_transform_loop(_loop_vec_info*)
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vect-loop.c:8243
0x10a311f try_vectorize_loop_1
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vectorizer.c:965
0x10a3adb vectorize_loops()
/home/sellcey/gcc-tot/src/gcc/gcc/tree-vectorizer.c:1097
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

Register allocation question (process_bb_lives/check_pseudos_live_through_calls/hard_regno_call_part_clobbered)

2018-10-25 Thread Steve Ellcey

I have a question about process_bb_lives and check_pseudos_live_through_calls.

I am trying to optimize aarch64 vector functions, which do not partially
clobber vector registers the way that 'normal' functions do.  To do this
I am looking at modifying the hard_regno_call_part_clobbered target
function to take an instruction as an argument so that it could look
at what function is being called and return a different value based on
an attribute of that function.  If the instruction is NULL it defaults
to the standard conservative behavour, but if it is a vector call it
will return false where the existing function (may) return true.

The problem I am having is that check_pseudos_live_through_calls calls
targetm.hard_regno_call_part_clobbered.  To pass the call instruction
into targetm.hard_regno_call_part_clobbered, I need to pass it in to
check_pseudos_live_through_calls first.  This works for three of the
four check_pseudos_live_through_calls that process_bb_lives makes.

The problem I am having is with the fourth (and last) call to
check_pseudos_live_through_calls from process_bb_lives.  It is
not in the loop that is processing each instruction in a basic
block so I do not have a specific call instruction to pass in.

I do not understand the purpose of the loop that contains this
particular call to check_pseudos_live_through_calls so I am not
sure what (if anything) I can do to address my problem.

Can anyone help me understand what the loop below in process_bb_lives
is doing and why it is needed?

  EXECUTE_IF_SET_IN_BITMAP (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, j, bi)
{
  if (sparseset_cardinality (pseudos_live_through_calls) == 0)
break;
  if (sparseset_bit_p (pseudos_live_through_calls, j))
check_pseudos_live_through_calls (j, last_call_used_reg_set);
    }


Steve Ellcey
sell...@cavium.com

LP64, unsigned int, vectorization, and PR 61247

2018-10-04 Thread Steve Ellcey

I was looking at PR tree-optimization/61247, where a loop with an unsigned
int index on an LP64 platform was not getting vectorized and I noticed an
odd thing.  In the function below, if I define N as 1000 or 1, the
loop does get vectorized, even in LP64 mode.  But if I define N as 10,
the loop does not get vectorized in LP64 mode.  I have not been able to
figure out why this is or where the decision to vectorize (or not) is
getting made.  Does anyone have an idea?  10 is not a large enough value
to hit the limit of a 32 bit int or unsigned int value so why can't it be
vectorized like the other two cases?

In the original test case that I added to this PR, N is an argument and
we don't know what value it has.  It seems like this could be vectorized
by including a test to make sure that the value is not larger than MAXINT
and thus could not wrap when doing the array indexing.

Steve Ellcey
sell...@cavium.com



/* define N as 1000 - gets vectorized  */
/* define N as 1 - gets vectorized  */
/* define N as 10 - does not get vectorized  */

#define N 10

typedef unsigned int TYPE;
void f(int *C, int *A, int val)
{
TYPE i,j;
for (i=0; i

GCC regression question (pr77445-2.c & ssa-dom-thread-7.c)

2018-06-28 Thread Steve Ellcey

Does anyone know anything about these failures that I see on my aarch64
build & test?

FAIL: gcc.dg/tree-ssa/pr77445-2.c scan-tree-dump-not thread3 "not considered"
FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps 
threaded"

The both seem to have started showing up on May 20th and I don't see any
bugzilla report on them.  Before I try and track down what checkin caused
them and whether or not they were caused by the same checkin I thought I
would see if anyone had already done that.

Steve Ellcey
sell...@cavium.com

Wabi warnings during GCC build

2018-06-27 Thread Steve Ellcey

Are other people building GCC seeing these messages during the build:

cc1plus: warning: -Wabi won't warn about anything [-Wabi]
cc1plus: note: -Wabi warns about differences from the most up-to-date ABI, 
which is also used by default
cc1plus: note: use e.g. -Wabi=11 to warn about changes from GCC 7

It doesn't seem to be causing any problems in the build (even bootstrap)
but I am wondering why it is there.  It seems to be happening when
using the latest (just built) g++ to build libstdc++ so it shouldn't
be related to the system GCC that I am using to build with.

I didn't find any mention of it in the gcc or libstdc++ mailing lists
when I looked or find any bugzilla report.

Steve Ellcey

Re: How to get GCC on par with ICC?

2018-06-21 Thread Steve Ellcey

On Wed, 2018-06-20 at 17:11 -0400, NightStrike wrote:
> 
> If I could perhaps jump in here for a moment...  Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.  They all involve a lot of heavy
> use of std::vector>.  Comparisons were with gcc
> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.  So, as close to not-noisy as
> possible.

There are a quite a number of bugzilla reports with examples where GCC
does not vectorize a loop.  I wonder if this example is related to PR
61247.

Steve Ellcey

Re: Aarch64 / simd builtin question

2018-06-08 Thread Steve Ellcey

On Fri, 2018-06-08 at 22:34 +0100, James Greenhalgh wrote:
> 
> Are you in an environment where you can use arm_neon.h ? If so, that
> would
> be the best approach:
> 
>   float32x4_t in;
>   float64x2_t low = vcvt_f64_f32 (vget_low_f64 (in));
>   float64x2_t high = vcvt_high_f64_f32 (in);
> 
> If you can't use arm_neon.h for some reason, you can look there for
> inspiration of how to write your own versions of these intrinsics.
> 
> Thanks,
> James

Thanks, that is helpful though I think you meant vget_low_f32 in
the first line instead of vget_low_f64.  With that change I get the
code I want/expect.  I hadn't seen the __GETLOW macro in the neon
header file.

Steve Ellcey

Re: How to get GCC on par with ICC?

2018-06-08 Thread Steve Ellcey

On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
> 
> When we do our own comparisons of GCC vs. ICC on benchmarks
> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
> (in fact it even trails in some benchmarks) unless you get to
> "SPEC tricks" like data structure re-organization optimizations that
> probably never apply in practice on real-world code (and people
> should fix such things at the source level being pointed at them
> via actually profiling their codes).

Richard,

I was wondering if you have any more details about these comparisions
you have done that you can share?  Compiler versions, options used,
hardware, etc  Also, were there any tests that stood out in terms of
icc outperforming GCC?

I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
for gcc.

The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference, 
around 20%.  521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.

Steve Ellcey
sell...@cavium.com

Aarch64 / simd builtin question

2018-06-08 Thread Steve Ellcey

I have a question about the Aarch64 simd instructions and builtins.

I want to unpack a __Float32x4 (V4SF) variable into two __Float64x2
variables.  I can get the upper part with:

__Float64x2_t a = __builtin_aarch64_vec_unpacks_hi_v4sf (x);

But I can't seem to find a builtin that would get me the lower half.
I assume this is due to the issue in aarch64-simd.md around the
vec_unpacks_lo_ instruction:

;; ??? Note that the vectorizer usage of the vec_unpacks_[lo/hi] patterns
;; is inconsistent with vector ordering elsewhere in the compiler, in that
;; the meaning of HI and LO changes depending on the target endianness.
;; While elsewhere we map the higher numbered elements of a vector to
;; the lower architectural lanes of the vector, for these patterns we want
;; to always treat "hi" as referring to the higher architectural lanes.
;; Consequently, while the patterns below look inconsistent with our
;; other big-endian patterns their behavior is as required.

Does this mean we can't have a __builtin_aarch64_vec_unpacks_lo_v4sf
builtin that will work in big endian and little endian modes?
It seems like it should be possible but I don't really understand 
the details of the implementation enough to follow the comment and
all its implications.

Right now, as a workaround, I use:

static inline __Float64x2_t __vec_unpacks_lo_v4sf (__Float32x4_t x)
{
  __Float64x2_t result;
  __asm__ ("fcvtl %0.2d,%1.2s" : "=w"(result) : "w"(x) : /* No clobbers */);
  return result;
}

But a builtin would be cleaner.

Steve Ellcey
sell...@cavium.com

__builtin_isnormal question

2018-06-04 Thread Steve Ellcey

Is there a bug in __builtin_isnormal or am I just confused as to what it
means?  There doesn't seem to be any actual definition/documentation for
the function.  __builtin_isnormal(0.0) is returning false.  That seems
wrong to me, 0.0 is a normal (as opposed to a denormalized) number isn't
it?  Or is zero special?

Steve Ellcey
sell...@cavium.com

#include 
#include 
#include 
int main()
{
double x;
x = 0.0;
printf("%e %e %e\n", x, DBL_MIN, DBL_MAX);
printf("normal is %s\n", __builtin_isnormal(x) ? "TRUE" : "FALSE");
x = 1.0;
printf("%e %e %e\n", x, DBL_MIN, DBL_MAX);
printf("normal is %s\n", __builtin_isnormal(x) ? "TRUE" : "FALSE");
return 0;
}

% gcc x.c -o x
% ./x
0.00e+00 2.225074e-308 1.797693e+308
normal is FALSE
1.00e+00 2.225074e-308 1.797693e+308
normal is TRUE

Why is REG_ALLOC_ORDER not defined on Aarch64

2018-05-25 Thread Steve Ellcey

I was curious if there was any reason that REG_ALLOC_ORDER is not
defined for Aarch64.  Has anyone tried this to see if it could help
performance?  It is defined for many other platforms.

Steve Ellcey
sell...@cavium.com

Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-24 Thread Steve Ellcey

On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote:
> 
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
> of saying that an rtl instruction preserves the low part of a
> register but clobbers the high part.  We would need something like
> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.
> 
> Another approach would be to piggy-back on the -fipa-ra
> infrastructure
> and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
> knows that a function doesn't clobber Q8-Q15 then that should
> override
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
> in practice, but it should :-)  And if it doesn't that's a bug that's
> worth fixing for its own sake.)
> 
> Thanks,
> Richard

Alan,

I have been looking at your CLOBBER_HIGH patches to see if they
might be helpful in implementing the ARM SIMD Vector ABI in GCC.
I have also been looking at the -fipa-ra flag and how it works.

I was wondering if you considered using the ipa-ra infrastructure
for the SVE work that you are currently trying to support with 
the CLOBBER_HIGH macro?

My current thought for the ABI work is to mark all the floating
point / vector registers as caller saved (the lower half of V8-V15
are currently callee saved) and remove
TARGET_HARD_REGNO_CALL_PART_CLOBBERED.
This should work but would be inefficient.

The next step would be to split get_call_reg_set_usage up into
two functions so that I don't have to pass in a default set of
registers.  One function would return call_used_reg_set by
default (but could return a smaller set if it had actual used
register information) and the other would return regs_invalidated
by_call by default (but could also return a smaller set).

Next I would add a 'largest mode used' array to call_cgraph_rtl_info
structure in addition to the current function_used_regs register
set.

Then I could turn the get_call_reg_set_usage replacement functions
into target specific functions and with the information in the
call_cgraph_rtl_info structure and any simd attribute information on
a function I could modify what registers are really being used/invalidated
without being saved.

If the called function only uses the bottom half of a register it would not
be marked as used/invalidated.  If it uses the entire register and the
function is not marked as simd, then the register would marked as
used/invalidated.  If the function was marked as simd the register would not
be marked because a simd function would save both the upper and lower halves
of a callee saved register (whereas a non simd function would only save the
lower half).

Does this sound like something that could be used in place of your 
CLOBBER_HIGH patch?

Steve Ellcey
sell...@cavium.com

Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Steve Ellcey

On Wed, 2018-05-16 at 17:30 +0100, Richard Earnshaw (lists) wrote:
> On 16/05/18 17:21, Steve Ellcey wrote:
> > 
> > It doesn't look like GCC has any existing mechanism for having different
> > sets of caller saved/callee saved registers depending on the function
> > attributes of the calling or called function.
> > 
> > Changing what registers a callee function saves and restores shouldn't
> > be too difficult since that can be done when generating the prologue
> > and epilogue code but changing what registers a caller saves/restores
> > when doing the call seems trickier.  The macro
> > TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
> > function being called.  It returns true/false depending on just the
> > register number and mode.
> > 
> > Steve Ellcey
> > sell...@cavium.com
> > 
> 
> Actually, we can.  See, for example, the attribute((pcs)) for the ARM
> port.  I think we could probably handle this automagically for the SVE
> vector calling convention in AArch64.
> 
> R.

Interesting, it looks like one could use aarch64_emit_call to emit
extra use_reg / clobber_reg instructions but in this case we want to
tell the caller that some registers are not being clobbered by the
callee.  The ARM port does not
define TARGET_HARD_REGNO_CALL_PART_CLOBBERED and that seemed like one
of the most problamatic issues with Aarch64.  Maybe we would have to
undefine this for aarch64 and use explicit clobbers to say what
floating point registers / vector registers are clobbered for each
call?  I wonder how that would affect register allocation.

Steve Ellcey
sell...@cavium.com

Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Steve Ellcey

On Tue, 2018-05-15 at 18:29 +, Francesco Petrogalli wrote:

> Hi Steve,
> 
> I am happy to let you know that the Vector Function ABI for AArch64
> is now public and available via the link at [1].
> 
> Don’t hesitate to contact me in case you have any questions.
> 
> Kind regards,
> 
> Francesco
> 
> [1] https://developer.arm.com/products/software-development-tools/hpc
> /arm-compiler-for-hpc/vector-function-abi
> 
> > 
> > Steve Ellcey
> > sell...@cavium.com

Thanks for publishing this Francesco, it looks like the main issue for
GCC is that the Vector Function ABI has different caller saved / callee
saved register conventions than the standard ARM calling convention.

If I understand things correctly, in the standard calling convention
the callee will only save the bottom 64 bits of V8-V15 and so the
caller needs to save those registers if it is using the top half.  In
the Vector calling convention the callee will save all 128 bits of
these registers (and possibly more registers) so the caller does not
have to save these registers at all, even if it is using all 128 bits
of them.

It doesn't look like GCC has any existing mechanism for having different
sets of caller saved/callee saved registers depending on the function
attributes of the calling or called function.

Changing what registers a callee function saves and restores shouldn't
be too difficult since that can be done when generating the prologue
and epilogue code but changing what registers a caller saves/restores
when doing the call seems trickier.  The macro
TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
function being called.  It returns true/false depending on just the
register number and mode.

Steve Ellcey
sell...@cavium.com

Another libmvect question (vectorizing sincos calls)

2018-03-27 Thread Steve Ellcey

Here is another libmvec question.  While testing on x86 I could easily make
a loop with calls to sin, cos, log, exp, or pow and see them vectorized when
I compile with -Ofast.  But if I try making a loop with sincos, it does not
get vectorized.  Is that to be expected?

I compiled:

#define _GNU_SOURCE
#include 
#define SIZE 1
double x[SIZE], y[SIZE], z[SIZE];
void doit(void) { for (int i = 0; i < SIZE; i++) sincos(x[i],&(y[i]),&(z[i])); }

I see the 'simd' attribute on sincos but I do not get any calls to
_ZGVcN4vvv_sincos, only to sincos.  When I look at the tree dump
files I see a call to __builtin_cexpi and not to sincos or __builtin_sincos,
is that confusing the vectorizer?  When expanded into rtl the call to sincos
does show up, but it never shows up in the tree dumps.

I also tried this program:

#define _GNU_SOURCE
#include 
#define SIZE 1
double x[SIZE], y[SIZE];
void doit(void) { for (int i = 0; i < SIZE; i++) x[i] = sin(y[i]) + cos(y[i]); }

Which generated a sincos call, but also did not vectorize it.

Is there any way to get GCC to vectorize a loop with sincos in it?

Steve Ellcey
sell...@cavium.com

Re: Can I use -Ofast without libmvec

2018-03-22 Thread Steve Ellcey

On Thu, 2018-03-22 at 11:42 -0700, H.J. Lu wrote:
> On Thu, Mar 22, 2018 at 11:08 AM, Steve Ellcey 
> wrote:
> > 
> > I have a question about the math vector library routines in
> > libmvec.
> > If I compile a program on x86 with -Ofast, something like:
> > 
> > void foo(double * __restrict x, double * __restrict y, double *
> > __restrict z)
> > {
> > for (int i = 0; i < 1000; i++) x[i] = sin(y[i]);
> > }
> > 
> > I get a call to the vector sin routine _ZGVbN2v_sin.  That is fine, but
> > is there some way to compile with -Ofast and not use the libmvec vector
> > routines?  I have tried -fopenmp, -fopenmp-simd, -fno-openmp, and -fno-
> > openmp-simd and I always get a call to _ZGVbN2v_sin.  Is there anyway
> > to stop the use of the vectorized calls (without turning off -Ofast)?
> Have you tried -lm?

It isn't a question of not working.  Everything works and links and
runs, but I would just like to know if there is any way to compile my
program in such a way that GCC does not generate calls to the libmvect
routines.

I am doing some performance analysis and would like to know how much
(or little) having these vectorized routines help in various
benchmarks.

Steve Ellcey

Can I use -Ofast without libmvec

2018-03-22 Thread Steve Ellcey

I have a question about the math vector library routines in libmvec.
If I compile a program on x86 with -Ofast, something like:

void foo(double * __restrict x, double * __restrict y, double * __restrict z)
{
for (int i = 0; i < 1000; i++) x[i] = sin(y[i]);
}

I get a call to the vector sin routine _ZGVbN2v_sin.  That is fine, but
is there some way to compile with -Ofast and not use the libmvec vector
routines?  I have tried -fopenmp, -fopenmp-simd, -fno-openmp, and -fno-
openmp-simd and I always get a call to _ZGVbN2v_sin.  Is there anyway
to stop the use of the vectorized calls (without turning off -Ofast)?

Steve Ellcey
sell...@cavium.com

Vectorization / libmvec / libgomp question

2018-02-23 Thread Steve Ellcey

I have a question about loop vectorization, OpenMP, and libmvec.  I am
experimenting with this on Aarch64 and looking at what exists on x86
and trying to understand the relationship (if there is one) between the
vector library (libmvec) and OpenMP (libgomp).

On x86, an OpenMP loop with a sin() call will call a function whose
name is defined by the OpenMP x86 ABI (_ZGVbN2v_sin) but if I just
vectorize a normal, non-OpenMP loop containing a sin() call and specify
'-mveclibabi=[svml|acml]' then I get a call to some thing else
(vmldSin2 or __vrd2_sin).  If I do not specify '-mveclibabi' then a
loop with a sin() in it doesn't get vectorized at all.

On Aarch64, where there is no existing vector library (at least not one
recognized by GCC), what do I want to do?  Obviously, for OpenMP, I
would call the name specified by an Aarch64 OpenMP ABI (not yet
publicly defined).  It would be something like '_ZGVbN2v_sin'.  But
what about the case of a vectorized loop that does not use OpenMP.  Is
it reasonable (desirable?) in that case to call the OpenMP routine
(_ZGVbN2v_sin) that is defined in libgomp or should I call a different
routine with a different name that is in libmvec?  Or should I put
'_ZGVbN2v_sin' in libmvec and have libgomp be dependent on libmvec?  Do
I need a -mveclibabi flag for GCC if there is only one vector ABI for
Aarch64?  I might still want to control whether vector functions are
called while vectorizing a loop in the absense of OpenMP.

Steve Ellcey
sell...@cavium.com

Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-02-09 Thread Steve Ellcey

James,

This is a follow-up to https://gcc.gnu.org/ml/gcc/2017-03/msg00109.html
 where you said:

| Hi Ashwin,
| 
| Thanks for the question. ARM has defined a vector function ABI, based
| on the Vector Function ABI Specification you linked below, which
| is designed to be suitable for both the Advanced SIMD and Scalable
| Vector Extensions. There has not yet been a release of this document
| which I can point you at, nor can I give you an estimate of when the
| document will be published.

I was wondering if the function vector ABI has been published yet and
if so, where I could find it.

Steve Ellcey
sell...@cavium.com

Re: poly_uint64 / TYPE_VECTOR_SUBPARTS question

2018-02-09 Thread Steve Ellcey

On Fri, 2018-02-09 at 17:58 +, Richard Sandiford wrote:
> 
> OK, so is this in aarch64_builtin_vectorized_function?  to_constant
> isn't valid there because the code isn't specific to Advanced SIMD.
> The way to check for V2xx is:
> 
>   known_eq (TYPE_VECTOR_SUBPARTS (...), 2U)
> 
> Thanks,
> Richard

It is in aarch64_builtin_vectorized (actually in a new function called
from there).  What I am using right now to limit myself to V2DF is:

/* We only handle single argument V2DF functions for now.  */

  el_mode = TYPE_MODE (TREE_TYPE (type_out));
  in_mode = TYPE_MODE (TREE_TYPE (type_in));
  if (el_mode != in_mode || el_mode != DFmode)
return NULL_TREE;

  if (!TYPE_VECTOR_SUBPARTS (type_out).is_constant (&n)
  || !TYPE_VECTOR_SUBPARTS (type_in).is_constant (&in_n))
return NULL_TREE;

  if (n != in_n || n != 2)
return NULL_TREE;



Steve Ellcey
sell...@cavium.com

Re: poly_uint64 / TYPE_VECTOR_SUBPARTS question

2018-02-09 Thread Steve Ellcey

On Fri, 2018-02-09 at 17:15 +, Richard Sandiford wrote:
> 
> 
> If the code you're adding is inherently specific to Advanced SIMD
> (not sure, but guessing yes based on this being aarch64-builtins.c)
> then like Kugan says, using to_constant is OK.
> 
> Which code are you copying over?
> 
> Thanks,
> Richard

OK, I found the is_constant member function and used that.  I was
looking at the i386 code that generates calls to libmvec.  Someone
here wrote vector sin/cos functions for V2DF and I want to test them
out to see if they would work with GCC/libmvec on aarch64.

Steve Ellcey
sell...@cavium.com

poly_uint64 / TYPE_VECTOR_SUBPARTS question

2018-02-08 Thread Steve Ellcey

I have a question about the poly_uint64 type and the TYPE_VECTOR_SUBPARTS
macro.  I am trying to copy some code from i386.c into my aarch64 build
that is basically:

int n;
n = TYPE_VECTOR_SUBPARTS (type_out);

And it is not compiling for me, I get:

/home/sellcey/gcc-vectmath/src/gcc/gcc/config/aarch64/aarch64-builtins.c:1504:37:
 error: cannot convert ‘poly_uint64’ {aka ‘poly_int<2, long unsigned int>’} to 
‘int’ in assignment
   n = TYPE_VECTOR_SUBPARTS (type_out);

My first thought was that I was missing a header file but I put
all the header includes that are in i386.c into aarch64-builtins.c
and it still does not compile.  It works on the i386 side.  It looks
like poly-int.h and poly-int-types.h are included by coretypes.h
and I include that header file so I don't understand why this isn't
compiling and what I am missing.  Any help?

Steve Ellcey
sell...@cavium.com

nexttoward/nextafter attribute question

2017-12-11 Thread Steve Ellcey


I have a question about the attributes that GCC is putting on the 
nexttoward/nextafter builtin functions.  This issue started when
ToT glibc ran into a problem when building with ToT GCC after Martin
Sebor added a patch to GCC that tightened up the attribute checking
that GCC does.

The line below no longer compiles cleanly with -Wall -frounding-math 
-fno-math-errno:

extern double nexttoward (double __x, long double __y) __attribute__ 
((__nothrow__ )) __attribute__ ((__const__));


When I look at the GCC sources it appears that the attribute on this function
(and on the other nextoward and nextafter functions) changesbased on the GCC
arguments:

-frounding-math -fno-math-errno: __attribute__ ((__pure__))
-fno-math-errno: __attribute__ ((__const__))
-frounding-math: __attribute__ ((__const__))
:    __attribute__ ((__const__))

I have several questions about this; one is why rounding-math affects
the attribute.  The other is why the function would be pure with
-fno-math-errno but const otherwise.  I would think that the -fno-math-errno
version would be const (stricter than pure) since it is not setting
errno.  Finally, how can I check if -frounding-mode is set in the compiler
or not.  If I want the glibc sources to match the GCC attributes I need
to know if -frounding-math is set but there does not seem to be a way
to do that.  Using -fno-math-errno will set __NO_MATH_ERRNO__ but there
does not seem to be an equivelent macro for -frounding-math.

Steve Ellcey
sell...@cavium.com

GCC testing, precompiled headers, and CFLAGS_FOR_TARGET question

2017-11-03 Thread Steve Ellcey

I have a question about gcc testing, precompiled header tests and the
CFLAGS_FOR_TARGET option to RUNTESTFLAGS.

I am building a complete native aarch64 toolchain (binutils, gcc, glibc) in
a non-standard location, I configure binutils and gcc with
--sysroot=/mylocation, and I want to run the gcc testsuite.

Everything mostly works, but when gcc tests are actually run, some tests
fail because the default dynamic linker, libc, and libm are used instead
of the ones I have in /mylocation/lib64.

I can work around this with:

make check RUNTESTFLAGS="CFLAGS_FOR_TARGET='-Wl,--dynamic-linker=/mylocation/lib
/ld-linux-aarch64.so.1 -Wl,-rpath=/mylocation/lib64'"

But when I do this, I noticed that a number of pch tests fail.  What I found
is that when I run the pch testsuite, it executes:

/home/sellcey/tot/obj/gcc/gcc/xgcc -B/home/sellcey/tot/obj/gcc/gcc/ ./common-1.h
 -fno-diagnostics-show-caret -fdiagnostics-color=never -O0 -g -Wl,--dynamic-link
er=/mylocation/lib/ld-linux-aarch64.so.1 -Wl,-rpath=/mylocation/lib64 -o common-
1.h.gch

And this tries to create an executable instead of a pre-compiled header.
If I run the same command without the -Wl flags then GCC creates the
pre-compiled header that I need for testing.

Is it excpected that GCC changes from creating a pch to creating an executable
when it see -Wl flags?  Is there a flag that we can use to explicitly tell GCC
that we want to create a precompiled header in this instance?

Steve Ellcey
sell...@cavium.com

Re: [RFC] type promotion pass

2017-09-19 Thread Steve Ellcey

On Tue, 2017-09-19 at 11:13 +1000, Kugan Vivekanandarajah wrote:

> > https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00929.html
> I tried the testases you have in the patch with type promotion. Looks
> like forwprop is reversing the promotion there. I haven't looked in
> detail yet but -fno-tree-forwprop seems to remove 6 "and" from the
> test case. I have a slightly different version to what Prathamseh has
> posted and hope that there isn't any difference here.
> 
> Thanks,
> Kugan

I don't think there is any way the type promotion pass can help with
the test case I have for pr77729. The 'and' operations go away in
forwprop but there are still type conversions like 'unsigned_int_var =
(unsigned int) char_var' and that is going to generate an 'and'
instruction (or an extend instruction) in RTL unless we know that
char_var is stored in a register whose upper bits have already been
zeroed out somehow.  In my test case the only way to know that is to
know that the load byte instruction zeroed them out.

Steve Ellcey
sell...@cavium.com

Re: [RFC] type promotion pass

2017-09-18 Thread Steve Ellcey

On Mon, 2017-09-18 at 23:29 +0530, Prathamesh Kulkarni wrote:
> 
> Hi Steve,
> The patch is currently based on r249469. I will rebase it on ToT and
> look into the build failure.
> Thanks for pointing it out.
> 
> Regards,
> Prathamesh

OK, I applied it to that version successfully.  The thing I wanted to
check was to see if this helped with PR target/77729.  It does not,
so I think even with this patch we would need my patch to address the
issue of having GCC recognize that ldrb/ldhb zero out the top of a
register and thus we do not need to mask it out later.

https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00929.html

Steve Ellcey
sell...@cavium.com

Re: [RFC] type promotion pass

2017-09-18 Thread Steve Ellcey

On Fri, 2017-09-15 at 12:22 +, Wilco Dijkstra wrote:

Wilco or Prathamesh,

I could not apply this patch (cleanly) to ToT.  match.pd did not apply,
I think I fixed that.  The cfgexpand.c patch applied but will not
build.  I get this error:

../../../src/gcc/gcc/cfgexpand.c: In function ‘rtx_def*
expand_debug_expr(tree)’:
../../../src/gcc/gcc/cfgexpand.c:5130:18: error: cannot convert
‘opt_machine_mode {aka opt_mode}’ to ‘machine_mode’ in
assignment
   inner_mode = mode_for_size (INTVAL (op1), MODE_INT, 0);


I can't quite figure out what change needs to be made to this line to
make it compile.  I do see that mode_for_size has been changed.
I tried using int_mode_for_size but that doesn't work and I tried
using '.require ()' but that didn't work either.

inner_mode = int_mode_for_size (INTVAL (op1), 0);  /* This did not work.  */
inner_mode = mode_for_size (INTVAL (op1), MODE_INT, 0).require (); /* This did 
not work */

Steve Ellcey
sell...@cavium.com

What to do about all the gcc.dg/guality test failures?

2017-08-08 Thread Steve Ellcey

I was wondering if something needs to be done about the gcc.dg/guality tests.

There are two main issues I see with these tests, one is that they are often
not run during testing and so failures do not show up.  I looked into this
and found that, at least on my ubuntu 16.04 system, the kernel parameter
kernel.yama.ptrace_scope is set to 1 by default.  This limits the use of
ptrace to direct child processes and causes the guality tests to not run
on my system.  They also don't show up as failures, all you get is a message
in the test log that says 'gdb: took too long to attach'.  After changing this
to 0, the guality tests do get run.

The second problem is that many of the tests fail when they are run.
For example, looking at some August test runs:

x86_64 failures:  https://gcc.gnu.org/ml/gcc-testresults/2017-08/msg00651.html
aarch64 failures: https://gcc.gnu.org/ml/gcc-testresults/2017-08/msg00603.html
mips64 failures:  https://gcc.gnu.org/ml/gcc-testresults/2017-08/msg00527.html
s390x failures:   https://gcc.gnu.org/ml/gcc-testresults/2017-08/msg00509.html

These all show many failures in gcc.dg/guality.  Most of the failures
are related to using the '-O2 -flto' or '-O3' options.  If I remove those
option runs I get 15 failures involving 5 tests on my aarch64 system:

gcc.dg/guality/pr36728-1.c
gcc.dg/guality/pr41447-1.c
gcc.dg/guality/pr54200.c
gcc.dg/guality/pr54693-2.c
gcc.dg/guality/vla-1.c

So I guess there are number of questions:  Are these tests worth runnning?
Do they make sense with -O3 and/or -O2 -flto?   If they make sense and 
should be run do we need to fix GCC to clean up the failures?  Or should
we continue to just ignore them?

Steve Ellcey
sell...@cavium.com

Re: libatomic IFUNC question (arm & libat_have_strexbhd)

2017-06-07 Thread Steve Ellcey

On Wed, 2017-06-07 at 12:21 -0700, Richard Henderson wrote:

> > Setting the variable in the constructor wouldn't influence IFUNC
> > resolver behavior because those can run before ELF constructors
> > (even with lazy binding).
> With lazy binding, the constructors of libraries should run in graph 
> dependency 
> order, which means this constructor should run before any users.
> 
> Without lazy binding, you're right that ifunc resolvers can run earlier, and 
> this would be largely useless.  Suggestions for a better organization
> welcome.

Would defining  __builtin_cpu_init, __builtin_cpu_is,
and __builtin_cpu_supports for ARM help with this?  X86 seems to have
some special code to call __builtin_cpu_init
from dispatch_function_versions.  Is that early enough?  Then
__builtin_cpu_is or __builtin_cpu_supports could be used in the IFUNC
resolvers instead of checking the libat_have_strexbhd variable.

Steve Ellcey
sell...@cavium.com

Re: libatomic IFUNC question (arm & libat_have_strexbhd)

2017-06-06 Thread Steve Ellcey

On Tue, 2017-06-06 at 07:50 +0200, Florian Weimer wrote:
> * Steve Ellcey:
> 
> > 
> > I have a question about the use of IFUNCs in libatomic.  I was
> > looking at the
> > arm implementation and in gcc/libatomic/config/linux/arm/host-
> > config.h I see:
> > 
> > extern bool libat_have_strexbhd HIDDEN;
> > # define IFUNC_COND_1   libat_have_strexbhd
> > 
> > I also see that gcc/libatomic/config/linux/arm/init.c has:
> > 
> > bool libat_have_strexbhd;
> > static void __attribute__((constructor))
> > init_cpu_revision (void)
> > {
> > }
> > 
> > What I don't see is any place that libat_have_strexbhd would ever get 
> > set.  What am I missing here?  init_cpu_revision is going to get called
> > when libatomic is first loaded since it is a constructor but it doesn't
> > seem to do anything and it isn't going to set libat_have_strexbhd as far
> > as I can see.
> Setting the variable in the constructor wouldn't influence IFUNC
> resolver behavior because those can run before ELF constructors
> (even with lazy binding).

So the question remains, where is libat_have_strexbhd set?  As near as
I can tell it isn't set, which would make the libatomic IFUNC pointless
on arm.

Steve Ellcey

libatomic IFUNC question (arm & libat_have_strexbhd)

2017-06-05 Thread Steve Ellcey

I have a question about the use of IFUNCs in libatomic.  I was looking at the
arm implementation and in gcc/libatomic/config/linux/arm/host-config.h I see:

extern bool libat_have_strexbhd HIDDEN;
# define IFUNC_COND_1   libat_have_strexbhd

I also see that gcc/libatomic/config/linux/arm/init.c has:

bool libat_have_strexbhd;
static void __attribute__((constructor))
init_cpu_revision (void)
{
}

What I don't see is any place that libat_have_strexbhd would ever get 
set.  What am I missing here?  init_cpu_revision is going to get called
when libatomic is first loaded since it is a constructor but it doesn't
seem to do anything and it isn't going to set libat_have_strexbhd as far
as I can see.

Steve Ellcey
sell...@cavium.com

Re: Duplicating loops and virtual phis

2017-05-17 Thread Steve Ellcey

On Wed, 2017-05-17 at 10:41 +0100, Bin.Cheng wrote:

> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).  So far my idea is to fuse the finest
> distributed
> loop in two passes, in the first pass, we merge all SCCs due to
> "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.  Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.  OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

Vectorizing hmmer is what I am interested in so I am glad to hear you
are looking into that.  You are obviously more knowledgable about the
GCC loop infrastructure then I am so I look forward to what you come up
with.

Steve Ellcey
sell...@cavium.com

Re: Duplicating loops and virtual phis

2017-05-15 Thread Steve Ellcey

On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey  om> wrote:
> > 
> > (Short version of this email, is there a way to recalculate/rebuild
> > virtual
> > phi nodes after modifying the CFG.)
> > 
> > I have a question about duplicating loops and virtual phi nodes.
> > I am trying to implement the following optimization as a pass:
> > 
> > Transform:
> > 
> >   for (i = 0; i < n; i++) {
> > A[i] = A[i] + B[i];
> > C[i] = C[i-1] + D[i];
> >   }
> > 
> > Into:
> > 
> >   if (noalias between A&B, A&C, A&D)
> > for (i = 0; i < 100; i++)
> > A[i] = A[i] + B[i];
> > for (i = 0; i < 100; i++)
> > C[i] = C[i-1] + D[i];
> >   else
> > for (i = 0; i < 100; i++) {
> > A[i] = A[i] + B[i];
> > C[i] = C[i-1] + D[i];
> > }
> > 
> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot be
> > vectorized so it gives up and does not vectorize the loop.  If we split
> > up the loop into two loops then the vector add with A[i] could be
> > vectorized
> > even if the one with C[i] could not.
> Loop distribution does this transform but it doesn't know about
> versioning for unknown dependences.
> 

Yes, I looked at loop distribution.  But it only works with global
arrays and not with pointer arguments where it doesn't know the size of
the array being pointed at.  I would like to be able to have it work
with pointer arguments.  If I call a function with 2 or
more integer pointers, and I have a loop that accesses them with
offsets between 0 and N where N is loop invariant then I should have
enough information (at runtime) to determine if there are overlapping
memory accesses through the pointers and determine whether or not I can
distribute the loop.

The loop splitting code seemed like a better template since it already
knows how to split a loop based on a runtime determined condition. That
part seems to be working for me, it is when I try to
distribute/duplicate one of those loops (under the unaliased condition)
that I am running into the problem with virtual PHIs.

Steve Ellcey
sell...@cavium.com

Duplicating loops and virtual phis

2017-05-12 Thread Steve Ellcey

(Short version of this email, is there a way to recalculate/rebuild virtual
phi nodes after modifying the CFG.)

I have a question about duplicating loops and virtual phi nodes.
I am trying to implement the following optimization as a pass:

Transform:

   for (i = 0; i < n; i++) {
A[i] = A[i] + B[i];
C[i] = C[i-1] + D[i];
   }

Into:

   if (noalias between A&B, A&C, A&D)
for (i = 0; i < 100; i++)
A[i] = A[i] + B[i];
for (i = 0; i < 100; i++)
C[i] = C[i-1] + D[i];
   else
for (i = 0; i < 100; i++) {
A[i] = A[i] + B[i];
C[i] = C[i-1] + D[i];
}

Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot be
vectorized so it gives up and does not vectorize the loop.  If we split
up the loop into two loops then the vector add with A[i] could be vectorized
even if the one with C[i] could not.

Currently I can introduce the first 'if' that checks for aliasing by
using loop_version() and that seems to work OK.  (My actual compare
for aliasing is actually just an approximation for now.)

Where I am running into problems is with splitting up the single loop
under the noalias if condition into two sequential loops (which I then
intend to 'thin out' by removing one or the other set of instructions.
I am using slpeel_tree_duplicate_loop_to_edge_cfg() for that loop duplication
and while I get the CFG I want, the pass ends with verify_ssa failing due
to bad virtual/MEM PHI nodes.  Perhaps there is a different function that
I should use duplicate the loop.

a.c: In function âfooâ:
a.c:2:5: error: PHI node with wrong VUSE on edge from BB 13
 int foo(int *a, int *b, int *c, int *d, int n)
 ^~~
.MEM_40 = PHI <.MEM_15(D)(13), .MEM_34(9)>
expected .MEM_58
a.c:2:5: internal compiler error: verify_ssa failed

I have tried to fix up the PHI node by hand using SET_PHI_ARG_DEF but
have not had any luck.  I was wondering if there was any kind of
'update all the phi nodes' function or just a 'update the virtual phi
nodes' function.  The non-virtual PHI nodes seem to be OK, it is just
the virtual ones that seem wrong after I duplicate the loop into two
consecutive loops.

Steve Ellcey
sell...@cavium.com

Question about dump_printf/dump_printf_loc

2017-05-05 Thread Steve Ellcey

I have a simple question about dump_printf and dump_printf_loc.  I notice
that most (all?) of the uses of these function are of the form:

if (dump_enabled_p ())
dump_printf_loc (MSG_*, ..);

Since dump_enabled_p() is just checking to see if dump_file or alt_dump_file
is set and since dump_printf_loc has checks for these as well, is there
any reason why we shouldn't or couldn't just use:

dump_printf_loc (MSG_*, ..);

with out the call to dump_enabled_p and have the dump function do nothing
when there is no dump file set?  I suppose the first version would have
some performance advantage since dump_enabled_p is an inlined function,
but is that enough of a reason to do it?  The second version seems like
it would look cleaner in the code where we are making these calls.

Steve Ellcey
sell...@cavium.com

Question about -fopt-info output (-fopt-info vs. -fopt-info-all)

2017-05-02 Thread Steve Ellcey

I have a question about -fopt-info.  According to the GCC documentation at:

  https://gcc.gnu.org/onlinedocs/gccint/Dump-examples.html


| If options is omitted, it defaults to all-all, which means dump all 
| available optimization info from all the passes. In the following example,
| all optimization info is output on to stderr.
|
|   gcc -O3 -fopt-info

But when I use the '-fopt-info' flag, I get less output about vectorization
than when I use '-fopt-info-all' or '-fopt-info-all-all'.

For example if I compile:

int foo(int *a, int *b, int *c, int n) {
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i];
}

with '-O3 -fopt-info' I get 6 lines of output.  '-O3 -fopt-info-all'
or '-O3 -fopt-info-all-all' gives me 453 lines of output.

Is the documentation wrong, the implementation wrong, or my understanding
of what the documentation is saying wrong?

Steve Ellcey
sell...@cavium.com

GCC loop structure question

2017-04-27 Thread Steve Ellcey

I have a question about the GCC loop structure.  I am trying to identify the
induction variable for a for loop and I don't see how to do that.

For example, if I have:

int foo(int *a, int *b, int *c, int *d, int *e, int *f, int n)
{
int i;
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
d[i] = e[i] * f[i];
}
}

I basically want to identify 'i' as the IV and look at its uses inside the
loop.

I can look at the control_ivs in the loop structure and I see:

base is:
  constant 1>
step is:
  constant 1>

But it doesn't say what is being set to base or what is being increased by
step in each loop iteration.  (I am also not sure why base is 1 and not 0.)

Does this refer to a psuedo-IV or something instead of a real SSA variable
that appears in the tree?   How would I identify 'i' as the IV for this
loop?  Do I need to look at the loop header and latch and see what the
header sets and what the latch checks to identify the variable?

Steve Ellcey
sell...@cavium.com

Re: Alias analysis and zero-sized arrays vs. flexible arrays

2017-04-25 Thread Steve Ellcey

On Tue, 2017-04-25 at 12:53 +0200, Richard Biener wrote:

> > int foo() {
> >    int i,j;
> >    for (i = 0; i < m; i++) {
> > a->o[i] = sizeof(*a);
> > b = ((struct r *)(((char *)a) + a->o[a->n]));
> > for (j = 0; j < 10; j++) {
> > b->slot[j].b = 0;
> in case b->slot[j].b aliases a->o[i] or a->o[a->n]
> you invoke undefined behavior becuase you violate
> strict aliasing rules.  I don't know why there's a
> difference between -DFLEX and -UFLEX but your
> code is buggy even if it works in one case.
> 
> Richard.

Should this work if I use -fno-strict-alias?  Even with that option I
get different code with a zero-sized array vs. a flexible array.
I have a patch to get_ref_base_and_extent that changes the behaviour
for zero-length arrays and I will submit it after I have tested it.

Steve Ellcey
sell...@cavium.com

Alias analysis and zero-sized arrays vs. flexible arrays

2017-04-24 Thread Steve Ellcey

I was wondering if someone could help me understand a bug involving
aliasing, this is happening on aarch64 but I don't think it is architecure
specific.  The problem involves flexible arrays vs. zero sized arrays at
the end of a structure.

In the original code, a zero size array is used and the program does not
behave correctly, if the zero sized array is changed to a C99 flexible
array it does work.  Are there any reasons why a zero-size array and
a flexible array should behave differently?

I was able to cut the test case down into the attached (non-runnable) test
case which when compiled with -O2 for aarch64 generates different code
with -DFLEX and -UFLEX.  In the code for the main loop GCC generates a
ldr/str/ldr/str sequence (with other instructions) when using a flexible
array and a ldr/ldr/str/str sequence when using a zero size array.  Moving
the second ldr ahead of the first str is what is causing the problem in the
original test case.

I have tracked the change in behaviour to differences in alias analysis
and to the get_ref_base_and_extent routine in tree-dfa.c and it looks
like the 'tree exp' argument is different between the two versions but
I am not sure if it should be different and, if the difference is OK,
should that affect how get_ref_base_and_extent behaves, as it apparently
does.

Steve Ellcey
sell...@cavium.com



Test case, compiling with '-O2 -DFLEX' generates different code than
'-O2 -UFLEX' on aarch64 using ToT GCC.  A cross compiler built on x86
can reproduce the problem too.

---


struct q {
int b;
};
struct r {
   int n;
   struct q slot[0];
};
struct s {
   int n;
#ifdef FLEX
 long int o[];
#else
 long int o[0];
#endif
};
extern int x, y, m;
extern struct s *a;
extern struct r *b;
extern void bar();
int foo() {
   int i,j;
   for (i = 0; i < m; i++) {
a->o[i] = sizeof(*a);
b = ((struct r *)(((char *)a) + a->o[a->n]));
for (j = 0; j < 10; j++) {
b->slot[j].b = 0;
}
bar();
  }
}

Machine problems at gcc.gnu.org?

2017-04-21 Thread Steve Ellcey


I am having problems getting to https://gcc.gnu.org this morning and
I have also had problems getting to the glibc mail archives though the
main web page for glibc seem available.  Anyone else having problems?
Of course if this email goes through the machines that are having problems
it may not get anywhere

Steve Ellcey
sell...@cavium.com

Re: SPEC 456.hmmer vectorization question

2017-03-08 Thread Steve Ellcey

On Tue, 2017-03-07 at 14:45 +0100, Michael Matz wrote:
> Hi Steve,
> 
> On Mon, 6 Mar 2017, Steve Ellcey wrote:
> 
> > 
> > I was looking at the spec 456.hmmer benchmark and this email string
> > from Jeff Law and Micheal Matz:
> > 
> >   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
> > 
> > and was wondering if anyone was looking at what more it would take
> > for GCC to vectorize the loop in P7Viterbi.

> It takes what I wrote in there.  There are two important things that need 
> to happen to get the best performance (at least from an analysis I did in 
> 2011, but nothing material should have changed since then):

I guess I was hoping that some progress had been made since then, but
it sounds like it hasn't.

> (1) loop distribution to make some memory streams vectorizable (and leave 
> the others in non-vectorized form).
> (1a) loop splitting based on conditional (to remove the k
> (2) a predictive commoning (or loop carried store reuse) on the dc[] 
> stream
> 
> None of these is valid if the loop streams can't be disambiguated, and as 
> this is C only adding explicit restrict qualifiers would give you that, or 
> runtime disambiguation, like ICC is doing, that's part (0).

So it sounds like the loop would have to be split up using runtime
disambiguation before we could do any of the optimizations.  Would that
check and split be something that could or should be done using the
graphite framework or would it be a seperate pass done before the
graphite phase is called?  I am not sure how one would determine what
loops would be worth splitting and which ones would not during such a
phase.

Steve Ellcey
sell...@cavium.com

SPEC 456.hmmer vectorization question

2017-03-06 Thread Steve Ellcey


I was looking at the spec 456.hmmer benchmark and this email string
from Jeff Law and Micheal Matz:

  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html

and was wondering if anyone was looking at what more it would take
for GCC to vectorize the loop in P7Viterbi.  There is a big performance
win to be had here if it can be done but the alias checking needed
seems rather extensive.

Steve Ellcey
sell...@cavium.com

Question about PR preprocessor/60723

2016-11-30 Thread Steve Ellcey

I am trying to understand the status of this bug and the patch
that fixes it.  It looks like a patch was submitted and checked
in for 5.0 to fix the problem reported and I see the new 
behavior caused by the patch in GCC 5.X compilers.  This behavior
caused a number of issues with configures and scripts that examined
preprocessed output as is mentioned in the bug report for PR 60723.
There was a later bug, 64864, complaining about the behavior and
that was closed as invalid.

But when I look at GCC 6.X or ToT compilers I do not see the same
behavior as 5.X.  Was this patch reverted or was a new patch submitted
that undid some of this patches behavior?  I couldn't find any revert or
new patch to replace the original one so I am not sure when or why
the code changed back after the 5.X releases.

Here is a test case that I am preprocessing with g++ -E:

#include 
class foo {
void operator= ( bool bit);
operator bool() const;
};

GCC 5.4 breaks up the operator delcarations with line markers and GCC 6.2
does not.

Steve Ellcey
sell...@caviumnetworks.com

Re: glibc test tst-thread_local1.cc fails to compile with latest GCC

2016-10-21 Thread Steve Ellcey

On Fri, 2016-10-21 at 17:03 +0100, Jonathan Wakely wrote:
> 
> > Is there some C++ standard change that I am not aware of or some
> > other header file I need to include?
> No, what probably happened is GCC didn't detect a usable Pthreads
> implementation and so doesn't define std::thread. The  header
> uses this condition around the definition of std::thread:
> 
> #if defined(_GLIBCXX_HAS_GTHREADS) &&
> defined(_GLIBCXX_USE_C99_STDINT_TR1)

Yes, I finally realized I had built a GCC with '--enable-threads=no'
and was using that GCC to build GLIBC.  Once I rebuilt GCC with threads
I could build GLIBC and not get this error.

Steve Ellcey

GCC compat testing and simulator question

2016-02-01 Thread Steve Ellcey


I have a question about the compatibility tests (gcc.dg/compat and
g++.dg/compat).  Do they work with remote/simulator testing?  I was
trying to run them with qemu and even though I am setting ALT_CC_UNDER_TEST
and ALT_CXX_UNDER_TEST it doesn't look like my alternative compiler
is ever getting run.

The README.compat file contains a line about 'make sure they work for
testing with a simulator' does that mean they are known not to work
with cross-testing and using a simulator?

I don't get any errors or warnings, and tests are being compiled with
GCC and run under qemu but it doesn't look like the second compiler is
ever run to compile anything.  I am using the multi-sim dejagnu board.

Steve Ellcey
sell...@imgtec.com

Re: __builtin_memcpy and alignment assumptions

2016-01-08 Thread Steve Ellcey

On Fri, 2016-01-08 at 12:56 +0100, Richard Biener wrote:
> On Fri, Jan 8, 2016 at 12:40 PM, Eric Botcazou  
> wrote:
> >> I think we only assume it if the pointer is actually dereferenced, 
> >> otherwise
> >> it just breaks too much code in the wild.  And while memcpy dereferences,
> >> it dereferences it through a char * cast, and thus only the minimum
> >> alignment is assumed.
> >
> > Yet the compiler was generating the expected code for Steve's testcase on
> > strict-alignment architectures until very recently (GCC 4.5 IIUC) and this
> > worked perfectly.

Yes, I just checked and I did get the better code in GCC 4.5 and I get
the current slower code in GCC 4.6.

> Consider
> 
> int a[256];
> int
> main()
> {
>   void *p = (char *)a + 1;
>   void *q = (char *)a + 5;
>   __builtin_memcpy (p, q, 4);
>   return 0;
> }
> 
> where the ME would be entitled to "drop" the char */void * conversions
> and use &a typed temps.

I am not sure how this works but I tweaked get_pointer_alignment_1 so
that if there was no align info or if get_ptr_info_alignment returned
false then the routine would return type based alignment information
instead of default 'void *' alignment.  In that case and using your
example, GCC still accessed p & q as pointers to unaligned data.

In fact if I used int pointers:

int a[256];
int main()
{
  int *p = (int *)((char *)a + 1);
  int *q = (int *)((char *)a + 5);
  __builtin_memcpy (p, q, 4);
  return 0;
}

GCC did unaligned accesses when optimizing, but when unoptimized (and
with my change) GCC did aligned accesses, which would not work on a
strict alignment machine like MIPS  This seems to match what happens
with:

int a[256];
int main()
{
  int *p = (int *)((char *)a + 1);
  int *q = (int *)((char *)a + 5);
  *p = *q;
  return 0;
}

When I optimize it, GCC does unaligned accesses and when unoptimized
GCC does aligned accesses which will not work on MIPS.

Steve Ellcey
sell...@imgtec.com

libstdc++ / uclibc question

2015-12-21 Thread Steve Ellcey

Is anyone building GCC (and libstdc++ specifically) with uclibc?  I haven't
done this in a while and when I do it now I get this build failure:

/scratch/sellcey/repos/uclibc-ng/src/gcc/libstdc++-v3/include/ext/random.tcc: 
In member function '__gnu_cxx::{anonymous}::uniform_on_sphere_helper<_Dimen, 
_RealType>::result_type 
__gnu_cxx::{anonymous}::uniform_on_sphere_helper<_Dimen, 
_RealType>::operator()(_NormalDistribution&, _UniformRandomNumberGenerator&)':
/scratch/sellcey/repos/uclibc-ng/src/gcc/libstdc++-v3/include/ext/random.tcc:1573:44:
 error: expected unqualified-id before '(' token
while (__norm == _RealType(0) || ! std::isfinite(__norm));

I am thinking the issue may be isfinite, but I am not sure.  I notice there
are some tests like 26_numerics/headers/cmath/c99_classification_macros_c++.cc
that are xfailed for uclibc and I wonder if this is a related problem.

I could not find any uses of isfinite in other C++ files (except cmath)
and the tests that use it are the same ones that are xfailed for uclibc.

Steve Ellcey
sell...@imgtec.com

Re: Question about PR 48814 and ivopts and post-increment

2015-12-04 Thread Steve Ellcey

On Fri, 2015-12-04 at 16:22 +0800, Bin.Cheng wrote:

> Dump before IVO is as below:
> 
>   :
>   # s1_1 = PHI 
>   # s2_2 = PHI 
>   s1_6 = s1_1 + 1;
>   c1_8 = *s1_1;
>   s2_9 = s2_2 + 1;
>   c2_10 = *s2_2;
>   if (c1_8 == 0)
> goto ;
>   else
> goto ;
> 
> And the iv candidates are as:
> candidate 1 (important)
>   var_before ivtmp.6
>   var_after ivtmp.6
>   incremented before exit test
>   type unsigned int
>   base (unsigned int) p1_4(D)
>   step 1
>   base object (void *) p1_4(D)
> candidate 2 (important)
>   original biv
>   type const unsigned char *
>   base (const unsigned char *) p1_4(D)
>   step 1
>   base object (void *) p1_4(D)
> candidate 3 (important)
>   var_before ivtmp.7
>   var_after ivtmp.7
>   incremented before exit test
>   type unsigned int
>   base (unsigned int) p2_5(D)
>   step 1
>   base object (void *) p2_5(D)
> candidate 4 (important)
>   original biv
>   type const unsigned char *
>   base (const unsigned char *) p2_5(D)
>   step 1
>   base object (void *) p2_5(D)
> 
> Generally GCC would choose normal candidates {1, 3} and insert
> increment before exit condition.  This is expected in this case.  But
> when there is applicable original candidates {2, 4}, GCC would prefer
> these in order to achieve better debugging.  Also as I suspected,
> [reg] and [reg-1] have same address cost on mips, that's why GCC makes
> current decision.
> 
> Thanks,
> bin

Yes, I agree that [reg] and [reg-1] have the same address cost, but
using [reg-1] means that the increment of reg happens before the access
and that puts the load of [reg-1] closer to the use of the value loaded
and that causes a stall.  If we used [reg] and incremented it after the
load then we would have at least one instruction in between the load and
the use and either no stall or a shorter stall.

I don't know if ivopts has anyway to do this type of analysis when
picking the IV.

Steve Ellcey
sell...@imgtec.com

Re: Instruction scheduler rewriting instructions?

2015-12-03 Thread Steve Ellcey

On Thu, 2015-12-03 at 19:56 +, Ramana Radhakrishnan wrote:

> IIRC it's because the scheduler *thinks* it can get a tighter schedule
> - probably because it thinks it can dual issue the lbu from $4 and the
> addiu to $5. Can it think so ? This may be related -
> https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html
> 
> regards
> Ramana

No, the system I am tuning for (MIPS 24k) is single issue according to
its description.  At least I do see now where the instruction is getting
rewritten in the instruction scheduler, so that is helpful.  I am no
longer sure the scheduler is where the problem lies though.  If I
compile with -O2 -mtune=24kc I get this loop:

addiu   $4,$4,1
$L8:
addiu   $5,$5,1
lbu $3,-1($4)
beq $3,$0,$L7
lbu $2,-1($5)

beq $3,$2,$L8
addiu   $4,$4,1

If I use -O2 -fno-ivopts -mtune=24kc I get:

lbu $3,0($4)
$L8:
lbu $2,0($5)
addiu   $4,$4,1
beq $3,$0,$L7
addiu   $5,$5,1

beql$3,$2,$L8
lbu $3,0($4)

This second loop is better because there is more time between the loads
and where the loaded values are used in the beq instructions.  So I
think there is something missing or wrong in the cost analysis that
ivopts is doing that it decides to do the adds before the loads instead
of visa versa.

I have tried tweaking the cost of loads in mips_rtx_costs and in the
instruction descriptions in 24k.md but that didn't seem to have any
affect on the ivopts code.

Steve Ellcey
sell...@imgtec.com

Instruction scheduler rewriting instructions?

2015-12-03 Thread Steve Ellcey

Can the instruction scheduler actually rewrite instructions?  I didn't
think so but when I compile some code on MIPS with:

-O2 -fno-ivopts -fno-peephole2 -fno-schedule-insns2

I get:

$L4:
lbu $3,0($4)
addiu   $4,$4,1
lbu $2,0($5)
beq $3,$0,$L7
addiu   $5,$5,1

beq $3,$2,$L4
subu$2,$3,$2

When I changed -fno-schedule-insns2 to -fschedule-insns2, I get:

$L4:
lbu $3,0($4)
addiu   $5,$5,1
lbu $2,-1($5)
beq $3,$0,$L7
addiu   $4,$4,1

beq $3,$2,$L4
subu$2,$3,$2

I.e. The addiu of $5 and the load using $5 have been swapped around
and the load uses a different offset to compensate.  I can't see
where in the instruction scheduler that this would happen.  Any 
help?  This is on MIPS if that matters, though I didn't see any
MIPS specific code for this.  This issue is related to my earlier
question about PR 48814 and ivopts (thus the -fno-ivopts option).

The C code I am looking at is the strcmp function from glibc:

int
strcmp (const char *p1, const char *p2)
{
  const unsigned char *s1 = (const unsigned char *) p1;
  const unsigned char *s2 = (const unsigned char *) p2;
  unsigned char c1, c2;

  do
{
  c1 = (unsigned char) *s1++;
  c2 = (unsigned char) *s2++;
  if (c1 == '\0')
return c1 - c2;
}
  while (c1 == c2);

  return c1 - c2;
}


Steve Ellcey
sell...@imgtec.com

Question about PR 48814 and ivopts and post-increment

2015-12-01 Thread Steve Ellcey


I have a question involving ivopts and PR 48814, which was a fix for
the post increment operation.  Prior to the fix for PR 48814, MIPS
would generate this loop for strcmp (C code from glibc):

$L4:
lbu $3,0($4)
lbu $2,0($5)
addiu   $4,$4,1
beq $3,$0,$L7
addiu   $5,$5,1# This is a branch delay slot
beq $3,$2,$L4
subu$2,$3,$2   # This is a branch delay slot (only used after loop)


With the current top-of-tree we now generate:

addiu   $4,$4,1
$L8:
lbu $3,-1($4)
addiu   $5,$5,1
beq $3,$0,$L7
lbu $2,-1($5)  # This is a branch delay slot
beq $3,$2,$L8
addiu   $4,$4,1# This is a branch delay slot

subu$2,$3,$2   # Done only once now after exiting loop.

The main problem with the new loop is that the beq comparing $2 and $3
is right before the load of $2 so there can be a delay due to the time
that the load takes.  The ideal code would probably be:

addiu   $4,$4,1
$L8:
lbu $3,-1($4)
lbu $2,0($5)  # This is a branch delay slot
beq $3,$0,$L7
addiu   $5,$5,1
beq $3,$2,$L8
addiu   $4,$4,1# This is a branch delay slot

subu$2,$3,$2   # Done only once now after exiting loop.

Where we load $2 earlier (using a 0 offset instead of a -1 offset) and
then do the increment of $5 after using it in the load.  The problem
is that this isn't something that can just be done in the instruction
scheduler because we are changing one of the instructions (to modify the
offset) in addition to rearranging them and I don't think the instruction
scheduler supports that.

It looks like is the ivopts code that decided to increment the registers
first and use the -1 offsets in the loads after instead of using 0 offsets
and then incrementing the offsets after the loads but I can't figure out
how or why ivopts made that decision.

Does anyone have any ideas on how I could 'fix' GCC to make it generate
the ideal code?  Is there some way to do it in the instruction scheduler?
Is there some way to modify ivopts to fix this by modifying the cost
analysis somehow?  Could I (partially) undo the fix for PR 48814?
According to the final comment in that bugzilla report the change is
really only needed for C11 and that the change does degrade the optimizer
so could we go back to the old behaviour for C89/C99?  The code in ivopts
has changed enough since the patch was applied I couldn't immediately see
how to do that in the ToT sources.

Steve Ellcey
sell...@imgtec.com

Re: _Fract types and conversion routines

2015-10-29 Thread Steve Ellcey


OK, I think I understand what is happening with the MIPS failure when
converting 'signed char' to '_Sat unsigned _Fract' after I removed
the TARGET_PROMOTE_PROTOTYPES macro.

This bug is a combination of two factors, one is that calls to library
functions (like __satfractqiuhq) don't necessarily get the right type
promotion (specifically with regards to signedness) of their arguments
and the other is that __satfractqiuhq doesn't deal with that problem
correctly, though I think it is supposed to.

Reading emit_library_call_value_1 I see comments like:

  /* Todo, choose the correct decl type of orgfun. Sadly this information
 isn't present here, so we default to native calling abi here.  */

So I think that when calling a library function like '__satfractqiuhq'
which takes a signed char argument or calling a library function like
__satfractunsqiuhq which takes an unsigned char argument
emit_library_call_value_1 cannot ensure that the right type of extension
(signed vs unsigned) is done on the argument when it is put in the
argument register.  Does this sound like a correct understanding of the
limitation in emit_library_call_value_1?

I don't see this issue on regular non-library calls, presumably because
the compiler has all the information needed to do correct explicit
conversions.

When I look at the preprocessed __satfractqiuhq code I see:

unsigned short _Fract
__satfractqiuhq (signed char a) {

signed char x = a;
low = (short) x;

When TARGET_PROMOTE_PROTOTYPES was defined this triggered explicit
code truncate/sign extend code that took care of the problem I am
seeing but when I removed it, GCC assumed the caller had taken care
of the truncate/sign extension and, because this is a library function,
that wasn't done correctly and I don't think it can be done correctly
because emit_library_call_value_1 doesn't have the necessary
information.

So should __satfractqiuhq be dealing with the fact that the argument 'a'
may not have been sign extend in the correct way?

I have tried a few code changes in fixed-bit.c (to no avail) but this
code is so heavily macro-ized it is tough to figure out what it should
be doing.

Steve Ellcey
sell...@imgtec.com

Re: _Fract types and conversion routines

2015-10-28 Thread Steve Ellcey


You can ignore that last email.  I think I finally found where the
problem is.  In the main program:

extern void abort (void);
int main ()
{
  signed char a = -1;
  _Sat unsigned _Fract b = a;
  if (b != 0.0ur)
abort();
  return 0;
}

If I compile with -O0, I see:

li  $2,-1   # 0x
sb  $2,24($fp)
lbu $4,24($fp)
jal __satfractqiuhq

We put -1 in register $2, store the byte, then load the byte as an
unsigned char instead of a signed char.  When TARGET_PROMOTE_PROTOTYPES
was defined it didn't matter because __satfractqiuhq did another sign
extend before using the value.  When I got rid of
TARGET_PROMOTE_PROTOTYPES, that extra sign extend went away and the fact
that we are doing a 'lbu' unsigned load instead of a 'lb' signed byte
load triggered the bug.  Now I just need to find out why we are doing an
lbu instead of an lb.

Steve Ellcey
sell...@imgtec.com

Re: _Fract types and conversion routines

2015-10-28 Thread Steve Ellcey

On Wed, 2015-10-28 at 13:42 +0100, Richard Biener wrote:
> On Wed, Oct 28, 2015 at 12:23 AM, Steve Ellcey  wrote:
> >
> > I have a question about the _Fract types and their conversion routines.
> > If I compile this program:
> >
> > extern void abort (void);
> > int main ()
> > {
> >   signed char a = -1;
> >   _Sat unsigned _Fract b = a;
> >   if (b != 0.0ur)
> > abort();
> >   return 0;
> > }
> >
> > with -O0 and on a MIPS32 system where char is 1 byte and unsigned (int)
> > is 4 bytes I see a call to '__satfractqiuhq' for the conversion.
> >
> > Now I think the 'qi' part of the name is for the 'from type' of the
> > conversion, a 1 byte signed type (signed char), and the 'uhq' part is
> > for the 'to' part of the conversion.  But 'uhq' would be a 2 byte
> > unsigned fract, and the unsigned fract type on MIPS should be 4 bytes
> > (unsigned int is 4 bytes).  So shouldn't GCC have generated a call to
> > __satfractqiusq instead?  Or am I confused?
> 
> did it eventually narrow the comparison?  Just check some of the tree/RTL 
> dumps.
> 
> > Steve Ellcey
> > sell...@imgtec.com

Hm, it looks like it optimized this in expand.  In the last tree dump it
still looks like:

b_2 = (_Sat unsigned _Fract) a_1;

But in the expand phase it becomes:

(call_insn/u 13 12 14 2 (parallel [
(set (reg:UHQ 2 $2)
(call (mem:SI (symbol_ref:SI ("__satfractqiuhq") [flags 0x41]) 
[0  S4 A32])
(const_int 16 [0x10])))
(clobber (reg:SI 31 $31))
])

I think this is a legitimate optimization (though I am compiling at -O0
so I wonder if it should really be doing this).  The problem I am
looking at is that I want to remove 'TARGET_PROMOTE_PROTOTYPES' because
it causing us to promote/sign extend types in the caller and the callee.
The MIPS ABI requires it be done in the caller so it should not need to
be done in the callee as well

See https://gcc.gnu.org/ml/gcc/2015-10/msg00149.html

When I ran the testsuite, I got one regression: 
gcc.dg/fixed-point/convert-sat.c.

When looking at that failure I thought the problem might be that I was calling
__satfractqiuhq instead of __satfractqiusq, but that does not seem to be the
issue.  The call to __satfractqiuhq is correct, and the difference that I see
when I don't define TARGET_PROMOTE_PROTOTYPES is that the result of 
__satfractqiuhq
is not truncated/sign-extended to UHQ mode inside of __satfractqiuhq.
I am looking to see if I need to do something with TARGET_PROMOTE_FUNCTION_MODE
to handle _Fract types differently than what 
default_promote_function_mode_always_promote
does.

I tried updating PROMOTE_MODE to handle _Fract modes (by promoting UHQ to USQ 
or SQ) but
that caused more failures than before.  It seems to be only the return of 
partial word
_Fract types that is causing me a problem.

Steve Ellcey
sell...@imgtec.com

_Fract types and conversion routines

2015-10-27 Thread Steve Ellcey


I have a question about the _Fract types and their conversion routines.
If I compile this program:

extern void abort (void);
int main ()
{
  signed char a = -1;
  _Sat unsigned _Fract b = a;
  if (b != 0.0ur)
abort();
  return 0;
}

with -O0 and on a MIPS32 system where char is 1 byte and unsigned (int)
is 4 bytes I see a call to '__satfractqiuhq' for the conversion.

Now I think the 'qi' part of the name is for the 'from type' of the
conversion, a 1 byte signed type (signed char), and the 'uhq' part is
for the 'to' part of the conversion.  But 'uhq' would be a 2 byte
unsigned fract, and the unsigned fract type on MIPS should be 4 bytes
(unsigned int is 4 bytes).  So shouldn't GCC have generated a call to
__satfractqiusq instead?  Or am I confused?

Steve Ellcey
sell...@imgtec.com

TARGET_PROMOTE_PROTOTYPES question

2015-10-20 Thread Steve Ellcey

I have a question about the TARGET_PROMOTE_PROTOTYPES macro.  This macro
says that types like short or char should be promoted to ints when
passed as arguments, even if there is a prototype for the argument.

Now when I look at the code generated on MIPS or x86 it looks like there
is conversion code in both the caller and the callee.  For example:

int foo(char a, short b) { return a+b; }
int bar (int a) { return foo(a,a); }


In the rtl expand dump (on MIPS) I see this in bar:

(insn 6 3 7 2 (set (reg:SI 200)
(sign_extend:SI (subreg:HI (reg/v:SI 199 [ a ]) 2))) x.c:2 -1
 (nil))
(insn 7 6 8 2 (set (reg:SI 201)
(sign_extend:SI (subreg:QI (reg/v:SI 199 [ a ]) 3))) x.c:2 -1
 (nil))

Which insures that we pass the arguments as ints.
And in foo we have:

(insn 8 9 10 2 (set (reg/v:SI 197 [ a+-3 ])
(sign_extend:SI (subreg:QI (reg:SI 198) 3))) x.c:1 -1
 (nil))
(insn 10 8 11 2 (set (reg/v:SI 199 [ b+-2 ])
(sign_extend:SI (subreg:HI (reg:SI 200) 2))) x.c:1 -1
 (nil))

Which makes sure we do a truncate/extend before using the values.

Now I know that we can't get rid of these truncation/extensions 
entirely, but do we need both?  It seems like foo could say that
if the original registers (198 and 200) are argument registers
that were extended to SImode due to TARGET_PROMOTE_PROTOTYPES
then we don't need to do the truncation/extension in the callee
and could just use the SImode values directly.  Am I missing
something?  Or are we doing both just to have belts and suspenders
and want to keep it that way?

Steve Ellcey
sell...@imgtec.com

Build problem with libgomp on ToT?

2015-09-10 Thread Steve Ellcey

I just ran into this build failure last night:

/usr/bin/install: cannot create regular file 
`/scratch/sellcey/repos/nightly/install-mips-mti-linux-gnu/lib/gcc/mips-mti-linux-gnu/6.0.0/finclude/omp_lib_kinds.mod':
 File exists

This is on a parallel make install (-j 7) with multilibs.  I don't see an
obvious patch that could have caused this new failure, has anyone else run
into this?  I couldn't find anything in the bug database or in the mailing
lists.

Steve Ellcey
sell...@imgtec.com

Re: GTY / gengtype question - adding a new header file

2015-09-01 Thread Steve Ellcey

On Tue, 2015-09-01 at 10:13 +0200, Georg-Johann Lay wrote:

> 
> I'd have a look at what BEs are using non-default target_gtfiles.
> 
> Johann

There are a few BEs that add a .c file to target_gtfiles, but no
platforms that add a .h file to target_gtfiles.  I do see a number
of platforms that define the machine_function structure in their header
file (aarch64.h, pa.h, i386.h) instead of their .c file though.

Maybe that is a better way to go for MIPS instead of doing something
completely new.  If I move machine_function, mips_frame_info,
mips_int_mask, and mips_shadow_set from mips.c to mips.h then I could
put my new machine specific pass in a separate .c file from mips.c and
not need to do anything with target_gtfiles.  The only reason I didn't
want to do this was so that machine_function wasn't visible to the rest
of GCC but that doesn't seem to have been an issue for other targets.

Steve Ellcey
sell...@imgtec.com

Re: GTY / gengtype question - adding a new header file

2015-09-01 Thread Steve Ellcey

On Tue, 2015-09-01 at 08:11 +0100, Richard Sandiford wrote:

> config.gcc would need to add mips-private.h to target_gtfiles.

OK, that was what I missed.

> I'm not sure splitting the file is a good idea though.  At the moment
> the definitions of all target hooks must be visible to a single TU.
> Either you'd need to keep all the hooks in one .c file (leading
> to an artificial split IMO) or you'd need declare some of them
> in the private header.  Declaring them in the header file would only be
> consistent if the targetm definition was in its own file (so that _every_
> hook had a prototype in the private header).  That seems like unnecessary
> work though.

The code I want to add is actually a separate GCC pass so it breaks out
fairly cleanly.  It just needs access to the machine_function structure
and the types and structures included in that structure
(mips_frame_info, mips_int_mask, and mips_shadow_set).  It sets a couple
of new boolean variables in the machine_function structure which are
then used during mips_compute_frame_info.

I see what you mean about much of mips.c probably not being splittable
due to the target hook structure but machine specific passes may be the
exception to that rule.  We already have one pass in mips.c
(pass_mips_machine_reorg2), that might be something else that could be
broken out, though I haven't looked in detail to see what types or
structures it would need access to.

Steve Ellcey
sell...@imgtec.com

GTY / gengtype question - adding a new header file

2015-08-31 Thread Steve Ellcey


I have a question about gengtype and GTY.  I was looking at adding some
code to mips.c and it occurred to me that that file was getting very
large (19873 lines).  So I wanted to add a new .c file instead but that
file needed some types that were defined in mips.c and not in a header file.
Specifically it needed the MIPS specific machine_function structure that
is defined in mips.c with:

struct GTY(())  machine_function {

I think I could just move this to mips.h and things would be fine but
I didn't want to do that because mips.h is included in tm.h and is visible
to the generic GCC code.  Currently machine_function is not visible to the
generic GCC code and so I wanted to put machine_function in a header file
that could only be seen/used by mips specific code.  So I created
mips-private.h and added it to extra_headers in config.gcc.

The problem is that if I include mips-private.h in mips.c instead of
having the actual definition of machine_function in mips.c then my
build fails and I think it is due to how and where gengtype scans for GTY
uses.

I couldn't find an example of a platform that has a machine specific header
file that was not visible to the generic GCC code and that has GTY types
in it so I am not sure what I need to do to get gengtype to scan
mips-private.h or if this is even possible (or wise).

Steve Ellcey
sell...@imgtec.com

Re: fake/abnormal/eh edge question

2015-08-25 Thread Steve Ellcey

On Tue, 2015-08-25 at 14:44 -0600, Jeff Law wrote:

> > I want to preserve the copy of $sp to $12 and I also want to preserve the
> > .cfi psuedo-ops (and code) in the exit block and epilogue in order for
> > exception handling to work correctly.  One way I thought of doing this
> > is to create an edge from the entry block to the exit block but I am
> > unsure of all the implications of creating a fake/eh/abnormal edge to
> > do this and which I would want to use.
> Presumably it's the RTL DCE pass that's eliminating this stuff?

Actually, it looks like is peephole2 that is eliminating the
instructions (and .cfi psuedo-ops).

> 
> Do you have the FRAME_RELATED bit set of those insns?
> 
> But what I don't understand is why preserving the code is useful if it 
> can't be reached.  Maybe there's something about the dwarf2 unwinding 
> that I simply don't understand -- I've managed to avoid learning about 
> it for years.

I am not entirely sure I need the code or if I just need the .cfi
psuedo-ops and that I need the code to generate the .cfi stuff.

I wish I could avoid the dwarf unwinder but that seems to be the main
problem I am having with stack realignment.  Getting the cfi stuff right
so that the unwinder works properly is proving very hard.

Steve Ellcey
sell...@imgtec.com

fake/abnormal/eh edge question

2015-08-25 Thread Steve Ellcey

I have a question about FAKE, EH, and ABNORMAL edges.  I am not sure I 
understand all the implications of each type of edge from the description
in cfg-flags.def.

I am trying to implement dynamic stack alignment for MIPS and I have code
that does the following:

prologue
copy incoming $sp to $12 (temp reg)
align $sp
copy $sp to $fp (after alignment so that $fp is also aligned)
entry block
copy $12 to virtual reg (DRAP) for accessing args and for
restoring $sp

exit block
copy virtual reg (DRAP) back to $12
epilogue
copy $12 to $sp to restore stack pointer


This works fine as long as there as a path from the entry block to the
exit block but in some cases (like gcc.dg/cleanup-8.c) we have a function
that always calls abort (a non-returning function) and so there is no 
path from entry to exit and the exit block and epilogue get removed and
the copy of $sp to $12 also gets removed because GCC sees no uses of $12.

I want to preserve the copy of $sp to $12 and I also want to preserve the
.cfi psuedo-ops (and code) in the exit block and epilogue in order for
exception handling to work correctly.  One way I thought of doing this
is to create an edge from the entry block to the exit block but I am
unsure of all the implications of creating a fake/eh/abnormal edge to
do this and which I would want to use.

Steve Ellcey
sell...@imgctec.com

Re: CFI directives and dynamic stack alignment

2015-08-24 Thread Steve Ellcey

On Tue, 2015-08-18 at 09:23 +0930, Alan Modra wrote:
> On Mon, Aug 17, 2015 at 10:38:22AM -0700, Steve Ellcey wrote:

> OK, then you need to emit a .cfi directive to say the frame top is
> given by the temp hard reg sometime after that assignment and before
> sp is aligned in the prologue, and another .cfi directive when copying
> to the pseudo.  It's a while since I looked at the CFI code in gcc,
> but arranging this might be as simple as setting RTX_FRAME_RELATED_P
> on the insns involved.
> 
> If -fasynchronous-unwind-tables, then you'll also need to track the
> frame in the epilogue.
> 
> > This function (fn2) ends with a call to abort, which is noreturn, so the
> > optimizer sees that the epilogue is dead code and GCC determines that
> > there is no need to save the old stack pointer since it will never get
> > restored.   I guess I need to tell GCC to save the stack pointer in
> > expand_prologue even if it never sees a use for it.  I guess I need to
> > make the temporary register where I save $sp volatile or do something
> > else so that the assignment (and its associated .cfi) is not deleted by
> > the optimizer.
> 
> Ah, I see.  Yes, the temp and pseudo are not really dead if they are
> needed for unwinding.

Yes, I was originally thinking I just had to make the temp and pseudo
regs volatile so that the assignments would not get removed but it
appears that I need the epilogue code too (even if I never get there
because of a call to abort which GCC knows is non-returning) so that I
have the needed .cfi directives there.  I am thinking I should add an
edge from the entry_block to the exit_block so that the exit block is
never removed by the optimizer.  I assume this edge would need to be
abnormal and/or fake but I am not sure which (if either) of these edges
would be appropriate for this.

Steve Ellcey
sell...@imgtec.com

Re: Adding an IPA pass question (pass names)

2015-08-19 Thread Steve Ellcey

On Wed, 2015-08-19 at 13:40 -0400, David Malcolm wrote:

> Is your pass of the correct type?  (presumably IPA_PASS).  I've run into
> this a few times with custom passes (which seems to be a "gotcha");
> position_pass can fail here:
> 
>   /* Check if the current pass is of the same type as the new pass and
>  matches the name and the instance number of the reference pass.  */
>   if (pass->type == new_pass_info->pass->type
> 
> 
> Hope this is helpful
> Dave

That seems to have been the problem.   I made my pass SIMPLE_IPA_PASS
and the comdats pass is just IPA_PASS.  I changed mine to IPA_PASS and
it now registers the pass.

Steve Ellcey
sell...@imgtec.com

Adding an IPA pass question (pass names)

2015-08-19 Thread Steve Ellcey


I am trying to create a new IPA pass to scan the routines being compiled
by GCC and I thought I would put it in after the last IPA pass (comdats)
so I tried to register it with:

  opt_pass *p = make_pass_ipa_frame_header_opt (g);
  static struct register_pass_info f = 
{p, "comdats", 1, PASS_POS_INSERT_AFTER };
  register_pass (&f);

But when I build GCC I get:

/scratch/sellcey/repos/header2/src/gcc/libgcc/libgcc2.c:1:0: fatal error: pass 
'comdats' not found but is referenced by new pass 'frame-header-opt'

Does anyone know why this is the case?  "comdats" is what is used for
the name of pass_ipa_comdats in ipa-comdats.c.

Steve Ellcey
sell...@imgtec.com

Re: CFI directives and dynamic stack alignment

2015-08-17 Thread Steve Ellcey

On Tue, 2015-08-11 at 10:05 +0930, Alan Modra wrote:

> > The 'and' instruction is where the stack gets aligned and if I remove that
> > one instruction, everything works.  I think I need to put out some new CFI
> > psuedo-ops to handle this but I am not sure what they should be.  I am just
> > not very familiar with the CFI directives.
> 
> I don't speak mips assembly very well, but it looks to me that you
> have more than just CFI problems.  How do you restore sp on return
> from the function, assuming sp wasn't 16-byte aligned to begin with?
> Past that "and $sp,$sp,$3" you don't have any means of calculating
> the original value of sp!  (Which of course is why you also can't find
> a way of representing the frame address.)

I have code in expand_prologue that copies the incoming stack pointer to
a temporary hard register and then I have code to the entry_block to
copy that register into a virtual register.  In the exit block that
virtual register is copied back to a temporary hard register and
expand_epilogue copies it back to $sp to restore the stack pointer.

This function (fn2) ends with a call to abort, which is noreturn, so the
optimizer sees that the epilogue is dead code and GCC determines that
there is no need to save the old stack pointer since it will never get
restored.   I guess I need to tell GCC to save the stack pointer in
expand_prologue even if it never sees a use for it.  I guess I need to
make the temporary register where I save $sp volatile or do something
else so that the assignment (and its associated .cfi) is not deleted by
the optimizer.

Steve Ellcey
sell...@imgtec.com

CFI directives and dynamic stack alignment

2015-08-03 Thread Steve Ellcey


I don't know if there are any CFI experts out there but I am working on
dynamic stack alignment for MIPS.  I think I have it working in the 'normal'
case but when I try to do stack unwinding through a routine with an aligned
stack, then I have problems.  I was wondering if someone can help me understand
what CFI directives to generate to allow stack unwinding.  Using
gcc.dg/cleanup-8.c as an example (because it fails with my stack alignment
code), if I generate code with no dynamic stack alignment (but forcing the
use of the frame pointer), the routine fn2 looks like this on MIPS:

fn2:
.frame  $fp,32,$31  # vars= 0, regs= 2/0, args= 16, gp= 8
.mask   0xc000,-4
.fmask  0x,0
.setnoreorder
.setnomacro
lui $2,%hi(null)
addiu   $sp,$sp,-32
.cfi_def_cfa_offset 32
lw  $2,%lo(null)($2)
sw  $fp,24($sp)
.cfi_offset 30, -8
move$fp,$sp
.cfi_def_cfa_register 30
sw  $31,28($sp)
.cfi_offset 31, -4
jal abort
sb  $0,0($2)

There are .cfi directives when incrementing the stack pointer, saving the
frame pointer, and copying the stack pointer to the frame pointer.

When I generate code to dynamically align the stack my code looks like
this:

fn2:
.frame  $fp,32,$31  # vars= 0, regs= 2/0, args= 16, gp= 8
.mask   0xc000,-4
.fmask  0x,0
.setnoreorder
.setnomacro
lui $2,%hi(null)
li  $3,-16  # 0xfff0
lw  $2,%lo(null)($2)
and $sp,$sp,$3
addiu   $sp,$sp,-32
.cfi_def_cfa_offset 32
sw  $fp,24($sp)
.cfi_offset 30, -8
move$fp,$sp
.cfi_def_cfa_register 30
sw  $31,28($sp)
.cfi_offset 31, -4
jal abort
sb  $0,0($2)

The 'and' instruction is where the stack gets aligned and if I remove that
one instruction, everything works.  I think I need to put out some new CFI
psuedo-ops to handle this but I am not sure what they should be.  I am just
not very familiar with the CFI directives.

I looked at ix86_emit_save_reg_using_mov where there is some special
code for handling the drap register and for saving registers on a 
realigned stack but I don't really understand what they are trying 
to do.

Any help?

Steve Ellcey
sell...@imgtec.com

P.S. For completeness sake I have attached my current dynamic
 alignment changes in case anyone wants to see them.

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 4f9a31d..386c2ce 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -5737,6 +5737,29 @@ expand_stack_alignment (void)
   gcc_assert (targetm.calls.get_drap_rtx != NULL);
   drap_rtx = targetm.calls.get_drap_rtx ();
 
+  /* I am not doing this in get_drap_rtx because we are also calling
+ that from expand_function_end in order to get/set the drap_reg
+ and vdrap_reg variables and doing these instructions at that
+ point is not working.   */
+
+  if (drap_rtx != NULL_RTX)
+{
+  rtx_insn *insn, *seq;
+
+  start_sequence ();
+  emit_move_insn (crtl->vdrap_reg, crtl->drap_reg);
+  seq = get_insns ();
+  insn = get_last_insn ();
+  end_sequence ();
+  emit_insn_at_entry (seq);
+  if (!optimize)
+{
+  add_reg_note (insn, REG_CFA_SET_VDRAP, crtl->vdrap_reg);
+  RTX_FRAME_RELATED_P (insn) = 1;
+}
+}
+
+
   /* stack_realign_drap and drap_rtx must match.  */
   gcc_assert ((stack_realign_drap != 0) == (drap_rtx != NULL));
 
diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c
index ce21a0f..b6ab30a 100644
--- a/gcc/config/mips/mips.c
+++ b/gcc/config/mips/mips.c
@@ -746,6 +746,8 @@ static const struct attribute_spec mips_attribute_table[] = {
   { "use_shadow_register_set",	0, 0, false, true,  true, NULL, false },
   { "keep_interrupts_masked",	0, 0, false, true,  true, NULL, false },
   { "use_debug_exception_return", 0, 0, false, true,  true, NULL, false },
+  { "align_stack", 0, 0, true, false, false, NULL, false },
+  { "no_align_stack", 0, 0, true, false, false, NULL, false },
   { NULL,	   0, 0, false, false, false, NULL, false }
 };
 
@@ -1528,6 +1530,61 @@ mips_merge_decl_attributes (tree olddecl, tree newdecl)
 			   DECL_ATTRIBUTES (newdecl));
 }
 
+static bool
+mips_cfun_has_msa_p (void)
+{
+  /* For now, for testing, assume all functions use MSA
+ (and thus need alignment).  */
+#if 0
+  if (!cfun || !TARGET_MSA)
+return FALSE;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+{
+  if (MSA_SUPPORTED_MODE_P (GET_MODE (insn)))
+	return TRUE;
+}
+
+  return FALSE;
+#else
+  return TRUE;
+#endif
+}
+
+bool
+mips_align_stack_p (void)
+{
+  bool want_alignment = TARGET_ALIGN_STACK &&

Re: Basic GCC testing question

2015-07-10 Thread Steve Ellcey

On Fri, 2015-07-10 at 14:27 -0500, Segher Boessenkool wrote:
> On Fri, Jul 10, 2015 at 10:43:43AM -0700, Steve Ellcey  wrote:
> > 
> > I have a basic GCC testing question.  I built a native GCC and ran:
> > 
> > make RUNTESTFLAGS='dg.exp' check
> > 
> > Everything passed and according to the log file it used the unix.exp
> > as the target-board.  But if I try running:
> > 
> > make RUNTESTFLAGS='dg.exp --target-board=unix' check
> 
> Does it work better if you spell --target_board ?
> 
> 
> Segher


Arg, I hate it when I do something stupid like that.  It would be ince
if runtest gave an error message when it had a bad/unknown argument, but
if it does I didn't see it anywhere.

Steve Ellcey

Basic GCC testing question

2015-07-10 Thread Steve Ellcey


I have a basic GCC testing question.  I built a native GCC and ran:

make RUNTESTFLAGS='dg.exp' check

Everything passed and according to the log file it used the unix.exp
as the target-board.  But if I try running:

make RUNTESTFLAGS='dg.exp --target-board=unix' check

Then I get failures.  They both say they are running target unix.
If I diff the two log files I see:

1,2c1,3
< Test Run By sellcey on Fri Jul 10 10:13:21 2015
< Native configuration is x86_64-unknown-linux-gnu
---
> Test Run By sellcey on Fri Jul 10 09:52:41 2015
> Target is unix
> Host   is x86_64-unknown-linux-gnu
12a14,15
> WARNING: Assuming target board is the local machine (which is 
probably wrong).
> You may need to set your DEJAGNU environment variable.

The reason I want to specify a target-board is so I can then modify it with
something like '--target-board=unix/-m32' but I think I need to specify a
board before I add any options don't I?

Steve Ellcey
sell...@imgtec.com

Re: Question about DRAP register and reserving hard registers

2015-07-09 Thread Steve Ellcey

On Mon, 2015-06-29 at 11:10 +0100, Richard Henderson wrote:

> > OK, I think I have this part of the code working on MIPS but
> > crtl->drap_reg is used in the epilogue as well as the prologue even if
> > it is not 'live' in between.  If I understand the code correctly the x86
> > prologue pushes the drap register on to the stack so that the epilogue
> > can pop it off and use it to restore the stack pointer.  Is my
> > understanding correct?
> 
> Yes.  Although that saved copy is also used by unwind info.

Do you know how and where this saved copy is used by the unwind info?
I don't see any indication that the unwind library knows if a stack has
been dynamically realigned and I don't see where unwind makes use of
this value.

Steve Ellcey
sell...@imgtec.com

Re: Question about DRAP register and reserving hard registers

2015-07-07 Thread Steve Ellcey

On Mon, 2015-06-29 at 11:10 +0100, Richard Henderson wrote:

> > I also need the drap pointer in the MIPS epilogue but I would like to
> > avoid having to get it from memory.  Ideally I would like to restore it
> > from the virtual register that the prologue code / get_drap_rtx code put
> > it into.  I tried just doing a move from the virtual drap register to
> > the real one in expand_epilogue but that didn't work because it looks
> > like you can't access virtual registers from expand_prologue or
> > expand_epilogue.  I guess that is why the code to copy the hard drap reg
> > to the virtual drap_reg is done in get_drap_reg and not in
> > expand_prologue.  I thought about putting code in get_drap_reg to do
> > this copying but I don't see how to access the end of a function.  The
> > hard drap reg to virtual drap reg copy is inserted into the beginning of
> > a function with:
> >
> > insn = emit_insn_before (seq, NEXT_INSN (entry_of_function ()));
> >
> > Is there an equivalent method to insert code to the end of a function?
> > I don't see an 'end_of_function ()' routine anywhere.
> 
> Because, while generating initial rtl for a function, the beginning of a 
> function has already been emitted, while the end of the function hasn't.
> 
> You'd need to hook into expand_function_end, right at the bottom, before the 
> call to use_return_register.
> 
> 
> r~

I ran into an interesting issue while doing this.  Right now the expand
pass calls construct_exit_block (which calls expand_function_end) before
it calls expand_stack_alignment.  That means that crtl->drap_reg, etc
are not yet set up when in expand_function_end.  I moved the
expand_stack_alignment call up before construct_exit_block to fix that.
I hope moving it up doesn't break anything.

Steve Ellcey
sell...@imgtec.com

Re: Question about DRAP register and reserving hard registers

2015-06-22 Thread Steve Ellcey

On Fri, 2015-06-19 at 09:09 -0400, Richard Henderson wrote:
> On 06/16/2015 07:05 PM, Steve Ellcey  wrote:
> >
> > I have a question about the DRAP register (used for dynamic stack alignment)
> > and about reserving/using hard registers in general.  I am trying to 
> > understand
> > where, if a drap register is allocated, GCC is told not to use it during
> > general register allocation.  There must be some code somewhere for this
> > but I cannot find it.
> 
> There isn't.  Because the vDRAP register is a pseudo.  The DRAP register is 
> only live from somewhere in the middle of the prologue to the end of the 
> prologue.
> 
> See ix86_get_drap_rtx, wherein we coordinate with the to-be-generated 
> prologue 
> (crtl->drap_reg), allocate the pseudo, and emit the hard-reg-to-pseudo copy 
> at 
> entry_of_function.
> 
> 
> r~

OK, I think I have this part of the code working on MIPS but
crtl->drap_reg is used in the epilogue as well as the prologue even if
it is not 'live' in between.  If I understand the code correctly the x86
prologue pushes the drap register on to the stack so that the epilogue
can pop it off and use it to restore the stack pointer.  Is my
understanding correct?

I also need the drap pointer in the MIPS epilogue but I would like to
avoid having to get it from memory.  Ideally I would like to restore it
from the virtual register that the prologue code / get_drap_rtx code put
it into.  I tried just doing a move from the virtual drap register to
the real one in expand_epilogue but that didn't work because it looks
like you can't access virtual registers from expand_prologue or
expand_epilogue.  I guess that is why the code to copy the hard drap reg
to the virtual drap_reg is done in get_drap_reg and not in
expand_prologue.  I thought about putting code in get_drap_reg to do
this copying but I don't see how to access the end of a function.  The
hard drap reg to virtual drap reg copy is inserted into the beginning of
a function with:

insn = emit_insn_before (seq, NEXT_INSN (entry_of_function ()));

Is there an equivalent method to insert code to the end of a function?
I don't see an 'end_of_function ()' routine anywhere.

Steve Ellcey
sell...@imgtec.com

Re: Question about DRAP register and reserving hard registers

2015-06-19 Thread Steve Ellcey

On Fri, 2015-06-19 at 09:09 -0400, Richard Henderson wrote:
> On 06/16/2015 07:05 PM, Steve Ellcey  wrote:
> >
> > I have a question about the DRAP register (used for dynamic stack alignment)
> > and about reserving/using hard registers in general.  I am trying to 
> > understand
> > where, if a drap register is allocated, GCC is told not to use it during
> > general register allocation.  There must be some code somewhere for this
> > but I cannot find it.
> 
> There isn't.  Because the vDRAP register is a pseudo.  The DRAP register is 
> only live from somewhere in the middle of the prologue to the end of the 
> prologue.
> 
> See ix86_get_drap_rtx, wherein we coordinate with the to-be-generated 
> prologue 
> (crtl->drap_reg), allocate the pseudo, and emit the hard-reg-to-pseudo copy 
> at 
> entry_of_function.
> 
> 
> r~

OK, that makes more sense now.  In my work on MIPS I was trying to cut
out some of the complexity of the x86 implementation and just use a hard
register as my DRAP register.  One of the issues I ran into, and perhaps
the one that caused x86 to use a virtual register, was saving and
restoring the register during setjmp/longjmp and C++ exception handling
usage.  I will trying switching to a virtual register and see if that
works better.

Other than exceptions, the main complexity in dynamic stack alignment
seems to involve the debug information.  I am still trying to understand
the handling of the drap register and dynamic stack alignment in
dwarf2out.c and dwarf2cfi.c.

Steve Ellcey
sell...@imgtec.com

Question about DRAP register and reserving hard registers

2015-06-16 Thread Steve Ellcey


I have a question about the DRAP register (used for dynamic stack alignment)
and about reserving/using hard registers in general.  I am trying to understand
where, if a drap register is allocated, GCC is told not to use it during
general register allocation.  There must be some code somewhere for this
but I cannot find it.

I am trying to implement dynamic stack alignment on MIPS and because there
is so much code for the x86 dynamic stack alignment I am trying to incorporate
bits of it as I understrand what I need instead of just turning it all on
at once and getting completely lost.

Right now I am using register 16 on MIPS to access incoming arguments
in a function that needs dynamic alignment, so it is my drap register if
my understanding of the x86 code and its use of a DRAP register is correct.
I copy the stack pointer into reg 16 before I align the stack pointer
(during expand_prologue).  So far the only way I have found to stop the
register allocator from also using reg 16 and thus messing up its value is to
set fixed_regs[16].  But I don't see the x86 doing this for its DRAP register
and I was wondering how it is handled there.

I think setting fixed_regs[16] is why C++ tests with exception handling are
not working for me because this register is not getting set and restored
(since it is thought to be fixed) during code that uses throw and catch.

Steve Ellcey
sell...@imgtec.com

Build oddity (Mode = sf\|df messages in output)

2015-04-30 Thread Steve Ellcey


I am curious, has anyone started seeing these messages in their GCC build
output:

Mode = sf\|df
Suffix = si\|2\|3

I think they come from the libgcc build but I can't figure exactly where from.
I am not sure if they are intentional messages or accidental debug statements
that escaped.  They do not seem to be causing any problems during the build,
they just got me curious.

Steve Ellcey
sell...@imgtec.com

Re: Running GCC testsuite with --param option (requires space in argument)

2015-04-28 Thread Steve Ellcey

On Tue, 2015-04-28 at 22:58 +0200, Jakub Jelinek wrote:

> > I tried:
> > 
> > export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1'
> > export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1'
> 
> Have you tried
> export RUNTESTFLAGS='--target_board=multi-sim/--param=foo=1'
> ?
> 
>   Jakub

Nope, but it seems to work.  That syntax is not documented in
invoke.texi.  I will see about submitting a patch (or at least a
documentation bug report).

Steve Ellcey

Running GCC testsuite with --param option (requires space in argument)

2015-04-28 Thread Steve Ellcey

Has anyone run the GCC testsuite using a --param option?  I am trying
to do something like:

export RUNTESTFLAGS='--target_board=multi-sim/--param foo=1'
make check

But the space in the '--param foo=1' option is causing dejagnu to fail.
Perhaps there is a way to specify a param value without a space in the
option?  If there is I could not find it.

I tried:

export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1'
export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1'

But neither of those worked either.

Steve Ellcey
sell...@imgtec.com

Re: How do I set a hard register in gimple

2015-04-22 Thread Steve Ellcey

Following up to my own email, I think I found the missing magic.  I
needed to set global_regs[16] to 1.  Once global_regs was set for the
register, the assignment stopped getting optimized out.

Steve Ellcey
sell...@imgtec.com


On Wed, 2015-04-22 at 12:27 -0700, Steve Ellcey wrote:
> On Wed, 2015-04-22 at 12:28 +0200, Steven Bosscher wrote:
> 
> > This is wrong for sure. You can't have DECL_RTL in GIMPLE.
> > 
> > You will want to set has_local_explicit_reg_vars, DECL_HARD_REGISTER,
> > and DECL_ASSEMBLER_NAME, and leave it to the middle end to take care
> > of everything else.
> > 
> > Ciao!
> > Steven
> 
> Thanks for the advice, I switched to DECL_HARD_REGISTER and
> DECL_ASSEMBLER_NAME but I am still having the same problem, which is
> that the assignment to this global (register) variable is getting
> optimized away.
> 
> If I have:
> 
>   ptr_type = build_pointer_type (char_type_node);
>   id = get_identifier ("X");
>   ptr_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, id, ptr_type);
>   TREE_PUBLIC (ptr_var) = 1;
>   DECL_EXTERNAL (ptr_var) = 1;
>   varpool_node::finalize_decl (ptr_var);
> 
> The the assignment to this global variable is not removed by the
> optimizer, which makes sense because someone outside the function could
> access the value of the global variable.
> 
> But if I change it to:
> 
>   ptr_type = build_pointer_type (char_type_node);
>   id = get_identifier ("*$16");
>   ptr_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, id, ptr_type);
>   TREE_PUBLIC (ptr_var) = 1;
>   DECL_EXTERNAL (ptr_var) = 1;
>   DECL_REGISTER (ptr_var) = 1;
>   DECL_HARD_REGISTER (ptr_var) = 1;
>   SET_DECL_ASSEMBLER_NAME (ptr_var, id);
>   varpool_node::finalize_decl (ptr_var);
> 
> Then the assignment to this variable is optimized away by the cse1
> optimization phase.
> 
> Steve Ellcey
> sell...@imgtec.com

Re: How do I set a hard register in gimple

2015-04-22 Thread Steve Ellcey

On Wed, 2015-04-22 at 12:28 +0200, Steven Bosscher wrote:

> This is wrong for sure. You can't have DECL_RTL in GIMPLE.
> 
> You will want to set has_local_explicit_reg_vars, DECL_HARD_REGISTER,
> and DECL_ASSEMBLER_NAME, and leave it to the middle end to take care
> of everything else.
> 
> Ciao!
> Steven

Thanks for the advice, I switched to DECL_HARD_REGISTER and
DECL_ASSEMBLER_NAME but I am still having the same problem, which is
that the assignment to this global (register) variable is getting
optimized away.

If I have:

  ptr_type = build_pointer_type (char_type_node);
  id = get_identifier ("X");
  ptr_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, id, ptr_type);
  TREE_PUBLIC (ptr_var) = 1;
  DECL_EXTERNAL (ptr_var) = 1;
  varpool_node::finalize_decl (ptr_var);

The the assignment to this global variable is not removed by the
optimizer, which makes sense because someone outside the function could
access the value of the global variable.

But if I change it to:

  ptr_type = build_pointer_type (char_type_node);
  id = get_identifier ("*$16");
  ptr_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, id, ptr_type);
  TREE_PUBLIC (ptr_var) = 1;
  DECL_EXTERNAL (ptr_var) = 1;
  DECL_REGISTER (ptr_var) = 1;
  DECL_HARD_REGISTER (ptr_var) = 1;
  SET_DECL_ASSEMBLER_NAME (ptr_var, id);
  varpool_node::finalize_decl (ptr_var);

Then the assignment to this variable is optimized away by the cse1
optimization phase.

Steve Ellcey
sell...@imgtec.com

How do I set a hard register in gimple

2015-04-21 Thread Steve Ellcey

I have a question about inserting code into a function being compiled by
GCC.  Basically I want to set a hard register at the beginning of a 
function like is being done below.  If I compile the program below on MIPS
the $16 register gets set to the result of alloca and even if I optimize
the routine and nothing else uses p ($16), the set of $16 gets done.

register void *p asm ("$16");
void *foo(void *a)
{
p = alloca(64);
/* Rest of function.  */
}

But if I try to insert this code myself from inside GCC the setting of $16
keeps getting optimized away and I cannot figure out how to stop it.
My code to set the register does this:

  ptr_var = build_decl (UNKNOWN_LOCATION, VAR_DECL,
get_identifier ("__alloca_reg"), ptr_type);
  TREE_PUBLIC (ptr_var) = 1;
  DECL_EXTERNAL (ptr_var) = 1;
  SET_DECL_RTL (ptr_var, gen_raw_REG (Pmode, 16));
  DECL_REGISTER (ptr_var) = 1;
  DECL_HARD_REGISTER (ptr_var) = 1;
  TREE_THIS_VOLATILE (ptr_var) = 1;
  TREE_USED (ptr_var) = 1;
  varpool_node::finalize_decl (ptr_var);

  stmt = gimple_build_assign (ptr_var, build_fold_addr_expr (array_var));
  e = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (fun));
  gsi_insert_on_edge_immediate (e, stmt);

And I see the code during the rtl expansion, but during the first CSE
pass the set of $16 goes away.  How do I mark this variable as 'volatile'
so that the assignment to it does not go away?  It must be possible because
the set does not go away in my small example program but I can't figure
out what it is setting that I am not.

Steve Ellcey
sell...@imgtec.com

1 2 3 4 >

1 - 100 of 310 matches

Mail list logo