Re: CREL relocation format for ELF (was: RELLEB)

2024-03-28 Thread Alan Modra via Gcc
On Fri, Mar 22, 2024 at 06:51:41PM -0700, Fangrui Song wrote:
> On Thu, Mar 14, 2024 at 5:16 PM Fangrui Song  wrote:
> > I propose RELLEB, a new format offering significant file size
> > reductions: 17.2% (x86-64), 16.5% (aarch64), and even 32.4% (riscv64)!
> >
> > Your thoughts on RELLEB are welcome!

Does anyone really care about relocatable object file size?  If they
do, wouldn't they be better off using a compressed file system?

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC][top-level] Add configure test-case

2022-11-07 Thread Alan Modra via Gcc
On Mon, Nov 07, 2022 at 06:23:45PM +, Joseph Myers wrote:
> On Mon, 7 Nov 2022, Alan Modra via Binutils wrote:
> 
> > a) that top-level binutils/gdb patches don't get applied to the gcc
> >git repository in a timely manner, or
> 
> If a toplevel patch is approved for either repository, I think you should 
> treat it as approved for the other one without needing separate review.

Thanks Joseph, that's how I see it too.  Of course with the
understanding that binutils-gdb can't be used as a back door way of
sneaking in a gcc-specific change.

Can I get agreement among the gcc build maintainers that such a
policy is acceptable?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Transitioning to a new Xtensa Port Maintainer

2020-05-28 Thread Alan Modra via Gcc
On Thu, May 28, 2020 at 10:12:04AM -0700, augustine.sterling--- via Gcc wrote:
> After a long run as the Xtensa port maintainer, it is time for me to move
> on.
> 
> I nominate Max Filippov [cc'd] as the new maintainer. He has contributed
> numerous patches over the years which have required minimal comments from
> me.

Sign him up by adding his name in binutils/MAINTAINERS.

> If there is anything I need to do to facilitate this, let me know. I'm
> happy to still review patches when needed.

That sounds like you're not abandoning binutils entirely, so you could
simply leave your name in binutils/MAINTAINERS where it is but put Max
on the line before your entry.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Modifying RTL cost model to know about long-latency loads

2020-04-11 Thread Alan Modra via Gcc
On Sat, Apr 11, 2020 at 04:27:07PM -0700, Sasha Krassovsky via Gcc wrote:
> However, in the following example, the load does get the cost applied to it 
> but the store to B does not. 
> 
> void bar(__attribute__((remote(5)) int *a, int *b)
> {
> if(*A > 5)
> *A = 10;
> *B = *A;
> }
> 
> I was wondering if this is the correct way to approach this problem, and also 
> why the attribute sometimes gets applied and sometimes not.

There are many places in the compiler that only consider the cost of
the source of a SET insn, that is, given something like

 (set (mem (address)) (op (reg) (const_int)))

will only pass (op (reg) (const_int)) to rtx_cost, and from there to
TARGET_RTX_COSTS.  Those places are generally looking at costs to see
whether variations in the SET source are profitable, for example
whether it is better to put the (const_int) in another reg first then
change the expression to (op (reg) (reg2)).

Other places, eg. rtlanal.c:seq_cost do pass the entire SET.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Git ChangeLog policy for GCC Testsuite inquiry

2020-02-07 Thread Alan Modra
On Fri, Feb 07, 2020 at 10:08:25AM +, Jonathan Wakely wrote:
> With Git you can't really have unwanted local commits present in a
> tree if you use a sensible workflow, so if you tested in a tree that
> was at commit 1234abcd and you push from another machine that is at
> the same commit, you know there are no unintended differences.

Maybe I don't have a sensible workflow, but often with lots of tiddly
little binutils patches I don't bother with branches for everything.
(I do use branches for larger work.)  I also like to test my patches.
I'll test individually on a few relevant targets but do a test over
a large number of targets (162 currently) for a bunch of patches.
Some of those patches tested might not be ready for commit upstream
(lacking comments, changelogs, even lacking that vital self review),
so I'll "git rebase -i" to put the ones that are ready first, then
"git push origin :master"
just to push up to the relevant commit.  That works quite well for me.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Silly GIT related question

2020-01-14 Thread Alan Modra
On Wed, Jan 15, 2020 at 03:11:13AM +, Gary Oblock wrote:
> If you just do a clone and don't checkout a branch, is this equivalent
> the  top of the trunk in the old scheme?

Yes.  More details in "git help clone".

-- 
Alan Modra
Australia Development Lab, IBM


Re: RFC: Extending --with-advance-toolchain to aarch64

2019-10-09 Thread Alan Modra
On Wed, Oct 09, 2019 at 10:29:48PM +, Steve Ellcey wrote:
> I have a question about building a toolchain that uses (at run time) a
> dynamic linker and system libraries and headers that are in a non-standard
> place.

I had scripts a long time ago to build a complete toolchain including
glibc that could be installed in a non-standard location and co-exist
with other system libraries.  I worked around..

> Inconsistency detected by ld.so: get-dynamic-info.h: 147: 
> elf_get_dynamic_info: 
> Assertion `info[DT_RPATH] == NULL' failed!

..this by patching glibc.

-- 
Alan Modra
Australia Development Lab, IBM


Re: POWER PC-relative addressing and new text relocations

2019-09-23 Thread Alan Modra
On Mon, Sep 23, 2019 at 11:14:12AM +0200, Florian Weimer wrote:
> * Alan Modra:
> 
> > On Mon, Sep 23, 2019 at 10:37:29AM +0200, Florian Weimer wrote:
> >> * Alan Modra:
> >> 
> >> > On Mon, Sep 23, 2019 at 09:42:52AM +0200, Florian Weimer wrote:
> >> > We've been discussing this inside IBM too.  The conclusion is that
> >> > only one of the new relocs makes any possible sense as a dynamic
> >> > reloc, R_PPC64_TPREL34, and that one only if you allow
> >> > -ftls-model=local-exec when building shared libraries and accept that
> >> > DF_STATIC_TLS shared libraries that can't be dlopen'd are OK.
> >> 
> >> Is this still a text relocation?
> >
> > Yes.  I should have mentioned that too.
> 
> Yuck.  Is this *really* necessary?

The idea was to allow lusers to do the same as they can on other
architectures, to minimise the number of bug reports saying "but I can
do this on x86".

Hmm, I just checked.
$ gcc -shared -fPIC -ftls-model=local-exec -o thread.so ~/src/tmp/thread.c
/usr/bin/ld: /tmp/ccoXMrxD.o: relocation R_X86_64_TPOFF32 against symbol `p' 
can not be used when making a shared object; recompile with -fPIC

So I'm not fussed if we drop the idea of supporting R_PPC64_TPREL34 as
a dynamic reloc.

-- 
Alan Modra
Australia Development Lab, IBM


Re: POWER PC-relative addressing and new text relocations

2019-09-23 Thread Alan Modra
On Mon, Sep 23, 2019 at 10:37:29AM +0200, Florian Weimer wrote:
> * Alan Modra:
> 
> > On Mon, Sep 23, 2019 at 09:42:52AM +0200, Florian Weimer wrote:
> >> At Cauldron, the question came up whether the dynamic loader needs to
> >> be taught about the new relocations for PC-relative addressing.
> >> 
> >> I think they would only matter if we supported PC-relative addressing
> >> *and* text relocations.  Is that really necessary?
> >> 
> >> These text relocations would not work reliably anyway because the
> >> maximum displacement is not large enough.  For example, with the
> >> current process layout, it's impossible to reach shared objects from
> >> the main program and vice versa.  And some systems might want to add
> >> additional randomization, so that shared objects are not mapped closed
> >> together anymore.
> >
> > We've been discussing this inside IBM too.  The conclusion is that
> > only one of the new relocs makes any possible sense as a dynamic
> > reloc, R_PPC64_TPREL34, and that one only if you allow
> > -ftls-model=local-exec when building shared libraries and accept that
> > DF_STATIC_TLS shared libraries that can't be dlopen'd are OK.
> 
> Is this still a text relocation?

Yes.  I should have mentioned that too.

>  The displacement relative to the
> thread pointer is (usually) small, so I can see how this could work
> reliable.
> 
> What's the restriction on dlopen?  Wouldn't it be the same as regular
> initial-exec TLS memory, which also uses static TLS, but without a
> text relocation and an additional indirection to load the TLS offset
> from a place where a regular relocation has put it?

I thought you can't dlopen libraries with static TLS, except when the
amount of TLS storage needed fits within a certain limit, but it's a
while since I looked at glibc code in this area so things may have
changed.

-- 
Alan Modra
Australia Development Lab, IBM


Re: POWER PC-relative addressing and new text relocations

2019-09-23 Thread Alan Modra
On Mon, Sep 23, 2019 at 09:42:52AM +0200, Florian Weimer wrote:
> At Cauldron, the question came up whether the dynamic loader needs to
> be taught about the new relocations for PC-relative addressing.
> 
> I think they would only matter if we supported PC-relative addressing
> *and* text relocations.  Is that really necessary?
> 
> These text relocations would not work reliably anyway because the
> maximum displacement is not large enough.  For example, with the
> current process layout, it's impossible to reach shared objects from
> the main program and vice versa.  And some systems might want to add
> additional randomization, so that shared objects are not mapped closed
> together anymore.

We've been discussing this inside IBM too.  The conclusion is that
only one of the new relocs makes any possible sense as a dynamic
reloc, R_PPC64_TPREL34, and that one only if you allow
-ftls-model=local-exec when building shared libraries and accept that
DF_STATIC_TLS shared libraries that can't be dlopen'd are OK.

See https://sourceware.org/ml/binutils/2019-09/msg00164.html, which
doesn't allow even R_PPC64_TPREL34.  I haven't put this patch on the
binutils 2.33 branch.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [PowerPC 64]r12 is not updated to GEP when control transferred from virtual thunk function .

2019-05-20 Thread Alan Modra
On Mon, May 20, 2019 at 03:39:50AM -0500, Segher Boessenkool wrote:
> But it means it needs to make a stub for every global entry point that
> is used?

Mostly.  Calls via function pointer don't (*), nor do you need stubs
when generating inline PLT calls.  I'll note that use of the global
entry point for direct calls is closely associated with needing a PLT
entry, and the stubs we're talking about here are similar to the code
other architectures put in their .plt section.

*) The exception is when a non-PIC executable initialises a function
pointer in read-only memory to a function defined outside the
executable.  This case requires a special stub in the executable to
serve as the address of the function.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [PowerPC 64]r12 is not updated to GEP when control transferred from virtual thunk function .

2019-05-20 Thread Alan Modra
On Mon, May 20, 2019 at 02:55:33AM -0500, Segher Boessenkool wrote:
> On Mon, May 20, 2019 at 04:19:54PM +0930, Alan Modra wrote:
> > On Thu, May 16, 2019 at 05:52:42PM -0500, Segher Boessenkool wrote:
> > > Hi Umesh,
> > > 
> > > On Thu, May 16, 2019 at 06:12:48PM +0530, Umesh Kalappa wrote:
> > > > We are very new to Power abi and we are thinking to handle this case
> > > > in loader  like  go through the  relocations like R_PPC64_REL24 and
> > > > found symbol has the localentry ,then compute the delta (GEP - LEP )
> > > > and patch the caller address like (sym.value - delta).
> > > 
> > > I wonder if you have found a bug in the compiler after all.  Most things
> > > are supposed to work without the linker/loader having to do special
> > > things; e.g. using the global entry point should always work, using the
> > > local entry point is just an optimisation.
> > 
> > That isn't true for direct calls.  If using the global entry point,
> > the linker must provide stub code to load up r12 with the global entry
> > address and modify the nop after the bl.  The linker must also adjust
> > calls using the local entry point; the call instruction (and
> > relocation) specify the function symbol not the function plus local
> > entry offset.
> > 
> > So I don't think there is any compiler bug here, just a broken kernel
> > module loader.  Incidentally, if thunks are broken then it's very
> > likely local function calls are broken too.
> 
> The ABI says
> 
> "When a linker causes control to transfer to a global entry point, it
> must insert a glue code sequence that loads r12 with the global
> entry-point address. Code at the global entry point can assume that
> register r12 points to the GEP."
> 
> But in the testcase the jump *already* was to the global entry point:

So?  We never add the local entry offset to the call assembly.
Compile this to assembly (without -fPIC) and note "bl print" in main.

#include 

void __attribute__ ((noclone, noinline))
print (const char *str)
{
  puts (str);
}

int
main ()
{
  print ("Hello");
  return 0;
}

Now if the thunk code produced a branch to a local label that *wasn't*
a function symbol, I'd agree that gcc was wrong.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [PowerPC 64]r12 is not updated to GEP when control transferred from virtual thunk function .

2019-05-19 Thread Alan Modra
On Thu, May 16, 2019 at 05:52:42PM -0500, Segher Boessenkool wrote:
> Hi Umesh,
> 
> On Thu, May 16, 2019 at 06:12:48PM +0530, Umesh Kalappa wrote:
> > We are very new to Power abi and we are thinking to handle this case
> > in loader  like  go through the  relocations like R_PPC64_REL24 and
> > found symbol has the localentry ,then compute the delta (GEP - LEP )
> > and patch the caller address like (sym.value - delta).
> 
> I wonder if you have found a bug in the compiler after all.  Most things
> are supposed to work without the linker/loader having to do special
> things; e.g. using the global entry point should always work, using the
> local entry point is just an optimisation.

That isn't true for direct calls.  If using the global entry point,
the linker must provide stub code to load up r12 with the global entry
address and modify the nop after the bl.  The linker must also adjust
calls using the local entry point; the call instruction (and
relocation) specify the function symbol not the function plus local
entry offset.

So I don't think there is any compiler bug here, just a broken kernel
module loader.  Incidentally, if thunks are broken then it's very
likely local function calls are broken too.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Question regarding constraint usage within inline asm

2019-02-20 Thread Alan Modra
On Wed, Feb 20, 2019 at 08:57:52PM -0600, Peter Bergner wrote:
> On 2/20/19 4:19 PM, Alan Modra wrote:
> > I forgot to say, gcc-6, gcc-7 and gcc-8 handle your original testcase
> > with the register asm just fine.
> 
> Yes, because they don't have my IRA and LRA patches that exposed this
> problem. I would say they were buggy for not complaining and silently
> spilling a hard register in the case where we used asm reg("...").

I don't follow your reasoning.  It seems to me that giving some
variable a register asm doesn't mean that the value of that variable
can't appear in some other register.  An obvious example is when
passing that variable to a function.

So why shouldn't a hard reg be reloaded in order to satisfy
incompatible constraints?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Question regarding constraint usage within inline asm

2019-02-20 Thread Alan Modra
I forgot to say, gcc-6, gcc-7 and gcc-8 handle your original testcase
with the register asm just fine.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Question regarding constraint usage within inline asm

2019-02-20 Thread Alan Modra
On Wed, Feb 20, 2019 at 10:08:07AM -0600, Peter Bergner wrote:
> On 2/19/19 9:09 PM, Alan Modra wrote:
> > On Mon, Feb 18, 2019 at 01:13:31PM -0600, Peter Bergner wrote:
> >> long input;
> >> long
> >> bug (void)
> >> {
> >>   register long output asm ("r3");
> >>   asm ("blah %0, %1, %2" : "=&r" (output) : "r" (input), "0" (input));
> >>   return output;
> >> }
> >>
> >> I know an input operand can have a matching constraint associated with
> >> an early clobber operand, as there seems to be code that explicitly
> >> mentions this scenario.  In this case, the user has to manually ensure
> >> that the input operand is not clobbered by the early clobber operand.
> >> In the case that the input operand uses an "r" constraint, we just
> >> ensure that the early clobber operand and the input operand are assigned
> >> different registers.  My question is, what about the case above where
> >> we have the same variable being used for two different inputs with
> >> constraints that seem to be incompatible?
> > 
> > Without the asm("r3") gcc will provide your "blah" instruction with
> > one register for %0 and %2, and another register for %1.  Both
> > registers will be initialised with the value of "input".
> 
> That's not what I'm seeing.  I see one pseudo (123) used for the output
> operand and one pseudo (121) used for both input operands.  Like so:

I meant by the time you get to assembly.

blah 3, 9, 3

> That said, talking with Segher and Uli offline, they both think the
> inline asm usage in the test case should be legal

Good, it seems we are in agreement.  Incidentally, the single pseudo
for the inputs happens even for testcases like

long input;
long
bug (void)
{
  register long output /* asm ("r3") */;
  asm ("blah %0, %1, %2" : "=r" (output) : "wi" (input), "0" (input));
  return output;
}

-- 
Alan Modra
Australia Development Lab, IBM


Re: Question regarding constraint usage within inline asm

2019-02-19 Thread Alan Modra
On Mon, Feb 18, 2019 at 01:13:31PM -0600, Peter Bergner wrote:
> I have a question about constraint usage in inline asm when we have
> an early clobber output operand.  The test case is from PR89313 and
> looks like the code below (I'm using "r3" for the reg on ppc, but
> you could also use "rax" on x86_64, etc.).
> 
> long input;
> long
> bug (void)
> {
>   register long output asm ("r3");
>   asm ("blah %0, %1, %2" : "=&r" (output) : "r" (input), "0" (input));
>   return output;
> }
> 
> I know an input operand can have a matching constraint associated with
> an early clobber operand, as there seems to be code that explicitly
> mentions this scenario.  In this case, the user has to manually ensure
> that the input operand is not clobbered by the early clobber operand.
> In the case that the input operand uses an "r" constraint, we just
> ensure that the early clobber operand and the input operand are assigned
> different registers.  My question is, what about the case above where
> we have the same variable being used for two different inputs with
> constraints that seem to be incompatible?

Without the asm("r3") gcc will provide your "blah" instruction with
one register for %0 and %2, and another register for %1.  Both
registers will be initialised with the value of "input".

>  Clearly, we cannot assign
> a register to the "input" variable that is both the same and different
> to the register that is assigned to "output".

No, you certainly can do that.  I think you have found a bug in lra.

-- 
Alan Modra
Australia Development Lab, IBM


Re: RS6000 emitting sign extention for unsigned type

2019-01-18 Thread Alan Modra
On Tue, Jan 15, 2019 at 04:48:27PM +0530, kamlesh kumar wrote:
> Hi all,
> 
> Analysed it further and find out that
> function ' rs6000_promote_function_mode ' (rs6000.c) needs modifcation.
> """
> static machine_mode
> rs6000_promote_function_mode (const_tree type ATTRIBUTE_UNUSED,
>   machine_mode mode,
>   int *punsignedp ATTRIBUTE_UNUSED,
>   const_tree, int)
> {
>   PROMOTE_MODE (mode, *punsignedp, type);
>   return mode;
> }
> """
> Here, This function is promoting the mode but
> it is not even touching 'punsignedp' and it is always initialized to zero
> by default.
> So in all cases 'punsignedp' remain zero even if it is for unsigned type.
> which cause the sign extension to happen  even for unsigned type.
> 
> is there any way to set 'punsignedp' appropriately here.

No.  The call to promote_function_mode in emit_library_call_value_1
does not pass type info (because it isn't available for libcalls).

-- 
Alan Modra
Australia Development Lab, IBM


Re: strlen optimizations based on whether stpcpy is declared?

2017-10-02 Thread Alan Modra
On Mon, Oct 02, 2017 at 09:11:53AM +0200, Jakub Jelinek wrote:
> On Sun, Oct 01, 2017 at 03:52:39PM -0600, Martin Sebor wrote:
> > While debugging some of my tests I noticed unexpected differences
> > between the results depending on whether or not the stpcpy function
> > is declared.  It turns out that the differences are caused by
> > the handle_builtin_strcpy function in tree-ssa-strlen.c testing
> > for stpcpy having been declared:
> > 
> >   if (srclen == NULL_TREE)
> > switch (bcode)
> >   {
> >   case BUILT_IN_STRCPY:
> >   case BUILT_IN_STRCPY_CHK:
> >   case BUILT_IN_STRCPY_CHKP:
> >   case BUILT_IN_STRCPY_CHK_CHKP:
> > if (lhs != NULL_TREE || !builtin_decl_implicit_p (BUILT_IN_STPCPY))
> >   return;
> > 
> > and taking different paths depending on whether or not the test
> > succeeds.
> > 
> > As far as can see, the tests have been there since the pass was
> > added, but I don't understand from the comments in the file what
> > their purpose is or why optimization decisions involving one set
> > of functions (I think strcpy and strcat at a minimum) are based
> > on whether another function has been declared or not.
> > 
> > Can you explain what they're for?
> 
> The reason is that stpcpy is not a standard C function, so in non-POSIX
> environments one could have stpcpy with completely unrelated prototype
> used for something else.  In such case we don't want to introduce stpcpy
> into a TU that didn't have such a call.  So, we use the existence of
> a matching prototype as a sign that stpcpy can be synthetized.

Why is the test for stpcpy being declared done for the strcpy cases
rather than the stpcpy cases?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Optimization breaks inline asm code w/ptrs

2017-08-17 Thread Alan Modra
On Thu, Aug 17, 2017 at 04:27:12PM +0200, Michael Matz wrote:
> Hi,
> 
> On Mon, 14 Aug 2017, Alan Modra wrote:
> 
> > I've opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81845 to track 
> > the lack of documentation.
> 
> You mean like in this paragraph discussing memory clobbers and uses in 
> extended asms that we have since 2004? :

The paragraph you show below (from gcc-4 sources) disappeared with git
commit 3aabc45f2.  We currently have this:

--
Flushing registers to memory has performance implications and may be an issue 
for time-sensitive code.  You can use a trick to avoid this if the size of 
the memory being accessed is known at compile time. For example, if accessing 
ten bytes of a string, use a memory input like: 

@code{@{"m"( (@{ struct @{ char x[10]; @} *p = (void *)ptr ; *p; @}) )@}}.
--

So, no example even of the simplest "m" (*y) type memory input.  This
lack was part of the reason I submitted
https://gcc.gnu.org/ml/gcc-patches/2017-03/msg01562.html
which died in the review process, mostly due to the example being
rather large, and partly I fear, due to not being x86.  I didn't push
the patch for a number of reasons.  Then later realized that the
constraints I was using for arrays, while they work for OpenBLAS, were
not strict enough.  "m" (*y) for an array y only makes the asm depend
on y[0].

I have a couple of documentation patches prepared, and have been
poking around in the source to verify that what I'm proposing for
indeterminate length arrays, "m" (*(const T (*)[]) ptr) and
"=m" (*(T (*)[]) ptr) is reasonable.  One obvious problem is that the
cast expression isn't a proper lvalue, but I'm encouraged to find
comments in the source complaining that such things need to be
tolerated in asm.  :)

> 
> 
>  If your assembler instructions access memory in an unpredictable
> fashion, add `memory' to the list of clobbered registers.  This will
> cause GCC to not keep memory values cached in registers across the
> assembler instruction and not optimize stores or loads to that memory.
> You will also want to add the `volatile' keyword if the memory affected
> is not listed in the inputs or outputs of the `asm', as the `memory'
> clobber does not count as a side-effect of the `asm'.  If you know how
> large the accessed memory is, you can add it as input or output but if
> this is not known, you should add `memory'.  As an example, if you
> access ten bytes of a string, you can use a memory input like:
> 
>  {"m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )}.
> 
>  Note that in the following example the memory input is necessary,
> otherwise GCC might optimize the store to `x' away:
>  int foo ()
>  {
>int x = 42;
>int *y = &x;
>    int result;
>asm ("magic stuff accessing an 'int' pointed to by '%1'"
>  "=&d" (r) : "a" (y), "m" (*y));
>return result;
>  }
> 
> 
> 
> Ciao,
> Michael.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Optimization breaks inline asm code w/ptrs

2017-08-16 Thread Alan Modra
On Tue, Aug 15, 2017 at 03:09:15PM +0800, Liu Hao wrote:
> On 2017/8/14 20:41, Alan Modra wrote:
> >On Sun, Aug 13, 2017 at 10:25:14PM +0930, Alan Modra wrote:
> >>On Sun, Aug 13, 2017 at 03:35:15AM -0700, David Wohlferd wrote:
> >>>Using "m"(*pStr) as an (unused) input parameter has no effect.
> >>
> >>Use "m" (*(const void *)pStr) and ignore the warning, or use
> >>"m" (*(const struct {char a; char x[];} *) pStr).
> >
> >or even better "m" (*(const char (*)[]) pStr).
> >
> 
> This should work in the sense that GCC now thinks bytes adjacent to `pStr`
> are subject to modification by the asm statement.
> 
> But I just tried GCC 7.2 and it seems that even if such a "+m" constraint is
> the only output parameter of an asm statement and there is no `volatile` or
> the "memory" clobber, GCC optimizer will not optimize the asm statement
> away, which is the case if a plain `"+m"(*pStr)` is used.

I wasn't advocating a "+m" constraint in this case.  Obviously it's
wrong to say scasb modifies memory.  That aside though, I'm mainly
interested in gcc-8 and see "+m"(*p) preventing dead code removal,
even when all outputs of the asm are unused (including of course the
array pointed at by p).  Probably a bug.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Optimization breaks inline asm code w/ptrs

2017-08-14 Thread Alan Modra
On Sun, Aug 13, 2017 at 10:25:14PM +0930, Alan Modra wrote:
> On Sun, Aug 13, 2017 at 03:35:15AM -0700, David Wohlferd wrote:
> > Using "m"(*pStr) as an (unused) input parameter has no effect.
> 
> Use "m" (*(const void *)pStr) and ignore the warning, or use
> "m" (*(const struct {char a; char x[];} *) pStr).

or even better "m" (*(const char (*)[]) pStr).

> The issue is one of letting gcc know what memory is accessed by the
> asm, if you don't want to use a "memory" clobber.  And there are very
> good reasons to avoid clobbering all memory.
> 
> "m"(*pStr) ought to work IMO, but apparently just tells gcc you are
> only interested in the first character.  Of course that is exactly
> what *pStr is, but in this context it would be nicer if it meant the
> entire array.

I take that back.  The relatively simple cast to differentiate a
pointer to a char from a pointer to an indeterminate length char array
makes it quite unnecessary for "m"(*pStr) to be treated as as array
reference.

I've opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81845 to
track the lack of documentation.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Optimization breaks inline asm code w/ptrs

2017-08-13 Thread Alan Modra
On Sun, Aug 13, 2017 at 03:35:15AM -0700, David Wohlferd wrote:
> Using "m"(*pStr) as an (unused) input parameter has no effect.

Use "m" (*(const void *)pStr) and ignore the warning, or use
"m" (*(const struct {char a; char x[];} *) pStr).

The issue is one of letting gcc know what memory is accessed by the
asm, if you don't want to use a "memory" clobber.  And there are very
good reasons to avoid clobbering all memory.

"m"(*pStr) ought to work IMO, but apparently just tells gcc you are
only interested in the first character.  Of course that is exactly
what *pStr is, but in this context it would be nicer if it meant the
entire array.

-- 
Alan Modra
Australia Development Lab, IBM


Re: PowerPC -many

2017-02-14 Thread Alan Modra
On Tue, Feb 14, 2017 at 06:38:40PM -0600, Segher Boessenkool wrote:
> On Wed, Feb 15, 2017 at 10:36:02AM +1030, Alan Modra wrote:
> > Since we've been talking about obsoleting cpu support, how about
> > getting rid of -many in ASM_CPU_SPEC for gcc-8?
> 
> Sure, but that doesn't need advance warning to the users, does it?

Probably not.

> Things worked before and stay working, nothing user-visible?

Except for bad user asm() that ought to be true.  Oh, and gcc bugs
like emitting power9 insns when -mcpu=power8.  You'd have some chance
that the assembler would complain rather than getting sigill at
run-time.

-- 
Alan Modra
Australia Development Lab, IBM


PowerPC -many

2017-02-14 Thread Alan Modra
Since we've been talking about obsoleting cpu support, how about
getting rid of -many in ASM_CPU_SPEC for gcc-8?

It's a horrible hack of mine to work around gcc -mcpu option handling
bugs which I think have been fixed, and to silence complaints from gas
about asm() written for multiple cpus (with presumably run-time
selection of which block of code gets executed depending on cpu).

It used to be just a linux hack, but I see David uses it in aix61.h
and aix71.h too?

-- 
Alan Modra
Australia Development Lab, IBM


Re: GNU indirect functions vs. symbol visibility

2016-08-25 Thread Alan Modra
On Thu, Aug 25, 2016 at 01:36:53PM +0200, Florian Weimer wrote:
> * Alan Modra:
> 
> > glibc people: As the main user of ifuncs, how do you feel about not
> > declaring functions hidden that are implemented in glibc by ifuncs?
> 
> We have run into this before, I think:
> 
>   <https://sourceware.org/ml/libc-alpha/2016-07/msg00089.html>

Yes, this is exactly the same problem, a hidden visibility prototype
with an ifunc definition.  Don't add the visibility attribute to the
prototype and the problem will no longer occur.

Also note that adding hidden visibility to a prototype that has an
ifunc definition in glibc gives no benefit on targets that can handle
this situation.

The difficulty of course is that where glibc does not provide an ifunc
implementation you *do* want the hidden visibility attribute, and
whether or not ifuncs are used varies from target to target.

> > It's fine to make them hidden via a version script, or even define
> > them as hidden (which requires just the rs6000_elf_encode_section_info
> > part of my gcc patch to make ppc32 behave).
> 
> If it doesn't work, we'd certainly prefer an early diagnostic.

Right.  https://sourceware.org/bugzilla/show_bug.cgi?id=20515 opened.

-- 
Alan Modra
Australia Development Lab, IBM


GNU indirect functions vs. symbol visibility

2016-08-24 Thread Alan Modra
Discussion started here:
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01678.html

On Wed, Aug 24, 2016 at 08:51:16PM +0300, Alexander Monakov wrote:
> On Wed, 24 Aug 2016, Alan Modra wrote:
> > Given a hidden visibility function declaration, the toolchain can say
> > that the function is local to the module.  This generally means that a
> > call to the function can be direct, ie. doesn't need to go via the PLT
> > even in a shared library.  However, ifunc breaks this promise.  GNU
> > indirect functions may resolve non-locally, and are implemented by
> > always using a PLT call.
> > 
> > This causes trouble for targets like ppc32 where the -msecure-plt PIC
> > PLT call stub needs a valid GOT pointer.  Any call that potentially
> > might be to an ifunc therefore requires a GOT pointer, and can't be a
> > sibling call (because the GOT pointer on ppc32 is a caller saved reg).
> 
> The same issue exists on 32-bit x86: PLT calls require that %ebx holds the
> address of GOT (and the sibcall issue arises as well).  I've just confirmed
> using a simple testcase that the scenario you describe leads to a runtime 
> error
> on i386, and even LD_BIND_NOW=1 doesn't help, as it doesn't trigger early
> resolution of ifuncs.

I'm happy to see that ppc32 isn't alone.  ;-)

> > So unless we require that all ifuncs are declared as ifunc,
> 
> (note, that would be impossible with today's GCC because the ifunc attribute
> requires designating the resolver, and the resolver cannot be extern -- so
> ultimately you cannot declare an extern-ifunc symbol)
> 
> > it seems that ppc32 can't assume extern or weak functions are local.
> 
> It doesn't seem nice to penalize all normal calls due to this issue.

I whole-heartedly agree.

> I think a
> solution without this cost is possible: have ld synthesize a forwarder 
> function
> when it sees a non-plt call to an ifunc symbol. The forwarder can push the GOT
> register, load the GOT address, call the ifunc symbol, pop the GOT register 
> and
> return. Does this sound right?

I'd considered this idea too.  It should work, but isn't ideal.  The
resulting code will be slower than if the ifuncs were simply not
declared hidden.  The idea also isn't quite as simple to implement as
it might seem, since frame unwinding must work through any such stub,
and gdb probably would need to know about them too.

I prefer to simply make ld error on seeing calls to ifuncs where it
detects that such a stub would be needed.  ppc32 GNU ld should do that
reliably as of git commit 888a7fc3.

glibc people: As the main user of ifuncs, how do you feel about not
declaring functions hidden that are implemented in glibc by ifuncs?
It's fine to make them hidden via a version script, or even define
them as hidden (which requires just the rs6000_elf_encode_section_info
part of my gcc patch to make ppc32 behave).

-- 
Alan Modra
Australia Development Lab, IBM


Re: Preventing preemption of 'protected' symbols in GNU ld 2.26 [aka should we revert the fix for 65248]

2016-04-25 Thread Alan Modra
On Mon, Apr 25, 2016 at 11:35:46AM -0600, Jeff Law wrote:
> No, we revert to the gcc-4.9 behavior WRT protected visibility and ensure
> that we're getting a proper diagnostic from the linker.
> 
> That direction is consistent with the intent of protected visibility, fixes
> the problem with preemption of protected symbols and gives us a diagnostic
> for the case that can't be reasonably handled.

I agree that this is the correct solution.  Unfortunately there is a
complication.  PIE + shared lib using protected visibility worked fine
with gcc-4.9, but since then code generated by gcc for PIEs on x86_64
has been optimized to rely on the horrible old hack of .dynbss and
copy relocations.  That means you'll have regressions from 4.9 if just
reverting the protected visibility change..

The PIE optimization will need reverting too, and I imagine you'll see
some resistance to that idea due to the fact that it delivers quite a
nice performance improvement for PIEs.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Preventing preemption of 'protected' symbols in GNU ld 2.26 [aka should we revert the fix for 65248]

2016-04-19 Thread Alan Modra
On Tue, Apr 19, 2016 at 10:20:23AM +0200, Richard Biener wrote:
> On Tue, Apr 19, 2016 at 7:08 AM, Alan Modra  wrote:
> > On Mon, Apr 18, 2016 at 07:59:50AM -0700, H.J. Lu wrote:
> >> On Mon, Apr 18, 2016 at 7:49 AM, Alan Modra  wrote:
> >> > On Mon, Apr 18, 2016 at 11:01:48AM +0200, Richard Biener wrote:
> >> >> To summarize: there is currently no testcase for a wrong-code issue
> >> >> because there is no wrong-code issue.
> >
> > I've added a testcase at
> > https://sourceware.org/bugzilla/show_bug.cgi?id=19965#c3
> > that shows the address problem (&x != x) with older gcc *or* older
> > glibc, and shows the program behaviour problem with current
> > binutils+gcc+glibc.
> 
> Thanks.
> 
> So with all this it sounds that current protected visibility is just broken
> and we should forgo with it, making it equal to default visibility?

Well, using protected visibility variables makes no sense in
executables.  They really are only useful in shared libraries, but
have been of limited use on architectures like x86 for a long time
due to non-PIC executable copying shared library variables into
.dynbss.  The concepts of copying variables into .dynbss, and
protected visibility, are fundamentally incompatible.

HJ's changes addressed the program level semantic issues, but in the
process lost the main reason to use protected visibility variables,
which is to tell a compiler that a global variable cannot be preempted
(and therefore can use faster code for access, typically pc or GOT
pointer relative rather than GOT indirect.)  So IMO, "of limited use"
has now become "not much use at all" on x86_64 and other architectures
that have blindly followed suit.

> At least I couldn't decipher a solution that solves all of the issues
> with protected visibility apart from trying to error at link-time
> (or runtime?) for the cases that are tricky (impossible?) to solve.

I described the problem and solutions in
https://sourceware.org/ml/binutils/2016-03/msg00431.html.  A followup
by Cary pointed out that one of the solutions, emitting text dynamic
relocations, won't work on some architectures (of which x86_64 is
one).

> glibc uses "protected visibility" via its using of local aliases, correct?

Yes, glibc defines a hidden visibility symbol for internal use, with
an exported alias.

> But it doesn't use anything like that for data symbols?

I believe it does.  See occurrences of libc_hidden_data_def.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Preventing preemption of 'protected' symbols in GNU ld 2.26 [aka should we revert the fix for 65248]

2016-04-18 Thread Alan Modra
On Mon, Apr 18, 2016 at 07:59:50AM -0700, H.J. Lu wrote:
> On Mon, Apr 18, 2016 at 7:49 AM, Alan Modra  wrote:
> > On Mon, Apr 18, 2016 at 11:01:48AM +0200, Richard Biener wrote:
> >> To summarize: there is currently no testcase for a wrong-code issue
> >> because there is no wrong-code issue.

I've added a testcase at
https://sourceware.org/bugzilla/show_bug.cgi?id=19965#c3
that shows the address problem (&x != x) with older gcc *or* older
glibc, and shows the program behaviour problem with current
binutils+gcc+glibc.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Preventing preemption of 'protected' symbols in GNU ld 2.26 [aka should we revert the fix for 65248]

2016-04-18 Thread Alan Modra
On Mon, Apr 18, 2016 at 11:01:48AM +0200, Richard Biener wrote:
> To summarize: there is currently no testcase for a wrong-code issue
> because there is no wrong-code issue.

That depends entirely on how far you are willing to bend the ELF gABI.

Any testcase the takes the address of a protected visibility variable
defined in a shared library now can get the wrong answer, since you
can argue that any address outside the shared library is wrong
according to the gABI.

I expect you can also write a testcase using a const protected var in
a shared library that ought to segfault on writing to the var from
code within the shared library, that now merrily writes to a .dynbss
copy.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Some aliasing questions

2016-04-11 Thread Alan Modra
On Fri, Apr 08, 2016 at 01:41:05PM -0700, Richard Henderson wrote:
> On 04/08/2016 11:10 AM, Bill Schmidt wrote:
> > The first is an issue with TOC-relative addresses on PowerPC.  These are
> > symbolic addresses that are to be loaded from a fixed slot in the table
> > of contents, as addressed by the TOC pointer (r2).  In the RTL phases
> > prior to register allocation, these are described in an UNSPEC that
> > looks like this for an example store:
> > 
> > (set (mem/c:DI (unspec:DI [
> >(symbol_ref:DI ("*.LANCHOR0") [flags 0x182])
> >(reg:DI 2 2)
> >   ] UNSPEC_TOCREL) [1 svul+0 S8 A128])
> >  (reg:DI 178))
> > 
> > The UNSPEC helps keep track of the r2 reference until this is split into
> > two or more insns depending on the memory model.
> 
> 
> That's why Alpha uses LO_SUM for pre-reload tracking of such things.
> 
> Even though that's a bit of a liberty, since there's no HIGH to go along with
> the LO_SUM.  But at least it allows the middle-end to continue to find the 
> symbol.

I wish I'd been made aware of the problem with alias analysis when I
invented this scheme for -mcmodel=medium code..

Back in gcc-4.3 days, when small-model code was the only option, we
used to generate
mem (plus ((reg 2) (const (minus ((symbol_ref)
  (symbol_ref toc_base))
for a toc mem reference, which accurately reflects the addressing.

The problem is that when splitting this to a high/lo_sum you lose the
r2 reference in the lo_sum, and that allows r2 to die prematurely,
breaking an important linker code editing optimisation.

Hmm.  Maybe if we rewrote the mem to
mem (plus ((symbol_ref toc_base) (const (minus ((symbol_ref)
(reg 2))
It might look odd, but is no lie.  r2 is equal to toc_base.  Or
perhaps we could lie a litte and simply omit the plus and toc_base
reference?

Either way, when we split to
set (reg tmp) (high (const (minus ((symbol_ref) (reg 2)
.. mem (lo_sum (reg tmp) (const (minus ((symbol_ref) (reg 2)
both high and lo_sum reference r2 and the linker could happily replace
rtmp in the lo_sum insn with r2 when the high address is known to be
zero.

Bill, do you have test cases for the alias problem?  Is this something
that needs fixing for gcc-6?

-- 
Alan Modra
Australia Development Lab, IBM


Re: September 2015 GNU Toolchain Update

2015-09-27 Thread Alan Modra
On Fri, Sep 25, 2015 at 01:33:34PM +0100, Nick Clifton wrote:
>   * The new PowerPC64 specific linker command line option
> --no-save-restore-funcs  tells the linker not to provide the
> out-of-line register save and restore functions used by -Os compiled
> code.  The default is to provide any such referenced function for
> a normal final link, but not do so for a relocatable link.

Actually, --save-restore-funcs and --no-save-restore-funcs have been
around since 2014-02.  The recent new PowerPC64 option is
--tls-get-addr-optimize, a complement to --no-tls-get-addr-optimize.

-- 
Alan Modra
Australia Development Lab, IBM


Re: ppc eabi float arguments

2015-09-23 Thread Alan Modra
On Wed, Sep 23, 2015 at 07:09:43PM -0400, Michael Meissner wrote:
> On Tue, Sep 22, 2015 at 01:43:55PM -0400, David Edelsohn wrote:
> > On Tue, Sep 22, 2015 at 1:39 PM, Bernhard Schommer
> >  wrote:
> > > Hi,
> > >
> > > if been working with the windriver Diab c compiler for 32bit ppc for  and
> > > encountered an incompatibly with the eabi version of the gcc 4.83. When
> > > calling functions with more than 8 float arguments the gcc stores the 9th
> > > float argument (and so on) as a float where as the diab compiler stores 
> > > the
> > > argument as a double using 8 byte.
> > >
> > > I checked the EABI document and it seems to support the way the diab
> > > compiler passes the arguments:
> > >
> > > "Arguments not otherwise handled above [i.e. not passed in registers]
> > > are passed in the parameter words of the caller=E2=80=99s stack frame. 
> > > [...=
> > > ]
> > > float, long long (where implemented), and double arguments are
> > > considered to have 8-byte size and alignment, *with float arguments
> > > converted to double representation*. "
> > >
> > > Does anyone know the reason why the gcc passes the argument as single 
> > > float?
> > 
> > Hi, Bernhard
> > 
> > First, are you certain that you have the final version of the 32 bit
> > PPC eABI? There were a few versions in circulation.
> > 
> > Mike may remember the history of this.
> 
> Well I worked on it around 1980 or so. I don't remember the details (nor do I
> have the original manuals I was working from).  From this distance, it sure
> looks like a bug, but I'm not sure whether it should be fixed or 
> grand-fathered
> in (and updating the stdargs.h support, if this is the offical calling
> sequence).

I recall this question coming up before, and we decided to leave gcc
as is so that new ppc32 gcc code stayed compatible with old ppc32 gcc
code.  Also, even if we were starting with a clean slate, we might
want to pass floats without promoting to double:  Stack frames are
potentially smaller.  Against that is the fact that we promote to
double when calling an unprototyped function, so you'll run into
trouble trying to define a function with more than eight float args if
writing K&R code.  Old programmers tend to know about such issues, and
don't use float function parameters in K&R code.  :)

Incidentally, there are other rather more nasty parameter passing
problems with ppc32, ones I would have liked to fix.  For instance,
"complex double" is passed in 4 gprs.

-- 
Alan Modra
Australia Development Lab, IBM


Re: ppc eabi float arguments

2015-09-22 Thread Alan Modra
On Tue, Sep 22, 2015 at 07:39:43PM +0200, Bernhard Schommer wrote:
> Does anyone know the reason why the gcc passes the argument as single float?

That's how the first powerpc gcc implementation behaved.  Once gcc
compiled code is out in the field, you need to ask everyone to
recompile their code in order to fix an ABI problem.  That may be more
disrupting than just leaving gcc incompatible with other compilers.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Adding static-PIE support to binutils

2015-08-18 Thread Alan Modra
On Tue, Aug 18, 2015 at 08:58:43PM -0400, Rich Felker wrote:
> I've updated the patch to cover the changes needed for all the
> elf??-*.c target files (lots of code duplication already there), skip
> the clearing of command_line.interpreter, and based it on current git
> master with your output_type changes.

This is OK to commit with a suitable ChangeLog.  I think a separate ld
option is best too, because historically -static and its aliases
-Bstatic, -dn, -non_shared really are about what type of libraries are
accepted rather than choosing linker output type.

-- 
Alan Modra
Australia Development Lab, IBM


Re: CFI directives and dynamic stack alignment

2015-08-17 Thread Alan Modra
On Mon, Aug 17, 2015 at 10:38:22AM -0700, Steve Ellcey wrote:
> On Tue, 2015-08-11 at 10:05 +0930, Alan Modra wrote:
> 
> > > The 'and' instruction is where the stack gets aligned and if I remove that
> > > one instruction, everything works.  I think I need to put out some new CFI
> > > psuedo-ops to handle this but I am not sure what they should be.  I am 
> > > just
> > > not very familiar with the CFI directives.
> > 
> > I don't speak mips assembly very well, but it looks to me that you
> > have more than just CFI problems.  How do you restore sp on return
> > from the function, assuming sp wasn't 16-byte aligned to begin with?
> > Past that "and $sp,$sp,$3" you don't have any means of calculating
> > the original value of sp!  (Which of course is why you also can't find
> > a way of representing the frame address.)
> 
> I have code in expand_prologue that copies the incoming stack pointer to
> a temporary hard register and then I have code to the entry_block to
> copy that register into a virtual register.  In the exit block that
> virtual register is copied back to a temporary hard register and
> expand_epilogue copies it back to $sp to restore the stack pointer.

OK, then you need to emit a .cfi directive to say the frame top is
given by the temp hard reg sometime after that assignment and before
sp is aligned in the prologue, and another .cfi directive when copying
to the pseudo.  It's a while since I looked at the CFI code in gcc,
but arranging this might be as simple as setting RTX_FRAME_RELATED_P
on the insns involved.

If -fasynchronous-unwind-tables, then you'll also need to track the
frame in the epilogue.

> This function (fn2) ends with a call to abort, which is noreturn, so the
> optimizer sees that the epilogue is dead code and GCC determines that
> there is no need to save the old stack pointer since it will never get
> restored.   I guess I need to tell GCC to save the stack pointer in
> expand_prologue even if it never sees a use for it.  I guess I need to
> make the temporary register where I save $sp volatile or do something
> else so that the assignment (and its associated .cfi) is not deleted by
> the optimizer.

Ah, I see.  Yes, the temp and pseudo are not really dead if they are
needed for unwinding.

-- 
Alan Modra
Australia Development Lab, IBM


Re: CFI directives and dynamic stack alignment

2015-08-10 Thread Alan Modra
On Mon, Aug 03, 2015 at 02:48:09PM -0700, Steve Ellcey wrote:
> When I generate code to dynamically align the stack my code looks like
> this:
> 
> fn2:
>   .frame  $fp,32,$31  # vars= 0, regs= 2/0, args= 16, gp= 8
>   .mask   0xc000,-4
>   .fmask  0x,0
>   .setnoreorder
>   .setnomacro
>   lui $2,%hi(null)
>   li  $3,-16  # 0xfff0
>   lw  $2,%lo(null)($2)
>   and $sp,$sp,$3
>   addiu   $sp,$sp,-32
>   .cfi_def_cfa_offset 32
>   sw  $fp,24($sp)
>   .cfi_offset 30, -8
>   move$fp,$sp
>   .cfi_def_cfa_register 30
>   sw  $31,28($sp)
>   .cfi_offset 31, -4
>   jal abort
>   sb  $0,0($2)
> 
> The 'and' instruction is where the stack gets aligned and if I remove that
> one instruction, everything works.  I think I need to put out some new CFI
> psuedo-ops to handle this but I am not sure what they should be.  I am just
> not very familiar with the CFI directives.

I don't speak mips assembly very well, but it looks to me that you
have more than just CFI problems.  How do you restore sp on return
from the function, assuming sp wasn't 16-byte aligned to begin with?
Past that "and $sp,$sp,$3" you don't have any means of calculating
the original value of sp!  (Which of course is why you also can't find
a way of representing the frame address.)

-- 
Alan Modra
Australia Development Lab, IBM


Re: configure.{in -> ac} rename (commit 35eafcc71b) broke in-tree binutils building of gcc

2015-07-14 Thread Alan Modra
On Tue, Jul 14, 2015 at 10:13:06AM +0100, Jan Beulich wrote:
> Alan, gcc maintainers,
> 
> I was quite surprised for my gcc 4.9.3 build (using binutils 2.25 instead
> of 2.24 as I had in use with 4.9.2) to fail in rather obscure ways. Quite
> a bit of digging resulted in me finding that gcc/configure.ac looks for
> configure.in in a number of binutils subtrees.

I haven't used combined tree builds of binutils+gcc for a very long
time, so this issue wasn't on my radar at all, sorry.

> Globally replacing
> configure.in by configure.[ai][cn] appears to address this, but I'm not
> sure whether that would be an acceptable change

Certainly sounds reasonable.

> (there doesn't seem
> to be a fix for this in gcc trunk either, which I originally expected I could
> simply backport).

The configure.in->configure.ac rename happened over a year ago so I
guess this shows that not too many people use combined binutils+gcc
builds nowadays.  I've always found combined binutils+gcc builds not
worth the bother compared to simply building and installing binutils
first, as Jim suggests.

-- 
Alan Modra
Australia Development Lab, IBM


Re: rtx_cost of insns

2015-06-29 Thread Alan Modra
On Mon, Jun 29, 2015 at 09:34:40AM -0500, Segher Boessenkool wrote:
> On Mon, Jun 29, 2015 at 05:16:39PM +0930, Alan Modra wrote:
> > Note that we already have insn_rtx_cost, and it returns a minimum cost
> > for a SET, so register move insns get a cost of 1 insn.  However,
> > despite insn_rtx_cost starting life in combine.c, even combine doesn't
> > use it in all whole insn cases.  :-(
> 
> In what cases does it not?

Practically all of the occurrences of set_src_cost in combine.c can be
called on whole insns.  By "whole insn" I mean of course the right
hand side of a set, or a single set inside a parallel.  I'm not saying
that this causes trouble, since I haven't seen a register move there
(but I haven't looked very hard either).

-- 
Alan Modra
Australia Development Lab, IBM


rtx_cost of insns

2015-06-29 Thread Alan Modra
On Thu, Jun 25, 2015 at 01:28:39PM +0100, Richard Earnshaw wrote:
> Perhaps the best thing to do is to use the OUTER code to spot the
> specific case where you've got a SET and return non-zero in that case.

That's exactly the path I've been following.  It's not as easy as it
sounds..

First, some backends call rtx_cost from their targetm.rtx_costs.
ix86_rtx_costs for instance has this

case PLUS:
...
  if (val == 2 || val == 4 || val == 8)
{
  *total = cost->lea;
  *total += rtx_cost (XEXP (XEXP (x, 0), 1),
  outer_code, opno, speed);
  *total += rtx_cost (XEXP (XEXP (XEXP (x, 0), 0), 0),
  outer_code, opno, speed);
  *total += rtx_cost (XEXP (x, 1), outer_code, opno, speed);
  return true;
}
which, when using a non-zero register move cost, results in

Successfully matched this instruction:
(set (reg:DI 198 [ D.74663 ])
(plus:DI (plus:DI (reg/v/f:DI 172 [ use_entry ])
(reg:DI 196 [ D.74662 ]))
(const_int -32 [0xffe0])))
rejecting combination of insns 179 and 180
original costs 6 + 4 = 10
replacement cost 15

So here the x86 backend is calculating the cost of an lea, plus the
cost of (reg:DI 196), plus the cost of (reg/v/f:DI 172), plus the cost
of (const_int -32).  outer_code is SET.  That means we add two
register moves, increasing the overall cost from 7 to 15.

The second problem I've hit is that fwprop.c:should_replace_address
has this:

  /* If the addresses have equivalent cost, prefer the new address
 if it has the highest `set_src_cost'.  That has the potential of
 eliminating the most insns without additional costs, and it
 is the same that cse.c used to do.  */
  if (gain == 0)
gain = (set_src_cost (new_rtx, VOIDmode, speed)
- set_src_cost (old_rtx, VOIDmode, speed));

  return (gain > 0);

If register moves have the same cost as adding a small constant to a
register, then this code no longer replaces a pseudo with its value as
an offset from a base.  I think this particular problem can be fixed
quite simply by "return gain >= 0;", but really, this code, like the
x86 code, is expecting the cost of a register move to be zero.

You'll notice that these example problems are not trying to cost a
whole instruction.  In both cases they want the cost of just a piece
of an instruction, but rtx_cost is called in a way that is
indistinguishable from other code that calls rtx_cost on whole
register move instructions.

The real difficulty is in separating out the whole insn cases from the
partial insn cases.

Note that we already have insn_rtx_cost, and it returns a minimum cost
for a SET, so register move insns get a cost of 1 insn.  However,
despite insn_rtx_cost starting life in combine.c, even combine doesn't
use it in all whole insn cases.  :-(

-- 
Alan Modra
Australia Development Lab, IBM


Re: set_src_cost lying comment

2015-06-24 Thread Alan Modra
On Tue, Jun 23, 2015 at 11:05:45PM -0600, Jeff Law wrote:
> I certainly agree that the cost of a move, logicals and arithmetic is
> essentially the same at the chip level for many processors.  But a copy has
> other properties that make it "cheaper" -- namely we can often propagate it
> away or arrange for the source & dest of the copy to have the same hard
> register which achieves the same effect.
> 
> So one could argue that a copy should have cost 0 as it has a reasonable
> chance of just going away, while logicals, alu operations on the appropriate
> chips should have a cost of 1.

That's an interesting point, and perhaps true for rtl expansion.  I'm
not so sure it is correct for later rtl passes where you'd like to
discourage register moves..

Case in point:  The rs6000 backend happens to use zero for the cost of
setting registers to simple constants.  That might be an accident, but
when I fixed this by making (set (reg) (const_int)) cost one insn as
it actually does for a range of constants, I found some call sequences
regressesd.  A call like foo(0,0) is better as
  (set (reg r3) (const_int 0))  li 3,0
  (set (reg r4) (const_int 0))  li 4,0
  (call ...)bl foo
rather than
  (set (reg r3) (const_int 0))  li 3,0
  (set (reg r4) (reg r3))   mr 4,3
  (call ...)bl foo
CSE will say the second sequence is cheaper if loading a constant is
more expensive than a copy.  In reality the second sequence is less
preferable since you have a register dependency.

A similar problem happens with foo(x+1,x+1) which currently emits
  (set (reg r3) (plus (reg x) (const_int 1)))
  (set (reg r4) (reg r3))
for the arg setup insns.  On modern processors it would be better as
  (set (reg r3) (plus (reg x) (const_int 1)))
  (set (reg r4) (plus (reg x) (const_int 1)))

So in these examples we'd really like register moves to cost one
insn.  Hmm, at least, moves from hard regs ought to cost something.

-- 
Alan Modra
Australia Development Lab, IBM


set_src_cost lying comment

2015-06-21 Thread Alan Modra
set_src_cost says it is supposed to
/* Return the cost of moving X into a register, relative to the cost
   of a register move.  SPEED_P is true if optimizing for speed rather
   than size.  */

Now, set_src_cost of a register move (set (reg1) (reg2)), is zero.
Why?  Well, set_src_cost is used just on the right hand side of a SET,
so the cost is that of (reg2), which is zero according to rtlanal.c
rtx_cost.  targetm.rtx_costs doesn't get a chance to modify this.

Now consider (set (reg1) (ior (reg2) (reg3))), for which set_src_cost
on rs6000 currently returns COSTS_N_INSNS(1).  It seems to me that
this also ought to return zero, if the set_src_cost comment is to be
believed.  I'd claim the right hand side of this expression costs the
same as a register move.  A register move machine insn "mr reg1,reg2"
is encoded as "or reg1,reg2,reg2" on rs6000!

Continuing in the same vein, an AND is no more expensive than an IOR,
and similarly for other ALU operations.  So they all ought to cost
zero??  But this is ridiculous since set_src_cost is used as in many
places as the cost of an entire insn, eg. synth_mult compares the cost
of implementing a multiply as a series of adds and shifts against the
cost of a multiply.  If all those adds and shifts are costed at zero,
then synth_mult can't do its job.

So what should that comment say?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-06-02 Thread Alan Modra
On Tue, Jun 02, 2015 at 11:28:09AM -0500, Segher Boessenkool wrote:
> On Tue, Jun 02, 2015 at 08:49:37AM +0930, Alan Modra wrote:
> > but and64_2_operand doesn't include all of and_operand!
> 
> Maybe I'm slow today, but I don't see it?  Do you have an example?

I need to get new glasses.  That's the best excuse I can come up with
at short notice. :)  mask64_2_operand, used by and64_2_operand,
does indeed cover all of mask_operand and mask64_operand.  Even so,
the predicate deserves to die.

> > > > get rid of WORD_REGISTER_OPERATIONS,
> > > 
> > > rs6000 should not define it.  What e.g. does it mean for mullw?  Or,
> > > worse, mulhw?  Pretty much anything with "w" in its name is problematic.
> > 
> > In many places WORD_REGISTER_OPERATIONS is used, it is saying "don't
> > trust the high bits".  At the moment we definitely do need it defined!
> 
> I don't see that either; do you have a pointer for me?

The first occurrence in combine.c looks like such a place to me.  Also
the first one in rtlanal.c:nonzero_bits1.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-06-01 Thread Alan Modra
On Mon, Jun 01, 2015 at 08:39:05AM -0500, Segher Boessenkool wrote:
> On Mon, Jun 01, 2015 at 11:33:18AM +0930, Alan Modra wrote:
> > Unifying andsi_mask with anddi_mask, and the fact that constraints for
> > const_int see VOIDmode rather than the operand mode is why we get
> > rldicr rather than rlwinm.  Easily fixed by separating the si/di
> > patterns, and with a little more work I may even be able to keep them
> > together.
> 
> Maybe just swapping T to be before S will do what you want, already?

Nope.

Index: gcc/config/rs6000/predicates.md
===
--- gcc/config/rs6000/predicates.md (revision 223878)
+++ gcc/config/rs6000/predicates.md (working copy)
@@ -764,7 +764,11 @@
 
   if (TARGET_POWERPC64)
 {
-  /* Fail if the mask is not 32-bit.  */
+  /* Fail if the mask is not 32-bit.  Note: If constraints are
+implemented using mask_operand then they will never fail this
+test.  const_ints are VOIDmode, which is what is seen here
+when called from a constraint.  When called as a predicate,
+the match_operand mode is seen.  */
   if (mode == DImode && (c & ~(unsigned HOST_WIDE_INT) 0x) != 0)
return 0;
 
The above, part of a patch I was writing to fix these problems, is why
putting T before S doesn't work.  eg. "T" matches 0x8000,
which is good for SImode where you're really masking with 0x8000,
but using rlwinm for the same constant in DImode would of course mask
off the top 32 bits.

> > In and3 expander I think you want the following since
> > and64_2_operand covers the extra double-rotate cases, not all DImode.
> > 
> > -  if ((mode == DImode && !and64_2_operand (operands[2], mode))
> > -  || (mode != DImode && !and_operand (operands[2], mode)))
> > +  if (!and_operand (operands[2], mode)
> > +  && (mode != DImode || !and64_2_operand (operands[2], 
> > mode)))
> 
> and64_2_operand includes all of and_operand.  I agree it is a mess.

but and64_2_operand doesn't include all of and_operand!

> > In and3_imm_mask_dot and and3_imm_mask_dot2.  Typo?
> > -   && any_mask_operand (operands[2], mode)"
> > +   && !any_mask_operand (operands[2], mode)"
> 
> Thinko; that whole line should just be removed.  We prefer e.g. "rlwinm"
> over "andi.", but "andi." over "rlwinm.".  I'll do a patch.

OK, I have a patch too..

> > get rid of WORD_REGISTER_OPERATIONS,
> 
> rs6000 should not define it.  What e.g. does it mean for mullw?  Or,
> worse, mulhw?  Pretty much anything with "w" in its name is problematic.

In many places WORD_REGISTER_OPERATIONS is used, it is saying "don't
trust the high bits".  At the moment we definitely do need it defined!

> and gets rid of 2rld completely.

That's a good idea.  If we want it still, and I think we do, just
implement the two rldicl/r in the expander.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-05-31 Thread Alan Modra
On Sat, May 30, 2015 at 08:02:20AM -0500, Segher Boessenkool wrote:
> On Sat, May 30, 2015 at 10:47:27AM +0930, Alan Modra wrote:
> > Huh, that does look like you've destroyed my claim about SImode AND.
> 
> Carefully worded :-)

Yes, I wrote it meaning as in refuted an argument, but it also fits
the culprit who broke the AND patterns.  :-)

Unifying andsi_mask with anddi_mask, and the fact that constraints for
const_int see VOIDmode rather than the operand mode is why we get
rldicr rather than rlwinm.  Easily fixed by separating the si/di
patterns, and with a little more work I may even be able to keep them
together.

There are some other problems too.

In and3 expander I think you want the following since
and64_2_operand covers the extra double-rotate cases, not all DImode.

-  if ((mode == DImode && !and64_2_operand (operands[2], mode))
-  || (mode != DImode && !and_operand (operands[2], mode)))
+  if (!and_operand (operands[2], mode)
+  && (mode != DImode || !and64_2_operand (operands[2], mode)))

In and3_imm_mask_dot and and3_imm_mask_dot2.  Typo?
-   && any_mask_operand (operands[2], mode)"
+   && !any_mask_operand (operands[2], mode)"

And that calls into question the !logical_const_operand in the insn
predicates for and3_mask_dot and and3_mask_dot2.  Certain
masks satisfy both any_mask_operand and logical_const_operand..  After
fixing the typo, neither the andi./andis. patterns nor the
rlwinm./rldic[rl]. patterns will be enabled for those masks.  Seems to
me we should omit !logical_const_operand from those insn predicates.

> I don't think it is a good idea to optimise code based on assumptions
> of what SImode SETs will do to the dest seen as DImode, without making
> those assumptions explicit in the RTL.

I agree.  Do you intend to get rid of WORD_REGISTER_OPERATIONS,
POINTERS_EXTEND_UNSIGNED, PUSH_ROUNDING, SHORT_IMMEDIATES_SIGN_EXTEND,
and LOAD_EXTEND_OP?  ;-)

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-05-29 Thread Alan Modra
On Fri, May 29, 2015 at 10:00:04AM -0500, Segher Boessenkool wrote:
> On Fri, May 29, 2015 at 11:20:08PM +0930, Alan Modra wrote:
> > On Fri, May 29, 2015 at 07:58:38AM -0500, Segher Boessenkool wrote:
> > > On Fri, May 29, 2015 at 12:41:20PM +0930, Alan Modra wrote:
> > > > +/* Describe how rtl operations on registers behave on this target when
> > > > +   operating on less than the entire register.  */
> > > > +#define EXTEND_OP(OP) \
> > > > +  (GET_MODE (OP) != SImode \
> > > > +   || !TARGET_POWERPC64\
> > > > +   ? UNKNOWN   \
> > > > +   : (GET_CODE (OP) == AND \
> > > > +  || GET_CODE (OP) == ZERO_EXTEND  \
> > > > +  || GET_CODE (OP) == ASHIFT   \
> > > > +  || GET_CODE (OP) == ROTATE   \
> > > > +  || GET_CODE (OP) == LSHIFTRT)\
> > > > +   ? ZERO_EXTEND   \
> > > > +   : (GET_CODE (OP) == SIGN_EXTEND \
> > > > +  || GET_CODE (OP) == ASHIFTRT)\
> > > > +   ? SIGN_EXTEND   \
> > > > +   : UNKNOWN)
> > > 
> > > I think this is too simplistic though.  For example, AND with -7 is not
> > > zero-extended (rlwinm rD,rA,0,31,28 sets the high 32 bits of rD to the low
> > > 32 bits of rA).
> > 
> > We take some pains in rs6000.md to ensure that the wrap-around case
> > for rlwinm does not occur for TARGET_POWERPC64.
> 
> I consider that a bug; it pessimises code.

At the time I added the checks for wrap-around, I recall that gcc
generated wrong code without the fix.

> > You'll find that an
> > SImode AND with any value is in fact zero extending.
> 
> int f(int x) { return x & 0xc000; }
> 
> is a counter-example with current trunk (it does a rldicr).

Huh, that does look like you've destroyed my claim about SImode AND.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-05-29 Thread Alan Modra
On Fri, May 29, 2015 at 07:58:38AM -0500, Segher Boessenkool wrote:
> On Fri, May 29, 2015 at 12:41:20PM +0930, Alan Modra wrote:
> > +/* Describe how rtl operations on registers behave on this target when
> > +   operating on less than the entire register.  */
> > +#define EXTEND_OP(OP) \
> > +  (GET_MODE (OP) != SImode \
> > +   || !TARGET_POWERPC64\
> > +   ? UNKNOWN   \
> > +   : (GET_CODE (OP) == AND \
> > +  || GET_CODE (OP) == ZERO_EXTEND  \
> > +  || GET_CODE (OP) == ASHIFT   \
> > +  || GET_CODE (OP) == ROTATE   \
> > +  || GET_CODE (OP) == LSHIFTRT)\
> > +   ? ZERO_EXTEND   \
> > +   : (GET_CODE (OP) == SIGN_EXTEND \
> > +  || GET_CODE (OP) == ASHIFTRT)\
> > +   ? SIGN_EXTEND   \
> > +   : UNKNOWN)
> 
> I think this is too simplistic though.  For example, AND with -7 is not
> zero-extended (rlwinm rD,rA,0,31,28 sets the high 32 bits of rD to the low
> 32 bits of rA).

We take some pains in rs6000.md to ensure that the wrap-around case
for rlwinm does not occur for TARGET_POWERPC64.  You'll find that an
SImode AND with any value is in fact zero extending.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
   REGNO (x)
 {
-  *result = rsp->last_set_sign_bit_copies;
+  int signbits = rsp->last_set_sign_bit_copies;
+  signbits -= (GET_MODE_PRECISION (rsp->last_set_mode)
+  - GET_MODE_PRECISION (mode));
+  if (signbits <= 0)
+   signbits = 1;
+  *result = signbits;
   return NULL;
 }
 
@@ -12716,9 +12723,26 @@ record_value_for_reg (rtx reg, rtx_insn *insn, rtx
   if (GET_MODE_CLASS (mode) == MODE_INT
  && HWI_COMPUTABLE_MODE_P (mode))
mode = nonzero_bits_mode;
-  rsp->last_set_nonzero_bits = nonzero_bits (value, mode);
-  rsp->last_set_sign_bit_copies
-   = num_sign_bit_copies (value, GET_MODE (reg));
+  unsigned HOST_WIDE_INT nonzero = nonzero_bits (value, mode);
+#if defined (WORD_REGISTER_OPERATIONS) && defined (EXTEND_OP)
+  /* Some operations might be known to zero extend to a wider mode.  */
+  if (GET_MODE_PRECISION (GET_MODE (reg)) < BITS_PER_WORD
+ && EXTEND_OP (value) == ZERO_EXTEND)
+   nonzero &= GET_MODE_MASK (GET_MODE (reg));
+#endif
+  rsp->last_set_nonzero_bits = nonzero;
+  unsigned int signbits = num_sign_bit_copies (value, GET_MODE (reg));
+#if defined (WORD_REGISTER_OPERATIONS) && defined (EXTEND_OP)
+  /* Some operations might be known to sign extend to a wider mode.  */
+  if (GET_MODE_PRECISION (GET_MODE (reg)) < BITS_PER_WORD
+ && GET_MODE_CLASS (GET_MODE (reg)) == MODE_INT
+ && EXTEND_OP (value) == SIGN_EXTEND)
+   {
+ rsp->last_set_mode = word_mode;
+ signbits += BITS_PER_WORD - GET_MODE_PRECISION (GET_MODE (reg));
+   }
+#endif
+  rsp->last_set_sign_bit_copies = signbits;
 }
 }
 
Index: config/rs6000/rs6000.h
===
--- config/rs6000/rs6000.h  (revision 223725)
+++ config/rs6000/rs6000.h  (working copy)
@@ -2043,6 +2043,23 @@ do { 
 \
on the full register even if a narrower mode is specified.  */
 #define WORD_REGISTER_OPERATIONS
 
+/* Describe how rtl operations on registers behave on this target when
+   operating on less than the entire register.  */
+#define EXTEND_OP(OP) \
+  (GET_MODE (OP) != SImode \
+   || !TARGET_POWERPC64\
+   ? UNKNOWN   \
+   : (GET_CODE (OP) == AND \
+  || GET_CODE (OP) == ZERO_EXTEND  \
+  || GET_CODE (OP) == ASHIFT   \
+  || GET_CODE (OP) == ROTATE   \
+  || GET_CODE (OP) == LSHIFTRT)\
+   ? ZERO_EXTEND   \
+   : (GET_CODE (OP) == SIGN_EXTEND \
+  || GET_CODE (OP) == ASHIFTRT)\
+   ? SIGN_EXTEND   \
+   : UNKNOWN)
+
 /* Define if loading in MODE, an integral mode narrower than BITS_PER_WORD
will either zero-extend or sign-extend.  The value of this macro should
be the code that says which one of the two operations is implicitly

-- 
Alan Modra
Australia Development Lab, IBM


Re: Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
On Thu, May 28, 2015 at 10:47:53AM -0400, David Edelsohn wrote:
> This seems like a problem with the cost model.  Rc instructions are
> more expensive and should be represented as such in rtx_costs.

The record instructions do have a higher cost (8 vs. 4 for normal
insns).  If the cost is increaed I don't think you'll see them
generated at all, which would fix my testcase but probably regress
others.

-- 
Alan Modra
Australia Development Lab, IBM


Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
 opportunity to try this three insn
combination, because we've already reduced down to two insns.

Does anyone have any clues as to how I might fix this?  I'm not keen
on adding an insn_and_split to rs6000.md to recognize the 6 -> 8
combination, because one of the aims of the wider patch I was working
on was to remove patterns like rotlsi3_64, ashlsi3_64, lshrsi3_64 and
ashrsi3_64.  Adding patterns in order to remove others doesn't sound
like much of a win.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-22 Thread Alan Modra
On Thu, May 21, 2015 at 01:44:31PM -0500, Segher Boessenkool wrote:
> Let's wait for Alan's patch that makes combine not reorder things
> unnecessarily, that should take care of it all as far as I see.

Patch here https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02055.html
It doesn't do anything fancy, just stops gratuitous register
reordering.  If simplification or canonicalization occurs, then
registers may well be reordered.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Alan Modra
On Thu, May 21, 2015 at 07:39:16AM -0500, Segher Boessenkool wrote:
> On Thu, May 21, 2015 at 08:06:04PM +0930, Alan Modra wrote:
> > FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1
> 
> It doesn't trigger on big-endian; what is different?

Register dependencies.  One of the arguments is in r4,r5, the return
value in r3,r4.  We calculate the low 64 bits first, which goes to r4
on big-endian, overlapping the argument.

> > Trying 18, 9 -> 24:
> > Failed to match this instruction:
> > (set (reg:DI 4 4 [+8 ])
> > (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
> > (reg:DI 76 ca))
> > (reg:DI 169 [+8 ])))
> 
> For some reason it has the CA reg not last.

simplify-rtx.c:simplify_plus_minus_op_data_cmp

>  I think we should add to
> the canonicalisation rules so that fixed regs sort after other regs.
> That requires a lot of testing.

What if you have two hard regs as above?  Which of reg 5 and reg 76
sorts first?  If they are sorted by register number, then ca appears
in the wrong place.  Reverse sorting hard regs might work for this
pattern on powerpc, but that seems an odd choice.  And if you say hard
regs ought to keep their original order in rtl like the above, then it
is no more difficult to keep all regs in their original order

> > original costs 4 + 8 + 4 = 16
> > replacement costs 4 + 4 = 8
> 
> Still need to fix the costs as well (but they work as-is; well enough
> that is).

Yes, I noticed that too.

> Are these copies guaranteed to (still) be in this basic block,
> after the passes before combine?  Did those passes do anything to
> prevent moving it?  I'm asking because it would be good to use the
> same conditions in that case.

Something I need to investigate.  As I said, the patch was just a
quick hack.

-- 
Alan Modra
Australia Development Lab, IBM


[RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Alan Modra
FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1
is seen on powerpc64le-linux since somewhere between revision 218587
and 218616.  See
https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01287.html and
https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01325.html

A regression hunt fingers one of Segher's 2014-12-10 patches to the
rs6000 backend, git commit 0f1bedb4 or svn rev 218595.  Segher might
argue that generated code is better after this commit, and I'd agree
that his change is a good one in general, but even so it would be nice
to generate the ideal code.  Curiously, the ideal code is generated at
-O1, but we regress at -O2..

before  after   ideal (-O1)
add_128:add_128:add_128:
ld 10,0(3)  ld 9,0(3)   ld 9,0(3)
ld 11,8(3)  ld 10,8(3)  ld 10,8(3)
addc 8,4,10 addc 3,4,9  addc 3,4,9
adde 9,5,11 addze 5,5   adde 4,5,10
mr 3,8  add 4,5,10  blr
mr 4,9  blr
blr

I went looking into where the addze appeared, and found combine.

Trying 18, 9 -> 24:
Failed to match this instruction:
(set (reg:DI 4 4 [+8 ])
(plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
(reg:DI 76 ca))
(reg:DI 169 [+8 ])))
Successfully matched this instruction:
(set (reg:DI 167 [ D.2366+8 ])
(plus:DI (reg:DI 5 5 [ val+8 ])
(reg:DI 76 ca)))
Successfully matched this instruction:
(set (reg:DI 4 4 [+8 ])
(plus:DI (reg:DI 167 [ D.2366+8 ])
(reg:DI 169 [+8 ])))
allowing combination of insns 18, 9 and 24
original costs 4 + 8 + 4 = 16
replacement costs 4 + 4 = 8

Here are the three insns involved, sans source line numbers and notes.

(insn 18 17 4 2 (set (reg:DI 165 [ val+8 ])
(reg:DI 5 5 [ val+8 ])) {*movdi_internal64})
...
(insn 9 8 23 2 (parallel [
(set (reg:DI 167 [ D.2366+8 ])
(plus:DI (plus:DI (reg:DI 165 [ val+8 ])
(reg:DI 169 [+8 ]))
(reg:DI 76 ca)))
(clobber (reg:DI 76 ca))
]) {*adddi3_carry_in_internal})
...
(insn 24 23 15 2 (set (reg:DI 4 4 [+8 ])
(reg:DI 167 [ D.2366+8 ])) {*movdi_internal64})

So, a move copying an argument register to a pseudo, one insn from the
body of the function, and a move copying a pseudo to a result
register.  The thought I had was: It is really combine's business to
look at copies from/to ABI mandated hard registers?  Isn't removing
the copies something that register allocation can do better?  If so,
then combine is doing unnecessary work.

As a quick hack, I tried the following.

Index: gcc/combine.c
===
--- gcc/combine.c   (revision 223431)
+++ gcc/combine.c   (working copy)
@@ -1281,6 +1281,16 @@ combine_instructions (rtx_insn *f, unsigned int nr
  if (!NONDEBUG_INSN_P (insn))
continue;
 
+ if (this_basic_block == EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb)
+   {
+ rtx set = single_set (insn);
+ if (set
+ && REG_P (SET_DEST (set))
+ && HARD_REGISTER_P (SET_DEST (set))
+ && REG_P (SET_SRC (set)))
+   continue;
+   }
+
  while (last_combined_insn
 && last_combined_insn->deleted ())
last_combined_insn = PREV_INSN (last_combined_insn);

This cures the powerpc64le testcase failure, but Segher said on irc I
was risking breaking x86 and other targets.  Perhaps that was trying
to push me to fix the underlying combine problem.  :)  In any case, I
didn't believe him, and tested the patch on powerpc64le-linux and
x86_64-linux.  No regressions in --languages=all,go and objdump -d
comparison for gcc/*.o against virgin source show no unexpected
changes.  powerpc64le-linux actually shows no changes at all apart
from combine.o while x86_64-linux shows some changes in register
allocation and cmove arg swapping with inversion of the condition.
There were no extra instructions.

So, is this worth pursuing in order to speed up combine?  I'd be
inclined to patch create_log_links instead for a proper patch.


Incidentally the underlying problem in combine (well the first one I
spotted, there might be more), is that 
  if (flag_expensive_optimizations)
{
  /* Pass pc_rtx so no substitutions are done, just
 simplifications.  */
"simplifies" this i2src

(plus:DI (plus:DI (reg:DI 165 [ val+8 ])
(reg:DI 169 [+8 ]))
(reg:DI 76 ca))

to this

(plus:DI (plus:DI (reg:DI 76 ca)
(reg:DI 165 [ val+8 ]))
(reg:DI 169 [+8 ]))

and the latter has the ca register in the wrong place.  So a split is
tried and you get addze.  I'm working on this.  The reordering happens
inside simplify_plus_minus.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Undefined Local Symbol on PowerPC

2015-04-15 Thread Alan Modra
On Wed, Apr 15, 2015 at 04:10:33PM -0500, Joel Sherrill wrote:
> Based on the grep, the .4byte directives are referencing a bogus symbol.
> 
> Does this look like a GCC bug?

Yes, unless you have some horrible asm there referencing the symbol.

-- 
Alan Modra
Australia Development Lab, IBM


--disable-shared bootstrap dies building libcc1

2015-02-13 Thread Alan Modra
On both x86_64-linux and powerpc64-linux, a --disable-shared bootstrap
dies with linker errors when building libcc1.so.  You can't build a
shared library using objects from the static libstdc++ (or any other
library built without -fpic/-fPIC).

OK, so there is a workaround, specify --disable-plugin too, but
shouldn't this be automatic if --disable-shared is given?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alan Modra
On Thu, Feb 12, 2015 at 06:55:30PM -0500, Rich Felker wrote:
> On Fri, Feb 13, 2015 at 10:12:11AM +1030, Alan Modra wrote:
> > I posted support for TLSDESC on powerpc back in 2009 (search for
> > powerpc _tls_get_addr call optimization).  The patch wasn't reviewed,
> > and I didn't push it because my benchmark tests didn't show a much of
> > a gain.  Quite possibly I wasn't using the right benchmark.
> 
> Were you measuring static-allocated TLSDESC vs non-TLSDESC GD model?
> That's the case where there should be a "big" difference, though I'm
> still somewhat skeptical of the benefits in real-world usage cases.

I can't remember, sorry, it was too long ago.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alan Modra
On Thu, Feb 12, 2015 at 12:07:24PM -0500, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote:
> > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek  wrote:
> > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> > >> > Hello,
> > >> >
> > >> > we're running into a problem related to use of initial-exec access to
> > >> > TLS variables in dynamically-loaded libraries.  Now, in general, this
> > >> > is actually not supported.  However, there seems to an "inofficial"
> > >> > extension that allows selected system libraries to use small amounts
> > >> > of static TLS space to allow critical variables to be defined to use
> > >> > the initial-exec model even in dynamically-loaded libraries.
> > >>
> > >> This usage is supposed to be deprecated. Why isn't libgomp using
> > >> TLSDESC/gnu2 model?
> > >
> > > Because it is significantly slower.
> > 
> > And TLSDESC/gnu2 model isn't implemented for x32.
> > There are no tests for TLSDESC/gnu2 model in glibc.
> > I have no ideas if it works in glibc master on x86-32 or
> > x86-64 today.
> 
> Then fixing this should be a priority, IMO. Broken libraries using IE
> model "for performance" are a problem that's not going to go away
> until TLSDESC gets properly adopted.

I posted support for TLSDESC on powerpc back in 2009 (search for
powerpc _tls_get_addr call optimization).  The patch wasn't reviewed,
and I didn't push it because my benchmark tests didn't show a much of
a gain.  Quite possibly I wasn't using the right benchmark.

-- 
Alan Modra
Australia Development Lab, IBM


Re: LTO and version scripts

2014-08-06 Thread Alan Modra
On Tue, Aug 05, 2014 at 08:18:06PM -0400, Ulrich Drepper wrote:
> On Tue, Aug 5, 2014 at 12:57 AM, Alan Modra  wrote:
> > What version linker?  In particular, do you have the fix for PR12975?
> 
> The Fedora 19 version.  I think it hasn't changed since then which
> means it is 2.23.88.0.1-13 (from the RPM version number).  No idea
> whether that fix is included and unfortunately won't have time to try
> before the weekend.

Both Fedora 19 and 20 have the patch needed for this to work.  Hmm, I
suppose the other thing necessary is a gcc that implements
LDPT_GET_SYMBOLS_V2.  You may be lacking that.  Here's what I see with
mainline gcc and ld.

cat > ltoshare.c <<\EOF
int
cond (void)
{
  return 0;
}

extern void something (void);

int
main (void)
{
  if (cond ())
something ();
  return 0;
}
EOF
cat > ltoshare.ver <<\EOF
{
  global: main;
  local: *;
};
EOF
~/build/gcc-current/gcc/xgcc -B ~/build/gcc-current/gcc/ -B ld/tmpdir/ld -O2 
-fPIC -flto -c ltoshare.c
~/build/gcc-current/gcc/xgcc -B ~/build/gcc-current/gcc/ -B ld/tmpdir/ld 
-shared -flto -o ltoshare.so ltoshare.o
nm -D ltoshare.so | grep something
 U something
~/build/gcc-current/gcc/xgcc -B ~/build/gcc-current/gcc/ -B ld/tmpdir/ld 
-shared -flto -o ltoshare.so ltoshare.o -Wl,--version-script=ltoshare.ver
nm -D ltoshare.so | grep something


-- 
Alan Modra
Australia Development Lab, IBM


Re: LTO and version scripts

2014-08-04 Thread Alan Modra
On Mon, Jul 07, 2014 at 11:04:17AM +0200, Richard Biener wrote:
> On Mon, Jun 30, 2014 at 2:35 PM, Ulrich Drepper  wrote:
> > Using LTO to create a DSO works fine (i.e., it performs the expected
> > optimizations) for symbols which are marked with visibility
> > attributes.  It does not work, though, when the symbol is not
> > restricted in its visibility in the source file but instead is
> > prevented from being exported from the DSO by a version script (ld
> > --version-script=FILE).
> >
> > Is this known?  I only found general problems related to linker
> > scripts although version script parameters do not cause any other
> > failures.
> 
> Yes, I've run into this as well.  IMHO the issue is that the linker(s)
> do not process the linker script "properly" when handing off
> the resolution data to the linker plugin.  So it's a linker bug AFAIU.

What version linker?  In particular, do you have the fix for PR12975?

-- 
Alan Modra
Australia Development Lab, IBM


Re: Reload generate invalid instruction on ppc64

2014-08-04 Thread Alan Modra
On Mon, Aug 04, 2014 at 05:54:04PM -0700, Carrot Wei wrote:
> Another problem is in the definition of insn pattern "*movdi_internal64".
> 
> (define_insn "*movdi_internal64"
>   [(set (match_operand:DI 0 "nonimmediate_operand"
> "=Y,r,r,r,r,r,?m,?*d,?*d,r,*h,*h,r,?*wg,r,?*wm")
> (match_operand:DI 1 "input_operand"
> "r,Y,r,I,L,nF,d,m,d,*h,r,0,*wg,r,*wm,r"))]
>   "TARGET_POWERPC64
>&& (gpc_reg_operand (operands[0], DImode)
>|| gpc_reg_operand (operands[1], DImode))"
> 
> The predicates of this insn pattern allow the moving of an integer to
> VSX register, but there is no constraint allow this case. Can this
> cause problem in reload?

Probably, just as you found with fprs.  The underlying issue is that
the operand predicates don't match the operand constraints.  What's
more, you can't make them match without breaking up the insn, or
adding a whole lot of extra alternatives.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] PR61300 K&R incoming args

2014-06-05 Thread Alan Modra
On Thu, Jun 05, 2014 at 01:19:19PM -0600, Jeff Law wrote:
> And so the problem you're trying to solve is that when compiling the
> callee.  You incorrectly assumed that if there was not a prototype
> for the callee's definition that the caller had set up the save area
> and that you could flush arguments to it.  That's not true in the
> case where the caller had a prototype for the callee in-scope (and
> the callee was not a varargs function).
> 
> Right?  Just want to make sure I understand the problem.

Exactly correct.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] PR61300 K&R incoming args

2014-06-02 Thread Alan Modra
On Mon, Jun 02, 2014 at 12:00:41PM +0200, Florian Weimer wrote:
> On 05/31/2014 08:56 AM, Alan Modra wrote:
> 
> >>It's fine to change ABI when compiling an old-style function
> >>definition for which a prototype exists (relative to the
> >>non-prototype case).  It happens on i386, too.
> >
> >That might be so, but when compiling the function body you must assume
> >the worst case, whatever that might be, at the call site.  For K&R
> >code, our error was to assume the call was unprototyped (which
> >paradoxically is the best case) when compiling the function body.
> 
> Is this really a supported use case?

Of course!  We still have K&R code lying around, as evidenced by the
PR.

>  I think I remember tracking
> down a bug which was related to a lack of float -> double promotion
> because the call was prototyped, and the old-style function
> definition wasn't.  This would have been on, ugh, SPARC.  I think
> this happened only in certain cases (float arguments, probably).

Yes, there are some limitations on parameter types that may be used
with unprototyped functions.

> Does this trigger more often on ppc64 ELFv2, to the extend it
> becomes a quality-of-implementation issue?  I'm pretty sure the
> standards do not require a particular behavior in such cases.

The PR isn't about the sort of parameter mismatch that you seem to be
thinking about.  The code in question is perfectly legal old-style
K&R where there is no float/double or int/long/void * trouble.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] PR61300 K&R incoming args

2014-05-30 Thread Alan Modra
On Fri, May 30, 2014 at 09:22:30PM +0200, Florian Weimer wrote:
> On 05/26/2014 09:38 AM, Alan Modra wrote:
> 
> >Background: The ELFv2 ABI requires a parameter save area only when
> >stack is actually used to pass parameters, and since varargs are
> >passed on the stack, unprototyped calls must pass both on the stack
> >and in registers.  OK, easy you say, !prototype_p(fun) means a
> >parameter save area is needed.  However, a prototype might not be in
> >scope when compiling an old K&R style C function body, but this does
> >*not* mean a parameter save area has necesasrily been allocated.
> 
> It's fine to change ABI when compiling an old-style function
> definition for which a prototype exists (relative to the
> non-prototype case).  It happens on i386, too.

That might be so, but when compiling the function body you must assume
the worst case, whatever that might be, at the call site.  For K&R
code, our error was to assume the call was unprototyped (which
paradoxically is the best case) when compiling the function body.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] PR61300 K&R incoming args

2014-05-30 Thread Alan Modra
On Fri, May 30, 2014 at 11:27:52AM -0600, Jeff Law wrote:
> On 05/26/14 01:38, Alan Modra wrote:
> >PR61300 shows a need to differentiate between incoming and outgoing
> >REG_PARM_STACK_SPACE for the PowerPC64 ELFv2 ABI, due to code like
> >function.c:assign_parm_is_stack_parm determining that a stack home
> >is available for incoming args if REG_PARM_STACK_SPACE is non-zero.
> >
> >Background: The ELFv2 ABI requires a parameter save area only when
> >stack is actually used to pass parameters, and since varargs are
> >passed on the stack, unprototyped calls must pass both on the stack
> >and in registers.  OK, easy you say, !prototype_p(fun) means a
> >parameter save area is needed.  However, a prototype might not be in
> >scope when compiling an old K&R style C function body, but this does
> >*not* mean a parameter save area has necesasrily been allocated.  A
> >caller may well have a prototype in scope at the point of the call.
> Ugh.  This reminds me a lot of the braindamage we had to deal with
> in the original PA abi's handling of FP values.
> 
> In the general case, how can any function ever be sure as to whether
> or not its prototype was in scope at a call site?  Yea, we can know
> for things with restricted scope, but if it's externally visible, I
> don't see how we're going to know the calling context with absolute
> certainty.
> 
> What am I missing here?

When compiling the function body you don't need to know whether a
prototype was in scope at the call site.  You just need to know the
rules.  :)  For functions with variable argument lists, you'll always
have a parameter save area.  For other functions, whether or not you
have a parameter save area just depends on the number of arguments and
their types (ie. whether you run out of registers for parameter
passing), and you have that whether or not the function is
prototyped.

A simple example might help clear up any confusion.

Given
 void fun1(int a, int b, double c);
 void fun2(int a, ...);
  ...
 fun1 (1, 2, 3.0);
 fun2 (1, 2, 3.0);

A call to fun1 with a prototype in scope won't allocate a parameter
save area, and will pass the first arg in r3, the second in r4, and
the third in f1.

A call to fun2 with a prototype in scope will allocate a parameter
save area of 64 bytes (the minimum size of a parameter save area), and
will pass the first arg in r3, the second in the second slot of the
parameter save area, and the third in the third slot of the parameter
save area.  Now the first eight slots/double-words of the parameter
save area are passed in r3 thru r10, so this means the second arg is
actually passed in r4 and the third in r5, not the stack!

A call to fun1 or fun2 without a prototype in scope will allocate a
parameter save area, and pass the first arg in r3, the second in r4,
and the third in both f1 and r5.

When compiling fun1 body, the first arg is known to be in r3, the
second in r4, and the third in f1, and we don't use the parameter save
area for storing incoming args to a stack slot.  (At least, after
PR61300 is fixed..)  It doesn't matter if the parameter save area was
allocated or not, we just don't use it.

When compiling fun2 body, the first arg is known to be in r3, the
second in r4 and the third in r5.  Since the function has a variable
argument list, registers r4 thru r10 are saved to the parameter
save area stack, and we set up our va_list pointer to the second
double-word of the parameter save area stack.  Of course, code
optimisation might lead to removing the saves and using the args
in their incoming regs, but this is conceptually what happens.

-- 
Alan Modra
Australia Development Lab, IBM


[RFC] PR61300 K&R incoming args

2014-05-26 Thread Alan Modra
g on the value of ACCUMULATE_OUTGOING_ARGS,
REG_PARM_STACK_SPACE, and OUTGOING_REG_PARM_STACK_SPACE.  */


-- 
Alan Modra
Australia Development Lab, IBM


Re: gcc's obvious patch policy

2013-11-26 Thread Alan Modra
On Tue, Nov 26, 2013 at 10:01:23AM +0100, Steven Bosscher wrote:
> On Tue, Nov 26, 2013 at 6:17 AM, Alan Modra wrote:
> > Was Re: [buildrobot] [PATCH] mips: Really remove ENTRY_BLOCK_PTR
> > On Wed, Nov 20, 2013 at 10:08:45AM +0100, Steven Bosscher wrote:
> >> This patch is obvious and it fixes breakage. Please go ahead and commit it.
> >
> > Sorry to pick on you here Steven, but this doesn't meet gcc's
> > definition of an obvious patch.  Don't believe me?  See
> > http://gcc.gnu.org/svnwrite.html#policies
> 
> Hmm I guess the patch will have to be reverted, then :-)
> 
> Or maybe this would be under the banner of "We don't want to get
> overly anal-retentive about checkin policies."

We are not amused.  Some lack-wit adviser told us it would be wise to
not seem anal-retentive, whatever that means, and thus we allowed
comment fixes against our better judgement.  Don't you dare extend our
magnanimous dispensation.  :-)

> In any case, it's not unprecedented that obviously obvious patches get
> checked in even if they're not obvious according to that policy. To
> list a few from just this month:
> 
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02989.html
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02975.html
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02970.html
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02972.html
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02496.html
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02331.html

Oh no!  There I am, listed with a bunch of other sinners.

-- 
Alan Modra
Australia Development Lab, IBM


Re: rs6000: load_multiple code

2013-11-22 Thread Alan Modra
On Fri, Nov 22, 2013 at 09:31:18AM +, Paulo Matos wrote:
> > From: Alan Modra [mailto:amo...@gmail.com]
> > On Wed, Nov 20, 2013 at 05:06:13PM +, Paulo Matos wrote:
> > > I am looking into how rs6000 implements load multiple code
> > [snip]
> > 
> > No pseudos are involved for the destination.  See the FAIL in
> > rs6000.md load_multiple.
> 
> Right, I missed that bit:
> if (...
> || REGNO (operands[0]) >= 32)
>   FAIL;
> 
> This will basically never match at expand time then, and will have little, if 
> any, use before register allocation then. Right?

Right.  You'll find store_multiple used in function prologues and
load_multiple in epilogues, with -Os if the target supports the string
insns.  movmemsi is of more interest in code elsewhere, and you'll see
a comment there about the register allocator.  :)

-- 
Alan Modra
Australia Development Lab, IBM


Re: g++ -Wl,--as-needed -pthread behaviour

2013-09-24 Thread Alan Modra
On Tue, Sep 24, 2013 at 01:13:53PM +0100, Jonathan Wakely wrote:
> On 24 September 2013 02:22, Alan Modra wrote:
> >
> > Try compiling that testcase with -static rather than -Wl,--as-needed.
> > You'll hit std::system_error just like you do here.  I believe that is
> > a libstdc++ bug, and can be solved by making libstdc++.a use strong
> > references to pthread symbols from std::thread::join() (and perhaps
> > other objects that provide c++ thread support, if there are such).
> 
> It's the std::thread constructor template that needs pthread_create.
> std::thread::join() needs pthread_join.

Ah, OK, I commented without looking at the source first, and it
shows.  :)

> > Otherwise we'd need to solve the transitive reference somehow.
> > ie. Teach the linker that a reference to std::thread::join() means
> > that pthread_create is required.  One obvious way to do that is have
> > the compiler reference pthread_create in objects that use
> > std::thread::join().
> 
> How would we have the compiler do that?

The std::thread constructor needs to emit a strong reference to
pthread_create in the user object file.  Probably the best way to do
that is as Jakub said, inline it and modify __gthread_create to only
use weak references to pthread symbols when compiling for a shared
library.  (When I wrote the above comment I was thing more along the
lines of a dummy reference via a pointer variable initialised to
pthread_create, but inlining the lot is simpler.)

-- 
Alan Modra
Australia Development Lab, IBM


Re: g++ -Wl,--as-needed -pthread behaviour

2013-09-23 Thread Alan Modra
On Mon, Sep 23, 2013 at 02:08:03PM +0200, Matthias Klose wrote:
> With binutils from the 2.24 branch or trunk, the behaviour of --as-needed did
> change, and what worked with binutils 2.23, now fails with 2.24:
> 
> $ cat thread.cpp
> #include 
> 
> void factorial(int n, unsigned long long int *result) {
> if (n==1) {
> *result=1;
> return;
> }
> *result=1;
> for (; n!=0; n--) *result=*result*n;
> }
> 
> int main() {
> unsigned long long int a;
> unsigned long long int *c=&a;
> std::thread t1(factorial,15,c);
> t1.join();
> return 0;
> }
> $ ld --version
> GNU ld (GNU Binutils for Ubuntu) 2.23.2
> $ g++ -Wl,--as-needed -pthread thread.cpp -std=c++11 -o thread && ./thread
> 
> $ ld --version
> GNU ld (GNU Binutils for Debian) 2.23.52.20130828
> $ g++ -Wl,--as-needed -pthread thread.cpp -std=c++11 -o thread && ./thread
> terminate called after throwing an instance of 'std::system_error'
>   what():  Enable multithreading to use std::thread: Operation not permitted
> Aborted
> 
> So the test program doesn't have any direct references to symbols in 
> libpthread,
> and isn't linked, and fails to run.
> 
> According to the binutils maintainers, this behaviour is expected:
> 
>   https://sourceware.org/ml/binutils/2013-08/msg00286.html
>   https://sourceware.org/ml/binutils/2013-09/msg0.html
> 
> but it seems a bit odd that g++ -Wl,--as-needed -pthread isn't working 
> anymore.

Try compiling that testcase with -static rather than -Wl,--as-needed.
You'll hit std::system_error just like you do here.  I believe that is
a libstdc++ bug, and can be solved by making libstdc++.a use strong
references to pthread symbols from std::thread::join() (and perhaps
other objects that provide c++ thread support, if there are such).
libstdc++.a objects that are just testing "is this program threaded"
should continue to use weak references.

Solving the problem with --as-needed and libstdc++.so isn't so easy.
One solution might be to split off thread support from libstdc++.so.6.
Otherwise we'd need to solve the transitive reference somehow.
ie. Teach the linker that a reference to std::thread::join() means
that pthread_create is required.  One obvious way to do that is have
the compiler reference pthread_create in objects that use
std::thread::join().

-- 
Alan Modra
Australia Development Lab, IBM


Re: lower-subreg and IBM long double

2013-06-10 Thread Alan Modra
On Mon, Jun 10, 2013 at 06:31:55PM -0700, Andrew Pinski wrote:
> On Mon, Jun 10, 2013 at 6:00 PM, David Edelsohn  wrote:
> > On Mon, Jun 10, 2013 at 8:26 PM, Alan Modra  wrote:
> >
> >> The following patch disables lower-subreg for double double TFmode,
> >> bootstrap and regression tests are OK, but I'm a little unsure whether
> >> this is the right thing to do.
> >>
> >> * rs6000.c (TARGET_INIT_LOWER_SUBREG): Define.
> >> (rs6000_init_lower_subreg): New function.
> >> * lower-subreg.c (init_lower_subreg): Call 
> >> targetm.init_lower_subreg.
> >> * target.def (init_lower_subreg): New.
> >> * doc/tm.texi.in (TARGET_INIT_LOWER_SUBREG): Document.
> >> * doc/tm.texi: Regenerate.
> >
> > I agree with the rs6000 bits.  You need someone else to approve the
> > common bits.  This also needs a testcase.
> 
> I thought there was a way already to disable lower subreg already for
> some modes.

There is, via rtx_costs.  In fact that was my first approach, with the
following in rs6000_rtx_costs, but this potentially affects other
areas of the compiler.

case SET:
  if (GET_MODE (SET_DEST (x)) == TFmode
  && !TARGET_IEEEQUAD
  && TARGET_HARD_FLOAT
  && (TARGET_FPRS || TARGET_E500_DOUBLE)
  && TARGET_LONG_DOUBLE_128)
/* This hack is to persuade lower_subreg to not lower
   TFmode regs to DImode.  */
*total = COSTS_N_INSNS (2) - 1;
  break;


-- 
Alan Modra
Australia Development Lab, IBM


lower-subreg and IBM long double

2013-06-10 Thread Alan Modra
===\n\n");
   compute_costs (true, &rtxes);
+
+  if (targetm.init_lower_subreg)
+targetm.init_lower_subreg (this_target_lower_subreg);
 }
 
 static bool
Index: gcc/target.def
===
--- gcc/target.def  (revision 199781)
+++ gcc/target.def  (working copy)
@@ -2926,6 +2926,12 @@
  void, (int *code, rtx *op0, rtx *op1, bool op0_preserve_value),
  default_canonicalize_comparison)
 
+/* Allow modification of subreg choices.  */
+DEFHOOK
+(init_lower_subreg,
+ "",
+ void, (void *data), NULL)
+
 DEFHOOKPOD
 (atomic_test_and_set_trueval,
  "This value should be set if the result written by\
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in  (revision 199781)
+++ gcc/doc/tm.texi.in  (working copy)
@@ -6375,6 +6375,12 @@
 registers on machines with lots of registers.
 @end deftypefn
 
+@hook TARGET_INIT_LOWER_SUBREG
+This hook allows modification of the choices the lower_subreg pass
+will make for particular subreg modes.  @var{data} is a pointer to a
+@code{struct target_lower_subreg}.
+@end deftypefn
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 

-- 
Alan Modra
Australia Development Lab, IBM


[RS6000] strict alignment for little-endian

2013-06-06 Thread Alan Modra
I'd like to remove -mstrict-align for little-endian powerpc, because
the assumption that mis-aligned accesses are massively slow isn't true
for current powerpc processors.  However I also don't want to break
old machines, so probably should add -mstrict-align back for some set
of cpus.  Can anyone tell me the set?

Index: gcc/config/rs6000/sysv4.h
===
--- gcc/config/rs6000/sysv4.h   (revision 199718)
+++ gcc/config/rs6000/sysv4.h   (working copy)
@@ -538,12 +538,7 @@
 
 #defineCC1_ENDIAN_BIG_SPEC ""
 
-#defineCC1_ENDIAN_LITTLE_SPEC "\
-%{!mstrict-align: %{!mno-strict-align: \
-%{!mcall-i960-old: \
-   -mstrict-align \
-} \
-}}"
+#defineCC1_ENDIAN_LITTLE_SPEC ""
 
 #defineCC1_ENDIAN_DEFAULT_SPEC "%(cc1_endian_big)"
 

-- 
Alan Modra
Australia Development Lab, IBM


Re: Excessive calls to iterate_phdr during exception handling

2013-05-28 Thread Alan Modra
On Tue, May 28, 2013 at 09:19:48PM -0400, Ryan Johnson wrote:
> On 28/05/2013 8:47 PM, Ian Lance Taylor wrote:
> >On Mon, May 27, 2013 at 3:20 PM, Ryan Johnson
> > wrote:
> >>I'm bringing the issue up here, rather than filing a bug, because I'm not
> >>sure whether this is an oversight, a known problem that's hard to fix, or a
> >>feature (e.g. somehow required for reliable unwinding). I suspect the
> >>former, because _Unwind_Find_FDE tries a call to _Unwind_Find_registered_FDE
> >>before falling back to dl_iterate_phdr, but the former never succeeds in my
> >>trace (iterate_phdr is always called).
> >The issue is dlclose followed by dlopen.  If we had a cache ahead of
> >dl_iterate_phdr, we would need some way to clear out any information
> >cached from a dlclose'd library.  Otherwise we might pick up the old
> >information when looking up an address from a new dlopen.  So 1)
> >locking will always be required; 2) any caching system to reduce the
> >number of locks will require support for dlclose, somehow.  It's worth
> >working on but there isn't going to be a simple solution.
> I have mixed feelings on this... on the one had it would be bad to
> risk sending unwind off to la-la land because somebody did a quick
> dlclose/dlopen pair on code we're about to unwind through... but on
> the other hand anybody who does a dlclose/dlopen pair on code we're
> about to unwind through (a) is asking for trouble and (b) is
> perfectly free to do so in spite of the mutex [1].

Yes of course you can shoot yourself in the foot.  The mutex is there
to stop the glibc dl_iterate_phdr list traversal running awry when
dlopen/dlclose happens in another thread.  To be clear, I'm talking
about a dlclose on an object that your thread doesn't access.  Such a
dlclose shouldn't affect your thread in any way.  But if glibc's
list of loaded objects was allowed to change while your thread was
running dl_iterate_phdr, then dl_iterate_phdr could potentially read
freed list entries.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Excessive calls to iterate_phdr during exception handling

2013-05-27 Thread Alan Modra
On Mon, May 27, 2013 at 06:20:21PM -0400, Ryan Johnson wrote:
> I'm not sure whether this is an oversight, a known problem that's
> hard to fix, or a feature (e.g. somehow required for reliable
> unwinding). I suspect the former, because _Unwind_Find_FDE tries a
> call to _Unwind_Find_registered_FDE before falling back to
> dl_iterate_phdr, but the former never succeeds in my trace
> (iterate_phdr is always called).

Your suspicion is unfounded.  The locking is required to support
dlopen (or at least, you need some sort of thread synchronisation
here).  _Unwind_Find_registered_FDE is to support an older method of
finding FDEs.  Newer executables and shared libraries on linux will
use PT_GNU_EH_FRAME, so don't expect _Unwind_Find_registered_FDE
to do anything except waste time!  google eh_frame_hdr for more info.

C++ and threading is a minefield.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Is this power gcc bug?

2013-03-29 Thread Alan Modra
On Fri, Mar 29, 2013 at 04:58:50PM -0700, Carrot Wei wrote:
> /trunkbin/bin/gcc -c -o rtl.o -DSPEC_CPU -DNDEBUG -I.  -O2
> -DSPEC_CPU_LP64 -m32rtl.c

You've given contradictory options.  -m32 is *not* LP64.

> The left shift count is 32, it is actually less than the width of
> unsigned long 64.

Nope, unsigned long is 32 bits for -m32.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [PATCH] rs6000: Disable generation of lwa in 32-bit mode

2012-10-27 Thread Alan Modra
On Sat, Oct 27, 2012 at 06:33:34AM +0200, Segher Boessenkool wrote:
> some (20040709-2.c, etc.) fail with a linker error now, instead of

Hmm, packed structs.  If gcc is generating mis-aligned accesses using
lwa or ld, that would be another TARGET_64BIT vs TARGET_POWERPC64
bug, wouldn't it?

-- 
Alan Modra
Australia Development Lab, IBM


Re: PR53914, rs6000 constraints and reload queries

2012-08-01 Thread Alan Modra
On Wed, Aug 01, 2012 at 10:26:50AM +0200, Olivier Hainque wrote:
> I had made a proposal to help the rs6000_mode_dependent_address
> issue, http://gcc.gnu.org/ml/gcc-patches/2012-04/msg01668.html.
> 
> Seems to me that the general idea is still valid:
> 
> << a number of places in the compiler use the
>mode_dependent_address_p predicate to actually check for weaker necessary
>conditions
> >>
> 
> Opinion on the proposal ?

I like the idea.  It is worth pursuing for code improvement we'll see
even if we avoid the "o" constraint everywhere.  For example,
long long llo (long long *x) { return x[4095]; }
will generate better powerpc -m32 -O2 code with your patch applied, I
think.

-- 
Alan Modra
Australia Development Lab, IBM


PR53914, rs6000 constraints and reload queries

2012-07-17 Thread Alan Modra
e_operand" "=wY,Y,d,&d,r")
+   (float_extend:TF (match_operand:DF 1 "input_operand" 
"d,r,md,md,rmGHF")))
+   (use (match_operand:DF 2 "zero_reg_mem_operand" "d,r,m,d,n"))]
   "!TARGET_IEEEQUAD
&& TARGET_HARD_FLOAT && TARGET_FPRS && TARGET_DOUBLE_FLOAT 
&& TARGET_LONG_DOUBLE_128"
@@ -10145,11 +10145,11 @@
 ;; Next come the multi-word integer load and store and the load and store
 ;; multiple insns.
 
-; List r->r after r->"o<>", otherwise reload will try to reload a
-; non-offsettable address by using r->r which won't make progress.
+;; List r->r after r->Y, otherwise reload will try to reload a
+;; non-offsettable address by using r->r which won't make progress.
 (define_insn "*movdi_internal32"
-  [(set (match_operand:DI 0 "rs6000_nonimmediate_operand" 
"=o<>,r,r,*d,*d,m,r,?wa")
-   (match_operand:DI 1 "input_operand" "r,r,m,d,m,d,IJKnGHF,O"))]
+  [(set (match_operand:DI 0 "rs6000_nonimmediate_operand" 
"=Y,r,r,*?d,*?d,?m,r,?wa")
+   (match_operand:DI 1 "input_operand" "r,Y,r,d,m,d,IJKnGHF,O"))]
   "! TARGET_POWERPC64
&& (gpc_reg_operand (operands[0], DImode)
|| gpc_reg_operand (operands[1], DImode))"
@@ -10162,7 +10162,7 @@
stfd%U0%X0 %1,%0
#
xxlxor %x0,%x0,%x0"
-  [(set_attr "type" "load,*,store,fp,fpload,fpstore,*,vecsimple")])
+  [(set_attr "type" "load,store,*,fp,fpload,fpstore,*,vecsimple")])
 
 (define_split
   [(set (match_operand:DI 0 "gpc_reg_operand" "")
@@ -10195,15 +10195,15 @@
 { rs6000_split_multireg_move (operands[0], operands[1]); DONE; })
 
 (define_insn "*movdi_mfpgpr"
-  [(set (match_operand:DI 0 "nonimmediate_operand" 
"=r,r,m,r,r,r,*d,*d,m,r,*h,*h,r,*d")
-   (match_operand:DI 1 "input_operand" "r,m,r,I,L,nF,d,m,d,*h,r,0,*d,r"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" 
"=Y,r,r,r,r,r,*?d,*?d,?m,r,*h,*h,r,*?d")
+   (match_operand:DI 1 "input_operand" "r,Y,r,I,L,nF,d,m,d,*h,r,0,*d,r"))]
   "TARGET_POWERPC64 && TARGET_MFPGPR && TARGET_HARD_FLOAT && TARGET_FPRS
&& (gpc_reg_operand (operands[0], DImode)
|| gpc_reg_operand (operands[1], DImode))"
   "@
+   std%U0%X0 %1,%0
+   ld%U1%X1 %0,%1
mr %0,%1
-   ld%U1%X1 %0,%1
-   std%U0%X0 %1,%0
li %0,%1
lis %0,%v1
#
@@ -10215,19 +10215,19 @@
{cror 0,0,0|nop}
mftgpr %0,%1
mffgpr %0,%1"
-  [(set_attr "type" 
"*,load,store,*,*,*,fp,fpload,fpstore,mfjmpr,mtjmpr,*,mftgpr,mffgpr")
+  [(set_attr "type" 
"store,load,*,*,*,*,fp,fpload,fpstore,mfjmpr,mtjmpr,*,mftgpr,mffgpr")
(set_attr "length" "4,4,4,4,4,20,4,4,4,4,4,4,4,4")])
 
 (define_insn "*movdi_internal64"
-  [(set (match_operand:DI 0 "nonimmediate_operand" 
"=r,r,m,r,r,r,*d,*d,m,r,*h,*h,?wa")
-   (match_operand:DI 1 "input_operand" "r,m,r,I,L,nF,d,m,d,*h,r,0,O"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" 
"=Y,r,r,r,r,r,*?d,*?d,?m,r,*h,*h,?wa")
+   (match_operand:DI 1 "input_operand" "r,Y,r,I,L,nF,d,m,d,*h,r,0,O"))]
   "TARGET_POWERPC64 && (!TARGET_MFPGPR || !TARGET_HARD_FLOAT || !TARGET_FPRS)
&& (gpc_reg_operand (operands[0], DImode)
|| gpc_reg_operand (operands[1], DImode))"
   "@
+   std%U0%X0 %1,%0
+   ld%U1%X1 %0,%1
mr %0,%1
-   ld%U1%X1 %0,%1
-   std%U0%X0 %1,%0
li %0,%1
lis %0,%v1
#
@@ -10238,7 +10238,7 @@
mt%0 %1
{cror 0,0,0|nop}
xxlxor %x0,%x0,%x0"
-  [(set_attr "type" 
"*,load,store,*,*,*,fp,fpload,fpstore,mfjmpr,mtjmpr,*,vecsimple")
+  [(set_attr "type" 
"store,load,*,*,*,*,fp,fpload,fpstore,mfjmpr,mtjmpr,*,vecsimple")
(set_attr "length" "4,4,4,4,4,20,4,4,4,4,4,4,4")])
 
 ;; immediate value valid for a single instruction hiding in a const_double
@@ -10313,8 +10313,8 @@
 ;; giving the SCRATCH mq.
 
 (define_insn "*movti_power"
-  [(set (match_operand:TI 0 "reg_or_mem_operand" "=Q,m,r,r,r,r")
-   (match_operand:TI 1 "input_operand" "r,r,r,Q,m,n"))
+  [(set (match_operand:TI 0 "reg_or_mem_operand" "=Q,Y,r,r,r,r")
+   (match_operand:TI 1 "input_operand" "r,r,r,Q,Y,n"))
(clobber (match_scratch:SI 2 "=q,q#X,X,X,X,X"))]
   "TARGET_POWER && ! TARGET_POWERPC64
&& (gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], 
TImode))"
@@ -10346,8 +10346,8 

Re: A case where PHI-OPT pessimizes the code

2012-04-23 Thread Alan Modra
On Mon, Apr 23, 2012 at 06:07:52PM +0200, Steven Bosscher wrote:
> On Mon, Apr 23, 2012 at 4:43 PM, Alan Modra  wrote:
> > On Mon, Apr 23, 2012 at 02:50:13PM +0200, Steven Bosscher wrote:
> >>   csui = (ONEUL << a);
> >>   b = ((csui & cst) != 0);
> >>   if (b)
> >> return 1;
> >>   else
> >> return 0;
> >
> > We (powerpc) would be much better if this were
> >
> >   csui = (ONEUL << a);
> >   return (csui & cst) >> a;
> >
> > Other targets would probably benefit too.
> 
> Yes, this has been discussed before. See here:
> 
>   http://gcc.gnu.org/ml/gcc-patches/2003-01/msg01791.html
>   http://gcc.gnu.org/ml/gcc-patches/2003-01/msg01950.html

I'm suggesting something slightly different to either of these.  I
realize it's probably not that easy to implement, and is really
outside the scope of the switch statement code you're working on, but
it would be nice if we could avoid the comparison.  On high end
powerpc machines, int -> cc -> int costs the equivalent of many
operations just on int.

(In the powerpc code you showed, the comparison is folded into the
AND, emitted as "and.", the move from cc is "mfcr; rlwinm; xori".
"and." isn't cheap and "mfcr" is relatively expensive.)

-- 
Alan Modra
Australia Development Lab, IBM


Re: A case where PHI-OPT pessimizes the code

2012-04-23 Thread Alan Modra
On Mon, Apr 23, 2012 at 02:50:13PM +0200, Steven Bosscher wrote:
>   csui = (ONEUL << a);
>   b = ((csui & cst) != 0);
>   if (b)
> return 1;
>   else
> return 0;

We (powerpc) would be much better if this were

   csui = (ONEUL << a);
   return (csui & cst) >> a;

Other targets would probably benefit too.

-- 
Alan Modra
Australia Development Lab, IBM


powerpc compare_and_swap fails

2011-11-17 Thread Alan Modra
I'm seeing a lot of testsuite failures on powerpc-linux, some of
which are locking related.  For example:
WARNING: Program timed out.
FAIL: libgomp.c/atomic-10.c execution test

This one fails in f3() here:
  #pragma omp atomic
z4 *= 3;

z4 is an unsigned char, so we hit the QImode case in
rs6000_expand_atomic_compare_and_swap.  operands[3] is modified.
The rather horrible piece of code below corresponds with z4 *= 3;
At 1c60 you can see operands[3], oldval, being shifted.  At
1c90 and 1c94, the newly loaded value from the z4 word is
shifted to the low byte position and masked.  Then in 1c98 this is
compared against oldval.  The comparison never succeeds, because r9
has the value 00yy (the shift happens to be 16 for z4) while r8
has 00yy.

1c34:   57 c6 1e f8 rlwinm  r6,r30,3,27,28
1c38:   38 a0 00 ff li  r5,255
1c3c:   89 39 13 d1 lbz r9,5073(r25)
1c40:   68 c6 00 18 xorir6,r6,24
1c44:   57 de 00 3a rlwinm  r30,r30,0,0,29
1c48:   7c a5 30 30 slw r5,r5,r6
1c4c:   48 00 00 08 b   1c54 
1c50:   7d 09 43 78 mr  r9,r8
1c54:   7c 00 04 ac sync
1c58:   55 2a 08 3c rlwinm  r10,r9,1,0,30
1c5c:   7d 4a 4a 14 add r10,r10,r9
1c60:   7d 29 30 30 slw r9,r9,r6
1c64:   55 4a 06 3e clrlwi  r10,r10,24
1c68:   7d 4a 30 30 slw r10,r10,r6
1c6c:   7d 00 f0 28 lwarx   r8,0,r30
1c70:   7d 07 28 38 and r7,r8,r5
1c74:   7f 87 48 00 cmpwcr7,r7,r9
1c78:   7d 07 28 78 andcr7,r8,r5
1c7c:   7c e7 53 78 or  r7,r7,r10
1c80:   40 9e 00 0c bne-cr7,1c8c 
1c84:   7c e0 f1 2d stwcx.  r7,0,r30
1c88:   40 a2 ff e4 bne-1c6c 
1c8c:   4c 00 01 2c isync
1c90:   7d 08 34 30 srw r8,r8,r6
1c94:   55 08 06 3e clrlwi  r8,r8,24
1c98:   7f 89 40 00 cmpwcr7,r9,r8
1c9c:   40 9e ff b4 bne+cr7,1c50 

I suspect the fix to this problem doesn't belong in rs6000.c,
but the following does seem to cure this failure.

Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 181400)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -17334,10 +17366,13 @@ rs6000_expand_atomic_compare_and_swap (r
   mask = shift = NULL_RTX;
   if (mode == QImode || mode == HImode)
 {
+  rtx orig = oldval;
+
   mem = rs6000_adjust_atomic_subword (mem, &shift, &mask);
 
   /* Shift and mask OLDVAL into position with the word.  */
-  oldval = convert_modes (SImode, mode, oldval, 1);
+  oldval = gen_reg_rtx (SImode);
+  convert_move (oldval, orig, 1);
   oldval = expand_simple_binop (SImode, ASHIFT, oldval, shift,
oldval, 1, OPTAB_LIB_WIDEN);
 

-- 
Alan Modra
Australia Development Lab, IBM


Re: Shrink wrapping issues

2011-11-05 Thread Alan Modra
On Sat, Nov 05, 2011 at 10:50:44AM +0100, Jakub Jelinek wrote:
> >From quick look, f1 isn't shrink-wrapped probably because of the set
> of bb's that need prologue/epilogue around it doesn't end in a return,
> but in a tail call.  Can't we just add a prologue before the bar call
> and throw the epilogue away (normally the epilogue in a function that
> ends only in a tail call is just emitted after the barrier and
> optimized away I think, we could do the same?).

http://gcc.gnu.org/ml/gcc-patches/2011-11/msg00046.html ought to cure
this particular problem.  With that patch, similar code on
powerpc-linux does result in shrink wrapping.

> And f2 is something that IMHO with especially AVX/AVX2 code happens very
> often, the prologue is expensive as it realigns the stack.  The reason
> for that is that until reload we don't know whether something won't be
> spilled on the stack and we need/want 32-byte aligned stack slots
> for that spilling.

Huh?  thread_prologue_and_epilogue is after reload.  So your backend
ought to be able to figure out whether an aligned stack is needed.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Need help resolving PR target/50906

2011-11-01 Thread Alan Modra
On Mon, Oct 31, 2011 at 10:58:03AM -0500, Moffett, Kyle D wrote:
> I have not yet been able to figure out if it's a libgcc issue or an
> actual compiler issue.

It is a gcc bug.  I've added a comment to the PR.

-- 
Alan Modra
Australia Development Lab, IBM


Re: PowerPC shrink-wrap support 0 of 3

2011-09-22 Thread Alan Modra
On Thu, Sep 22, 2011 at 12:58:51AM +0930, Alan Modra wrote:
> I spent a little time today looking at why shrink wrap is failing to
> help on PowerPC, and it turns out that the optimization simply doesn't
> trigger that often due to prologue clobbered regs.  PowerPC uses r0 as
> a temp in the prologue to save LR to the stack, and unfortunately r0
> seems to often be live across the candidate edge chosen for
> shrink-wrapping, ie. where the prologue will be inserted.  I suppose
> it's no surprise that r0 is often live; rs6000.h:REG_ALLOC_ORDER makes
> r0 the first gpr to be used.
> 
> As a quick hack, I'm going to try a different REG_ALLOC_ORDER but I
> suspect the real fix will require register renaming in the prologue.

Hi Bernd,
Rearranging the rs6000 register allocation order did in fact help a
lot as far as making more opportunities available for shrink-wrap.  So
did your http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01499.html
patch.  The two together worked so well that gcc won't bootstrap now..

The problem is that shrink wrapping followed by basic block reordering
breaks dwarf unwind info, triggering "internal compiler error: in
maybe_record_trace_start at dwarf2cfi.c:2243".  From your emails on
the list, I gather you've seen this yourself.

The bootstrap breakage happens on libmudflap/mf-hooks1.c, compiling
__wrap_malloc.  Eliding some detail, this function starts off as

void *__wrap_malloc (size_t c)
{
  if (__mf_starting_p)
return __real_malloc (c);

The "if" is bb2, the sibling call bb3, and shrink wrap rather nicely
puts the prologue for the rest of the function in bb4.  A great
example of shrink wrap doing as it should, if you ignore the fact that
optimizing for startup isn't so clever.  However, bb-reorder inverts
the "if" and moves the sibling call past other blocks in the function.
That's wrong, because the dwarf unwind info for the prologue is not
applicable for the sibling call block:  The prologue hasn't been
executed for that block.  (The unwinder sequentially executes all
unwind opcodes from the start of the function to find the unwind state
at any instruction address.)  Exactly the same sort of problem is
generated by your "unconverted_simple_returns" code.

What should I do here?  bb-reorder could be disabled for these blocks,
but that won't help unconverted_simple_returns.  I'm willing to spend
some time fixing this, but don't want to start if you already have
partial or full solutions.  Another thing I'd like to work on is
stopping ifcvt transformations from killing shrink wrap opportunities.
We have one in CPU2006 povray Ray_In_Bound that ought to give 5%
(figure from shrink wrap by hand), but currently only gets shrink
wrapping there with -fno-if-conversion.

-- 
Alan Modra
Australia Development Lab, IBM


Re: PATCH: Support mixing .init_array.* and .ctors.* input sections

2010-12-14 Thread Alan Modra
On Tue, Dec 14, 2010 at 09:55:42AM -0800, H.J. Lu wrote:
> bfd/
> 
> 2010-12-14  H.J. Lu  
> 
>   * elf.c (_bfd_elf_new_section_hook): Special handling for
>   .init_array/.fini_array output sections.
> 
> ld/
> 
> 2010-12-13  H.J. Lu  
> 
>   * Makefile.am (GENSCRIPTS): Add @enable_initfini_ar...@.
> 
>   * NEWS: Mention SORT_BY_INIT_PRIORITY.
> 
>   * configure.in: Add AC_CANONICAL_BUILD.
>   Add --enable-initfini-array.
> 
>   * genscripts.sh (ENABLE_INITFINI_ARRAY): New.
> 
>   * ld.h (sort_type): Add by_init_priority.
> 
>   * ld.texinfo: Document SORT_BY_INIT_PRIORITY.
> 
>   * ldgram.y (SORT_BY_INIT_PRIORITY): New.
>   (wildcard_spec): Handle SORT_BY_INIT_PRIORITY.
> 
>   * ldlang.c (get_init_priority): New.
>   (compare_section): Use get_init_priority for by_init_priority.
> 
>   * ldlex.l (SORT_BY_INIT_PRIORITY): New.
> 
>   * scripttempl/elf.sc: Support ENABLE_INITFINI_ARRAY.
> 
>   * Makefile.in: Regenerated.
>   * aclocal.m4: Regenerated.
>   * config.in: Likewise.
>   * configure: Likewise.
> 
> ld/testsuite/
> 
> 2010-12-13  H.J. Lu  
> 
>   * ld-elf/elf.exp (array_tests): Add init-mixed.
>   (array_tests_static): Likewise.
>   Also delete tmpdir/init-mixed.
> 
>   * ld-elf/init-mixed.c: New.
>   * ld-elf/init-mixed.out: Likewise.

OK.  Except

> +static long int

unsigned long

> +get_init_priority (const char *name)
> +{
> +  char *end;
> +  long int init_priority;

unsigned long

> +
> +  /* GCC uses the following section names for the init_priority
> + attribute with numerical values 101 and 65535 inclusive:
> +
> + 1: .init_array./.fini_array.: Where  is the
> + decimal numerical value of the init_priority attribute.
> + 2: .ctors./.ctors.: Where  is 65535 minus the
> + decimal numerical value of the init_priority attribute.
> +   */

I would like to see this comment expanded.  Specify what the
init_priority values mean, ie. a lower value means a higher priority.
Also specify the order of execution in .init_array and .fini_array.
>From memory .init_array is forward, .fini_array reverse, and just to
make things interesting .ctors/.dtors goes the other way, .ctors
reverse and .dtors forward.

> +  if (strncmp (name, ".init_array.", 12) == 0
> +  || strncmp (name, ".fini_array.", 12) == 0)
> +{
> +  init_priority = strtoul (name + 12, &end, 10);
> +  return *end ? 0 : init_priority;
> +}
> +  else if (strncmp (name, ".ctors.", 7) == 0
> +|| strncmp (name, ".dtors.", 7) == 0)
> +{
> +  init_priority = strtoul (name + 7, &end, 10);
> +  return *end ? 0 : 65535 - init_priority;
> +}
> +
> +  return 0;
> +}
> +
>  /* Compare sections ASEC and BSEC according to SORT.  */
>  
>  static int
>  compare_section (sort_type sort, asection *asec, asection *bsec)
>  {
>int ret;
> +  long int ainit_priority, binit_priority;

unsigned long


> @@ -247,19 +274,16 @@ CTOR=".ctors${CONSTRUCTING-0} :
> linker won't look for a file to match a
> wildcard.  The wildcard also means that it
> doesn't matter which directory crtbegin.o
> -   is in.  */
> +   is in. 
>  
> -KEEP (*crtbegin.o(.ctors))
> -KEEP (*crtbegin?.o(.ctors))
> -
> -/* We don't want to include the .ctor section from
> +   We don't want to include the .ctor section from
> the crtend.o file until after the sorted ctors.
> The .ctor section from the crtend file contains the
> end of ctors marker and it must be last */
>  
> -KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o $OTHER_EXCLUDE_FILES) .ctors))
> -KEEP (*(SORT(.ctors.*)))
> -KEEP (*(.ctors))
> +KEEP (*crtbegin.o(.ctors))
> +KEEP (*crtbegin?.o(.ctors))
> +${CTORS}
>  ${CONSTRUCTING+${CTOR_END}}
>}"
>  DTOR=".dtors${CONSTRUCTING-0} :
> @@ -267,9 +291,7 @@ DTOR=".dtors${CONSTRUCTING-0} :
>  ${CONSTRUCTING+${DTOR_START}}
>  KEEP (*crtbegin.o(.dtors))
>  KEEP (*crtbegin?.o(.dtors))
> -KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o $OTHER_EXCLUDE_FILES) .dtors))
> -KEEP (*(SORT(.dtors.*)))
> -KEEP (*(.dtors))
> +${DTORS}
>  ${CONSTRUCTING+${DTOR_END}}
>}"

No need to make any changes to .ctors or .dtors.  If .init_array
and .fini_array match input .ctors or .dtors sections, then any later
match will simply be ignored.

-- 
Alan Modra
Australia Development Lab, IBM


Re: PATCH: 2 stage BFD linker for LTO plugin

2010-12-06 Thread Alan Modra
On Mon, Dec 06, 2010 at 09:57:14AM -0800, H.J. Lu wrote:
> Personally, I think 2 stage linking is one way to fix this issue.

Ian has stated that he thinks this is a really bad idea.  I haven't
approved the patch because I value Ian's opinion, and can see why he
thinks it is the wrong way to go.  On the other hand, BFD is full of
bad ideas..  I'm not strongly opposed to your patch myself.

HJ, you showed that link times for gcc did not regress too much with
your 2 stage lto link patch.  It would be more interesting to see the
results for a large C++ project, mozilla for example.

-- 
Alan Modra
Australia Development Lab, IBM


Re: RFC: Add zlib source to src CVS resposity

2010-11-01 Thread Alan Modra
On Mon, Nov 01, 2010 at 05:13:44PM +, Nick Clifton wrote:
>   * We have to make sure that zlib will build on all of the
> hosts that we care about.  Should the situation arise
> where the zlib does not build on a particular host, and
> the zlib maintainers are not interested in making it
> build there, then it will be down to us to fix it.  Or
> else abandon compression support on that host.

This would mean we need to keep machinery to conditionally compile
in compressed debug support, removal of said support being HJ's stated
reason for importing zlib.

I'm against importing zlib into binutils, and I think we should keep
support of compressed debug sections conditional, to avoid potential
bootstrap problems or circular dependencies.

-- 
Alan Modra
Australia Development Lab, IBM


Re: %pc relative addressing of string literals/const data

2010-10-26 Thread Alan Modra
On Wed, Oct 27, 2010 at 12:53:00AM +0100, Dave Korn wrote:
> On 26/10/2010 23:37, Joakim Tjernlund wrote:
> 
> > Everything went dead quiet the minute I stated to send patches, what did
> > I do wrong?
> 
>   Nothing, you just ran into the lack-of-manpower problem.  Sorry!  And I
> can't even help, I'm not a ppc maintainer.

I also cannot approve gcc patches.

-- 
Alan Modra
Australia Development Lab, IBM


Re: %pc relative addressing of string literals/const data

2010-10-11 Thread Alan Modra
On Sun, Oct 10, 2010 at 11:20:06AM +0200, Joakim Tjernlund wrote:
> Now I have had a closer look at this and it looks much like -fpic
> on ppc32, you still use the GOT/TOC to load the address where the data is.

No, with ppc64 -mcmodel=medium you use the GOT/TOC pointer plus an
offset to address local data.

> I was looking for true %pc relative addressing of data. I guess this is really
> hard on PowerPC?

Yes, PowerPC lacks pc-relative instructions.

> I am not sure this is all it takes to make -fpic to work with -mrelocatable,
> any ideas?

You might be lucky.  With -mrelocatable, .got2 only contains
addresses.  No other constants.  So a simple run-time loader can
relocate the entire .got2 section, plus those locations specified in
.fixup.  You'll have to make sure gcc does the same for .got, and your
run-time loader will need to be modified to handle .got (watch out for
the .got header!).

-- 
Alan Modra
Australia Development Lab, IBM


Re: %pc relative addressing of string literals/const data

2010-10-05 Thread Alan Modra
On Tue, Oct 05, 2010 at 11:40:11PM +0200, Joakim Tjernlund wrote:
> yes, but this could be a new PIC mode that uses a new better
> PIC mode for everything. Especially one that doesn't require each function
> to calculate the GOT address in the function prologue(why is that so?)

The ppc32 ABI is old, much like x86.  cf. x86 -O2 -fPIC (without
hidden pragma).

foo:
call__i686.get_pc_thunk.cx
addl$_GLOBAL_OFFSET_TABLE_, %ecx
pushl   %ebp
movl%esp, %ebp
popl%ebp
movly...@got(%ecx), %eax
movlx...@got(%ecx), %edx
movl(%eax), %eax
addl(%edx), %eax
ret
[snip]
__i686.get_pc_thunk.cx:
movl(%esp), %ecx
ret

The new ppc64 -mcmodel=medium support does give you pic access to
locals.

-fPIC -O2 without hidden
.LC0:
.tc x[TC],x   <-- compiler managed GOT entries
.LC1:
.tc y[TC],y
[snip]
.L.foo:
addis 11,2,@toc@ha
addis 9,2,@toc@ha
ld 11,@toc@l(11)
ld 9,@toc@l(9)
lwz 3,0(11)
lwz 0,0(9)
add 3,3,0
extsw 3,3
blr

-fPIC -O2 with hidden pragma
.L.foo:
addis 11,2,x...@toc@ha
addis 9,2,y...@toc@ha
lwz 3,x...@toc@l(11)  <-- TOC/GOT pointer relative
lwz 0,y...@toc@l(9)
add 3,3,0
extsw 3,3
blr

x...@toc is equivalent to @GOTOFF on other processors.

-- 
Alan Modra
Australia Development Lab, IBM


PowerPC64, optimization too aggressive?

2010-06-23 Thread Alan Modra
On Tue, Jun 08, 2010 at 10:27:03PM +0930, Alan Modra wrote:
> PowerPC64 gcc support for a larger TOC via -mcmodel option.
[snip]

I'm having second thoughts about the optimization I added to PowerPC64
gcc with the patch hunk below.  Its effect is to use a more efficient
TOC/GOT pointer relative address calculation on references known to be
local, rather than loading an address out of the TOC/GOT.  ie.

  addis rx,2,s...@toc@ha
  addi ry,rx,s...@toc@l

instead of

  addis rx,2,s...@got@ha
  ld ry,s...@got@l(rx)

This saves a word in the TOC/GOT and is a little faster too.  However,
there is a problem:  If people build PowerPC64 shared libraries
without -fpic/-fPIC then gcc will emit code that requires text relocs
to properly support ELF shared library semantics, and I don't intend
to change ld and ld.so to do that.

It may be better to not do this optimization in gcc at all, especially
since we can do the same transformation in ld.  This would mean
PowerPC64 gcc would lose -mcmodel=medium, retaining -mcmodel=small and
-mcmodel=large.  If there are no dissenting opinions I'll prepare a
gcc patch to do that.

> +   || (TARGET_CMODEL == CMODEL_MEDIUM
> +   && GET_CODE (operands[1]) == SYMBOL_REF
> +   && !CONSTANT_POOL_ADDRESS_P (operands[1])
> +   && SYMBOL_REF_LOCAL_P (operands[1])
> +   && offsettable_ok_by_alignment (SYMBOL_REF_DECL (operands[1]

-- 
Alan Modra
Australia Development Lab, IBM


Re: powerpc-eabi-gcc no implicit FPU usage

2010-05-20 Thread Alan Modra
On Thu, May 20, 2010 at 09:40:47AM -0700, Mark Mitchell wrote:
> It is of course a feature much
> less valuable on a workstation/server class operating system than on the
> VxWorks/RTEMS class of RTOS systems.

Even on servers this option may be quite valuable.  I recall seing
figures that showed using fp regs for something like structure copies
could cost thousands of cpu cycles.

Why?  With lazy fpu save and restore, the first use of the fpu in a
given time slice takes an interrupt.  So if your task is only using
the fpu occasionally it is a severe misoptimization to choose to use
fp regs rather than gp regs.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Gprof can account for less than 1/3 of execution time?!?!

2010-02-21 Thread Alan Modra
On Sun, Feb 21, 2010 at 12:27:04PM -0600, Jon Turner wrote:
> The program in question has been compiled with -pg for all
> source code files.

Linked statically too?  If not, the missing time is probably spent in
libc.so or other shared libraries.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [LTO merge][0/15] Description of the final 15 patches

2009-09-29 Thread Alan Modra
On Mon, Sep 28, 2009 at 10:46:29PM -0400, DJ Delorie wrote:
> 
> > gets from the linker.  Since the linker plugin is a shared
> > object, and it uses libiberty functions, it needs to use a
> > shared libiberty.
> 
> Why can't they just link a static libiberty?

This comment from opcodes/configure.in is relevant

# When building a shared libopcodes, link against the pic version of libiberty
# so that apps that use libopcodes won't need libiberty just to satisfy any
# libopcodes references.
# We can't do that if a pic libiberty is unavailable since including non-pic
# code would insert text relocations into libopcodes.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Should -Wjump-misses-init be in -Wall?

2009-06-22 Thread Alan Modra
On Mon, Jun 22, 2009 at 09:45:52PM -0400, Robert Dewar wrote:
> Joe Buck wrote:
>> I think that this should be the standard: a warning belongs in -Wall if
>> it tends to expose bugs.  If it doesn't, then it's just somebody's idea
>> of proper coding style but with no evidence in support of its correctness.
>>
>> A -Wall warning should expose bugs, and should be easy to silence in
>> correct code.
>
> To understand what you are saying, we need to know what bug means, since
> it can have two meanings:
>
> 1. An actual error, that could show up right now in certain circumstances
>
> 2. An error resulting in undefined behavior in the standard, but
> for the current version of gcc, it cannot actually cause any real
> misbehavior, but some future version of gcc might take advantage
> of this error status and do something weird.
>
> For me it is enough if warnings expose case 2 situations, even if
> they find few if any case 1 situations.

I agree, but I think this warning should be in -Wc++-compat, not -Wall
or even -Wextra.  Why?  I'd argue the warning is useless for C code,
unless you care about C++ style.

There were five places in binutils that triggered -Wjump-misses-init
warnings.  Not one of them was a real bug, even using Robert's case 2
definition.  I believe the same is true of the three places in gcc
where the warning triggered.

So far, no one has generated a C testcase having undefined behaviour
where -Wjump-misses-init warns but -Wuninitialized (already in -Wall)
doesn't, when optimizing.  If such a testcase is found, I'm guessing
it probably should be filed as a -Wuninitialized bug.

In C, an auto variable initialization is just an assignment.  (I'm of
course aware that arrays can be initialized and their size set,
structs and unions initialized, but by and large, in C, an
initialization is simply an assignment.)  So, why single out the
initial assignment?  If skipping it deserves a warning then skipping
other assignments deserves a warning too, which would be ridiculous.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Linking very large application with GCC trunk on powerpc-linux leads to relocation error of crtbegin/end.o

2008-08-06 Thread Alan Modra
Found it, finally.  powerpc ld --relax had a bug affecting -fPIC
code.  See http://sourceware.org/ml/binutils/2008-08/msg00040.html

-- 
Alan Modra
Australia Development Lab, IBM


Re: Linking very large application with GCC trunk on powerpc-linux leads to relocation error of crtbegin/end.o

2008-06-30 Thread Alan Modra
On Mon, Jun 30, 2008 at 11:04:43AM -0400, David Edelsohn wrote:
> >>>>> Andreas Jaeger writes:
> 
> Andreas> So, it means that --relax is not the right solution for the problem.
> 
> Andreas> I'll continue with the STAGE1_CFLAG flag but if anybody else wants 
> me to
> Andreas> test something, please tell me,
> 
>   Maybe Alan will have some insight about --relax not working.

I'm definitely interested in trying to reproduce the problem as it
sounds like there might be a linker bug.  Andreas, can you send me
your configure options?

>   Otherwise, in the past Alan has had some suggestions for swapping
> around the crt file order or using linker scripts to place those sections
> more effectively.

These tricks would only help the particular case of reloc overflow in
branches to __do_global_{c,d}tors_aux.  Something like the following
(untested!) in place of the .text output section description in
ld/scripttempl/elf.sc ought to work.

  .text ${RELOCATING-0} :
  {
${RELOCATING+${TEXT_START_SYMBOLS}}
*crtend.o(.text .stub${RELOCATING+ .text.* .gnu.linkonce.t.*})
*crtend?.o(.text .stub${RELOCATING+ .text.* .gnu.linkonce.t.*})
*(EXCLUDE_FILE (*crtbegin.o *crtbegin?.o) .text .stub${RELOCATING+ .text.* 
.gnu.linkonce.t.*})
*crtbegin.o(.text .stub${RELOCATING+ .text.* .gnu.linkonce.t.*})
*crtbegin?.o(.text .stub${RELOCATING+ .text.* .gnu.linkonce.t.*})
/* .gnu.warning sections are handled specially by elf32.em.  */
*(.gnu.warning)
${RELOCATING+${OTHER_TEXT_SECTIONS}}
  } =${NOP-0}

-- 
Alan Modra
Australia Development Lab, IBM


Re: Linking very large application with GCC trunk on powerpc-linux leads to relocation error of crtbegin/end.o

2008-06-16 Thread Alan Modra
On Mon, Jun 16, 2008 at 01:27:58PM +0200, Laurent GUERBY wrote:
> Hi,
> 
> When linking a very large (> 100MB executable) application on
> powerpc-linux with trunk we get linker errors:
> 
> .../lib/gcc/powerpc-unknown-linux-gnu/4.4.0/crtbegin.o:(.fini+0x0):
> relocation truncated to fit: R_PPC_REL24 against `.text'
> .../lib/gcc/powerpc-unknown-linux-gnu/4.4.0/crtend.o:(.init+0x0):
> relocation truncated to fit: R_PPC_REL24 against `.text' 
> 
> The application itself is compiled with -mlongcall,
> would adding -mlongcall to crtstuff.c Makefile rule help here?

It ought to.  You could also try GNU ld's --relax option, which might
also allow you to dispense with -mlongcall for your app.

> If no, what is the proper solution GCC-wise?

I'll note that this problem is exacerbated by the fact that crtbegin.o
defines the destructor function run from .fini and crtend.o defines
the constructor run from .init.  It really should be the other way
around, since this arrangement results in maximmum offset branches,
from the .init section located before .text to a function at the end
of .text, and from the .fini section located after .text to a function
at the beginning of .text.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Possible GCC 4.3 driver regression caused by your patch

2008-03-02 Thread Alan Modra
On Mon, Mar 03, 2008 at 09:29:18AM +1100, Greg Schafer wrote:
> The following patch restores the old behaviour and fixes my build.

I for one would not like to see us go back to the old broken
behaviour.  One rather nice result of Carlos' fix is that you can now
build a sysrooted compiler on a native host without too much trouble.

-- 
Alan Modra
Australia Development Lab, IBM


  1   2   >