Re: [i386] Scalar DImode instructions on XMM registers

2015-05-26 Thread Jeff Law

On 05/25/2015 09:27 AM, Ilya Enkovich wrote:

2015-05-22 15:01 GMT+03:00 Ilya Enkovich :

2015-05-22 11:53 GMT+03:00 Ilya Enkovich :

2015-05-21 22:08 GMT+03:00 Vladimir Makarov :

So, Ilya, to solve the problem you need to avoid sharing subregs for the
correct LRA/reload work.




Thanks a lot for your help! I'll fix it.

Ilya


I've fixed SUBREG sharing and got a missing spill. I added
--enable-checking=rtl to check other possible bugs. Spill/fill code
still seems incorrect because different sizes are used.  Shouldn't
block me though.

.L5:
 movl16(%esp), %eax
 addl$8, %esi
 movl20(%esp), %edx
 movl%eax, (%esp)
 movl%edx, 4(%esp)
 callcounter@PLT
 movq-8(%esi), %xmm0
 **movdqa  16(%esp), %xmm2**
 pand%xmm0, %xmm2
 movdqa  %xmm2, %xmm0
 movd%xmm2, %edx
 **movq%xmm2, 16(%esp)**
 psrlq   $32, %xmm0
 movd%xmm0, %eax
 orl %edx, %eax
 jne .L5

Thanks,
Ilya


I was wrong assuming reloads with wrong size shouldn't block me. These
reloads require memory to be aligned which is not always true. Here is
what I have in RTL now:

(insn 2 7 3 2 (set (reg/v:DI 93 [ l ])
 (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
  (nil))
...
(insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0)
 (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0)
 (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3}
  (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ])
 (expr_list:REG_DEAD (reg/v:DI 93 [ l ])
 (nil

After reload I get:

(insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93])
 (mem/c:DI (plus:SI (reg/f:SI 7 sp)
 (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
  (nil))
(insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64])
 (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal}
  (nil))
...
(insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87])
 (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99])
 (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64])))
test.c:11 3489 {*iorv2di3}


'por' instruction requires memory to be aligned and fails in a bigger
testcase. There is also movdqa generated for esp by reload. May it
mean I still have some inconsistencies in the produced RTL? Probably I
should somehow transform loads and stores?
I'd start by looking at the AP->SP elimination step.  What's the defined 
stack alignment and whether or not a dynamic stack realignment is 
needed.  If you don't have all that setup properly prior to the 
allocators, then they're not going to know how what objects to align nor 
how to align them.


jeff



Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Paul E. McKenney
On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote:
> On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: 
> > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
> 
> I have been discussing Section 7.9 with Paul during last week.
> 
> While I think that 7.9 helps narrow down the problem somewhat, I'm still
> concerned that it effectively requires compilers to either track
> dependencies or conservatively prevent optimizations like value
> speculation and specialization based on that.  Neither is a good thing
> for a compiler.

I do believe that we can find some middle ground.

> 7.9 adds requirements that dependency chains stop if the program itself
> informs the compiler about the value of something in the dependency
> chain (e.g., as shown in Figure 33).  Also, if a correct program that
> does not have undefined behavior must use a particular value, this is
> also considered as "informing" the compiler about that value.  For
> example:
>   int arr[2];
>   int* x = foo.load(mo_consume);
>   if (x > arr)   // implies same object/array, so x is in arr[]
> int r1 = *x; // compiler knows x == arr + 1
> The program, assuming no undefined behavior, first tells the compiler
> that x should be within arr, and then the comparison tells the compiler
> that x!=arr, so x==arr+1 must hold because there are just two elements
> in the array.

The updated version of Section 7.9 says that if undefined behavior
allows the compiler to deduce the exact pointer value, as in the
case you show above, the dependency chain is broken.

> Having these requirements is generally good, but we don't yet know how
> to specify this properly.  For example, I suppose we'd need to also say
> that the compiler cannot assume to know anything about a value returned
> from an mo_consume load; otherwise, nothing prevents a compiler from
> using knowledge about the stores that the mo_consume load can read from
> (see Section 7.2).

I expect that the Linux kernel's rcu_dereference() primitives would
map to volatile memory_order_consume loads for this reason.

> Also, a compiler is still required to not do value-speculation or
> optimizations based on that.  For example, this program:
> 
> void op(type *p)
> {
>   foo /= p->a;
>   bar = p->b;
> }
> void bar()
> {
>   pointer = ppp.load(mo_consume);
>   op(pointer);
> }
> 
> ... must not be transformed into this program, even if the compiler
> knows that global_var->a == 1:
> 
> void op(type *p) { /* unchanged */}
> void bar()
> {
> pointer = ppp.load(mo_consume);
>   if (pointer != global_var) {
> op(pointer);
>   else // specialization for global_var
> {
>   // compiler knows global_var->a==1;
>   // compiler uses global_var directly, inlines, optimizes:
>   bar = global_var->b;
> }
> 
> The compiler could optimize out the division if pointer==global_var but
> it must not access field b directly through global_var.  This would be
> pretty awkwaard; the compiler may work based on an optimized expression
> in the specialization (ie, create code that assumes global_var instead
> of pointer) but it would still have to carry around and use the
> non-optimized expression.

Exactly how valuable is this sort of optimization in real code?  And
how likely is the compiler to actually be able to apply it?

(I nevertheless will take another pass through the Linux kernel looking
for global variables being added to RCU-protected linked structures.
Putting a global into an RCU-protected structure seems more likely than
is an RCU-protected pointer into a two-element array.)

> This wouldn't be as bad if it were easily constrained to code sequences
> that really need the dependencies.  However, 7.9 does not effectively
> contain dependencies to only the code that really needs them, IMO.
> Unless the compiler proves otherwise, it has to assume that a load from
> a pointer carries a dependency.  Proving that is often very hard because
> it requires points-to analysis; 7.9 restricts this to intra-thread
> analysis but that is still nontrivial.
> Michael Matz' had a similar concern (in terms of what it results in).

Again, I will be looking through the Linux kernel for vulnerabilities to
this sort of transformation.  However, I am having a very hard time seeing
how the compiler is going to know that much about the vast majority of
the Linux-kernel use cases.  The linked structures are allocated on the
heap, not in arrays or in globals.

> Given that mo_consume is useful but a very specialized feature, I
> wouldn't be optimistic that 7.9 would actually be supported by many
> compilers.  The trade-off between having to track dependencies or having
> to disallow optimizations is a bad one to make.  The simple way out for
> a compiler would be to just emit mo_acquire instead of mo_consume and be
> done with all -- and this might be the most practical decision overall,
> or the default general-purpose implementation.  At least I haven't heard
> any compiler impl

gcc-5-20150526 is now available

2015-05-26 Thread gccadmin
Snapshot gcc-5-20150526 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/5-20150526/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 5 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch 
revision 223720

You'll find:

 gcc-5-20150526.tar.bz2   Complete GCC

  MD5=3dfffc8efcbfc069d41239cb7578b054
  SHA1=bae3ce5b6c8f61bd1b273ae6908032b08aa416d7

Diffs from 5-20150519 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-5
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Torvald Riegel
On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote:
> On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
>  wrote:
> >
> > The compiler can (and does) speculate non-atomic non-volatile writes
> > in some cases, but I do not believe that it is permitted to speculate
> > either volatile or atomic writes.
> 
> I do *not* believe that a compiler is ever allowed to speculate *any*
> writes - volatile or not - unless the compiler can prove that the end
> result is either single-threaded, or the write in question is
> guaranteed to only be visible in that thread (ie local stack variable
> etc).

It must not speculative volatile accesses.  It could speculate
non-volatiles even if those are atomic and observable by other threads
but that would require further work/checks on all potential observers of
those (ie, to still satisfy as-if).  Thus, compilers are unlikely to do
such speculation, I'd say.

The as-if rule (ie, equality of observable behavior (ie, volatiles, ...)
to the abstract machine) makes all this clear.

> Also, I do think that the whole "consume" read should be explained
> better to compiler writers. Right now the language (including very
> much in the "restricted dependency" model) is described in very
> abstract terms. Yet those abstract terms are actually very subtle and
> complex, and very opaque to a compiler writer.

I believe the issues around the existing specification of mo_consume
where pointed out by compiler folks.  It's a complex problem, and I'm
all for more explanations, but I did get the impression that the
compiler writers in ISO C++ Study Group 1 do have a good understanding
of the problem.

> I personally think the whole "abstract machine" model of the C
> language is a mistake. It would be much better to talk about things in
> terms of actual code generation and actual issues. Make all the
> problems much more concrete, with actual examples of how memory
> ordering matters on different architectures.

As someone working for a toolchain team, I don't see how the
abstract-machine-based specification is a problem at all, nor have I
seen compiler writers struggling with it.  It does give precise rules
for code generation.

The level of abstraction is a good thing for most programs because for
those, the primary concern is that the observable behavior and end
result is computed -- it's secondary and QoI how that happens.  In
contrast, if you specify on the level of code generation, you'd have to
foresee how code generation might look in the future, including predict
future optimizations and all that.  That doesn't look future-proof to
me.

I do realize that this may be less than ideal for cases when one would
want to use a C compiler more like a convenient assembler.  But that
case isn't the 99%, I believe.



Re: [c++std-parallel-1611] Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Torvald Riegel
On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: 
>   http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

I have been discussing Section 7.9 with Paul during last week.

While I think that 7.9 helps narrow down the problem somewhat, I'm still
concerned that it effectively requires compilers to either track
dependencies or conservatively prevent optimizations like value
speculation and specialization based on that.  Neither is a good thing
for a compiler.


7.9 adds requirements that dependency chains stop if the program itself
informs the compiler about the value of something in the dependency
chain (e.g., as shown in Figure 33).  Also, if a correct program that
does not have undefined behavior must use a particular value, this is
also considered as "informing" the compiler about that value.  For
example:
  int arr[2];
  int* x = foo.load(mo_consume);
  if (x > arr)   // implies same object/array, so x is in arr[]
int r1 = *x; // compiler knows x == arr + 1
The program, assuming no undefined behavior, first tells the compiler
that x should be within arr, and then the comparison tells the compiler
that x!=arr, so x==arr+1 must hold because there are just two elements
in the array.

Having these requirements is generally good, but we don't yet know how
to specify this properly.  For example, I suppose we'd need to also say
that the compiler cannot assume to know anything about a value returned
from an mo_consume load; otherwise, nothing prevents a compiler from
using knowledge about the stores that the mo_consume load can read from
(see Section 7.2).

Also, a compiler is still required to not do value-speculation or
optimizations based on that.  For example, this program:

void op(type *p)
{
  foo /= p->a;
  bar = p->b;
}
void bar()
{
  pointer = ppp.load(mo_consume);
  op(pointer);
}

... must not be transformed into this program, even if the compiler
knows that global_var->a == 1:

void op(type *p) { /* unchanged */}
void bar()
{
pointer = ppp.load(mo_consume);
  if (pointer != global_var) {
op(pointer);
  else // specialization for global_var
{
  // compiler knows global_var->a==1;
  // compiler uses global_var directly, inlines, optimizes:
  bar = global_var->b;
}

The compiler could optimize out the division if pointer==global_var but
it must not access field b directly through global_var.  This would be
pretty awkwaard; the compiler may work based on an optimized expression
in the specialization (ie, create code that assumes global_var instead
of pointer) but it would still have to carry around and use the
non-optimized expression.


This wouldn't be as bad if it were easily constrained to code sequences
that really need the dependencies.  However, 7.9 does not effectively
contain dependencies to only the code that really needs them, IMO.
Unless the compiler proves otherwise, it has to assume that a load from
a pointer carries a dependency.  Proving that is often very hard because
it requires points-to analysis; 7.9 restricts this to intra-thread
analysis but that is still nontrivial.
Michael Matz' had a similar concern (in terms of what it results in).


Given that mo_consume is useful but a very specialized feature, I
wouldn't be optimistic that 7.9 would actually be supported by many
compilers.  The trade-off between having to track dependencies or having
to disallow optimizations is a bad one to make.  The simple way out for
a compiler would be to just emit mo_acquire instead of mo_consume and be
done with all -- and this might be the most practical decision overall,
or the default general-purpose implementation.  At least I haven't heard
any compiler implementer say that they think it's obviously worth
implementing.

I also don't think 7.9 is ready for ISO standardization yet (or any of
the other alternatives mentioned in the paper).  Standardizing a feature
that we're not sure whether it will actually be implemented is not a
good thing to do; it's too costly for all involved parties (compiler
writers *and* users).


IMO, the approach outlined in Section 7.7 is still the most promising
contender in the long run.  It avoid the perhaps more pervasive changes
that a type-system-based approach as the one in Section 7.2 might result
in, yet still informs the compiler where dependencies are actually used
and which chain of expressions would be involved in that.  Tracking is
probably simplified, as dependencies are never open-ended and
potentially leaking into various other regions of code.  It seems easier
to specify in a standard because we just need the programmer to annotate
the intent and the rest is compiler QoI.  It would require users to
annotate their use of dependencies, but they don't need to follow
further rules; performance tuning of the code so it actually makes use
of dependencies is mostly a compiler QoI thing, and if the compiler
can't maintain a dependency, it can issue warnings and thus make the
tuning interactive for the user.

Of 

Re: Balanced partition map for Firefox

2015-05-26 Thread Richard Biener
On Tue, 19 May 2015, Martin Liška wrote:

> Hello.
> 
> I've just noticed that we, for default configuration, produce just 30
> partitions.
> I'm wondering whether that's fine, or it would be necessary to re-tune
> partitioning
> algorithm to produce better balanced map?
> 
> Attached patch is used to produce following dump:
> 
> Partition sizes:
> partition 0 contains 9806 (5.42)% symbols and 232445 (2.37)% insns
> partition 1 contains 15004 (8.30)% symbols and 389297 (3.96)% insns
> partition 2 contains 13954 (7.71)% symbols and 390076 (3.97)% insns
> partition 3 contains 14349 (7.93)% symbols and 390476 (3.97)% insns
> partition 4 contains 13852 (7.66)% symbols and 391346 (3.98)% insns
> partition 5 contains 10766 (5.95)% symbols and 278110 (2.83)% insns
> partition 6 contains 11465 (6.34)% symbols and 396298 (4.03)% insns
> partition 7 contains 16467 (9.10)% symbols and 396043 (4.03)% insns
> partition 8 contains 12959 (7.16)% symbols and 316753 (3.22)% insns
> partition 9 contains 17422 (9.63)% symbols and 402809 (4.10)% insns
> partition 10 contains 15431 (8.53)% symbols and 404822 (4.12)% insns
> partition 11 contains 15967 (8.83)% symbols and 342655 (3.49)% insns
> partition 12 contains 12325 (6.81)% symbols and 409573 (4.17)% insns
> partition 13 contains 11876 (6.57)% symbols and 411484 (4.19)% insns
> partition 14 contains 20902 (11.56)% symbols and 391188 (3.98)% insns
> partition 15 contains 18894 (10.45)% symbols and 339148 (3.45)% insns
> partition 16 contains 27028 (14.94)% symbols and 426811 (4.34)% insns
> partition 17 contains 19626 (10.85)% symbols and 431548 (4.39)% insns
> partition 18 contains 23864 (13.19)% symbols and 437657 (4.45)% insns
> partition 19 contains 28677 (15.86)% symbols and 445054 (4.53)% insns
> partition 20 contains 32558 (18.00)% symbols and 457975 (4.66)% insns
> partition 21 contains 37598 (20.79)% symbols and 470463 (4.79)% insns
> partition 22 contains 21612 (11.95)% symbols and 488384 (4.97)% insns
> partition 23 contains 18981 (10.49)% symbols and 493152 (5.02)% insns
> partition 24 contains 20591 (11.38)% symbols and 493380 (5.02)% insns
> partition 25 contains 20721 (11.46)% symbols and 496018 (5.05)% insns
> partition 26 contains 26171 (14.47)% symbols and 479232 (4.88)% insns
> partition 27 contains 29242 (16.17)% symbols and 530613 (5.40)% insns
> partition 28 contains 35817 (19.80)% symbols and 563768 (5.74)% insns
> partition 29 contains 42662 (23.59)% symbols and 741133 (7.54)% insns
> 
> As seen, there are partitions that are about 3x bigger than a different one.
> What do you think about installing the patch to trunk? If yes, I'll test the
> patch
> and write a ChangeLog entry.

I think a patch like this is fine (lto-partition.h parts are missing).

Thanks,
Richard.

> Thanks,
> Martin
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nuernberg)