Re: [PATCH] RFC: -fasm-show-source

2016-08-11 Thread Prathamesh Kulkarni
On 12 August 2016 at 02:04, David Malcolm  wrote:
> I sometimes find myself scouring assembler output from the compiler
> and trying to figure out which instructions correspond to which
> lines of source code; I believe this is a common activity for some
> end-users.
Hi David,
I usually use gcc -g -Wa,-adhln to trace C source from the assembly.
I tried your example and it gave more or less a similar output (attached).
However the source mapping with your patch looks better.

Thanks,
Prathamesh
>
> The following patch adds a new -fasm-show-source option, which
> emits comments into the generated asm showing the pertinent
> line of source code, whenever it changes.  It uses the same logic
> as debug_hooks->source_line for tracking this (for handling
> line-based breakpoints).
>
> An example can be seen in the invoke.texi part of the patch.  As
> noted there, it's aimed at end-users, rather than gcc developers.
> The example shows a relatively short function; the option is
> likely to be much more useful for longer functions.
>
> I think it would further improve usability if this option were enabled
> by default when the final output is .s (either via -S, or by "-o foo.s").
> Ideas on how to implement that (in the driver) would be welcome - I
> started looking at the spec-handling code, but thought I'd post the
> idea here first, before diving in too deeply.
>
> Successfully bootstrapped on x86_64-pc-linux-gnu; adds
> 2 PASS results to gcc.sum.
>
> Thoughts?  OK for trunk as-is?
>
> gcc/ChangeLog:
> * common.opt (fasm-show-source): New option.
> * doc/invoke.texi (Code Generation Options): Add
> -fasm-show-source.
> (-fasm-show-source): New item.
> * final.c (asm_show_source): New function.
> (final_scan_insn): Call asm_show_source.
>
> gcc/testsuite/ChangeLog:
> * gcc.dg/fasm-show-source-1.c: New test case.
> ---
>  gcc/common.opt|  5 +++
>  gcc/doc/invoke.texi   | 71 
> ++-
>  gcc/final.c   | 29 -
>  gcc/testsuite/gcc.dg/fasm-show-source-1.c | 15 +++
>  4 files changed, 117 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/fasm-show-source-1.c
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 8a292ed..56ce513 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -939,6 +939,11 @@ fargument-noalias-anything
>  Common Ignore
>  Does nothing. Preserved for backward compatibility.
>
> +fasm-show-source
> +Common Var(flag_asm_show_source)
> +Emit comments in the generated assembly code to show the source code
> +lines associated with the assembly instructions.
> +
>  fsanitize=
>  Common Driver Report Joined
>  Select what to sanitize.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 22001f9..dc3d3ad 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -486,7 +486,8 @@ Objective-C and Objective-C++ Dialects}.
>
>  @item Code Generation Options
>  @xref{Code Gen Options,,Options for Code Generation Conventions}.
> -@gccoptlist{-fcall-saved-@var{reg}  -fcall-used-@var{reg} @gol
> +@gccoptlist{-fasm-show-source @gol
> +-fcall-saved-@var{reg}  -fcall-used-@var{reg} @gol
>  -ffixed-@var{reg}  -fexceptions @gol
>  -fnon-call-exceptions  -fdelete-dead-exceptions  -funwind-tables @gol
>  -fasynchronous-unwind-tables @gol
> @@ -11153,6 +11154,74 @@ can figure out the other form by either removing 
> @samp{no-} or adding
>  it.
>
>  @table @gcctabopt
> +@item -fasm-show-source
> +@opindex fasm-show-source
> +Emit comments in the generated assembly code to show the source code
> +lines associated with the assembly instructions.  This option is
> +aimed at end-users who wish to better understand the relationship
> +between their source code and the generated machine code.
> +
> +The comments are of the form FILENAME:LINENUMBER:CONTENT OF LINE.
> +
> +For example, given this C source file:
> +
> +@smallexample
> +int test (int n)
> +@{
> +  int i;
> +  int total = 0;
> +
> +  for (i = 0; i < n; i++)
> +total += i * i;
> +
> +  return total;
> +@}
> +@end smallexample
> +
> +compiling to (x86_64) assembly via @option{-S} and emitting the result
> +direct to stdout via @option{-o} @option{-}
> +
> +@smallexample
> +gcc -S test.c -fasm-show-source -Os -o -
> +@end smallexample
> +
> +gives output similar to this, highlighting which instructions correspond
> +to the various parts of the loop:
> +
> +@smallexample
> +   .file   "test.c"
> +   .text
> +   .globl  test
> +   .type   test, @@function
> +test:
> +.LFB0:
> +   .cfi_startproc
> +# test.c:4:   int total = 0;
> +   xorl%eax, %eax
> +# test.c:6:   for (i = 0; i < n; i++)
> +   xorl%edx, %edx
> +.L2:
> +# test.c:6:   for (i = 0; i < n; i++)
> +   cmpl%edi, %edx
> +   jge .L5
> +# test.c:7: total += i * i;
> +   movl%edx, %ecx
> +   imull   

Re: [PATCH] Indicate minimum in-tree MPFR version handled

2016-08-11 Thread Bernd Edlinger
On 08/12/16, Maciej W. Rozycki wrote:
> Hi,
> 
>  Commit 235763 removed support for versions of MPFR below 3.1.0 which have
> a flat directory structure, however it did not introduce any safety check
> for such an unhandled library version present in the tree.  Consequently
> the system-installed version is silently chosen, which if too old, causes
> a confusing configuration failure in mpc/ stating a misleading version
> requirement:
> 
> checking for MPFR... yes
> checking for recent GMP... yes
> checking for recent MPFR... no
> configure: error: MPFR version >= 2.4.2 required
> make[1]: *** [configure-mpc] Error 1
> 
>  I propose the check below to make people's life just a little bit easier
> and indicate right away that an incorrect version of MPFR has been found
> in the source tree.  This is especially helpful when you just sync your
> build tree from upstream and may easily miss the updated requirement.  I
> carefully chose to use "handled" rather than "supported" in the message as
> the commit referred clearly indicates you are on your own with versions of
> the libraries different from those stated in `download_prerequisites'.
> 
> +  # MPFR v3.1.0 moved the sources into a src sub-directory.
> +  if ! test -d ${srcdir}/mpfr/src; then
> +as_fn_error "Building GCC with MPFR in the source tree is only handled 
> for MPFR 3.1.0+." "$LINENO" 5
> +  fi

I think it is a good idea to detect this situation, but you should advise the 
user
that he has to use exactly the same version(s) that 
contrib/download_prerequisites
installs.


Bernd.

Re: [PATCH] Fix warning breaking profiled bootstrap

2016-08-11 Thread Andi Kleen
> 
> If sym1 results in a return value that is some useful tree and inv1
> is true and cst1 is true via this call:

The only way for get_single_symbol to return a non NULL tree
is to hit the return at the end -- and that always initializes
inv and neg.

And when the return is NULL the && prevents evaluating 
inv or neg.

So I still think it's a false positive.

-Andi


Re: [RFC][PR61839]Convert CST BINOP COND_EXPR to COND_EXPR ? (CST BINOP 1) : (CST BINOP 0)

2016-08-11 Thread kugan

Hi Richard,


On 11/08/16 20:04, Richard Biener wrote:

On Thu, Aug 11, 2016 at 6:11 AM, kugan
 wrote:


[SNIP]



+two_valued_val_range_p (tree var, tree *a, tree *b)
+{
+  value_range *vr = get_value_range (var);
+  if ((vr->type != VR_RANGE
+   && vr->type != VR_ANTI_RANGE)
+  || !range_int_cst_p (vr))
+return false;

range_int_cst_p checks for vr->type == VR_RANGE so the anti-range handling
doesn't ever trigger - which means you should add a testcase for it as well.


Fixed it. I have also added a testcase.




+{
+  *a = TYPE_MIN_VALUE (TREE_TYPE (var));
+  *b = TYPE_MAX_VALUE (TREE_TYPE (var));

note that for pointer types this doesn't work, please also use
vrp_val_min/max for
consistency.  I think you want to add a INTEGRAL_TYPE_P (TREE_TYPE (var))
to the guard of two_valued_val_range_p.

+  /* First canonicalize to simplify tests.  */
+  if (commutative_tree_code (rhs_code)
+ && TREE_CODE (rhs2) == INTEGER_CST)
+   std::swap (rhs1, rhs2);

note that this doesn't really address my comment - if you just want to handle
commutative ops then simply look for the constant in the second place rather
than the first which is the canonical operand order.  But even for
non-commutative
operations we might want to apply this optimization - and then for both cases,
rhs1 or rhs2 being constant.  Like x / 5 and 5 / x.

Note that you can rely on int_const_binop returning NULL_TREE for "invalid"
ops like x % 0 or x / 0, so no need to explicitely guard this here.


Sorry, I misunderstood you. I have changed it now. I also added 
test-case to check this.


Bootstrapped and regression tested on x86_64-linux-gnu with no new 
regressions. Is this OK for trunk now?


Thanks,
Kugan

gcc/testsuite/ChangeLog:

2016-08-12  Kugan Vivekanandarajah  

PR tree-optimization/61839
* gcc.dg/tree-ssa/pr61839_1.c: New test.
* gcc.dg/tree-ssa/pr61839_2.c: New test.
* gcc.dg/tree-ssa/pr61839_3.c: New test.
* gcc.dg/tree-ssa/pr61839_4.c: New test.

gcc/ChangeLog:

2016-08-12  Kugan Vivekanandarajah  

PR tree-optimization/61839
* tree-vrp.c (two_valued_val_range_p): New.
(simplify_stmt_using_ranges): Convert CST BINOP VAR where VAR is
two-valued to VAR == VAL1 ? (CST BINOP VAL1) : (CST BINOP VAL2).
Also Convert VAR BINOP CST where VAR is two-valued to
VAR == VAL1 ? (VAL1 BINOP CST) : (VAL2 BINOP CST).
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c
index e69de29..9f8168a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c
@@ -0,0 +1,44 @@
+/* PR tree-optimization/61839.  */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-optimized" } */
+/* { dg-require-effective-target int32plus } */
+
+__attribute__ ((noinline))
+int foo ()
+{
+  int a = -1;
+  volatile unsigned b = 1U;
+  int c = 1;
+  c = (a + 972195718) >> (1LU <= b);
+  if (c == 486097858)
+;
+  else
+__builtin_abort ();
+  return 0;
+}
+
+__attribute__ ((noinline))
+int bar ()
+{
+  int a = -1;
+  volatile unsigned b = 1U;
+  int c = 1;
+  c = (a + 972195718) >> (b ? 2 : 3);
+  if (c == 243048929)
+;
+  else
+__builtin_abort ();
+  return 0;
+}
+
+int main ()
+{
+  foo ();
+  bar ();
+}
+
+/* Scan for c = 972195717) >> [0, 1] in function foo.  */
+/* { dg-final { scan-tree-dump-times "486097858 : 972195717" 1  "vrp1" } } */
+/* Scan for c = 972195717) >> [2, 3] in function bar.  */
+/* { dg-final { scan-tree-dump-times "243048929 : 121524464" 2  "vrp1" } } */
+/* { dg-final { scan-tree-dump-times "486097858" 0  "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_2.c
index e69de29..ffa00a7 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_2.c
@@ -0,0 +1,54 @@
+/* PR tree-optimization/61839.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-vrp1" } */
+/* { dg-require-effective-target int32plus } */
+
+__attribute__ ((noinline))
+int foo ()
+{
+  int a = -1;
+  volatile unsigned b = 1U;
+  int c = 1;
+  c = (a + 972195718) / (b ? 1 : 0);
+  if (c == 972195717)
+;
+  else
+__builtin_abort ();
+  return 0;
+}
+
+__attribute__ ((noinline))
+int bar ()
+{
+  int a = -1;
+  volatile unsigned b = 1U;
+  int c = 1;
+  c = (a + 972195718) % (b ? 1 : 0);
+  if (c == 972195717)
+;
+  else
+__builtin_abort ();
+  return 0;
+}
+
+__attribute__ ((noinline))
+int bar2 ()
+{
+  int a = -1;
+  volatile unsigned b = 1U;
+  int c = 1;
+  c = (a + 972195716) % (b ? 1 : 2);
+  if (c == 972195715)
+;
+  else
+__builtin_abort ();
+  return 0;
+}
+
+
+/* Dont optimize 972195717 / 0 in function foo.  */
+/* { dg-final { scan-tree-dump-times "972195717 / _" 1  "vrp1" } } */
+/* Dont optimize 972195717 % 0 in function 

Re: [PATCH] RFC: -fasm-show-source

2016-08-11 Thread Sandra Loosemore

On 08/11/2016 02:34 PM, David Malcolm wrote:

I sometimes find myself scouring assembler output from the compiler
and trying to figure out which instructions correspond to which
lines of source code; I believe this is a common activity for some
end-users.

The following patch adds a new -fasm-show-source option, which
emits comments into the generated asm showing the pertinent
line of source code, whenever it changes.  It uses the same logic
as debug_hooks->source_line for tracking this (for handling
line-based breakpoints).

An example can be seen in the invoke.texi part of the patch.  As
noted there, it's aimed at end-users, rather than gcc developers.
The example shows a relatively short function; the option is
likely to be much more useful for longer functions.

I think it would further improve usability if this option were enabled
by default when the final output is .s (either via -S, or by "-o foo.s").
Ideas on how to implement that (in the driver) would be welcome - I
started looking at the spec-handling code, but thought I'd post the
idea here first, before diving in too deeply.

Successfully bootstrapped on x86_64-pc-linux-gnu; adds
2 PASS results to gcc.sum.

Thoughts?  OK for trunk as-is?


Why not extend the existing -fverbose-asm to do this?  E.g. 
-fverbose-asm=source, or something like that.  Otherwise you need to 
cross-reference the documentation for the two options and explain how 
they interact (or don't interact, as the case may be).


-Sandra



Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

2016-08-11 Thread Andrew Pinski
On Thu, Aug 11, 2016 at 5:19 PM, Denys Vlasenko  wrote:
>
>
> On 08/11/2016 10:59 PM, Andrew Pinski wrote:
>>
>> On Thu, Aug 11, 2016 at 1:49 PM, Denys Vlasenko 
>> wrote:
>>>
>>> falign-functions=N is too simplistic.
>>>
>>> Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte
>>> alignment
>>> runs fastest (he tried many other possibilites).
>>>
>>> However, developers are less than thrilled by the idea of a slam-dunk
>>> 64-byte
>>> aligning everything. Too much waste:
>>> On 05/20/2015 02:47 AM, Linus Torvalds wrote:
>>> > At the same time, I have to admit that I abhor a 64-byte
>>> function
>>> > alignment, when we have a fair number of functions that are
>>> (much)
>>> > smaller than that.
>>> >
>>> > Is there some way to get gcc to take the size of the function
>>> into
>>> > account? Because aligning a 16-byte or 32-byte function on a
>>> 64-byte
>>> > alignment is just criminally nasty and wasteful.
>>>
>>> This change makes it possible to align function to 64-byte boundaries
>>> *IF*
>>> this does not introduce huge amount of padding.
>>>
>>> Patch drops forced alignment to 8 if requested alignment is higher than
>>> 8:
>>> before the patch, -falign-functions=9 was generating
>>>
>>> .p2align 4,,8
>>> .p2align 3
>>>
>>> which means: "align to 16 if the skip is 8 bytes or less; else align to
>>> 8".
>>> After this change, ".p2align 3" is not emitted.
>>>
>>> It is dropped because I ultimately want to do something
>>> like -falign-functions=64,8 - IOW, I want to align functions to 64 bytes,
>>> but only if that generates padding of less than 8 bytes - otherwise I
>>> want
>>> *no alignment at all*. The forced ".p2align 3" interferes with that
>>> intention.
>>>
>>> Testing:
>>> tested that with -falign-functions=N (tried 8, 15, 16, 17...) the
>>> alignment
>>> directives are the same before and after the patch.
>>> Tested that -falign-functions=N,N (two equal paramenters) works exactly
>>> like -falign-functions=N.
>>
>>
>> Two things I noticed you missed:
>> 1) ChangeLog entries (this is required as per the GNU coding style)
>
>
> Is this form okay? -

This is getting there but needs a little improvement.

>
> ...
> Testing:
> tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment
> directives are the same before and after the patch.
> Tested that -falign-functions=N,N (two equal paramenters) works exactly
> like -falign-functions=N.

Oh also maybe adding testcases would be useful, just scanning the
outputted assembly should be good enough.

>
> 2016-08-11  Denys Vlasenko  
>
> * common.opt: Change definitions so that -falign_FOO= accepts a string.

I think this should be:
* common.opt (-falign-functions): Accept a string instead of an integer.
(falign-loops): Likewise.


> * config/i386/freebsd.h: Remove "If N is large, do at least 8 byte
> alignment" code.
This should be something more like:
* config/i386/freebsd.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Remove "If N is
large, do at least 8 byte alignment" code.

> * config/i386/gnu-user.h: Likewise.
* config/i386/gnu-user.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

> * config/i386/iamcu.h: Likewise.
> * config/i386/openbsdelf.h: Likewise.
> * config/i386/x86-64.h: Likewise.
> * flags.h (struct target_flag_state): Add x_align_functions_max_skip
> member.
> * toplev.c (parse_N_M): New function.
> (init_alignments): Set align_FOO_log, align_FOO, align_FOO_max_skip
> from specified --align_FOO=N{,M} option
> * varasm.c (assemble_start_function): Use align_functions_max_skip
> instead of align_functions - 1.
>
> Index: gcc/common.opt
> ===
> 
>
>
>> 2) Documentation changes.

You did not add this or you doing this in a separate patch now?
The current documentation is done in doc/invoke.texi:
@item -falign-functions
@itemx -falign-functions=@var{n}
@opindex falign-functions
Align the start of functions to the next power-of-two greater than
@var{n}, skipping up to @var{n} bytes.  For instance,
@option{-falign-functions=32} aligns functions to the next 32-byte
boundary, but @option{-falign-functions=24} aligns to the next
32-byte boundary only if this can be done by skipping 23 bytes or less.

@option{-fno-align-functions} and @option{-falign-functions=1} are
equivalent and mean that functions are not aligned.

Some assemblers only support this flag when @var{n} is a power of two;
in that case, it is rounded up.

If @var{n} is not specified or is zero, use a machine-dependent default.

 CUT 
You should add your documentation there (and the rest of the -falign-* options.


>>
>>>
>>> Index: gcc/common.opt
>>> ===
>
> ...
>
>>>  #define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)   

Re: [PATCH], Patch #4, Improve vector int/long initialization on PowerPC

2016-08-11 Thread Segher Boessenkool
On Thu, Aug 11, 2016 at 07:15:17PM -0400, Michael Meissner wrote:
> This patch was originally part of patch #3, but I separated it out as I rework
> what used to be part of patch #3 to fix some issues.
> 
> This patch adds support for using the ISA 3.0 MTVSRDD instruction when
> initializing vector long vectors with variables.  I also changed the CPU type
> of the other use of MTVSRDD to be vecperm as Pat suggested.
> 
> I added two general tests (vec-init-1.c and vec-init-2.c) that test various
> forms of vector initialization to make sure the compiler generates the correct
> code.  These tests will test optimizations that the future patches will
> enhance.  I also added a third test (vec-init-3.c) to specifically test 
> whether
> MTVSRDD is generated.
> 
> I did a bootstrap and make check on a little endian power8 system, and there
> were no regressions.  Can I install this patch to the trunk?

Yes, please do.  Thanks,


Segher


> 2016-08-11  Michael Meissner  
> 
>   * config/rs6000/vsx.md (vsx_concat_): Add support for the
>   ISA 3.0 MTVSRDD instruction.
>   (vsx_splat_): Change cpu type of MTVSRDD instruction to
>   vecperm.
> 
> [gcc/testsuite]
> 2016-08-11  Michael Meissner  
> 
>   * gcc.target/powerpc/vec-init-1.c: New tests to test various
>   vector initialization options.
>   * gcc.target/powerpc/vec-init-2.c: Likewise.
>   * gcc.target/powerpc/vec-init-3.c: New test to make sure MTVSRDD
>   is generated on ISA 3.0.


Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

2016-08-11 Thread Denys Vlasenko



On 08/11/2016 10:59 PM, Andrew Pinski wrote:

On Thu, Aug 11, 2016 at 1:49 PM, Denys Vlasenko  wrote:

falign-functions=N is too simplistic.

Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte alignment
runs fastest (he tried many other possibilites).

However, developers are less than thrilled by the idea of a slam-dunk 64-byte
aligning everything. Too much waste:
On 05/20/2015 02:47 AM, Linus Torvalds wrote:
> At the same time, I have to admit that I abhor a 64-byte function
> alignment, when we have a fair number of functions that are (much)
> smaller than that.
>
> Is there some way to get gcc to take the size of the function into
> account? Because aligning a 16-byte or 32-byte function on a 64-byte
> alignment is just criminally nasty and wasteful.

This change makes it possible to align function to 64-byte boundaries *IF*
this does not introduce huge amount of padding.

Patch drops forced alignment to 8 if requested alignment is higher than 8:
before the patch, -falign-functions=9 was generating

.p2align 4,,8
.p2align 3

which means: "align to 16 if the skip is 8 bytes or less; else align to 8".
After this change, ".p2align 3" is not emitted.

It is dropped because I ultimately want to do something
like -falign-functions=64,8 - IOW, I want to align functions to 64 bytes,
but only if that generates padding of less than 8 bytes - otherwise I want
*no alignment at all*. The forced ".p2align 3" interferes with that intention.

Testing:
tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment
directives are the same before and after the patch.
Tested that -falign-functions=N,N (two equal paramenters) works exactly
like -falign-functions=N.


Two things I noticed you missed:
1) ChangeLog entries (this is required as per the GNU coding style)


Is this form okay? -

...
Testing:
tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment
directives are the same before and after the patch.
Tested that -falign-functions=N,N (two equal paramenters) works exactly
like -falign-functions=N.

2016-08-11  Denys Vlasenko  

* common.opt: Change definitions so that -falign_FOO= accepts a string.
* config/i386/freebsd.h: Remove "If N is large, do at least 8 byte
alignment" code.
* config/i386/gnu-user.h: Likewise.
* config/i386/iamcu.h: Likewise.
* config/i386/openbsdelf.h: Likewise.
* config/i386/x86-64.h: Likewise.
* flags.h (struct target_flag_state): Add x_align_functions_max_skip
member.
* toplev.c (parse_N_M): New function.
(init_alignments): Set align_FOO_log, align_FOO, align_FOO_max_skip
from specified --align_FOO=N{,M} option
* varasm.c (assemble_start_function): Use align_functions_max_skip
instead of align_functions - 1.

Index: gcc/common.opt
===




2) Documentation changes.



Index: gcc/common.opt
===

...


 #define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)   \
   do { \
 if ((LOG) != 0) {  \
-  if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG)); \
-  else {   \
+  if ((MAX_SKIP) == 0 || (MAX_SKIP) >= (1<<(LOG))) \
+fprintf ((FILE), "\t.p2align %d\n", (LOG));\
+  else \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
-   /* Make sure that we have at least 8 byte alignment if > 8 byte \
-  alignment is preferred.  */  \
-   if ((LOG) > 3   \
-   && (1 << (LOG)) > ((MAX_SKIP) + 1)  \
-   && (MAX_SKIP) >= 7) \
- fputs ("\t.p2align 3\n", (FILE)); \
-  }
\
 }  \
   } while (0)
 #undef  ASM_OUTPUT_MAX_SKIP_PAD


These looks like a bug fix, it most likely should be sent separately;
maybe even first.


ok


Also can you make sure the generic ASM_OUTPUT_MAX_SKIP_ALIGN is done correctly?


If by generic you mean "non-arch-specific", the situation is as follows:
if arch does not define ASM_OUTPUT_MAX_SKIP_ALIGN(file, align, max_skip), then
ASM_OUTPUT_ALIGN_WITH_NOP(file, align) or
ASM_OUTPUT_ALIGN (file, align) macros are used.
As you se from their arguments, those macros don't take "max_skip" argument.
IOW: even in existing code, they ignore max_skip, and 

Re: [PATCH] - improve sprintf buffer overflow detection (middle-end/49905)

2016-08-11 Thread Martin Sebor

On 07/22/2016 04:12 PM, Jeff Law wrote:

Working through the new pass...  Overall it looks pretty good.  There's
a certain level of trust I'll extend WRT getting the low level details
right -- a thorough testsuite obviously helps there.


In the latest patch where I add the return value optimization I also
add a new test that compares the return value computed by the pass
to the value returned by the library call.  That greatly increases
my confidence in the correctness of the pass (though the test still
could stand to be enhanced in a number of ways).


+const pass_data pass_data_sprintf_length = {
+  GIMPLE_PASS,// pass type
+  "sprintf_length",   // pass name
+  OPTGROUP_NONE,  // optinfo_flags
+  TV_NONE,// tv_id
+  PROP_cfg,   // properties_required
+  0, // properties_provided
+  0, // properties_destroyed
+  0, // properties_start
+  0, // properties_finish

So the worry here is that you need PROP_ssa in the late pass.  I'm not
sure if we have the infrastructure to allow you to distinguish the two.
I guess we could literally make it two passes with two distinct
pass_data structures.


I'm not sure I understand the concern.  It sounds like you are
suggesting to add another pass_data struct with PROP_ssa among
the required properties and specify it as an argument to
the gimple_opt pass constructor.  What will it do that what
I have doesn't (make sure the ssa pass runs before this pass?)
and how can I test it?



In handle_gimple_call, you don't handle anti ranges -- I haven't looked
closely at canonicalization rules in VRP, so I don't know if you're
likely to see a range like ![MININT, -1] which would be a synonym for
[0..MAXINT].  It might be worth doing some instrumentation to see if
you're getting useful anti-ranges with any kind of consistency.  ISTM
the most interesting ones are going to allow you to drop or insist on a
"-" sign.


I don't recall seeing any inverted ranges when I was developing
the range tests.  I had high hopes for the range information but
it turned out to be so poor that I decided not to spend time
completing this part.  Once Andrew's new VRP pass is available
I'll come back to it.  If it's as generic as it sounds this pass
will not only be a consumer but also be able to contribute to
improving it.


+  /* As a special case, bounded function allow the explicitly
+specified destination size argument to be zero as a request
+to determine the number of bytes on output without actually
+writing any output.  Disable length checking for those.  */

This doesn't parse well.



+  /* Try to use __builtin_object_size although it rarely returns
+ a useful result even for straighforward cases.  */

Hopefully you've stashed away some tests for this so that we can work on
them independently of the sprintf checking itself?  ISTM that the
recursive handling of POINTER_PLUS_EXPR really ought to move into the
generic builtin_object_size handling itself as an independent change.


Bug 71831 tracks this limitation/enhancement.  I would expect to
be relatively easy to enhance the compute_builtin_object_size
function to return more useful results for some basic expressions
even without optimization, without running the whole objsize pass.
It seems that a good number of other features in GCC (not just
warnings) would benefit from it.


+  /* Bail early if format length checking is disabled, either
+ via a command line option, or as a result of the size of
+ the destination object not being available, or due to
+ format directives or arguments encountered during processing
+ that prevent length tracking with any reliability.  */
+
+  if (HOST_WIDE_INT_MAX <= info.objsize)
+  return;

I think the return is indented too deep.


+  if (*pf == 0)
+   {
+ /* Incomplete directive.  */
+ return;
+   }

For this and the other early returns from compute_format_length, do we
know if -Wformat will catch these errors?  Might that be a low priority
follow-up project if it doesn't?


Being invoked by the front end, -Wformat is limited to checking
constant strings, so it misses problems that a pass that runs
later could detect.  With a bit of effort (to avoid duplicate
warnings) it would be possible to enhance this pass to catch some
these as well.  I chose not to spend time on it in this version
since non-constant format strings are rare but it I agree that
it's something to look into after this patch is in.


In format_floating, we end up using actual host floating point
arithmetic.  That's generally been frowned upon.  We're not doing
anything in terms of code generation here, so ultimately that might be
what allows us to still be safe from a cross compilation standpoint.

It's also not clear if that routine handles the other target floating
point formats.  For things like VAX FP, I'm comfortable issuing a
warning that this stuff isn't 

[PATCH] Indicate minimum in-tree MPFR version handled

2016-08-11 Thread Maciej W. Rozycki
Hi,

 Commit 235763 removed support for versions of MPFR below 3.1.0 which have 
a flat directory structure, however it did not introduce any safety check 
for such an unhandled library version present in the tree.  Consequently 
the system-installed version is silently chosen, which if too old, causes 
a confusing configuration failure in mpc/ stating a misleading version 
requirement:

checking for MPFR... yes
checking for recent GMP... yes
checking for recent MPFR... no
configure: error: MPFR version >= 2.4.2 required
make[1]: *** [configure-mpc] Error 1

 I propose the check below to make people's life just a little bit easier 
and indicate right away that an incorrect version of MPFR has been found 
in the source tree.  This is especially helpful when you just sync your 
build tree from upstream and may easily miss the updated requirement.  I 
carefully chose to use "handled" rather than "supported" in the message as 
the commit referred clearly indicates you are on your own with versions of 
the libraries different from those stated in `download_prerequisites'.

2016-08-12  Maciej W. Rozycki  

* configure.ac: Check for the minimum in-tree MPFR version 
handled.
* configure: Regenerate.

 OK to apply?

  Maciej

gcc-mpfr-version.diff
Index: gcc/configure
===
--- gcc.orig/configure  2016-08-11 23:23:44.104635061 +0100
+++ gcc/configure   2016-08-11 23:24:02.933019031 +0100
@@ -5566,6 +5566,10 @@ if test "x$with_mpfr_lib" != x; then
   gmplibs="-L$with_mpfr_lib $gmplibs"
 fi
 if test "x$with_mpfr$with_mpfr_include$with_mpfr_lib" = x && test -d 
${srcdir}/mpfr; then
+  # MPFR v3.1.0 moved the sources into a src sub-directory.
+  if ! test -d ${srcdir}/mpfr/src; then
+as_fn_error "Building GCC with MPFR in the source tree is only handled for 
MPFR 3.1.0+." "$LINENO" 5
+  fi
   gmplibs='-L$$r/$(HOST_SUBDIR)/mpfr/src/'"$lt_cv_objdir $gmplibs"
   gmpinc='-I$$r/$(HOST_SUBDIR)/mpfr/src -I$$s/mpfr/src '"$gmpinc"
   extra_mpc_mpfr_configure_flags='--with-mpfr-include=$$s/mpfr/src 
--with-mpfr-lib=$$r/$(HOST_SUBDIR)/mpfr/src/'"$lt_cv_objdir"
Index: gcc/configure.ac
===
--- gcc.orig/configure.ac   2016-08-11 23:23:44.117834819 +0100
+++ gcc/configure.ac2016-08-11 23:24:00.198745307 +0100
@@ -1546,6 +1546,11 @@ if test "x$with_mpfr_lib" != x; then
   gmplibs="-L$with_mpfr_lib $gmplibs"
 fi
 if test "x$with_mpfr$with_mpfr_include$with_mpfr_lib" = x && test -d 
${srcdir}/mpfr; then
+  # MPFR v3.1.0 moved the sources into a src sub-directory.
+  if ! test -d ${srcdir}/mpfr/src; then
+AC_MSG_ERROR([dnl
+Building GCC with MPFR in the source tree is only handled for MPFR 3.1.0+.])
+  fi
   gmplibs='-L$$r/$(HOST_SUBDIR)/mpfr/src/'"$lt_cv_objdir $gmplibs"
   gmpinc='-I$$r/$(HOST_SUBDIR)/mpfr/src -I$$s/mpfr/src '"$gmpinc"
   extra_mpc_mpfr_configure_flags='--with-mpfr-include=$$s/mpfr/src 
--with-mpfr-lib=$$r/$(HOST_SUBDIR)/mpfr/src/'"$lt_cv_objdir"


[PATCH], Patch #4, Improve vector int/long initialization on PowerPC

2016-08-11 Thread Michael Meissner
This patch was originally part of patch #3, but I separated it out as I rework
what used to be part of patch #3 to fix some issues.

This patch adds support for using the ISA 3.0 MTVSRDD instruction when
initializing vector long vectors with variables.  I also changed the CPU type
of the other use of MTVSRDD to be vecperm as Pat suggested.

I added two general tests (vec-init-1.c and vec-init-2.c) that test various
forms of vector initialization to make sure the compiler generates the correct
code.  These tests will test optimizations that the future patches will
enhance.  I also added a third test (vec-init-3.c) to specifically test whether
MTVSRDD is generated.

I did a bootstrap and make check on a little endian power8 system, and there
were no regressions.  Can I install this patch to the trunk?

[gcc]
2016-08-11  Michael Meissner  

* config/rs6000/vsx.md (vsx_concat_): Add support for the
ISA 3.0 MTVSRDD instruction.
(vsx_splat_): Change cpu type of MTVSRDD instruction to
vecperm.

[gcc/testsuite]
2016-08-11  Michael Meissner  

* gcc.target/powerpc/vec-init-1.c: New tests to test various
vector initialization options.
* gcc.target/powerpc/vec-init-2.c: Likewise.
* gcc.target/powerpc/vec-init-3.c: New test to make sure MTVSRDD
is generated on ISA 3.0.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/rs6000.c
===
Index: gcc/config/rs6000/vsx.md
===
--- gcc/config/rs6000/vsx.md
(.../svn+ssh://meiss...@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)
(revision 239334)
+++ gcc/config/rs6000/vsx.md(.../gcc/config/rs6000) (working copy)
@@ -1911,16 +1911,24 @@ (define_insn "*vsx_float_fix_v2df2"
 
 ;; Build a V2DF/V2DI vector from two scalars
 (define_insn "vsx_concat_"
-  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=,?")
+  [(set (match_operand:VSX_D 0 "gpc_reg_operand" "=,we")
(vec_concat:VSX_D
-(match_operand: 1 "vsx_register_operand" ",")
-(match_operand: 2 "vsx_register_operand" 
",")))]
+(match_operand: 1 "gpc_reg_operand" ",r")
+(match_operand: 2 "gpc_reg_operand" ",r")))]
   "VECTOR_MEM_VSX_P (mode)"
 {
-  if (BYTES_BIG_ENDIAN)
-return "xxpermdi %x0,%x1,%x2,0";
+  if (which_alternative == 0)
+return (BYTES_BIG_ENDIAN
+   ? "xxpermdi %x0,%x1,%x2,0"
+   : "xxpermdi %x0,%x2,%x1,0");
+
+  else if (which_alternative == 1)
+return (BYTES_BIG_ENDIAN
+   ? "mtvsrdd %x0,%1,%2"
+   : "mtvsrdd %x0,%2,%1");
+
   else
-return "xxpermdi %x0,%x2,%x1,0";
+gcc_unreachable ();
 }
   [(set_attr "type" "vecperm")])
 
@@ -2650,7 +2658,7 @@ (define_insn "vsx_splat_"
xxpermdi %x0,%x1,%x1,0
lxvdsx %x0,%y1
mtvsrdd %x0,%1,%1"
-  [(set_attr "type" "vecperm,vecload,mftgpr")])
+  [(set_attr "type" "vecperm,vecload,vecperm")])
 
 ;; V4SI splat (ISA 3.0)
 ;; When SI's are allowed in VSX registers, add XXSPLTW support
Index: gcc/testsuite/gcc.target/powerpc/vec-init-1.c
===
--- gcc/testsuite/gcc.target/powerpc/vec-init-1.c   
(.../svn+ssh://meiss...@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)
 (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-init-1.c   
(.../gcc/testsuite/gcc.target/powerpc)  (revision 239334)
@@ -0,0 +1,169 @@
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-require-effective-target vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include 
+#include 
+#include 
+
+#define ELEMENTS -1, 2, 0, -123456
+#define SPLAT 0x01234567
+
+vector int sv = (vector int) { ELEMENTS };
+vector int splat = (vector int) { SPLAT, SPLAT, SPLAT, SPLAT };
+vector int sv_global, sp_global;
+static vector int sv_static, sp_static;
+static const int expected[] = { ELEMENTS };
+
+extern void check (vector int a)
+  __attribute__((__noinline__));
+
+extern void check_splat (vector int a)
+  __attribute__((__noinline__));
+
+extern vector int pack_reg (int a, int b, int c, int d)
+  __attribute__((__noinline__));
+
+extern vector int pack_const (void)
+  __attribute__((__noinline__));
+
+extern void pack_ptr (vector int *p, int a, int b, int c, int d)
+  __attribute__((__noinline__));
+
+extern void pack_static (int a, int b, int c, int d)
+  __attribute__((__noinline__));
+
+extern void pack_global (int a, int b, int c, int d)
+  __attribute__((__noinline__));
+
+extern vector int splat_reg (int a)
+  __attribute__((__noinline__));
+
+extern vector int splat_const (void)
+  __attribute__((__noinline__));
+
+extern void splat_ptr (vector int *p, int a)
+  __attribute__((__noinline__));
+
+extern void splat_static (int a)
+  

Re: [PATCH] Update or add fall through comments in switches

2016-08-11 Thread Jeff Law

On 08/11/2016 07:43 AM, Marek Polacek wrote:

This patch either updates fall through comments so that our warning
machinery recognizes them, or annotates code with new FALLTHRU comments
so that intended fall through cases are marked and the compiler wouldn't
warn.

I'd like to get this in before I post another patches to keep this mess
more manageable.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

2016-08-11  Marek Polacek  

PR c/7652
gcc/
* alias.c (find_base_value): Adjust fall through comment.
* cfgexpand.c (expand_debug_expr): Likewise.
* combine.c (find_split_point): Likewise.
(expand_compound_operation): Likewise.  Add FALLTHRU.
(make_compound_operation): Adjust fall through comment.
(canon_reg_for_combine): Add FALLTHRU.
(force_to_mode): Adjust fall through comment.
(simplify_shift_const_1): Likewise.
(simplify_comparison): Likewise.
* config/aarch64/aarch64-builtins.c (aarch64_simd_expand_args): Add
FALLTHRU.
* config/aarch64/predicates.md: Likewise.
* config/i386/i386.c (function_arg_advance_32): Likewise.
(ix86_gimplify_va_arg): Likewise.
(print_reg): Likewise.
(ix86_print_operand): Likewise.
(ix86_build_const_vector): Likewise.
(ix86_expand_branch): Likewise.
(ix86_sched_init_global): Adjust fall through comment.
(ix86_expand_args_builtin): Add FALLTHRU.
(ix86_expand_builtin): Likewise.
(ix86_expand_vector_init_one_var): Likewise.
* config/rs6000/rs6000.c (rs6000_emit_vector_compare_inner): Likewise.
(rs6000_adjust_cost): Likewise.
(insn_must_be_first_in_group): Likewise.
* config/rs6000/rs6000.md: Likewise.  Adjust fall through comment.
* dbxout.c (dbxout_symbol): Adjust fall through comment.
* df-scan.c (df_uses_record): Likewise.
* dojump.c (do_jump): Add FALLTHRU.
* dwarf2out.c (mem_loc_descriptor): Likewise.  Adjust fall through
comment.
(resolve_args_picking_1): Adjust fall through comment.
(loc_list_from_tree_1): Likewise.
* expmed.c (make_tree): Likewise.
* expr.c (expand_expr_real_2): Add FALLTHRU.
(expand_expr_real_1): Likewise.  Adjust fall through comment.
* fold-const.c (const_binop): Adjust fall through comment.
(fold_truth_not_expr): Likewise.
(fold_cond_expr_with_comparison): Add FALLTHRU.
(fold_binary_loc): Likewise.
(contains_label_1): Adjust fall through comment.
(multiple_of_p): Likewise.
* gcov-tool.c (process_args): Add FALLTHRU.
* genattrtab.c (check_attr_test): Likewise.
(write_test_expr): Likewise.
* genconfig.c (walk_insn_part): Likewise.
* genpreds.c (validate_exp): Adjust fall through comment.
(needs_variable): Likewise.
* gensupport.c (get_alternatives_number): Add FALLTHRU.
(subst_dup): Likewise.
* gimple-pretty-print.c (dump_gimple_assign): Likewise.
* gimplify.c (gimplify_addr_expr): Adjust fall through comment.
(gimplify_scan_omp_clauses): Add FALLTHRU.
(goa_stabilize_expr): Likewise.
* graphite-isl-ast-to-gimple.c (substitute_ssa_name): Adjust fall
through comment.
* hsa-gen.c (get_address_from_value): Likewise.
* ipa-icf.c (sem_function::hash_stmt): Likewise.
* ira.c (ira_setup_alts): Add FALLTHRU.
* lra-eliminations.c (lra_eliminate_regs_1): Adjust fall through
comment.
* lto-streamer-out.c (lto_output_tree_ref): Add FALLTHRU.
* opts.c (common_handle_option): Likewise.
* read-rtl.c (read_rtx_code): Likewise.
* real.c (round_for_format): Likewise.
* recog.c (asm_operand_ok): Likewise.
* reginfo.c (reg_scan_mark_refs): Adjust fall through comment.
* reload1.c (set_label_offsets): Likewise.
(eliminate_regs_1): Likewise.
(reload_reg_reaches_end_p): Likewise.
* rtlanal.c (commutative_operand_precedence): Add FALLTHRU.
(rtx_cost): Likewise.
* sched-rgn.c (is_exception_free): Likewise.
* simplify-rtx.c (simplify_rtx): Adjust fall through comment.
* stor-layout.c (int_mode_for_mode): Likewise.
* toplev.c (print_to_asm_out_file): Likewise.
(print_to_stderr): Likewise.
* tree-cfg.c (gimple_verify_flow_info): Likewise.
* tree-chrec.c (chrec_fold_plus_1): Add FALLTHRU.
(chrec_fold_multiply): Likewise.
(evolution_function_is_invariant_rec_p): Likewise.
(for_each_scev_op): Likewise.
* tree-data-ref.c (siv_subscript_p): Likewise.
(get_references_in_stmt): Likewise.
* tree.c (find_placeholder_in_expr): Adjust fall through comment.
(substitute_in_expr): Likewise.
(type_cache_hasher::equal): Likewise.
(walk_type_fields): Likewise.
* 

Re: [PATCH, rs6000] Fix vec_construct vectorization cost to be somewhat more accurate

2016-08-11 Thread Segher Boessenkool
On Wed, Aug 10, 2016 at 11:52:06AM -0500, Bill Schmidt wrote:
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no
> regressions.  Ok for trunk?

Okay.  Thanks,


Segher


> 2016-08-10  Bill Schmidt  
> 
>   * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
>   Correct costs for vec_construct.


[gomp4] Extend libgomp's fortran test coverage of host_data

2016-08-11 Thread Cesar Philippidis
This patch ports libgomp.oacc-c-c++-common/host_data-1.c to fortran.
Fortunately, the existing fortran host_data infrastructure was already
in place, so I had to do was port over the calls to Nvidia's CUDA BLAS
library.

There are a couple of details that one needs to consider when using CUDA
BLAS in gfortran. First, if you want to use Nvidia's wrapper functions
written in C to set up the appropriate cuda device contexts, then use
the thunking variants of the functions described here
.
Otherwise, it's much easier to let gfortran's OpenACC runtime manage the
data mappings and use the host_data clause to pass those data pointers
to the CUDA BLAS library calls.

In terms of calling the actual CUDA BLAS functions, there's already good
documentation for that here
.
Basically, those library calls need a function interface with a special
C binding. The function I tested in host_data-2.f90 is cublasSaxpy.
Other function interfaces will need to be created as necessary.

I've applied this patch to gomp-4_0-branch.

Cesar
2016-08-11  Cesar Philippidis  

	libgomp/
	* testsuite/libgomp.oacc-fortran/host_data-1.f90: Remove stale xfail.
	* testsuite/libgomp.oacc-fortran/host_data-2.f90: New test.


diff --git a/libgomp/testsuite/libgomp.oacc-fortran/host_data-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/host_data-1.f90
index 497b0f7..69a491d 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/host_data-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/host_data-1.f90
@@ -1,9 +1,6 @@
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
 
-! { dg-xfail-if "TODO" { *-*-* } }
-! { dg-excess-errors "TODO" }
-
 program test
   implicit none
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/host_data-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/host_data-2.f90
new file mode 100644
index 000..68e14e3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/host_data-2.f90
@@ -0,0 +1,85 @@
+! Test host_data interoperability with CUDA blas.
+
+! { dg-do run { target openacc_nvidia_accel_selected } }
+! { dg-additional-options "-lcublas" }
+
+program test
+  implicit none
+
+  integer, parameter :: N = 10
+  integer :: i
+  real*4 :: x_ref(N), y_ref(N), x(N), y(N), a
+  
+  interface
+ subroutine cublassaxpy(N, alpha, x, incx, y, incy) bind(c, name="cublasSaxpy")
+   use iso_c_binding
+   integer(kind=c_int), value :: N
+   real*4, value :: alpha
+   type(*), dimension(*) :: x
+   integer(kind=c_int), value :: incx
+   type(*), dimension(*) :: y
+   integer(kind=c_int), value :: incy
+ end subroutine cublassaxpy
+  end interface
+
+  a = 2.0
+
+  do i = 1, N
+ x(i) = 4.0 * i
+ y(i) = 3.0
+ x_ref(i) = x(i)
+ y_ref(i) = y(i)
+  end do
+
+  call saxpy (N, a, x_ref, y_ref)
+  
+  !$acc data copyin (x) copy (y)
+  !$acc host_data use_device (x, y)
+  call cublassaxpy(N, a, x, 1, y, 1)
+  !$acc end host_data
+  !$acc end data
+  
+  do i = 1, N
+ if (y(i) .ne. y_ref(i)) call abort
+  end do
+
+  !$acc data create (x) copyout (y)
+  !$acc parallel loop
+  do i = 1, N
+ y(i) = 3.0
+  end do
+  !$acc end parallel loop
+
+  !$acc host_data use_device (x, y)
+  call cublassaxpy(N, a, x, 1, y, 1)
+  !$acc end host_data
+  !$acc end data
+
+  do i = 1, N
+ if (y(i) .ne. y_ref(i)) call abort
+  end do
+
+  y(:) = 3.0
+  
+  !$acc data copyin (x) copyin (a) copy (y)
+  !$acc parallel present (x) pcopy (y) present (a)
+  call saxpy (N, a, x, y)
+  !$acc end parallel
+  !$acc end data
+
+  do i = 1, N
+ if (y(i) .ne. y_ref(i)) call abort
+  end do
+end program test
+
+subroutine saxpy (nn, aa, xx, yy)
+  integer :: nn
+  real*4 :: aa, xx(nn), yy(nn)
+  integer i
+  real*4 :: t
+  !$acc routine
+
+  do i = 1, nn
+yy(i) = yy(i) + aa * xx(i)
+  end do
+end subroutine saxpy


Re: [PATCH] Add test coverage for PR gcov-profile/35590

2016-08-11 Thread Jeff Law

On 08/10/2016 06:50 AM, Martin Liška wrote:

Hi.

Following patch makes better test-coverage for cases mentioned in the PR.
The PR is resolved, I'll close it as soon as the patch is accepted.

Survives make check -k -j10 RUNTESTFLAGS="tree-prof.exp" on x86_64-linux-gnu.

Ready for trunk?

OK.
jeff


Re: Init df for split pass since some target use REG_NOTE in split pattern

2016-08-11 Thread Jeff Law

On 08/09/2016 08:44 PM, Kito Cheng wrote:

Hi Steven:

You are right, I have build arm toolchain today, it's crash when building libgcc
since there is no valid CFG in split_all_insns_noflow.

Hi Jeff:

Maybe we need split_reg_dead_p and split_reg_unused_p like peephole2
have peep2_reg_dead_p?

Seems pretty reasonable.

I'm still curious which of the 4 passes of split_all_insns was 
triggering the failure though,



jeff



Re: [PATCH] Add selftests to selftest.c

2016-08-11 Thread Jeff Law

On 08/09/2016 07:25 PM, David Malcolm wrote:

This patch adds some initial selftesting of selftest.c/h itself.
This may seem like overdoing it, but I have some followup patches
that add non-trivial logic to selftest.c, which we should verify, so it
makes sense to test the existing functionality and then build on that.

Successfully bootstrapped on x86_64-pc-linux-gnu.

OK for trunk?

gcc/ChangeLog:
* selftest-run-tests.c (selftest::run_tests): Call selftest_c_tests.
* selftest.c (selftest::test_assertions): New function.
(selftest::selftest_c_tests): New function.
* selftest.h (selftest::selftest_c_tests): New declaration.

OK.
jeff



Re: [PATCH] Fix warning breaking profiled bootstrap

2016-08-11 Thread Jeff Law

On 08/08/2016 08:43 PM, Andi Kleen wrote:

Ideally we'd have a test that we could more deeply analyze for paths
through the CFG that can't be executed.  Finding those paths usually
both fixes the warning *and* results in better code.  Even if we
can't fix it now, we can file it away for future work


It's multiple variables who are depending on each other, with complex
control flow inbetween. It's not surprising that it loses control
of all the combinations.

You can take a look yourself here:

/home/andi/gcc/git/gcc/gcc/tree-vrp.c: In function 'int 
compare_values_warnv(tree, tree, bool*)':
/home/andi/gcc/git/gcc/gcc/tree-vrp.c:1251:12: error: 'inv2' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
   tree inv = cst1 ? inv2 : inv1;
^~~
/home/andi/gcc/git/gcc/gcc/tree-vrp.c:1222:4: error: 'inv1' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
   if (strict_overflow_p != NULL
   ~
&& (!inv1 || !TREE_NO_WARNING (val1))
^

From my reading it looks to me like we could use inv2 uninitialized.

  tree sym1 = get_single_symbol (val1, , );
  tree sym2 = get_single_symbol (val2, , );


If sym1 results in a return value that is some useful tree and inv1 is 
true and cst1 is true via this call:


  const bool cst1 = is_gimple_min_invariant (val1);


We then need sym2 to result in a return value from get_single_symbol to 
be NULL, but cst2 still be true via this statement:


  const bool cst2 = is_gimple_min_invariant (val2);

If all those conditions can hold, then we'll use inv2 without 
initializing it AFAICT.


The key here is can there be a value for val2 where val2 is an invariant 
not built from PLUS_EXPR, POINTER_PLUS_EXPR, MINUS_EXPR or NEGATE_EXPR. 
I think we can make that happen if val2 is an ADDR_EXPR with an 
invariant address.


So it actually seems to me that unless ADDR_EXPR filtered out elsewhere 
that we've got a real bug here rather than a false positive.  There may 
be others, that's just the first one that stood out.


Or did I miss something?

jeff





[PATCH, preapproved] Disable DF_VERIFY_SCHEDULED at end of df_verify (PR72855)

2016-08-11 Thread Bill Schmidt
Hi,

PR72855 reports a compile-time problem with df_verify being called once per 
loop.
The compile time drops from about 2 hours to 9 minutes for the reported test 
case
when we disable the DF_VERIFY_SCHEDULED flag after df_verify runs.  That's all
this patch does.

Bin Cheng has a separate patch to factor some code in loop-doloop.c that will 
also
provide benefit.

There is a separate issue that this compile-time problem should not occur at all
with release checking, and yet it is.  That is still under investigation.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
Preapproved, committed.

Thanks,
Bill


2016-08-11  Richard Biener  
Bill Schmidt  

PR rtl-optimization/72855
* df-core.c (df_verify): Turn off DF_VERIFY_SCHEDULED at end.


Index: gcc/df-core.c
===
--- gcc/df-core.c   (revision 239394)
+++ gcc/df-core.c   (working copy)
@@ -1833,6 +1833,7 @@ df_verify (void)
   if (df_live)
 df_live_verify_transfer_functions ();
 #endif
+  df->changeable_flags &= ~DF_VERIFY_SCHEDULED;
 }
 
 #ifdef DF_DEBUG_CFG



Re: divmod transform: add test-cases

2016-08-11 Thread Jeff Law

On 08/09/2016 04:58 AM, Prathamesh Kulkarni wrote:

ping https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01869.html
This seems to be dependent upon other patches, this is not OK until all 
prereqs are resolved.


You're using SI/DI in the descriptions, but then using more traditional 
C types like int, unsigned, long long in the actual test.


If you intent is really to nail down SI/DI mode testing, then the way to 
go will be to define typedefs that correspond directly to SI and DI modes:


typedef unsigned int u32 __attribute__((mode(SI)));
typedef signed int s32 __attribute__((mode(SI)));

You can do something similar for DImode.

It may not matter because of your effective-target 
{divmod,divmod_simode} selectors, but it still seems cleaner.


With that change, this is OK when its prereqs are resolved.

jeff




Re: [PATCH] - improve sprintf buffer overflow detection (middle-end/49905)

2016-08-11 Thread Martin Sebor

On 07/29/2016 04:50 PM, Joseph Myers wrote:

On Mon, 18 Jul 2016, Martin Sebor wrote:


+  /* Number of exponent digits or -1 when unknown.  */
+  int expdigs = -1;
+  /* 1 when ARG < 0, 0 when ARG >= 0, -1 when unknown.  */
+  int negative = -1;
+  /* Log10 of EXPDIGS.  */
+  int logexpdigs = 2;
+
+  const double log10_2 = .30102999566398119521;
+
+  if (arg && TREE_CODE (arg) == REAL_CST)
+{
+  expdigs = real_exponent (TREE_REAL_CST_PTR (arg)) * log10_2;
+  negative = real_isneg (TREE_REAL_CST_PTR (arg));
+  logexpdigs = ilog (expdigs, 10);


Thanks for looking at it!  The floating conversion wasn't 100%
complete and the bits that were there I had the least confidence
in so I especially appreciate a second pair of eyes on it.



This looks wrong for the case of a constant with a negative exponent.


You're right -- great catch!  I've fixed it.



Also, what if the constant has a decimal floating-point type?


Decimal floating point isn't handled in this initial implementation.
The checker gives up after the first directive it doesn't recognize
and stops checking the rest of the format string.  Eventually, I'd
like to enhance it to handle decimal floats and other directives
and types.




+}
+  else if (REAL_MODE_FORMAT (TYPE_MODE (type))->b == 2)
+{
+  /* Compute T_MAX_EXP for base 2.  */
+  expdigs = REAL_MODE_FORMAT (TYPE_MODE (type))->emax * log10_2;


Shouldn't you also compute logexpdigs here?  The comment on logexpdigs
implies it's always meant to have a given relation to expdigs.  Or maybe
those variables need to be split into min/max cases, since you're
computing both min/max values below.


I have reworked this so logexpdigs is computed here as well.




+case 'E':
+case 'e':
+  /* The minimum output is "[-+]1.234567e+00" for an IEEE double
+regardless of the value of the actual argument. */
+  res.min = ((0 < negative || spec.get_flag ('+') || spec.get_flag (' '))
++ 1 /* unit */ + (prec < 0 ? 7 : prec ? prec + 1 : 0)
++ 2 /* e+ */ + (logexpdigs < 2 ? 2 : logexpdigs));


As I understand your logic, for the case of a constant expdigs and so
logexpdigs may sometimes be 1 bigger than the correct value for that
constant, say for 9.999e+99, so this may not actually be the minimum.


Yes, %e is an approximation though I hadn't considered this case.
I reworked this part of the implementation to compute an accurate
result.


+case 'G':
+case 'g':
+  /* Treat this the same as '%F' for now even though that's
+inaccurate.  */
+  res.min = 2 + (prec < 0 ? 6 : prec);
+  res.max = ((spec.get_flag ('L') ? 4934 : 310)
++ (prec < 0 ? 6 : prec ? prec + 1 : 0));


Again, avoid embedding properties of given formats.


I've removed the IEEE 754 assumptions from both the code and the
comments.

I will post the latest version of the patch shortly.

Martin


Re: libgo patch committed: Change build procedure to use build tags

2016-08-11 Thread Ian Lance Taylor
On Thu, Aug 11, 2016 at 8:15 AM, Rainer Orth
 wrote:
>
>> Go packages use build tags (see the section on Build Constraints at
>> https://golang.org/pkg/go/build/) to select which files to build on
>> specific systems.
>>
>> Previously the libgo Makefile explicitly listed the set of files to
>> compile for each package.  For packages that use build tags, this
>> required a lot of awkward automake conditionals in the Makefile.
>>
>> This patch changes the build to look at the build tags in the files.
>> The new shell script libgo/match.sh does the matching.  This required
>> adjusting a lot of build tags, and removing some files that are never
>> used.  I verified that the exact same sets of files are compiled on
>> x86_64-pc-linux-gnu.  I also tested the build on i386-sun-solaris
>> (building for both 32-bit and 64-bit).
>>
>> Writing match.sh revealed some bugs in the build tag handling that
>> already exists, in a slightly different form, in the gotest shell
>> script.  This patch fixes those problems as well.
>>
>> The old code used automake conditionals to handle systems that were
>> missing strerror_r and wait4.  Rather than deal with those in Go,
>> those functions are now implemented in runtime/go-nosys.c when
>> necessary, so the Go code can simply assume that they exist.
>>
>> The os testsuite looked for dir_unix.go, which was never built for
>> gccgo and has now been removed.  I changed the testsuite to look for
>> dir.go instead.
>>
>> Note that if you have an existing build directory, you will have to
>> remove all the .dep files in TARGET/libgo after updating to this
>> patch.  There isn't anything that will force them to update
>> automatically.
>>
>> Bootstrapped on x86_64-pc-linux-gnu and i386-sun-solaris.  Ran Go
>> testsuite on x86_64-pc-linux-gnu.  Committed to mainline.
>
> this patch broke i386-pc-solaris2.12 and sparc-sun-solaris2.12
> bootstrap, however: in both cases, the 64-bit build of os.lo fails like this:
>
> /vol/gcc/src/hg/trunk/local/libgo/go/os/dir.go:82:8: error: reference to 
> undefined name 'libc_readdir_r'
>i := libc_readdir_r(file.dirinfo.dir, entryDirent, pr)
> ^
>
> Neither dir_largefile.go (which is correctly omitted, being 32-bit only)
> nor dir_regfile.go (which is needed here) is included in the
> compilation.

Sorry, I don't know what I messed up in my testing.  I committed the
appended patch, which should fix the problem.

Ian
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 239332)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-5e05b7bc947231b4d5a8327bf63e2fa648e51dc7
+fe1e77f843220503f1f8d5ea7dd5e307580e1d38
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: libgo/go/os/dir_regfile.go
===
--- libgo/go/os/dir_regfile.go  (revision 239189)
+++ libgo/go/os/dir_regfile.go  (working copy)
@@ -6,8 +6,8 @@
 // license that can be found in the LICENSE file.
 
 // +build !linux
-// +build !solaris,386
-// +build !solaris,sparc
+// +build !solaris !386
+// +build !solaris !sparc
 
 package os
 


Re: [PATCH, c++ testsuite]: Fix UNRESOLVED: g++.dg/cpp1z/constexpr-lambda6.C testsuite failure

2016-08-11 Thread Uros Bizjak
On Thu, Aug 11, 2016 at 11:25 PM, Jason Merrill  wrote:
> There's no need to rename main, but removing the dg-do is OK.

Thanks, I have committed only the removal of dg-do directive.

Uros.

Index: g++.dg/cpp1z/constexpr-lambda6.C
===
--- g++.dg/cpp1z/constexpr-lambda6.C(revision 239384)
+++ g++.dg/cpp1z/constexpr-lambda6.C(working copy)
@@ -1,5 +1,4 @@
 // Testcase from P0170R1
-// { dg-do run }
 // { dg-options -std=c++1z }

 auto monoid = [](auto v) { return [=] { return v; }; };


Re: [PATCH, c++ testsuite]: Fix UNRESOLVED: g++.dg/cpp1z/constexpr-lambda6.C testsuite failure

2016-08-11 Thread Jason Merrill
There's no need to rename main, but removing the dg-do is OK.

On Thu, Aug 11, 2016 at 3:03 PM, Uros Bizjak  wrote:
> The compilation, where error is expected, will not produce an executable.
>
> 2016-08-11  Uros Bizjak  
>
> * g++.dg/cpp1z/constexpr-lambda6.C (dg-do run): Remove.
> (g): Rename from main.
>
> Tested on x86_64-linux-gnu.
>
> OK for mainline?
>
> Uros.


Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

2016-08-11 Thread Denys Vlasenko

On 08/11/2016 10:49 PM, Denys Vlasenko wrote:

falign-functions=N is too simplistic.

This change makes it possible to align function to 64-byte boundaries *IF*
this does not introduce huge amount of padding.


This is my first submission to gcc-patches@gcc.gnu.org
Please let me know if my patch formatting is wrong.

The patch is for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

Regarding copyright:
I believe my company, Red Hat, has a blanket copyright assignement
for gcc for patches from our employees.


[PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

2016-08-11 Thread Denys Vlasenko
falign-functions=N is too simplistic.

Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte alignment
runs fastest (he tried many other possibilites).

However, developers are less than thrilled by the idea of a slam-dunk 64-byte
aligning everything. Too much waste:
On 05/20/2015 02:47 AM, Linus Torvalds wrote:
> At the same time, I have to admit that I abhor a 64-byte function
> alignment, when we have a fair number of functions that are (much)
> smaller than that.
>
> Is there some way to get gcc to take the size of the function into
> account? Because aligning a 16-byte or 32-byte function on a 64-byte
> alignment is just criminally nasty and wasteful.

This change makes it possible to align function to 64-byte boundaries *IF*
this does not introduce huge amount of padding.

Patch drops forced alignment to 8 if requested alignment is higher than 8:
before the patch, -falign-functions=9 was generating

.p2align 4,,8
.p2align 3

which means: "align to 16 if the skip is 8 bytes or less; else align to 8".
After this change, ".p2align 3" is not emitted.

It is dropped because I ultimately want to do something
like -falign-functions=64,8 - IOW, I want to align functions to 64 bytes,
but only if that generates padding of less than 8 bytes - otherwise I want
*no alignment at all*. The forced ".p2align 3" interferes with that intention.

Testing:
tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment
directives are the same before and after the patch.
Tested that -falign-functions=N,N (two equal paramenters) works exactly
like -falign-functions=N.

Index: gcc/common.opt
===
--- gcc/common.opt  (revision 239390)
+++ gcc/common.opt  (working copy)
@@ -900,7 +900,7 @@ Common Report Var(align_functions,0) Optimization
 Align the start of functions.
 
 falign-functions=
-Common RejectNegative Joined UInteger Var(align_functions)
+Common RejectNegative Joined Var(flag_align_functions)
 
 falign-jumps
 Common Report Var(align_jumps,0) Optimization UInteger
@@ -907,7 +907,7 @@ Common Report Var(align_jumps,0) Optimization UInt
 Align labels which are only reached by jumping.
 
 falign-jumps=
-Common RejectNegative Joined UInteger Var(align_jumps)
+Common RejectNegative Joined Var(flag_align_jumps)
 
 falign-labels
 Common Report Var(align_labels,0) Optimization UInteger
@@ -914,7 +914,7 @@ Common Report Var(align_labels,0) Optimization UIn
 Align all labels.
 
 falign-labels=
-Common RejectNegative Joined UInteger Var(align_labels)
+Common RejectNegative Joined Var(flag_align_labels)
 
 falign-loops
 Common Report Var(align_loops,0) Optimization UInteger
@@ -921,7 +921,7 @@ Common Report Var(align_loops,0) Optimization UInt
 Align the start of loops.
 
 falign-loops=
-Common RejectNegative Joined UInteger Var(align_loops)
+Common RejectNegative Joined Var(flag_align_loops)
 
 fargument-alias
 Common Ignore
Index: gcc/config/i386/freebsd.h
===
--- gcc/config/i386/freebsd.h   (revision 239390)
+++ gcc/config/i386/freebsd.h   (working copy)
@@ -92,25 +92,17 @@ along with GCC; see the file COPYING3.  If not see
 
 /* A C statement to output to the stdio stream FILE an assembler
command to advance the location counter to a multiple of 1<= (1<<(LOG))) \
+fprintf ((FILE), "\t.p2align %d\n", (LOG));\
+  else \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
-   /* Make sure that we have at least 8 byte alignment if > 8 byte \
-  alignment is preferred.  */  \
-   if ((LOG) > 3   \
-   && (1 << (LOG)) > ((MAX_SKIP) + 1)  \
-   && (MAX_SKIP) >= 7) \
- fputs ("\t.p2align 3\n", (FILE)); \
-  }
\
 }  \
   } while (0)
 #endif
Index: gcc/config/i386/gnu-user.h
===
--- gcc/config/i386/gnu-user.h  (revision 239390)
+++ gcc/config/i386/gnu-user.h  (working copy)
@@ -94,24 +94,16 @@ along with GCC; see the file COPYING3.  If not see
 
 /* A C statement to output to the stdio stream FILE an assembler
command to advance the location counter to a multiple of 1<= (1<<(LOG))) \
+fprintf ((FILE), "\t.p2align %d\n", (LOG));\
+  else \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
-   /* Make sure that we have 

[PATCH] PR middle-end/74113: Don't limit piecewise move to MAX_FIXED_MODE_SIZE

2016-08-11 Thread H.J. Lu
alignment_for_piecewise_move is called only with MOVE_MAX_PIECES or
STORE_MAX_PIECES, which are the number of bytes at a time that we
can move or store efficiently.  We should call mode_for_size without
limit to MAX_FIXED_MODE_SIZE, which is an integer expression for the
size in bits of the largest integer machine mode that should actually
be used, may be smaller than MOVE_MAX_PIECES or STORE_MAX_PIECES, which
may use vector register.

MAX_BITSIZE_MODE_ANY_INT is only defined for i386 and everyone else uses
the default.  The widest mode for integer computation is determined by
MAX_FIXED_MODE_SIZE, not by MAX_BITSIZE_MODE_ANY_INT.  OImode and XImode
can be used for load and store, including constant integers.  Remove
MAX_BITSIZE_MODE_ANY_INT from i386 to avoid any potential problems for
constant integers > TImode.

Tested on i686 and x86-64.  OK for trunk?


H.J.
---
PR middle-end/74113
* expr.c (alignment_for_piecewise_move): Call mode_for_size
without limit to MAX_FIXED_MODE_SIZE.
* config/i386/i386-modes.def (MAX_BITSIZE_MODE_ANY_INT): Removed.
---
 gcc/config/i386/i386-modes.def | 5 -
 gcc/expr.c | 2 +-
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/gcc/config/i386/i386-modes.def b/gcc/config/i386/i386-modes.def
index d524313..61a1f08 100644
--- a/gcc/config/i386/i386-modes.def
+++ b/gcc/config/i386/i386-modes.def
@@ -98,10 +98,5 @@ POINTER_BOUNDS_MODE (BND64, 16);
 INT_MODE (OI, 32);
 INT_MODE (XI, 64);
 
-/* Keep the OI and XI modes from confusing the compiler into thinking
-   that these modes could actually be used for computation.  They are
-   only holders for vectors during data movement.  */
-#define MAX_BITSIZE_MODE_ANY_INT (128)
-
 /* The symbol Pmode stands for one of the above machine modes (usually SImode).
The tm.h file specifies which one.  It is not a distinct mode.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index 46de35f..826fd9b 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -692,7 +692,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, 
unsigned int align)
 {
   machine_mode tmode;
 
-  tmode = mode_for_size (max_pieces * BITS_PER_UNIT, MODE_INT, 1);
+  tmode = mode_for_size (max_pieces * BITS_PER_UNIT, MODE_INT, 0);
   if (align >= GET_MODE_ALIGNMENT (tmode))
 align = GET_MODE_ALIGNMENT (tmode);
   else
-- 
2.7.4



[PATCH] RFC: -fasm-show-source

2016-08-11 Thread David Malcolm
I sometimes find myself scouring assembler output from the compiler
and trying to figure out which instructions correspond to which
lines of source code; I believe this is a common activity for some
end-users.

The following patch adds a new -fasm-show-source option, which
emits comments into the generated asm showing the pertinent
line of source code, whenever it changes.  It uses the same logic
as debug_hooks->source_line for tracking this (for handling
line-based breakpoints).

An example can be seen in the invoke.texi part of the patch.  As
noted there, it's aimed at end-users, rather than gcc developers.
The example shows a relatively short function; the option is
likely to be much more useful for longer functions.

I think it would further improve usability if this option were enabled
by default when the final output is .s (either via -S, or by "-o foo.s").
Ideas on how to implement that (in the driver) would be welcome - I
started looking at the spec-handling code, but thought I'd post the
idea here first, before diving in too deeply.

Successfully bootstrapped on x86_64-pc-linux-gnu; adds
2 PASS results to gcc.sum.

Thoughts?  OK for trunk as-is?

gcc/ChangeLog:
* common.opt (fasm-show-source): New option.
* doc/invoke.texi (Code Generation Options): Add
-fasm-show-source.
(-fasm-show-source): New item.
* final.c (asm_show_source): New function.
(final_scan_insn): Call asm_show_source.

gcc/testsuite/ChangeLog:
* gcc.dg/fasm-show-source-1.c: New test case.
---
 gcc/common.opt|  5 +++
 gcc/doc/invoke.texi   | 71 ++-
 gcc/final.c   | 29 -
 gcc/testsuite/gcc.dg/fasm-show-source-1.c | 15 +++
 4 files changed, 117 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/fasm-show-source-1.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 8a292ed..56ce513 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -939,6 +939,11 @@ fargument-noalias-anything
 Common Ignore
 Does nothing. Preserved for backward compatibility.
 
+fasm-show-source
+Common Var(flag_asm_show_source)
+Emit comments in the generated assembly code to show the source code
+lines associated with the assembly instructions.
+
 fsanitize=
 Common Driver Report Joined
 Select what to sanitize.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 22001f9..dc3d3ad 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -486,7 +486,8 @@ Objective-C and Objective-C++ Dialects}.
 
 @item Code Generation Options
 @xref{Code Gen Options,,Options for Code Generation Conventions}.
-@gccoptlist{-fcall-saved-@var{reg}  -fcall-used-@var{reg} @gol
+@gccoptlist{-fasm-show-source @gol
+-fcall-saved-@var{reg}  -fcall-used-@var{reg} @gol
 -ffixed-@var{reg}  -fexceptions @gol
 -fnon-call-exceptions  -fdelete-dead-exceptions  -funwind-tables @gol
 -fasynchronous-unwind-tables @gol
@@ -11153,6 +11154,74 @@ can figure out the other form by either removing 
@samp{no-} or adding
 it.
 
 @table @gcctabopt
+@item -fasm-show-source
+@opindex fasm-show-source
+Emit comments in the generated assembly code to show the source code
+lines associated with the assembly instructions.  This option is
+aimed at end-users who wish to better understand the relationship
+between their source code and the generated machine code.
+
+The comments are of the form FILENAME:LINENUMBER:CONTENT OF LINE.
+
+For example, given this C source file:
+
+@smallexample
+int test (int n)
+@{
+  int i;
+  int total = 0;
+
+  for (i = 0; i < n; i++)
+total += i * i;
+
+  return total;
+@}
+@end smallexample
+
+compiling to (x86_64) assembly via @option{-S} and emitting the result
+direct to stdout via @option{-o} @option{-}
+
+@smallexample
+gcc -S test.c -fasm-show-source -Os -o -
+@end smallexample
+
+gives output similar to this, highlighting which instructions correspond
+to the various parts of the loop:
+
+@smallexample
+   .file   "test.c"
+   .text
+   .globl  test
+   .type   test, @@function
+test:
+.LFB0:
+   .cfi_startproc
+# test.c:4:   int total = 0;
+   xorl%eax, %eax
+# test.c:6:   for (i = 0; i < n; i++)
+   xorl%edx, %edx
+.L2:
+# test.c:6:   for (i = 0; i < n; i++)
+   cmpl%edi, %edx
+   jge .L5
+# test.c:7: total += i * i;
+   movl%edx, %ecx
+   imull   %edx, %ecx
+# test.c:6:   for (i = 0; i < n; i++)
+   incl%edx
+# test.c:7: total += i * i;
+   addl%ecx, %eax
+   jmp .L2
+.L5:
+# test.c:10: @}
+   ret
+   .cfi_endproc
+.LFE0:
+   .size   test, .-test
+   .ident  "GCC: (GNU) 7.0.0 20160809 (experimental)"
+   .section.note.GNU-stack,"",@@progbits
+@end smallexample
+
 @item -fstack-reuse=@var{reuse-level}
 @opindex fstack_reuse
 This option controls stack space reuse for user declared local/auto variables
diff --git a/gcc/final.c b/gcc/final.c

[PATCH] Increase alignment for bit-field access when predictive commoning (PR 71083)

2016-08-11 Thread Bernd Edlinger
On 08/11/16 09:07, Richard Biener wrote:
> The patch looks mostly ok, but
>
> +  else
> +   {
> + boff >>= LOG2_BITS_PER_UNIT;
> + boff += tree_to_uhwi (component_ref_field_offset (ref));
> + coff = size_binop (MINUS_EXPR, coff, ssize_int (boff));
>
> how can we be sure that component_ref_field_offset is an INTEGER_CST?
> At least Ada can have variably-length fields before a bitfield.  We'd
> need to apply component_ref_field_offset to off in that case.  Which
> makes me wonder if we can simply avoid the COMPONENT_REF path in
> a first iteration of the patch and always build a BIT_FIELD_REF?
> It should solve the correctness issues as well and be more applicable
> for branches.
>

I believe that it will be necessary to check for tree_fits_uhwi_p here,
but while it happens quite often hat a normal COMPONENT_REF has a
variable offset, it happens rarely that a bit-field COMPONENT_REF has a
variable offset: In the whole Ada test suite it was only on the pack16
test case. However I was not able to use that trick in the
loop_optimization23 test case.  When I tried, it did no longer get
optimized by pcom.  So there will be no new test case at this time.

Therefore I left the comment as-is, because it is not clear, if a
variable offset will ever happen here; but if it happens, it will be
in Ada, and it will be safe to create a BIT_FIELD_REF instead.

So this is the follow-up patch that tries to create a more aligned
access using the COMPONENT_REF.


Boot-strap and reg-test on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.
2016-08-11  Bernd Edlinger  

PR tree-optimization/71083
* tree-predcom.c (ref_at_iteration): Use a COMPONENT_REF for the
bitfield access when possible.
Index: gcc/tree-predcom.c
===
--- gcc/tree-predcom.c	(revision 239367)
+++ gcc/tree-predcom.c	(working copy)
@@ -1365,11 +1365,16 @@ replace_ref_with (gimple *stmt, tree new_tree, boo
 /* Returns a memory reference to DR in the ITER-th iteration of
the loop it was analyzed in.  Append init stmts to STMTS.  */
 
-static tree 
+static tree
 ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts)
 {
   tree off = DR_OFFSET (dr);
   tree coff = DR_INIT (dr);
+  tree ref = DR_REF (dr);
+  enum tree_code ref_code = ERROR_MARK;
+  tree ref_type = NULL_TREE;
+  tree ref_op1 = NULL_TREE;
+  tree ref_op2 = NULL_TREE;
   if (iter == 0)
 ;
   else if (TREE_CODE (DR_STEP (dr)) == INTEGER_CST)
@@ -1378,28 +1383,50 @@ ref_at_iteration (data_reference_p dr, int iter, g
   else
 off = size_binop (PLUS_EXPR, off,
 		  size_binop (MULT_EXPR, DR_STEP (dr), ssize_int (iter)));
-  tree addr = fold_build_pointer_plus (DR_BASE_ADDRESS (dr), off);
-  addr = force_gimple_operand_1 (unshare_expr (addr), stmts,
- is_gimple_mem_ref_addr, NULL_TREE);
-  tree alias_ptr = fold_convert (reference_alias_ptr_type (DR_REF (dr)), coff);
-  tree type = build_aligned_type (TREE_TYPE (DR_REF (dr)),
-  get_object_alignment (DR_REF (dr)));
   /* While data-ref analysis punts on bit offsets it still handles
  bitfield accesses at byte boundaries.  Cope with that.  Note that
- we cannot simply re-apply the outer COMPONENT_REF because the
- byte-granular portion of it is already applied via DR_INIT and
- DR_OFFSET, so simply build a BIT_FIELD_REF knowing that the bits
+ if the bitfield object also starts at a byte-boundary we can simply
+ replicate the COMPONENT_REF, but we have to subtract the component's
+ byte-offset from the MEM_REF address first.
+ Otherwise we simply build a BIT_FIELD_REF knowing that the bits
  start at offset zero.  */
-  if (TREE_CODE (DR_REF (dr)) == COMPONENT_REF
-  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (dr), 1)))
+  if (TREE_CODE (ref) == COMPONENT_REF
+  && DECL_BIT_FIELD (TREE_OPERAND (ref, 1)))
 {
-  tree field = TREE_OPERAND (DR_REF (dr), 1);
-  return build3 (BIT_FIELD_REF, TREE_TYPE (DR_REF (dr)),
-		 build2 (MEM_REF, type, addr, alias_ptr),
-		 DECL_SIZE (field), bitsize_zero_node);
+  unsigned HOST_WIDE_INT boff;
+  tree field = TREE_OPERAND (ref, 1);
+  tree offset = component_ref_field_offset (ref);
+  ref_type = TREE_TYPE (ref);
+  boff = tree_to_uhwi (DECL_FIELD_BIT_OFFSET (field));
+  /* This can occur in Ada.  See the comment in get_bit_range.  */
+  if (boff % BITS_PER_UNIT != 0
+	  || !tree_fits_uhwi_p (offset))
+	{
+	  ref_code = BIT_FIELD_REF;
+	  ref_op1 = DECL_SIZE (field);
+	  ref_op2 = bitsize_zero_node;
+	}
+  else
+	{
+	  boff >>= LOG2_BITS_PER_UNIT;
+	  boff += tree_to_uhwi (offset);
+	  coff = size_binop (MINUS_EXPR, coff, ssize_int (boff));
+	  ref_code = COMPONENT_REF;
+	  ref_op1 = field;
+	  ref_op2 = TREE_OPERAND (ref, 2);
+	  ref = TREE_OPERAND (ref, 0);
+	}
 }
-  else
-return fold_build2 (MEM_REF, type, addr, alias_ptr);
+  tree addr = 

[PATCH, libstdc++v3]: Fallback to read/write ops in case sendfile fails with ENOSYS or EINVAL.

2016-08-11 Thread Uros Bizjak
Hello!

Attached patch implements note from sendfile manpage:


Applications may wish to fall back to read(2)/write(2) in the case
where sendfile() fails with EINVAL or ENOSYS.


Also, the patch fixes a small inconsistency in how
_GLIBCXX_USE_FCHMODAT config flag is handled in do_copy_file function.

2016-08-11  Uros Bizjak  

* src/filesystem/ops.cc: Always include ostream and
ext/stdio_filebuf.h.
(do_copy_file): Check if _GLIBCXX_USE_FCHMODAT is defined.
[_GLIBCXX_USE_SENDFILE]: Fallback to read/write operations in case
sendfile fails with ENOSYS or EINVAL.

Patch was bootstrapped and regression tested on x86_64-linux-gnu
{,-m32} on CentOS 5.11 (where sendfile returns EINVAL for file->file
copy) and Fedora 24. In addition, the patch was bootstraped and
regression tested with _GLIBCXX_USE_SENDFILE manually disabled after
configure.

OK for mainline?

Uros.
Index: src/filesystem/ops.cc
===
--- src/filesystem/ops.cc   (revision 239384)
+++ src/filesystem/ops.cc   (working copy)
@@ -28,7 +28,9 @@
 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -48,9 +50,6 @@
 #endif
 #ifdef _GLIBCXX_USE_SENDFILE
 # include 
-#else
-# include 
-# include 
 #endif
 #if _GLIBCXX_HAVE_UTIME_H
 # include 
@@ -416,7 +415,7 @@ namespace
 
 #ifdef _GLIBCXX_USE_FCHMOD
 if (::fchmod(out.fd, from_st->st_mode))
-#elif _GLIBCXX_USE_FCHMODAT
+#elif defined _GLIBCXX_USE_FCHMODAT
 if (::fchmodat(AT_FDCWD, to.c_str(), from_st->st_mode, 0))
 #else
 if (::chmod(to.c_str(), from_st->st_mode))
@@ -428,6 +427,31 @@ namespace
 
 #ifdef _GLIBCXX_USE_SENDFILE
 const auto n = ::sendfile(out.fd, in.fd, nullptr, from_st->st_size);
+if (n < 0 && (errno == ENOSYS || errno == EINVAL))
+  {
+#endif
+   __gnu_cxx::stdio_filebuf sbin(in.fd, std::ios::in);
+   __gnu_cxx::stdio_filebuf sbout(out.fd, std::ios::out);
+   if (sbin.is_open())
+ in.fd = -1;
+   if (sbout.is_open())
+ out.fd = -1;
+   if (from_st->st_size && !(std::ostream() << ))
+ {
+   ec = std::make_error_code(std::errc::io_error);
+   return false;
+ }
+   if (!sbout.close() || !sbin.close())
+ {
+   ec.assign(errno, std::generic_category());
+   return false;
+ }
+
+   ec.clear();
+   return true;
+
+#ifdef _GLIBCXX_USE_SENDFILE
+  }
 if (n != from_st->st_size)
   {
ec.assign(errno, std::generic_category());
@@ -438,27 +462,10 @@ namespace
ec.assign(errno, std::generic_category());
return false;
   }
-#else
-__gnu_cxx::stdio_filebuf sbin(in.fd, std::ios::in);
-__gnu_cxx::stdio_filebuf sbout(out.fd, std::ios::out);
-if (sbin.is_open())
-  in.fd = -1;
-if (sbout.is_open())
-  out.fd = -1;
-if (from_st->st_size && !(std::ostream() << ))
-  {
-   ec = std::make_error_code(std::errc::io_error);
-   return false;
-  }
-if (!sbout.close() || !sbin.close())
-  {
-   ec.assign(errno, std::generic_category());
-   return false;
-  }
-#endif
 
 ec.clear();
 return true;
+#endif
   }
 }
 #endif


Re: [PATCH, rs6000] Fix PR72863 (swap optimization misses swaps generated from intrinsics)

2016-08-11 Thread David Edelsohn
On Thu, Aug 11, 2016 at 2:39 PM, Bill Schmidt
 wrote:
> Hi,
>
> Anton reports in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72863 that use 
> of
> vec_vsx_ld and vec_vsx_st intrinsics leaves the endian swaps in the generated
> code, even for very simple computations.  This turns out to be because we 
> don't
> generate the swaps at expand time as we do with other vector moves; rather, 
> they
> don't get generated until split time.  This patch fixes the problem in the
> obvious way.
>
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
> One new test case added.  Is this ok for trunk?
>
> I would also like to backport this to 6 and 5 branches after some burn-in 
> time.
> I do not plan to rush this into 6.2; we'll have to wait for 6.3 as this is 
> only
> a performance issue, albeit an important one.
>
> Thanks,
> Bill
>
>
> [gcc]
>
> 2016-08-11  Bill Schmidt  
>
> PR target/72863
> * vsx.md (vsx_load_): For P8LE, emit swaps at expand time.
> (vsx_store_): Likewise.
>
> [gcc/testsuite]
>
> 2016-08-11  Bill Schmidt  
>
> PR target/72863
> * gcc.target/powerpc/pr72863.c: New test.

Okay for trunk, and 5/6 branches after a few weeks.

Thanks, David


[PATCH, c++ testsuite]: Fix UNRESOLVED: g++.dg/cpp1z/constexpr-lambda6.C testsuite failure

2016-08-11 Thread Uros Bizjak
The compilation, where error is expected, will not produce an executable.

2016-08-11  Uros Bizjak  

* g++.dg/cpp1z/constexpr-lambda6.C (dg-do run): Remove.
(g): Rename from main.

Tested on x86_64-linux-gnu.

OK for mainline?

Uros.
Index: g++.dg/cpp1z/constexpr-lambda6.C
===
--- g++.dg/cpp1z/constexpr-lambda6.C(revision 239384)
+++ g++.dg/cpp1z/constexpr-lambda6.C(working copy)
@@ -1,5 +1,4 @@
 // Testcase from P0170R1
-// { dg-do run }
 // { dg-options -std=c++1z }
 
 auto monoid = [](auto v) { return [=] { return v; }; };
@@ -14,7 +13,7 @@
   };
 };
 
-int main()
+void g()
 {
   constexpr auto zero = monoid(0);
   constexpr auto one = monoid(1);


Re: [PATCH] Improve backwards threading

2016-08-11 Thread Jeff Law

On 08/09/2016 07:13 AM, Richard Biener wrote:

On Fri, 5 Aug 2016, Jeff Law wrote:


On 08/05/2016 07:36 AM, Richard Biener wrote:

@@ -560,6 +562,14 @@ fsm_find_control_statement_thread_paths
  edge taken_edge = profitable_jump_thread_path (path, bbi, name,
arg);
  if (taken_edge)
{
+ /* If the taken edge is a loop entry avoid mashing two
+loops into one with multiple latches by splitting
+the edge.  This only works if that block won't become
+a latch of this loop.  */
+ if ((bb_loop_depth (taken_edge->src)
+  < bb_loop_depth (taken_edge->dest))
+ && ! single_succ_p (bbi))
+   split_edge (taken_edge);
  if (bb_loop_depth (taken_edge->src)
  >= bb_loop_depth (taken_edge->dest))
convert_and_register_jump_thread_path (path, taken_edge);

note you need the PR72772 fix to trigger all this.

I'm a little confused here.  In the case where the taken edge goes into a
deeper loop nest you're splitting the edge -- to what end?  The backwards
threader isn't going to register that jump thread.  So if this is fixing
something, then we've got the fix in the wrong place.


Ok, so I've figured that splitting the edge is indeed pointless unless
it does exactly the same as creating a forwarder.  We may not blindly
thread backwards to a loop header because of correctness issues in
re-using old loop meta-data for that loop (and in the
ubsan.exp=pr71403-2.c case miscompiling the testcase).  What we need
is a forwarder block we can thread to which eventually becomes the
new loop header.  Note this is also what we'd achieve with properly
initializing loops in the threader - sth we should do anyway with
looking at loop meta data.  This is likely also why the old threader has
(in DOM):

  /* We need to know loop structures in order to avoid destroying them
 in jump threading.  Note that we still can e.g. thread through loop
 headers to an exit edge, or through loop header to the loop body,
assuming
 that we update the loop info.

 TODO: We don't need to set LOOPS_HAVE_PREHEADERS generally, but due
 to several overly conservative bail-outs in jump threading, case
 gcc.dg/tree-ssa/pr21417.c can't be threaded if loop preheader is
 missing.  We should improve jump threading in future then
 LOOPS_HAVE_PREHEADERS won't be needed here.  */
  loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);

thus we run into exactly those cases now in the FSM threader.  Thus
the following patch fixes the two testcases with the PR72772 fix
applied as well.

Right.  I'm pretty sure there's a BZ around this issue in my queue :-)



It's also possible to create a forwarder on-demand at the place I
splitted the edge and avoid creating preheaders for all loops but I
as we should init the loop optimizer to be able to look at loop
metadata anyway there's not much point in doing that.

Bootstrap / regtest pending on x86_64-unknown-linux-gnu, ok for trunk?

Thanks,
Richard.

2016-08-09  Richard Biener  

* tree-ssa-threadbackward.c (pass_data_thread_jumps): Remove
unconditional TODO_cleanup_cfg.
(pass_thread_jumps::execute): Initialize loops, perform a CFG
cleanup only if we threaded a jump.

OK.
jeff



Re: Early jump threading

2016-08-11 Thread Jeff Law

On 08/11/2016 09:50 AM, Richard Biener wrote:



Ah, I thought it was exclusively dealing with threading through back
edges which is sth I'd avoid doing early?
No, that's one of the fundamental changes we made for gcc-6, namely 
using the FSM threader for general purpose threading rather than just 
using it for threading across loop backedges.


I've got a few things to do first, but the direction I want to go is to 
use the FSM threader exclusively.


Jeff


Re: Early jump threading

2016-08-11 Thread Jeff Law

On 08/11/2016 08:27 AM, Jan Hubicka wrote:


On tramp3d all VRP passes threads together 730 branches, all DOM passes 393, so
FSM threading (with 1957 branches) is the most effective one. Perhaps eventually
early VRP can also do bit of work.
That's roughly consistent with what I've seen.  I have some thoughts on 
what DOM's threading pass is catching that we're missing in the 
backward/FSM threader, but I haven't had time to see how big those 
effects really are.




I am not 100% sure from where "backward" is comming from. I guess is means that
analysis goes backward from conditionals to definitions: it looks for
conditional driven by a PHI statement that has a constant value on some paths
and duplicates for those. This seems cheap and rather effective way of getting
good part of the threading oppurtunities (most expensive part is probably
identifying and walking paths that will not be threaded at the end).
Correct, forward/backward is based on the direction of analysis.  ie, do 
we start at the conditional and work backwards through the USE-DEF chain 
or do we build a equivalence table as we walk forward through the entire 
function and use the equivalence tables to simplify the conditional.





BTW I wonder if the same analysis can't be done for other instructions where 
constant
operands are very profitable, like division or multiplication.
Yes.  There's all kinds of problems that can be solved using a backwards 
walk to identify useful properties on paths, then path duplication to 
isolate the use point and modify it.


So you could (for example) walk backwards from an indirect call and 
identify paths through the CFG where we know the target.  We then can 
use path duplication to isolate that path from the rest and call the 
target function directly.



Jeff



[PATCH, rs6000] Fix PR72863 (swap optimization misses swaps generated from intrinsics)

2016-08-11 Thread Bill Schmidt
Hi,

Anton reports in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72863 that use of
vec_vsx_ld and vec_vsx_st intrinsics leaves the endian swaps in the generated
code, even for very simple computations.  This turns out to be because we don't
generate the swaps at expand time as we do with other vector moves; rather, they
don't get generated until split time.  This patch fixes the problem in the
obvious way.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
One new test case added.  Is this ok for trunk?

I would also like to backport this to 6 and 5 branches after some burn-in time.
I do not plan to rush this into 6.2; we'll have to wait for 6.3 as this is only
a performance issue, albeit an important one.

Thanks,
Bill


[gcc]

2016-08-11  Bill Schmidt  

PR target/72863
* vsx.md (vsx_load_): For P8LE, emit swaps at expand time.
(vsx_store_): Likewise.

[gcc/testsuite]

2016-08-11  Bill Schmidt  

PR target/72863
* gcc.target/powerpc/pr72863.c: New test.


Index: gcc/config/rs6000/vsx.md
===
--- gcc/config/rs6000/vsx.md(revision 239310)
+++ gcc/config/rs6000/vsx.md(working copy)
@@ -922,13 +922,27 @@
   [(set (match_operand:VSX_M 0 "vsx_register_operand" "")
(match_operand:VSX_M 1 "memory_operand" ""))]
   "VECTOR_MEM_VSX_P (mode)"
-  "")
+{
+  /* Expand to swaps if needed, prior to swap optimization.  */
+  if (!BYTES_BIG_ENDIAN && !TARGET_P9_VECTOR)
+{
+  rs6000_emit_le_vsx_move (operands[0], operands[1], mode);
+  DONE;
+}
+})
 
 (define_expand "vsx_store_"
   [(set (match_operand:VSX_M 0 "memory_operand" "")
(match_operand:VSX_M 1 "vsx_register_operand" ""))]
   "VECTOR_MEM_VSX_P (mode)"
-  "")
+{
+  /* Expand to swaps if needed, prior to swap optimization.  */
+  if (!BYTES_BIG_ENDIAN && !TARGET_P9_VECTOR)
+{
+  rs6000_emit_le_vsx_move (operands[0], operands[1], mode);
+  DONE;
+}
+})
 
 ;; Explicit load/store expanders for the builtin functions for lxvd2x, etc.,
 ;; when you really want their element-reversing behavior.
Index: gcc/testsuite/gcc.target/powerpc/pr72863.c
===
--- gcc/testsuite/gcc.target/powerpc/pr72863.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr72863.c  (working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile { target { powerpc64le-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O3" } */
+/* { dg-final { scan-assembler "lxvd2x" } } */
+/* { dg-final { scan-assembler "stxvd2x" } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
+
+#include 
+
+extern unsigned char *src, *dst;
+
+void b(void)
+{
+  int i;
+
+  unsigned char *s8 = src;
+  unsigned char *d8 = dst;
+
+  for (i = 0; i < 100; i++) {
+vector unsigned char vs = vec_vsx_ld(0, s8);
+vector unsigned char vd = vec_vsx_ld(0, d8);
+vector unsigned char vr = vec_xor(vs, vd);
+vec_vsx_st(vr, 0, d8);
+s8 += 16;
+d8 += 16;
+  }
+}



Re: Early jump threading

2016-08-11 Thread Jeff Law

On 08/11/2016 08:06 AM, Richard Biener wrote:

On Thu, 11 Aug 2016, Jan Hubicka wrote:


Hi,
this patch adds early jump threading pass.  Jump threading is one of most
common cases where estimated profile becomes corrupted, because the branches
are predicted independently beforehand. This patch simply adds early mode to
jump threader that does not permit code growth and thus only win/win threads
are done before profile is constructed.

Naturally this also improves IPA decisions because code size/time is estimated
more acurately.

It is not very cool to add another pass and the jump threader is already
run 5 times. I think incrementally we could drop one of late threaders at least.
I tried to measure compile time effects but they are in wash. Moreover the patch
pays for itself in cc1plus code size:

Before patch to tweak size estimates: 27779964
Current mainline: 27748900
With patch applied:   27716173

So despite adding few functions the code size effect is positive which I think
is quite nice.

Given the fact that jump threading seems quite lightweit, I wonder why it is
guarded by flag_expensive_optimizations? Is it expensive in terms of code
size?

The effectivity of individual threading passes is as follows (for tramp3d)

  mainline  with patch
pass  thread count profile mismatches   thread countprofile mismatch
early   525
1 1853 1900 316 337
2 4812  4   288
3 24   1450 32  947
4 76   1457 75  975

So at least tramp3d data seems to suggest that we can drop the second occurence
of jump threading and that most of the job thread1 does can be done by the
size restricted early version (the lower absolute counts are caused by the
fact that threadable paths gets duplicated by the inliner). 50% drop in
profile distortion is not bad. I wonder why basically every threaded paths seems
to introduce a mismatch.

The patch distorts testusite somewhat, in most cases we only want to update
dump files or disable early threading:

+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
+FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
+FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 
1
+FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around 
loop and would copy too many statements"
+FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps 
threaded: 1" 1
+FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
+FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 
1" 1
+FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
+FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1, 1]"

This testcase is the now probably unnecesary heuristics (it may still be
relevant in cases we do not thread because of size bounds but its effectivity
may not be worth the maintenance cost):

+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  

Re: Early jump threading

2016-08-11 Thread Jeff Law

On 08/11/2016 08:02 AM, Jan Hubicka wrote:

Hi,
this patch adds early jump threading pass.  Jump threading is one of most
common cases where estimated profile becomes corrupted, because the branches
are predicted independently beforehand. This patch simply adds early mode to
jump threader that does not permit code growth and thus only win/win threads
are done before profile is constructed.

Naturally this also improves IPA decisions because code size/time is estimated
more acurately.
Excellent.  One of the goals here was to enable you to run it early, so 
I'm glad to see it's working out.




It is not very cool to add another pass and the jump threader is already
run 5 times. I think incrementally we could drop one of late threaders at least.
I tried to measure compile time effects but they are in wash. Moreover the patch
pays for itself in cc1plus code size:
Most definitely we want to be dropping calls into the later passes. 
There's analysis to do, but the goal is to drop the old style threading 
from DOM/VRP completely.   It may also be the case that one or more 
passes of the backwards/FSM threader can be avoided, but we should look 
at removing the DOM/VRP threading first.




Before patch to tweak size estimates: 27779964
Current mainline: 27748900
With patch applied:   27716173

So despite adding few functions the code size effect is positive which I think
is quite nice.

Given the fact that jump threading seems quite lightweit, I wonder why it is
guarded by flag_expensive_optimizations? Is it expensive in terms of code
size?
The DOM/VRP jump threading used to be very expensive because of 
iteration and the ssa updates.  Both of those issues have since been 
addressed.   The backwards/FSM threader is based on an algorithm that 
IIRC is cubic, and we also have some implementation issues that cause it 
to use far more time than it should.


However, in an early mode, we don't want to walk backwards through the 
CFG very far (if at all) and for an early mode we may not want to guard 
on flag_expensive_optimizations.




The effectivity of individual threading passes is as follows (for tramp3d)

  mainline  with patch
pass  thread count profile mismatches   thread countprofile mismatch
early   525
1 1853 1900 316 337
2 4812  4   288
3 24   1450 32  947
4 76   1457 75  975

So at least tramp3d data seems to suggest that we can drop the second occurence
of jump threading and that most of the job thread1 does can be done by the
size restricted early version (the lower absolute counts are caused by the
fact that threadable paths gets duplicated by the inliner). 50% drop in
profile distortion is not bad. I wonder why basically every threaded paths seems
to introduce a mismatch.
I don't think the backwards/FSM threader tries to update the profile 
data at all right now.




The patch distorts testusite somewhat, in most cases we only want to update
dump files or disable early threading:

+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
+FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
+FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 
1
+FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around 
loop and would copy too many statements"
+FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps 
threaded: 1" 1
+FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
+FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 
1" 1
+FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
+FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1, 1]"

This testcase is the now probably unnecesary heuristics (it may still be
relevant in cases we do not thread because of size bounds but its effectivity
may not be worth the maintenance cost):

+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3

Re: [v3 PATCH] Implement C++17 make_from_tuple.

2016-08-11 Thread Ville Voutilainen
On 11 August 2016 at 20:45, Ed Smith-Rowland <3dw...@verizon.net> wrote:
> On 08/10/2016 08:04 PM, Ville Voutilainen wrote:
>
> I was in the middle of doing this, so here's the patch before I commit
> the string_view one.
>
> Tested on Linux-x64.
>
> 2016-08-11  Ville Voutilainen  
>
> Implement C++17 make_from_tuple.
> * include/std/tuple (__make_from_tuple_impl, make_from_tuple): New.
> * testsuite/20_util/tuple/make_from_tuple/1.cc: Likewise.
>
> The latest SD-6 has a feature test macro;
>
> #define __cpp_lib_make_from_tuple  201606
>
>
> Also the test should get:
>
> #ifndef __cpp_lib_make_from_tuple
> # error "Feature-test macro for make_from_tuple missing"
> #elif __cpp_lib_make_from_tuple != 201606
> # error "Feature-test macro for make_from_tuple has wrong value"
> #endif


Thanks, I tried to find a recent SD-6 when looking for a feature macro
for make_from_tuple, but couldn't find one.
I was not looking in the right place. :)


Re: protected alloca class for malloc fallback

2016-08-11 Thread Trevor Saunders
On Thu, Aug 11, 2016 at 09:18:34PM +0900, Oleg Endo wrote:
> On Wed, 2016-08-10 at 21:31 -0400, Trevor Saunders wrote:
> 
> > 
> > Well, I'd say the compromises made in std::string make it pretty
> > terrible for general purpose use accept where perf and memory doesn't
> > matter at all.
> > 
> > std::vector isn't very great size wise either, imho the size / 
> > capacity fields would be much better put in the buffer than in the 
> > struct itself, to save 2 * sizeof (void *) in sizeof std::vector.
> 
> There's nothing in the standard that says those fields have to go into
> the vector struct/class itself and not into the data buffer.  It just
> happens to be the way it's implemented in libstdc++.  E.g. libc++
> implements some things in different ways to save a few bytes here and
> there.
> 
> It looks more practical to put those fields into the container itself,
> otherwise you'd have to do a nullptr check every time you access the
> size field etc.  Although newer compilers optimize away the redundant
> nullptr checks, older compilers (which GCC wants to be good with) might
> not do it.  In any case, it seems to generate slightly bigger code, at
> least in one instance (see attachment).

if you are clever you can have a empty global vector that you point all
the empty vectors at and so eliminate the null check.  Well, I guess
that doesn't work quite as well if the global needs to be accessed
through a plt sigh, but at least you only need to check against it for
reallocation which is rather heavy weight anyway.

> Anyway, normally there is more data in the vectors than there are
> vectors around, so whether sizeof (vector) is 8 or 24 bytes doesn't
> matter.

It matters if you have any empty vectors around.  The only numbers I can
find are https://bugzilla.mozilla.org/show_bug.cgi?id=1007846 but I seem
to remember seeing that for Firefox at least most vectors contained 0
elements.

> > 
> > or just having your stack_string type convert to the normal string 
> > type so that you can pass mutable references.
> 
> But that would imply copying, wouldn't it?

Not if you are clever, you can use the same trick gcc uses in vec /
auto_vec with strings.

Trev

> 
> Cheers,
> Oleg

> 
> #if 0
> template 
> class vector
> {
> public:
>   unsigned int size (void) const { return m_buffer != nullptr ? 
> m_buffer->size : 0; }
>   bool empty (void) const { return size () == 0; }
> 
>   unsigned int capacity (void) const { return m_buffer != nullptr ? 
> m_buffer->capacity : 0; }
> 
>   T* data (void) { return (T*)(m_buffer + 1); }
>   const T* data (void) const { return (const T*)(m_buffer + 1); }
> 
>   T& operator [] (unsigned int i) { return data ()[i]; }
>   const T& operator [] (unsigned int i) const { return data ()[i]; }
> 
> private:
>   struct buffer_header
>   {
> unsigned int size;
> unsigned int capacity;
>   };
> 
>   buffer_header* m_buffer;
> };
> 
> 
> int foo (const vector& x)
> {
>   if (x.empty ())
> return 0;
> 
>   int r = 0;
>   for (unsigned int i = 0; i < x.size (); ++i)
> r += x[i];
> 
>   return r;
> }
> 
> /*
> -O2 -m2a
>   mov.l   @r4,r2
>   tst r2,r2
>   bt  .L5
>   mov.l   @r2,r1
>   tst r1,r1
>   bt.s.L5
>   shll2   r1
>   add #-4,r1
>   shlr2   r1
>   add #8,r2
>   mov #0,r0
>   add #1,r1
>   .align 2
> .L3:
>   mov.l   @r2+,r3
>   dt  r1
>   bf.s.L3
>   add r3,r0
>   rts/n
>   .align 1
> .L5:
>   rts
>   mov #0,r0
> */
> #endif
> 
> 
> #if 1
> 
> template 
> class vector
> {
> public:
>   unsigned int size (void) const { return m_size; }
>   bool empty (void) const { return size () == 0; }
> 
>   unsigned int capacity (void) const { return m_capacity; }
> 
>   T* data (void) { return (T*)(m_buffer); }
>   const T* data (void) const { return (const T*)(m_buffer); }
> 
>   T& operator [] (unsigned int i) { return data ()[i]; }
>   const T& operator [] (unsigned int i) const { return data ()[i]; }
> 
> private:
>   unsigned int m_size;
>   unsigned int m_capacity;
>   T* m_buffer;
> };
> 
> 
> int foo (const vector& x)
> {
>   if (x.empty ())
> return 0;
> 
>   int r = 0;
>   for (unsigned int i = 0; i < x.size (); ++i)
> r += x[i];
> 
>   return r;
> }
> 
> /*
> -O2 -m2a
>   mov.l   @r4,r1
>   tst r1,r1
>   bt.s.L7
>   mov #0,r0
>   shll2   r1
>   mov.l   @(8,r4),r2
>   add #-4,r1
>   shlr2   r1
>   add #1,r1
>   .align 2
> .L3:
>   mov.l   @r2+,r3
>   dt  r1
>   bf.s.L3
>   add r3,r0
> .L7:
>   rts/n
> */
> 
> #endif
> 



Re: [PATCH] Support TImode CONST_WIDE_INT store in 64-bit STV

2016-08-11 Thread Uros Bizjak
On Thu, Aug 11, 2016 at 6:54 PM, H.J. Lu  wrote:
> Support TImode CONST_WIDE_INT store generated from piecewise store.
> Need to verify performance impact before enabling TImode CONST_INT
> store for __int128.
>
> Tested on x86-64.  OK for trunk?

OK.

Thanks,
Uros.

> H.J.
> ---
> gcc/
>
> * config/i386/i386.c (timode_scalar_to_vector_candidate_p): Allow
> TImode CONST_WIDE_INT store.
> (timode_scalar_chain::convert_insn): Handle CONST_WIDE_INT store.
>
> gcc/testsuite/
>
> * gcc.target/i386/pieces-strcpy-1.c: New test.
> * gcc.target/i386/pieces-strcpy-2.c: Likewise.
> ---
>  gcc/config/i386/i386.c  | 23 ---
>  gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c | 15 +++
>  gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c | 15 +++
>  3 files changed, 50 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 93eaab1..d086ede 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -2862,9 +2862,12 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
>
>if (MEM_P (dst))
>  {
> -  /* Check for store.  Only support store from register or standard
> -SSE constants.  Memory must be aligned or unaligned store is
> -optimal.  */
> +  /* Check for store.  Memory must be aligned or unaligned store
> +is optimal.  Only support store from register, standard SSE
> +constant or CONST_WIDE_INT generated from piecewise store.
> +
> +??? Verify performance impact before enabling CONST_INT for
> +__int128 store.  */
>if (misaligned_operand (dst, TImode)
>   && !TARGET_SSE_UNALIGNED_STORE_OPTIMAL)
> return false;
> @@ -2875,6 +2878,7 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
>   return false;
>
> case REG:
> +   case CONST_WIDE_INT:
>   return true;
>
> case CONST_INT:
> @@ -3868,6 +3872,19 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
>PUT_MODE (src, V1TImode);
>break;
>
> +case CONST_WIDE_INT:
> +  if (NONDEBUG_INSN_P (insn))
> +   {
> + /* Since there are no instructions to store 128-bit constant,
> +temporary register usage is required.  */
> + rtx tmp = gen_reg_rtx (V1TImode);
> + src = gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec (1, src));
> + src = validize_mem (force_const_mem (V1TImode, src));
> + emit_conversion_insns (gen_rtx_SET (dst, tmp), insn);
> + dst = tmp;
> +   }
> +  break;
> +
>  case CONST_INT:
>switch (standard_sse_constant_p (src, TImode))
> {
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c 
> b/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
> new file mode 100644
> index 000..64b7329
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
> +
> +extern char *strcpy (char *, const char *);
> +
> +void
> +foo (char *s)
> +{
> +  strcpy (s,
> + "1234567890abcdef123456abcdef5678123456abcdef567abcdef678"
> + "1234567");
> +}
> +
> +/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> +/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c 
> b/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c
> new file mode 100644
> index 000..7421255
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
> +
> +extern char *strcpy (char *, const char *);
> +
> +void
> +foo (char *s)
> +{
> +  strcpy (s,
> + "1234567890abcdef123456abcdef5678123456abcdef567abcdef678"
> + "1234567");
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqa\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> +/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> --
> 2.7.4
>


[PATCH] Support TImode CONST_WIDE_INT store in 64-bit STV

2016-08-11 Thread H.J. Lu
Support TImode CONST_WIDE_INT store generated from piecewise store.
Need to verify performance impact before enabling TImode CONST_INT
store for __int128.

Tested on x86-64.  OK for trunk?

H.J.
---
gcc/

* config/i386/i386.c (timode_scalar_to_vector_candidate_p): Allow
TImode CONST_WIDE_INT store.
(timode_scalar_chain::convert_insn): Handle CONST_WIDE_INT store.

gcc/testsuite/

* gcc.target/i386/pieces-strcpy-1.c: New test.
* gcc.target/i386/pieces-strcpy-2.c: Likewise.
---
 gcc/config/i386/i386.c  | 23 ---
 gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c | 15 +++
 gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c | 15 +++
 3 files changed, 50 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 93eaab1..d086ede 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2862,9 +2862,12 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
 
   if (MEM_P (dst))
 {
-  /* Check for store.  Only support store from register or standard
-SSE constants.  Memory must be aligned or unaligned store is
-optimal.  */
+  /* Check for store.  Memory must be aligned or unaligned store
+is optimal.  Only support store from register, standard SSE
+constant or CONST_WIDE_INT generated from piecewise store.
+
+??? Verify performance impact before enabling CONST_INT for
+__int128 store.  */
   if (misaligned_operand (dst, TImode)
  && !TARGET_SSE_UNALIGNED_STORE_OPTIMAL)
return false;
@@ -2875,6 +2878,7 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
  return false;
 
case REG:
+   case CONST_WIDE_INT:
  return true;
 
case CONST_INT:
@@ -3868,6 +3872,19 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
   PUT_MODE (src, V1TImode);
   break;
 
+case CONST_WIDE_INT:
+  if (NONDEBUG_INSN_P (insn))
+   {
+ /* Since there are no instructions to store 128-bit constant,
+temporary register usage is required.  */
+ rtx tmp = gen_reg_rtx (V1TImode);
+ src = gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec (1, src));
+ src = validize_mem (force_const_mem (V1TImode, src));
+ emit_conversion_insns (gen_rtx_SET (dst, tmp), insn);
+ dst = tmp;
+   }
+  break;
+
 case CONST_INT:
   switch (standard_sse_constant_p (src, TImode))
{
diff --git a/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c 
b/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
new file mode 100644
index 000..64b7329
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-strcpy-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *strcpy (char *, const char *);
+
+void
+foo (char *s)
+{
+  strcpy (s,
+ "1234567890abcdef123456abcdef5678123456abcdef567abcdef678"
+ "1234567");
+}
+
+/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c 
b/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c
new file mode 100644
index 000..7421255
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *strcpy (char *, const char *);
+
+void
+foo (char *s)
+{
+  strcpy (s,
+ "1234567890abcdef123456abcdef5678123456abcdef567abcdef678"
+ "1234567");
+}
+
+/* { dg-final { scan-assembler-times "vmovdqa\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
-- 
2.7.4



Re: [WIP] [PR fortran/72741] Rework Fortran OpenACC routine clause handling

2016-08-11 Thread Cesar Philippidis
On 08/11/2016 08:18 AM, Thomas Schwinge wrote:

>> --- a/gcc/fortran/module.c
>> +++ b/gcc/fortran/module.c
> 
>>  [...]
>> +DECL_MIO_NAME (oacc_function)
>>  [...]
> 
> As discussed between Cesar and Tobias, these module.c/symbol.c changes
> introduce an incompatibility in the Fortran module file format, which
> we'll want to avoid.  Reverting that to use individual bit flags instead
> of the "enum oacc_function", I think that we're safe (but I have not
> verified that).  On the other hand, given that I'm not at all handling in
> module.c/symbol.c the new "locus omp_clauses_locus" and "struct
> symbol_attribute *next" members that I'm adding to "symbol_attribute",
> I'm not sure whether I'm actually testing this properly.  ;-) I guess I'm
> not.

How are you testing it? Basically, what you need to do is create two
source files, one containing a module and another with the program unit.
Then compile one of those files with the old, say gcc6 fortran, and the
other with trunk gfortran and try to link the .o files together.

I've attached some test cases so that you can experiment with. Each
driver file corresponds to a test file, with the exception of
test-driver which uses both test-interface.f90 and test-module.f90.

>> --- a/gcc/fortran/openmp.c
>> +++ b/gcc/fortran/openmp.c
> 
>> @@ -1814,7 +1824,10 @@ gfc_match_oacc_routine (void)
>>!= MATCH_YES))
>>  return MATCH_ERROR;
>>  
>> -  if (sym != NULL)
>> +  if (isym != NULL)
>> +/* There is nothing to do for intrinsic procedures.  */
>> +;
> 
> We will want to check that no incompatible clauses are being specified,
> for example (but, low priority).  I'm adding a hacky implementation of
> that.

So this is what I was overlooking in PR72741. For some reason I was only
considering invalid clauses of the form

  !$acc routine gang worker

and not actually checking for compatible parallelism at the call sites.
The title "Fortran OpenACC routine directive doesn't properly handle
clauses specifying the level of parallelism" was kind of misleading.

Shouldn't the oaccdevlow pass already catch these types of errors already?

>> --- /dev/null
>> +++ b/gcc/testsuite/gfortran.dg/goacc/routine-7.f90
>> @@ -0,0 +1,69 @@
>> +! Test acc routines inside modules.
>> +
>> +! { dg-additional-options "-O0" }
> 
> -O0 to prevent inlining of functions tagged with OpenACC routine
> directives, or another reason?

I'm not sure why, but that's probably it.

>> --- a/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90
>> +++ b/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90
>> @@ -1,121 +1,95 @@
>> +! Test acc routines inside modules.
>>  
>>  ! { dg-do run }
>> -! { dg-additional-options "-cpp" }
>>  
>> -#define M 8
>> -#define N 32
>> +module routines
>> +  integer, parameter :: N = 32
>>  
>> -program main
>> -  integer :: i
>> -  integer :: a(N)
>> -  integer :: b(M * N)
>> -
>> -  do i = 1, N
>> -a(i) = 0
>> -  end do
>> +contains
>> +  subroutine vector (a)
>> +implicit none
>> +!$acc routine vector
>> +integer, intent (inout) :: a(N)
>> +integer :: i
>> [...]
> 
> This seems to completely rewrite the test case.  Is that intentional, or
> should the original test case be preserved, and the changed/new/rewritten
> one be added as a new test case?

The original test was completely bogus because it had a lot of race
conditions when writing to variables. I could restore it, but then you'd
need to remove all of the gang, worker, vector clauses and force it to
run in seq. But that would defeat the intent behind the patch.

> Now, my hacky WIP patch.
> 
> One big chunk of the gcc/fortran/gfortran.h changes is just to move some
> stuff around, without any changes, so that I can use "locus" in
> "symbol_attribute".

I agree with Jakub about creating a new gfc_acc_routine struct to
contain the locus and clauses for acc routines. That way, you can also
link them together for device_type.

But at the same time, since device_type isn't a priority for in the near
term, we might be better off using the existing oacc_function and nohost
attribute bits instead of introducing a new struct.

> I very much "cargo cult"ed all that "oacc_routine*" bit flag stuff in
> module.c/symbol.c, replicating what's being done for "omp target
> declare", without really knowing what I'm doing there.  I will appreciate
> test cases actually exercising this code -- which doesn't currently at
> all handle the new "locus omp_clauses_locus" and "struct symbol_attribute
> *next" members that I'm adding to "symbol_attribute", as I've mentioned
> before.  (But I suppose it should?)
> 
> We're not implementing the OpenACC device_type clause at the moment, so
> the "TODO: handle device_type clauses" comment in
> gcc/fortran/openmp.c:gfc_match_oacc_routine is not a concern right now.
> 
> With these changes, we're now actually also paying attention the clauses
> specified with the OpenACC routine directive with a name -- one of the
> things mentioned as missing in 

Re: [WIP] [PR fortran/72741] Rework Fortran OpenACC routine clause handling (was: [PATCH] OpenACC routines in fortran modules)

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 06:26:50PM +0200, Thomas Schwinge wrote:
> > > --- gcc/fortran/gfortran.h
> > > +++ gcc/fortran/gfortran.h
> 
> > >  /* Symbol attribute structure.  */
> > > -typedef struct
> > > +typedef struct symbol_attribute
> > >  {
> 
> > While symbol_attribute is already bloated, I don't like bloating it this
> > much further.  Do you really need it for all symbols, or just all 
> > subroutines?
> 
> Certainly not for all symbole; just for what is valid to be used with the
> OpenACC routine directive, which per OpenACC 2.0a, 2.13.1 Routine
> Directive is:
> 
> In Fortran the syntax of the routine directive is:
> !$acc routine clause-list
> !$acc routine( name ) clause-list
> In Fortran, the routine directive without a name may appear within the 
> specification part of a subroutine or function definition, or within an 
> interface body for a subroutine or function in an interface block, and 
> applies to the containing subroutine or function. The routine directive with 
> a name may appear in the specification part of a subroutine, function or 
> module, and applies to the named subroutine or function.
> 
> (Pasting that in full just in case that contains some additional Fortran
> lingo, meaning more than "subroutines".)

By "subroutines" I've meant of course also functions, those have their own
namespace structure too.

> > omp_clauses_locus makes no sense, symbol_attribute contains parsed info from
> > many different clauses, which one it is?
> 
> Well, it makes some sense -- it works no worse than the existing code ;-)
> -- but I agree that it's not exactly pretty.  To the best of my
> knowledge, in Fortran OpenACC/OpenMP clauses parsing, we're currently not
> tracking (saving) specific location information for individual clauses
> (at least, that's what a casual scan through the code, and
> gfc_match_oacc_routine or gfc_match_omp_declare_target in particular make
> me think: gfc_omp_clauses collects all clause data, but only contains a
> single "locus loc" member (which maybe I should have used instead of
> "old_loc", the location information for the directive itself?).  Maybe I
> misunderstood, and we do have more precise location information available
> for individual clauses?  In that case, I'll happily use that, of course.

The Fortran FE generally doesn't track locations of any of the attributes
symbols have, attributes as well as OpenMP clauses are represented just as
bits (for boolean stuff), etc., only if you have some expression you have
location for the expression.
I don't see what is so special on these clauses that they need to have
location tracked compared to say CONTIGUOUS or whatever other attribute, just
use the location of the function.  Unless of course you want to rewrite all
the Fortran FE data structures and track detailed locations for everything.
But just treating selected OpenACC clauses specially, ignoring how the FE is
structured, is at least inconsistent with the rest.

Jakub


[PATCH PR72817/PR73450]Fix wrong code caused by niter analyzer for NE_EXPR exit cond.

2016-08-11 Thread Bin Cheng
Hi,
I made a mistake when improving loop niter analysis for NE_EXPR exit condition.
It can results in wrong code as reported in PR72817/PR73450.  In previous 
patch, 
it simplifies below condition:
base <= FINAL ; step > 0
base >= FINAL ; step < 0
into:
base - step < FINAL ; step > 0 && base - step doesn't underflow
base - step > FINAL ; step < 0 && base - step doesn't overflow
But this is not enough.  Since we adjusted IV.base to "new_base = IV.base - 
IV.step"
in the condition, we need to check if |FINAL - new_base| is multiple of |step| 
for the
adjusted base.

This patch fixes the issue as well as revises the comment.  Bootstrap and test 
on
x86_64.  Is it OK?

Thanks,
bin

2016-08-11  Bin Cheng  

PR tree-optimization/72817
PR tree-optimization/73450
* tree-ssa-loop-niter.c (number_of_iterations_ne): Check
multiple_of_p for adjusted IV.base.

gcc/testsuite/ChangeLog
2016-08-11  Bin Cheng  

PR tree-optimization/72817
PR tree-optimization/73450
* gcc.dg/tree-ssa/pr72817.c: New test.
* gcc.dg/tree-ssa/pr73450.c: New test.
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index c740ffa..8851862 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -999,33 +999,36 @@ number_of_iterations_ne (struct loop *loop, tree type, 
affine_iv *iv,
   mpz_clear (max);
 
   /* Compute no-overflow information for the control iv.  This can be
- proven when below two conditions hold.
-
-   1) |FINAL - base| is an exact multiple of step.
-   2) IV evaluates toward FINAL at beginning, i.e:
+ proven when below two conditions are satisfied:
 
+   1) IV evaluates toward FINAL at beginning, i.e:
base <= FINAL ; step > 0
base >= FINAL ; step < 0
 
- Note the first condition holds, the second can be then relaxed
- to below condition.
+   2) |FINAL - base| is an exact multiple of step.
+
+ Unfortunately, it's hard to prove above conditions after pass loop-ch
+ because loop with exit condition (IV != FINAL) usually will be guarded
+ by initial-condition (IV.base - IV.step != FINAL).  In this case, we
+ can alternatively try to prove below conditions:
+
+   1') IV evaluates toward FINAL at beginning, i.e:
+   new_base = base - step < FINAL ; step > 0
+&& base - step doesn't underflow
+   new_base = base - step > FINAL ; step < 0
+&& base - step doesn't overflow
 
-   base - step < FINAL ; step > 0
- && base - step doesn't underflow
-   base - step > FINAL ; step < 0
- && base - step doesn't overflow
+   2') |FINAL - new_base| is an exact multiple of step.
 
- The relaxation is important because after pass loop-ch, loop
- with exit condition (IV != FINAL) will usually be guarded by
- pre-condition (IV.base - IV.step != FINAL).  Please refer to
- PR34114 as an example.
+ Please refer to PR34114 as an example of loop-ch's impact, also refer
+ to PR72817 as an example why condition 2') is necessary.
 
- Also note, for NE_EXPR, base equals to FINAL is a special case, in
+ Note, for NE_EXPR, base equals to FINAL is a special case, in
  which the loop exits immediately, and the iv does not overflow.  */
   if (!niter->control.no_overflow
   && (integer_onep (s) || multiple_of_p (type, c, s)))
 {
-  tree t, cond, relaxed_cond = boolean_false_node;
+  tree t, cond, new_c, relaxed_cond = boolean_false_node;
 
   if (tree_int_cst_sign_bit (iv->step))
{
@@ -1039,8 +1042,12 @@ number_of_iterations_ne (struct loop *loop, tree type, 
affine_iv *iv,
  if (integer_nonzerop (t))
{
  t = fold_build2 (MINUS_EXPR, type, iv->base, iv->step);
- relaxed_cond = fold_build2 (GT_EXPR, boolean_type_node,
- t, final);
+ new_c = fold_build2 (MINUS_EXPR, niter_type,
+  fold_convert (niter_type, t),
+  fold_convert (niter_type, final));
+ if (multiple_of_p (type, new_c, s))
+   relaxed_cond = fold_build2 (GT_EXPR, boolean_type_node,
+   t, final);
}
}
}
@@ -1056,8 +1063,12 @@ number_of_iterations_ne (struct loop *loop, tree type, 
affine_iv *iv,
  if (integer_nonzerop (t))
{
  t = fold_build2 (MINUS_EXPR, type, iv->base, iv->step);
- relaxed_cond = fold_build2 (LT_EXPR, boolean_type_node,
- t, final);
+ new_c = 

Re: [WIP] [PR fortran/72741] Rework Fortran OpenACC routine clause handling (was: [PATCH] OpenACC routines in fortran modules)

2016-08-11 Thread Thomas Schwinge
Hi!

As Cesar asked for it, there is now a Git branch
tschwinge/omp/pr72741-wip containing these changes (plus some other
pending changes that I didn't single out at this time), at
.
(I expect it does, but I didn't verify that this actually builds; I have
further changes on top of that.)  Cesar, please tell me if you'd like me
to push this to GitHub, in case you want to use their review/commentary
functions, or the like.


On Thu, 11 Aug 2016 17:40:26 +0200, Jakub Jelinek  wrote:
> On Thu, Aug 11, 2016 at 05:18:43PM +0200, Thomas Schwinge wrote:
> > --- gcc/fortran/gfortran.h
> > +++ gcc/fortran/gfortran.h

> >  /* Symbol attribute structure.  */
> > -typedef struct
> > +typedef struct symbol_attribute
> >  {

> While symbol_attribute is already bloated, I don't like bloating it this
> much further.  Do you really need it for all symbols, or just all subroutines?

Certainly not for all symbole; just for what is valid to be used with the
OpenACC routine directive, which per OpenACC 2.0a, 2.13.1 Routine
Directive is:

In Fortran the syntax of the routine directive is:
!$acc routine clause-list
!$acc routine( name ) clause-list
In Fortran, the routine directive without a name may appear within the 
specification part of a subroutine or function definition, or within an 
interface body for a subroutine or function in an interface block, and applies 
to the containing subroutine or function. The routine directive with a name may 
appear in the specification part of a subroutine, function or module, and 
applies to the named subroutine or function.

(Pasting that in full just in case that contains some additional Fortran
lingo, meaning more than "subroutines".)

> In the latter case, it is much better to add some openacc specific pointer
> into the namespace structure and stick everything you need into some custom
> structure it will refer to.  E.g. look at gfc_omp_declare_simd struct
> in ns->omp_declare_simd.

Thanks for the suggestion, I'll look into that.


> omp_clauses_locus makes no sense, symbol_attribute contains parsed info from
> many different clauses, which one it is?

Well, it makes some sense -- it works no worse than the existing code ;-)
-- but I agree that it's not exactly pretty.  To the best of my
knowledge, in Fortran OpenACC/OpenMP clauses parsing, we're currently not
tracking (saving) specific location information for individual clauses
(at least, that's what a casual scan through the code, and
gfc_match_oacc_routine or gfc_match_omp_declare_target in particular make
me think: gfc_omp_clauses collects all clause data, but only contains a
single "locus loc" member (which maybe I should have used instead of
"old_loc", the location information for the directive itself?).  Maybe I
misunderstood, and we do have more precise location information available
for individual clauses?  In that case, I'll happily use that, of course.


Grüße
 Thomas


Re: [PATCH] Use TImode for piecewise move in 64-bit mode

2016-08-11 Thread Uros Bizjak
On Thu, Aug 11, 2016 at 6:12 PM, Uros Bizjak  wrote:
> On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu  wrote:
>
> Use TImode for piecewise move in 64-bit mode.  When vector register
> is used for piecewise move, we don't increase stack_alignment_needed
> since vector register spill isn't required for piecewise move.  Since
> stack_realign_needed is set to true by checking 
> stack_alignment_estimated
> set by pseudo vector register usage, we also need to check
> stack_realign_needed to eliminate frame pointer.

 Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>>
>>> I will extend it to 32-bit mode.
>>
>> It doesn't work in 32-bit mode due to
>>
>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : 
>> DImode):
>>
>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>> x.i: In function ‘foo’:
>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>return __builtin_mempcpy (dst, src, 32);
>>   ^~~~
>
> This happens since by_pieces_ninsns determines widest mode by calling
> widest_*INT*_mode_for_size, while moves can also use vector-mode
> moves. This is an infrastructure problem, and will bite you on 64bit
> targets when MOVE_MAX_PIECES returns OImode or XImode size.

 I opened:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113

> +#define MOVE_MAX_PIECES \
> +  ((TARGET_64BIT \
> +&& TARGET_SSE2 \
> +&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> +&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>
> The above part is OK with an appropriate ??? comment, describing the
> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
> instead of magic constant.
>
> Can you please submit the realignment patch as a separate follow-up
> patch? Let's keep two issues separate.
>
> Uros.

 Here is the updated patch.  OK for trunk?
>>>
>>> OK, but please do not yet introduce:
>>>
>>> +/* No need to dynamically realign the stack here.  */
>>> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
>>> +/* Nor use a frame pointer.  */
>>> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>>>
>>> in the testcases. This should be part of a followup patch.
>>
>> This is what I checked in.
>
> Playing a bit with a patched gcc, I found no stack realignment insns
> in the assembly of the provided testcases. However, if
> -mincoming-stack-boundary=3 is added, then no vector instructions are
> generated (and also no realignment insns).

Ah yes, STV pass is disabled for -mincoming-stack-boundary={2,3}.

It looks that we don't need extra realignment patch.

Uros.


Re: [PATCH] Use TImode for piecewise move in 64-bit mode

2016-08-11 Thread Uros Bizjak
On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu  wrote:

 Use TImode for piecewise move in 64-bit mode.  When vector register
 is used for piecewise move, we don't increase stack_alignment_needed
 since vector register spill isn't required for piecewise move.  Since
 stack_realign_needed is set to true by checking 
 stack_alignment_estimated
 set by pseudo vector register usage, we also need to check
 stack_realign_needed to eliminate frame pointer.
>>>
>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>
>> I will extend it to 32-bit mode.
>
> It doesn't work in 32-bit mode due to
>
> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : 
> DImode):
>
> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
> x.i: In function ‘foo’:
> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>return __builtin_mempcpy (dst, src, 32);
>   ^~~~

 This happens since by_pieces_ninsns determines widest mode by calling
 widest_*INT*_mode_for_size, while moves can also use vector-mode
 moves. This is an infrastructure problem, and will bite you on 64bit
 targets when MOVE_MAX_PIECES returns OImode or XImode size.
>>>
>>> I opened:
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>>>
 +#define MOVE_MAX_PIECES \
 +  ((TARGET_64BIT \
 +&& TARGET_SSE2 \
 +&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
 +&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)

 The above part is OK with an appropriate ??? comment, describing the
 infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
 instead of magic constant.

 Can you please submit the realignment patch as a separate follow-up
 patch? Let's keep two issues separate.

 Uros.
>>>
>>> Here is the updated patch.  OK for trunk?
>>
>> OK, but please do not yet introduce:
>>
>> +/* No need to dynamically realign the stack here.  */
>> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
>> +/* Nor use a frame pointer.  */
>> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>>
>> in the testcases. This should be part of a followup patch.
>
> This is what I checked in.

Playing a bit with a patched gcc, I found no stack realignment insns
in the assembly of the provided testcases. However, if
-mincoming-stack-boundary=3 is added, then no vector instructions are
generated (and also no realignment insns).

Uros.


Re: [PATCH] Use TImode for piecewise move in 64-bit mode

2016-08-11 Thread H.J. Lu
On Thu, Aug 11, 2016 at 8:41 AM, Uros Bizjak  wrote:
> On Thu, Aug 11, 2016 at 5:26 PM, H.J. Lu  wrote:
>> On Thu, Aug 11, 2016 at 1:16 AM, Uros Bizjak  wrote:
>>> On Wed, Aug 10, 2016 at 6:44 PM, H.J. Lu  wrote:
>>>
>>> Use TImode for piecewise move in 64-bit mode.  When vector register
>>> is used for piecewise move, we don't increase stack_alignment_needed
>>> since vector register spill isn't required for piecewise move.  Since
>>> stack_realign_needed is set to true by checking 
>>> stack_alignment_estimated
>>> set by pseudo vector register usage, we also need to check
>>> stack_realign_needed to eliminate frame pointer.
>>
>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>
> I will extend it to 32-bit mode.

 It doesn't work in 32-bit mode due to

 #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : 
 DImode):

 /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
 -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
 -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
 x.i: In function ‘foo’:
 x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
return __builtin_mempcpy (dst, src, 32);
   ^~~~
>>>
>>> This happens since by_pieces_ninsns determines widest mode by calling
>>> widest_*INT*_mode_for_size, while moves can also use vector-mode
>>> moves. This is an infrastructure problem, and will bite you on 64bit
>>> targets when MOVE_MAX_PIECES returns OImode or XImode size.
>>
>> I opened:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>>
>>> +#define MOVE_MAX_PIECES \
>>> +  ((TARGET_64BIT \
>>> +&& TARGET_SSE2 \
>>> +&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>>> +&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>>>
>>> The above part is OK with an appropriate ??? comment, describing the
>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
>>> instead of magic constant.
>>>
>>> Can you please submit the realignment patch as a separate follow-up
>>> patch? Let's keep two issues separate.
>>>
>>> Uros.
>>
>> Here is the updated patch.  OK for trunk?
>
> OK, but please do not yet introduce:
>
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>
> in the testcases. This should be part of a followup patch.
>
> Thanks,
> Uros.

This is what I checked in.

Thanks.


-- 
H.J.
From e52d6c1aed5a688fd28b9f6e005c6d9f3857b9cf Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 8 Aug 2016 09:08:01 -0700
Subject: [PATCH] Use TImode for piecewise move in 64-bit mode

Use TImode for piecewise move in 64-bit mode.  We should use TImode in
32-bit mode and use OImode or XImode if they are available.  But since
by_pieces_ninsns determines the widest mode with MAX_FIXED_MODE_SIZE,
we can only use TImode in 64-bit mode.

gcc/

	* config/i386/i386.h (MOVE_MAX_PIECES): Use TImode in 64-bit
	mode if unaligned SSE load and store are optimal.

gcc/testsuite/

	* gcc.target/i386/pieces-memcpy-1.c: New test.
	* gcc.target/i386/pieces-memcpy-2.c: Likewise.
	* gcc.target/i386/pieces-memcpy-3.c: Likewise.
	* gcc.target/i386/pieces-memcpy-4.c: Likewise.
	* gcc.target/i386/pieces-memcpy-5.c: Likewise.
	* gcc.target/i386/pieces-memcpy-6.c: Likewise.
---
 gcc/config/i386/i386.h  | 14 --
 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c | 13 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c | 13 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c | 13 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c | 13 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c | 13 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c | 13 +
 7 files changed, 90 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 9b66264..8751143 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1950,8 +1950,18 @@ typedef struct ix86_args {
 
 /* MOVE_MAX_PIECES is the number of bytes at a time which we can
move efficiently, as opposed to  MOVE_MAX which is the maximum
-   number of bytes we can move with a single instruction.  */
-#define MOVE_MAX_PIECES UNITS_PER_WORD
+   number of bytes we can move with 

Re: Early jump threading

2016-08-11 Thread Richard Biener
On August 11, 2016 4:27:00 PM GMT+02:00, Jan Hubicka  wrote:
>> On Thu, 11 Aug 2016, Jan Hubicka wrote:
>> 
>> > Hi,
>> > this patch adds early jump threading pass.  Jump threading is one
>of most
>> > common cases where estimated profile becomes corrupted, because the
>branches
>> > are predicted independently beforehand. This patch simply adds
>early mode to
>> > jump threader that does not permit code growth and thus only
>win/win threads
>> > are done before profile is constructed.
>> > 
>> > Naturally this also improves IPA decisions because code size/time
>is estimated
>> > more acurately.
>> > 
>> > It is not very cool to add another pass and the jump threader is
>already
>> > run 5 times. I think incrementally we could drop one of late
>threaders at least.
>> > I tried to measure compile time effects but they are in wash.
>Moreover the patch
>> > pays for itself in cc1plus code size:
>> > 
>> > Before patch to tweak size estimates: 27779964  
>> > Current mainline: 27748900  
>> > With patch applied:   27716173  
>> > 
>> > So despite adding few functions the code size effect is positive
>which I think
>> > is quite nice.
>> > 
>> > Given the fact that jump threading seems quite lightweit, I wonder
>why it is
>> > guarded by flag_expensive_optimizations? Is it expensive in terms
>of code
>> > size?
>> > 
>> > The effectivity of individual threading passes is as follows (for
>tramp3d)
>> > 
>> >   mainline  with patch
>> > pass  thread count profile mismatches   thread countprofile
>mismatch
>> > early   525
>> > 1 1853 1900 316 337
>> > 2 4812  4   288
>> > 3 24   1450 32  947
>> > 4 76   1457 75  975
>> > 
>> > So at least tramp3d data seems to suggest that we can drop the
>second occurence
>> > of jump threading and that most of the job thread1 does can be done
>by the
>> > size restricted early version (the lower absolute counts are caused
>by the
>> > fact that threadable paths gets duplicated by the inliner). 50%
>drop in
>> > profile distortion is not bad. I wonder why basically every
>threaded paths seems
>> > to introduce a mismatch.
>> > 
>> > The patch distorts testusite somewhat, in most cases we only want
>to update
>> > dump files or disable early threading:
>> > 
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
>> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
>> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1
>"Registering FSM" 1
>> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did
>not thread around loop and would copy too many statements"
>> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times
>thread1 "Jumps threaded: 1" 1
>> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
>> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding
>predicate p_.*to 1" 1
>> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps
>threaded: 1"
>> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1,
>1]"
>> > 
>> > This testcase is the now probably unnecesary heuristics (it may
>still be
>> > relevant in cases we do not thread because of size bounds but its
>effectivity
>> > may not be worth the maintenance cost):
>> > 
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: 

Re: [C++ PATCH] Handle CASE_HIGH in constexpr evaluation (PR c++/72868)

2016-08-11 Thread Jason Merrill
OK for trunk and 6.

On Thu, Aug 11, 2016 at 10:55 AM, Jakub Jelinek  wrote:
> Hi!
>
> As mentioned in the PR, constexpr.c has been handling cases with ranges
> just as the lowest value of the range.
>
> Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux, ok for
> trunk?  What about 6.2 (not a regression, but low risk fix for wrong-code)?
>
> 2016-08-11  Jakub Jelinek  
>
> PR c++/72868
> * constexpr.c (label_matches): Handle case range expressions.
>
> * g++.dg/cpp1y/constexpr-switch4.C: New test.
>
> --- gcc/cp/constexpr.c.jj   2016-08-10 00:21:07.0 +0200
> +++ gcc/cp/constexpr.c  2016-08-10 22:17:16.577041975 +0200
> @@ -3448,6 +3448,12 @@ label_matches (tree *jump_target, tree_s
> {
>   if (!CASE_LOW (stmt))
> default_label = i;
> + else if (CASE_HIGH (stmt))
> +   {
> + if (tree_int_cst_le (CASE_LOW (stmt), *jump_target)
> + && tree_int_cst_le (*jump_target, CASE_HIGH (stmt)))
> +   return true;
> +   }
>   else if (tree_int_cst_equal (*jump_target, CASE_LOW (stmt)))
> return true;
> }
> --- gcc/testsuite/g++.dg/cpp1y/constexpr-switch4.C.jj   2016-08-10 
> 22:22:29.567129868 +0200
> +++ gcc/testsuite/g++.dg/cpp1y/constexpr-switch4.C  2016-08-10 
> 22:23:25.104435699 +0200
> @@ -0,0 +1,27 @@
> +// PR c++/72868
> +// { dg-do compile }
> +// { dg-options "-std=gnu++14" }
> +
> +constexpr int
> +foo (int i)
> +{
> +  switch (i)
> +{
> +case 11 ... 12:
> +  return 4;
> +case 0 ... 9:
> +  return 3;
> +default:
> +  return 7;
> +}
> +}
> +
> +#define SA(X) static_assert((X),#X)
> +SA (foo (-1) == 7);
> +SA (foo (0) == 3);
> +SA (foo (3) == 3);
> +SA (foo (9) == 3);
> +SA (foo (10) == 7);
> +SA (foo (11) == 4);
> +SA (foo (12) == 4);
> +SA (foo (13) == 7);
>
> Jakub


C++ PATCH for c++/73456 (concepts ICE with constrained parameter pack)

2016-08-11 Thread Jason Merrill
In trying to decide if the constraints on the partial specialization
subsume those on the primary template, we consider whether any of the
constraints on the primary template imply the type constraint typename
list.  We don't have any matching constraints on the primary
template, so we look to see if there are any non-atomic constraints
that we might be able to decompose in order to produce a match.  This
function doesn't know how to handle the pack expansion
Sequence... and so it crashes.

A pack expansion isn't really atomic, but it also won't decompose any
further, so treating it as non-atomic here leads to an infinite loop;
treating it as atomic, as 6.1 did, fixes the testcase.

Tested x86_64-pc-linux-gnu, applying to trunk and 6.
commit 5b8341e2a06c9dead76cba2485a2e0633badbac0
Author: Jason Merrill 
Date:   Thu Aug 11 11:15:13 2016 -0400

PR c++/73456 - ICE with constrained parameter pack.

* logic.cc (non_atomic_constraint_p): Handle EXPR_PACK_EXPANSION.

diff --git a/gcc/cp/logic.cc b/gcc/cp/logic.cc
index dda98df..b86e740 100644
--- a/gcc/cp/logic.cc
+++ b/gcc/cp/logic.cc
@@ -305,6 +305,9 @@ non_atomic_constraint_p (tree t)
 case ICONV_CONSTR:
 case DEDUCT_CONSTR:
 case EXCEPT_CONSTR:
+  /* A pack expansion isn't atomic, but it can't decompose to prove an
+atom, so it shouldn't cause analyze_atom to return undecided.  */
+case EXPR_PACK_EXPANSION:
   return false;
 case CHECK_CONSTR:
 case PARM_CONSTR:
diff --git a/gcc/testsuite/g++.dg/concepts/variadic4.C 
b/gcc/testsuite/g++.dg/concepts/variadic4.C
new file mode 100644
index 000..d20fa7d
--- /dev/null
+++ b/gcc/testsuite/g++.dg/concepts/variadic4.C
@@ -0,0 +1,20 @@
+// PR c++/73456
+// { dg-options "-std=c++1z -fconcepts" }
+
+template struct list {};
+
+template
+concept bool Sequence = true;
+
+template
+struct zip;
+
+template
+requires requires { typename list; }
+// main.cpp:12:8: internal compiler error: in non_atomic_constraint_p, at 
cp/logic.cc:315
+struct zip {};
+
+int main()
+{
+zip, list> {};
+}


Re: [PATCH] Use TImode for piecewise move in 64-bit mode

2016-08-11 Thread Uros Bizjak
On Thu, Aug 11, 2016 at 5:26 PM, H.J. Lu  wrote:
> On Thu, Aug 11, 2016 at 1:16 AM, Uros Bizjak  wrote:
>> On Wed, Aug 10, 2016 at 6:44 PM, H.J. Lu  wrote:
>>
>> Use TImode for piecewise move in 64-bit mode.  When vector register
>> is used for piecewise move, we don't increase stack_alignment_needed
>> since vector register spill isn't required for piecewise move.  Since
>> stack_realign_needed is set to true by checking stack_alignment_estimated
>> set by pseudo vector register usage, we also need to check
>> stack_realign_needed to eliminate frame pointer.
>
> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.

 I will extend it to 32-bit mode.
>>>
>>> It doesn't work in 32-bit mode due to
>>>
>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : 
>>> DImode):
>>>
>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>>> x.i: In function ‘foo’:
>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>>return __builtin_mempcpy (dst, src, 32);
>>>   ^~~~
>>
>> This happens since by_pieces_ninsns determines widest mode by calling
>> widest_*INT*_mode_for_size, while moves can also use vector-mode
>> moves. This is an infrastructure problem, and will bite you on 64bit
>> targets when MOVE_MAX_PIECES returns OImode or XImode size.
>
> I opened:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>
>> +#define MOVE_MAX_PIECES \
>> +  ((TARGET_64BIT \
>> +&& TARGET_SSE2 \
>> +&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>> +&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>>
>> The above part is OK with an appropriate ??? comment, describing the
>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
>> instead of magic constant.
>>
>> Can you please submit the realignment patch as a separate follow-up
>> patch? Let's keep two issues separate.
>>
>> Uros.
>
> Here is the updated patch.  OK for trunk?

OK, but please do not yet introduce:

+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */

in the testcases. This should be part of a followup patch.

Thanks,
Uros.


Re: [WIP] [PR fortran/72741] Rework Fortran OpenACC routine clause handling (was: [PATCH] OpenACC routines in fortran modules)

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 05:18:43PM +0200, Thomas Schwinge wrote:
> --- gcc/fortran/gfortran.h
> +++ gcc/fortran/gfortran.h
> @@ -729,7 +839,7 @@ ext_attr_t;
>  extern const ext_attr_t ext_attr_list[];
>  
>  /* Symbol attribute structure.  */
> -typedef struct
> +typedef struct symbol_attribute
>  {
>/* Variable attributes.  */
>unsigned allocatable:1, dimension:1, codimension:1, external:1, 
> intrinsic:1,
> @@ -864,6 +974,13 @@ typedef struct
>/* Mentioned in OMP DECLARE TARGET.  */
>unsigned omp_declare_target:1;
>  
> +  /* OpenACC routine.  */
> +  unsigned oacc_routine:1;
> +  unsigned oacc_routine_gang:1;
> +  unsigned oacc_routine_worker:1;
> +  unsigned oacc_routine_vector:1;
> +  unsigned oacc_routine_seq:1;
> +
>/* Mentioned in OACC DECLARE.  */
>unsigned oacc_declare_create:1;
>unsigned oacc_declare_copyin:1;
> @@ -871,137 +988,24 @@ typedef struct
>unsigned oacc_declare_device_resident:1;
>unsigned oacc_declare_link:1;
>  
> -  /* This is an OpenACC acclerator function at level N - 1  */
> -  ENUM_BITFIELD (oacc_function) oacc_function:3;
> -
>/* Attributes set by compiler extensions (!GCC$ ATTRIBUTES).  */
>unsigned ext_attr:EXT_ATTR_NUM;
>  
> +  /* Location information for OMP clauses.  */
> +  //TODO: how to handle in module.c/symbol.c?
> +  locus omp_clauses_locus;
> +
>/* The namespace where the attribute has been set.  */
>struct gfc_namespace *volatile_ns, *asynchronous_ns;
> +
> +  /* Chain to another set of symbol attributes.  Currently only used for
> + OpenACC routine.  */
> +  //TODO: how to handle in module.c/symbol.c?
> +  struct symbol_attribute *next;

While symbol_attribute is already bloated, I don't like bloating it this
much further.  Do you really need it for all symbols, or just all subroutines?
In the latter case, it is much better to add some openacc specific pointer
into the namespace structure and stick everything you need into some custom
structure it will refer to.  E.g. look at gfc_omp_declare_simd struct
in ns->omp_declare_simd.
omp_clauses_locus makes no sense, symbol_attribute contains parsed info from
many different clauses, which one it is?

Jakub


Re: [Patch, libfortran] Multi-threaded random_number

2016-08-11 Thread Janne Blomqvist
On Thu, Aug 11, 2016 at 5:54 PM, Rainer Orth
 wrote:
> Hi Janne,
>
>> committed a slightly modified patch as r239356. Changes from the
>> submitted patch attached. To my surprise, it turned out that my
>> fallback code using __gthread_key_{create,delete} and
>> __ghtread_{get,set}_specific was faster than my TLS code, so I removed
>> the TLS configure magic and the TLS code and just left the __gthread
>> stuff in.
>
> this patch broke Solaris bootstrap:
>
> /vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/random.c: In function 
> 'constructor_random':
> /vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/
> random.c:915:45: error: 'free' undeclared (first use in this function); did 
> you mean 'frexp'?
>  __gthread_key_create (_state_key, );
>  ^~~~
>  frexp
> /vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/random.c:915:45: note: 
> each undeclared identifier is reported only once for each function it appears 
> in
>
> You need to include  for the free declaration, as this patch
> does.  Allowed i386-pc-solaris2.12 and sparc-sun-solaris2.12 to
> continue.  I'm going to install this as obvious.

Oh, I (incorrectly, obviously!) remembered that stdlib.h would be
included via libgfortran.h. Thanks for the quick fix!


-- 
Janne Blomqvist


Re: [PATCH] Use TImode for piecewise move in 64-bit mode

2016-08-11 Thread H.J. Lu
On Thu, Aug 11, 2016 at 1:16 AM, Uros Bizjak  wrote:
> On Wed, Aug 10, 2016 at 6:44 PM, H.J. Lu  wrote:
>
> Use TImode for piecewise move in 64-bit mode.  When vector register
> is used for piecewise move, we don't increase stack_alignment_needed
> since vector register spill isn't required for piecewise move.  Since
> stack_realign_needed is set to true by checking stack_alignment_estimated
> set by pseudo vector register usage, we also need to check
> stack_realign_needed to eliminate frame pointer.

 Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>>
>>> I will extend it to 32-bit mode.
>>
>> It doesn't work in 32-bit mode due to
>>
>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : 
>> DImode):
>>
>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>> x.i: In function ‘foo’:
>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>return __builtin_mempcpy (dst, src, 32);
>>   ^~~~
>
> This happens since by_pieces_ninsns determines widest mode by calling
> widest_*INT*_mode_for_size, while moves can also use vector-mode
> moves. This is an infrastructure problem, and will bite you on 64bit
> targets when MOVE_MAX_PIECES returns OImode or XImode size.

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113

> +#define MOVE_MAX_PIECES \
> +  ((TARGET_64BIT \
> +&& TARGET_SSE2 \
> +&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> +&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>
> The above part is OK with an appropriate ??? comment, describing the
> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
> instead of magic constant.
>
> Can you please submit the realignment patch as a separate follow-up
> patch? Let's keep two issues separate.
>
> Uros.

Here is the updated patch.  OK for trunk?

-- 
H.J.
From b3caa00a2164b44148552bb82bf616e0aeba5ea7 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 8 Aug 2016 09:08:01 -0700
Subject: [PATCH] Use TImode for piecewise move in 64-bit mode

Use TImode for piecewise move in 64-bit mode.  We should use TImode in
32-bit mode and use OImode or XImode if they are available.  But since
by_pieces_ninsns determines the widest mode with MAX_FIXED_MODE_SIZE,
we can only use TImode in 64-bit mode.

gcc/

	* config/i386/i386.h (MOVE_MAX_PIECES): Use TImode in 64-bit
	mode if unaligned SSE load and store are optimal.

gcc/testsuite/

	* gcc.target/i386/pieces-memcpy-1.c: New test.
	* gcc.target/i386/pieces-memcpy-2.c: Likewise.
	* gcc.target/i386/pieces-memcpy-3.c: Likewise.
	* gcc.target/i386/pieces-memcpy-4.c: Likewise.
	* gcc.target/i386/pieces-memcpy-5.c: Likewise.
	* gcc.target/i386/pieces-memcpy-6.c: Likewise.
---
 gcc/config/i386/i386.h  | 14 --
 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c | 17 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c | 17 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c | 17 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c | 17 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c | 17 +
 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c | 17 +
 7 files changed, 114 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 9b66264..8751143 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1950,8 +1950,18 @@ typedef struct ix86_args {
 
 /* MOVE_MAX_PIECES is the number of bytes at a time which we can
move efficiently, as opposed to  MOVE_MAX which is the maximum
-   number of bytes we can move with a single instruction.  */
-#define MOVE_MAX_PIECES UNITS_PER_WORD
+   number of bytes we can move with a single instruction.
+
+   ??? We should use TImode in 32-bit mode and use OImode or XImode
+   if they are available.  But since by_pieces_ninsns determines the
+   widest mode with MAX_FIXED_MODE_SIZE, we can only use TImode in
+   64-bit mode.  */
+#define MOVE_MAX_PIECES \
+  ((TARGET_64BIT \
+&& TARGET_SSE2 \
+&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
+&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
+   ? GET_MODE_SIZE (TImode) : UNITS_PER_WORD)
 
 /* If a memory-to-memory move would take MOVE_RATIO or more simple
move-instruction pairs, we will do a movmem or libcall instead.
diff --git 

Re: libgo patch committed: Change build procedure to use build tags

2016-08-11 Thread Rainer Orth
Hi Ian,

> Go packages use build tags (see the section on Build Constraints at
> https://golang.org/pkg/go/build/) to select which files to build on
> specific systems.
>
> Previously the libgo Makefile explicitly listed the set of files to
> compile for each package.  For packages that use build tags, this
> required a lot of awkward automake conditionals in the Makefile.
>
> This patch changes the build to look at the build tags in the files.
> The new shell script libgo/match.sh does the matching.  This required
> adjusting a lot of build tags, and removing some files that are never
> used.  I verified that the exact same sets of files are compiled on
> x86_64-pc-linux-gnu.  I also tested the build on i386-sun-solaris
> (building for both 32-bit and 64-bit).
>
> Writing match.sh revealed some bugs in the build tag handling that
> already exists, in a slightly different form, in the gotest shell
> script.  This patch fixes those problems as well.
>
> The old code used automake conditionals to handle systems that were
> missing strerror_r and wait4.  Rather than deal with those in Go,
> those functions are now implemented in runtime/go-nosys.c when
> necessary, so the Go code can simply assume that they exist.
>
> The os testsuite looked for dir_unix.go, which was never built for
> gccgo and has now been removed.  I changed the testsuite to look for
> dir.go instead.
>
> Note that if you have an existing build directory, you will have to
> remove all the .dep files in TARGET/libgo after updating to this
> patch.  There isn't anything that will force them to update
> automatically.
>
> Bootstrapped on x86_64-pc-linux-gnu and i386-sun-solaris.  Ran Go
> testsuite on x86_64-pc-linux-gnu.  Committed to mainline.

this patch broke i386-pc-solaris2.12 and sparc-sun-solaris2.12
bootstrap, however: in both cases, the 64-bit build of os.lo fails like this:

/vol/gcc/src/hg/trunk/local/libgo/go/os/dir.go:82:8: error: reference to 
undefined name 'libc_readdir_r'
   i := libc_readdir_r(file.dirinfo.dir, entryDirent, pr)
^

Neither dir_largefile.go (which is correctly omitted, being 32-bit only)
nor dir_regfile.go (which is needed here) is included in the
compilation.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


[WIP] [PR fortran/72741] Rework Fortran OpenACC routine clause handling (was: [PATCH] OpenACC routines in fortran modules)

2016-08-11 Thread Thomas Schwinge
Hi!

This is still hacky and WIP; posting for Cesar and Tobias to have a look.
I'm still not too much of a Fortran person.  ;-)

On Fri, 1 Jul 2016 13:40:58 -0700, Cesar Philippidis  
wrote:
> It turns out that the acc routine parallelism isn't being recorded in
> fortran .mod files. This is a problem because then the ME can't validate
> if a routine has compatible parallelism with the call site. This patch
> does two things:
> 
>  1. Encode gang, worker, vector and seq level parallelism in module
> files. This introduces a new oacc_function enum, which I ended
> up using to record the parallelism of standalone acc routines too.

Building on top of this patch, and on top of
 "[gomp4] Fix
PR72741", I reworked these patches (effectively reverting a lot of
Cesar's earlier changes, which nevertheless gave good guidance to me,
about which code I needed to touch).  With this patch, we now handle more
Fortran OpenACC routine directive use/misuse (see the test case changes),
much in spirit of what I discussed in 
"Fortran OpenACC routine directive doesn't properly handle clauses
specifying the level of parallelism", minus items that Cesar already
clarified for me, where Fortran is different from what I expected,
different from the C/C++ environment I'm more used to.  This now also
paves the way for adding Fortran support to my recent patch
 "Use
verify_oacc_routine_clauses", and then ultimately
 "Repeated use
of the OpenACC routine directive".

However, my changes are still hacky and WIP, still contains a bunch of
TODOs.  Can you, Cesar and/or Tobias, please advise on these?

>  2. Extends gfc_match_oacc_routine to add acc routine directive support
> for intrinsic procedures such as abort.
> 
> Is this patch OK for trunk? I included support for intrinsic procedures
> because it was necessary with my previous patch which treated all calls
> to non-acc routines from within an OpenACC offloaded region as errors.
> Now that it has been determined that those patches should be link time
> errors, we technically don't need to add acc routine support for
> intrinsic procedures. So I can drop that part of the patch if necessary.

That could've been a patch separate from the others, as it's doing a
separate thing.  We will want to handle intrinsics used with the OpenACC
routine directive with a name (but it certainly isn't a priority).  I
left in these changes, and also extended them a bit.

First some comments on Cesar's patch:

> --- a/gcc/fortran/module.c
> +++ b/gcc/fortran/module.c

>  [...]
> +DECL_MIO_NAME (oacc_function)
>  [...]

As discussed between Cesar and Tobias, these module.c/symbol.c changes
introduce an incompatibility in the Fortran module file format, which
we'll want to avoid.  Reverting that to use individual bit flags instead
of the "enum oacc_function", I think that we're safe (but I have not
verified that).  On the other hand, given that I'm not at all handling in
module.c/symbol.c the new "locus omp_clauses_locus" and "struct
symbol_attribute *next" members that I'm adding to "symbol_attribute",
I'm not sure whether I'm actually testing this properly.  ;-) I guess I'm
not.

> --- a/gcc/fortran/openmp.c
> +++ b/gcc/fortran/openmp.c

> @@ -1814,7 +1824,10 @@ gfc_match_oacc_routine (void)
> != MATCH_YES))
>  return MATCH_ERROR;
>  
> -  if (sym != NULL)
> +  if (isym != NULL)
> +/* There is nothing to do for intrinsic procedures.  */
> +;

We will want to check that no incompatible clauses are being specified,
for example (but, low priority).  I'm adding a hacky implementation of
that.

> +  else if (sym != NULL)
>  {
>n = gfc_get_oacc_routine_name ();
>n->sym = sym;

> --- a/gcc/fortran/trans-decl.c
> +++ b/gcc/fortran/trans-decl.c
> @@ -1327,11 +1327,26 @@ add_attributes_to_decl (symbol_attribute sym_attr, 
> tree list)
>  list = tree_cons (get_identifier ("omp declare target"),
> NULL_TREE, list);
>  
> -  if (sym_attr.oacc_function)
> +  if (sym_attr.oacc_function != OACC_FUNCTION_NONE)
>  {
>tree dims = NULL_TREE;
>int ix;
> -  int level = sym_attr.oacc_function - 1;
> +  int level = GOMP_DIM_MAX;
> +
> +  switch (sym_attr.oacc_function)
> + {
> + case OACC_FUNCTION_GANG:
> +   level = GOMP_DIM_GANG;
> +   break;
> + case OACC_FUNCTION_WORKER:
> +   level = GOMP_DIM_WORKER;
> +   break;
> + case OACC_FUNCTION_VECTOR:
> +   level = GOMP_DIM_VECTOR;
> +   break;
> + case OACC_FUNCTION_SEQ:
> + default:;
> + }
>  
>for (ix = GOMP_DIM_MAX; ix--;)
>   dims = tree_cons (build_int_cst (boolean_type_node, ix >= level),

As discussed before, this should use the generic omp-low.c functions,
which I've implemented.


Re: fix crash on 64-bit mingw-w64 hosted compiler using more than 4 gb of ram

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 04:55:34PM +0200, Stanislaw Halik wrote:
> On 2016-08-11 Thu 16:33, Kai Tietz wrote:
> >Hello Stanislaw.
> >
> >patch is ok.  Nevertheless there is a ChangeLog entry missing for it.
> >It is mandatory to be provided for submissions to gcc.
> 
> ChangeLog:
> 
> PR target/66488
> * gcc/config/i386/xm-mingw32.h (HOST_BITS_PER_PTR): specify pointer
> size as distinct from sizeof long, for the use in gcc/ggc-page.c

gcc/ directory has its own ChangeLog, the filename is relative against that.
The description should start after ): with capital letter, end with .
And the description is what has changed, not why.
So
PR target/66488
* config/i386/xm-mingw32.h (HOST_BITS_PER_PTR): Define if __x86_64__.
or so would be better.

Jakub


Re: fix crash on 64-bit mingw-w64 hosted compiler using more than 4 gb of ram

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 11:31:49AM +0200, Stanislaw Halik wrote:
> The host configuration across platforms wrongly assumes that
> sizeof(long) == sizeof(intptr_t) which is incorrect on amd64-hosted compiler
> hosting mingw-w64.
> 
> Here's a patch fixing
> 
> 
> cheers,
> sh

> diff --git a/gcc/config/i386/xm-mingw32.h b/gcc/config/i386/xm-mingw32.h
> index 501cebd..1b17263 100644
> --- a/gcc/config/i386/xm-mingw32.h
> +++ b/gcc/config/i386/xm-mingw32.h
> @@ -38,3 +38,7 @@ along with GCC; see the file COPYING3.  If not see
>  #define HOST_LONG_LONG_FORMAT "I64"
>  #endif
>  
> +/* this is to prevent ggc-heap.c from assuming sizeof(long) == 
> sizeof(intptr_t) */
> +#ifdef __x86_64__
> +#define HOST_BITS_PER_PTR 64

I think you should follow the indentation style of the rest of the file.
So #define rather than #\tdefine.  Generally, tab between # and define isn't
used anywhere in GCC, sometimes one space is used to indent per level.

> +#endif
> \ No newline at end of file

Please try to avoid this, terminate the last line with a newline.

Jakub


Re: fix crash on 64-bit mingw-w64 hosted compiler using more than 4 gb of ram

2016-08-11 Thread Stanislaw Halik

On 2016-08-11 Thu 16:33, Kai Tietz wrote:

Hello Stanislaw.

patch is ok.  Nevertheless there is a ChangeLog entry missing for it.
It is mandatory to be provided for submissions to gcc.


ChangeLog:

PR target/66488
* gcc/config/i386/xm-mingw32.h (HOST_BITS_PER_PTR): specify pointer
size as distinct from sizeof long, for the use in gcc/ggc-page.c



[C++ PATCH] Handle CASE_HIGH in constexpr evaluation (PR c++/72868)

2016-08-11 Thread Jakub Jelinek
Hi!

As mentioned in the PR, constexpr.c has been handling cases with ranges
just as the lowest value of the range.

Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux, ok for
trunk?  What about 6.2 (not a regression, but low risk fix for wrong-code)?

2016-08-11  Jakub Jelinek  

PR c++/72868
* constexpr.c (label_matches): Handle case range expressions.

* g++.dg/cpp1y/constexpr-switch4.C: New test.

--- gcc/cp/constexpr.c.jj   2016-08-10 00:21:07.0 +0200
+++ gcc/cp/constexpr.c  2016-08-10 22:17:16.577041975 +0200
@@ -3448,6 +3448,12 @@ label_matches (tree *jump_target, tree_s
{
  if (!CASE_LOW (stmt))
default_label = i;
+ else if (CASE_HIGH (stmt))
+   {
+ if (tree_int_cst_le (CASE_LOW (stmt), *jump_target)
+ && tree_int_cst_le (*jump_target, CASE_HIGH (stmt)))
+   return true;
+   }
  else if (tree_int_cst_equal (*jump_target, CASE_LOW (stmt)))
return true;
}
--- gcc/testsuite/g++.dg/cpp1y/constexpr-switch4.C.jj   2016-08-10 
22:22:29.567129868 +0200
+++ gcc/testsuite/g++.dg/cpp1y/constexpr-switch4.C  2016-08-10 
22:23:25.104435699 +0200
@@ -0,0 +1,27 @@
+// PR c++/72868
+// { dg-do compile }
+// { dg-options "-std=gnu++14" }
+
+constexpr int
+foo (int i)
+{
+  switch (i)
+{
+case 11 ... 12:
+  return 4;
+case 0 ... 9:
+  return 3;
+default:
+  return 7;
+}
+}
+
+#define SA(X) static_assert((X),#X)
+SA (foo (-1) == 7);
+SA (foo (0) == 3);
+SA (foo (3) == 3);
+SA (foo (9) == 3);
+SA (foo (10) == 7);
+SA (foo (11) == 4);
+SA (foo (12) == 4);
+SA (foo (13) == 7);

Jakub


Re: [Patch, libfortran] Multi-threaded random_number

2016-08-11 Thread Rainer Orth
Hi Janne,

> committed a slightly modified patch as r239356. Changes from the
> submitted patch attached. To my surprise, it turned out that my
> fallback code using __gthread_key_{create,delete} and
> __ghtread_{get,set}_specific was faster than my TLS code, so I removed
> the TLS configure magic and the TLS code and just left the __gthread
> stuff in.

this patch broke Solaris bootstrap:

/vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/random.c: In function 
'constructor_random':
/vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/
random.c:915:45: error: 'free' undeclared (first use in this function); did you 
mean 'frexp'?
 __gthread_key_create (_state_key, );
 ^~~~
 frexp
/vol/gcc/src/hg/trunk/local/libgfortran/intrinsics/random.c:915:45: note: each 
undeclared identifier is reported only once for each function it appears in

You need to include  for the free declaration, as this patch
does.  Allowed i386-pc-solaris2.12 and sparc-sun-solaris2.12 to
continue.  I'm going to install this as obvious.

Rainer


2016-08-11  Rainer Orth  

libgfortran:
* intrinsics/random.c: Include .

diff --git a/libgfortran/intrinsics/random.c b/libgfortran/intrinsics/random.c
--- a/libgfortran/intrinsics/random.c
+++ b/libgfortran/intrinsics/random.c
@@ -27,6 +27,7 @@ see the files COPYING3 and COPYING.RUNTI
 #include "libgfortran.h"
 #include 
 #include 
+#include 
 
 /* For getosrandom.  */
 #ifdef HAVE_SYS_TYPES_H

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


RFC: A few more fallthrus

2016-08-11 Thread Marek Polacek
A few more cases where I'm unsure whether the fall through is intended.
Jason, can you please look at the cp/ part?
Richi, would you mind looking at the tree-complex.c bit?
What 'bout the pch.c?

Thanks,

2016-08-11  Marek Polacek  

PR c/7652
gcc/
* tree-complex.c (expand_complex_division): Likewise.
gcc/cp/
* call.c (add_builtin_candidate): Add gcc_fallthrough.
* cxx-pretty-print.c (pp_cxx_unqualified_id): Likewise.
* parser.c (cp_parser_skip_to_end_of_statement): Likewise.
(cp_parser_cache_defarg): Likewise.
libcpp/
* pch.c (write_macdef): Add CPP_FALLTHRU.

--- gcc/gcc/cp/call.c
+++ gcc/gcc/cp/call.c
@@ -2542,6 +2544,8 @@ add_builtin_candidate (struct z_candidate **candidates, 
enum tree_code code,
  type2 = ptrdiff_type_node;
  break;
}
+  /* XXX Really fallthru?  */
+  /* FALLTHRU */
 case MULT_EXPR:
 case TRUNC_DIV_EXPR:
   if (ARITHMETIC_TYPE_P (type1) && ARITHMETIC_TYPE_P (type2))
--- gcc/gcc/cp/cxx-pretty-print.c
+++ gcc/gcc/cp/cxx-pretty-print.c
@@ -142,6 +142,8 @@ pp_cxx_unqualified_id (cxx_pretty_printer *pp, tree t)
 
 case OVERLOAD:
   t = OVL_CURRENT (t);
+  /* XXX Really fallthru?  */
+  /* FALLTHRU */
 case VAR_DECL:
 case PARM_DECL:
 case CONST_DECL:
--- gcc/gcc/cp/parser.c
+++ gcc/gcc/cp/parser.c
@@ -3488,6 +3488,8 @@ cp_parser_skip_to_end_of_statement (cp_parser* parser)
  cp_lexer_consume_token (parser->lexer);
  return;
}
+ /* XXX Really fallthru?  */
+ /* FALLTHRU */
 
case CPP_OPEN_BRACE:
  ++nesting_depth;
@@ -27640,6 +27646,8 @@ cp_parser_cache_defarg (cp_parser *parser, bool nsdmi)
  parser->in_template_argument_list_p = saved_italp;
  break;
}
+ /* XXX Really fallthru?  */
+ /* FALLTHRU */
case CPP_CLOSE_PAREN:
case CPP_ELLIPSIS:
  /* If we run into a non-nested `;', `}', or `]',
--- gcc/gcc/tree-complex.c
+++ gcc/gcc/tree-complex.c
@@ -1336,6 +1336,8 @@ expand_complex_division (gimple_stmt_iterator *gsi, tree 
inner_type,
   rr = gimplify_build2 (gsi, code, inner_type, ai, bi);
   ri = gimplify_build2 (gsi, code, inner_type, ar, bi);
   ri = gimplify_build1 (gsi, NEGATE_EXPR, inner_type, ri);
+  /* XXX Really fallthru?  */
+  /* FALLTHRU */
 
 case PAIR (ONLY_REAL, VARYING):
 case PAIR (ONLY_IMAG, VARYING):
--- gcc/libcpp/pch.c
+++ gcc/libcpp/pch.c
@@ -55,6 +55,8 @@ write_macdef (cpp_reader *pfile, cpp_hashnode *hn, void 
*file_p)
 case NT_VOID:
   if (! (hn->flags & NODE_POISONED))
return 1;
+  /* XXX Really fallthru?  */
+  /* FALLTHRU */
 
 case NT_MACRO:
   if ((hn->flags & NODE_BUILTIN)

Marek


Re: fix crash on 64-bit mingw-w64 hosted compiler using more than 4 gb of ram

2016-08-11 Thread Kai Tietz
Hello Stanislaw.

patch is ok.  Nevertheless there is a ChangeLog entry missing for it.
It is mandatory to be provided for submissions to gcc.

Thanks,
Kai

2016-08-11 11:31 GMT+02:00 Stanislaw Halik :
> The host configuration across platforms wrongly assumes that
> sizeof(long) == sizeof(intptr_t) which is incorrect on amd64-hosted compiler
> hosting mingw-w64.
>
> Here's a patch fixing
> 
>
> cheers,
> sh


Re: [PATCHv2, PING 2][ARM] -mpure-code option for ARM

2016-08-11 Thread Andre Vieira (lists)
On 25/07/16 11:52, Andre Vieira (lists) wrote:
> On 11/07/16 17:56, Andre Vieira (lists) wrote:
>> On 07/07/16 13:30, mickael guene wrote:
>>> Hi Andre,
>>>
>>>  Another feedback on your purecode patch.
>>>  You have to disable casesi pattern since then it will
>>> generate wrong code with -mpure-code option.
>>>  Indeed it will generate an 'adr rx, .Lx' (aka
>>> 'subs rx, PC, #offset') which will not work in our
>>> case since 'Lx' label is put in an .rodata section.
>>> So offset value is unknown and can be impossible
>>> to encode correctly.
>>>
>>> Regards
>>> Mickael
>>>
>>> On 06/30/2016 04:32 PM, Andre Vieira (lists) wrote:
 Hello,

 This patch adds the -mpure-code option for ARM. This option ensures
 functions are put into sections that contain only code and no data. To
 ensure this throughout compilation we give these sections the ARM
 processor-specific ELF section attribute "SHF_ARM_PURECODE". This option
 is only supported for non-pic code for armv7-m targets.

 This patch introduces a new target hook 'TARGET_ASM_ELF_FLAGS_NUMERIC'.
 This target hook enables a target to use the numeric value for elf
 section attributes rather than their alphabetical representation. If
 TARGET_ASM_ELF_FLAGS_NUMERIC returns TRUE, the existing
 'default_elf_asm_named_section', will print the numeric value of the
 section attributes for the current section. This target hook has two
 parameters:
 unsigned int FLAGS, the input parameter that tells the function the
 current section's attributes;
 unsigned int *NUM, used to pass down the numerical representation of the
 section's attributes.

 The default implementation for TARGET_ASM_ELF_FLAGS_NUMERIC will return
 false, so existing behavior is not changed.

 Bootstrapped and tested for arm-none-linux-gnueabihf. Further tested for
 arm-none-eabi with a Cortex-M3 target.


 gcc/ChangeLog:
 2016-06-30  Andre Vieira  
 Terry Guo  

 * target.def (elf_flags_numeric): New target hook.
 * targhooks.h (default_asm_elf_flags_numeric): New.
 * varasm.c (default_asm_elf_flags_numeric): New.
   (default_elf_asm_named_section): Use new target hook.
 * config/arm/arm.opt (mpure-code): New.
 * config/arm/arm.h (SECTION_ARM_PURECODE): New.
 * config/arm/arm.c (arm_asm_init_sections): Add section
   attribute to default text section if -mpure-code.
   (arm_option_check_internal): Diagnose use of option with
   non supported targets and/or options.
   (arm_asm_elf_flags_numeric): New.
   (arm_function_section): New.
   (arm_elf_section_type_flags): New.
 * config/arm/elf.h (JUMP_TABLES_IN_TEXT_SECTION): Disable
   for -mpure-code.
 * gcc/doc/texi (TARGET_ASM_ELF_FLAGS_NUMERIC): New.
 * gcc/doc/texi.in (TARGET_ASM_ELF_FLAGS_NUMERIC): Likewise.



 gcc/testsuite/ChangeLog:
 2016-06-30  Andre Vieira  
 Terry Guo  

 * gcc.target/arm/pure-code/ffunction-sections.c: New.
 * gcc.target/arm/pure-code/no-literal-pool.c: New.
 * gcc.target/arm/pure-code/pure-code.exp: New.

>>>
>> Hi Sandra, Mickael,
>>
>> Thank you for your comments. I changed the description of -mpure-code in
>> invoke.texi to better reflect the error message you get wrt supported
>> targets.
>>
>> As for the target hook description, I hope the text is clearer now. Let
>> me know if you think it needs further explanation.
>>
>> I also fixed the double '%' in the text string for unnamed text sections
>> and disabled the casesi pattern.
>>
>> I duplicated the original casesi test
>> 'gcc/testsuite/gcc.c-torture/compile/pr46934.c' for pure-code to make
>> sure the casesi was disabled and other patterns were selected instead.
>>
>> Reran regressions for pure-code.exp for Cortex-M3.
>>
>> Cheers,
>> Andre
>>
>>
>> gcc/ChangeLog:
>> 2016-07-11  Andre Vieira  
>> Terry Guo  
>>
>> * target.def (elf_flags_numeric): New target hook.
>> * hooks.c (hook_uint_uintp_false): New generic hook.
>> * varasm.c (default_elf_asm_named_section): Use new target hook.
>> * config/arm/arm.opt (mpure-code): New.
>> * config/arm/arm.h (SECTION_ARM_PURECODE): New.
>> * config/arm/arm.c (arm_asm_init_sections): Add section
>> attribute to default text section if -mpure-code.
>> (arm_option_check_internal): Diagnose use of option with
>> non supported targets and/or options.
>> (arm_asm_elf_flags_numeric): New.
>> (arm_function_section): New.
>> (arm_elf_section_type_flags): New.
>>   

Re: Early jump threading

2016-08-11 Thread Jan Hubicka
> On Thu, 11 Aug 2016, Jan Hubicka wrote:
> 
> > Hi,
> > this patch adds early jump threading pass.  Jump threading is one of most
> > common cases where estimated profile becomes corrupted, because the branches
> > are predicted independently beforehand. This patch simply adds early mode to
> > jump threader that does not permit code growth and thus only win/win threads
> > are done before profile is constructed.
> > 
> > Naturally this also improves IPA decisions because code size/time is 
> > estimated
> > more acurately.
> > 
> > It is not very cool to add another pass and the jump threader is already
> > run 5 times. I think incrementally we could drop one of late threaders at 
> > least.
> > I tried to measure compile time effects but they are in wash. Moreover the 
> > patch
> > pays for itself in cc1plus code size:
> > 
> > Before patch to tweak size estimates: 27779964  
> > Current mainline: 27748900  
> > With patch applied:   27716173  
> > 
> > So despite adding few functions the code size effect is positive which I 
> > think
> > is quite nice.
> > 
> > Given the fact that jump threading seems quite lightweit, I wonder why it is
> > guarded by flag_expensive_optimizations? Is it expensive in terms of code
> > size?
> > 
> > The effectivity of individual threading passes is as follows (for tramp3d)
> > 
> >   mainline  with patch
> > pass  thread count profile mismatches   thread countprofile mismatch
> > early   525
> > 1 1853 1900 316 337
> > 2 4812  4   288
> > 3 24   1450 32  947
> > 4 76   1457 75  975
> > 
> > So at least tramp3d data seems to suggest that we can drop the second 
> > occurence
> > of jump threading and that most of the job thread1 does can be done by the
> > size restricted early version (the lower absolute counts are caused by the
> > fact that threadable paths gets duplicated by the inliner). 50% drop in
> > profile distortion is not bad. I wonder why basically every threaded paths 
> > seems
> > to introduce a mismatch.
> > 
> > The patch distorts testusite somewhat, in most cases we only want to update
> > dump files or disable early threading:
> > 
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering 
> > FSM" 1
> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not 
> > thread around loop and would copy too many statements"
> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 
> > "Jumps threaded: 1" 1
> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate 
> > p_.*to 1" 1
> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1, 1]"
> > 
> > This testcase is the now probably unnecesary heuristics (it may still be
> > relevant in cases we do not thread because of size bounds but its 
> > effectivity
> > may not be worth the maintenance cost):
> > 
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
> > profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
> > profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
> > profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
> > profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
> > profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times 
> > profile_estimate "extra loop exit heuristics 

Re: [v3 PATCH] Implement C++17 make_from_tuple.

2016-08-11 Thread Paolo Carlini

Hi,

On 11/08/2016 02:04, Ville Voutilainen wrote:

I was in the middle of doing this, so here's the patch before I commit
the string_view one.
Thanks. This one also looks straightforward enough to me. I'm still 
working on your largish contributions...


Thanks again,
Paolo.


Re: [PATCH] Fix early debug regression with DW_AT_string_length (PR debug/71906)

2016-08-11 Thread Richard Biener
On Thu, Aug 11, 2016 at 2:36 PM, Jakub Jelinek  wrote:
> Hi!
>
> On Wed, Aug 10, 2016 at 12:23:06PM +0200, Richard Biener wrote:
>> > Introducing another attribute that does the same thing as existing 
>> > attribute
>> > would be way too ugly.  In theory the reference class could be added to
>> > DW_AT_string_length, I can ask on DWARF workgroup (but it might be too late
>> > for DWARF 5), but that still doesn't solve the issue of the indirect 
>> > params.
>> >
>> > How do you want to handle the debug info without ammending the 
>> > early-generated
>> > DWARF though?  Just by using it as abstract origins of everything and
>> > ammending in copies?
>>
>> Yes.
>
> I've filed an enhancement request and got one positive feedback, but it is
> going to be multiple months at best.
> So, at least for non-LTO I'd strongly prefer to go with what the current
> patch does, and for LTO we'd then have to ask GDB to implement the reference
> class for DW_AT_string_length and then just use that instead, depending on
> flag_lto or similar, or perhaps for dwarf4 and earlier don't emit for LTO
> variable string length and keep it only for dwarf5+ (if the change is
> approved).  For the indirect parms and LTO, I guess we'd need to create some
> artificial VAR_DECL at function scope with DECL_VALUE_EXPR of *parm,
> DECL_ARTIFICIAL, DECL_NAMELESS and reference that instead of the parm
> itself.
> With this, do you object to the current patch?

No, I still hope it avoids generating references to possibly optimized out stuff
early - I didn't thorougly review it.  I guess I'll find out soon ;)

Richard.

> Jakub


Re: Early jump threading

2016-08-11 Thread Richard Biener
On Thu, 11 Aug 2016, Jan Hubicka wrote:

> Hi,
> this patch adds early jump threading pass.  Jump threading is one of most
> common cases where estimated profile becomes corrupted, because the branches
> are predicted independently beforehand. This patch simply adds early mode to
> jump threader that does not permit code growth and thus only win/win threads
> are done before profile is constructed.
> 
> Naturally this also improves IPA decisions because code size/time is estimated
> more acurately.
> 
> It is not very cool to add another pass and the jump threader is already
> run 5 times. I think incrementally we could drop one of late threaders at 
> least.
> I tried to measure compile time effects but they are in wash. Moreover the 
> patch
> pays for itself in cc1plus code size:
> 
> Before patch to tweak size estimates: 27779964  
> Current mainline: 27748900  
> With patch applied:   27716173  
> 
> So despite adding few functions the code size effect is positive which I think
> is quite nice.
> 
> Given the fact that jump threading seems quite lightweit, I wonder why it is
> guarded by flag_expensive_optimizations? Is it expensive in terms of code
> size?
> 
> The effectivity of individual threading passes is as follows (for tramp3d)
> 
>   mainline  with patch
> pass  thread count profile mismatches   thread countprofile mismatch
> early   525
> 1 1853 1900 316 337
> 2 4812  4   288
> 3 24   1450 32  947
> 4 76   1457 75  975
> 
> So at least tramp3d data seems to suggest that we can drop the second 
> occurence
> of jump threading and that most of the job thread1 does can be done by the
> size restricted early version (the lower absolute counts are caused by the
> fact that threadable paths gets duplicated by the inliner). 50% drop in
> profile distortion is not bad. I wonder why basically every threaded paths 
> seems
> to introduce a mismatch.
> 
> The patch distorts testusite somewhat, in most cases we only want to update
> dump files or disable early threading:
> 
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering 
> FSM" 1
> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread 
> around loop and would copy too many statements"
> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 
> "Jumps threaded: 1" 1
> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate 
> p_.*to 1" 1
> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1, 1]"
> 
> This testcase is the now probably unnecesary heuristics (it may still be
> relevant in cases we do not thread because of size bounds but its effectivity
> may not be worth the maintenance cost):
> 
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times 
> profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times 
> profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  

[PATCH] Improve VRP PHI handling

2016-08-11 Thread Richard Biener

The patch for PR72851 exposes an issue with PHI handling in VRP that
is easily fixed if we allow iterating again if we are sure this is
the last iteration with the set of fixed executable edges.  It also
makes sure we always print the lattice change, even if it is to VARYING.

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

Richard.

2016-08-11  Richard Biener  

* tree-vrp.c (vrp_visit_phi_node): Allow a last iteration if
the currently executable edges have fixed ranges.  Always
go through update_value_range.

Index: gcc/tree-vrp.c
===
--- gcc/tree-vrp.c  (revision 239361)
+++ gcc/tree-vrp.c  (working copy)
@@ -8725,6 +8703,7 @@ vrp_visit_phi_node (gphi *phi)
   print_gimple_stmt (dump_file, phi, 0, dump_flags);
 }
 
+  bool may_simulate_again = false;
   edges = 0;
   for (i = 0; i < gimple_phi_num_args (phi); i++)
 {
@@ -8747,6 +8726,12 @@ vrp_visit_phi_node (gphi *phi)
 
  if (TREE_CODE (arg) == SSA_NAME)
{
+ /* See if we are eventually going to change one of the args.  */
+ gimple *def_stmt = SSA_NAME_DEF_STMT (arg);
+ if (! gimple_nop_p (def_stmt)
+ && prop_simulate_again_p (def_stmt))
+   may_simulate_again = true;
+
  vr_arg = *(get_value_range (arg));
  /* Do not allow equivalences or symbolic ranges to leak in from
 backedges.  That creates invalid equivalencies.
@@ -8822,11 +8807,14 @@ vrp_visit_phi_node (gphi *phi)
  previous one.  We don't do this if we have seen a new executable
  edge; this helps us avoid an overflow infinity for conditionals
  which are not in a loop.  If the old value-range was VR_UNDEFINED
- use the updated range and iterate one more time.  */
+ use the updated range and iterate one more time.  If we will not
+ simulate this PHI again with the same number of edges then iterate
+ one more time.  */
   if (edges > 0
   && gimple_phi_num_args (phi) > 1
   && edges == old_edges
-  && lhs_vr->type != VR_UNDEFINED)
+  && lhs_vr->type != VR_UNDEFINED
+  && may_simulate_again)
 {
   /* Compare old and new ranges, fall back to varying if the
  values are not comparable.  */
@@ -8880,6 +8868,31 @@ vrp_visit_phi_node (gphi *phi)
   goto infinite_check;
 }
 
+  goto update_range;
+
+varying:
+  set_value_range_to_varying (_result);
+
+scev_check:
+  /* If this is a loop PHI node SCEV may known more about its value-range.
+ scev_check can be reached from two paths, one is a fall through from above
+ "varying" label, the other is direct goto from code block which tries to
+ avoid infinite simulation.  */
+  if ((l = loop_containing_stmt (phi))
+  && l->header == gimple_bb (phi))
+adjust_range_with_scev (_result, l, phi, lhs);
+
+infinite_check:
+  /* If we will end up with a (-INF, +INF) range, set it to
+ VARYING.  Same if the previous max value was invalid for
+ the type and we end up with vr_result.min > vr_result.max.  */
+  if ((vr_result.type == VR_RANGE || vr_result.type == VR_ANTI_RANGE)
+  && !((vrp_val_is_max (vr_result.max) && vrp_val_is_min (vr_result.min))
+  || compare_values (vr_result.min, vr_result.max) > 0))
+;
+  else
+set_value_range_to_varying (_result);
+
   /* If the new range is different than the previous value, keep
  iterating.  */
 update_range:
@@ -8902,31 +8915,6 @@ update_range:
 
   /* Nothing changed, don't add outgoing edges.  */
   return SSA_PROP_NOT_INTERESTING;
-
-varying:
-  set_value_range_to_varying (_result);
-
-scev_check:
-  /* If this is a loop PHI node SCEV may known more about its value-range.
- scev_check can be reached from two paths, one is a fall through from above
- "varying" label, the other is direct goto from code block which tries to
- avoid infinite simulation.  */
-  if ((l = loop_containing_stmt (phi))
-  && l->header == gimple_bb (phi))
-adjust_range_with_scev (_result, l, phi, lhs);
-
-infinite_check:
-  /* If we will end up with a (-INF, +INF) range, set it to
- VARYING.  Same if the previous max value was invalid for
- the type and we end up with vr_result.min > vr_result.max.  */
-  if ((vr_result.type == VR_RANGE || vr_result.type == VR_ANTI_RANGE)
-  && !((vrp_val_is_max (vr_result.max) && vrp_val_is_min (vr_result.min))
-  || compare_values (vr_result.min, vr_result.max) > 0))
-goto update_range;
-
-  /* No match found.  Set the LHS to VARYING.  */
-  set_value_range_to_varying (lhs_vr);
-  return SSA_PROP_VARYING;
 }
 
 /* Simplify boolean operations if the source is known



Early jump threading

2016-08-11 Thread Jan Hubicka
Hi,
this patch adds early jump threading pass.  Jump threading is one of most
common cases where estimated profile becomes corrupted, because the branches
are predicted independently beforehand. This patch simply adds early mode to
jump threader that does not permit code growth and thus only win/win threads
are done before profile is constructed.

Naturally this also improves IPA decisions because code size/time is estimated
more acurately.

It is not very cool to add another pass and the jump threader is already
run 5 times. I think incrementally we could drop one of late threaders at least.
I tried to measure compile time effects but they are in wash. Moreover the patch
pays for itself in cc1plus code size:

Before patch to tweak size estimates: 27779964  
Current mainline: 27748900  
With patch applied:   27716173  

So despite adding few functions the code size effect is positive which I think
is quite nice.

Given the fact that jump threading seems quite lightweit, I wonder why it is
guarded by flag_expensive_optimizations? Is it expensive in terms of code
size?

The effectivity of individual threading passes is as follows (for tramp3d)

  mainline  with patch
pass  thread count profile mismatches   thread countprofile mismatch
early   525
1 1853 1900 316 337
2 4812  4   288
3 24   1450 32  947
4 76   1457 75  975

So at least tramp3d data seems to suggest that we can drop the second occurence
of jump threading and that most of the job thread1 does can be done by the
size restricted early version (the lower absolute counts are caused by the
fact that threadable paths gets duplicated by the inliner). 50% drop in
profile distortion is not bad. I wonder why basically every threaded paths seems
to introduce a mismatch.

The patch distorts testusite somewhat, in most cases we only want to update
dump files or disable early threading:

+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
+FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
+FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 
1
+FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread 
around loop and would copy too many statements"
+FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps 
threaded: 1" 1
+FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
+FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate 
p_.*to 1" 1
+FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
+FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: [1, 1]"

This testcase is the now probably unnecesary heuristics (it may still be
relevant in cases we do not thread because of size bounds but its effectivity
may not be worth the maintenance cost):

+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times 
profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: 

[PATCH] Update or add fall through comments in switches

2016-08-11 Thread Marek Polacek
This patch either updates fall through comments so that our warning
machinery recognizes them, or annotates code with new FALLTHRU comments
so that intended fall through cases are marked and the compiler wouldn't
warn.

I'd like to get this in before I post another patches to keep this mess
more manageable.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

2016-08-11  Marek Polacek  

PR c/7652
gcc/
* alias.c (find_base_value): Adjust fall through comment.
* cfgexpand.c (expand_debug_expr): Likewise.
* combine.c (find_split_point): Likewise.
(expand_compound_operation): Likewise.  Add FALLTHRU.
(make_compound_operation): Adjust fall through comment.
(canon_reg_for_combine): Add FALLTHRU.
(force_to_mode): Adjust fall through comment.
(simplify_shift_const_1): Likewise.
(simplify_comparison): Likewise.
* config/aarch64/aarch64-builtins.c (aarch64_simd_expand_args): Add
FALLTHRU.
* config/aarch64/predicates.md: Likewise.
* config/i386/i386.c (function_arg_advance_32): Likewise.
(ix86_gimplify_va_arg): Likewise.
(print_reg): Likewise.
(ix86_print_operand): Likewise.
(ix86_build_const_vector): Likewise.
(ix86_expand_branch): Likewise.
(ix86_sched_init_global): Adjust fall through comment.
(ix86_expand_args_builtin): Add FALLTHRU.
(ix86_expand_builtin): Likewise.
(ix86_expand_vector_init_one_var): Likewise.
* config/rs6000/rs6000.c (rs6000_emit_vector_compare_inner): Likewise.
(rs6000_adjust_cost): Likewise.
(insn_must_be_first_in_group): Likewise.
* config/rs6000/rs6000.md: Likewise.  Adjust fall through comment.
* dbxout.c (dbxout_symbol): Adjust fall through comment.
* df-scan.c (df_uses_record): Likewise.
* dojump.c (do_jump): Add FALLTHRU.
* dwarf2out.c (mem_loc_descriptor): Likewise.  Adjust fall through
comment.
(resolve_args_picking_1): Adjust fall through comment.
(loc_list_from_tree_1): Likewise.
* expmed.c (make_tree): Likewise.
* expr.c (expand_expr_real_2): Add FALLTHRU.
(expand_expr_real_1): Likewise.  Adjust fall through comment.
* fold-const.c (const_binop): Adjust fall through comment.
(fold_truth_not_expr): Likewise.
(fold_cond_expr_with_comparison): Add FALLTHRU.
(fold_binary_loc): Likewise.
(contains_label_1): Adjust fall through comment.
(multiple_of_p): Likewise.
* gcov-tool.c (process_args): Add FALLTHRU.
* genattrtab.c (check_attr_test): Likewise.
(write_test_expr): Likewise.
* genconfig.c (walk_insn_part): Likewise.
* genpreds.c (validate_exp): Adjust fall through comment.
(needs_variable): Likewise.
* gensupport.c (get_alternatives_number): Add FALLTHRU.
(subst_dup): Likewise.
* gimple-pretty-print.c (dump_gimple_assign): Likewise.
* gimplify.c (gimplify_addr_expr): Adjust fall through comment.
(gimplify_scan_omp_clauses): Add FALLTHRU.
(goa_stabilize_expr): Likewise.
* graphite-isl-ast-to-gimple.c (substitute_ssa_name): Adjust fall
through comment.
* hsa-gen.c (get_address_from_value): Likewise.
* ipa-icf.c (sem_function::hash_stmt): Likewise.
* ira.c (ira_setup_alts): Add FALLTHRU.
* lra-eliminations.c (lra_eliminate_regs_1): Adjust fall through
comment.
* lto-streamer-out.c (lto_output_tree_ref): Add FALLTHRU.
* opts.c (common_handle_option): Likewise.
* read-rtl.c (read_rtx_code): Likewise.
* real.c (round_for_format): Likewise.
* recog.c (asm_operand_ok): Likewise.
* reginfo.c (reg_scan_mark_refs): Adjust fall through comment.
* reload1.c (set_label_offsets): Likewise.
(eliminate_regs_1): Likewise.
(reload_reg_reaches_end_p): Likewise.
* rtlanal.c (commutative_operand_precedence): Add FALLTHRU.
(rtx_cost): Likewise.
* sched-rgn.c (is_exception_free): Likewise.
* simplify-rtx.c (simplify_rtx): Adjust fall through comment.
* stor-layout.c (int_mode_for_mode): Likewise.
* toplev.c (print_to_asm_out_file): Likewise.
(print_to_stderr): Likewise.
* tree-cfg.c (gimple_verify_flow_info): Likewise.
* tree-chrec.c (chrec_fold_plus_1): Add FALLTHRU.
(chrec_fold_multiply): Likewise.
(evolution_function_is_invariant_rec_p): Likewise.
(for_each_scev_op): Likewise.
* tree-data-ref.c (siv_subscript_p): Likewise.
(get_references_in_stmt): Likewise.
* tree.c (find_placeholder_in_expr): Adjust fall through comment.
(substitute_in_expr): Likewise.
(type_cache_hasher::equal): Likewise.
(walk_type_fields): Likewise.
* var-tracking.c (adjust_mems): Add FALLTHRU.

Re: [RFC] ipa bitwise constant propagation

2016-08-11 Thread Jan Hubicka
> @@ -266,6 +267,38 @@ private:
>bool meet_with_1 (unsigned new_align, unsigned new_misalign);
>  };
>  
> +/* Lattice of known bits, only capable of holding one value.
> +   Similar to ccp_lattice_t, mask represents which bits of value are 
> constant.
> +   If a bit in mask is set to 0, then the corresponding bit in
> +   value is known to be constant.  */
> +
> +class ipcp_bits_lattice
> +{
> +public:
> +  bool bottom_p () { return m_lattice_val == IPA_BITS_VARYING; }
> +  bool top_p () { return m_lattice_val == IPA_BITS_UNDEFINED; }
> +  bool constant_p () { return m_lattice_val == IPA_BITS_CONSTANT; }
> +  bool set_to_bottom ();
> +  bool set_to_constant (widest_int, widest_int); 
> + 
> +  widest_int get_value () { return m_value; }
> +  widest_int get_mask () { return m_mask; }
> +
> +  bool meet_with (ipcp_bits_lattice& other, unsigned, signop,
> +   enum tree_code, tree);
> +
> +  bool meet_with (widest_int, widest_int, unsigned);
> +
> +  void print (FILE *);
> +
> +private:
> +  enum { IPA_BITS_UNDEFINED, IPA_BITS_CONSTANT, IPA_BITS_VARYING } 
> m_lattice_val;
> +  widest_int m_value, m_mask;

Please add comment for these, like one in tree-ssa-ccp and mention they are the 
same
values.

> +
> +  /* For K C programs, ipa_get_type() could return NULL_TREE.
> + Avoid the transform for these cases.  */
> +  if (!parm_type)
> +return dest_lattice->set_to_bottom ();

Please add TDF_DETAILS dump for this so we notice if we drop useful info for no
good reasons.  It also happens for variadic functions but hopefully not much 
more.

The patch is OK with those changes.

thanks,
Honza


Re: [PATCH] Fix early debug regression with DW_AT_string_length (PR debug/71906)

2016-08-11 Thread Jakub Jelinek
Hi!

On Wed, Aug 10, 2016 at 12:23:06PM +0200, Richard Biener wrote:
> > Introducing another attribute that does the same thing as existing attribute
> > would be way too ugly.  In theory the reference class could be added to
> > DW_AT_string_length, I can ask on DWARF workgroup (but it might be too late
> > for DWARF 5), but that still doesn't solve the issue of the indirect params.
> >
> > How do you want to handle the debug info without ammending the 
> > early-generated
> > DWARF though?  Just by using it as abstract origins of everything and
> > ammending in copies?
> 
> Yes.

I've filed an enhancement request and got one positive feedback, but it is
going to be multiple months at best.
So, at least for non-LTO I'd strongly prefer to go with what the current
patch does, and for LTO we'd then have to ask GDB to implement the reference
class for DW_AT_string_length and then just use that instead, depending on
flag_lto or similar, or perhaps for dwarf4 and earlier don't emit for LTO
variable string length and keep it only for dwarf5+ (if the change is
approved).  For the indirect parms and LTO, I guess we'd need to create some
artificial VAR_DECL at function scope with DECL_VALUE_EXPR of *parm,
DECL_ARTIFICIAL, DECL_NAMELESS and reference that instead of the parm
itself.
With this, do you object to the current patch?

Jakub


Re: backward threading heuristics tweek

2016-08-11 Thread Jan Hubicka
> This also caused:
> 
> FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
> 
>   Failures:
>   gcc.dg/tree-ssa/pr69270-3.c
>   
>   Bisected to: 
> 
>   Author: hubicka
>   Date:   Sun Aug 7 10:50:16 2016 +
> 
>   * tree-ssa-threadbackward.c: Include tree-inline.h
>   (profitable_jump_thread_path): Use estimate_num_insns to estimate
>   size of copied block; for cold paths reduce duplication.
>   (find_jump_threads_backwards): Remove redundant tests.
>   (pass_thread_jumps::gate): Enable for -Os.
>   * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
> 
>   svn+ssh://gcc.gnu.org/svn/gcc/trunk@239219 
> 
> On my aarch64-none-linux-gnu and x86_64-none-linux-gnu automatic bisect
> robots.

Sorry for that - it seems I have missed this failure.  The reason is that FSM
now stops on:
  /* We avoid creating irreducible inner loops unless we thread through
 a multiway branch, in which case we have deemed it worth losing
 other loop optimizations later.

 We also consider it worth creating an irreducible inner loop if
 the number of copied statement is low relative to the length of
 the path -- in that case there's little the traditional loop
 optimizer would have done anyway, so an irreducible loop is not
 so bad.  */
  if (!threaded_multiway_branch && creates_irreducible_loop
  && (n_insns * PARAM_VALUE (PARAM_FSM_SCALE_PATH_STMTS)
  > path_length * PARAM_VALUE (PARAM_FSM_SCALE_PATH_BLOCKS)))

{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file,
 "FSM would create irreducible loop without threading "
 "multiway branch.\n");
  path->pop ();
  return NULL;
}

The path threaded now gets n_insn==13 and path_lengt=6. I guess the difference
is that the path consists of several calls that are considered heavy by the
new code size estimate which is correct. It is definitly heaver than path
consisting of few increments.

: 

   
if (phi_inserted_5 == 0)

   
  goto ;  

   
else

   
  goto ;

:
_2 = boo ();
if (_2 != 20)
  goto ;
else
  goto ;

:
_1 = arf ();
if (_1 != 10)
  goto ;
else
  goto ;

:
# phi_inserted_5 = PHI <0(2), phi_inserted_4(8)>
_3 = end_imm_use_stmt_p ();
if (_3 == 0)
  goto ;
else
  goto ;

loop latch.
: 
# phi_inserted_4 = PHI 
next_imm_use_stmt ();

:
if (phi_inserted_5 == 0)
  goto ;
else
  goto ;


I would say that optimizing this path to dead is not the most important thing.  
The question is whether
there is really problem with an irreducible loop. THere are two loops in the 
function body prior threading:

;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 6 7 8 9 10
;;
;; Loop 1
;;  header 9, latch 8
;;  depth 1, outer 0
;;  nodes: 9 8 4 6 7 3
;; 2 succs { 9 }
;; 3 succs { 8 4 }
;; 4 succs { 8 6 }
;; 6 succs { 8 7 }
;; 7 succs { 8 }
;; 8 succs { 9 }
;; 9 succs { 3 10 }
;; 10 succs { 1 }

So the threaded path lives fully inside loop1: 6->8->9->3->4->6 propagating
that phi_inserted is 0 after the first iteration of the loop.  This looks like
useful loop peeling oppurtunity which does not garble loop structure. So
perhaps threading paths starting and passing loop latch (i.e. peeling) is
sane? Perhaps all paths fully captured in the loop in question are?

;; 2 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
;;
;; Loop 1
;;  header 12, latch 11
;;  depth 1, outer 0
;;  nodes: 12 11 4 9 10 3 8 7 6 5
;; 2 succs { 12 }
;; 3 succs { 11 4 }
;; 4 succs { 11 5 }
;; 5 succs { 6 10 }
;; 6 succs { 7 }
;; 7 succs { 8 13 }
;; 8 succs { 11 9 }
;; 9 succs { 11 10 }
;; 10 succs { 11 }
;; 11 succs { 12 }
;; 12 succs { 3 13 }
;; 13 succs { 1 }

Honza


Re: protected alloca class for malloc fallback

2016-08-11 Thread Oleg Endo
On Wed, 2016-08-10 at 21:31 -0400, Trevor Saunders wrote:

> 
> Well, I'd say the compromises made in std::string make it pretty
> terrible for general purpose use accept where perf and memory doesn't
> matter at all.
> 
> std::vector isn't very great size wise either, imho the size / 
> capacity fields would be much better put in the buffer than in the 
> struct itself, to save 2 * sizeof (void *) in sizeof std::vector.

There's nothing in the standard that says those fields have to go into
the vector struct/class itself and not into the data buffer.  It just
happens to be the way it's implemented in libstdc++.  E.g. libc++
implements some things in different ways to save a few bytes here and
there.

It looks more practical to put those fields into the container itself,
otherwise you'd have to do a nullptr check every time you access the
size field etc.  Although newer compilers optimize away the redundant
nullptr checks, older compilers (which GCC wants to be good with) might
not do it.  In any case, it seems to generate slightly bigger code, at
least in one instance (see attachment).

Anyway, normally there is more data in the vectors than there are
vectors around, so whether sizeof (vector) is 8 or 24 bytes doesn't
matter.

> 
> or just having your stack_string type convert to the normal string 
> type so that you can pass mutable references.

But that would imply copying, wouldn't it?

Cheers,
Oleg
#if 0
template 
class vector
{
public:
  unsigned int size (void) const { return m_buffer != nullptr ? m_buffer->size : 0; }
  bool empty (void) const { return size () == 0; }

  unsigned int capacity (void) const { return m_buffer != nullptr ? m_buffer->capacity : 0; }

  T* data (void) { return (T*)(m_buffer + 1); }
  const T* data (void) const { return (const T*)(m_buffer + 1); }

  T& operator [] (unsigned int i) { return data ()[i]; }
  const T& operator [] (unsigned int i) const { return data ()[i]; }

private:
  struct buffer_header
  {
unsigned int size;
unsigned int capacity;
  };

  buffer_header* m_buffer;
};


int foo (const vector& x)
{
  if (x.empty ())
return 0;

  int r = 0;
  for (unsigned int i = 0; i < x.size (); ++i)
r += x[i];

  return r;
}

/*
-O2 -m2a
	mov.l	@r4,r2
	tst	r2,r2
	bt	.L5
	mov.l	@r2,r1
	tst	r1,r1
	bt.s	.L5
	shll2	r1
	add	#-4,r1
	shlr2	r1
	add	#8,r2
	mov	#0,r0
	add	#1,r1
	.align 2
.L3:
	mov.l	@r2+,r3
	dt	r1
	bf.s	.L3
	add	r3,r0
	rts/n
	.align 1
.L5:
	rts
	mov	#0,r0
*/
#endif


#if 1

template 
class vector
{
public:
  unsigned int size (void) const { return m_size; }
  bool empty (void) const { return size () == 0; }

  unsigned int capacity (void) const { return m_capacity; }

  T* data (void) { return (T*)(m_buffer); }
  const T* data (void) const { return (const T*)(m_buffer); }

  T& operator [] (unsigned int i) { return data ()[i]; }
  const T& operator [] (unsigned int i) const { return data ()[i]; }

private:
  unsigned int m_size;
  unsigned int m_capacity;
  T* m_buffer;
};


int foo (const vector& x)
{
  if (x.empty ())
return 0;

  int r = 0;
  for (unsigned int i = 0; i < x.size (); ++i)
r += x[i];

  return r;
}

/*
-O2 -m2a
	mov.l	@r4,r1
	tst	r1,r1
	bt.s	.L7
	mov	#0,r0
	shll2	r1
	mov.l	@(8,r4),r2
	add	#-4,r1
	shlr2	r1
	add	#1,r1
	.align 2
.L3:
	mov.l	@r2+,r3
	dt	r1
	bf.s	.L3
	add	r3,r0
.L7:
	rts/n
*/

#endif



Re: backward threading heuristics tweek

2016-08-11 Thread Jan Hubicka
> On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka  wrote:
> > Hi,
> > while looking into profile mismatches introduced by the backward threading 
> > pass
> > I noticed that the heuristics seems quite simplistics.  First it should be
> > profile sensitive and disallow duplication when optimizing cold paths. 
> > Second
> > it should use estimate_num_insns because gimple statement count is not 
> > really
> > very realistic estimate of final code size effect and third there seems to 
> > be
> > no reason to disable the pass for functions optimized for size.
> >
> > If we block duplication for more than 1 insns for size optimized paths the 
> > pass
> > is able to do majority of threading decisions that are for free and improve 
> > codegen.
> > The code size benefit was between 0.5% to 2.7% on testcases I tried 
> > (tramp3d,
> > GCC modules, xlanancbmk and some other stuff around my hd).
> >
> > Bootstrapped/regtested x86_64-linux, seems sane?
> >
> > The pass should also avoid calling cleanup_cfg when no trheading was done
> > and i do not see why it is guarded by expensive_optimizations. What are the
> > main compile time complexity limitations?
> 
> This patch caused a huge regression (~11%) on coremarks on ThunderX.
> I assume other targets too.
> Basically it looks like the path is no longer thread jumped.

Sorry for late reply. I checked our periodic testers and the patch seems more 
or less
performance neutral with some code size improvements. Can you point me to the 
path that
is no longer crossjumped? I added diag output, so you should see the reason why 
the
path was considered unprofitable - either it was cold or we exceeded the 
maximal size.
The size is largely untuned, so perhaps we can just adjust it.

Honza


Re: [PATCH PR71956] Add missed check on MASK_STORE builtin.

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 02:03:37PM +0300, Yuri Rumyantsev wrote:
> Hi All,
> 
> Jakub introduced regression after r235764 and we got RF for
> spec2000/176.gcc on HSW if loop vectorization is on (32-bit only).
> Here is a simple fix which cures the issue.
> 
> Is it OK for trunk?
> ChangeLog:
> 2016-08-11  Yuri Rumyantsev  
> 
> PR rtl-optimization/71956
> * ipa-pure-const.c (check_call): Add missed check on MASK_STORE
> builtin which can not be pure function.

This is weird.  Even for MASK_STORE the comment IMHO still holds,
there are no cgraph edges for internal calls, so it won't be processed as a
call during ipa mode.  Is the problem that the ipa-pure-const pass marks a
function containing MASK_STORE still as const or pure, or doesn't mark
something you want to mark, something else?

At least in the non-ipa mode, I'd expect that a MASK_STORE to a (potential)
non-local decl should be treated in this pass similarly to normal stores
and similarly MASK_LOAD.

I'm really surprised the ipa-pure-const pass sees these internal calls
though, I thought they are introduced late (tree-if-conv) while
ipa-pure-const runs during IPA opts far earlier, isn't that the case?

Jakub


[Committed 0/2] Vector builtin improvements

2016-08-11 Thread Andreas Krebbel
These two patches implement 2 minor improvements to the vector builtins.

Committed after successful regression testing with -march=z13.  The
patches will be applied to GCC 6 branch as well.

Andreas Krebbel (2):
  S/390: Fix vlvg vlgv low-level builtins.
  S/390: Provide low-level builtins with __int128 ops.

 gcc/config/s390/s390-builtin-types.def |   3 +
 gcc/config/s390/s390-builtins.def  |  69 
 gcc/config/s390/s390.c |  15 +
 gcc/config/s390/s390.md|   5 +-
 gcc/config/s390/vecintrin.h|  16 ++---
 gcc/config/s390/vx-builtins.md | 113 +
 6 files changed, 101 insertions(+), 120 deletions(-)

-- 
2.9.1



[PATCH 2/2] S/390: Provide low-level builtins with __int128 ops.

2016-08-11 Thread Andreas Krebbel
gcc/ChangeLog:

2016-08-11  Andreas Krebbel  

* config/s390/s390-builtin-types.def: Add INT128 types.
* config/s390/s390-builtins.def: Add INT128 variants for the add
sub low-level builtins dealing with TImode.
* config/s390/s390.c (s390_expand_builtin): Allow mode conversions
via subreg when expanding a builtin.
* config/s390/s390.md: Remove UNSPEC_VEC_ADDC_U128,
UNSPEC_VEC_SUB_U128, and UNSPEC_VEC_SUBC_U128 constants.
Fix comment.
* config/s390/vecintrin.h: Adjust builtin names accordingly.
* config/s390/vx-builtins.md ("vec_add_u128"): Remove expander.
("vec_addc", "vec_addc_u128"): Merge to
"vacc_".
("vec_adde_u128"): Rename to "vacq". Change mode to TImode.
("vec_addec_u128"): Rename to "vacccq". Change mode to TImode.
("vec_subc", "vec_subc_u128"): Merge to
"vscbi_".
("vec_sube_u128"): Rename to "vsbiq". Change mode to TImode.
("vec_subec_u128"): Rename to "vsbcbiq". Change mode to TImode.
---
 gcc/config/s390/s390-builtin-types.def |   3 +
 gcc/config/s390/s390-builtins.def  |  49 +-
 gcc/config/s390/s390.c |  15 +
 gcc/config/s390/s390.md|   5 +-
 gcc/config/s390/vecintrin.h|  16 ++---
 gcc/config/s390/vx-builtins.md | 113 +
 6 files changed, 91 insertions(+), 110 deletions(-)

diff --git a/gcc/config/s390/s390-builtin-types.def 
b/gcc/config/s390/s390-builtin-types.def
index 3d90d41..f5fcf98 100644
--- a/gcc/config/s390/s390-builtin-types.def
+++ b/gcc/config/s390/s390-builtin-types.def
@@ -66,6 +66,7 @@ DEF_TYPE (BT_FLT, B_VX, float_type_node, 0)
 DEF_TYPE (BT_UINT, 0, unsigned_type_node, 0)
 DEF_TYPE (BT_VOIDCONST, B_VX, void_type_node, 1)
 DEF_TYPE (BT_ULONG, B_VX, long_unsigned_type_node, 0)
+DEF_TYPE (BT_INT128, B_VX, intTI_type_node, 0)
 DEF_TYPE (BT_USHORTCONST, B_VX, short_unsigned_type_node, 1)
 DEF_TYPE (BT_SHORTCONST, B_VX, short_integer_type_node, 1)
 DEF_TYPE (BT_INTCONST, B_VX, integer_type_node, 1)
@@ -171,6 +172,7 @@ DEF_FN_TYPE_1 (BT_FN_V8HI_V8HI, B_VX, BT_V8HI, BT_V8HI)
 DEF_FN_TYPE_1 (BT_FN_VOID_INT, B_HTM, BT_VOID, BT_INT)
 DEF_FN_TYPE_1 (BT_FN_VOID_UINT, 0, BT_VOID, BT_UINT)
 DEF_FN_TYPE_2 (BT_FN_DBL_V2DF_INT, B_VX, BT_DBL, BT_V2DF, BT_INT)
+DEF_FN_TYPE_2 (BT_FN_INT128_INT128_INT128, B_VX, BT_INT128, BT_INT128, 
BT_INT128)
 DEF_FN_TYPE_2 (BT_FN_INT_OV4SI_INT, B_VX, BT_INT, BT_OV4SI, BT_INT)
 DEF_FN_TYPE_2 (BT_FN_INT_OV4SI_OV4SI, B_VX, BT_INT, BT_OV4SI, BT_OV4SI)
 DEF_FN_TYPE_2 (BT_FN_INT_UV16QI_UV16QI, B_VX, BT_INT, BT_UV16QI, BT_UV16QI)
@@ -260,6 +262,7 @@ DEF_FN_TYPE_2 (BT_FN_V8HI_V4SI_V4SI, B_VX, BT_V8HI, 
BT_V4SI, BT_V4SI)
 DEF_FN_TYPE_2 (BT_FN_V8HI_V8HI_V8HI, B_VX, BT_V8HI, BT_V8HI, BT_V8HI)
 DEF_FN_TYPE_2 (BT_FN_VOID_UINT64PTR_UINT64, B_HTM, BT_VOID, BT_UINT64PTR, 
BT_UINT64)
 DEF_FN_TYPE_2 (BT_FN_VOID_V2DF_FLTPTR, B_VX, BT_VOID, BT_V2DF, BT_FLTPTR)
+DEF_FN_TYPE_3 (BT_FN_INT128_INT128_INT128_INT128, B_VX, BT_INT128, BT_INT128, 
BT_INT128, BT_INT128)
 DEF_FN_TYPE_3 (BT_FN_INT_OV4SI_OV4SI_INTPTR, B_VX, BT_INT, BT_OV4SI, BT_OV4SI, 
BT_INTPTR)
 DEF_FN_TYPE_3 (BT_FN_OV4SI_INT_OV4SI_INT, B_VX, BT_OV4SI, BT_INT, BT_OV4SI, 
BT_INT)
 DEF_FN_TYPE_3 (BT_FN_OV4SI_OV4SI_OV4SI_INT, B_VX, BT_OV4SI, BT_OV4SI, 
BT_OV4SI, BT_INT)
diff --git a/gcc/config/s390/s390-builtins.def 
b/gcc/config/s390/s390-builtins.def
index ead0afb..4bcdb22 100644
--- a/gcc/config/s390/s390-builtins.def
+++ b/gcc/config/s390/s390-builtins.def
@@ -741,7 +741,6 @@ B_DEF  (s390_vuplhw,vec_unpacklv8hi,
0,
 B_DEF  (s390_vupllh,vec_unpackl_lv8hi,  0, 
 B_VX,   0,  BT_FN_UV4SI_UV8HI)
 B_DEF  (s390_vuplf, vec_unpacklv4si,0, 
 B_VX,   0,  BT_FN_V2DI_V4SI)
 B_DEF  (s390_vupllf,vec_unpackl_lv4si,  0, 
 B_VX,   0,  BT_FN_UV2DI_UV4SI)
-B_DEF  (s390_vaq,   vec_add_u128,   0, 
 B_VX,   0,  BT_FN_UV16QI_UV16QI_UV16QI)
 
 OB_DEF (s390_vec_addc,  s390_vec_addc_u8,   s390_vec_addc_u64, 
 B_VX,   BT_FN_OV4SI_OV4SI_OV4SI)
 OB_DEF_VAR (s390_vec_addc_u8,   s390_vaccb, 0, 
 BT_OV_UV16QI_UV16QI_UV16QI)
@@ -749,13 +748,20 @@ OB_DEF_VAR (s390_vec_addc_u16,  s390_vacch,   
  0,
 OB_DEF_VAR (s390_vec_addc_u32,  s390_vaccf, 0, 
 BT_OV_UV4SI_UV4SI_UV4SI)
 OB_DEF_VAR (s390_vec_addc_u64,  s390_vaccg, 0, 
 BT_OV_UV2DI_UV2DI_UV2DI)
 
-B_DEF  (s390_vaccb, vec_addcv16qi,  0, 
 B_VX,   0,  BT_FN_UV16QI_UV16QI_UV16QI)
-B_DEF  (s390_vacch, vec_addcv8hi,   0,   

[PATCH 1/2] S/390: Fix vlvg vlgv low-level builtins.

2016-08-11 Thread Andreas Krebbel
gcc/ChangeLog:

2016-08-11  Andreas Krebbel  

* config/s390/s390-builtins.def: Mark last operand of s390_vlvg*
and s390_vlgv* builtins as element selector.
---
 gcc/config/s390/s390-builtins.def | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/gcc/config/s390/s390-builtins.def 
b/gcc/config/s390/s390-builtins.def
index 408c519..ead0afb 100644
--- a/gcc/config/s390/s390-builtins.def
+++ b/gcc/config/s390/s390-builtins.def
@@ -386,11 +386,11 @@ OB_DEF_VAR (s390_vec_insert_u64,s390_vlvgg,   
  O3_ELEM,
 OB_DEF_VAR (s390_vec_insert_b64,s390_vlvgg, O3_ELEM,   
 BT_OV_UV2DI_ULONGLONG_BV2DI_INT)
 OB_DEF_VAR (s390_vec_insert_dbl,s390_vlvgg_dbl, O3_ELEM,   
 BT_OV_V2DF_DBL_V2DF_INT)
 
-B_DEF  (s390_vlvgb, vec_insertv16qi,0, 
 B_VX,   0,  BT_FN_UV16QI_UV16QI_UCHAR_INT)
-B_DEF  (s390_vlvgh, vec_insertv8hi, 0, 
 B_VX,   0,  BT_FN_UV8HI_UV8HI_USHORT_INT)
-B_DEF  (s390_vlvgf, vec_insertv4si, 0, 
 B_VX,   0,  BT_FN_UV4SI_UV4SI_UINT_INT)
-B_DEF  (s390_vlvgg, vec_insertv2di, 0, 
 B_VX,   0,  BT_FN_UV2DI_UV2DI_ULONGLONG_INT)
-B_DEF  (s390_vlvgg_dbl, vec_insertv2df, 0, 
 B_VX | B_INT,   0,  BT_FN_V2DF_V2DF_DBL_INT)
+B_DEF  (s390_vlvgb, vec_insertv16qi,0, 
 B_VX,   O3_ELEM,BT_FN_UV16QI_UV16QI_UCHAR_INT)
+B_DEF  (s390_vlvgh, vec_insertv8hi, 0, 
 B_VX,   O3_ELEM,BT_FN_UV8HI_UV8HI_USHORT_INT)
+B_DEF  (s390_vlvgf, vec_insertv4si, 0, 
 B_VX,   O3_ELEM,BT_FN_UV4SI_UV4SI_UINT_INT)
+B_DEF  (s390_vlvgg, vec_insertv2di, 0, 
 B_VX,   O3_ELEM,BT_FN_UV2DI_UV2DI_ULONGLONG_INT)
+B_DEF  (s390_vlvgg_dbl, vec_insertv2df, 0, 
 B_VX | B_INT,   O3_ELEM,BT_FN_V2DF_V2DF_DBL_INT)
 
 OB_DEF (s390_vec_promote,   
s390_vec_promote_s8,s390_vec_promote_dbl,B_VX,  BT_FN_OV4SI_INT_INT)
 OB_DEF_VAR (s390_vec_promote_s8,s390_vlvgb_noin,O2_ELEM,   
 BT_OV_V16QI_SCHAR_INT)   /* vlvgb */
@@ -424,11 +424,11 @@ OB_DEF_VAR (s390_vec_extract_u64,   s390_vlgvg,   
  O2_ELEM,
 OB_DEF_VAR (s390_vec_extract_b64,   s390_vlgvg, O2_ELEM,   
 BT_OV_ULONGLONG_BV2DI_INT)
 OB_DEF_VAR (s390_vec_extract_dbl,   s390_vlgvg_dbl, O2_ELEM,   
 BT_OV_DBL_V2DF_INT)  /* vlgvg */
 
-B_DEF  (s390_vlgvb, vec_extractv16qi,   0, 
 B_VX,   0,  BT_FN_UCHAR_UV16QI_INT)
-B_DEF  (s390_vlgvh, vec_extractv8hi,0, 
 B_VX,   0,  BT_FN_USHORT_UV8HI_INT)
-B_DEF  (s390_vlgvf, vec_extractv4si,0, 
 B_VX,   0,  BT_FN_UINT_UV4SI_INT)
-B_DEF  (s390_vlgvg, vec_extractv2di,0, 
 B_VX,   0,  BT_FN_ULONGLONG_UV2DI_INT)
-B_DEF  (s390_vlgvg_dbl, vec_extractv2df,0, 
 B_VX | B_INT,   0,  BT_FN_DBL_V2DF_INT)
+B_DEF  (s390_vlgvb, vec_extractv16qi,   0, 
 B_VX,   O2_ELEM,BT_FN_UCHAR_UV16QI_INT)
+B_DEF  (s390_vlgvh, vec_extractv8hi,0, 
 B_VX,   O2_ELEM,BT_FN_USHORT_UV8HI_INT)
+B_DEF  (s390_vlgvf, vec_extractv4si,0, 
 B_VX,   O2_ELEM,BT_FN_UINT_UV4SI_INT)
+B_DEF  (s390_vlgvg, vec_extractv2di,0, 
 B_VX,   O2_ELEM,BT_FN_ULONGLONG_UV2DI_INT)
+B_DEF  (s390_vlgvg_dbl, vec_extractv2df,0, 
 B_VX | B_INT,   O2_ELEM,BT_FN_DBL_V2DF_INT)
 
 OB_DEF (s390_vec_insert_and_zero,   
s390_vec_insert_and_zero_s8,s390_vec_insert_and_zero_dbl,B_VX,BT_FN_OV4SI_INTCONSTPTR)
 OB_DEF_VAR (s390_vec_insert_and_zero_s8,s390_vllezb,0, 
 BT_OV_V16QI_SCHARCONSTPTR)
-- 
2.9.1



Re: [PATCH testsuite]Require vect_cond_mixed for test case gcc.dg/vect/pr56541.c

2016-08-11 Thread Bin.Cheng
On Thu, Aug 11, 2016 at 11:38 AM, Richard Biener
 wrote:
> On Thu, Aug 11, 2016 at 11:56 AM, Bin.Cheng  wrote:
>> On Thu, Aug 11, 2016 at 10:50 AM, Richard Biener
>>  wrote:
>>> On Wed, Aug 10, 2016 at 5:58 PM, Bin Cheng  wrote:
 Hi,
 Due to some reasons, tree-if-conv.c now factors floating point comparison 
 out of cond_expr,
 resulting in mixed types in it.  This does help CSE on common comparison 
 operations.
 Only problem is that test gcc.dg/vect/pr56541.c now requires 
 vect_cond_mixed to be
 vectorized.  This patch changes the test in that way.
 Test result checked.  Is it OK?
>>>
>>> Hmm, I think the fix is to fix if-conversion not doing that.  Can you
>>> track down why this happens?
>> Hmm, but there are several common floating comparison operations in
>> the case, by doing this, we could do CSE on GIMPLE, otherwise we
>> depends on RTL optimizers.
>
> I see.
>
>> I thought we prefer GIMPLE level
>> transforms?
>
> Yes, but the vectorizer is happier with the conditions present in the 
> COND_EXPR
> and thus we concluded we always want to have them there.  forwprop will
> also aggressively put them back.  Note that we cannot put back
> tree_could_throw_p
> conditions (FP compares with signalling nans for example) to properly model EH
> (though for VEC_COND_EXPRs we don't really care here).
Yes, also putting comparison together could enable possible target
optimizations.  I will check the if-conv behavior later.

Thanks,
bin
>
> Note that nothing between if-conversion and vectorization will perform
> the desired
> CSE anyway.
>
> Richard.
>
>
>> Thanks,
>> bin
>>>
>>> Richard.
>>>
 Thanks,
 bin

 gcc/testsuite/ChangeLog
 2016-08-09  Bin Cheng  

 * gcc.dg/vect/pr56541.c: Require vect_cond_mixed.


[PATCH PR71956] Add missed check on MASK_STORE builtin.

2016-08-11 Thread Yuri Rumyantsev
Hi All,

Jakub introduced regression after r235764 and we got RF for
spec2000/176.gcc on HSW if loop vectorization is on (32-bit only).
Here is a simple fix which cures the issue.

Is it OK for trunk?
ChangeLog:
2016-08-11  Yuri Rumyantsev  

PR rtl-optimization/71956
* ipa-pure-const.c (check_call): Add missed check on MASK_STORE
builtin which can not be pure function.


patch
Description: Binary data


Re: fold strlen (s) eq/ne 0 to *s eq/ne 0 on GIMPLE

2016-08-11 Thread Jakub Jelinek
On Thu, Aug 11, 2016 at 04:06:10PM +0530, Prathamesh Kulkarni wrote:
> +static void
> +replace_call_by_cmp (gimple_stmt_iterator *gsi, location_t loc, 
> +  tree type, tree arg1, tree arg2,
> +  tree res_type, enum tree_code cmp_code)
> +{
> +  tree ptrtype = build_pointer_type_for_mode (char_type_node, ptr_mode, 
> true);
> +  tree off = build_int_cst (ptrtype, 0);
> +  arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
> +  tree tem1 = fold_const_aggregate_ref (arg1);
> +  if (tem1)
> +arg1 = tem1;
> +
> +  if (POINTER_TYPE_P (TREE_TYPE (arg2)))
> +{
> +  arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
> +  tree tem2 = fold_const_aggregate_ref (arg2);
> +  if (tem2)
> + arg2 = tem2;
> +}
> +  else
> +gcc_assert (INTEGRAL_TYPE_P (TREE_TYPE (arg2)));
> +
> +  tree res = fold_convert_loc (loc, res_type,
> +fold_build2_loc (loc, cmp_code,
> + boolean_type_node,
> + arg1, arg2));
> +
> +  gimplify_and_update_call_from_tree (gsi, res);

I know it is pre-existing, but do you really need to create everything as
trees first and then gimplify?  Can't you emit GIMPLE right away?

> @@ -2505,7 +2558,7 @@ const pass_data pass_data_strlen =
>0, /* properties_provided */
>0, /* properties_destroyed */
>0, /* todo_flags_start */
> -  0, /* todo_flags_finish */
> +  TODO_update_ssa_only_virtuals, /* todo_flags_finish */
>  };
>  
>  class pass_strlen : public gimple_opt_pass

And if you do, I'd hope you can avoid this.  The strlen call has
a vuse, just use the same virtual operand in vuse on the mem read.

Jakub


Re: [PATCH testsuite]Require vect_cond_mixed for test case gcc.dg/vect/pr56541.c

2016-08-11 Thread Richard Biener
On Thu, Aug 11, 2016 at 11:56 AM, Bin.Cheng  wrote:
> On Thu, Aug 11, 2016 at 10:50 AM, Richard Biener
>  wrote:
>> On Wed, Aug 10, 2016 at 5:58 PM, Bin Cheng  wrote:
>>> Hi,
>>> Due to some reasons, tree-if-conv.c now factors floating point comparison 
>>> out of cond_expr,
>>> resulting in mixed types in it.  This does help CSE on common comparison 
>>> operations.
>>> Only problem is that test gcc.dg/vect/pr56541.c now requires 
>>> vect_cond_mixed to be
>>> vectorized.  This patch changes the test in that way.
>>> Test result checked.  Is it OK?
>>
>> Hmm, I think the fix is to fix if-conversion not doing that.  Can you
>> track down why this happens?
> Hmm, but there are several common floating comparison operations in
> the case, by doing this, we could do CSE on GIMPLE, otherwise we
> depends on RTL optimizers.

I see.

> I thought we prefer GIMPLE level
> transforms?

Yes, but the vectorizer is happier with the conditions present in the COND_EXPR
and thus we concluded we always want to have them there.  forwprop will
also aggressively put them back.  Note that we cannot put back
tree_could_throw_p
conditions (FP compares with signalling nans for example) to properly model EH
(though for VEC_COND_EXPRs we don't really care here).

Note that nothing between if-conversion and vectorization will perform
the desired
CSE anyway.

Richard.


> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> gcc/testsuite/ChangeLog
>>> 2016-08-09  Bin Cheng  
>>>
>>> * gcc.dg/vect/pr56541.c: Require vect_cond_mixed.


Re: fold strlen (s) eq/ne 0 to *s eq/ne 0 on GIMPLE

2016-08-11 Thread Prathamesh Kulkarni
On 1 August 2016 at 17:03, Richard Biener  wrote:
> On Mon, 1 Aug 2016, Prathamesh Kulkarni wrote:
>
>> Hi Richard,
>> The attached patch tries to fold strlen (s) eq/ne 0 to *s eq/ne 0 on GIMPLE.
>> I am not sure where was the ideal place to put this transform in and ended up
>> adding it to strlen_optimize_stmt().
>> Does that look OK ?
>>
>> I needed to add TODO_update_ssa to strlen pass, otherwise we hit the
>> following assert in execute_todo():
>> if (flag_checking
>>   && cfun
>>   && need_ssa_update_p (cfun))
>> gcc_assert (flags & TODO_update_ssa_any);
>>
>> Bootstrap+test in progress on x86_64-unknown-linux-gnu.
>
> I believe you should factor small-size part of handle_builtin_memcmp and
> re-use that for the code generation part.
>
> You should also remove the corresponding fold-const.c code I think.
Hi,
In the attached version, I removed code from fold-const.c, and
factored code-gen part
into replace_call_cmp. The previous patch incorrectly applied the transform
when result of strlen() had multiple uses, thus restricting it to
has_single_use.
However I suppose we could do better by checking if each immediate use
of the result is involved in compare against 0 and in that case perform
the transform ?
Bootstrap + test in progress on x86_64-unknown-linux-gnu

Thanks,
Prathamesh
>
> Richard.
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index c5d9a79..309ef38 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -10676,30 +10676,6 @@ fold_binary_loc (location_t loc,
return t1;
}
 
-  /* Optimize comparisons of strlen vs zero to a compare of the
-first character of the string vs zero.  To wit,
-   strlen(ptr) == 0   =>  *ptr == 0
-   strlen(ptr) != 0   =>  *ptr != 0
-Other cases should reduce to one of these two (or a constant)
-due to the return value of strlen being unsigned.  */
-  if (TREE_CODE (arg0) == CALL_EXPR
- && integer_zerop (arg1))
-   {
- tree fndecl = get_callee_fndecl (arg0);
-
- if (fndecl
- && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL
- && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_STRLEN
- && call_expr_nargs (arg0) == 1
- && TREE_CODE (TREE_TYPE (CALL_EXPR_ARG (arg0, 0))) == 
POINTER_TYPE)
-   {
- tree iref = build_fold_indirect_ref_loc (loc,
-  CALL_EXPR_ARG (arg0, 0));
- return fold_build2_loc (loc, code, type, iref,
- build_int_cst (TREE_TYPE (iref), 0));
-   }
-   }
-
   /* Fold (X >> C) != 0 into X < 0 if C is one less than the width
 of X.  Similarly fold (X >> C) == 0 into X >= 0.  */
   if (TREE_CODE (arg0) == RSHIFT_EXPR
diff --git a/gcc/testsuite/gcc.dg/strlenopt-30.c 
b/gcc/testsuite/gcc.dg/strlenopt-30.c
new file mode 100644
index 000..b041d86
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/strlenopt-30.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-strlen" } */
+
+__attribute__((noinline, no_icf))
+_Bool f1(const char *s)
+{
+  unsigned long len = __builtin_strlen (s);
+  _Bool ret = (len == 0);
+  return ret;
+}
+
+/* { dg-final { scan-tree-dump-times "strlen \\(" 0 "strlen" } } */
diff --git a/gcc/tree-ssa-strlen.c b/gcc/tree-ssa-strlen.c
index 9d7b4df..4ada2ee 100644
--- a/gcc/tree-ssa-strlen.c
+++ b/gcc/tree-ssa-strlen.c
@@ -45,6 +45,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-chkp.h"
 #include "tree-hash-traits.h"
 #include "builtins.h"
+#include "tree-pretty-print.h"
+#include "tree-cfg.h"
 
 /* A vector indexed by SSA_NAME_VERSION.  0 means unknown, positive value
is an index into strinfo vector, negative value stands for
@@ -1937,6 +1939,36 @@ handle_builtin_memset (gimple_stmt_iterator *gsi)
   return false;
 }
 
+static void
+replace_call_by_cmp (gimple_stmt_iterator *gsi, location_t loc, 
+tree type, tree arg1, tree arg2,
+tree res_type, enum tree_code cmp_code)
+{
+  tree ptrtype = build_pointer_type_for_mode (char_type_node, ptr_mode, true);
+  tree off = build_int_cst (ptrtype, 0);
+  arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+  tree tem1 = fold_const_aggregate_ref (arg1);
+  if (tem1)
+arg1 = tem1;
+
+  if (POINTER_TYPE_P (TREE_TYPE (arg2)))
+{
+  arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
+  tree tem2 = fold_const_aggregate_ref (arg2);
+  if (tem2)
+   arg2 = tem2;
+}
+  else
+gcc_assert (INTEGRAL_TYPE_P (TREE_TYPE (arg2)));
+
+  tree res = fold_convert_loc (loc, res_type,
+  fold_build2_loc (loc, cmp_code,
+   boolean_type_node,
+   arg1, arg2));
+
+  gimplify_and_update_call_from_tree (gsi, res);
+}
+
 /* Handle a call to memcmp.  We try to handle small comparisons by

Re: Fwd: [PATCH, doc/ARM] Remove false affirmation that Thumb cannot use an FPU

2016-08-11 Thread Thomas Preudhomme

Hi Sandra,

Thanks for your feedback. Please find an updated version attached to this email. 
ChangeLog entry is unchanged:


*** gcc/ChangeLog ***

2016-08-02  Thomas Preud'homme  

* doc/fragments.texi (MULTILIB_EXCEPTIONS): Replace example with
preventing combination of -mfloat-abi=soft with any -mfpu option.

Best regards,

Thomas

On 11/08/16 04:09, Sandra Loosemore wrote:

On 08/10/2016 09:51 AM, Thomas Preudhomme wrote:


diff --git a/gcc/doc/fragments.texi b/gcc/doc/fragments.texi
index
b6d8541c8ca820fa732363a05221e2cd4d1251c2..a060635c9cee7374d9d187858ac87acdd08860f2
100644
--- a/gcc/doc/fragments.texi
+++ b/gcc/doc/fragments.texi
@@ -117,12 +117,15 @@ specified, there are combinations that should not be
built.  In that
 case, set @code{MULTILIB_EXCEPTIONS} to be all of the switch exceptions
 in shell case syntax that should not be built.

-For example the ARM processor cannot execute both hardware floating
-point instructions and the reduced size THUMB instructions at the same
-time, so there is no need to build libraries with both of these
-options enabled.  Therefore @code{MULTILIB_EXCEPTIONS} is set to:
+For example on ARM targets @code{-mfloat-abi=soft} requests to use a
+softfloat implementation for floating-point operations.  Therefore, it


For example, on ARM targets @option{-mfloat-abi=soft} requests use of software
floating-point operations.  Therefore, it


+does not make sense to find both @code{-mfloat-abi=soft} and an


@option here too


+@code{mfpu} option on the command line so @code{MULTILIB_EXCEPTIONS}


and here @option{-mfpu}


+could contain the following exception (assuming that @code{-mfloat-abi}


@option


+comes after in MULTILIB_OPTIONS and given that soft is the default


@code markup on MULTILIB_OPTIONS?
@samp markup on soft?


+value):
 @smallexample
-*mthumb/*mhard-float*
+*mfpu=*
 @end smallexample

 @findex MULTILIB_REQUIRED


-Sandra

diff --git a/gcc/doc/fragments.texi b/gcc/doc/fragments.texi
index b6d8541c8ca820fa732363a05221e2cd4d1251c2..abf4e128671bb4751d21f24bb69625593d3c839e 100644
--- a/gcc/doc/fragments.texi
+++ b/gcc/doc/fragments.texi
@@ -117,12 +117,15 @@ specified, there are combinations that should not be built.  In that
 case, set @code{MULTILIB_EXCEPTIONS} to be all of the switch exceptions
 in shell case syntax that should not be built.
 
-For example the ARM processor cannot execute both hardware floating
-point instructions and the reduced size THUMB instructions at the same
-time, so there is no need to build libraries with both of these
-options enabled.  Therefore @code{MULTILIB_EXCEPTIONS} is set to:
+For example on ARM targets @option{-mfloat-abi=soft} requests to use a
+softfloat implementation for floating-point operations.  Therefore, it
+does not make sense to find both @option{-mfloat-abi=soft} and an
+@option{mfpu} option on the command line so @code{MULTILIB_EXCEPTIONS}
+could contain the following exception (assuming that
+@option{-mfloat-abi} comes after in @code{MULTILIB_OPTIONS} and given
+that @option{-mfloat-abi=soft} is the default value):
 @smallexample
-*mthumb/*mhard-float*
+*mfpu=*
 @end smallexample
 
 @findex MULTILIB_REQUIRED


Re: [PATCH] Fix unaligned access when predictive commoning (PR 71083)

2016-08-11 Thread Richard Biener
On Thu, 11 Aug 2016, Bernd Edlinger wrote:

> On 08/11/16, Richard Biener wrote:
> > 
> > The patch looks mostly ok, but
> > 
> > +  else
> > +   {
> > + boff >>= LOG2_BITS_PER_UNIT;
> > + boff += tree_to_uhwi (component_ref_field_offset (ref));
> > + coff = size_binop (MINUS_EXPR, coff, ssize_int (boff));
> > 
> > how can we be sure that component_ref_field_offset is an INTEGER_CST?
> > At least Ada can have variably-length fields before a bitfield.  We'd
> > need to apply component_ref_field_offset to off in that case.  Which
> > makes me wonder if we can simply avoid the COMPONENT_REF path in
> > a first iteration of the patch and always build a BIT_FIELD_REF?
> > It should solve the correctness issues as well and be more applicable
> > for branches.
> > 
> 
> Oh yes, thanks for catching that!
> 
> If that information is true, that ought to go into the comment before
> the if, that would certainly be an interesting comment :-)
> 
> Are there any test cases for this non-constant field offsets?
> 
> I see many checks if TREE_TYPE of
> component_ref_field_offset is INTEGER_CST, but with very little
> background why it could be otherwise.
> 
> I think we should simply fall back to the BIT_FIELD_REF in that case,
> that would mean, the if should be something like:
> 
> tree offset = component_ref_field_offset (ref);
> if (boff % BITS_PER_UNIT != 0
> || !tree_fits_uhwi_p (offset))
> 
> And yes, the correctness issue can certainly be solved with the
> BIT_FIELD_REF alone.
> 
> So, as requested, here is a first iteration of my patch that always builds
> a BIT_FIELD_REF, together with the test cases.
> 
> 
> Boot-strap & regression testing was done on x86_64-pc-linux-gnu.
> Is it OK for trunk (and active branches)?

Yes.

Thanks,
Richard.


Re: [Patch, tentative, reload] Make push_reload work with more types of subregs?

2016-08-11 Thread Eric Botcazou
> Committed patch to trunk. Ok for backport to 6.x and 5.x branches as
> well?

No, we really avoid touching reload on release branches, unless there is a 
very good reason to do it, like a regression on a primary platform.

-- 
Eric Botcazou


[PATCH AArch64]Delete useless var declaration as obvious

2016-08-11 Thread Bin Cheng
Hi,
The useless var declaration was introduced when I separating patches, and it 
breaks bootstrap on AArch64.  This patch fixes it as obvious.
Bootstrap checked.  Will apply as obvious.

Thanks,
bin

2016-08-11  Bin Cheng  

* config/aarch64/aarch64-simd.md (vcond): Delete
unused declaration.
(vcond): Ditto.
(vcondu, vcondu): Ditto.Index: gcc/config/aarch64/aarch64-simd.md
===
--- gcc/config/aarch64/aarch64-simd.md  (revision 239358)
+++ gcc/config/aarch64/aarch64-simd.md  (working copy)
@@ -2573,7 +2573,6 @@
   "TARGET_SIMD"
 {
   rtx mask = gen_reg_rtx (mode);
-  enum rtx_code code = GET_CODE (operands[3]);
 
   emit_insn (gen_vec_cmp (mask, operands[3],
  operands[4], operands[5]));
@@ -2594,7 +2593,6 @@
   "TARGET_SIMD"
 {
   rtx mask = gen_reg_rtx (mode);
-  enum rtx_code code = GET_CODE (operands[3]);
 
   emit_insn (gen_vec_cmp (mask, operands[3],
  operands[4], operands[5]));
@@ -2616,7 +2614,6 @@
   "TARGET_SIMD"
 {
   rtx mask = gen_reg_rtx (mode);
-  enum rtx_code code = GET_CODE (operands[3]);
 
   emit_insn (gen_vec_cmp (mask, operands[3],
  operands[4], operands[5]));
@@ -2636,7 +2633,6 @@
   "TARGET_SIMD"
 {
   rtx mask = gen_reg_rtx (mode);
-  enum rtx_code code = GET_CODE (operands[3]);
 
   emit_insn (gen_vec_cmp (
  mask, operands[3],


Re: [PATCH] Fix unaligned access when predictive commoning (PR 71083)

2016-08-11 Thread Bernd Edlinger
On 08/11/16, Richard Biener wrote:
> 
> The patch looks mostly ok, but
> 
> +  else
> +   {
> + boff >>= LOG2_BITS_PER_UNIT;
> + boff += tree_to_uhwi (component_ref_field_offset (ref));
> + coff = size_binop (MINUS_EXPR, coff, ssize_int (boff));
> 
> how can we be sure that component_ref_field_offset is an INTEGER_CST?
> At least Ada can have variably-length fields before a bitfield.  We'd
> need to apply component_ref_field_offset to off in that case.  Which
> makes me wonder if we can simply avoid the COMPONENT_REF path in
> a first iteration of the patch and always build a BIT_FIELD_REF?
> It should solve the correctness issues as well and be more applicable
> for branches.
> 

Oh yes, thanks for catching that!

If that information is true, that ought to go into the comment before
the if, that would certainly be an interesting comment :-)

Are there any test cases for this non-constant field offsets?

I see many checks if TREE_TYPE of
component_ref_field_offset is INTEGER_CST, but with very little
background why it could be otherwise.

I think we should simply fall back to the BIT_FIELD_REF in that case,
that would mean, the if should be something like:

tree offset = component_ref_field_offset (ref);
if (boff % BITS_PER_UNIT != 0
|| !tree_fits_uhwi_p (offset))

And yes, the correctness issue can certainly be solved with the
BIT_FIELD_REF alone.

So, as requested, here is a first iteration of my patch that always builds
a BIT_FIELD_REF, together with the test cases.


Boot-strap & regression testing was done on x86_64-pc-linux-gnu.
Is it OK for trunk (and active branches)?


Thanks
Bernd.2016-08-11  Bernd Edlinger  

PR tree-optimization/71083
* tree-predcom.c (ref_at_iteration): Correctly align the
reference type.

testsuite:
2016-08-11  Bernd Edlinger  

PR tree-optimization/71083
* gcc.c-torture/execute/pr71083.c: New test.
* gnat.dg/loop_optimization23.adb: New test.
* gnat.dg/loop_optimization23_pkg.ads: New test.
* gnat.dg/loop_optimization23_pkg.adb: New test.
Index: gcc/tree-predcom.c
===
--- gcc/tree-predcom.c	(revision 239193)
+++ gcc/tree-predcom.c	(working copy)
@@ -213,6 +213,7 @@ along with GCC; see the file COPYING3.
 #include "tree-scalar-evolution.h"
 #include "params.h"
 #include "tree-affine.h"
+#include "builtins.h"
 
 /* The maximum number of iterations between the considered memory
references.  */
@@ -1381,6 +1382,8 @@ ref_at_iteration (data_reference_p dr, i
   addr = force_gimple_operand_1 (unshare_expr (addr), stmts,
  is_gimple_mem_ref_addr, NULL_TREE);
   tree alias_ptr = fold_convert (reference_alias_ptr_type (DR_REF (dr)), coff);
+  tree type = build_aligned_type (TREE_TYPE (DR_REF (dr)),
+  get_object_alignment (DR_REF (dr)));
   /* While data-ref analysis punts on bit offsets it still handles
  bitfield accesses at byte boundaries.  Cope with that.  Note that
  we cannot simply re-apply the outer COMPONENT_REF because the
@@ -1392,12 +1395,11 @@ ref_at_iteration (data_reference_p dr, i
 {
   tree field = TREE_OPERAND (DR_REF (dr), 1);
   return build3 (BIT_FIELD_REF, TREE_TYPE (DR_REF (dr)),
-		 build2 (MEM_REF, DECL_BIT_FIELD_TYPE (field),
-			 addr, alias_ptr),
+		 build2 (MEM_REF, type, addr, alias_ptr),
 		 DECL_SIZE (field), bitsize_zero_node);
 }
   else
-return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr);
+return fold_build2 (MEM_REF, type, addr, alias_ptr);
 }
 
 /* Get the initialization expression for the INDEX-th temporary variable
Index: gcc/testsuite/gcc.c-torture/execute/pr71083.c
===
--- gcc/testsuite/gcc.c-torture/execute/pr71083.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/pr71083.c	(working copy)
@@ -0,0 +1,43 @@
+struct lock_chain {
+  unsigned int irq_context: 2,
+depth: 6,
+base: 24;
+};
+
+__attribute__((noinline, noclone))
+struct lock_chain * foo (struct lock_chain *chain)
+{
+  int i;
+  for (i = 0; i < 100; i++)
+{
+  chain[i+1].base = chain[i].base;
+}
+  return chain;
+}
+
+struct lock_chain1 {
+  char x;
+  unsigned short base;
+} __attribute__((packed));
+
+__attribute__((noinline, noclone))
+struct lock_chain1 * bar (struct lock_chain1 *chain)
+{
+  int i;
+  for (i = 0; i < 100; i++)
+{
+  chain[i+1].base = chain[i].base;
+}
+  return chain;
+}
+
+struct lock_chain test [101];
+struct lock_chain1 test1 [101];
+
+int
+main ()
+{
+  foo (test);
+  bar (test1);
+  return 0;
+}
Index: gcc/testsuite/gnat.dg/loop_optimization23.adb
===
--- gcc/testsuite/gnat.dg/loop_optimization23.adb	(revision 0)
+++ gcc/testsuite/gnat.dg/loop_optimization23.adb	(working copy)
@@ -0,0 +1,14 @@
+-- { 

Re: [RFC][PR61839]Convert CST BINOP COND_EXPR to COND_EXPR ? (CST BINOP 1) : (CST BINOP 0)

2016-08-11 Thread Richard Biener
On Thu, Aug 11, 2016 at 6:11 AM, kugan
 wrote:
> Hi Richard,
>
>
> On 09/08/16 20:22, Richard Biener wrote:
>>
>> On Tue, Aug 9, 2016 at 4:51 AM, Kugan Vivekanandarajah
>>  wrote:
>>>
>>> Hi Richard,
>>>
>>> Thanks for the review.
>>>
>>> On 29 April 2016 at 20:47, Richard Biener 
>>> wrote:

 On Sun, Apr 17, 2016 at 1:14 AM, kugan
  wrote:
>
> As explained in PR61839,
>
> Following difference results in extra instructions:
> -  c = b != 0 ? 486097858 : 972195717;
> +  c = a + 972195718 >> (b != 0);
>
> As suggested in PR, attached patch converts CST BINOP COND_EXPR to
> COND_EXPR
> ? (CST BINOP 1) : (CST BINOP 0).
>
> Bootstrapped and regression tested for x86-64-linux-gnu with no new
> regression. Is this OK for statege-1.


 You are missing a testcase.

 I think the transform can be generalized to any two-value value-range by
 instead of

   lhs = cond_res ? (cst binop 1) : (cst binop 0)

 emitting

   lhs = tmp == val1 ? (cst binop val1) : (cst binop val2);

 In the PR I asked the transform to be only carried out if cond_res and
 tmp have a single use (and thus they'd eventually vanish).

 I'm not sure if a general two-value "constant" propagation is profitable
 which is why I was originally asking for the pattern to only apply
 if the resulting value is used in a comparison which we could then
 in turn simplify by substituting COND_RES (or ! COND_RES) for it.
 For the general two-value case we'd substitute it with tmp [=!]= val[12]
 dependent on which constant is cheaper to test for.

 So I think this needs some exploring work on which way to go
 and which transform is profitable in the end.  I think the general
 two-value case feeding a condition will be always profitable.
>>>
>>>
>>>
>>> Please find a modified version which checks for two-valued variable
>>> and uses this to optimize. In the attached test case (in function
>>> bar), we end up doing the conversion twice.
>>>
>>> Bootstrapped and regression tested on x86_64-linux-gnu without no new
>>> regressions. Is this OK for trunk?
>>
>>
>> +/* Return true if VAR is a two-valued variable.  Set MIN and MAX when it
>> is
>> +   true.  Return false otherwise.  */
>> +
>> +static bool
>> +two_valued_val_range_p (tree var, tree *min, tree *max)
>> +{
>>
>> I'd use A and B, not MIN/MAX given it's two values, not necessarily
>> a two-valued range (for example for ~[1, UINT_MAX-1] which you
>
> I have changed this. I don't  think this would be the only VR_ANTI_RANGE.
> Others like TYPE_MIN + 1, TYPE_MIN + 2 should come as VR_RANGE.
>
>> don't handle).  In theory VRP might get a more sophisticated range
>> representation to also allow a range consisting of just 3 and 7 for
>> example.
>>
> I am not sure how this will be represented as value range. Is this for enum
> types where thhere is no valid values between 3 and 7 ?
>
>> +  tree tmp
>> += int_const_binop (PLUS_EXPR,
>> +  vr->min,
>> +  build_int_cst_type (TREE_TYPE (var), 1));
>> +  if (0 != compare_values (tmp, vr->max))
>> +return false;
>>
>> I think simply
>>
>>if (wi::sub (vr->max, vr->min) == 1)
>
> I have changed this.
>
>>
>> might work as well and avoid building a tree node.
>>
>> +  /* Convert:
>> +LHS = CST BINOP VAR
>> +where VAR is two-valued.
>> +
>> +To:
>> +LHS = VAR == VAL1 ? (CST BINOP VAL1) : (CST BINOP VAL2) */
>> +
>> +  if (TREE_CODE_CLASS (rhs_code) == tcc_binary
>> + && TREE_CODE (rhs1) == INTEGER_CST
>> + && TREE_CODE (rhs2) == SSA_NAME
>>
>> Note that for all commutative tcc_binary operators the constant will be on
>> the
>> other operand.  I think you need to handle the constant appearing in both
>> places
>> (and for division for example watch out for a zero divisor).
>
>
> I have now canonicalized it in the beginning. I don't think it will affect
> other simplifications that comes after this transformation.
>
>>
>> + && has_single_use (rhs2)
>> + && two_valued_val_range_p (rhs2, , ))
>> +
>> +   {
>> + tree cond = build2 (EQ_EXPR, TREE_TYPE (rhs2), rhs2, min);
>> + tree new_rhs1 =  int_const_binop (rhs_code, rhs1, min);
>> + tree new_rhs2 =  int_const_binop (rhs_code, rhs1, max);
>>
>> too many spaces after '='.
>
> Done.
>
>>
>> +
>> + if (new_rhs1 && new_rhs2)
>>
>> You didn't address my point about profitability - you test for a single
>> use
>> but not for the kind of use.  Please instead use
>
> I checked with some simple test-cases. In those cases it either improves or
> no difference.
>
>>
>> && single_imm_use (rhs2, _p, _stmt)
>> && gimple_code (use_stmt) == GIMPLE_COND
>
> Done.
>
>>
>> The testcase 

[PATCH AArch64]Fix spurious warning with explicit initialization

2016-08-11 Thread Bin Cheng
Hi,
GCC gives spurious -Wmaybe-uninitialized message which breaks bootstrap.  This 
patch fixes it by explicitly initializing.  Also I file PR72355 in order to 
tracking the wrong warning message.
Compilation log checked.  Applied as obvious.

Thanks,
bin

2016-08-11  Bin Cheng  

* config/aarch64/aarch64-simd.md (vec_cmp: Init
variable explicitly, also assert on it before use.diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 37d397c..3817895 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2413,7 +2413,7 @@
   enum rtx_code code = GET_CODE (operands[1]);
   rtx tmp = gen_reg_rtx (mode);
 
-  rtx (*comparison) (rtx, rtx, rtx);
+  rtx (*comparison) (rtx, rtx, rtx) = NULL;
 
   switch (code)
 {
@@ -2495,6 +2495,7 @@
 a UNLE b -> !(a GT b)
 a UNLT b -> !(a GE b)
 a   NE b -> !(a EQ b)  */
+  gcc_assert (comparison != NULL);
   emit_insn (comparison (operands[0], operands[2], operands[3]));
   emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
   break;
@@ -2511,6 +2512,7 @@
 a LE b -> b GE a
 a LT b -> b GT a
 a EQ b -> a EQ b  */
+  gcc_assert (comparison != NULL);
   emit_insn (comparison (operands[0], operands[2], operands[3]));
   break;
 


Re: [PATCH testsuite]Require vect_cond_mixed for test case gcc.dg/vect/pr56541.c

2016-08-11 Thread Bin.Cheng
On Thu, Aug 11, 2016 at 10:50 AM, Richard Biener
 wrote:
> On Wed, Aug 10, 2016 at 5:58 PM, Bin Cheng  wrote:
>> Hi,
>> Due to some reasons, tree-if-conv.c now factors floating point comparison 
>> out of cond_expr,
>> resulting in mixed types in it.  This does help CSE on common comparison 
>> operations.
>> Only problem is that test gcc.dg/vect/pr56541.c now requires vect_cond_mixed 
>> to be
>> vectorized.  This patch changes the test in that way.
>> Test result checked.  Is it OK?
>
> Hmm, I think the fix is to fix if-conversion not doing that.  Can you
> track down why this happens?
Hmm, but there are several common floating comparison operations in
the case, by doing this, we could do CSE on GIMPLE, otherwise we
depends on RTL optimizers.  I thought we prefer GIMPLE level
transforms?

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> gcc/testsuite/ChangeLog
>> 2016-08-09  Bin Cheng  
>>
>> * gcc.dg/vect/pr56541.c: Require vect_cond_mixed.


Re: [PATCH PR69848]Introduce special conditional reduction CONST_COND_REDUCTION

2016-08-11 Thread Richard Biener
On Wed, Aug 10, 2016 at 6:01 PM, Bin Cheng  wrote:
> Hi,
> This patch fixes the inefficient code generated by vectorizer as reported by 
> PR69848.
> It introduces new conditional reduction type CONST_COND_REDUCTION.  As a 
> result,
> we don't need to compute index vector in loop; also the epilog reduction code 
> can be
> simplified using single reduc_max/reduc_min operation.  Together with AArch64 
> vcond
> patches, the # of insns in loop body is reduced from 10 to 4 on AArch64.  
> Note, this one
> doesn't handle cases in which reduction values are loop invariants because it 
> requires
> quite different code to current implementation, and I failed to work out a 
> "clean" patch at
> the moment.
>
> Bootstrap and test on x86_64 and AArch64.  Is it OK?

Ok.

Thanks,
Richard.

> Thanks,
> bin
>
> 2016-08-08  Bin Cheng  
>
> PR tree-optimization/69848
> * tree-vectorizer.h (enum vect_def_type): New condition reduction
> type CONST_COND_REDUCTION.
> * tree-vect-loop.c (vectorizable_reduction): Support new condition
> reudction type CONST_COND_REDUCTION.
>
> gcc/testsuite/ChangeLog
> 2016-08-08  Bin Cheng  
>
> PR tree-optimization/69848
> * gcc.dg/vect/vect-pr69848.c: New test.


Re: [PATCH testsuite]Require vect_cond_mixed for test case gcc.dg/vect/pr56541.c

2016-08-11 Thread Richard Biener
On Wed, Aug 10, 2016 at 5:58 PM, Bin Cheng  wrote:
> Hi,
> Due to some reasons, tree-if-conv.c now factors floating point comparison out 
> of cond_expr,
> resulting in mixed types in it.  This does help CSE on common comparison 
> operations.
> Only problem is that test gcc.dg/vect/pr56541.c now requires vect_cond_mixed 
> to be
> vectorized.  This patch changes the test in that way.
> Test result checked.  Is it OK?

Hmm, I think the fix is to fix if-conversion not doing that.  Can you
track down why this happens?

Richard.

> Thanks,
> bin
>
> gcc/testsuite/ChangeLog
> 2016-08-09  Bin Cheng  
>
> * gcc.dg/vect/pr56541.c: Require vect_cond_mixed.


fix crash on 64-bit mingw-w64 hosted compiler using more than 4 gb of ram

2016-08-11 Thread Stanislaw Halik

The host configuration across platforms wrongly assumes that
sizeof(long) == sizeof(intptr_t) which is incorrect on amd64-hosted 
compiler hosting mingw-w64.


Here's a patch fixing


cheers,
sh
diff --git a/gcc/config/i386/xm-mingw32.h b/gcc/config/i386/xm-mingw32.h
index 501cebd..1b17263 100644
--- a/gcc/config/i386/xm-mingw32.h
+++ b/gcc/config/i386/xm-mingw32.h
@@ -38,3 +38,7 @@ along with GCC; see the file COPYING3.  If not see
 #define HOST_LONG_LONG_FORMAT "I64"
 #endif
 
+/* this is to prevent ggc-heap.c from assuming sizeof(long) == 
sizeof(intptr_t) */
+#ifdef __x86_64__
+#  define HOST_BITS_PER_PTR 64
+#endif
\ No newline at end of file


Re: [Patch, libfortran] Multi-threaded random_number

2016-08-11 Thread Janne Blomqvist
Hi,

committed a slightly modified patch as r239356. Changes from the
submitted patch attached. To my surprise, it turned out that my
fallback code using __gthread_key_{create,delete} and
__ghtread_{get,set}_specific was faster than my TLS code, so I removed
the TLS configure magic and the TLS code and just left the __gthread
stuff in.

On Wed, Aug 10, 2016 at 2:11 PM, Janne Blomqvist
 wrote:
> Hi,
>
> thanks for the Ok. However, moments before committing I got cold feet
> and started digging around; it unfortunately seems that TLS
> (_Thread_local) is not supported on all targets. So I'll have to
> copy-paste some configure magic from libgomp/libjava/etc., and provide
> workarounds for such systems. :(
>
> On Tue, Aug 9, 2016 at 8:01 AM, Jerry DeLisle  wrote:
>> On 08/08/2016 04:01 AM, Janne Blomqvist wrote:
>>>
>>> PING**2
>>>
>>
>> OK, thanks for patch.
>>
>> Jerry
>>
>>> On Sun, Jul 24, 2016 at 4:45 PM, Janne Blomqvist
>>>  wrote:

 Hi,

 the attached patch replaces the current random_number / random_seed
 implementations with an implementation that better supports threads.
 It's an improved version of the RFC patch I posted earlier at
 https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02110.html . Please see
 that earlier message for a longer-winded explanation of what's wrong
 with the current implementation and how the patch addresses this.

 In short, with the patch the random number generator state is now
 per-thread and stored in a per-thread (TLS) variable, enabling a
 lockless fast-path. This provides up to 2 orders of magnitude better
 performance on a synthetic benchmark using 4 threads, and provides a
 more deterministic result as the order that threads are scheduled does
 not affect the random number streams for each thread.

 Compared to the RFC patch, a number of minor and not-so-minor bugs
 have been fixed, so the patch now passes the testsuite (with a few
 modifications to the suite, part of the patch). Also, for REAL kinds
 4, 8, 10 the generated streams are identical (except precision, of
 course) (like the current implementation), enabling precision
 comparisons, as requested by Steve Kargl. However, this does not
 extend to REAL(16) as that would have necessitated doubling the size
 of the state, along with potential issues of slower escape from a
 low-entropy state, for a feature that I believe is not used by
 particularly many users in the end. So if one wants to do precision
 comparisons with REAL(16) one must employ a wrapper routine.

 Regtested on x86_64-pc-linux-gnu, Ok for trunk?

 frontend ChangeLog:

 2016-07-27  Janne Blomqvist  

 * check.c (gfc_check_random_seed): Use new seed size in check.
 * intrinsic.texi (RANDOM_NUMBER): Updated documentation.
 (RANDOM_SEED): Likewise.


 testsuite:

 2016-07-27  Janne Blomqvist  

 * gfortran.dg/random_7.f90: Take into account that the last seed
 value is the special p value.
 * gfortran.dg/random_seed_1.f90: Seed size is now constant.


 libgfortran:
 2016-07-27  Janne Blomqvist  

 * intrinsics/random.c: Replace KISS with xorshift1024* with
 per-thread (TLS) state.
 * runtime/main.c (init): Don't call random_seed_i4.


 --
 Janne Blomqvist
>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist
diff --git a/libgfortran/intrinsics/random.c b/libgfortran/intrinsics/random.c
index 516b640..6084ebe 100644
--- a/libgfortran/intrinsics/random.c
+++ b/libgfortran/intrinsics/random.c
@@ -205,20 +205,39 @@ static uint64_t master_state[] = {
 };
 
 
-/* The actual random number state, as a TLS variable.  */
-static _Thread_local xorshift1024star_state rand_state;
+static __gthread_key_t rand_state_key;
+
+static xorshift1024star_state*
+get_rand_state (void)
+{
+  /* For single threaded apps.  */
+  static xorshift1024star_state rand_state;
+
+  if (__gthread_active_p ())
+{
+  void* p = __gthread_getspecific (rand_state_key);
+  if (!p)
+   {
+ p = xcalloc (1, sizeof (xorshift1024star_state));
+ __gthread_setspecific (rand_state_key, p);
+   }
+  return p;
+}
+  else
+return _state;
+}
 
 
 static uint64_t
-xorshift1024star (void)
+xorshift1024star (xorshift1024star_state* rs)
 {
-  int p = rand_state.p;
-  const uint64_t s0 = rand_state.s[p];
-  uint64_t s1 = rand_state.s[p = (p + 1) & 15];
+  int p = rs->p;
+  const uint64_t s0 = rs->s[p];
+  uint64_t s1 = rs->s[p = (p + 1) & 15];
   s1 ^= s1 << 31;
-  rand_state.s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30);
-  rand_state.p = p;
-  return rand_state.s[p] * UINT64_C(1181783497276652981);
+  rs->s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30);

Re: [PATCH][RFC] Fix PR72851 by "randomizing" SSA propagator worklist extraction

2016-08-11 Thread Richard Biener
On Wed, 10 Aug 2016, Richard Biener wrote:

> 
> The following randomizes SSA propagator worklist processing to make the
> processing order less likely hit the worst-case as in the PR.  Ideally
> we'd process it in stmt order but that adds overhead to the idea of a
> SSA propagator that makes it very much not appealing.  Using a queue
> rather than a stack would als be a possibility (but also not really
> fix the underlying issue).  So this patch uses a very simple approach
> to fix the specific testcase.
> 
> Opinion?  Any great ideas on how to avoid O (n log n) sorting or
> O (log n) insertion?  [mark blocks to be visited and simply iterate
> over all stmts in to-be-visited BBs]

Ok, I've looked at the propagator again and discovered "interesting"
code trying to do BB evaluation in a good order.  So I rewrote that
to use PRE order which conveniently allows us to compute stmt UIDs
in that order.  Throwing memory on the sorting issue (a uid-to-stmt
array) and using a bitmap for the worklist should solve the issue fully.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

Richard.

2016-08-11  Richard Biener  

PR tree-optimization/72851
* tree-ssa-propagate.c: Include cfganal.h.  Rewrite block and stmt
worklists to use bitmaps indexed in execution order.
(executable_blocks, cfg_blocks_num, cfg_blocks_tail, cfg_blocks_head,
bb_in_list, interesting_ssa_edges, varying_ssa_edges): Remove.
(cfg_blocks): Make a bitmap.
(bb_to_cfg_order, cfg_order_to_bb, ssa_edge_worklist, uid_to_stmt):
New globals.
(cfg_blocks_empty_p): Adjust.
(cfg_blocks_add): Likewise.
(cfg_blocks_get): Likewise.
(add_ssa_edge): Likewise.
(add_control_edge): Likewise.
(simulate_stmt): Likewise.
(process_ssa_edge_worklist): Likewise.
(simulate_block): Likewise.
(ssa_prop_init): Compute PRE order and stmt UIDs.
(ssa_prop_fini): Adjust.
(ssa_propagate): Adjust.

* gcc.dg/torture/pr72851.c: New testcase.

Index: gcc/tree-ssa-propagate.c
===
*** gcc/tree-ssa-propagate.c(revision 239317)
--- gcc/tree-ssa-propagate.c(working copy)
***
*** 37,42 
--- 37,43 
  #include "domwalk.h"
  #include "cfgloop.h"
  #include "tree-cfgcleanup.h"
+ #include "cfganal.h"
  
  /* This file implements a generic value propagation engine based on
 the same propagation used by the SSA-CCP algorithm [1].
***
*** 83,95 
Blocks are added to this list if their incoming edges are
found executable.
  
!   VARYING_SSA_EDGES contains the list of statements that feed
!   from statements that produce an SSA_PROP_VARYING result.
!   These are simulated first to speed up processing.
! 
!   INTERESTING_SSA_EDGES contains the list of statements that
!   feed from statements that produce an SSA_PROP_INTERESTING
!   result.
  
 5- Simulation terminates when all three work lists are drained.
  
--- 84,91 
Blocks are added to this list if their incoming edges are
found executable.
  
!   SSA_EDGE_WORKLIST contains the list of statements that we 
!   need to revisit.
  
 5- Simulation terminates when all three work lists are drained.
  
***
*** 116,224 
  static ssa_prop_visit_stmt_fn ssa_prop_visit_stmt;
  static ssa_prop_visit_phi_fn ssa_prop_visit_phi;
  
! /* Keep track of statements that have been added to one of the SSA
!edges worklists.  This flag is used to avoid visiting statements
!unnecessarily when draining an SSA edge worklist.  If while
!simulating a basic block, we find a statement with
!STMT_IN_SSA_EDGE_WORKLIST set, we clear it to prevent SSA edge
!processing from visiting it again.
! 
!NOTE: users of the propagation engine are not allowed to use
!the GF_PLF_1 flag.  */
! #define STMT_IN_SSA_EDGE_WORKLIST GF_PLF_1
! 
! /* A bitmap to keep track of executable blocks in the CFG.  */
! static sbitmap executable_blocks;
! 
! /* Array of control flow edges on the worklist.  */
! static vec cfg_blocks;
! 
! static unsigned int cfg_blocks_num = 0;
! static int cfg_blocks_tail;
! static int cfg_blocks_head;
! 
! static sbitmap bb_in_list;
  
  /* Worklist of SSA edges which will need reexamination as their
 definition has changed.  SSA edges are def-use edges in the SSA
 web.  For each D-U edge, we store the target statement or PHI node
!U.  */
! static vec interesting_ssa_edges;
! 
! /* Identical to INTERESTING_SSA_EDGES.  For performance reasons, the
!list of SSA edges is split into two.  One contains all SSA edges
!who need to be reexamined because their lattice value changed to
!varying (this worklist), and the other contains all other SSA edges
!to be reexamined (INTERESTING_SSA_EDGES).
! 
!

  1   2   >