date:20140114

Re: PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes

2014-01-14 Thread Jakub Jelinek

On Tue, Jan 14, 2014 at 06:18:22AM -0800, H.J. Lu wrote:
> 2014-01-14  H.J. Lu  
> 
>   PR target/59794
>   * config/i386/i386.c (type_natural_mode): Add a bool parameter
>   to indicate if type is used for function return value.  Warn
>   ABI change if the vector mode isn't available for function
>   return value.
>   (ix86_function_arg_advance): Pass false to type_natural_mode.
>   (ix86_function_arg): Likewise.
>   (ix86_gimplify_va_arg): Likewise.
>   (function_arg_32): Don't warn ABI change.
>   (ix86_function_value): Pass true to type_natural_mode.
>   (ix86_return_in_memory): Likewise.
>   (ix86_struct_value_rtx): Removed.
>   (TARGET_STRUCT_VALUE_RTX): Likewise.

This has added many FAILs on i686-linux (make sure to configure for a CPU
that doesn't automatically turn on -msse or -mmmx, say i686):

+FAIL: gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c (test for excess errors)
+FAIL: gcc.dg/pr53060.c (test for excess errors)
+FAIL: c-c++-common/convert-vec-1.c  -Wc++-compat  (test for excess errors)
+FAIL: c-c++-common/scal-to-vec2.c  -Wc++-compat  (test for excess errors)
+FAIL: c-c++-common/vector-compare-2.c  -Wc++-compat  (test for excess errors)
+FAIL: g++.dg/conversion/simd1.C -std=c++98 (test for excess errors)
+FAIL: g++.dg/conversion/simd1.C -std=c++11 (test for excess errors)
+FAIL: g++.dg/cpp0x/constexpr-53094-2.C (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-1.C -std=gnu++98 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-1.C -std=gnu++11 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-2.C -std=gnu++98 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-2.C -std=gnu++11 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-3.C -std=c++98 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-3.C -std=c++11 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-4.C -std=c++98 (test for excess errors)
+FAIL: g++.dg/ext/attribute-test-4.C -std=c++11 (test for excess errors)
+FAIL: g++.dg/ext/pr56790-1.C -std=gnu++98 (test for excess errors)
+FAIL: g++.dg/ext/pr56790-1.C -std=gnu++11 (test for excess errors)
+FAIL: c-c++-common/convert-vec-1.c -std=c++98 (test for excess errors)
+FAIL: c-c++-common/convert-vec-1.c -std=c++11 (test for excess errors)
+FAIL: c-c++-common/scal-to-vec2.c -std=gnu++98 (test for excess errors)
+FAIL: c-c++-common/scal-to-vec2.c -std=gnu++11 (test for excess errors)
+FAIL: c-c++-common/vector-compare-2.c -std=gnu++98 (test for excess errors)
+FAIL: c-c++-common/vector-compare-2.c -std=gnu++11 (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O0  (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O1  (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O2  (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O3 -fomit-frame-pointer  (test for excess 
errors)
+FAIL: g++.dg/torture/pr38565.C  -O3 -g  (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -Os  (test for excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O2 -flto -flto-partition=none  (test for 
excess errors)
+FAIL: g++.dg/torture/pr38565.C  -O2 -flto  (test for excess errors)

Excess errors:
/usr/src/gcc/gcc/testsuite/gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c:9:1: 
warning: SSE vector return without SSE enabled changes the ABI [enabled by 
default]
/usr/src/gcc/gcc/testsuite/gcc.dg/pr53060.c:13:1: warning: SSE vector return 
without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/c-c++-common/convert-vec-1.c:3:1: warning: MMX 
vector return without MMX enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/c-c++-common/scal-to-vec2.c:19:1: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/c-c++-common/vector-compare-2.c:20:1: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/conversion/simd1.C:8:59: warning: SSE vector 
return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/cpp0x/constexpr-53094-2.C:7:46: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-1.C:10:52: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-2.C:14:59: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-3.C:26:26: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-4.C:26:24: warning: SSE 
vector return without SSE enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/ext/pr56790-1.C:6:12: warning: MMX vector 
return without MMX enabled changes the ABI [enabled by default]
/usr/src/gcc/gcc/testsuite/g++.dg/torture/pr38565.C:5:

[wide-int] fixed several regressions in branch.

2014-01-14 Thread Kenneth Zadeck

This patch fixes what appears to have been a long standing failure in 
the conversion of tree-vect-generic.c:build_replicated_const. This 
failure caused several regressions on the branch.


Committed as revision 206616
Index: gcc/tree-vect-generic.c
===
--- gcc/tree-vect-generic.c	(revision 206609)
+++ gcc/tree-vect-generic.c	(working copy)
@@ -57,7 +57,8 @@ static tree
 build_replicated_const (tree type, tree inner_type, HOST_WIDE_INT value)
 {
   int width = tree_to_uhwi (TYPE_SIZE (inner_type));
-  int n = TYPE_PRECISION (type) / width;
+  int n = (TYPE_PRECISION (type) + HOST_BITS_PER_WIDE_INT - 1) 
+/ HOST_BITS_PER_WIDE_INT;
   unsigned HOST_WIDE_INT low, mask;
   HOST_WIDE_INT a[WIDE_INT_MAX_ELTS];
   int i;

Go patch committed: Use backend interface for interface expressions

2014-01-14 Thread Ian Lance Taylor

This Go frontend patch from Chris Manghane uses the backend interface
for interface info and field expressions.  Bootstrapped and ran Go
testsuite on x86_64-unknown-linux-gnu.  Committed to mainline.

Ian


2014-01-14  Chris Manghane  

* go-gcc.cc (Gcc_backend::compound_expression): New function.
(Gcc_backend::conditional_expression): New function.


Index: gcc/go/gofrontend/expressions.cc
===
--- gcc/go/gofrontend/expressions.cc	(revision 206509)
+++ gcc/go/gofrontend/expressions.cc	(working copy)
@@ -6473,11 +6473,11 @@ Expression::make_binary(Operator op, Exp
 
 tree
 Expression::comparison_tree(Translate_context* context, Type* result_type,
-			Operator op, Expression* left_expr,
-			Expression* right_expr, Location location)
+			Operator op, Expression* left, Expression* right,
+			Location location)
 {
-  Type* left_type = left_expr->type();
-  Type* right_type = right_expr->type();
+  Type* left_type = left->type();
+  Type* right_type = right->type();
 
   mpz_t zval;
   mpz_init_set_ui(zval, 0UL);
@@ -6509,17 +6509,11 @@ Expression::comparison_tree(Translate_co
   go_unreachable();
 }
 
-  // FIXME: Computing the tree here means it will be computed multiple times,
-  // which is wasteful.  This is a temporary modification until all tree code
-  // here can be replaced with frontend expressions.
-  tree left_tree = left_expr->get_tree(context);
-  tree right_tree = right_expr->get_tree(context);
   if (left_type->is_string_type() && right_type->is_string_type())
 {
-  Expression* strcmp_call = Runtime::make_call(Runtime::STRCMP, location, 2,
-   left_expr, right_expr);
-  left_tree = strcmp_call->get_tree(context);
-  right_tree = zexpr->get_tree(context);
+  left = Runtime::make_call(Runtime::STRCMP, location, 2,
+left, right);
+  right = zexpr;
 }
   else if ((left_type->interface_type() != NULL
 	&& right_type->interface_type() == NULL
@@ -6532,31 +6526,30 @@ Expression::comparison_tree(Translate_co
   if (left_type->interface_type() == NULL)
 	{
 	  std::swap(left_type, right_type);
-	  std::swap(left_expr, right_expr);
+	  std::swap(left, right);
 	}
 
   // The right operand is not an interface.  We need to take its
   // address if it is not a pointer.
   Expression* pointer_arg = NULL;
   if (right_type->points_to() != NULL)
-pointer_arg = right_expr;
+pointer_arg = right;
   else
 	{
-  go_assert(right_expr->is_addressable());
-  pointer_arg = Expression::make_unary(OPERATOR_AND, right_expr,
+  go_assert(right->is_addressable());
+  pointer_arg = Expression::make_unary(OPERATOR_AND, right,
location);
 	}
 
-  Expression* descriptor_expr = Expression::make_type_descriptor(right_type,
- location);
-  Call_expression* iface_valcmp =
+  Expression* descriptor =
+  Expression::make_type_descriptor(right_type, location);
+  left =
   Runtime::make_call((left_type->interface_type()->is_empty()
   ? Runtime::EMPTY_INTERFACE_VALUE_COMPARE
   : Runtime::INTERFACE_VALUE_COMPARE),
- location, 3, left_expr, descriptor_expr,
+ location, 3, left, descriptor,
  pointer_arg);
-  left_tree = iface_valcmp->get_tree(context);
-  right_tree = zexpr->get_tree(context);
+  right = zexpr;
 }
   else if (left_type->interface_type() != NULL
 	   && right_type->interface_type() != NULL)
@@ -6574,56 +6567,42 @@ Expression::comparison_tree(Translate_co
 	{
 	  go_assert(op == OPERATOR_EQEQ || op == OPERATOR_NOTEQ);
 	  std::swap(left_type, right_type);
-	  std::swap(left_expr, right_expr);
+	  std::swap(left, right);
 	}
 	  go_assert(!left_type->interface_type()->is_empty());
 	  go_assert(right_type->interface_type()->is_empty());
 	  compare_function = Runtime::INTERFACE_EMPTY_COMPARE;
 	}
 
-  Call_expression* ifacecmp_call =
-  Runtime::make_call(compare_function, location, 2,
- left_expr, right_expr);
-
-  left_tree = ifacecmp_call->get_tree(context);
-  right_tree = zexpr->get_tree(context);
+  left = Runtime::make_call(compare_function, location, 2, left, right);
+  right = zexpr;
 }
 
   if (left_type->is_nil_type()
   && (op == OPERATOR_EQEQ || op == OPERATOR_NOTEQ))
 {
   std::swap(left_type, right_type);
-  std::swap(left_tree, right_tree);
-  std::swap(left_expr, right_expr);
+  std::swap(left, right);
 }
 
   if (right_type->is_nil_type())
 {
+  right = Expression::make_nil(location);
   if (left_type

[PATCH/AARCH64] Fix register cost for moving to/from stack registers

2014-01-14 Thread Andrew Pinski

While writing the Thunder tunings, I got an internal compiler error
while building glibc.  The reduced testcase is:
  typedef unsigned int size_t;
  typedef unsigned int wchar_t;
  extern __thread int __libc_errno;
  extern __thread int *  t;
  int _IO_vfprintf_internal (char *string, char *workend, char *f) {
int save_errno = __libc_errno;
   do
   {
   int prec = (workend - string);
   string = (char *) __strerror_r (save_errno);
 if ( *t == 1)
 {
 size_t ignore_size = (unsigned) prec > 1024 ? 1024 : prec;
 wchar_t ignore[ignore_size];
 const char *str2 = string;
 const char *strend = string + prec;
 int ps;
 while (str2 != ((void *)0) && str2 < strend)
__mbsnrtowcs (ignore, &str2 ,&ps) ;
 }
 }
   while (*f != '\0');
}
  CUT ---
I changed the cost of moving between two neon registers (FP_REGS
register class) to be the same as the cost of moving between a
GENERAL_REGS and a FP_REGS class. This caused the cost of moving
between the STACK_REG and FP_REGS being the same FP_REGS and
GENERAL_REGS which is incorrect as it has to go through a
GENERAL_REGS.

This patch fixes the problem by changing the cost of the move between
STACK_REG and FP_REGS to the cost of moving via a GENERAL_REGS.

OK? Built and tested on aarch64-elf with no regressions.

Thanks,
Andrew Pinski

ChangeLog:
* config/aarch64/aarch64.c (aarch64_register_move_cost): Correct cost
of moving from/to the STACK_REG register class.
Index: config/aarch64/aarch64.c
===
--- config/aarch64/aarch64.c(revision 206611)
+++ config/aarch64/aarch64.c(working copy)
@@ -4870,6 +4870,16 @@ aarch64_register_move_cost (enum machine
   const struct cpu_regmove_cost *regmove_cost
 = aarch64_tune_params->regmove_cost;
 
+  /* Moving between GPR and stack cost is the same as GP2GP. */
+  if ((from == GENERAL_REGS && to == STACK_REG)
+  || (to == GENERAL_REGS && from == STACK_REG))
+return regmove_cost->GP2GP;
+
+  /* To/From the stack register, is the move via the gprs. */
+  if (to == STACK_REG || from == STACK_REG)
+return aarch64_register_move_cost (mode, from, GENERAL_REGS)
++ aarch64_register_move_cost (mode, GENERAL_REGS, to);
+
   if (from == GENERAL_REGS && to == GENERAL_REGS)
 return regmove_cost->GP2GP;
   else if (from == GENERAL_REGS)

Go patch committed: Define Backend_function_type

2014-01-14 Thread Ian Lance Taylor

This patch from Chris Manghane adds a Backend_function_type: a function
type that is implemented as a function pointer rather than as the
pointer to a struct that is the implementation of a Go function.  This
will be used in other backend conversions.  Bootstrapped and ran Go
testsuite on x86_64-unknown-linux-gnu.  Committed to mainline.

Ian

diff -r 3515c11d7573 go/types.cc
--- a/go/types.cc	Thu Jan 09 22:40:50 2014 -0800
+++ b/go/types.cc	Tue Jan 14 15:16:02 2014 -0800
@@ -4066,6 +4066,17 @@
   return new Function_type(receiver, parameters, results, location);
 }
 
+// Make a backend function type.
+
+Backend_function_type*
+Type::make_backend_function_type(Typed_identifier* receiver,
+ Typed_identifier_list* parameters,
+ Typed_identifier_list* results,
+ Location location)
+{
+  return new Backend_function_type(receiver, parameters, results, location);
+}
+
 // Class Pointer_type.
 
 // Traversal.
diff -r 3515c11d7573 go/types.h
--- a/go/types.h	Thu Jan 09 22:40:50 2014 -0800
+++ b/go/types.h	Tue Jan 14 15:16:02 2014 -0800
@@ -19,6 +19,7 @@
 class Complex_type;
 class String_type;
 class Function_type;
+class Backend_function_type;
 class Struct_field;
 class Struct_field_list;
 class Struct_type;
@@ -484,6 +485,12 @@
 		 Typed_identifier_list* results,
 		 Location);
 
+  static Backend_function_type*
+  make_backend_function_type(Typed_identifier* receiver,
+ Typed_identifier_list* parameters,
+ Typed_identifier_list* results,
+ Location);
+
   static Pointer_type*
   make_pointer_type(Type*);
 
@@ -1896,6 +1903,23 @@
   Btype* fnbtype_;
 };
 
+// The type of a function's backend representation.
+
+class Backend_function_type : public Function_type
+{
+ public:
+  Backend_function_type(Typed_identifier* receiver,
+Typed_identifier_list* parameters,
+Typed_identifier_list* results, Location location)
+  : Function_type(receiver, parameters, results, location)
+  { }
+
+ protected:
+  Btype*
+  do_get_backend(Gogo* gogo)
+  { return this->get_backend_fntype(gogo); }
+};
+
 // The type of a pointer.
 
 class Pointer_type : public Type

Re: [PATCH, rs6000] Don't emit profile code for procedures marked no_instrument_function

2014-01-14 Thread David Edelsohn

On Tue, Jan 14, 2014 at 3:42 PM, Pat Haugen  wrote:
> This patch fixes a problem where the attribute no_instrument_function was
> being ignored and profile code was emitted. Testcase
> gcc.target/powerpc/ppc64-abi-2.c exposed the issue.
>
> Bootstrap/regtest with no new regressions, ok for trunk?
>
> -Pat
>
>
> 2014-01-13  Pat Haugen  
>
> * config/rs6000/rs6000.c (rs6000_output_function_prologue): Check if
> current procedure should be profiled.

okay.

Thanks, David

Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)

2014-01-14 Thread Joseph S. Myers

On Mon, 13 Jan 2014, Marek Polacek wrote:

> +/* Return true if T is a pointer to a zero-sized struct/union.  */
> +
> +bool
> +pointer_to_zero_sized_aggr_p (tree t)
> +{
> +  t = strip_pointer_operator (t);
> +  if (RECORD_OR_UNION_TYPE_P (t)
> +  && TYPE_SIZE (t)
> +  && integer_zerop (TYPE_SIZE (t)))
> +return true;
> +  return false;

Given that GNU C also allows arrays of constant size 0, shouldn't the 
errors also apply in that case?  (I don't know whether the original bug 
can appear for such arrays, but I'd think the errors should apply to 
anything with constant size 0 - not of course for VLAs where it just so 
happens that the compiler can tell at compile time that the size is always 
0.)

-- 
Joseph S. Myers
jos...@codesourcery.com

Re: [C PATCH] Preevaluate rhs for lhs op= rhs in C (PR c/58943)

2014-01-14 Thread Joseph S. Myers

On Mon, 13 Jan 2014, Jakub Jelinek wrote:

> This patch fixes the following testcase by preevaluating rhs if it has
> (can have) side-effects in lhs op= rhs expressions.  Bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?

OK.

-- 
Joseph S. Myers
jos...@codesourcery.com

[PATCH, rs6000] Don't emit profile code for procedures marked no_instrument_function

2014-01-14 Thread Pat Haugen

This patch fixes a problem where the attribute no_instrument_function 
was being ignored and profile code was emitted. Testcase 
gcc.target/powerpc/ppc64-abi-2.c exposed the issue.


Bootstrap/regtest with no new regressions, ok for trunk?

-Pat


2014-01-13  Pat Haugen  

* config/rs6000/rs6000.c (rs6000_output_function_prologue): 
Check if

current procedure should be profiled.



Index: config/rs6000/rs6000.c
===
--- config/rs6000/rs6000.c(revision 206602)
+++ config/rs6000/rs6000.c(working copy)
@@ -23198,7 +23198,7 @@ rs6000_output_function_prologue (FILE *f
   /* Output -mprofile-kernel code.  This needs to be done here instead of
  in output_function_profile since it must go after the ELFv2 ABI
  local entry point.  */
-  if (TARGET_PROFILE_KERNEL)
+  if (TARGET_PROFILE_KERNEL && crtl->profile)
 {
   gcc_assert (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2);
   gcc_assert (!TARGET_32BIT);

Re: [PATCH,rs6000] Implement -maltivec=be for vec_mule and vec_mulo Altivec intrinsics

2014-01-14 Thread David Edelsohn

On Mon, Jan 13, 2014 at 6:37 PM, Bill Schmidt
 wrote:
> This patch provides for interpreting parity of element numbers for the
> Altivec vec_mule and vec_mulo intrinsics as big-endian (left to right in
> a vector register) when targeting a little endian machine and specifying
> -maltivec=be.  New test cases are added to test this functionality on
> all supported vector types.
>
> The main change is in the altivec.md define_insns for
> vec_widen_{su}mult_{even,odd}_{v8hi,v16qi}, where we now test for
> VECTOR_ELT_ORDER_BIG rather than BYTES_BIG_ENDIAN in order to treat the
> element order as big-endian.  However, this necessitates changes to
> other places in altivec.md where we previously called
> gen_vec_widen_{su}mult_*.  The semantics of these internal uses are not
> affected by -maltivec=be, so these are now replaced with direct
> generation of the underlying instructions that were previously
> generated.
>
> Bootstrapped and tested with no new regressions on
> powerpc64{,le}-unknown-linux-gnu.  Ok for trunk?
>
> Thanks,
> Bill
>
>
> gcc:
>
> 2014-01-13  Bill Schmidt  
>
> * config/rs6000/altivec.md (mulv8hi3): Explicitly generate vmulesh
> and vmulosh rather than call gen_vec_widen_smult_*.
> (vec_widen_umult_even_v16qi): Test VECTOR_ELT_ORDER_BIG rather
> than BYTES_BIG_ENDIAN to determine use of even or odd instruction.
> (vec_widen_smult_even_v16qi): Likewise.
> (vec_widen_umult_even_v8hi): Likewise.
> (vec_widen_smult_even_v8hi): Likewise.
> (vec_widen_umult_odd_v16qi): Likewise.
> (vec_widen_smult_odd_v16qi): Likewise.
> (vec_widen_umult_odd_v8hi): Likewise.
> (vec_widen_smult_odd_v8hi): Likewise.
> (vec_widen_umult_hi_v16qi): Explicitly generate vmuleub and
> vmuloub rather than call gen_vec_widen_umult_*.
> (vec_widen_umult_lo_v16qi): Likewise.
> (vec_widen_smult_hi_v16qi): Explicitly generate vmulesb and
> vmulosb rather than call gen_vec_widen_smult_*.
> (vec_widen_smult_lo_v16qi): Likewise.
> (vec_widen_umult_hi_v8hi): Explicitly generate vmuleuh and vmulouh
> rather than call gen_vec_widen_umult_*.
> (vec_widen_umult_lo_v8hi): Likewise.
> (vec_widen_smult_hi_v8hi): Explicitly gnerate vmulesh and vmulosh
> rather than call gen_vec_widen_smult_*.
> (vec_widen_smult_lo_v8hi): Likewise.
>
> gcc/testsuite:
>
> 2014-01-13  Bill Schmidt  
>
> * gcc.dg/vmx/mult-even-odd.c: New.
> * gcc.dg/vmx/mult-even-odd-be-order.c: New.

Okay. The less said the better.

Thanks, David

Re: [Patch] Avoid gcc_assert in libgcov

2014-01-14 Thread Teresa Johnson

On Thu, Jan 9, 2014 at 6:56 AM, Jan Hubicka  wrote:
>> As suggested by Honza, avoid bloating libgcov from gcc_assert by using
>> a new macro gcov_nonruntime_assert in gcov-io.c that is only mapped to
>> gcc_assert when not in libgcov.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2014-01-09  Teresa Johnson  
>>
>> * gcov-io.c (gcov_position): Use gcov_nonruntime_assert.
>> (gcov_is_error): Ditto.
>> (gcov_rewrite): Ditto.
>> (gcov_open): Ditto.
>> (gcov_write_words): Ditto.
>> (gcov_write_length): Ditto.
>> (gcov_read_words): Ditto.
>> (gcov_read_summary): Ditto.
>> (gcov_sync): Ditto.
>> (gcov_seek): Ditto.
>> (gcov_histo_index): Ditto.
>> (static void gcov_histogram_merge): Ditto.
>> (compute_working_sets): Ditto.
>> * gcov-io.h (gcov_nonruntime_assert): Define.
>>
>
>> @@ -481,14 +481,14 @@ gcov_read_words (unsigned words)
>>const gcov_unsigned_t *result;
>>unsigned excess = gcov_var.length - gcov_var.offset;
>>
>> -  gcc_assert (gcov_var.mode > 0);
>> +  gcov_nonruntime_assert (gcov_var.mode > 0);
>>if (excess < words)
>>  {
>>gcov_var.start += gcov_var.offset;
>>  #if IN_LIBGCOV
>>if (excess)
>> {
>> - gcc_assert (excess == 1);
>> + gcov_nonruntime_assert (excess == 1);
>
> It probably makes no sense to put nonruntime access into IN_LIBGCOV defines.

You are right - there were several that were in IN_LIBGCOV defines
that I can just remove.

>
>>   memcpy (gcov_var.buffer, gcov_var.buffer + gcov_var.offset, 4);
>> }
>>  #else
>> @@ -497,7 +497,7 @@ gcov_read_words (unsigned words)
>>gcov_var.offset = 0;
>>gcov_var.length = excess;
>>  #if IN_LIBGCOV
>> -  gcc_assert (!gcov_var.length || gcov_var.length == 1);
>> +  gcov_nonruntime_assert (!gcov_var.length || gcov_var.length == 1);
>>excess = GCOV_BLOCK_SIZE;
>>  #else
>>if (gcov_var.length + words > gcov_var.alloc)
>> @@ -614,7 +614,7 @@ gcov_read_summary (struct gcov_summary *summary)
>>while (!cur_bitvector)
>>  {
>>h_ix = bv_ix * 32;
>> -  gcc_assert (bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE);
>> +  gcov_nonruntime_assert (bv_ix < 
>> GCOV_HISTOGRAM_BITVECTOR_SIZE);
>>cur_bitvector = histo_bitvector[bv_ix++];
>>  }
>>while (!(cur_bitvector & 0x1))
>> @@ -622,7 +622,7 @@ gcov_read_summary (struct gcov_summary *summary)
>>h_ix++;
>>cur_bitvector >>= 1;
>>  }
>> -  gcc_assert (h_ix < GCOV_HISTOGRAM_SIZE);
>> +  gcov_nonruntime_assert (h_ix < GCOV_HISTOGRAM_SIZE);
>
> How many of those asserts can be triggered by a corrupted gcda file?
> I would like to make libgcov more safe WRT file corruptions, too, so in that
> case we should produce an error message.

In that case should we call gcov_error when IN_LIBGCOV? One
possibility would be to simply make gcov_nonruntime_assert be defined
as if (!EXPR) gcov_error in the IN_LIBGCOV case. But I think what you
wanted here was to reduce libgcov bloat by removing calls altogether,
which this wouldn't solve. But if we want to call gcov_error in some
cases, I think I need to add another macro that will either do
gcc_assert when !IN_LIBGCOV and "if (!EXPR) gcov_error" when
IN_LIBGCOV. Is that what you had in mind?

Thanks,
Teresa

>
> The rest of changes seems OK.
>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: [PATCH][IRA] Analysis of register usage of functions for usage by IRA.

2014-01-14 Thread Vladimir Makarov

On 12/05/2013 07:47 PM, Tom de Vries wrote:
> On 14-03-13 10:34, Tom de Vries wrote:
>>> I thought about implementing your optimization for LRA by myself.
>>> But it
>>> >is ok if you decide to work on it.  At least, I am not going to start
>>> >this work for a month.
 >>I'm also currently looking at how to use the analysis in LRA.
 >>AFAIU, in lra-constraints.c we do a backward scan over the insns,
 and keep track
 >>of how many calls we've seen (calls_num), and mark insns with
 that number. Then
 >>when looking at a live-range segment consisting of a def or use
 insn a and a
 >>following use insn b, we can compare the number of calls seen for
 each insn, and
 >>if they're not equal there is at least one call between the 2
 insns, and if the
 >>corresponding hard register is clobbered by calls, we spill after
 insn a and
 >>restore before insn b.
 >>
 >>That is too coarse-grained to use with our analysis, since we
 need to know which
 >>calls occur in between insn a and insn b, and more precisely
 which registers
 >>those calls clobbered.
>>> >
 >>I wonder though if we can do something similar: we keep an array
 >>call_clobbers_num[FIRST_PSEUDO_REG], initialized at 0 when we
 start scanning.
 >>When encountering a call, we increase the call_clobbers_num
 entries for the hard
 >>registers clobbered by the call.
 >>When encountering a use, we set the call_clobbers_num field of
 the use to
 >>call_clobbers_num[reg_renumber[original_regno]].
 >>And when looking at a live-range segment, we compare the
 clobbers_num field of
 >>insn a and insn b, and if it is not equal, the hard register was
 clobbered by at
 >>least one call between insn a and insn b.
 >>Would that work? WDYT?
 >>
>>> >As I understand you looked at live-range splitting code in
>>> >lra-constraints.c.  To get necessary info you should look at
>>> ira-lives.c.
>> Unfortunately I haven't been able to find time to work further on the
>> LRA part.
>> So if you're still willing to pick up that part, that would be great.
>
> Vladimir,
>
> I gave this a try. The attached patch works for the included test-case
> for x86_64.
>
> I've bootstrapped and reg-tested the patch (in combination with the
> other patches from the series) on x86_64.
>
> OK for stage1?
>
Yes, it is ok for stage1.  Thanks for not forgetting LRA and sorry for
the delay with the answer (it is not a high priority patch for me right
now).

I believe, this patch helps to improve code also because of better
spilling into SSE regs.  Spilling into SSE regs instead of memory has a
rare probability right now as all SSE regs are call clobbered.

Thanks again, Tom.

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Yvan Roux

Thanks for the hint Vladimir, I'll pass some validation on arm.c and
arm.md/aarch64.md separately.

On 14 January 2014 20:09, Vladimir Makarov  wrote:
> On 01/14/2014 01:41 PM, Yvan Roux wrote:
>>> A quick grep of the arm backend shows 11 instances of reload_in_progress:
>>>
>>> arm.c:  && !(reload_in_progress || reload_completed)
>>> arm.c:  if (! (reload_in_progress || reload_completed)
>>> arm.c:  if (! (reload_in_progress || reload_completed)
>>> arm.c:  if (! (reload_in_progress || reload_completed)
>>> arm.c:   reload_in_progress || reload_completed))
>>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>>> predicates.md:"offsettable_address_p (reload_completed |
>>> reload_in_progress,
>>> predicates.md:  (and (match_test "reload_in_progress ||
>>> reload_completed")
>>>
>>> and aarch64 has five more:
>>>
>>> aarch64.md:  "reload_completed || reload_in_progress"
>>> aarch64.md:  "reload_completed || reload_in_progress"
>>> aarch64.md:  "reload_completed || reload_in_progress"
>>> aarch64.md:  "reload_completed || reload_in_progress"
>>> aarch64.md:  "reload_completed || reload_in_progress"
>>>
>>> Yvan, could you do a quick audit on these to see if they are also likely
>>> to need fixing?
>> Yes, I'll check all of them.
> I checked these places too.  I'd do analogous change for only arm.c in
> thumb1_legitimate_address_p,  neon_vector_mem_operand, and
> neon_struct_mem_operand.   I guess it is a a bad idea to do it in
> predicates.md.  Changes arm.md and aarch64.md is worth to try but I
> believe LRA will work without the changes.

Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

2014-01-14 Thread Jakub Jelinek

On Tue, Jan 14, 2014 at 07:37:33PM +0100, Uros Bizjak wrote:
> OK, let's play safe. I'll revert these two changes (modulo size of
> nocona prefetch block).

Thanks.

> > opt we never return a smaller number from ix86_data_alignment than
> > we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
> > code that assumes the alignment 4.8 would use for something that is defined
> > in a compilation unit built by gcc 4.9+, if we don't align it at least
> > as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
> > could misbehave.
> 
> >From 4.9 onwards, we would like to align >= 64byte structures on
> 64byte boundary. Should we add a compatibility rule to align >= 32byte
> structures to 32 bytes?
> 
> Please also note that in 4.7 and 4.8, we have
> 
> int max_align = optimize_size ? BITS_PER_WORD : MIN (256, 
> MAX_OFILE_ALIGNMENT);
> 
> so, in effect -Os code will be incompatible with other optimization levels.

Well, the max_align is only one of the several possibilities of aligment
increases, but yes, there is an ABI issue, see e.g. PR56564 for details.

> I guess that for 4.7 and 4.8, we should revert to this anyway, but
> what to do with 4.9?

For 4.9, if what you've added is what you want to do for performance
reasons, then I'd do something like:

  /* GCC 4.8 and earlier used to incorrectly assume this alignment even
 for symbols from other compilation units or symbols that don't need
 to bind locally.  In order to preserve some ABI compatibility with
 those compilers, ensure we don't decrease alignment from what we
 used to assume.  */

  int max_align_compat
= optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);

  /* A data structure, equal or greater than the size of a cache line
 (64 bytes in the Pentium 4 and other recent Intel processors, including
 processors based on Intel Core microarchitecture) should be aligned
 so that its base address is a multiple of a cache line size.  */

  int max_align
= MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT);

  if (max_align < BITS_PER_WORD)
max_align = BITS_PER_WORD;

  if (opt
  && AGGREGATE_TYPE_P (type)
  && TYPE_SIZE (type)
  && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST)
{
  if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align_compat
   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
  && align < max_align_compat)
align = max_align_compat;
  if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
  && align < max_align)
align = max_align;
}

That way, max_align will be purely optimization and can be changed as
anyone wishes in the future, max_align_compat compatibility with
pre-4.9 (beyond ABI) assumptions and !opt stuff the ABI mandated alignment.

Jakub

Re: [PATCH, reginfo.c, i386.c] Backport fix for PR58139 to 4.8

2014-01-14 Thread Vladimir Makarov

On 01/14/2014 12:22 PM, Peter Bergner wrote:
> The mainline fix for PR58139 which is a wrong code gen bug was
> submitted here:
>
> http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00910.html
>
> and approved for mainline and 4.8 (after a few weeks) here:
>
> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00134.html
>
> However, my fix exposed a latent x86 bug, so this patch was never
> committed to 4.8.  The latent x86 bug was fixed by Honza and I'd
> like to now ask to be able to backport my fix for PR58139 along
> with Honza's fix to 4.8.
>
> This passed bootstrap and regtesting on powerpc64-linux and
> I bootstrapped this on x86_64 and verified that the ICE seen
> when compiling the test case with only my patch in:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58269#c2
>
> is fixed when we add Honza's patch.  Ok for 4.8?
>
>
reginfo.c change is ok for me.  Thanks, Peter.

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Vladimir Makarov

On 01/14/2014 01:41 PM, Yvan Roux wrote:
>> A quick grep of the arm backend shows 11 instances of reload_in_progress:
>>
>> arm.c:  && !(reload_in_progress || reload_completed)
>> arm.c:  if (! (reload_in_progress || reload_completed)
>> arm.c:  if (! (reload_in_progress || reload_completed)
>> arm.c:  if (! (reload_in_progress || reload_completed)
>> arm.c:   reload_in_progress || reload_completed))
>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
>> predicates.md:"offsettable_address_p (reload_completed |
>> reload_in_progress,
>> predicates.md:  (and (match_test "reload_in_progress ||
>> reload_completed")
>>
>> and aarch64 has five more:
>>
>> aarch64.md:  "reload_completed || reload_in_progress"
>> aarch64.md:  "reload_completed || reload_in_progress"
>> aarch64.md:  "reload_completed || reload_in_progress"
>> aarch64.md:  "reload_completed || reload_in_progress"
>> aarch64.md:  "reload_completed || reload_in_progress"
>>
>> Yvan, could you do a quick audit on these to see if they are also likely
>> to need fixing?
> Yes, I'll check all of them.
I checked these places too.  I'd do analogous change for only arm.c in
thumb1_legitimate_address_p,  neon_vector_mem_operand, and
neon_struct_mem_operand.   I guess it is a a bad idea to do it in
predicates.md.  Changes arm.md and aarch64.md is worth to try but I
believe LRA will work without the changes.

Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

2014-01-14 Thread H.J. Lu

On Tue, Jan 14, 2014 at 10:37 AM, Uros Bizjak  wrote:
> On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek  wrote:
>
>>> On a second thought, the crossing of 16-byte boundaries is mentioned
>>> for the data *access* (the instruction itself) if it is not naturally
>>> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
>>> case.
>>>
>>> So, we don't have to align 32 byte structures in any way for newer
>>> processors, since this optimization applies to 64+ byte (larger or
>>> equal to cache line size) structures only. Older processors are
>>> handled correctly, modulo nocona, where its cache line size value has
>>> to be corrected.
>>>
>>> Following that, my original patch implements this optimization in the
>>> correct way.
>>
>> Sorry for catching this late, but on the 4.8 and earlier branches
>> there is no opt argument and thus any ix86_data_alignment change is
>> unfortunately an ABI change.  So I'd think we should revert
>> r206433 and r206436.  And for the trunk we need to ensure even for
>
> OK, let's play safe. I'll revert these two changes (modulo size of
> nocona prefetch block).
>
>> opt we never return a smaller number from ix86_data_alignment than
>> we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
>> code that assumes the alignment 4.8 would use for something that is defined
>> in a compilation unit built by gcc 4.9+, if we don't align it at least
>> as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
>> could misbehave.
>
> From 4.9 onwards, we would like to align >= 64byte structures on
> 64byte boundary. Should we add a compatibility rule to align >= 32byte
> structures to 32 bytes?

That is why we issue a warning when alignment was changed
with AVX support:

[hjl@gnu-6 tmp]$ cat a1.i
typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
extern __m256i y;
void
f1(__m256i x)
{
  y = x;
}
[hjl@gnu-6 tmp]$ gcc -S a1.i
a1.i: In function ‘f1’:
a1.i:4:1: note: The ABI for passing parameters with 32-byte alignment
has changed in GCC 4.6
 f1(__m256i x)
 ^
a1.i:4:1: warning: AVX vector argument without AVX enabled changes the
ABI [enabled by default]
[hjl@gnu-6 tmp]$

> Please also note that in 4.7 and 4.8, we have
>
> int max_align = optimize_size ? BITS_PER_WORD : MIN (256, 
> MAX_OFILE_ALIGNMENT);
>
> so, in effect -Os code will be incompatible with other optimization levels.
>
> I guess that for 4.7 and 4.8, we should revert to this anyway, but
> what to do with 4.9?
>
> Uros.



-- 
H.J.

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Yvan Roux

> A quick grep of the arm backend shows 11 instances of reload_in_progress:
>
> arm.c:  && !(reload_in_progress || reload_completed)
> arm.c:  if (! (reload_in_progress || reload_completed)
> arm.c:  if (! (reload_in_progress || reload_completed)
> arm.c:  if (! (reload_in_progress || reload_completed)
> arm.c:   reload_in_progress || reload_completed))
> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
> arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
> predicates.md:"offsettable_address_p (reload_completed |
> reload_in_progress,
> predicates.md:  (and (match_test "reload_in_progress ||
> reload_completed")
>
> and aarch64 has five more:
>
> aarch64.md:  "reload_completed || reload_in_progress"
> aarch64.md:  "reload_completed || reload_in_progress"
> aarch64.md:  "reload_completed || reload_in_progress"
> aarch64.md:  "reload_completed || reload_in_progress"
> aarch64.md:  "reload_completed || reload_in_progress"
>
> Yvan, could you do a quick audit on these to see if they are also likely
> to need fixing?

Yes, I'll check all of them.

Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

2014-01-14 Thread Uros Bizjak

On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek  wrote:

>> On a second thought, the crossing of 16-byte boundaries is mentioned
>> for the data *access* (the instruction itself) if it is not naturally
>> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
>> case.
>>
>> So, we don't have to align 32 byte structures in any way for newer
>> processors, since this optimization applies to 64+ byte (larger or
>> equal to cache line size) structures only. Older processors are
>> handled correctly, modulo nocona, where its cache line size value has
>> to be corrected.
>>
>> Following that, my original patch implements this optimization in the
>> correct way.
>
> Sorry for catching this late, but on the 4.8 and earlier branches
> there is no opt argument and thus any ix86_data_alignment change is
> unfortunately an ABI change.  So I'd think we should revert
> r206433 and r206436.  And for the trunk we need to ensure even for

OK, let's play safe. I'll revert these two changes (modulo size of
nocona prefetch block).

> opt we never return a smaller number from ix86_data_alignment than
> we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
> code that assumes the alignment 4.8 would use for something that is defined
> in a compilation unit built by gcc 4.9+, if we don't align it at least
> as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
> could misbehave.

>From 4.9 onwards, we would like to align >= 64byte structures on
64byte boundary. Should we add a compatibility rule to align >= 32byte
structures to 32 bytes?

Please also note that in 4.7 and 4.8, we have

int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);

so, in effect -Os code will be incompatible with other optimization levels.

I guess that for 4.7 and 4.8, we should revert to this anyway, but
what to do with 4.9?

Uros.

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Richard Earnshaw

On 14/01/14 16:48, Vladimir Makarov wrote:
>   The following patch fixes
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787
> 
>   The problem of LRA looping was in that address with soft frame pointer
> was rejected as valid constraint in iwmmxt_amd_movdi insn.
> 
> Ok to commit?
> 
> 2014-01-14  Vladimir Makarov  
> 
> PR target/59787
> * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress.

A quick grep of the arm backend shows 11 instances of reload_in_progress:

arm.c:  && !(reload_in_progress || reload_completed)
arm.c:  if (! (reload_in_progress || reload_completed)
arm.c:  if (! (reload_in_progress || reload_completed)
arm.c:  if (! (reload_in_progress || reload_completed)
arm.c:   reload_in_progress || reload_completed))
arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
arm.md:  "TARGET_32BIT && (reload_in_progress || reload_completed)"
predicates.md:"offsettable_address_p (reload_completed |
reload_in_progress,
predicates.md:  (and (match_test "reload_in_progress ||
reload_completed")

and aarch64 has five more:

aarch64.md:  "reload_completed || reload_in_progress"
aarch64.md:  "reload_completed || reload_in_progress"
aarch64.md:  "reload_completed || reload_in_progress"
aarch64.md:  "reload_completed || reload_in_progress"
aarch64.md:  "reload_completed || reload_in_progress"

Yvan, could you do a quick audit on these to see if they are also likely
to need fixing?

R.

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Richard Earnshaw

On 14/01/14 16:48, Vladimir Makarov wrote:
>   The following patch fixes
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787
> 
>   The problem of LRA looping was in that address with soft frame pointer
> was rejected as valid constraint in iwmmxt_amd_movdi insn.
> 
> Ok to commit?
> 
> 2014-01-14  Vladimir Makarov  
> 
> PR target/59787
> * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress.
> 
> 

OK.

R.

Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)

2014-01-14 Thread Florian Weimer


On 01/13/2014 09:48 PM, Marek Polacek wrote:

+bool
+pointer_to_zero_sized_aggr_p (tree t)
+{
+  t = strip_pointer_operator (t);
+  if (RECORD_OR_UNION_TYPE_P (t)
+  && TYPE_SIZE (t)
+  && integer_zerop (TYPE_SIZE (t)))
+return true;
+  return false;
+}


I think you can just return the value of the condition, there's no need 
for the if statement.


--
Florian Weimer / Red Hat Product Security Team

Re: RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Renlin Li


Hi Vladimir,

I have test your patch with the following configurations, , and it 
indeed eliminates the ICE.


--target=arm-none-linux-gnueabihf
--with-arch=armv7-a
--with-fpu=vfpv3-d16
--with-float=softfp

Thank you!

Kind regards,
Renlin Li

On 14/01/14 16:48, Vladimir Makarov wrote:

   The following patch fixes

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787

   The problem of LRA looping was in that address with soft frame pointer
was rejected as valid constraint in iwmmxt_amd_movdi insn.

Ok to commit?

2014-01-14  Vladimir Makarov  

 PR target/59787
 * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress.

[PATCH, reginfo.c, i386.c] Backport fix for PR58139 to 4.8

2014-01-14 Thread Peter Bergner

The mainline fix for PR58139 which is a wrong code gen bug was
submitted here:

http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00910.html

and approved for mainline and 4.8 (after a few weeks) here:

http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00134.html

However, my fix exposed a latent x86 bug, so this patch was never
committed to 4.8.  The latent x86 bug was fixed by Honza and I'd
like to now ask to be able to backport my fix for PR58139 along
with Honza's fix to 4.8.

This passed bootstrap and regtesting on powerpc64-linux and
I bootstrapped this on x86_64 and verified that the ICE seen
when compiling the test case with only my patch in:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58269#c2

is fixed when we add Honza's patch.  Ok for 4.8?

Peter


Backport from mainline
2013-09-06  Jan Hubicka  

* config/i386/i386.c (ix86_hard_regno_mode_ok): AVX modes are valid
only when AVX is enabled.

2013-09-05  Peter Bergner  

PR target/58139
* reginfo.c (choose_hard_reg_mode): Scan through all mode classes
looking for widest mode.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 206582)
+++ gcc/config/i386/i386.c  (working copy)
@@ -33944,7 +33944,7 @@ ix86_hard_regno_mode_ok (int regno, enum
 are available.  OImode move is available only when AVX is
 enabled.  */
   return ((TARGET_AVX && mode == OImode)
- || VALID_AVX256_REG_MODE (mode)
+ || (TARGET_AVX && VALID_AVX256_REG_MODE (mode))
  || VALID_SSE_REG_MODE (mode)
  || VALID_SSE2_REG_MODE (mode)
  || VALID_MMX_REG_MODE (mode)
Index: gcc/reginfo.c
===
--- gcc/reginfo.c   (revision 206582)
+++ gcc/reginfo.c   (working copy)
@@ -620,40 +620,35 @@ choose_hard_reg_mode (unsigned int regno
mode = GET_MODE_WIDER_MODE (mode))
 if ((unsigned) hard_regno_nregs[regno][mode] == nregs
&& HARD_REGNO_MODE_OK (regno, mode)
-   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)))
+   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))
+   && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode))
   found_mode = mode;

-  if (found_mode != VOIDmode)
-return found_mode;
-
   for (mode = GET_CLASS_NARROWEST_MODE (MODE_FLOAT);
mode != VOIDmode;
mode = GET_MODE_WIDER_MODE (mode))
 if ((unsigned) hard_regno_nregs[regno][mode] == nregs
&& HARD_REGNO_MODE_OK (regno, mode)
-   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)))
+   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))
+   && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode))
   found_mode = mode;

-  if (found_mode != VOIDmode)
-return found_mode;
-
   for (mode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_FLOAT);
mode != VOIDmode;
mode = GET_MODE_WIDER_MODE (mode))
 if ((unsigned) hard_regno_nregs[regno][mode] == nregs
&& HARD_REGNO_MODE_OK (regno, mode)
-   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)))
+   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))
+   && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode))
   found_mode = mode;

-  if (found_mode != VOIDmode)
-return found_mode;
-
   for (mode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT);
mode != VOIDmode;
mode = GET_MODE_WIDER_MODE (mode))
 if ((unsigned) hard_regno_nregs[regno][mode] == nregs
&& HARD_REGNO_MODE_OK (regno, mode)
-   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)))
+   && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))
+   && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode))
   found_mode = mode;

   if (found_mode != VOIDmode)

Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

2014-01-14 Thread Jakub Jelinek

On Fri, Jan 03, 2014 at 05:04:39PM +0100, Uros Bizjak wrote:
> On a second thought, the crossing of 16-byte boundaries is mentioned
> for the data *access* (the instruction itself) if it is not naturally
> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
> case.
> 
> So, we don't have to align 32 byte structures in any way for newer
> processors, since this optimization applies to 64+ byte (larger or
> equal to cache line size) structures only. Older processors are
> handled correctly, modulo nocona, where its cache line size value has
> to be corrected.
> 
> Following that, my original patch implements this optimization in the
> correct way.

Sorry for catching this late, but on the 4.8 and earlier branches
there is no opt argument and thus any ix86_data_alignment change is
unfortunately an ABI change.  So I'd think we should revert
r206433 and r206436.  And for the trunk we need to ensure even for
opt we never return a smaller number from ix86_data_alignment than
we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
code that assumes the alignment 4.8 would use for something that is defined
in a compilation unit built by gcc 4.9+, if we don't align it at least
as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
could misbehave.

Jakub

[PATCH][buildrobot] PR59496: Fix unused variable warning

2014-01-14 Thread Jan-Benedict Glaw

Hi!

In the buildrobot's logs for building with config-list.mk, I noticed
this warning:

g++ -c  -DIN_GCC_FRONTEND -DIN_GCC_FRONTEND -g -O2 -DIN_GCC  
-DCROSS_DIRECTORY_STRUCTURE  -fno-exceptions -fno-rtti 
-fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
-Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long 
-Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  
-DHAVE_CONFIG_H -I. -Ic-family -I../../../gcc/gcc -I../../../gcc/gcc/c-family 
-I../../../gcc/gcc/../include -I../../../gcc/gcc/../libcpp/include 
-I/opt/cfarm/mpc/include  -I../../../gcc/gcc/../libdecnumber 
-I../../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber 
-I../../../gcc/gcc/../libbacktrace-o c-family/c-common.o -MT 
c-family/c-common.o -MMD -MP -MF c-family/.deps/c-common.TPo 
../../../gcc/gcc/c-family/c-common.c
../../../gcc/gcc/c-family/c-common.c: In function ‘tree_node* 
c_sizeof_or_alignof_type(location_t, tree, bool, bool, int)’:
../../../gcc/gcc/c-family/c-common.c:5007:9: error: unused variable ‘field’ 
[-Werror=unused-variable]
tree field = build_decl (UNKNOWN_LOCATION, FIELD_DECL, NULL_TREE,
 ^
cc1plus: all warnings being treated as errors
make[2]: *** [c-family/c-common.o] Error 1

There were additional comments on the PR recently, I suggest this
patch to plug it:

2014-01-14  Jan-Benedict Glaw  

PR bootstrap/59496
* config/rs6000/darwin.h (ADJUST_FIELD_ALIGN): Fix unused variable
warning.

diff --git a/gcc/config/rs6000/darwin.h b/gcc/config/rs6000/darwin.h
index 43a2ab5..fc1f862 100644
--- a/gcc/config/rs6000/darwin.h
+++ b/gcc/config/rs6000/darwin.h
@@ -328,9 +328,10 @@ extern int darwin_emit_branch_islands;
behavior is dealt with by
darwin_rs6000_special_round_type_align.  */
 #define ADJUST_FIELD_ALIGN(FIELD, COMPUTED)\
-  (TARGET_ALIGN_NATURAL ? (COMPUTED)   \
-   : (COMPUTED) == 128 ? 128   \
-   : MIN ((COMPUTED), 32))
+  ((void) (FIELD), \
+   (TARGET_ALIGN_NATURAL ? (COMPUTED)  \
+: (COMPUTED) == 128 ? 128  \
+: MIN ((COMPUTED), 32)))
 
 /* Darwin increases natural record alignment to doubleword if the first
field is an FP double while the FP fields remain word aligned.  */




-- 
  Jan-Benedict Glaw  jbg...@lug-owl.de  +49-172-7608481
Signature of:  GDB has a 'break' feature; why doesn't it have 'fix' too?
the second  :


signature.asc
Description: Digital signature

Re: [Patch, xtensa] Add section anchor support for the xtensa backend.

2014-01-14 Thread Sterling Augustine

On Tue, Jan 14, 2014 at 7:20 AM, Felix Yang  wrote:
> Hi Sterling,
>
>   I found that we can avoid emitting excessive literal loading
> instructions with with section anchors.
>   This patch also passed the cases in testsuite/gcc.c-torture/execute/ dir.
>   Please apply it if OK for trunk.

Hi Felix,

It's been a while since I dealt with it, but literals are produced by
gcc carefully such that they are arranged by pages and such to make
certain that certain linux loader operations are convenient. (Marc, I
believe, knows the details.)

Marc, does this rearrangement of the literals make a difference on that side?

Sterling

>
>
> Index: gcc/ChangeLog
> ===
> --- gcc/ChangeLog(revision 206599)
> +++ gcc/ChangeLog(working copy)
> @@ -1,3 +1,11 @@
> +2014-01-14  Felix Yang  
> +
> +* common/config/xtensa/xtensa-common.c
> +(xtensa_option_optimization_table): Enable -fsection-anchors under -O1
> +or plus, and disable -fcommon by default.
> +* config/xtensa/xtensa.c (TARGET_MAX_ANCHOR_OFFSET): New.
> +(TARGET_MIN_ANCHOR_OFFSET): Ditto.
> +
>  2014-01-14  Richard Biener  
>
>  PR tree-optimization/58921
> Index: gcc/common/config/xtensa/xtensa-common.c
> ===
> --- gcc/common/config/xtensa/xtensa-common.c(revision 206599)
> +++ gcc/common/config/xtensa/xtensa-common.c(working copy)
> @@ -35,6 +35,13 @@ static const struct default_options xtensa_option_
> assembler, so GCC cannot do a good job of reordering blocks.
> Do not enable reordering unless it is explicitly requested.  */
>  { OPT_LEVELS_ALL, OPT_freorder_blocks, NULL, 0 },
> +/* Enable section anchors under -O1 or plus. This can avoid generating
> +   excessive literal loading instructions to load addresses of globals.  
> */
> +{ OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 2 },
> +/* Allocate uninitialized global variables in the data section of object
> +   file, rather than generating them as common blocks. This is required
> +   for section anchors to work on uninitialized globals.  */
> +{ OPT_LEVELS_ALL, OPT_fcommon, NULL, 0 },
>  { OPT_LEVELS_NONE, 0, NULL, 0 }
>};
>
> Index: gcc/config/xtensa/xtensa.c
> ===
> --- gcc/config/xtensa/xtensa.c(revision 206599)
> +++ gcc/config/xtensa/xtensa.c(working copy)
> @@ -290,6 +290,12 @@ static const int reg_nonleaf_alloc_order[FIRST_PSE
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM xtensa_cannot_force_const_mem
>
> +#undef TARGET_MAX_ANCHOR_OFFSET
> +#define TARGET_MAX_ANCHOR_OFFSET 255
> +
> +#undef TARGET_MIN_ANCHOR_OFFSET
> +#define TARGET_MIN_ANCHOR_OFFSET 0
> +
>  #undef TARGET_LEGITIMATE_ADDRESS_P
>  #define TARGET_LEGITIMATE_ADDRESS_Pxtensa_legitimate_address_p
>
>
> Cheers,
> Felix

RFA: patch to fix PR59787 (arm target)

2014-01-14 Thread Vladimir Makarov

  The following patch fixes

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787

  The problem of LRA looping was in that address with soft frame pointer
was rejected as valid constraint in iwmmxt_amd_movdi insn.

Ok to commit?

2014-01-14  Vladimir Makarov  

PR target/59787
* config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress.

Index: config/arm/arm.c
===
--- config/arm/arm.c	(revision 206579)
+++ config/arm/arm.c	(working copy)
@@ -12439,7 +12439,7 @@ arm_coproc_mem_operand (rtx op, bool wb)
   rtx ind;
 
   /* Reject eliminable registers.  */
-  if (! (reload_in_progress || reload_completed)
+  if (! (reload_in_progress || reload_completed || lra_in_progress)
   && (   reg_mentioned_p (frame_pointer_rtx, op)
 	  || reg_mentioned_p (arg_pointer_rtx, op)
 	  || reg_mentioned_p (virtual_incoming_args_rtx, op)

Re: [Patch, xtensa] Add LOCAL_REGNO to the xtensa backend.

2014-01-14 Thread Sterling Augustine

On Tue, Jan 14, 2014 at 7:14 AM, Felix Yang  wrote:
> Hi Sterling,
>
> The xtensa backend uses register windows, and we need to define
> LOCAL_REGNO for it.

Hi Felix,

How does this change the produced code? In particular, please identify
a problem this patch is solving. I know that the documentation for the
macro says to define it if you have register windows, but things
aren't that straightforward.

The original register windowing support in GCC went in to support
Sparc and, iirc, Intel i960. There are subtle differences between the
two ISAs that aren't entirely captured in the GCC infrastructure. Note
that in Sparc the window isn't modified until the save and restore
instructions are executed, whereas in Xtensa, the save and restores
happen combined with call and return.

All of this makes for very subtle data-flow issues around certain
registers--in particular, a8 (which functions as both an argument
register and the frame pointer.) Passing gcc/torture/... isn't enough
to establish that this works. It needs a full set of ABI tests to be
established as correct.

What we have there works perfectly well. Your patch may indeed improve
the generated code, or be more correct, but please identify the
problem you are solving and show how this patch improves things.

Thanks,

Sterling

RFA: Fix assembler data directives emitted for variable length structures

2014-01-14 Thread Nick Clifton

Hi Guys,

  Several PRs (28865, 57180 and 59719) have reported that the assembler
  directives emitted by gcc to describe a structure with a variable
  length field occupy too much space.  That is a serious problem for
  targets which use section anchors as they rely upon objects in the
  data area being exactly the size that they are supposed to be.

  The attached patch fixes the problem by updating the output_constant()
  function so that it returns the number of bytes that it really did
  emit, which may be larger than the number of bytes that it was
  requested to emit.  It also adds a couple of tests to the testsuite to
  check that the desired behaviour is achieved.

  Tested without regressions on i686-pc-linux-gnu and aarch64-elf
  toolchains, as well as bootstrapping and regression testing a
  powerpc64-linux toolchain.

  OK to apply ?

Cheers
  Nick

gcc/ChangeLog
2014-01-13  Nick Clifton  

PR middle-end/28865
* varasm.c (output_constant): Return the number of bytes actually
emitted.
(output_constructor_array_range): Update the field size with the
number of bytes emitted by output_constant.
(output_constructor_regular_field): Likewise.  Also do not
complain if the total number of bytes emitted is now greater
than the expected fieldpos.
* output.h (output_constant): Update prototype and descriptive
comment.

gcc/testsuite/ChangeLog
2014-01-13  Nick Clifton  

PR middle-end/28865
* gcc.c-torture/compile/pr28865.c: New.
* gcc.c-torture/execute/pr28865.c: New.

Index: gcc/output.h
===
--- gcc/output.h	(revision 206572)
+++ gcc/output.h	(working copy)
@@ -294,11 +294,13 @@
This includes the pseudo-op such as ".int" or ".byte", and a newline.
Assumes output_addressed_constants has been done on EXP already.
 
-   Generate exactly SIZE bytes of assembler data, padding at the end
-   with zeros if necessary.  SIZE must always be specified.
+   Generate at least SIZE bytes of assembler data, padding at the end
+   with zeros if necessary.  SIZE must always be specified.  The returned
+   value is the actual number of bytes of assembler data generated, which
+   may be bigger than SIZE if the object contains a variable length field.
 
ALIGN is the alignment in bits that may be assumed for the data.  */
-extern void output_constant (tree, unsigned HOST_WIDE_INT, unsigned int);
+extern unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT, unsigned int);
 
 /* When outputting delayed branch sequences, this rtx holds the
sequence being output.  It is null when no delayed branch
Index: gcc/varasm.c
===
--- gcc/varasm.c	(revision 206572)
+++ gcc/varasm.c	(working copy)
@@ -4584,8 +4584,10 @@
This includes the pseudo-op such as ".int" or ".byte", and a newline.
Assumes output_addressed_constants has been done on EXP already.
 
-   Generate exactly SIZE bytes of assembler data, padding at the end
-   with zeros if necessary.  SIZE must always be specified.
+   Generate at least SIZE bytes of assembler data, padding at the end
+   with zeros if necessary.  SIZE must always be specified.  The returned
+   value is the actual number of bytes of assembler data generated, which
+   may be bigger than SIZE if the object contains a variable length field.
 
SIZE is important for structure constructors,
since trailing members may have been omitted from the constructor.
@@ -4600,14 +4602,14 @@
 
ALIGN is the alignment of the data in bits.  */
 
-void
+unsigned HOST_WIDE_INT
 output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align)
 {
   enum tree_code code;
   unsigned HOST_WIDE_INT thissize;
 
   if (size == 0 || flag_syntax_only)
-return;
+return size;
 
   /* See if we're trying to initialize a pointer in a non-default mode
  to the address of some declaration somewhere.  If the target says
@@ -4672,7 +4674,7 @@
   && vec_safe_is_empty (CONSTRUCTOR_ELTS (exp)))
 {
   assemble_zeros (size);
-  return;
+  return size;
 }
 
   if (TREE_CODE (exp) == FDESC_EXPR)
@@ -4684,7 +4686,7 @@
 #else
   gcc_unreachable ();
 #endif
-  return;
+  return size;
 }
 
   /* Now output the underlying data.  If we've handling the padding, return.
@@ -4723,8 +4725,7 @@
   switch (TREE_CODE (exp))
 	{
 	case CONSTRUCTOR:
-	  output_constructor (exp, size, align, NULL);
-	  return;
+	  return output_constructor (exp, size, align, NULL);
 	case STRING_CST:
 	  thissize
 	= MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
@@ -4752,11 +4753,10 @@
 case RECORD_TYPE:
 case UNION_TYPE:
   gcc_assert (TREE_CODE (exp) == CONSTRUCTOR);
-  output_constructor (exp, size, align, NULL);
-  return;
+  return output_constructor (exp, size, align, NULL);
 
 case

Re: PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes

2014-01-14 Thread Uros Bizjak

On Tue, Jan 14, 2014 at 3:18 PM, H.J. Lu  wrote:

> There are several problems with i386 MMX/SSE/AVX ABI change detection:
>
> 1. MMX/SSE return value isn't checked for -m32 since revision 83533:
>
> http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=83533
>
> which added ix86_struct_value_rtx.  Since MMX/SSE condition is always
> false, the MMX/SSE return value ABI change is disabled.
> 2. For -m32, the same warning on MMX/SSE argument is issued twice, one from
> type_natural_mode and one from function_arg_32.
> 3. AVX return value ABI change isn't checked.
>
> This patch does followings:
>
> 1. Remove the ineffective ix86_struct_value_rtx.
> 2. Add a bool parameter to indicate if type is used for function return
> value.  Warn ABI change if the vector mode isn't available for function
> return value.  Add AVX function return value ABI change warning.
> 3. Consolidate ABI change warning into type_natural_mode.
> 4. Update g++.dg/ext/vector23.C to prune ABI change for Linux/x86
> added by the AVX function return value ABI change warning.
> 5. Update gcc.target/i386/pr39162.c to avoid the AVX function return
> value ABI change warning.
> 6. Add testcases for warning MMX/SSE/AVX ABI changes in parameter
> passing and function return.
>
> Tested on Linux/x86-64 with -m32/-m64 for "make check".  OK to install?
>
> Thanks.
>
> H.J.
> ---
> gcc/
>
> 2014-01-14  H.J. Lu  
>
> PR target/59794
> * config/i386/i386.c (type_natural_mode): Add a bool parameter
> to indicate if type is used for function return value.  Warn
> ABI change if the vector mode isn't available for function
> return value.
> (ix86_function_arg_advance): Pass false to type_natural_mode.
> (ix86_function_arg): Likewise.
> (ix86_gimplify_va_arg): Likewise.
> (function_arg_32): Don't warn ABI change.
> (ix86_function_value): Pass true to type_natural_mode.
> (ix86_return_in_memory): Likewise.
> (ix86_struct_value_rtx): Removed.
> (TARGET_STRUCT_VALUE_RTX): Likewise.
>
> gcc/testsuite/
>
> 2014-01-14  H.J. Lu  
>
> PR target/59794
> * g++.dg/ext/vector23.C: Also prune ABI change for Linux/x86.
> * gcc.target/i386/pr39162.c (y): New __m256i variable.
> (bar): Change return type to void.  Set y to x.
> * gcc.target/i386/pr59794-1.c: New testcase.
> * gcc.target/i386/pr59794-2.c: Likewise.
> * gcc.target/i386/pr59794-3.c: Likewise.
> * gcc.target/i386/pr59794-4.c: Likewise.
> * gcc.target/i386/pr59794-5.c: Likewise.
> * gcc.target/i386/pr59794-6.c: Likewise.
> * gcc.target/i386/pr59794-7.c: Likewise.

OK for mainline and release branches after a couple of days.

Thanks,
Uros.

Re: [PATCH i386 10/8] [AVX512] Add missing AVX-512ER patterns, intrinsics, tests.

2014-01-14 Thread H.J. Lu

On Sun, Jan 12, 2014 at 10:02 PM, Kirill Yukhin  wrote:
> Hello,
> On 11 Jan 12:42, Uros Bizjak wrote:
>> On Fri, Jan 10, 2014 at 5:24 PM, Jakub Jelinek  wrote:
>> > This means you should ensure aligned_mem will be set for
>> > CODE_FOR_avx512f_movntdqa in ix86_expand_special_args_builtin.
> Fixed. Updated patch in the bottom.
>
>> > Leaving the rest of review to Uros/Richard.
>>
>> The rest is OK.
> Thanks! I'll check it in tomorrow if no more issues!
>

It caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59808


H.J.

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Jakub Jelinek

On Tue, Jan 14, 2014 at 02:49:30PM +, Richard Earnshaw wrote:
> On 14/01/14 14:32, Jakub Jelinek wrote:
> > Anyway, the above is really a simple case, and I'd call it a
> > backend bug if it isn't able to generate good code out of that.
> 
> Exactly which back-end pass are you expecting to simplify
> 
> (set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0)
> (const_int 2)))

Well, already at the expansion time you know here that & 2 is zero-extended,
so you can/should emit here (set (reg:SI 1)) (and:SI (subreg:SI (reg:HI 0) 0) 
(const_int 2)))
instead.  And/or combiner, and/or what Kugan has been working on,
to use the remembered SSA_NAME range info during expansion.

And/or, as I said earlier, we can have a type promotion pass driven by
backend properties.  But not narrowing types anywhere will mean say
the various issues mentioned in PR45397.  For vectorizations we want to
decrease the number of different type widths in a loop as much as possible,
etc.

Jakub

Re: [Patch AArch64] Implement Vector Permute Support

2014-01-14 Thread pinskia



> On Jan 14, 2014, at 7:19 AM, Alex Velenko  wrote:
> 
> Hi,
> 
> This patch turns off the vec_perm patterns for aarch64_be, this should resolve
> the issue  highlighted here 
> http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00321.html
> With this patch applied, the test case provided in that link compiles without 
> an ICE.
> 
> However, the Big-Endian port is still in development. This patch exposes
> another known but unrelated issue with Big-Endian Large-Int modes.
> 
> The patch has been tested on aarch64-none-elf and aarch64_be-none-elf 
> resulting in five
> further regression due to the broken implementation of Big-Endian Large-Int 
> modes.
> 
> Kind regards,
> Alex Velenko
> 
> gcc/
> 
> 2014-01-14  Alex Velenko  
> 
>* config/aarch64/aarch64-simd.md (vec_perm): Add BE check.
>* config/aarch64/aarch64.c (aarch64_expand_vec_perm): Add comment.
> 
> gcc/testsuite/
> 
> 2014-01-14  Alex Velenko  
> 
>* lib/target-supports.exp
>(check_effective_target_vect_perm): Exclude aarch64_be.
>(check_effective_target_vect_perm_byte): Likewise.
>(check_effective_target_vect_perm_short): Likewise.

I think you want to use a function to check if the target is effectively 
big-endian instead.  Internally at Cavium, our elf compiler has big-endian 
multi-lib. 

Thanks,
Andrew

> 
>

Re: [wide-int] resolve bootstrap issue

2014-01-14 Thread Richard Sandiford

Mike Stump  writes:
> diff --git a/gcc/expmed.c b/gcc/expmed.c
> index ce063eb..720d8c1 100644
> --- a/gcc/expmed.c
> +++ b/gcc/expmed.c
> @@ -4963,6 +4963,7 @@ make_tree (tree type, rtx x)
>return t;
>  
>  case CONST_DOUBLE:
> +  gcc_assert (HOST_BITS_PER_WIDE_INT * 2 <= MAX_BITSIZE_MODE_ANY_INT);
>if (TARGET_SUPPORTS_WIDE_INT == 0 && GET_MODE (x) == VOIDmode)
>   t = wide_int_to_tree (type,
> wide_int::from_array (&CONST_DOUBLE_LOW (x), 2,

I think this would be better as a STATIC_ASSERT.

> @@ -1440,10 +1442,10 @@ real_to_integer (const REAL_VALUE_TYPE *r, bool 
> *fail, int precision)
>   }
>  #endif
>w = SIGSZ * HOST_BITS_PER_LONG + words * HOST_BITS_PER_WIDE_INT;
> -  result = wide_int::from_array
> +  tmp = real_int::from_array
>   (val, (w + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT, w);
> -  result = wi::lrshift (result, (words * HOST_BITS_PER_WIDE_INT) - exp);
> -  result = wide_int::from (result, precision, UNSIGNED);
> +  tmp = wi::lrshift (tmp, (words * HOST_BITS_PER_WIDE_INT) - 
> exp);
> +  result = wide_int::from (tmp, precision, UNSIGNED);

Why did you need the ?  It was supposed to work without.

> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index 00b5439..7c21afa 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -5384,6 +5384,7 @@ simplify_immed_subreg (enum machine_mode outermode, rtx 
> op,
>   tmp[u] = buf;
>   base += HOST_BITS_PER_WIDE_INT;
> }
> + gcc_assert (GET_MODE_PRECISION (outer_submode) <= 
> MAX_BITSIZE_MODE_ANY_INT);
>   r = wide_int::from_array (tmp, units,
> GET_MODE_PRECISION (outer_submode));
>   elems[elem] = immed_wide_int_const (r, outer_submode);

Long line.

Looks good to me otherwise FWIW.

Thanks,
Richard

[Patch, xtensa] Add section anchor support for the xtensa backend.

2014-01-14 Thread Felix Yang

Hi Sterling,

  I found that we can avoid emitting excessive literal loading
instructions with with section anchors.
  This patch also passed the cases in testsuite/gcc.c-torture/execute/ dir.
  Please apply it if OK for trunk.


Index: gcc/ChangeLog
===
--- gcc/ChangeLog(revision 206599)
+++ gcc/ChangeLog(working copy)
@@ -1,3 +1,11 @@
+2014-01-14  Felix Yang  
+
+* common/config/xtensa/xtensa-common.c
+(xtensa_option_optimization_table): Enable -fsection-anchors under -O1
+or plus, and disable -fcommon by default.
+* config/xtensa/xtensa.c (TARGET_MAX_ANCHOR_OFFSET): New.
+(TARGET_MIN_ANCHOR_OFFSET): Ditto.
+
 2014-01-14  Richard Biener  

 PR tree-optimization/58921
Index: gcc/common/config/xtensa/xtensa-common.c
===
--- gcc/common/config/xtensa/xtensa-common.c(revision 206599)
+++ gcc/common/config/xtensa/xtensa-common.c(working copy)
@@ -35,6 +35,13 @@ static const struct default_options xtensa_option_
assembler, so GCC cannot do a good job of reordering blocks.
Do not enable reordering unless it is explicitly requested.  */
 { OPT_LEVELS_ALL, OPT_freorder_blocks, NULL, 0 },
+/* Enable section anchors under -O1 or plus. This can avoid generating
+   excessive literal loading instructions to load addresses of globals.  */
+{ OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 2 },
+/* Allocate uninitialized global variables in the data section of object
+   file, rather than generating them as common blocks. This is required
+   for section anchors to work on uninitialized globals.  */
+{ OPT_LEVELS_ALL, OPT_fcommon, NULL, 0 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };

Index: gcc/config/xtensa/xtensa.c
===
--- gcc/config/xtensa/xtensa.c(revision 206599)
+++ gcc/config/xtensa/xtensa.c(working copy)
@@ -290,6 +290,12 @@ static const int reg_nonleaf_alloc_order[FIRST_PSE
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM xtensa_cannot_force_const_mem

+#undef TARGET_MAX_ANCHOR_OFFSET
+#define TARGET_MAX_ANCHOR_OFFSET 255
+
+#undef TARGET_MIN_ANCHOR_OFFSET
+#define TARGET_MIN_ANCHOR_OFFSET 0
+
 #undef TARGET_LEGITIMATE_ADDRESS_P
 #define TARGET_LEGITIMATE_ADDRESS_Pxtensa_legitimate_address_p


Cheers,
Felix

Re: [Patch AArch64] Implement Vector Permute Support

2014-01-14 Thread Alex Velenko


Hi,

This patch turns off the vec_perm patterns for aarch64_be, this should 
resolve
the issue  highlighted here 
http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00321.html
With this patch applied, the test case provided in that link compiles 
without an ICE.


However, the Big-Endian port is still in development. This patch exposes
another known but unrelated issue with Big-Endian Large-Int modes.

The patch has been tested on aarch64-none-elf and aarch64_be-none-elf 
resulting in five
further regression due to the broken implementation of Big-Endian 
Large-Int modes.


Kind regards,
Alex Velenko

gcc/

2014-01-14  Alex Velenko  

* config/aarch64/aarch64-simd.md (vec_perm): Add BE check.
* config/aarch64/aarch64.c (aarch64_expand_vec_perm): Add comment.

gcc/testsuite/

2014-01-14  Alex Velenko  

* lib/target-supports.exp
(check_effective_target_vect_perm): Exclude aarch64_be.
(check_effective_target_vect_perm_byte): Likewise.
(check_effective_target_vect_perm_short): Likewise.

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bc47a291de4b9b24d829e4dbf060fff7a321558f..43a9c5b27d78a47cf965636a03232005a4c8e7c3 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3840,7 +3840,7 @@
(match_operand:VB 1 "register_operand")
(match_operand:VB 2 "register_operand")
(match_operand:VB 3 "register_operand")]
-  "TARGET_SIMD"
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
 {
   aarch64_expand_vec_perm (operands[0], operands[1],
 			   operands[2], operands[3]);
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 95360089b89d5fef2997dc6dbe7f47a6864143ea..084668af5124aa1c4a7f25495cf44b52811d0e62 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3417,7 +3417,8 @@ proc check_effective_target_vect_perm { } {
 } else {
 set et_vect_perm_saved 0
 if { [is-effective-target arm_neon_ok]
-	 || [istarget aarch64*-*-*]
+	 || ([istarget aarch64*-*-*]
+		 && ![istarget aarch64_be*-*-*])
 	 || [istarget powerpc*-*-*]
  || [istarget spu-*-*]
 	 || [istarget i?86-*-*]
@@ -3445,7 +3446,8 @@ proc check_effective_target_vect_perm_byte { } {
 set et_vect_perm_byte_saved 0
 if { ([is-effective-target arm_neon_ok]
 	  && [is-effective-target arm_little_endian])
-	 || [istarget aarch64*-*-*]
+	 || ([istarget aarch64*-*-*]
+		 && ![istarget aarch64_be*-*-*])
 	 || [istarget powerpc*-*-*]
  || [istarget spu-*-*] } {
 set et_vect_perm_byte_saved 1
@@ -3469,7 +3471,8 @@ proc check_effective_target_vect_perm_short { } {
 set et_vect_perm_short_saved 0
 if { ([is-effective-target arm_neon_ok]
 	  && [is-effective-target arm_little_endian])
-	 || [istarget aarch64*-*-*]
+	 || ([istarget aarch64*-*-*]
+		 && ![istarget aarch64_be*-*-*])
 	 || [istarget powerpc*-*-*]
  || [istarget spu-*-*] } {
 set et_vect_perm_short_saved 1

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Richard Earnshaw

On 14/01/14 14:49, Richard Earnshaw wrote:
> On 14/01/14 14:32, Jakub Jelinek wrote:
>> Anyway, the above is really a simple case, and I'd call it a
>> backend bug if it isn't able to generate good code out of that.
> 
> Exactly which back-end pass are you expecting to simplify
> 
> (set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0)
> (const_int 2)))
> 
> (set (reg:SI 2) (zero_extend:SI (reg:HI 1)))
> 
> (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0)))
> 
> into
> 
> (set (reg:SI 2) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2)))
> 
> (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0)))
> 
> Combine is about the only pass that does this sort of thing, and that's
> far too often confused by extraneous information that it thinks might be
> helpful, but isn't, or by the fact that the intermediate result is used
> more than once.
> 
> R.
> 

Consider this case:

struct b2Body {
 unsigned short flags;
 int type;
};

_Bool IsAwake(short *a, struct b2Body *b)
{
  int c;
  c = b->flags & 2;
  *a = c;
  return c == 2;
}

There's a redundant extend operation left in here on ARM, MIPS & PPC.

ARM:

ldrhr3, [r1]
and r3, r3, #2
uxthr3, r3  // Redundant
strhr3, [r0]
addsr0, r3, #0
movne   r0, #1
bx  lr

MIPS
lhu $2,0($5)
nop
andi$2,$2,0x2
andi$2,$2,0x// Redundant
sh  $2,0($4)
j   $31
sltu$2,$0,$2

PPC:

lhz 9,0(4)
rlwinm 9,9,0,30,30
rlwinm 9,9,0,0x // Redundant
sth 9,0(3)
addic 10,9,-1
subfe 3,10,9
blr

So if this is a backend issue, it's wide-spread on word-based machines.

R.

[Patch, xtensa] Add LOCAL_REGNO to the xtensa backend.

2014-01-14 Thread Felix Yang

Hi Sterling,

The xtensa backend uses register windows, and we need to define
LOCAL_REGNO for it.
The dataflow may not be accurate with this macro. This patch
passed the cases in testsuite/gcc.c-torture/execute dir.
Please apply it if OK for trunk. Thanks.

Index: gcc/ChangeLog
===
--- gcc/ChangeLog(revision 206599)
+++ gcc/ChangeLog(working copy)
@@ -1,3 +1,7 @@
+2014-01-14  Felix Yang  
+
+* config/xtensa/xtensa.h (LOCAL_REGNO): New.
+
 2014-01-14  Richard Biener  

 PR tree-optimization/58921
Index: gcc/config/xtensa/xtensa.h
===
--- gcc/config/xtensa/xtensa.h(revision 206599)
+++ gcc/config/xtensa/xtensa.h(working copy)
@@ -369,7 +369,14 @@ extern char xtensa_hard_regno_mode_ok[][FIRST_PSEU
 ((unsigned) ((IN) - GP_REG_FIRST) < WINDOW_SIZE)) ?\
(IN) + WINDOW_SIZE : (IN))

+/* Define this macro if the target machine has register windows.  This
+   C expression returns true if the register is call-saved but is in the
+   register window.  */

+#define LOCAL_REGNO(REGNO)\
+  (GP_REG_P (REGNO) && ((unsigned) (REGNO - GP_REG_FIRST) < WINDOW_SIZE))
+
+
 /* Define the classes of registers for register constraints in the
machine description.  */
 enum reg_class



Cheers,
Felix

[gomp4 6/6] Enable initial support in the C front end for OpenACC data clauses.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/c/
* c-parser.c (OACC_PARALLEL_CLAUSE_MASK): Add
PRAGMA_OMP_CLAUSE_COPY, PRAGMA_OMP_CLAUSE_COPYIN,
PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_CREATE,
PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_PRESENT,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and
PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE.
gcc/testsuite/
* c-c++-common/goacc/data-clause-duplicate-1.c: New file.
* c-c++-common/goacc/deviceptr-1.c: New file.
libgomp/
* testsuite/libgomp.oacc-c/parallel-1.c: Extend.
---
 gcc/c/c-parser.c   |  14 +-
 .../c-c++-common/goacc/data-clause-duplicate-1.c   |  13 ++
 gcc/testsuite/c-c++-common/goacc/deviceptr-1.c |  64 +
 libgomp/testsuite/libgomp.oacc-c/parallel-1.c  | 150 +++--
 4 files changed, 228 insertions(+), 13 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/deviceptr-1.c

diff --git gcc/c/c-parser.c gcc/c/c-parser.c
index 48c55e6..d6a2af0 100644
--- gcc/c/c-parser.c
+++ gcc/c/c-parser.c
@@ -11225,8 +11225,17 @@ c_parser_omp_structured_block (c_parser *parser)
LOC is the location of the #pragma token.
 */
 
-#define OACC_PARALLEL_CLAUSE_MASK  \
-   (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_NONE)
+#define OACC_PARALLEL_CLAUSE_MASK  \
+   ( (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPY) \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPYIN)   \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPYOUT)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_CREATE)   \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_DEVICEPTR)\
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN)\
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT)   \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE) )
 
 static tree
 c_parser_oacc_parallel (location_t loc, c_parser *parser)
@@ -11235,7 +11244,6 @@ c_parser_oacc_parallel (location_t loc, c_parser 
*parser)
 
   clauses =  c_parser_oacc_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK,
"#pragma acc parallel");
-  gcc_assert (clauses == NULL);
 
   block = c_begin_omp_parallel ();
   add_stmt (c_parser_omp_structured_block (parser));
diff --git gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c 
gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c
new file mode 100644
index 000..1bcf5be
--- /dev/null
+++ gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c
@@ -0,0 +1,13 @@
+void
+fun (void)
+{
+  float *fp;
+#pragma acc parallel copy(fp[0:2],fp[0:2]) /* { dg-error "'fp' appears more 
than once in map clauses" } */
+  ;
+#pragma acc parallel present_or_copyin(fp[3]) present_or_copyout(fp[7:4]) /* { 
dg-error "'fp' appears more than once in map clauses" } */
+  ;
+#pragma acc parallel create(fp[:10]) deviceptr(fp)
+  /* { dg-error "'fp' appears more than once in map clauses" "" { target *-*-* 
} 9 } */
+  /* { dg-message "sorry, unimplemented: data clause not yet implemented" "" { 
target *-*-* } 9 } */
+  ;
+}
diff --git gcc/testsuite/c-c++-common/goacc/deviceptr-1.c 
gcc/testsuite/c-c++-common/goacc/deviceptr-1.c
new file mode 100644
index 000..0f0cf0c
--- /dev/null
+++ gcc/testsuite/c-c++-common/goacc/deviceptr-1.c
@@ -0,0 +1,64 @@
+void
+fun1 (void)
+{
+#pragma acc parallel deviceptr(u) /* { dg-error "'u' undeclared" } */
+  ;
+#pragma acc parallel deviceptr(u[0:4]) /* { dg-error "expected '\\\)' before 
'\\\[' token" } */
+  ;
+
+#pragma acc parallel deviceptr(fun1) /* { dg-error "'fun1' is not a variable" 
} */
+  ;
+#pragma acc parallel deviceptr(fun1[2:5])
+  /* { dg-error "'fun1' is not a variable" "not a variable" { target *-*-* } 
11 } */
+  /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 
11 } */
+  ;
+
+  int i;
+#pragma acc parallel deviceptr(i) /* { dg-error "'i' is not a pointer 
variable" } */
+  ;
+#pragma acc parallel deviceptr(i[0:4])
+  /* { dg-error "'i' is not a pointer variable" "not a pointer variable" { 
target *-*-* } 19 } */
+  /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 
19 } */
+  ;
+
+  float fa[10];
+#pragma acc parallel deviceptr(fa) /* { dg-error "'fa' is not a pointer 
variable" } */
+  ;
+#pragma acc parallel deviceptr(fa[1:5])
+  /* { dg-error "'fa' is not a pointer variable" "not a pointer variable" { 
target *-*-* } 27 } */
+  /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 
27 } */
+  ;
+
+  fl

[gomp4 3/6] Initial support for OpenACC memory mapping semantics.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/
* tree-core.h (omp_clause_map_kind): Add OMP_CLAUSE_MAP_FORCE,
OMP_CLAUSE_MAP_FORCE_ALLOC, OMP_CLAUSE_MAP_FORCE_TO,
OMP_CLAUSE_MAP_FORCE_FROM, OMP_CLAUSE_MAP_FORCE_TOFROM,
OMP_CLAUSE_MAP_FORCE_PRESENT, OMP_CLAUSE_MAP_FORCE_DEALLOC, and
OMP_CLAUSE_MAP_FORCE_DEVICEPTR.
* tree-pretty-print.c (dump_omp_clause): Handle these.
* gimplify.c (gimplify_omp_var_data): Add GOVD_MAP_FORCE.
(omp_region_type): Add ORT_TARGET_MAP_FORCE.
(omp_add_variable, omp_notice_threadprivate_variable)
(omp_notice_variable, gimplify_scan_omp_clauses)
(gimplify_adjust_omp_clauses_1): Extend accordingly.
(gimplify_oacc_parallel): Add ORT_TARGET_MAP_FORCE to ORT_TARGET
usage.
* omp-low.c (install_var_field, scan_sharing_clauses)
(lower_oacc_parallel, lower_omp_target): Extend accordingly.
---
 gcc/gimplify.c  | 92 ++---
 gcc/omp-low.c   | 33 +++---
 gcc/tree-core.h | 19 +-
 gcc/tree-pretty-print.c | 21 +++
 4 files changed, 140 insertions(+), 25 deletions(-)

diff --git gcc/gimplify.c gcc/gimplify.c
index 90507c2..633784f 100644
--- gcc/gimplify.c
+++ gcc/gimplify.c
@@ -69,7 +69,13 @@ enum gimplify_omp_var_data
   GOVD_PRIVATE_OUTER_REF = 1024,
   GOVD_LINEAR = 2048,
   GOVD_ALIGNED = 4096,
+
+  /* Flags for GOVD_MAP.  */
+  /* Don't copy back.  */
   GOVD_MAP_TO_ONLY = 8192,
+  /* Force a specific behavior (or else, a run-time error).  */
+  GOVD_MAP_FORCE = 16384,
+
   GOVD_DATA_SHARE_CLASS = (GOVD_SHARED | GOVD_PRIVATE | GOVD_FIRSTPRIVATE
   | GOVD_LASTPRIVATE | GOVD_REDUCTION | GOVD_LINEAR
   | GOVD_LOCAL)
@@ -86,7 +92,11 @@ enum omp_region_type
   ORT_UNTIED_TASK = 5,
   ORT_TEAMS = 8,
   ORT_TARGET_DATA = 16,
-  ORT_TARGET = 32
+  ORT_TARGET = 32,
+
+  /* Flags for ORT_TARGET.  */
+  /* Default to GOVD_MAP_FORCE for implicit mappings in this region.  */
+  ORT_TARGET_MAP_FORCE = 64
 };
 
 /* Gimplify hashtable helper.  */
@@ -5430,9 +5440,20 @@ omp_add_variable (struct gimplify_omp_ctx *ctx, tree 
decl, unsigned int flags)
 copy into or out of the context.  */
   if (!(flags & GOVD_LOCAL))
{
- nflags = flags & GOVD_MAP
-  ? GOVD_MAP | GOVD_MAP_TO_ONLY | GOVD_EXPLICIT
-  : flags & GOVD_PRIVATE ? GOVD_PRIVATE : GOVD_FIRSTPRIVATE;
+ if (flags & GOVD_MAP)
+   {
+ nflags = GOVD_MAP | GOVD_MAP_TO_ONLY | GOVD_EXPLICIT;
+#if 0
+ /* Not sure if this is actually needed; haven't found a case
+where this would change anything; TODO.  */
+ if (flags & GOVD_MAP_FORCE)
+   nflags |= OMP_CLAUSE_MAP_FORCE;
+#endif
+   }
+ else if (flags & GOVD_PRIVATE)
+   nflags = GOVD_PRIVATE;
+ else
+   nflags = GOVD_FIRSTPRIVATE;
  nflags |= flags & GOVD_SEEN;
  t = DECL_VALUE_EXPR (decl);
  gcc_assert (TREE_CODE (t) == INDIRECT_REF);
@@ -5501,6 +5522,8 @@ omp_notice_threadprivate_variable (struct 
gimplify_omp_ctx *ctx, tree decl,
   for (octx = ctx; octx; octx = octx->outer_context)
 if (octx->region_type & ORT_TARGET)
   {
+   gcc_assert (!(octx->region_type & ORT_TARGET_MAP_FORCE));
+
n = splay_tree_lookup (octx->variables, (splay_tree_key)decl);
if (n == NULL)
  {
@@ -5562,19 +5585,45 @@ omp_notice_variable (struct gimplify_omp_ctx *ctx, tree 
decl, bool in_code)
   n = splay_tree_lookup (ctx->variables, (splay_tree_key)decl);
   if (ctx->region_type & ORT_TARGET)
 {
+  unsigned map_force;
+  if (ctx->region_type & ORT_TARGET_MAP_FORCE)
+   map_force = GOVD_MAP_FORCE;
+  else
+   map_force = 0;
   if (n == NULL)
{
  if (!lang_hooks.types.omp_mappable_type (TREE_TYPE (decl)))
{
  error ("%qD referenced in target region does not have "
 "a mappable type", decl);
- omp_add_variable (ctx, decl, GOVD_MAP | GOVD_EXPLICIT | flags);
+ omp_add_variable (ctx, decl, GOVD_MAP | map_force | GOVD_EXPLICIT 
| flags);
}
  else
-   omp_add_variable (ctx, decl, GOVD_MAP | flags);
+   omp_add_variable (ctx, decl, GOVD_MAP | map_force | flags);
}
   else
-   n->value |= flags;
+   {
+#if 0
+ /* The following fails for:
+
+int l = 10;
+float c[l];
+#pragma acc parallel copy(c[2:4])
+  {
+#pragma acc parallel
+{
+  int t = sizeof c;
+}
+  }
+
+..., which we currently don't have to care about (nesting
+disabled), but eventually will have to; TODO.  */
+ if ((n->value & GOVD_MAP) && !(n->value & GOVD_EXPLICIT))
+   gcc

[gomp4 1/6] During gimplification, allow additional flags next to ORT_TARGET.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/
* gimplify.c (gimplify_call_expr, gimplify_modify_expr)
(omp_firstprivatize_variable, omp_notice_threadprivate_variable)
(omp_notice_variable, gimplify_adjust_omp_clauses)
(gimplify_omp_workshare): Treat ORT_TARGET as a flag, not as a
value.
---
 gcc/gimplify.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git gcc/gimplify.c gcc/gimplify.c
index e45bed2..90507c2 100644
--- gcc/gimplify.c
+++ gcc/gimplify.c
@@ -2363,7 +2363,7 @@ gimplify_call_expr (tree *expr_p, gimple_seq *pre_p, bool 
want_value)
 during omplower pass instead.  */
   struct gimplify_omp_ctx *ctx;
   for (ctx = gimplify_omp_ctxp; ctx; ctx = ctx->outer_context)
-   if (ctx->region_type == ORT_TARGET)
+   if (ctx->region_type & ORT_TARGET)
  break;
   if (ctx == NULL)
fold_stmt (&gsi);
@@ -4534,7 +4534,7 @@ gimplify_modify_expr (tree *expr_p, gimple_seq *pre_p, 
gimple_seq *post_p,
  during omplower pass instead.  */
   struct gimplify_omp_ctx *ctx;
   for (ctx = gimplify_omp_ctxp; ctx; ctx = ctx->outer_context)
-if (ctx->region_type == ORT_TARGET)
+if (ctx->region_type & ORT_TARGET)
   break;
   if (ctx == NULL)
 fold_stmt (&gsi);
@@ -5317,7 +5317,7 @@ omp_firstprivatize_variable (struct gimplify_omp_ctx 
*ctx, tree decl)
  else
return;
}
-  else if (ctx->region_type == ORT_TARGET)
+  else if (ctx->region_type & ORT_TARGET)
omp_add_variable (ctx, decl, GOVD_MAP | GOVD_MAP_TO_ONLY);
   else if (ctx->region_type != ORT_WORKSHARE
   && ctx->region_type != ORT_SIMD
@@ -5499,7 +5499,7 @@ omp_notice_threadprivate_variable (struct 
gimplify_omp_ctx *ctx, tree decl,
   struct gimplify_omp_ctx *octx;
 
   for (octx = ctx; octx; octx = octx->outer_context)
-if (octx->region_type == ORT_TARGET)
+if (octx->region_type & ORT_TARGET)
   {
n = splay_tree_lookup (octx->variables, (splay_tree_key)decl);
if (n == NULL)
@@ -5560,7 +5560,7 @@ omp_notice_variable (struct gimplify_omp_ctx *ctx, tree 
decl, bool in_code)
 }
 
   n = splay_tree_lookup (ctx->variables, (splay_tree_key)decl);
-  if (ctx->region_type == ORT_TARGET)
+  if (ctx->region_type & ORT_TARGET)
 {
   if (n == NULL)
{
@@ -6285,7 +6285,7 @@ gimplify_adjust_omp_clauses (tree *list_p)
  if (!DECL_P (decl))
break;
  n = splay_tree_lookup (ctx->variables, (splay_tree_key) decl);
- if (ctx->region_type == ORT_TARGET && !(n->value & GOVD_SEEN))
+ if ((ctx->region_type & ORT_TARGET) && !(n->value & GOVD_SEEN))
remove = true;
  else if (DECL_SIZE (decl)
   && TREE_CODE (DECL_SIZE (decl)) != INTEGER_CST
@@ -6857,7 +6857,7 @@ gimplify_omp_workshare (tree *expr_p, gimple_seq *pre_p)
   gcc_unreachable ();
 }
   gimplify_scan_omp_clauses (&OMP_CLAUSES (expr), pre_p, ort);
-  if (ort == ORT_TARGET || ort == ORT_TARGET_DATA)
+  if ((ort & ORT_TARGET) || ort == ORT_TARGET_DATA)
 {
   push_gimplify_context ();
   gimple g = gimplify_and_return_first (OMP_BODY (expr), &body);
-- 
1.8.1.1

[gomp4 2/6] Prepare for extending omp_clause_map_kind.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/
* tree-core.h (omp_clause_map_kind): Make the identifiers' bit
patterns more obvious.  Add comments.
* omp-low.c (lower_oacc_parallel, lower_omp_target): Test for
omp_clause_map_kind flags set instead of for values.
---
 gcc/omp-low.c   | 22 ++
 gcc/tree-core.h | 16 +++-
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git gcc/omp-low.c gcc/omp-low.c
index eb755c3..899e970 100644
--- gcc/omp-low.c
+++ gcc/omp-low.c
@@ -8855,13 +8855,16 @@ lower_oacc_parallel (gimple_stmt_iterator *gsi_p, 
omp_context *ctx)
  {
tree avar = create_tmp_var (TREE_TYPE (var), NULL);
mark_addressable (avar);
-   if (OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_ALLOC
-   && OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_FROM)
+   enum omp_clause_map_kind map_kind
+ = OMP_CLAUSE_MAP_KIND (c);
+   if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL)
+&& (map_kind & OMP_CLAUSE_MAP_TO))
+   || map_kind == OMP_CLAUSE_MAP_POINTER)
  gimplify_assign (avar, var, &ilist);
avar = build_fold_addr_expr (avar);
gimplify_assign (x, avar, &ilist);
-   if ((OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_FROM
-|| OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_TOFROM)
+   if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL)
+&& (map_kind & OMP_CLAUSE_MAP_FROM))
&& !TYPE_READONLY (TREE_TYPE (var)))
  {
x = build_sender_ref (ovar, ctx);
@@ -10331,13 +10334,16 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, 
omp_context *ctx)
gcc_assert (kind == GF_OMP_TARGET_KIND_REGION);
tree avar = create_tmp_var (TREE_TYPE (var), NULL);
mark_addressable (avar);
-   if (OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_ALLOC
-   && OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_FROM)
+   enum omp_clause_map_kind map_kind
+ = OMP_CLAUSE_MAP_KIND (c);
+   if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL)
+&& (map_kind & OMP_CLAUSE_MAP_TO))
+   || map_kind == OMP_CLAUSE_MAP_POINTER)
  gimplify_assign (avar, var, &ilist);
avar = build_fold_addr_expr (avar);
gimplify_assign (x, avar, &ilist);
-   if ((OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_FROM
-|| OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_TOFROM)
+   if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL)
+&& (map_kind & OMP_CLAUSE_MAP_FROM))
&& !TYPE_READONLY (TREE_TYPE (var)))
  {
x = build_sender_ref (ovar, ctx);
diff --git gcc/tree-core.h gcc/tree-core.h
index e2750e0..3602b5f 100644
--- gcc/tree-core.h
+++ gcc/tree-core.h
@@ -1112,14 +1112,20 @@ enum omp_clause_depend_kind
 
 enum omp_clause_map_kind
 {
-  OMP_CLAUSE_MAP_ALLOC,
-  OMP_CLAUSE_MAP_TO,
-  OMP_CLAUSE_MAP_FROM,
-  OMP_CLAUSE_MAP_TOFROM,
+  /* If not already present, allocate.  */
+  OMP_CLAUSE_MAP_ALLOC = 0,
+  /* ..., and copy to device.  */
+  OMP_CLAUSE_MAP_TO = 1 << 0,
+  /* ..., and copy from device.  */
+  OMP_CLAUSE_MAP_FROM = 1 << 1,
+  /* ..., and copy to and from device.  */
+  OMP_CLAUSE_MAP_TOFROM = OMP_CLAUSE_MAP_TO | OMP_CLAUSE_MAP_FROM,
+  /* Special map kinds.  */
+  OMP_CLAUSE_MAP_SPECIAL = 1 << 2,
   /* The following kind is an internal only map kind, used for pointer based
  array sections.  OMP_CLAUSE_SIZE for these is not the pointer size,
  which is implicitly POINTER_SIZE / BITS_PER_UNIT, but the bias.  */
-  OMP_CLAUSE_MAP_POINTER
+  OMP_CLAUSE_MAP_POINTER = OMP_CLAUSE_MAP_SPECIAL
 };
 
 enum omp_clause_proc_bind_kind
-- 
1.8.1.1

[gomp4 4/6] C front end infrastructure for OpenACC clauses parsing.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/c/
* c-parser.c (c_parser_oacc_all_clauses): New function.
(c_parser_oacc_parallel): Use it.
* c-typeck.c (c_finish_omp_clauses): Update comment.  Remove
duplicated variable initialization.
---
 gcc/c/c-parser.c | 59 +++-
 gcc/c/c-typeck.c |  4 ++--
 2 files changed, 56 insertions(+), 7 deletions(-)

diff --git gcc/c/c-parser.c gcc/c/c-parser.c
index ce46f31..c8b80db 100644
--- gcc/c/c-parser.c
+++ gcc/c/c-parser.c
@@ -9465,7 +9465,7 @@ c_parser_pragma_pch_preprocess (c_parser *parser)
 c_common_pch_pragma (parse_in, TREE_STRING_POINTER (name));
 }
 
-/* OpenMP 2.5 / 3.0 / 3.1 / 4.0 parsing routines.  */
+/* OpenACC and OpenMP parsing routines.  */
 
 /* Returns name of the next clause.
If the clause is not recognized PRAGMA_OMP_CLAUSE_NONE is returned and
@@ -10767,9 +10767,58 @@ c_parser_omp_clause_uniform (c_parser *parser, tree 
list)
   return list;
 }
 
+/* Parse all OpenACC clauses.  The set clauses allowed by the directive
+   is a bitmask in MASK.  Return the list of clauses found.  */
+
+static tree
+c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
+  const char *where, bool finish_p = true)
+{
+  tree clauses = NULL;
+  bool first = true;
+
+  while (c_parser_next_token_is_not (parser, CPP_PRAGMA_EOL))
+{
+  location_t here;
+  pragma_omp_clause c_kind;
+  const char *c_name;
+  tree prev = clauses;
+
+  if (!first && c_parser_next_token_is (parser, CPP_COMMA))
+   c_parser_consume_token (parser);
+
+  here = c_parser_peek_token (parser)->location;
+  c_kind = c_parser_omp_clause_name (parser);
+
+  switch (c_kind)
+   {
+   default:
+ c_parser_error (parser, "expected clause");
+ goto saw_error;
+   }
+
+  first = false;
+
+  if (((mask >> c_kind) & 1) == 0 && !parser->error)
+   {
+ /* Remove the invalid clause(s) from the list to avoid
+confusing the rest of the compiler.  */
+ clauses = prev;
+ error_at (here, "%qs is not valid for %qs", c_name, where);
+   }
+}
+
+ saw_error:
+  c_parser_skip_to_pragma_eol (parser);
+
+  if (finish_p)
+return c_finish_omp_clauses (clauses);
+
+  return clauses;
+}
+
 /* Parse all OpenMP clauses.  The set clauses allowed by the directive
-   is a bitmask in MASK.  Return the list of clauses found; the result
-   of clause default goes in *pdefault.  */
+   is a bitmask in MASK.  Return the list of clauses found.  */
 
 static tree
 c_parser_omp_all_clauses (c_parser *parser, omp_clause_mask mask,
@@ -11019,8 +11068,8 @@ c_parser_oacc_parallel (location_t loc, c_parser 
*parser)
 {
   tree stmt, clauses, block;
 
-  clauses =  c_parser_omp_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK,
-  "#pragma acc parallel");
+  clauses =  c_parser_oacc_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK,
+   "#pragma acc parallel");
   gcc_assert (clauses == NULL);
 
   block = c_begin_omp_parallel ();
diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c
index 854e149..81f0c5c 100644
--- gcc/c/c-typeck.c
+++ gcc/c/c-typeck.c
@@ -11661,7 +11661,7 @@ c_find_omp_placeholder_r (tree *tp, int *, void *data)
   return NULL_TREE;
 }
 
-/* For all elements of CLAUSES, validate them vs OpenMP constraints.
+/* For all elements of CLAUSES, validate them against their constraints.
Remove any elements from the list that are invalid.  */
 
 tree
@@ -11669,7 +11669,7 @@ c_finish_omp_clauses (tree clauses)
 {
   bitmap_head generic_head, firstprivate_head, lastprivate_head;
   bitmap_head aligned_head;
-  tree c, t, *pc = &clauses;
+  tree c, t, *pc;
   bool branch_seen = false;
   bool copyprivate_seen = false;
   tree *nowait_clause = NULL;
-- 
1.8.1.1

[gomp4 5/6] Initial support in the C front end for OpenACC data clauses.

2014-01-14 Thread thomas

From: Thomas Schwinge 

gcc/c-family/
* c-pragma.h (pragma_omp_clause): Add PRAGMA_OMP_CLAUSE_COPY,
PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_CREATE,
PRAGMA_OMP_CLAUSE_DELETE, PRAGMA_OMP_CLAUSE_DEVICEPTR,
PRAGMA_OMP_CLAUSE_PRESENT, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and
PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE.
gcc/c/
* c-parser.c (c_parser_omp_clause_name): Handle these.
(c_parser_oacc_data_clause, c_parser_oacc_data_clause_deviceptr):
New functions.
(c_parser_oacc_all_clauses): Handle PRAGMA_OMP_CLAUSE_COPY,
PRAGMA_OMP_CLAUSE_COPYIN, PRAGMA_OMP_CLAUSE_COPYOUT,
PRAGMA_OMP_CLAUSE_CREATE, PRAGMA_OMP_CLAUSE_DELETE,
PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_PRESENT,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN,
PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and
PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE.
gcc/
* tree-core.h (omp_clause_code): Update description for
OMP_CLAUSE_MAP.
---
 gcc/c-family/c-pragma.h |  12 +++-
 gcc/c/c-parser.c| 171 +++-
 gcc/tree-core.h |   6 +-
 3 files changed, 184 insertions(+), 5 deletions(-)

diff --git gcc/c-family/c-pragma.h gcc/c-family/c-pragma.h
index 64eed11..2c8af67 100644
--- gcc/c-family/c-pragma.h
+++ gcc/c-family/c-pragma.h
@@ -63,18 +63,23 @@ typedef enum pragma_kind {
 } pragma_kind;
 
 
-/* All clauses defined by OpenMP 2.5, 3.0, 3.1 and 4.0.
+/* All clauses defined by OpenACC 2.0, and OpenMP 2.5, 3.0, 3.1, and 4.0.
Used internally by both C and C++ parsers.  */
 typedef enum pragma_omp_clause {
   PRAGMA_OMP_CLAUSE_NONE = 0,
 
   PRAGMA_OMP_CLAUSE_ALIGNED,
   PRAGMA_OMP_CLAUSE_COLLAPSE,
+  PRAGMA_OMP_CLAUSE_COPY,
   PRAGMA_OMP_CLAUSE_COPYIN,
+  PRAGMA_OMP_CLAUSE_COPYOUT,
   PRAGMA_OMP_CLAUSE_COPYPRIVATE,
+  PRAGMA_OMP_CLAUSE_CREATE,
   PRAGMA_OMP_CLAUSE_DEFAULT,
+  PRAGMA_OMP_CLAUSE_DELETE,
   PRAGMA_OMP_CLAUSE_DEPEND,
   PRAGMA_OMP_CLAUSE_DEVICE,
+  PRAGMA_OMP_CLAUSE_DEVICEPTR,
   PRAGMA_OMP_CLAUSE_DIST_SCHEDULE,
   PRAGMA_OMP_CLAUSE_FINAL,
   PRAGMA_OMP_CLAUSE_FIRSTPRIVATE,
@@ -92,6 +97,11 @@ typedef enum pragma_omp_clause {
   PRAGMA_OMP_CLAUSE_NUM_THREADS,
   PRAGMA_OMP_CLAUSE_ORDERED,
   PRAGMA_OMP_CLAUSE_PARALLEL,
+  PRAGMA_OMP_CLAUSE_PRESENT,
+  PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY,
+  PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN,
+  PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT,
+  PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE,
   PRAGMA_OMP_CLAUSE_PRIVATE,
   PRAGMA_OMP_CLAUSE_PROC_BIND,
   PRAGMA_OMP_CLAUSE_REDUCTION,
diff --git gcc/c/c-parser.c gcc/c/c-parser.c
index c8b80db..48c55e6 100644
--- gcc/c/c-parser.c
+++ gcc/c/c-parser.c
@@ -9496,16 +9496,26 @@ c_parser_omp_clause_name (c_parser *parser)
case 'c':
  if (!strcmp ("collapse", p))
result = PRAGMA_OMP_CLAUSE_COLLAPSE;
+ else if (!strcmp ("copy", p))
+   result = PRAGMA_OMP_CLAUSE_COPY;
  else if (!strcmp ("copyin", p))
result = PRAGMA_OMP_CLAUSE_COPYIN;
+ else if (!strcmp ("copyout", p))
+   result = PRAGMA_OMP_CLAUSE_COPYOUT;
   else if (!strcmp ("copyprivate", p))
result = PRAGMA_OMP_CLAUSE_COPYPRIVATE;
+ else if (!strcmp ("create", p))
+   result = PRAGMA_OMP_CLAUSE_CREATE;
  break;
case 'd':
- if (!strcmp ("depend", p))
+ if (!strcmp ("delete", p))
+   result = PRAGMA_OMP_CLAUSE_DELETE;
+ else if (!strcmp ("depend", p))
result = PRAGMA_OMP_CLAUSE_DEPEND;
  else if (!strcmp ("device", p))
result = PRAGMA_OMP_CLAUSE_DEVICE;
+ else if (!strcmp ("deviceptr", p))
+   result = PRAGMA_OMP_CLAUSE_DEVICEPTR;
  else if (!strcmp ("dist_schedule", p))
result = PRAGMA_OMP_CLAUSE_DIST_SCHEDULE;
  break;
@@ -9550,6 +9560,16 @@ c_parser_omp_clause_name (c_parser *parser)
case 'p':
  if (!strcmp ("parallel", p))
result = PRAGMA_OMP_CLAUSE_PARALLEL;
+ else if (!strcmp ("present", p))
+   result = PRAGMA_OMP_CLAUSE_PRESENT;
+ else if (!strcmp ("present_or_copy", p))
+   result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY;
+ else if (!strcmp ("present_or_copyin", p))
+   result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN;
+ else if (!strcmp ("present_or_copyout", p))
+   result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT;
+ else if (!strcmp ("present_or_create", p))
+   result = PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE;
  else if (!strcmp ("private", p))
result = PRAGMA_OMP_CLAUSE_PRIVATE;
  else if (!strcmp ("proc_bind", p))
@@ -9611,7 +9631,7 @@ check_no_duplicate_clause (tree clauses, enum 
omp_clause_code code,
   }
 }
 
-/* OpenMP 2.5:
+/* OpenACC 2.0, OpenMP

[gomp4] Initial support for OpenACC data clauses

2014-01-14 Thread Thomas Schwinge

Hi!

Here is a patch series that adds initial support for OpenACC data
clauses.  It is not yet complete, but I thought I might as well already
now strive to get this integrated upstream instead of "hoarding" the
patches locally.

Would it be a good idea to also commit to trunk the (portions of the)
patches that don't directly relate with OpenACC stuff?  That way, trunk
and gomp-4_0-branch would diverge a little less?  Or, would you first
like to see all of this stabilitize on gomp-4_0-branch?


Grüße,
 Thomas


pgpUJT7vPMKqc.pgp
Description: PGP signature

[wide-int] resolve bootstrap issue

2014-01-14 Thread Mike Stump

This resolves a bootstrap issue found after reducing the size of the maximal 
wide_int; the real code really does want a slightly larger type so we create on 
just for real.

Ok?

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 81bb407..f4ffdb0 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -685,10 +685,10 @@ c_readstr (const char *str, enum machine_mode mode)
   HOST_WIDE_INT tmp[MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT];
 
   gcc_assert (GET_MODE_CLASS (mode) == MODE_INT);
-
   unsigned int len = (GET_MODE_PRECISION (mode) + HOST_BITS_PER_WIDE_INT - 1)
 / HOST_BITS_PER_WIDE_INT;
 
+  gcc_assert (len <= MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT);
   for (i = 0; i < len; i++)
 tmp[i] = 0;
 
diff --git a/gcc/expmed.c b/gcc/expmed.c
index ce063eb..720d8c1 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -4963,6 +4963,7 @@ make_tree (tree type, rtx x)
   return t;
 
 case CONST_DOUBLE:
+  gcc_assert (HOST_BITS_PER_WIDE_INT * 2 <= MAX_BITSIZE_MODE_ANY_INT);
   if (TARGET_SUPPORTS_WIDE_INT == 0 && GET_MODE (x) == VOIDmode)
t = wide_int_to_tree (type,
  wide_int::from_array (&CONST_DOUBLE_LOW (x), 2,
diff --git a/gcc/lto-streamer-in.c b/gcc/lto-streamer-in.c
index d058307..08eba48 100644
--- a/gcc/lto-streamer-in.c
+++ b/gcc/lto-streamer-in.c
@@ -1312,6 +1312,7 @@ lto_input_tree_1 (struct lto_input_block *ib, struct 
data_in *data_in,
 
   for (i = 0; i < len; i++)
a[i] = streamer_read_hwi (ib);
+  gcc_assert (TYPE_PRECISION (type) <= MAX_BITSIZE_MODE_ANY_INT);
   result = wide_int_to_tree (type, wide_int::from_array
 (a, len, TYPE_PRECISION (type)));
   streamer_tree_cache_append (data_in->reader_cache, result, hash);
diff --git a/gcc/real.c b/gcc/real.c
index b060497..51d1868 100644
--- a/gcc/real.c
+++ b/gcc/real.c
@@ -1377,10 +1377,12 @@ real_to_integer (const REAL_VALUE_TYPE *r)
 wide_int
 real_to_integer (const REAL_VALUE_TYPE *r, bool *fail, int precision)
 {
+  typedef FIXED_WIDE_INT (WIDE_INT_MAX_PRECISION * 2) real_int;
   HOST_WIDE_INT val[2 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT];
   int exp;
   int words;
   wide_int result;
+  real_int tmp;
   int w;
 
   switch (r->cl)
@@ -1440,10 +1442,10 @@ real_to_integer (const REAL_VALUE_TYPE *r, bool *fail, 
int precision)
}
 #endif
   w = SIGSZ * HOST_BITS_PER_LONG + words * HOST_BITS_PER_WIDE_INT;
-  result = wide_int::from_array
+  tmp = real_int::from_array
(val, (w + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT, w);
-  result = wi::lrshift (result, (words * HOST_BITS_PER_WIDE_INT) - exp);
-  result = wide_int::from (result, precision, UNSIGNED);
+  tmp = wi::lrshift (tmp, (words * HOST_BITS_PER_WIDE_INT) - 
exp);
+  result = wide_int::from (tmp, precision, UNSIGNED);
 
   if (r->sign)
return -result;
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 00b5439..7c21afa 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5384,6 +5384,7 @@ simplify_immed_subreg (enum machine_mode outermode, rtx 
op,
tmp[u] = buf;
base += HOST_BITS_PER_WIDE_INT;
  }
+   gcc_assert (GET_MODE_PRECISION (outer_submode) <= 
MAX_BITSIZE_MODE_ANY_INT);
r = wide_int::from_array (tmp, units,
  GET_MODE_PRECISION (outer_submode));
elems[elem] = immed_wide_int_const (r, outer_submode);
diff --git a/gcc/tree-vect-generic.c b/gcc/tree-vect-generic.c
index 50453b4..74d29d2 100644
--- a/gcc/tree-vect-generic.c
+++ b/gcc/tree-vect-generic.c
@@ -62,7 +62,7 @@ build_replicated_const (tree type, tree inner_type, 
HOST_WIDE_INT value)
   HOST_WIDE_INT a[WIDE_INT_MAX_ELTS];
   int i;
 
-  gcc_assert (n);
+  gcc_assert (n && n <= WIDE_INT_MAX_ELTS);
 
   if (width == HOST_BITS_PER_WIDE_INT)
 low = value;
@@ -75,6 +75,7 @@ build_replicated_const (tree type, tree inner_type, 
HOST_WIDE_INT value)
   for (i = 0; i < n; i++)
 a[i] = low;
 
+  gcc_assert (TYPE_PRECISION (type) <= MAX_BITSIZE_MODE_ANY_INT);
   return wide_int_to_tree
 (type, wide_int::from_array (a, n, TYPE_PRECISION (type)));
 }

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Richard Earnshaw

On 14/01/14 14:32, Jakub Jelinek wrote:
> Anyway, the above is really a simple case, and I'd call it a
> backend bug if it isn't able to generate good code out of that.

Exactly which back-end pass are you expecting to simplify

(set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0)
(const_int 2)))

(set (reg:SI 2) (zero_extend:SI (reg:HI 1)))

(set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0)))

into

(set (reg:SI 2) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2)))

(set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0)))

Combine is about the only pass that does this sort of thing, and that's
far too often confused by extraneous information that it thinks might be
helpful, but isn't, or by the fact that the intermediate result is used
more than once.

R.

Re: PR 59712 patch

2014-01-14 Thread Jonathan Wakely

On 9 January 2014 21:55, François Dumont wrote:
> Hi
>
> Here is a patch for this small problem with clang. It is not a blocking
> issue for the 4.9 release but at the same time it is a rather safe fix so
> just tell me if I can commit it.
>
> All unordered_* tests run under Linux x86_64. I haven't clang installed
> at the moment so a clang feedback would be appreciated.

I'm going to test this with Clang to be sure it actually fixes the bug
before we change anything, but it might have to wait for the weekend.

N.B. please change the comment at the top of the patch to say "pool of
nodes" instead of "pool of node"

Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)

2014-01-14 Thread Jason Merrill


The C++ part is OK.

Jason

Re: [RFC] Using function clones for Pointer Bounds Checker

2014-01-14 Thread Ilya Enkovich

2014/1/14 Richard Biener :
> On Tue, Jan 14, 2014 at 1:47 PM, Ilya Enkovich  wrote:
>> 2014/1/14 Richard Biener :
>>> On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich  
>>> wrote:
 Hi,

 I've been working for some time on the prototype of the Pointer Bounds
 Checker which uses function clones for instrumentation
 (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After
 several experiments with this approach I want to share my results and
 ask for some feedback to make a decision about the future steps.

 Firstly I want to remind the reasons for digging in this direction. In
 the original approach bounds of call arguments and input parameters
 are associated with arguments via special built-in calls. It creates
 implicit data flow compiler is not aware about which confuses some
 optimizations resulting in miss-optimization and breaks bounds data
 flow. Thus optimizations have to be fixed to get better pointers
 protection.

 Clones approach does not use special built-in function calls to
 associate bounds with call arguments and input parameters. Each
 function which should be instrumented gets an additional version and
 only this special version will be instrumented.This new version gets
 additional bound arguments to express input bounds. When function call
 is instrumented, it is redirected to instrumented version and all
 bounds are passed as explicit call arguments. Thus we have explicit
 pointer bounds flow similar to regular function parameters. It should
 allow to avoid changes in optimization, avoid miss-optimizations,
 allow existing IPA optimizations to work with bound args (e.g.
 propagate constant bounds value and remove checks in called function).

 I made a prototype implementation of this approach in the following way:

 - Add new IPA pass before early local passes to produce versions for
 all functions to be instrumented.
 - Put instrumentation pass after SSA pass.
 - Add new pass after IPA passes to remove bodies of functions which
 have instrumented versions. Function nodes may still be required for
 calls in not instrumented code. But we do not emit this code and
 therefore function bodies are not needed.

 Positive changes are:

 - IPA optimizations are not confused by bound parameters
 - bounds are now more like regular arguments; it makes their
 processing in expand easier
 - functions with bounds not attached to any pointer are allowed
>>>
>>> First of all thanks for trying to work in this direction.  Comments on the
>>> issues you encountered below (also CCed Honza as he should be more
>>> familiar with reachability and clone issues).
>>>
 On simple codes this approach worked well but on a bigger tests some
 issues were revealed.

 1. Nodes reachability. Instrumented version is actually always
 reachable when original function is reachable because it is always
 emitted instead of the original. Thus I had to fix reachability
 analysis to achieve it. Another similar problem is check whether node
 can be removed after inline when inlining instrumented function. Not
 hard to fix but probably other similar problems exist.
>>>
>>> I suppose you do not update the callgraph / the call stmts when
>>> creating the clones?  Btw, is it desirable to inline the uninstrumented
>>> function and then instrument the result (thus run cloning and
>>> instrumentation after early inlining?)?  Especially handling always_inlines
>>> before cloning/isntrumentation looks very sensible.
>>
>> Right. Created clones have the same code as the original function and
>> therefore same cgraph edges. I suppose instrumentation after early
>> inlining is OK and may be preferred because inline shouldn't lead to
>> any losses of bounds information. I tried variant when instrumentation
>> works right after early inlining but with cloning still before early
>> local passes. In general it looked OK.
>>
>>>
 2. Function processing order. Function processing order is determined
 before early local passes. But during function instrumentation call
 graph is modified significantly and used topological order becomes
 outdated. That causes some troubles. E.g. function marked as 'always
 inline' cannot be inlined because it is not in SSA form yet. Surely
 inlining problem may be solved by just putting instrumentation after
 early inline, but similar problem may exist in other passes too. To
 resolve this problem I tried to split early local passes into three
 parts. The first one builds SSA, the second one performs
 instrumentation, the last one does the rest. Each part is performed on
 all functions before the next one starts. Thus I get all functions in
 SSA form and all instrumentation performed before starting early
 optimizations. Unfortunately such passes ord

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Richard Earnshaw

On 14/01/14 14:25, Richard Biener wrote:
> On Tue, Jan 14, 2014 at 3:21 PM, Kyrill Tkachov  
> wrote:
>> Moving to gcc, I accidentally sent it to gcc-patches previously...
>>
>>
>> On 14/01/14 14:09, Richard Biener wrote:
>>>
>>> On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov 
>>> wrote:

 Hi all,

 I'm looking into PR 54168 where we end up generating an unnecessary
 extend
 operation on arm.

 Given code:

 struct b2Body {
  unsigned short flags;
  int type;
 };

 static  _Bool IsAwake(struct b2Body *b)
 {
  return (b->flags & 2) == 2;
 }


 int foo(struct b2Body *bA, struct b2Body *bB)
 {
  int typeA = bA->type;
  int typeB = bB->type;
  _Bool activeA = IsAwake(bA) && typeA != 0;
  _Bool activeB = IsAwake(bB) && typeB != 0;

  if (!activeA && !activeB)
  {
  return 1;
  }

  return 0;
 }

 Compiled for arm-none-eabi with -O3 -march=armv7-a

 The inlined generated code for IsAwake contains the fragment:

  ldrhr0, [r1]
  and r0, r0, #2
  uxthr0, r0
  cmp r0, #0

 which contains a redundant extend, which also confuses combine and
 prevents
 the whole thing from being optimised into an ldrh ; ands sequence.

 Looking at the tree dumps I notice that after the forwprop pass the types
 of
 the operands in the _7 = _3 & 2; statement turn into short unsigned where
 before that pass they were just ints:

 IsAwake (struct b2Body * b)
 {
short unsigned int _3;
int _4;
_Bool _6;
short unsigned int _7;

:
_3 = b_2(D)->flags;
_4 = (int) _3;
_7 = _3 & 2;
_6 = _7 != 0;
return _6;

 }


 I believe the C standard expects the operation to be performed in int
 mode.
 Now, since this is a bitwise and operation with a known constant 2, the
 operation can be safely performed in unsigned short mode. However on
 word-based machines like arm this would introduce unnecessary extend
 operations down the line, as I believe is the case here.
>>>
>>> Though the variant using shorts is clearly shorter (in number of stmts)
>>> and thus easier to optimize.  Am I correct that the issue in the end
>>> is that _7 != 0 requires to extend _7?  & 2 is trivially performed without
>>> any extension, no?
>>
>>
>> If I look at the dump before forwprop, the number of statements is exactly
>> the same, so it's not any shorter in that sense.
> 
> Well, it is - _4 = (int) _3; is dead, thus a zero-extension instruction
> is removed.
> 

That's a rather short-sighted definition of removed.  You've removed an
extension that:

a) can be merged with the preceding load
b) will have to be put back again anyway, when the _4 & 2 is expanded.

And finally, you end up with a second one when _5 != 0 is later expanded.

R.

>> 
>>
>> IsAwake (struct b2Body * b)
>> {
>>   short unsigned int _3;
>>   int _4;
>>   int _5;
>>   _Bool _6;
>>
>>
>>   :
>>   _3 = b_2(D)->flags;
>>   _4 = (int) _3;
>>   _5 = _4 & 2;
>>   _6 = _5 != 0;
>>   return _6;
>>
>> }
>>
>> Using shorts is not cheaper on an architecture like arm which is word-based.
>> Just the fact that we're introducing shorts already implies we're going to
>> have to extend somewhere.
> 
> But given the bit-and with '2' the extension can be removed, no?
> 
> Richard.
> 
>> Kyrill
>>
>>
>>>
>>> Richard.
>>>
 Anyone have any insight on how to resolve this one?

 Thanks,
 Kyrill

>>
>>
>

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Jakub Jelinek

On Tue, Jan 14, 2014 at 02:21:52PM +, Kyrill Tkachov wrote:
> If I look at the dump before forwprop, the number of statements is
> exactly the same, so it's not any shorter in that sense.
> 
> 
> IsAwake (struct b2Body * b)
> {
>   short unsigned int _3;
>   int _4;
>   int _5;
>   _Bool _6;
> 
>   :
>   _3 = b_2(D)->flags;
>   _4 = (int) _3;
>   _5 = _4 & 2;
>   _6 = _5 != 0;
>   return _6;
> 
> }
> 
> Using shorts is not cheaper on an architecture like arm which is
> word-based. Just the fact that we're introducing shorts already
> implies we're going to have to extend somewhere.

As discussed multiple times in the various type demotion/promotion threads,
for GIMPLE optimizations it is desirable to use as narrow types as possible,
because that e.g. means greater possibilities of finding redundancies,
optimizing away useless computations in the upper bits, canonicalize
computations done in different types etc.
Already the FEs right now narrow types, but only within the same folded
expressions (get_narrower etc.), of course we want to move that away from
FEs if possible and do it during (early?) GIMPLE passes.
Generally, targets should be able to expand good code for the narrower
types, there is no reason why they can't, but perhaps their tasks would be
easier if we did some target hooks/optabs or something driven type
promotion.  Anyway, the above is really a simple case, and I'd call it a
backend bug if it isn't able to generate good code out of that.

Jakub

Re: Drop REG_CROSSING_JUMP when converting to a conditional return

2014-01-14 Thread Jeff Law


On 01/14/14 03:37, Richard Sandiford wrote:

While experimenting with a patch to use conditional returns, I hit a case
where a conditional jump between hot and cold sections was being converted
into a conditional return.  The new jump still had the REG_CROSSING_JUMP
and later passes were confused by the combination of that and a return
JUMP_LABEL.

The jump is now returning directly from the current section rather than
crossing sections within the function, so I think we should just drop
the note.

Tested on s390x-linux-gnu with some local patches.  OK to install?

Thanks,
Richard


gcc/
* jump.c (redirect_jump_2): Remove REG_CROSSING_JUMP notes when
converting a conditional jump into a conditional return.

OK.

jeff

Re: [PATCH] Fix PR59802, LCM compile-time slowness

2014-01-14 Thread Jeff Law


On 01/14/14 06:42, Richard Biener wrote:


This fixes the slowness seen in LCM compute_available accounted
to RTL cprop.  Currently the dataflow problem uses a "random"
basic-block order to seed the initial worklist (it wants to
visit predecessors before successors) - the following patch
makes it use inverted postorder (similar to tree PRE antic
computation).

This reduces the compile-time for the testcase in PR59802 at -O3
from

  CPROP   :  54.53 (55%) usr   0.04 ( 6%) sys  54.57 (55%)
wall kB ( 2%) ggc
  PRE :   4.47 ( 5%) usr   0.03 ( 5%) sys   4.48 ( 5%)
wall1833 kB ( 1%) ggc
  TOTAL :  98.51 0.6399.13
220195 kB

to

  CPROP   :   2.13 ( 5%) usr   0.06 (10%) sys   2.20 ( 5%)
wall kB ( 2%) ggc
  PRE :   0.52 ( 1%) usr   0.02 ( 3%) sys   0.54 ( 1%)
wall1833 kB ( 1%) ggc
  TOTAL :  42.22 0.6042.81
220195 kB

which is nice.  I checked for other compile-time hog PRs with
CPROP but didn't find one, have yet to check for PRE (three-letter,
likely too much noise).

Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk
(and branch?)

Thanks,
Richard.

2014-01-14  Richard Biener  

PR rtl-optimization/59802
* lcm.c (compute_available): Use inverted postorder to seed
the initial worklist.

Looks good.  Thanks for taking care of this.

Jeff

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Richard Biener

On Tue, Jan 14, 2014 at 3:21 PM, Kyrill Tkachov  wrote:
> Moving to gcc, I accidentally sent it to gcc-patches previously...
>
>
> On 14/01/14 14:09, Richard Biener wrote:
>>
>> On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov 
>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm looking into PR 54168 where we end up generating an unnecessary
>>> extend
>>> operation on arm.
>>>
>>> Given code:
>>>
>>> struct b2Body {
>>>  unsigned short flags;
>>>  int type;
>>> };
>>>
>>> static  _Bool IsAwake(struct b2Body *b)
>>> {
>>>  return (b->flags & 2) == 2;
>>> }
>>>
>>>
>>> int foo(struct b2Body *bA, struct b2Body *bB)
>>> {
>>>  int typeA = bA->type;
>>>  int typeB = bB->type;
>>>  _Bool activeA = IsAwake(bA) && typeA != 0;
>>>  _Bool activeB = IsAwake(bB) && typeB != 0;
>>>
>>>  if (!activeA && !activeB)
>>>  {
>>>  return 1;
>>>  }
>>>
>>>  return 0;
>>> }
>>>
>>> Compiled for arm-none-eabi with -O3 -march=armv7-a
>>>
>>> The inlined generated code for IsAwake contains the fragment:
>>>
>>>  ldrhr0, [r1]
>>>  and r0, r0, #2
>>>  uxthr0, r0
>>>  cmp r0, #0
>>>
>>> which contains a redundant extend, which also confuses combine and
>>> prevents
>>> the whole thing from being optimised into an ldrh ; ands sequence.
>>>
>>> Looking at the tree dumps I notice that after the forwprop pass the types
>>> of
>>> the operands in the _7 = _3 & 2; statement turn into short unsigned where
>>> before that pass they were just ints:
>>>
>>> IsAwake (struct b2Body * b)
>>> {
>>>short unsigned int _3;
>>>int _4;
>>>_Bool _6;
>>>short unsigned int _7;
>>>
>>>:
>>>_3 = b_2(D)->flags;
>>>_4 = (int) _3;
>>>_7 = _3 & 2;
>>>_6 = _7 != 0;
>>>return _6;
>>>
>>> }
>>>
>>>
>>> I believe the C standard expects the operation to be performed in int
>>> mode.
>>> Now, since this is a bitwise and operation with a known constant 2, the
>>> operation can be safely performed in unsigned short mode. However on
>>> word-based machines like arm this would introduce unnecessary extend
>>> operations down the line, as I believe is the case here.
>>
>> Though the variant using shorts is clearly shorter (in number of stmts)
>> and thus easier to optimize.  Am I correct that the issue in the end
>> is that _7 != 0 requires to extend _7?  & 2 is trivially performed without
>> any extension, no?
>
>
> If I look at the dump before forwprop, the number of statements is exactly
> the same, so it's not any shorter in that sense.

Well, it is - _4 = (int) _3; is dead, thus a zero-extension instruction
is removed.

> 
>
> IsAwake (struct b2Body * b)
> {
>   short unsigned int _3;
>   int _4;
>   int _5;
>   _Bool _6;
>
>
>   :
>   _3 = b_2(D)->flags;
>   _4 = (int) _3;
>   _5 = _4 & 2;
>   _6 = _5 != 0;
>   return _6;
>
> }
>
> Using shorts is not cheaper on an architecture like arm which is word-based.
> Just the fact that we're introducing shorts already implies we're going to
> have to extend somewhere.

But given the bit-and with '2' the extension can be removed, no?

Richard.

> Kyrill
>
>
>>
>> Richard.
>>
>>> Anyone have any insight on how to resolve this one?
>>>
>>> Thanks,
>>> Kyrill
>>>
>
>

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Kyrill Tkachov


Moving to gcc, I accidentally sent it to gcc-patches previously...

On 14/01/14 14:09, Richard Biener wrote:

On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov  wrote:

Hi all,

I'm looking into PR 54168 where we end up generating an unnecessary extend
operation on arm.

Given code:

struct b2Body {
 unsigned short flags;
 int type;
};

static  _Bool IsAwake(struct b2Body *b)
{
 return (b->flags & 2) == 2;
}


int foo(struct b2Body *bA, struct b2Body *bB)
{
 int typeA = bA->type;
 int typeB = bB->type;
 _Bool activeA = IsAwake(bA) && typeA != 0;
 _Bool activeB = IsAwake(bB) && typeB != 0;

 if (!activeA && !activeB)
 {
 return 1;
 }

 return 0;
}

Compiled for arm-none-eabi with -O3 -march=armv7-a

The inlined generated code for IsAwake contains the fragment:

 ldrhr0, [r1]
 and r0, r0, #2
 uxthr0, r0
 cmp r0, #0

which contains a redundant extend, which also confuses combine and prevents
the whole thing from being optimised into an ldrh ; ands sequence.

Looking at the tree dumps I notice that after the forwprop pass the types of
the operands in the _7 = _3 & 2; statement turn into short unsigned where
before that pass they were just ints:

IsAwake (struct b2Body * b)
{
   short unsigned int _3;
   int _4;
   _Bool _6;
   short unsigned int _7;

   :
   _3 = b_2(D)->flags;
   _4 = (int) _3;
   _7 = _3 & 2;
   _6 = _7 != 0;
   return _6;

}


I believe the C standard expects the operation to be performed in int mode.
Now, since this is a bitwise and operation with a known constant 2, the
operation can be safely performed in unsigned short mode. However on
word-based machines like arm this would introduce unnecessary extend
operations down the line, as I believe is the case here.

Though the variant using shorts is clearly shorter (in number of stmts)
and thus easier to optimize.  Am I correct that the issue in the end
is that _7 != 0 requires to extend _7?  & 2 is trivially performed without
any extension, no?


If I look at the dump before forwprop, the number of statements is exactly the 
same, so it's not any shorter in that sense.



IsAwake (struct b2Body * b)
{
  short unsigned int _3;
  int _4;
  int _5;
  _Bool _6;

  :
  _3 = b_2(D)->flags;
  _4 = (int) _3;
  _5 = _4 & 2;
  _6 = _5 != 0;
  return _6;

}

Using shorts is not cheaper on an architecture like arm which is word-based. 
Just the fact that we're introducing shorts already implies we're going to have 
to extend somewhere.


Kyrill



Richard.


Anyone have any insight on how to resolve this one?

Thanks,
Kyrill

PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes

2014-01-14 Thread H.J. Lu

Hi,

There are several problems with i386 MMX/SSE/AVX ABI change detection:

1. MMX/SSE return value isn't checked for -m32 since revision 83533:

http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=83533

which added ix86_struct_value_rtx.  Since MMX/SSE condition is always
false, the MMX/SSE return value ABI change is disabled.
2. For -m32, the same warning on MMX/SSE argument is issued twice, one from
type_natural_mode and one from function_arg_32.
3. AVX return value ABI change isn't checked.

This patch does followings:

1. Remove the ineffective ix86_struct_value_rtx.
2. Add a bool parameter to indicate if type is used for function return
value.  Warn ABI change if the vector mode isn't available for function
return value.  Add AVX function return value ABI change warning.
3. Consolidate ABI change warning into type_natural_mode.
4. Update g++.dg/ext/vector23.C to prune ABI change for Linux/x86
added by the AVX function return value ABI change warning.
5. Update gcc.target/i386/pr39162.c to avoid the AVX function return
value ABI change warning.
6. Add testcases for warning MMX/SSE/AVX ABI changes in parameter
passing and function return.

Tested on Linux/x86-64 with -m32/-m64 for "make check".  OK to install?

Thanks.

H.J.
---
gcc/

2014-01-14  H.J. Lu  

PR target/59794
* config/i386/i386.c (type_natural_mode): Add a bool parameter
to indicate if type is used for function return value.  Warn
ABI change if the vector mode isn't available for function
return value.
(ix86_function_arg_advance): Pass false to type_natural_mode.
(ix86_function_arg): Likewise.
(ix86_gimplify_va_arg): Likewise.
(function_arg_32): Don't warn ABI change.
(ix86_function_value): Pass true to type_natural_mode.
(ix86_return_in_memory): Likewise.
(ix86_struct_value_rtx): Removed.
(TARGET_STRUCT_VALUE_RTX): Likewise.

gcc/testsuite/

2014-01-14  H.J. Lu  

PR target/59794
* g++.dg/ext/vector23.C: Also prune ABI change for Linux/x86.
* gcc.target/i386/pr39162.c (y): New __m256i variable.
(bar): Change return type to void.  Set y to x.
* gcc.target/i386/pr59794-1.c: New testcase.
* gcc.target/i386/pr59794-2.c: Likewise.
* gcc.target/i386/pr59794-3.c: Likewise.
* gcc.target/i386/pr59794-4.c: Likewise.
* gcc.target/i386/pr59794-5.c: Likewise.
* gcc.target/i386/pr59794-6.c: Likewise.
* gcc.target/i386/pr59794-7.c: Likewise.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ad48fc8..70181c3 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6104,10 +6104,14 @@ init_cumulative_args (CUMULATIVE_ARGS *cum,  /* 
Argument info to initialize */
 
The midde-end can't deal with the vector types > 16 bytes.  In this
case, we return the original mode and warn ABI change if CUM isn't
-   NULL.  */
+   NULL. 
+
+   If INT_RETURN is true, warn ABI change if the vector mode isn't
+   available for function return value.  */
 
 static enum machine_mode
-type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum)
+type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum,
+  bool in_return)
 {
   enum machine_mode mode = TYPE_MODE (type);
 
@@ -6133,6 +6137,7 @@ type_natural_mode (const_tree type, const CUMULATIVE_ARGS 
*cum)
if (size == 32 && !TARGET_AVX)
  {
static bool warnedavx;
+   static bool warnedavx_ret;
 
if (cum
&& !warnedavx
@@ -6142,12 +6147,20 @@ type_natural_mode (const_tree type, const 
CUMULATIVE_ARGS *cum)
warning (0, "AVX vector argument without AVX "
 "enabled changes the ABI");
  }
+   else if (in_return & !warnedavx_ret)
+ {
+   warnedavx_ret = true;
+   warning (0, "AVX vector return without AVX "
+"enabled changes the ABI");
+ }
+
return TYPE_MODE (type);
  }
else if (((size == 8 && TARGET_64BIT) || size == 16)
 && !TARGET_SSE)
  {
static bool warnedsse;
+   static bool warnedsse_ret;
 
if (cum
&& !warnedsse
@@ -6157,10 +6170,19 @@ type_natural_mode (const_tree type, const 
CUMULATIVE_ARGS *cum)
warning (0, "SSE vector argument without SSE "
 "enabled changes the ABI");
  }
+   else if (!TARGET_64BIT
+&& in_return
+& !warnedsse_ret)
+ {
+   warnedsse_ret = true;
+   warning (0, "SSE ve

Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Richard Biener

On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov  wrote:
> Hi all,
>
> I'm looking into PR 54168 where we end up generating an unnecessary extend
> operation on arm.
>
> Given code:
>
> struct b2Body {
> unsigned short flags;
> int type;
> };
>
> static  _Bool IsAwake(struct b2Body *b)
> {
> return (b->flags & 2) == 2;
> }
>
>
> int foo(struct b2Body *bA, struct b2Body *bB)
> {
> int typeA = bA->type;
> int typeB = bB->type;
> _Bool activeA = IsAwake(bA) && typeA != 0;
> _Bool activeB = IsAwake(bB) && typeB != 0;
>
> if (!activeA && !activeB)
> {
> return 1;
> }
>
> return 0;
> }
>
> Compiled for arm-none-eabi with -O3 -march=armv7-a
>
> The inlined generated code for IsAwake contains the fragment:
>
> ldrhr0, [r1]
> and r0, r0, #2
> uxthr0, r0
> cmp r0, #0
>
> which contains a redundant extend, which also confuses combine and prevents
> the whole thing from being optimised into an ldrh ; ands sequence.
>
> Looking at the tree dumps I notice that after the forwprop pass the types of
> the operands in the _7 = _3 & 2; statement turn into short unsigned where
> before that pass they were just ints:
>
> IsAwake (struct b2Body * b)
> {
>   short unsigned int _3;
>   int _4;
>   _Bool _6;
>   short unsigned int _7;
>
>   :
>   _3 = b_2(D)->flags;
>   _4 = (int) _3;
>   _7 = _3 & 2;
>   _6 = _7 != 0;
>   return _6;
>
> }
>
>
> I believe the C standard expects the operation to be performed in int mode.
> Now, since this is a bitwise and operation with a known constant 2, the
> operation can be safely performed in unsigned short mode. However on
> word-based machines like arm this would introduce unnecessary extend
> operations down the line, as I believe is the case here.

Though the variant using shorts is clearly shorter (in number of stmts)
and thus easier to optimize.  Am I correct that the issue in the end
is that _7 != 0 requires to extend _7?  & 2 is trivially performed without
any extension, no?

Richard.

> Anyone have any insight on how to resolve this one?
>
> Thanks,
> Kyrill
>

PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?

2014-01-14 Thread Kyrill Tkachov


Hi all,

I'm looking into PR 54168 where we end up generating an unnecessary extend 
operation on arm.


Given code:

struct b2Body {
unsigned short flags;
int type;
};

static  _Bool IsAwake(struct b2Body *b)
{
return (b->flags & 2) == 2;
}


int foo(struct b2Body *bA, struct b2Body *bB)
{
int typeA = bA->type;
int typeB = bB->type;
_Bool activeA = IsAwake(bA) && typeA != 0;
_Bool activeB = IsAwake(bB) && typeB != 0;

if (!activeA && !activeB)
{
return 1;
}

return 0;
}

Compiled for arm-none-eabi with -O3 -march=armv7-a

The inlined generated code for IsAwake contains the fragment:

ldrhr0, [r1]
and r0, r0, #2
uxthr0, r0
cmp r0, #0

which contains a redundant extend, which also confuses combine and prevents the 
whole thing from being optimised into an ldrh ; ands sequence.


Looking at the tree dumps I notice that after the forwprop pass the types of the 
operands in the _7 = _3 & 2; statement turn into short unsigned where before 
that pass they were just ints:


IsAwake (struct b2Body * b)
{
  short unsigned int _3;
  int _4;
  _Bool _6;
  short unsigned int _7;

  :
  _3 = b_2(D)->flags;
  _4 = (int) _3;
  _7 = _3 & 2;
  _6 = _7 != 0;
  return _6;

}


I believe the C standard expects the operation to be performed in int mode. Now, 
since this is a bitwise and operation with a known constant 2, the operation can 
be safely performed in unsigned short mode. However on word-based machines like 
arm this would introduce unnecessary extend operations down the line, as I 
believe is the case here.


Anyone have any insight on how to resolve this one?

Thanks,
Kyrill

Re: [RFC] Using function clones for Pointer Bounds Checker

2014-01-14 Thread Richard Biener

On Tue, Jan 14, 2014 at 1:47 PM, Ilya Enkovich  wrote:
> 2014/1/14 Richard Biener :
>> On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich  
>> wrote:
>>> Hi,
>>>
>>> I've been working for some time on the prototype of the Pointer Bounds
>>> Checker which uses function clones for instrumentation
>>> (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After
>>> several experiments with this approach I want to share my results and
>>> ask for some feedback to make a decision about the future steps.
>>>
>>> Firstly I want to remind the reasons for digging in this direction. In
>>> the original approach bounds of call arguments and input parameters
>>> are associated with arguments via special built-in calls. It creates
>>> implicit data flow compiler is not aware about which confuses some
>>> optimizations resulting in miss-optimization and breaks bounds data
>>> flow. Thus optimizations have to be fixed to get better pointers
>>> protection.
>>>
>>> Clones approach does not use special built-in function calls to
>>> associate bounds with call arguments and input parameters. Each
>>> function which should be instrumented gets an additional version and
>>> only this special version will be instrumented.This new version gets
>>> additional bound arguments to express input bounds. When function call
>>> is instrumented, it is redirected to instrumented version and all
>>> bounds are passed as explicit call arguments. Thus we have explicit
>>> pointer bounds flow similar to regular function parameters. It should
>>> allow to avoid changes in optimization, avoid miss-optimizations,
>>> allow existing IPA optimizations to work with bound args (e.g.
>>> propagate constant bounds value and remove checks in called function).
>>>
>>> I made a prototype implementation of this approach in the following way:
>>>
>>> - Add new IPA pass before early local passes to produce versions for
>>> all functions to be instrumented.
>>> - Put instrumentation pass after SSA pass.
>>> - Add new pass after IPA passes to remove bodies of functions which
>>> have instrumented versions. Function nodes may still be required for
>>> calls in not instrumented code. But we do not emit this code and
>>> therefore function bodies are not needed.
>>>
>>> Positive changes are:
>>>
>>> - IPA optimizations are not confused by bound parameters
>>> - bounds are now more like regular arguments; it makes their
>>> processing in expand easier
>>> - functions with bounds not attached to any pointer are allowed
>>
>> First of all thanks for trying to work in this direction.  Comments on the
>> issues you encountered below (also CCed Honza as he should be more
>> familiar with reachability and clone issues).
>>
>>> On simple codes this approach worked well but on a bigger tests some
>>> issues were revealed.
>>>
>>> 1. Nodes reachability. Instrumented version is actually always
>>> reachable when original function is reachable because it is always
>>> emitted instead of the original. Thus I had to fix reachability
>>> analysis to achieve it. Another similar problem is check whether node
>>> can be removed after inline when inlining instrumented function. Not
>>> hard to fix but probably other similar problems exist.
>>
>> I suppose you do not update the callgraph / the call stmts when
>> creating the clones?  Btw, is it desirable to inline the uninstrumented
>> function and then instrument the result (thus run cloning and
>> instrumentation after early inlining?)?  Especially handling always_inlines
>> before cloning/isntrumentation looks very sensible.
>
> Right. Created clones have the same code as the original function and
> therefore same cgraph edges. I suppose instrumentation after early
> inlining is OK and may be preferred because inline shouldn't lead to
> any losses of bounds information. I tried variant when instrumentation
> works right after early inlining but with cloning still before early
> local passes. In general it looked OK.
>
>>
>>> 2. Function processing order. Function processing order is determined
>>> before early local passes. But during function instrumentation call
>>> graph is modified significantly and used topological order becomes
>>> outdated. That causes some troubles. E.g. function marked as 'always
>>> inline' cannot be inlined because it is not in SSA form yet. Surely
>>> inlining problem may be solved by just putting instrumentation after
>>> early inline, but similar problem may exist in other passes too. To
>>> resolve this problem I tried to split early local passes into three
>>> parts. The first one builds SSA, the second one performs
>>> instrumentation, the last one does the rest. Each part is performed on
>>> all functions before the next one starts. Thus I get all functions in
>>> SSA form and all instrumentation performed before starting early
>>> optimizations. Unfortunately such passes order leads to invalid SSA
>>> because of local_pure_const optimization affecting callers correctness
>>> (in case caller

[PATCH] Fix PR59802, LCM compile-time slowness

2014-01-14 Thread Richard Biener


This fixes the slowness seen in LCM compute_available accounted
to RTL cprop.  Currently the dataflow problem uses a "random"
basic-block order to seed the initial worklist (it wants to
visit predecessors before successors) - the following patch
makes it use inverted postorder (similar to tree PRE antic
computation).

This reduces the compile-time for the testcase in PR59802 at -O3
from

 CPROP   :  54.53 (55%) usr   0.04 ( 6%) sys  54.57 (55%) 
wall kB ( 2%) ggc
 PRE :   4.47 ( 5%) usr   0.03 ( 5%) sys   4.48 ( 5%) 
wall1833 kB ( 1%) ggc
 TOTAL :  98.51 0.6399.13 
220195 kB

to

 CPROP   :   2.13 ( 5%) usr   0.06 (10%) sys   2.20 ( 5%) 
wall kB ( 2%) ggc
 PRE :   0.52 ( 1%) usr   0.02 ( 3%) sys   0.54 ( 1%) 
wall1833 kB ( 1%) ggc
 TOTAL :  42.22 0.6042.81 
220195 kB

which is nice.  I checked for other compile-time hog PRs with
CPROP but didn't find one, have yet to check for PRE (three-letter,
likely too much noise).

Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk
(and branch?)

Thanks,
Richard.

2014-01-14  Richard Biener  

PR rtl-optimization/59802
* lcm.c (compute_available): Use inverted postorder to seed
the initial worklist.

Index: gcc/lcm.c
===
*** gcc/lcm.c   (revision 206599)
--- gcc/lcm.c   (working copy)
*** compute_available (sbitmap *avloc, sbitm
*** 496,507 
bitmap_vector_ones (avout, last_basic_block_for_fn (cfun));
  
/* Put every block on the worklist; this is necessary because of the
!  optimistic initialization of AVOUT above.  */
!   FOR_EACH_BB_FN (bb, cfun)
  {
*qin++ = bb;
bb->aux = bb;
  }
  
qin = worklist;
qend = &worklist[n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS];
--- 496,515 
bitmap_vector_ones (avout, last_basic_block_for_fn (cfun));
  
/* Put every block on the worklist; this is necessary because of the
!  optimistic initialization of AVOUT above.  Use inverted postorder
!  to make the dataflow problem require less iterations.  */
!   int *postorder = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
!   int postorder_num = inverted_post_order_compute (postorder);
!   for (int i = 0; i < postorder_num; ++i)
  {
+   bb = BASIC_BLOCK_FOR_FN (cfun, postorder[i]);
+   if (bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
+ || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+   continue;
*qin++ = bb;
bb->aux = bb;
  }
+   free (postorder);
  
qin = worklist;
qend = &worklist[n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS];

Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-14 Thread Richard Biener

On Tue, 14 Jan 2014, Jakub Jelinek wrote:

> On Tue, Jan 14, 2014 at 10:01:06AM +0100, Richard Biener wrote:
> > Jakub, adding the new flag is ok with me.
> 
> So like this?

Ok if it passes testing.

Thanks,
Richard.

> 2014-01-14  Jakub Jelinek  
> 
>   * tree-vectorizer.h (struct _loop_vec_info): Add no_data_dependencies
>   field.
>   (LOOP_VINFO_NO_DATA_DEPENDENCIES): Define.
>   * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): Clear it
>   when not giving up or versioning for alias only because of
>   loop->safelen.
>   (vect_analyze_data_ref_dependences): Set to true.
>   * tree-vect-stmts.c (vectorizable_load): Use
>   LOOP_VINFO_NO_DATA_DEPENDENCIES instead of
>   LOOP_REQUIRES_VERSIONING_FOR_ALIAS.
> 
> --- gcc/tree-vectorizer.h.jj  2014-01-03 11:40:57.0 +0100
> +++ gcc/tree-vectorizer.h 2014-01-14 13:10:00.477989924 +0100
> @@ -347,6 +347,25 @@ typedef struct _loop_vec_info {
>   fix it up.  */
>bool operands_swapped;
>  
> +  /* True if there are no loop carried data dependencies in the loop.
> + If loop->safelen <= 1, then this is always true, either the loop
> + didn't have any loop carried data dependencies, or the loop is being
> + vectorized guarded with some runtime alias checks, or couldn't
> + be vectorized at all, but then this field shouldn't be used.
> + For loop->safelen >= 2, the user has asserted that there are no
> + backward dependencies, but there still could be loop carried forward
> + dependencies in such loops.  This flag will be false if normal
> + vectorizer data dependency analysis would fail or require versioning
> + for alias, but because of loop->safelen >= 2 it has been vectorized
> + even without versioning for alias.  E.g. in:
> + #pragma omp simd
> + for (int i = 0; i < m; i++)
> +   a[i] = a[i + k] * c;
> + (or #pragma simd or #pragma ivdep) we can vectorize this and it will
> + DTRT even for k > 0 && k < m, but without safelen we would not
> + vectorize this, so this field would be false.  */
> +  bool no_data_dependencies;
> +
>/* If if-conversion versioned this loop before conversion, this is the
>   loop version without if-conversion.  */
>struct loop *scalar_loop;
> @@ -385,6 +404,7 @@ typedef struct _loop_vec_info {
>  #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)->peeling_for_gaps
>  #define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)->operands_swapped
>  #define LOOP_VINFO_PEELING_FOR_NITER(L)(L)->peeling_for_niter
> +#define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
>  #define LOOP_VINFO_SCALAR_LOOP(L)   (L)->scalar_loop
>  
>  #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
> --- gcc/tree-vect-data-refs.c.jj  2014-01-10 00:38:26.0 +0100
> +++ gcc/tree-vect-data-refs.c 2014-01-14 13:12:06.056342116 +0100
> @@ -244,6 +244,7 @@ vect_analyze_data_ref_dependence (struct
>   {
> if (loop->safelen < *max_vf)
>   *max_vf = loop->safelen;
> +   LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false;
> return false;
>   }
>  
> @@ -291,6 +292,7 @@ vect_analyze_data_ref_dependence (struct
>   {
> if (loop->safelen < *max_vf)
>   *max_vf = loop->safelen;
> +   LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false;
> return false;
>   }
>  
> @@ -447,6 +449,7 @@ vect_analyze_data_ref_dependences (loop_
>  dump_printf_loc (MSG_NOTE, vect_location,
>   "=== vect_analyze_data_ref_dependences ===\n");
>  
> +  LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = true;
>if (!compute_all_dependences (LOOP_VINFO_DATAREFS (loop_vinfo),
>   &LOOP_VINFO_DDRS (loop_vinfo),
>   LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
> --- gcc/tree-vect-stmts.c.jj  2014-01-14 10:33:21.0 +0100
> +++ gcc/tree-vect-stmts.c 2014-01-14 13:14:15.157677243 +0100
> @@ -6381,10 +6381,11 @@ vectorizable_load (gimple stmt, gimple_s
> if (inv_p && !bb_vinfo)
>   {
> gcc_assert (!grouped_load);
> -   /* If we have versioned for aliasing then we are sure
> -  this is a loop invariant load and thus we can insert
> -  it on the preheader edge.  */
> -   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> +   /* If we have versioned for aliasing or the loop doesn't
> +  have any data dependencies that would preclude this,
> +  then we are sure this is a loop invariant load and
> +  thus we can insert it on the preheader edge.  */
> +   if (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo))
>   {
> if (dump_enabled_p ())
>   {
> 
> 
>   Jakub
> 
> 

-- 
Richard Biener 
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF

Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-14 Thread Jakub Jelinek

On Tue, Jan 14, 2014 at 10:01:06AM +0100, Richard Biener wrote:
> Jakub, adding the new flag is ok with me.

So like this?

2014-01-14  Jakub Jelinek  

* tree-vectorizer.h (struct _loop_vec_info): Add no_data_dependencies
field.
(LOOP_VINFO_NO_DATA_DEPENDENCIES): Define.
* tree-vect-data-refs.c (vect_analyze_data_ref_dependence): Clear it
when not giving up or versioning for alias only because of
loop->safelen.
(vect_analyze_data_ref_dependences): Set to true.
* tree-vect-stmts.c (vectorizable_load): Use
LOOP_VINFO_NO_DATA_DEPENDENCIES instead of
LOOP_REQUIRES_VERSIONING_FOR_ALIAS.

--- gcc/tree-vectorizer.h.jj2014-01-03 11:40:57.0 +0100
+++ gcc/tree-vectorizer.h   2014-01-14 13:10:00.477989924 +0100
@@ -347,6 +347,25 @@ typedef struct _loop_vec_info {
  fix it up.  */
   bool operands_swapped;
 
+  /* True if there are no loop carried data dependencies in the loop.
+ If loop->safelen <= 1, then this is always true, either the loop
+ didn't have any loop carried data dependencies, or the loop is being
+ vectorized guarded with some runtime alias checks, or couldn't
+ be vectorized at all, but then this field shouldn't be used.
+ For loop->safelen >= 2, the user has asserted that there are no
+ backward dependencies, but there still could be loop carried forward
+ dependencies in such loops.  This flag will be false if normal
+ vectorizer data dependency analysis would fail or require versioning
+ for alias, but because of loop->safelen >= 2 it has been vectorized
+ even without versioning for alias.  E.g. in:
+ #pragma omp simd
+ for (int i = 0; i < m; i++)
+   a[i] = a[i + k] * c;
+ (or #pragma simd or #pragma ivdep) we can vectorize this and it will
+ DTRT even for k > 0 && k < m, but without safelen we would not
+ vectorize this, so this field would be false.  */
+  bool no_data_dependencies;
+
   /* If if-conversion versioned this loop before conversion, this is the
  loop version without if-conversion.  */
   struct loop *scalar_loop;
@@ -385,6 +404,7 @@ typedef struct _loop_vec_info {
 #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)->peeling_for_gaps
 #define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)->operands_swapped
 #define LOOP_VINFO_PEELING_FOR_NITER(L)(L)->peeling_for_niter
+#define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
 #define LOOP_VINFO_SCALAR_LOOP(L) (L)->scalar_loop
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
--- gcc/tree-vect-data-refs.c.jj2014-01-10 00:38:26.0 +0100
+++ gcc/tree-vect-data-refs.c   2014-01-14 13:12:06.056342116 +0100
@@ -244,6 +244,7 @@ vect_analyze_data_ref_dependence (struct
{
  if (loop->safelen < *max_vf)
*max_vf = loop->safelen;
+ LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false;
  return false;
}
 
@@ -291,6 +292,7 @@ vect_analyze_data_ref_dependence (struct
{
  if (loop->safelen < *max_vf)
*max_vf = loop->safelen;
+ LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false;
  return false;
}
 
@@ -447,6 +449,7 @@ vect_analyze_data_ref_dependences (loop_
 dump_printf_loc (MSG_NOTE, vect_location,
  "=== vect_analyze_data_ref_dependences ===\n");
 
+  LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = true;
   if (!compute_all_dependences (LOOP_VINFO_DATAREFS (loop_vinfo),
&LOOP_VINFO_DDRS (loop_vinfo),
LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
--- gcc/tree-vect-stmts.c.jj2014-01-14 10:33:21.0 +0100
+++ gcc/tree-vect-stmts.c   2014-01-14 13:14:15.157677243 +0100
@@ -6381,10 +6381,11 @@ vectorizable_load (gimple stmt, gimple_s
  if (inv_p && !bb_vinfo)
{
  gcc_assert (!grouped_load);
- /* If we have versioned for aliasing then we are sure
-this is a loop invariant load and thus we can insert
-it on the preheader edge.  */
- if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+ /* If we have versioned for aliasing or the loop doesn't
+have any data dependencies that would preclude this,
+then we are sure this is a loop invariant load and
+thus we can insert it on the preheader edge.  */
+ if (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo))
{
  if (dump_enabled_p ())
{


Jakub

Re: [RFC] Using function clones for Pointer Bounds Checker

2014-01-14 Thread Ilya Enkovich

2014/1/14 Richard Biener :
> On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich  
> wrote:
>> Hi,
>>
>> I've been working for some time on the prototype of the Pointer Bounds
>> Checker which uses function clones for instrumentation
>> (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After
>> several experiments with this approach I want to share my results and
>> ask for some feedback to make a decision about the future steps.
>>
>> Firstly I want to remind the reasons for digging in this direction. In
>> the original approach bounds of call arguments and input parameters
>> are associated with arguments via special built-in calls. It creates
>> implicit data flow compiler is not aware about which confuses some
>> optimizations resulting in miss-optimization and breaks bounds data
>> flow. Thus optimizations have to be fixed to get better pointers
>> protection.
>>
>> Clones approach does not use special built-in function calls to
>> associate bounds with call arguments and input parameters. Each
>> function which should be instrumented gets an additional version and
>> only this special version will be instrumented.This new version gets
>> additional bound arguments to express input bounds. When function call
>> is instrumented, it is redirected to instrumented version and all
>> bounds are passed as explicit call arguments. Thus we have explicit
>> pointer bounds flow similar to regular function parameters. It should
>> allow to avoid changes in optimization, avoid miss-optimizations,
>> allow existing IPA optimizations to work with bound args (e.g.
>> propagate constant bounds value and remove checks in called function).
>>
>> I made a prototype implementation of this approach in the following way:
>>
>> - Add new IPA pass before early local passes to produce versions for
>> all functions to be instrumented.
>> - Put instrumentation pass after SSA pass.
>> - Add new pass after IPA passes to remove bodies of functions which
>> have instrumented versions. Function nodes may still be required for
>> calls in not instrumented code. But we do not emit this code and
>> therefore function bodies are not needed.
>>
>> Positive changes are:
>>
>> - IPA optimizations are not confused by bound parameters
>> - bounds are now more like regular arguments; it makes their
>> processing in expand easier
>> - functions with bounds not attached to any pointer are allowed
>
> First of all thanks for trying to work in this direction.  Comments on the
> issues you encountered below (also CCed Honza as he should be more
> familiar with reachability and clone issues).
>
>> On simple codes this approach worked well but on a bigger tests some
>> issues were revealed.
>>
>> 1. Nodes reachability. Instrumented version is actually always
>> reachable when original function is reachable because it is always
>> emitted instead of the original. Thus I had to fix reachability
>> analysis to achieve it. Another similar problem is check whether node
>> can be removed after inline when inlining instrumented function. Not
>> hard to fix but probably other similar problems exist.
>
> I suppose you do not update the callgraph / the call stmts when
> creating the clones?  Btw, is it desirable to inline the uninstrumented
> function and then instrument the result (thus run cloning and
> instrumentation after early inlining?)?  Especially handling always_inlines
> before cloning/isntrumentation looks very sensible.

Right. Created clones have the same code as the original function and
therefore same cgraph edges. I suppose instrumentation after early
inlining is OK and may be preferred because inline shouldn't lead to
any losses of bounds information. I tried variant when instrumentation
works right after early inlining but with cloning still before early
local passes. In general it looked OK.

>
>> 2. Function processing order. Function processing order is determined
>> before early local passes. But during function instrumentation call
>> graph is modified significantly and used topological order becomes
>> outdated. That causes some troubles. E.g. function marked as 'always
>> inline' cannot be inlined because it is not in SSA form yet. Surely
>> inlining problem may be solved by just putting instrumentation after
>> early inline, but similar problem may exist in other passes too. To
>> resolve this problem I tried to split early local passes into three
>> parts. The first one builds SSA, the second one performs
>> instrumentation, the last one does the rest. Each part is performed on
>> all functions before the next one starts. Thus I get all functions in
>> SSA form and all instrumentation performed before starting early
>> optimizations. Unfortunately such passes order leads to invalid SSA
>> because of local_pure_const optimization affecting callers correctness
>> (in case caller SSA was built before optimization revealed 'pure' or
>> 'const' flag).
>
> Generally the processing order of early_local_passes is chosen
> to get bette

Re: [PATCH, AArch64] Use llfloor and llceil for vcvtmd_s64_f64 and vcvtpd_s64_f64 in arm_neon.h

2014-01-14 Thread Marcus Shawcroft

On 6 January 2014 12:30, Yufeng Zhang  wrote:
> This patch fixes the implementation of vcvtmd_s64_f64 and vcvtpd_s64_f64 in
> arm_neon.h to use llfloor and llceil instead, which are ILP32-friendly.
>
> This patch will fix the following test failure in the ILP32 mode:
>
> FAIL: gcc.target/aarch64/vect-vcvt.c scan-assembler fcvtms\\tx[0-9]+,
> d[0-9]+
>
> OK for the trunk?


OK, but we should wait for stage-1 now.

Thanks
/Marcus

[PING] Re: [PATCH, AArch64] Use llfloor and llceil for vcvtmd_s64_f64 and vcvtpd_s64_f64 in arm_neon.h

2014-01-14 Thread Yufeng Zhang


Ping~

Originally posted here:
http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00185.html

Thanks,
Yufeng

On 01/06/14 12:30, Yufeng Zhang wrote:

This patch fixes the implementation of vcvtmd_s64_f64 and vcvtpd_s64_f64
in arm_neon.h to use llfloor and llceil instead, which are ILP32-friendly.

This patch will fix the following test failure in the ILP32 mode:

FAIL: gcc.target/aarch64/vect-vcvt.c scan-assembler fcvtms\\tx[0-9]+,
d[0-9]+

OK for the trunk?

Thanks,
Yufeng

gcc/

* config/aarch64/aarch64-builtins.c
(aarch64_builtin_vectorized_function): Add BUILT_IN_LFLOORF,
BUILT_IN_LLFLOOR, BUILT_IN_LCEILF and BUILT_IN_LLCEIL.
* config/aarch64/arm_neon.h (vcvtaq_u64_f64): Call __builtin_llfloor
instead of __builtin_lfloor.
(vcvtnq_u64_f64): Call __builtin_llceil instead of __builtin_lceil.

Re: [RFC] Using function clones for Pointer Bounds Checker

2014-01-14 Thread Richard Biener

On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich  wrote:
> Hi,
>
> I've been working for some time on the prototype of the Pointer Bounds
> Checker which uses function clones for instrumentation
> (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After
> several experiments with this approach I want to share my results and
> ask for some feedback to make a decision about the future steps.
>
> Firstly I want to remind the reasons for digging in this direction. In
> the original approach bounds of call arguments and input parameters
> are associated with arguments via special built-in calls. It creates
> implicit data flow compiler is not aware about which confuses some
> optimizations resulting in miss-optimization and breaks bounds data
> flow. Thus optimizations have to be fixed to get better pointers
> protection.
>
> Clones approach does not use special built-in function calls to
> associate bounds with call arguments and input parameters. Each
> function which should be instrumented gets an additional version and
> only this special version will be instrumented.This new version gets
> additional bound arguments to express input bounds. When function call
> is instrumented, it is redirected to instrumented version and all
> bounds are passed as explicit call arguments. Thus we have explicit
> pointer bounds flow similar to regular function parameters. It should
> allow to avoid changes in optimization, avoid miss-optimizations,
> allow existing IPA optimizations to work with bound args (e.g.
> propagate constant bounds value and remove checks in called function).
>
> I made a prototype implementation of this approach in the following way:
>
> - Add new IPA pass before early local passes to produce versions for
> all functions to be instrumented.
> - Put instrumentation pass after SSA pass.
> - Add new pass after IPA passes to remove bodies of functions which
> have instrumented versions. Function nodes may still be required for
> calls in not instrumented code. But we do not emit this code and
> therefore function bodies are not needed.
>
> Positive changes are:
>
> - IPA optimizations are not confused by bound parameters
> - bounds are now more like regular arguments; it makes their
> processing in expand easier
> - functions with bounds not attached to any pointer are allowed

First of all thanks for trying to work in this direction.  Comments on the
issues you encountered below (also CCed Honza as he should be more
familiar with reachability and clone issues).

> On simple codes this approach worked well but on a bigger tests some
> issues were revealed.
>
> 1. Nodes reachability. Instrumented version is actually always
> reachable when original function is reachable because it is always
> emitted instead of the original. Thus I had to fix reachability
> analysis to achieve it. Another similar problem is check whether node
> can be removed after inline when inlining instrumented function. Not
> hard to fix but probably other similar problems exist.

I suppose you do not update the callgraph / the call stmts when
creating the clones?  Btw, is it desirable to inline the uninstrumented
function and then instrument the result (thus run cloning and
instrumentation after early inlining?)?  Especially handling always_inlines
before cloning/isntrumentation looks very sensible.

> 2. Function processing order. Function processing order is determined
> before early local passes. But during function instrumentation call
> graph is modified significantly and used topological order becomes
> outdated. That causes some troubles. E.g. function marked as 'always
> inline' cannot be inlined because it is not in SSA form yet. Surely
> inlining problem may be solved by just putting instrumentation after
> early inline, but similar problem may exist in other passes too. To
> resolve this problem I tried to split early local passes into three
> parts. The first one builds SSA, the second one performs
> instrumentation, the last one does the rest. Each part is performed on
> all functions before the next one starts. Thus I get all functions in
> SSA form and all instrumentation performed before starting early
> optimizations. Unfortunately such passes order leads to invalid SSA
> because of local_pure_const optimization affecting callers correctness
> (in case caller SSA was built before optimization revealed 'pure' or
> 'const' flag).

Generally the processing order of early_local_passes is chosen
to get better optimization - changing it shouldn't affect correctness
and thus the issues you observe demand fixing anyway.
(I've noted in the past that the early_local_passes processing order
should more explicitely honor callgraph SCCs, eventually even iterating).

Moving SSA build out of the early_local_passes and into a
separate lowering stage is possible, the pure-const stuff is
handled by keeping pass_fixup_cfg where it is now I think.
In theory you can go into SSA form in all_lowering_passes
already (but you have

Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-14 Thread Richard Biener

On Tue, 14 Jan 2014, Richard Biener wrote:

> On Mon, 13 Jan 2014, Cong Hou wrote:
> 
> > I noticed that LIM could not hoist vector invariant, and that is why
> > my first implementation tries to hoist them all.
> 
> Yes, I filed PR59786 for this.  I'll see if I can come up with
> a fix suitable for stage3.
> 
> > In addition, there are two disadvantages of hoisting invariant load +
> > lim method:
> > 
> > First, for some instructions the scalar version is faster than the
> > vector version, and in this case hoisting scalar instructions before
> > vectorization is better. Those instructions include data
> > packing/unpacking, integer multiplication with SSE2, etc..
> > 
> > Second, it may use more SIMD registers.
> > 
> > The following code shows a simple example:
> > 
> > char *a, *b, *c;
> > for (int i = 0; i < N; ++i)
> >   a[i] = b[0] * c[0] + a[i];
> > 
> > Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0]
> > into a vector.
> 
> Yes.  I've tried with adjusting the vec_def_type as in the prototype
> patch I sent before christmas but that's quite intrusive for this
> stage.  You could argue that performing invariant motion is not
> really the vectorizers main task and that a combination of hoisting
> only the load, later LIM hoisting the rest and then tree-vect-generic.c
> demoting vector ops to scalar ops (unimplemented, but also a useful
> general optimization) would work as well.

For example with the untested following.  Not sure if the LIM change
is appropriate at this stage (it's handling of "cost" is weird, and
in other places of the compiler we simply aggressively hoist
invariants and expect RTL to fixup register pressure issues).

The lowering change looks more like sth for forwprop but that
runs quite late after vectorization.  tree-vect-generic could
at least factor in whether the target has a scalar op of that
kind and whether that is maybe more expensive (though trading
two vector splats for one is very likely offsetting that).  It
also would need to consider the case where this moves a vector
splat inside a loop when handling the testcases we talk about
without improved invariant motion.

Any comments?  Anything we want to fix before 4.9?  The
testcases are optimized by RTL invariant motion but they
perform a vector addition.  For example

void test1 (int* a, int* b)
{
  int i;
  for (i = 0; i < 10; ++i)
a[i] = *b + 1;
}

gets

.L7:
movd(%rsi), %xmm1
leaq(%rdi,%rdx,4), %rdx
xorl%eax, %eax
pshufd  $0, %xmm1, %xmm0
paddd   .LC0(%rip), %xmm0
.p2align 4,,10
.p2align 3
.L4:
addl$1, %eax
addq$16, %rdx
movaps  %xmm0, -16(%rdx)
cmpl%eax, %ecx
ja  .L4

instead of

.L7:
movl(%rsi), %eax
leaq(%rdi,%rdx,4), %rdx
addl$1, %eax
movl%eax, -12(%rsp)
xorl%eax, %eax
movd-12(%rsp), %xmm1
pshufd  $0, %xmm1, %xmm0
.p2align 4,,10
.p2align 3
.L4:
addl$1, %eax
addq$16, %rdx
movaps  %xmm0, -16(%rdx)
cmpl%eax, %ecx
ja  .L4

which because of the by default disabled inter-unit moves looks
even more expensive.  With inter-unit moves we get

.L7:
movl(%rsi), %eax
leaq(%rdi,%rdx,4), %rdx
addl$1, %eax
movd%eax, %xmm0
xorl%eax, %eax
pshufd  $0, %xmm0, %xmm0
.p2align 4,,10
.p2align 3
.L4:
addl$1, %eax
movaps  %xmm0, (%rdx)
addq$16, %rdx
cmpl%eax, %ecx
ja  .L4

not sure if the avoided constant pool load offsets the inter-unit
move here (depends on the kind of pipeline constraints that has,
the above is with corei7 tuning).

It looks to me that demoting vector to scalar ops might be
better performed at RTL level?  Plus the reverse op as seen
from the above example where it isn't all clear which
variant is better (which probably depends quite some bit
on the CPU architecture).

Thanks,
Richard.

Index: gcc/tree-ssa-loop-im.c
===
*** gcc/tree-ssa-loop-im.c  (revision 206599)
--- gcc/tree-ssa-loop-im.c  (working copy)
*** stmt_cost (gimple stmt)
*** 533,538 
--- 533,541 
return 0;

  default:
+   /* All vector operations are expensive.  */
+   if (VECTOR_TYPE_P (gimple_expr_type (stmt)))
+   return LIM_EXPENSIVE;
return 1;
  }
  }
Index: gcc/tree-vect-generic.c
===
*** gcc/tree-vect-generic.c (revision 206599)
--- gcc/tree-vect-generic.c (working copy)
*** lower_vec_perm (gimple_stmt_iterator *gs
*** 1335,1340 
--- 1335,1357 
update_stmt (gsi_stmt (*gsi));
  }

+ /* If OP is a uniform vector return the element it is a splat from.  */
+ 
+ static tree
+ ssa_uniform_vecto

Drop REG_CROSSING_JUMP when converting to a conditional return

2014-01-14 Thread Richard Sandiford

While experimenting with a patch to use conditional returns, I hit a case
where a conditional jump between hot and cold sections was being converted
into a conditional return.  The new jump still had the REG_CROSSING_JUMP
and later passes were confused by the combination of that and a return
JUMP_LABEL.

The jump is now returning directly from the current section rather than
crossing sections within the function, so I think we should just drop
the note.

Tested on s390x-linux-gnu with some local patches.  OK to install?

Thanks,
Richard


gcc/
* jump.c (redirect_jump_2): Remove REG_CROSSING_JUMP notes when
converting a conditional jump into a conditional return.

Index: gcc/jump.c
===
--- gcc/jump.c  2014-01-03 15:06:10.516727719 +
+++ gcc/jump.c  2014-01-14 10:27:13.224173269 +
@@ -1580,6 +1580,16 @@ redirect_jump_2 (rtx jump, rtx olabel, r
}
 }
 
+  /* Handle the case where we had a conditional crossing jump to a return
+ label and are now changing it into a direct conditional return.
+ The jump is no longer crossing in that case.  */
+  if (ANY_RETURN_P (nlabel))
+{
+  note = find_reg_note (jump, REG_CROSSING_JUMP, NULL_RTX);
+  if (note)
+   remove_note (jump, note);
+}
+
   if (!ANY_RETURN_P (olabel)
   && --LABEL_NUSES (olabel) == 0 && delete_unused > 0
   /* Undefined labels will remain outside the insn stream.  */

Re: [PATCH/AARCH64] Add issue_rate tuning field

2014-01-14 Thread Richard Earnshaw

On 14/01/14 01:49, Andrew Pinski wrote:
> Hi,
>   While writing a scheduler for Cavium's aarch64 processor (Thunder),
> I found there was no way currently to change the issue rate in
> back-end.  This patch adds a field (issue_rate) to tune_params and
> creates a new function that the middle-end calls.  I updated the
> current two tuning variables (generic_tunings and cortexa53_tunings)
> to be 1 which was the default before.
> 
> OK?  Built and tested for aarch64-elf with no regressions.
> 
> Thanks,
> Andrew Pinski
> 
> ChangeLog:
> * config/aarch64/aarch64-protos.h (tune_params): Add issue_rate.
> * config/aarch64/aarch64.c (generic_tunings): Add issue rate of 1.
>  (cortexa53_tunings): Likewise.
> (aarch64_sched_issue_rate): New function.
> (TARGET_SCHED_ISSUE_RATE): Define.
> 
> 

Ug, I'd missed that we weren't setting this.

I think the value should be 2 for both generic and Cortex-A53 (I can't
really envisage single-issue AArch64 systems).

OK with that change (yes I know that wasn't the default before)...

R.

Re: libsanitizer merge from upstream r196090

2014-01-14 Thread Yuri Gribov

Uros Bizjak wrote:
> The same tsan failures are reported in one of HJ's testers [2], with message:

Can this be duplicate of http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59410 ?

-Y

Re: [PATCH][IRA] Analysis of register usage of functions for usage by IRA.

2014-01-14 Thread Richard Earnshaw

On 13/01/14 16:16, Tom de Vries wrote:
> On 10-01-14 12:39, Richard Earnshaw wrote:
>> Consequently, you'll need to add a patch for AArch64 which has two
>> registers clobbered by PLT-based calls.
>>

 Thanks for pointing that out. That's r16 and r17, right? I can propose the 
 hook
 for AArch64, once we all agree on how the hook should look.

>> Yes; and thanks!
> 
> Hi Richard,
> 
> I'm posting this patch that implements the TARGET_FN_OTHER_HARD_REG_USAGE 
> hook 
> for aarch64. It uses the conservative hook format for now.
> 
> I've build gcc and cc1 with the patch, and observed the impact on this code 
> snippet:
> ...
> static int
> bar (int x)
> {
>return x + 3;
> }
> 
> int
> foo (int y)
> {
>return y + bar (y);
> }
> ...
> 
> AFAICT, that looks as expected:
> ...
> $ gcc fuse-caller-save.c -mno-lra -fno-use-caller-save -O2 -S -o- > 1
> $ gcc fuse-caller-save.c -mno-lra -fuse-caller-save -O2 -S -o- > 2
> $ diff -u 1 2
> --- 1 2014-01-13 16:51:24.0 +0100
> +++ 2 2014-01-13 16:51:19.0 +0100
> @@ -11,14 +11,12 @@
>   .global foo
>   .type   foo, %function
>   foo:
> - stp x29, x30, [sp, -32]!
> + stp x29, x30, [sp, -16]!
> + mov w1, w0
>   add x29, sp, 0
> - str x19, [sp,16]
> - mov w19, w0
>   bl  bar
> - add w0, w0, w19
> - ldr x19, [sp,16]
> - ldp x29, x30, [sp], 32
> + ldp x29, x30, [sp], 16
> + add w0, w0, w1
>   ret
>   .size   foo, .-foo
>   .section.text.startup,"ax",%progbits
> ...
> 
> Btw, the results are the same for -mno-lra and -mlra. I'm just using the 
> -mno-lra version here because the -mlra version of -fuse-caller-save is still 
> in 
> review ( http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00586.html ).
> 
> Thanks,
> - Tom
> 
> 
> fuse-caller-save-aarch64-hook.patch
> 
> 
> 2014-01-11  Tom de Vries  
> 
>   * config/aarch64/aarch64.c (TARGET_FN_OTHER_HARD_REG_USAGE): Redefine as
>   aarch64_fn_other_hard_reg_usage.
>   (aarch64_fn_other_hard_reg_usage): New function.
> ---
>  gcc/config/aarch64/aarch64.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 3b1f6b5..295fd5d 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -3287,6 +3287,16 @@ aarch64_fixed_condition_code_regs (unsigned int *p1, 
> unsigned int *p2)
>return true;
>  }
>  
> +/* Implement TARGET_FN_OTHER_HARD_REG_USAGE.  */
> +
> +static bool
> +aarch64_fn_other_hard_reg_usage (struct hard_reg_set_container *regs)
> +{
> +  SET_HARD_REG_BIT (regs->set, R16_REGNUM);
> +  SET_HARD_REG_BIT (regs->set, R17_REGNUM);
> +  return true;
> +}


I think that in this context using IP0_REGNUM and IP1_REGNUM would be
slightly clearer; since it is because these registers are the
inter-procedure-call scratch registers that they aren't safe to use in
this context.

Otherwise, this is OK.

R.

Re: libsanitizer merge from upstream r196090

2014-01-14 Thread Yuri Gribov

> FAIL: g++.dg/tsan/default_options.C  -O2  execution test

This one looks plain wrong to me. It should be checked for success
instead of failure.

-Y

Re: libsanitizer merge from upstream r196090

2014-01-14 Thread Konstantin Serebryany

I've started a separate topic about flaky tsan tests here:
https://groups.google.com/forum/#!topic/thread-sanitizer/KIok3F_b1oI

--kcc

On Fri, Jan 10, 2014 at 7:20 PM, Jakub Jelinek  wrote:
> On Fri, Jan 10, 2014 at 03:50:33PM +0400, Maxim Ostapenko wrote:
>> On Fri, Jan 10, 2014 at 10:39 AM, Jakub Jelinek wrote:
>>
>> > Some of the tsan tests seems to FAIL randomly for quite a while
>> > (since they  were added), didn't have time to look if it is just
>> bugs in the test or
>> > some compiler issue or library issue.
>>
>> When I've commited these tsan tests, all of them were passed on my
>> x86_64-unknown-linux-gnu 64bit system.
>>
>> Should I review them more carefully?
>
> That would be nice.  The FAILs I'm seeing include e.g.
> FAIL: c-c++-common/tsan/simple_race.c  -O2  execution test
>
> FAIL: c-c++-common/tsan/simple_race.c  -O0  execution test
> FAIL: g++.dg/tsan/default_options.C  -O2  execution test
>
> (and various others in the past, though not sure if any of the
> pattern related failures could have something with libbacktrace
> symbolization not being there yet (note, I do not have
> /usr/bin/llvm-symbolizer installed on the testing box)).
>
> Both these tests don't report anything (well, default_options.C prints DONE),
> simply succeed (with dg-shouldfail that is a failure).
> I don't see anything wrong in the compiler output, the Thread?
> routines just call __tsan_func_entry, then __tsan_write4 (&Global);
> then __tsan_func_exit, so I don't see how it could be related to anything
> but the library.  Note the box is sometimes quite loaded (two make -j48
> regtests going on at the same time), but there is enough memory.
>
> Is the library perhaps timing sensitive, e.g. that it would track
> issues only if the two threads are actually concurrent rather than could be
> concurrent?  Say if the first pthread_create creates thread immediately,
> second pthread_create returns but the clone started thread isn't up yet,
> then pthread_join on the first thread finishes and the first thread is
> destroyed, then the second thread actually starts?
>
> Jakub

Re: [PATCH, ARM] Fix two IT issues

2014-01-14 Thread Richard Earnshaw

On 14/01/14 09:06, Zhenqiang Chen wrote:
> Hi,
> 
> The patch fixes two IT issues:
> 1) IT block is splitted for A15 (and Cortex-M).
> 
> For A15 tune, max cond insns (max_insns_skipped) is 2, which is set as the
> maximum allowed insns in an IT block (see thumb2_final_prescan_insn). So IT
> which has 3 or 4 instructions, will be splitted. Take the first if-then-else
> in the test case of the patch as example, it will generate ITs like:
>   cmp r0, #10
>   ite gt
>   subgt   r0, r0, r1
>   suble   r0, r1, r0
>   ite gt
>   addgt   r0, r0, #10
>   suble   r0, r0, #7
> It does not make sense to split the IT. For cortex-m, the IT can not be
> folded if the previous insn is 4 BYTES. 
> 
> 2) For arm_v7m_tune, max cond insns is not aligned with its branch cost. In
> ifcvt.c, the relation between MAX_CONDITIONAL_EXECUTE (max cond insns) and
> branch cost is:
> 
> #ifndef MAX_CONDITIONAL_EXECUTE
> #define MAX_CONDITIONAL_EXECUTE \
>   (BRANCH_COST (optimize_function_for_speed_p (cfun), false) \
>+ 1)
> #endif
> 
> So the best value of max cond insns for v7m_tune should be 2.
> 
> Bootstrap and no make check regression on ARM Chrome book. No make check
> regression for Cortex-M3.
> 
> Cortex-M4 O2 performance changes on coremark, dhrystone and eembc-v1:
> coremark: -0.11%
> dhrystone: 1.26%.
> a2time01_lite: 2.63%
> canrdr01_lite: 4.27%
> iirflt01_lite: 6.51%
> rspeed01_lite: 6.51%
> dither01_lite: 7.36%
> 
> The biggest regression in eembc-v1 is pntrch01_lite: -0.51%
> All other regressions < 0.1%
> 
> Cortex-M4 O3 performance changes are similar with O2, except one regression
> due to loop alignment change.
> 
> OK for trunk?
> 
> Thanks!
> -Zhenqiang
> 
> 2014-01-14  Zhenqiang Chen  
> 
>   * config/arm/arm.c (arm_v7m_tune): Set max_insns_skipped to 2.
>   (thumb2_final_prescan_insn): Set max to MAX_INSN_PER_IT_BLOCK.
> 
> testsuite/ChangeLog:
> 2014-01-14  Zhenqiang Chen  
> 
>   * gcc.target/arm/its.c: New test.
> 

OK.

R.

[RFC] Using function clones for Pointer Bounds Checker

2014-01-14 Thread Ilya Enkovich

Hi,

I've been working for some time on the prototype of the Pointer Bounds
Checker which uses function clones for instrumentation
(http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After
several experiments with this approach I want to share my results and
ask for some feedback to make a decision about the future steps.

Firstly I want to remind the reasons for digging in this direction. In
the original approach bounds of call arguments and input parameters
are associated with arguments via special built-in calls. It creates
implicit data flow compiler is not aware about which confuses some
optimizations resulting in miss-optimization and breaks bounds data
flow. Thus optimizations have to be fixed to get better pointers
protection.

Clones approach does not use special built-in function calls to
associate bounds with call arguments and input parameters. Each
function which should be instrumented gets an additional version and
only this special version will be instrumented.This new version gets
additional bound arguments to express input bounds. When function call
is instrumented, it is redirected to instrumented version and all
bounds are passed as explicit call arguments. Thus we have explicit
pointer bounds flow similar to regular function parameters. It should
allow to avoid changes in optimization, avoid miss-optimizations,
allow existing IPA optimizations to work with bound args (e.g.
propagate constant bounds value and remove checks in called function).

I made a prototype implementation of this approach in the following way:

- Add new IPA pass before early local passes to produce versions for
all functions to be instrumented.
- Put instrumentation pass after SSA pass.
- Add new pass after IPA passes to remove bodies of functions which
have instrumented versions. Function nodes may still be required for
calls in not instrumented code. But we do not emit this code and
therefore function bodies are not needed.

Positive changes are:

- IPA optimizations are not confused by bound parameters
- bounds are now more like regular arguments; it makes their
processing in expand easier
- functions with bounds not attached to any pointer are allowed

On simple codes this approach worked well but on a bigger tests some
issues were revealed.

1. Nodes reachability. Instrumented version is actually always
reachable when original function is reachable because it is always
emitted instead of the original. Thus I had to fix reachability
analysis to achieve it. Another similar problem is check whether node
can be removed after inline when inlining instrumented function. Not
hard to fix but probably other similar problems exist.

2. Function processing order. Function processing order is determined
before early local passes. But during function instrumentation call
graph is modified significantly and used topological order becomes
outdated. That causes some troubles. E.g. function marked as 'always
inline' cannot be inlined because it is not in SSA form yet. Surely
inlining problem may be solved by just putting instrumentation after
early inline, but similar problem may exist in other passes too. To
resolve this problem I tried to split early local passes into three
parts. The first one builds SSA, the second one performs
instrumentation, the last one does the rest. Each part is performed on
all functions before the next one starts. Thus I get all functions in
SSA form and all instrumentation performed before starting early
optimizations. Unfortunately such passes order leads to invalid SSA
because of local_pure_const optimization affecting callers correctness
(in case caller SSA was built before optimization revealed 'pure' or
'const' flag).

In general I feel that having special function version for
instrumentation has a better potential, should lead to less intrusive
changes in the compiler and better code quality.

But before continue this implementation I would like to get some
feedback and probably some advice on how to order passes to get the
best result. Currently I incline to have all functions instrumented
before any local optimizations and solve pure_const problem by
modifying all callers when attribute is computed for some function.

Thanks,
Ilya

[PATCH, ARM] Fix two IT issues

2014-01-14 Thread Zhenqiang Chen

Hi,

The patch fixes two IT issues:
1) IT block is splitted for A15 (and Cortex-M).

For A15 tune, max cond insns (max_insns_skipped) is 2, which is set as the
maximum allowed insns in an IT block (see thumb2_final_prescan_insn). So IT
which has 3 or 4 instructions, will be splitted. Take the first if-then-else
in the test case of the patch as example, it will generate ITs like:
cmp r0, #10
ite gt
subgt   r0, r0, r1
suble   r0, r1, r0
ite gt
addgt   r0, r0, #10
suble   r0, r0, #7
It does not make sense to split the IT. For cortex-m, the IT can not be
folded if the previous insn is 4 BYTES. 

2) For arm_v7m_tune, max cond insns is not aligned with its branch cost. In
ifcvt.c, the relation between MAX_CONDITIONAL_EXECUTE (max cond insns) and
branch cost is:

#ifndef MAX_CONDITIONAL_EXECUTE
#define MAX_CONDITIONAL_EXECUTE \
  (BRANCH_COST (optimize_function_for_speed_p (cfun), false) \
   + 1)
#endif

So the best value of max cond insns for v7m_tune should be 2.

Bootstrap and no make check regression on ARM Chrome book. No make check
regression for Cortex-M3.

Cortex-M4 O2 performance changes on coremark, dhrystone and eembc-v1:
coremark: -0.11%
dhrystone: 1.26%.
a2time01_lite: 2.63%
canrdr01_lite: 4.27%
iirflt01_lite: 6.51%
rspeed01_lite: 6.51%
dither01_lite: 7.36%

The biggest regression in eembc-v1 is pntrch01_lite: -0.51%
All other regressions < 0.1%

Cortex-M4 O3 performance changes are similar with O2, except one regression
due to loop alignment change.

OK for trunk?

Thanks!
-Zhenqiang

2014-01-14  Zhenqiang Chen  

* config/arm/arm.c (arm_v7m_tune): Set max_insns_skipped to 2.
(thumb2_final_prescan_insn): Set max to MAX_INSN_PER_IT_BLOCK.

testsuite/ChangeLog:
2014-01-14  Zhenqiang Chen  

* gcc.target/arm/its.c: New test.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 39d23cc..8751675 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1696,7 +1696,7 @@ const struct tune_params arm_v7m_tune =
   &v7m_extra_costs,
   NULL,/* Sched adj cost.
*/
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
   true,/* Prefer constant
pool.  */
   arm_cortex_m_branch_cost,
@@ -22131,11 +22131,11 @@ thumb2_final_prescan_insn (rtx insn)
   int mask;
   int max;
 
-  /* Maximum number of conditionally executed instructions in a block
- is minimum of the two max values: maximum allowed in an IT block
- and maximum that is beneficial according to the cost model and tune.
*/
-  max = (max_insns_skipped < MAX_INSN_PER_IT_BLOCK) ?
-max_insns_skipped : MAX_INSN_PER_IT_BLOCK;
+  /* max_insns_skipped in the tune was already taken into account in the
+ cost model of ifcvt pass when generating COND_EXEC insns.  At this
stage
+ just emit the IT blocks as we can.  It does not make sense to split
+ the IT blocks.  */
+  max = MAX_INSN_PER_IT_BLOCK;
 
   /* Remove the previous insn from the count of insns to be output.  */
   if (arm_condexec_count)
diff --git a/gcc/testsuite/gcc.target/arm/its.c
b/gcc/testsuite/gcc.target/arm/its.c
new file mode 100644
index 000..8eecdf4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/its.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" }  */
+int test (int a, int b)
+{
+  int r;
+  if (a > 10)
+{
+  r = a - b;
+  r += 10;
+}
+  else
+{
+  r = b - a;
+  r -= 7;
+}
+  if (r > 0)
+r -= 3;
+  return r;
+}
+/* { dg-final { scan-assembler-times "\tit" 2 { target arm_thumb2 } } } */

Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-14 Thread Richard Biener

On Mon, 13 Jan 2014, Cong Hou wrote:

> I noticed that LIM could not hoist vector invariant, and that is why
> my first implementation tries to hoist them all.

Yes, I filed PR59786 for this.  I'll see if I can come up with
a fix suitable for stage3.

> In addition, there are two disadvantages of hoisting invariant load +
> lim method:
> 
> First, for some instructions the scalar version is faster than the
> vector version, and in this case hoisting scalar instructions before
> vectorization is better. Those instructions include data
> packing/unpacking, integer multiplication with SSE2, etc..
> 
> Second, it may use more SIMD registers.
> 
> The following code shows a simple example:
> 
> char *a, *b, *c;
> for (int i = 0; i < N; ++i)
>   a[i] = b[0] * c[0] + a[i];
> 
> Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0]
> into a vector.

Yes.  I've tried with adjusting the vec_def_type as in the prototype
patch I sent before christmas but that's quite intrusive for this
stage.  You could argue that performing invariant motion is not
really the vectorizers main task and that a combination of hoisting
only the load, later LIM hoisting the rest and then tree-vect-generic.c
demoting vector ops to scalar ops (unimplemented, but also a useful
general optimization) would work as well.

That said, we should definitely have a second look for 4.10.  For now
hoisting the load is an improvement over 4.8 (at least I hope so ;))
which needs to be good enough for 4.9.

I had to fix a latent bug to cure some testsuite fallout so the following
is what I ended up committing.

Jakub, adding the new flag is ok with me.

Thanks,
Richard.

2014-01-14  Richard Biener  

PR tree-optimization/58921
PR tree-optimization/59006
* tree-vect-loop-manip.c (vect_loop_versioning): Remove code
hoisting invariant stmts.
* tree-vect-stmts.c (vectorizable_load): Insert the splat of
invariant loads on the preheader edge if possible.

* gcc.dg/torture/pr58921.c: New testcase.
* gcc.dg/torture/pr59006.c: Likewise.
* gcc.dg/vect/pr58508.c: XFAIL no longer handled cases.

Index: gcc/tree-vect-loop-manip.c
===
*** gcc/tree-vect-loop-manip.c  (revision 206576)
--- gcc/tree-vect-loop-manip.c  (working copy)
*** vect_loop_versioning (loop_vec_info loop
*** 2435,2507 
}
  }

- 
-   /* Extract load statements on memrefs with zero-stride accesses.  */
- 
-   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
- {
-   /* In the loop body, we iterate each statement to check if it is a load.
-Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
-then we will hoist the load statement to the loop preheader.  */
- 
-   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
-   int nbbs = loop->num_nodes;
- 
-   for (int i = 0; i < nbbs; ++i)
-   {
- for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]);
-  !gsi_end_p (si);)
-   {
- gimple stmt = gsi_stmt (si);
- stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
- struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
- 
- if (is_gimple_assign (stmt)
- && (!dr
- || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr)
-   {
- bool hoist = true;
- ssa_op_iter iter;
- tree var;
- 
- /* We hoist a statement if all SSA uses in it are defined
-outside of the loop.  */
- FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
-   {
- gimple def = SSA_NAME_DEF_STMT (var);
- if (!gimple_nop_p (def)
- && flow_bb_inside_loop_p (loop, gimple_bb (def)))
-   {
- hoist = false;
- break;
-   }
-   }
- 
- if (hoist)
-   {
- if (dr)
-   gimple_set_vuse (stmt, NULL);
- 
- gsi_remove (&si, false);
- gsi_insert_on_edge_immediate (loop_preheader_edge (loop),
-   stmt);
- 
- if (dump_enabled_p ())
-   {
- dump_printf_loc
- (MSG_NOTE, vect_location,
-  "hoisting out of the vectorized loop: ");
- dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
- dump_printf (MSG_NOTE, "\n");
-   }
- continue;
-   }
-   }
- gsi_next (&si);
-   }
-   }
- }
- 
/* End loop-exit-fixes after versioning.

Re: [PATCH] Fix up vect/fast-math-mgrid-resid.f testcase (PR testsuite/59494)

2014-01-14 Thread Richard Biener

On Mon, 13 Jan 2014, Jakub Jelinek wrote:

> Hi!
> 
> As discussed in the PR and on IRC, this testcase is very fragile, counting
> additions with vect_ named SSA_NAME on lhs works only for some tunings,
> for other tunings reassoc width etc. affect it and we can e.g. have
> anonymous SSA_NAMEs on the lhs in the optimized dump instead.
> 
> These alternate regexps seems to match regardless of the tunings (at least
> what I've tried), starting with the corresponding fix onwards, and FAIL
> before the fix.
> 
> Regtested on x86_64-linux and i686-linux, ok for trunk?

Ok.

Thanks,
Richard.

> 2014-01-13  Jakub Jelinek  
> 
>   PR testsuite/59494
>   * gfortran.dg/vect/fast-math-mgrid-resid.f: Change
>   -fdump-tree-optimized to -fdump-tree-pcom-details in dg-options and
>   cleanup-tree-dump from optimized to pcom.  Remove scan-tree-dump-times
>   for vect_\[^\\n\]*\\+, add scan-tree-dump-times for no suitable chains 
> and
>   Executing predictive commoning without unrolling.
> 
> --- gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f.jj 2013-04-08 
> 15:38:21.0 +0200
> +++ gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f2014-01-13 
> 13:18:39.904315828 +0100
> @@ -1,7 +1,7 @@
>  ! { dg-do compile { target i?86-*-* x86_64-*-* } }
>  ! { dg-require-effective-target vect_double }
>  ! { dg-require-effective-target sse2 }
> -! { dg-options "-O3 -ffast-math -msse2 -fpredictive-commoning 
> -ftree-vectorize -fdump-tree-optimized" }
> +! { dg-options "-O3 -ffast-math -msse2 -fpredictive-commoning 
> -ftree-vectorize -fdump-tree-pcom-details" }
>  
>  
>  *** RESID COMPUTES THE RESIDUAL:  R = V - AU
> @@ -39,8 +39,9 @@ C
>RETURN
>END
>  ! we want to check that predictive commoning did something on the
> -! vectorized loop, which means we have to have exactly 13 vector
> -! additions.
> -! { dg-final { scan-tree-dump-times "vect_\[^\\n\]*\\+ " 13 "optimized" } }
> +! vectorized loop.
> +! { dg-final { scan-tree-dump-times "Executing predictive commoning without 
> unrolling" 1 "pcom" { target lp64 } } }
> +! { dg-final { scan-tree-dump-times "Executing predictive commoning without 
> unrolling" 2 "pcom" { target ia32 } } }
> +! { dg-final { scan-tree-dump-times "Predictive commoning failed: no 
> suitable chains" 0 "pcom" } }
>  ! { dg-final { cleanup-tree-dump "vect" } }
> -! { dg-final { cleanup-tree-dump "optimized" } }
> +! { dg-final { cleanup-tree-dump "pcom" } }
> 
>   Jakub
> 
> 

-- 
Richard Biener 
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

80 matches

Mail list logo