Re: PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes
On Tue, Jan 14, 2014 at 06:18:22AM -0800, H.J. Lu wrote: > 2014-01-14 H.J. Lu > > PR target/59794 > * config/i386/i386.c (type_natural_mode): Add a bool parameter > to indicate if type is used for function return value. Warn > ABI change if the vector mode isn't available for function > return value. > (ix86_function_arg_advance): Pass false to type_natural_mode. > (ix86_function_arg): Likewise. > (ix86_gimplify_va_arg): Likewise. > (function_arg_32): Don't warn ABI change. > (ix86_function_value): Pass true to type_natural_mode. > (ix86_return_in_memory): Likewise. > (ix86_struct_value_rtx): Removed. > (TARGET_STRUCT_VALUE_RTX): Likewise. This has added many FAILs on i686-linux (make sure to configure for a CPU that doesn't automatically turn on -msse or -mmmx, say i686): +FAIL: gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c (test for excess errors) +FAIL: gcc.dg/pr53060.c (test for excess errors) +FAIL: c-c++-common/convert-vec-1.c -Wc++-compat (test for excess errors) +FAIL: c-c++-common/scal-to-vec2.c -Wc++-compat (test for excess errors) +FAIL: c-c++-common/vector-compare-2.c -Wc++-compat (test for excess errors) +FAIL: g++.dg/conversion/simd1.C -std=c++98 (test for excess errors) +FAIL: g++.dg/conversion/simd1.C -std=c++11 (test for excess errors) +FAIL: g++.dg/cpp0x/constexpr-53094-2.C (test for excess errors) +FAIL: g++.dg/ext/attribute-test-1.C -std=gnu++98 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-1.C -std=gnu++11 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-2.C -std=gnu++98 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-2.C -std=gnu++11 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-3.C -std=c++98 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-3.C -std=c++11 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-4.C -std=c++98 (test for excess errors) +FAIL: g++.dg/ext/attribute-test-4.C -std=c++11 (test for excess errors) +FAIL: g++.dg/ext/pr56790-1.C -std=gnu++98 (test for excess errors) +FAIL: g++.dg/ext/pr56790-1.C -std=gnu++11 (test for excess errors) +FAIL: c-c++-common/convert-vec-1.c -std=c++98 (test for excess errors) +FAIL: c-c++-common/convert-vec-1.c -std=c++11 (test for excess errors) +FAIL: c-c++-common/scal-to-vec2.c -std=gnu++98 (test for excess errors) +FAIL: c-c++-common/scal-to-vec2.c -std=gnu++11 (test for excess errors) +FAIL: c-c++-common/vector-compare-2.c -std=gnu++98 (test for excess errors) +FAIL: c-c++-common/vector-compare-2.c -std=gnu++11 (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O0 (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O1 (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O2 (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O3 -fomit-frame-pointer (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O3 -g (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -Os (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O2 -flto -flto-partition=none (test for excess errors) +FAIL: g++.dg/torture/pr38565.C -O2 -flto (test for excess errors) Excess errors: /usr/src/gcc/gcc/testsuite/gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c:9:1: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/gcc.dg/pr53060.c:13:1: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/c-c++-common/convert-vec-1.c:3:1: warning: MMX vector return without MMX enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/c-c++-common/scal-to-vec2.c:19:1: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/c-c++-common/vector-compare-2.c:20:1: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/conversion/simd1.C:8:59: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/cpp0x/constexpr-53094-2.C:7:46: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-1.C:10:52: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-2.C:14:59: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-3.C:26:26: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/ext/attribute-test-4.C:26:24: warning: SSE vector return without SSE enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/ext/pr56790-1.C:6:12: warning: MMX vector return without MMX enabled changes the ABI [enabled by default] /usr/src/gcc/gcc/testsuite/g++.dg/torture/pr38565.C:5:
[wide-int] fixed several regressions in branch.
This patch fixes what appears to have been a long standing failure in the conversion of tree-vect-generic.c:build_replicated_const. This failure caused several regressions on the branch. Committed as revision 206616 Index: gcc/tree-vect-generic.c === --- gcc/tree-vect-generic.c (revision 206609) +++ gcc/tree-vect-generic.c (working copy) @@ -57,7 +57,8 @@ static tree build_replicated_const (tree type, tree inner_type, HOST_WIDE_INT value) { int width = tree_to_uhwi (TYPE_SIZE (inner_type)); - int n = TYPE_PRECISION (type) / width; + int n = (TYPE_PRECISION (type) + HOST_BITS_PER_WIDE_INT - 1) +/ HOST_BITS_PER_WIDE_INT; unsigned HOST_WIDE_INT low, mask; HOST_WIDE_INT a[WIDE_INT_MAX_ELTS]; int i;
Go patch committed: Use backend interface for interface expressions
This Go frontend patch from Chris Manghane uses the backend interface for interface info and field expressions. Bootstrapped and ran Go testsuite on x86_64-unknown-linux-gnu. Committed to mainline. Ian 2014-01-14 Chris Manghane * go-gcc.cc (Gcc_backend::compound_expression): New function. (Gcc_backend::conditional_expression): New function. Index: gcc/go/gofrontend/expressions.cc === --- gcc/go/gofrontend/expressions.cc (revision 206509) +++ gcc/go/gofrontend/expressions.cc (working copy) @@ -6473,11 +6473,11 @@ Expression::make_binary(Operator op, Exp tree Expression::comparison_tree(Translate_context* context, Type* result_type, - Operator op, Expression* left_expr, - Expression* right_expr, Location location) + Operator op, Expression* left, Expression* right, + Location location) { - Type* left_type = left_expr->type(); - Type* right_type = right_expr->type(); + Type* left_type = left->type(); + Type* right_type = right->type(); mpz_t zval; mpz_init_set_ui(zval, 0UL); @@ -6509,17 +6509,11 @@ Expression::comparison_tree(Translate_co go_unreachable(); } - // FIXME: Computing the tree here means it will be computed multiple times, - // which is wasteful. This is a temporary modification until all tree code - // here can be replaced with frontend expressions. - tree left_tree = left_expr->get_tree(context); - tree right_tree = right_expr->get_tree(context); if (left_type->is_string_type() && right_type->is_string_type()) { - Expression* strcmp_call = Runtime::make_call(Runtime::STRCMP, location, 2, - left_expr, right_expr); - left_tree = strcmp_call->get_tree(context); - right_tree = zexpr->get_tree(context); + left = Runtime::make_call(Runtime::STRCMP, location, 2, +left, right); + right = zexpr; } else if ((left_type->interface_type() != NULL && right_type->interface_type() == NULL @@ -6532,31 +6526,30 @@ Expression::comparison_tree(Translate_co if (left_type->interface_type() == NULL) { std::swap(left_type, right_type); - std::swap(left_expr, right_expr); + std::swap(left, right); } // The right operand is not an interface. We need to take its // address if it is not a pointer. Expression* pointer_arg = NULL; if (right_type->points_to() != NULL) -pointer_arg = right_expr; +pointer_arg = right; else { - go_assert(right_expr->is_addressable()); - pointer_arg = Expression::make_unary(OPERATOR_AND, right_expr, + go_assert(right->is_addressable()); + pointer_arg = Expression::make_unary(OPERATOR_AND, right, location); } - Expression* descriptor_expr = Expression::make_type_descriptor(right_type, - location); - Call_expression* iface_valcmp = + Expression* descriptor = + Expression::make_type_descriptor(right_type, location); + left = Runtime::make_call((left_type->interface_type()->is_empty() ? Runtime::EMPTY_INTERFACE_VALUE_COMPARE : Runtime::INTERFACE_VALUE_COMPARE), - location, 3, left_expr, descriptor_expr, + location, 3, left, descriptor, pointer_arg); - left_tree = iface_valcmp->get_tree(context); - right_tree = zexpr->get_tree(context); + right = zexpr; } else if (left_type->interface_type() != NULL && right_type->interface_type() != NULL) @@ -6574,56 +6567,42 @@ Expression::comparison_tree(Translate_co { go_assert(op == OPERATOR_EQEQ || op == OPERATOR_NOTEQ); std::swap(left_type, right_type); - std::swap(left_expr, right_expr); + std::swap(left, right); } go_assert(!left_type->interface_type()->is_empty()); go_assert(right_type->interface_type()->is_empty()); compare_function = Runtime::INTERFACE_EMPTY_COMPARE; } - Call_expression* ifacecmp_call = - Runtime::make_call(compare_function, location, 2, - left_expr, right_expr); - - left_tree = ifacecmp_call->get_tree(context); - right_tree = zexpr->get_tree(context); + left = Runtime::make_call(compare_function, location, 2, left, right); + right = zexpr; } if (left_type->is_nil_type() && (op == OPERATOR_EQEQ || op == OPERATOR_NOTEQ)) { std::swap(left_type, right_type); - std::swap(left_tree, right_tree); - std::swap(left_expr, right_expr); + std::swap(left, right); } if (right_type->is_nil_type()) { + right = Expression::make_nil(location); if (left_type
[PATCH/AARCH64] Fix register cost for moving to/from stack registers
While writing the Thunder tunings, I got an internal compiler error while building glibc. The reduced testcase is: typedef unsigned int size_t; typedef unsigned int wchar_t; extern __thread int __libc_errno; extern __thread int * t; int _IO_vfprintf_internal (char *string, char *workend, char *f) { int save_errno = __libc_errno; do { int prec = (workend - string); string = (char *) __strerror_r (save_errno); if ( *t == 1) { size_t ignore_size = (unsigned) prec > 1024 ? 1024 : prec; wchar_t ignore[ignore_size]; const char *str2 = string; const char *strend = string + prec; int ps; while (str2 != ((void *)0) && str2 < strend) __mbsnrtowcs (ignore, &str2 ,&ps) ; } } while (*f != '\0'); } CUT --- I changed the cost of moving between two neon registers (FP_REGS register class) to be the same as the cost of moving between a GENERAL_REGS and a FP_REGS class. This caused the cost of moving between the STACK_REG and FP_REGS being the same FP_REGS and GENERAL_REGS which is incorrect as it has to go through a GENERAL_REGS. This patch fixes the problem by changing the cost of the move between STACK_REG and FP_REGS to the cost of moving via a GENERAL_REGS. OK? Built and tested on aarch64-elf with no regressions. Thanks, Andrew Pinski ChangeLog: * config/aarch64/aarch64.c (aarch64_register_move_cost): Correct cost of moving from/to the STACK_REG register class. Index: config/aarch64/aarch64.c === --- config/aarch64/aarch64.c(revision 206611) +++ config/aarch64/aarch64.c(working copy) @@ -4870,6 +4870,16 @@ aarch64_register_move_cost (enum machine const struct cpu_regmove_cost *regmove_cost = aarch64_tune_params->regmove_cost; + /* Moving between GPR and stack cost is the same as GP2GP. */ + if ((from == GENERAL_REGS && to == STACK_REG) + || (to == GENERAL_REGS && from == STACK_REG)) +return regmove_cost->GP2GP; + + /* To/From the stack register, is the move via the gprs. */ + if (to == STACK_REG || from == STACK_REG) +return aarch64_register_move_cost (mode, from, GENERAL_REGS) ++ aarch64_register_move_cost (mode, GENERAL_REGS, to); + if (from == GENERAL_REGS && to == GENERAL_REGS) return regmove_cost->GP2GP; else if (from == GENERAL_REGS)
Go patch committed: Define Backend_function_type
This patch from Chris Manghane adds a Backend_function_type: a function type that is implemented as a function pointer rather than as the pointer to a struct that is the implementation of a Go function. This will be used in other backend conversions. Bootstrapped and ran Go testsuite on x86_64-unknown-linux-gnu. Committed to mainline. Ian diff -r 3515c11d7573 go/types.cc --- a/go/types.cc Thu Jan 09 22:40:50 2014 -0800 +++ b/go/types.cc Tue Jan 14 15:16:02 2014 -0800 @@ -4066,6 +4066,17 @@ return new Function_type(receiver, parameters, results, location); } +// Make a backend function type. + +Backend_function_type* +Type::make_backend_function_type(Typed_identifier* receiver, + Typed_identifier_list* parameters, + Typed_identifier_list* results, + Location location) +{ + return new Backend_function_type(receiver, parameters, results, location); +} + // Class Pointer_type. // Traversal. diff -r 3515c11d7573 go/types.h --- a/go/types.h Thu Jan 09 22:40:50 2014 -0800 +++ b/go/types.h Tue Jan 14 15:16:02 2014 -0800 @@ -19,6 +19,7 @@ class Complex_type; class String_type; class Function_type; +class Backend_function_type; class Struct_field; class Struct_field_list; class Struct_type; @@ -484,6 +485,12 @@ Typed_identifier_list* results, Location); + static Backend_function_type* + make_backend_function_type(Typed_identifier* receiver, + Typed_identifier_list* parameters, + Typed_identifier_list* results, + Location); + static Pointer_type* make_pointer_type(Type*); @@ -1896,6 +1903,23 @@ Btype* fnbtype_; }; +// The type of a function's backend representation. + +class Backend_function_type : public Function_type +{ + public: + Backend_function_type(Typed_identifier* receiver, +Typed_identifier_list* parameters, +Typed_identifier_list* results, Location location) + : Function_type(receiver, parameters, results, location) + { } + + protected: + Btype* + do_get_backend(Gogo* gogo) + { return this->get_backend_fntype(gogo); } +}; + // The type of a pointer. class Pointer_type : public Type
Re: [PATCH, rs6000] Don't emit profile code for procedures marked no_instrument_function
On Tue, Jan 14, 2014 at 3:42 PM, Pat Haugen wrote: > This patch fixes a problem where the attribute no_instrument_function was > being ignored and profile code was emitted. Testcase > gcc.target/powerpc/ppc64-abi-2.c exposed the issue. > > Bootstrap/regtest with no new regressions, ok for trunk? > > -Pat > > > 2014-01-13 Pat Haugen > > * config/rs6000/rs6000.c (rs6000_output_function_prologue): Check if > current procedure should be profiled. okay. Thanks, David
Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)
On Mon, 13 Jan 2014, Marek Polacek wrote: > +/* Return true if T is a pointer to a zero-sized struct/union. */ > + > +bool > +pointer_to_zero_sized_aggr_p (tree t) > +{ > + t = strip_pointer_operator (t); > + if (RECORD_OR_UNION_TYPE_P (t) > + && TYPE_SIZE (t) > + && integer_zerop (TYPE_SIZE (t))) > +return true; > + return false; Given that GNU C also allows arrays of constant size 0, shouldn't the errors also apply in that case? (I don't know whether the original bug can appear for such arrays, but I'd think the errors should apply to anything with constant size 0 - not of course for VLAs where it just so happens that the compiler can tell at compile time that the size is always 0.) -- Joseph S. Myers jos...@codesourcery.com
Re: [C PATCH] Preevaluate rhs for lhs op= rhs in C (PR c/58943)
On Mon, 13 Jan 2014, Jakub Jelinek wrote: > This patch fixes the following testcase by preevaluating rhs if it has > (can have) side-effects in lhs op= rhs expressions. Bootstrapped/regtested > on x86_64-linux and i686-linux, ok for trunk? OK. -- Joseph S. Myers jos...@codesourcery.com
[PATCH, rs6000] Don't emit profile code for procedures marked no_instrument_function
This patch fixes a problem where the attribute no_instrument_function was being ignored and profile code was emitted. Testcase gcc.target/powerpc/ppc64-abi-2.c exposed the issue. Bootstrap/regtest with no new regressions, ok for trunk? -Pat 2014-01-13 Pat Haugen * config/rs6000/rs6000.c (rs6000_output_function_prologue): Check if current procedure should be profiled. Index: config/rs6000/rs6000.c === --- config/rs6000/rs6000.c(revision 206602) +++ config/rs6000/rs6000.c(working copy) @@ -23198,7 +23198,7 @@ rs6000_output_function_prologue (FILE *f /* Output -mprofile-kernel code. This needs to be done here instead of in output_function_profile since it must go after the ELFv2 ABI local entry point. */ - if (TARGET_PROFILE_KERNEL) + if (TARGET_PROFILE_KERNEL && crtl->profile) { gcc_assert (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2); gcc_assert (!TARGET_32BIT);
Re: [PATCH,rs6000] Implement -maltivec=be for vec_mule and vec_mulo Altivec intrinsics
On Mon, Jan 13, 2014 at 6:37 PM, Bill Schmidt wrote: > This patch provides for interpreting parity of element numbers for the > Altivec vec_mule and vec_mulo intrinsics as big-endian (left to right in > a vector register) when targeting a little endian machine and specifying > -maltivec=be. New test cases are added to test this functionality on > all supported vector types. > > The main change is in the altivec.md define_insns for > vec_widen_{su}mult_{even,odd}_{v8hi,v16qi}, where we now test for > VECTOR_ELT_ORDER_BIG rather than BYTES_BIG_ENDIAN in order to treat the > element order as big-endian. However, this necessitates changes to > other places in altivec.md where we previously called > gen_vec_widen_{su}mult_*. The semantics of these internal uses are not > affected by -maltivec=be, so these are now replaced with direct > generation of the underlying instructions that were previously > generated. > > Bootstrapped and tested with no new regressions on > powerpc64{,le}-unknown-linux-gnu. Ok for trunk? > > Thanks, > Bill > > > gcc: > > 2014-01-13 Bill Schmidt > > * config/rs6000/altivec.md (mulv8hi3): Explicitly generate vmulesh > and vmulosh rather than call gen_vec_widen_smult_*. > (vec_widen_umult_even_v16qi): Test VECTOR_ELT_ORDER_BIG rather > than BYTES_BIG_ENDIAN to determine use of even or odd instruction. > (vec_widen_smult_even_v16qi): Likewise. > (vec_widen_umult_even_v8hi): Likewise. > (vec_widen_smult_even_v8hi): Likewise. > (vec_widen_umult_odd_v16qi): Likewise. > (vec_widen_smult_odd_v16qi): Likewise. > (vec_widen_umult_odd_v8hi): Likewise. > (vec_widen_smult_odd_v8hi): Likewise. > (vec_widen_umult_hi_v16qi): Explicitly generate vmuleub and > vmuloub rather than call gen_vec_widen_umult_*. > (vec_widen_umult_lo_v16qi): Likewise. > (vec_widen_smult_hi_v16qi): Explicitly generate vmulesb and > vmulosb rather than call gen_vec_widen_smult_*. > (vec_widen_smult_lo_v16qi): Likewise. > (vec_widen_umult_hi_v8hi): Explicitly generate vmuleuh and vmulouh > rather than call gen_vec_widen_umult_*. > (vec_widen_umult_lo_v8hi): Likewise. > (vec_widen_smult_hi_v8hi): Explicitly gnerate vmulesh and vmulosh > rather than call gen_vec_widen_smult_*. > (vec_widen_smult_lo_v8hi): Likewise. > > gcc/testsuite: > > 2014-01-13 Bill Schmidt > > * gcc.dg/vmx/mult-even-odd.c: New. > * gcc.dg/vmx/mult-even-odd-be-order.c: New. Okay. The less said the better. Thanks, David
Re: [Patch] Avoid gcc_assert in libgcov
On Thu, Jan 9, 2014 at 6:56 AM, Jan Hubicka wrote: >> As suggested by Honza, avoid bloating libgcov from gcc_assert by using >> a new macro gcov_nonruntime_assert in gcov-io.c that is only mapped to >> gcc_assert when not in libgcov. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2014-01-09 Teresa Johnson >> >> * gcov-io.c (gcov_position): Use gcov_nonruntime_assert. >> (gcov_is_error): Ditto. >> (gcov_rewrite): Ditto. >> (gcov_open): Ditto. >> (gcov_write_words): Ditto. >> (gcov_write_length): Ditto. >> (gcov_read_words): Ditto. >> (gcov_read_summary): Ditto. >> (gcov_sync): Ditto. >> (gcov_seek): Ditto. >> (gcov_histo_index): Ditto. >> (static void gcov_histogram_merge): Ditto. >> (compute_working_sets): Ditto. >> * gcov-io.h (gcov_nonruntime_assert): Define. >> > >> @@ -481,14 +481,14 @@ gcov_read_words (unsigned words) >>const gcov_unsigned_t *result; >>unsigned excess = gcov_var.length - gcov_var.offset; >> >> - gcc_assert (gcov_var.mode > 0); >> + gcov_nonruntime_assert (gcov_var.mode > 0); >>if (excess < words) >> { >>gcov_var.start += gcov_var.offset; >> #if IN_LIBGCOV >>if (excess) >> { >> - gcc_assert (excess == 1); >> + gcov_nonruntime_assert (excess == 1); > > It probably makes no sense to put nonruntime access into IN_LIBGCOV defines. You are right - there were several that were in IN_LIBGCOV defines that I can just remove. > >> memcpy (gcov_var.buffer, gcov_var.buffer + gcov_var.offset, 4); >> } >> #else >> @@ -497,7 +497,7 @@ gcov_read_words (unsigned words) >>gcov_var.offset = 0; >>gcov_var.length = excess; >> #if IN_LIBGCOV >> - gcc_assert (!gcov_var.length || gcov_var.length == 1); >> + gcov_nonruntime_assert (!gcov_var.length || gcov_var.length == 1); >>excess = GCOV_BLOCK_SIZE; >> #else >>if (gcov_var.length + words > gcov_var.alloc) >> @@ -614,7 +614,7 @@ gcov_read_summary (struct gcov_summary *summary) >>while (!cur_bitvector) >> { >>h_ix = bv_ix * 32; >> - gcc_assert (bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE); >> + gcov_nonruntime_assert (bv_ix < >> GCOV_HISTOGRAM_BITVECTOR_SIZE); >>cur_bitvector = histo_bitvector[bv_ix++]; >> } >>while (!(cur_bitvector & 0x1)) >> @@ -622,7 +622,7 @@ gcov_read_summary (struct gcov_summary *summary) >>h_ix++; >>cur_bitvector >>= 1; >> } >> - gcc_assert (h_ix < GCOV_HISTOGRAM_SIZE); >> + gcov_nonruntime_assert (h_ix < GCOV_HISTOGRAM_SIZE); > > How many of those asserts can be triggered by a corrupted gcda file? > I would like to make libgcov more safe WRT file corruptions, too, so in that > case we should produce an error message. In that case should we call gcov_error when IN_LIBGCOV? One possibility would be to simply make gcov_nonruntime_assert be defined as if (!EXPR) gcov_error in the IN_LIBGCOV case. But I think what you wanted here was to reduce libgcov bloat by removing calls altogether, which this wouldn't solve. But if we want to call gcov_error in some cases, I think I need to add another macro that will either do gcc_assert when !IN_LIBGCOV and "if (!EXPR) gcov_error" when IN_LIBGCOV. Is that what you had in mind? Thanks, Teresa > > The rest of changes seems OK. > > Honza -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413
Re: [PATCH][IRA] Analysis of register usage of functions for usage by IRA.
On 12/05/2013 07:47 PM, Tom de Vries wrote: > On 14-03-13 10:34, Tom de Vries wrote: >>> I thought about implementing your optimization for LRA by myself. >>> But it >>> >is ok if you decide to work on it. At least, I am not going to start >>> >this work for a month. >>I'm also currently looking at how to use the analysis in LRA. >>AFAIU, in lra-constraints.c we do a backward scan over the insns, and keep track >>of how many calls we've seen (calls_num), and mark insns with that number. Then >>when looking at a live-range segment consisting of a def or use insn a and a >>following use insn b, we can compare the number of calls seen for each insn, and >>if they're not equal there is at least one call between the 2 insns, and if the >>corresponding hard register is clobbered by calls, we spill after insn a and >>restore before insn b. >> >>That is too coarse-grained to use with our analysis, since we need to know which >>calls occur in between insn a and insn b, and more precisely which registers >>those calls clobbered. >>> > >>I wonder though if we can do something similar: we keep an array >>call_clobbers_num[FIRST_PSEUDO_REG], initialized at 0 when we start scanning. >>When encountering a call, we increase the call_clobbers_num entries for the hard >>registers clobbered by the call. >>When encountering a use, we set the call_clobbers_num field of the use to >>call_clobbers_num[reg_renumber[original_regno]]. >>And when looking at a live-range segment, we compare the clobbers_num field of >>insn a and insn b, and if it is not equal, the hard register was clobbered by at >>least one call between insn a and insn b. >>Would that work? WDYT? >> >>> >As I understand you looked at live-range splitting code in >>> >lra-constraints.c. To get necessary info you should look at >>> ira-lives.c. >> Unfortunately I haven't been able to find time to work further on the >> LRA part. >> So if you're still willing to pick up that part, that would be great. > > Vladimir, > > I gave this a try. The attached patch works for the included test-case > for x86_64. > > I've bootstrapped and reg-tested the patch (in combination with the > other patches from the series) on x86_64. > > OK for stage1? > Yes, it is ok for stage1. Thanks for not forgetting LRA and sorry for the delay with the answer (it is not a high priority patch for me right now). I believe, this patch helps to improve code also because of better spilling into SSE regs. Spilling into SSE regs instead of memory has a rare probability right now as all SSE regs are call clobbered. Thanks again, Tom.
Re: RFA: patch to fix PR59787 (arm target)
Thanks for the hint Vladimir, I'll pass some validation on arm.c and arm.md/aarch64.md separately. On 14 January 2014 20:09, Vladimir Makarov wrote: > On 01/14/2014 01:41 PM, Yvan Roux wrote: >>> A quick grep of the arm backend shows 11 instances of reload_in_progress: >>> >>> arm.c: && !(reload_in_progress || reload_completed) >>> arm.c: if (! (reload_in_progress || reload_completed) >>> arm.c: if (! (reload_in_progress || reload_completed) >>> arm.c: if (! (reload_in_progress || reload_completed) >>> arm.c: reload_in_progress || reload_completed)) >>> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >>> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >>> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >>> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >>> predicates.md:"offsettable_address_p (reload_completed | >>> reload_in_progress, >>> predicates.md: (and (match_test "reload_in_progress || >>> reload_completed") >>> >>> and aarch64 has five more: >>> >>> aarch64.md: "reload_completed || reload_in_progress" >>> aarch64.md: "reload_completed || reload_in_progress" >>> aarch64.md: "reload_completed || reload_in_progress" >>> aarch64.md: "reload_completed || reload_in_progress" >>> aarch64.md: "reload_completed || reload_in_progress" >>> >>> Yvan, could you do a quick audit on these to see if they are also likely >>> to need fixing? >> Yes, I'll check all of them. > I checked these places too. I'd do analogous change for only arm.c in > thumb1_legitimate_address_p, neon_vector_mem_operand, and > neon_struct_mem_operand. I guess it is a a bad idea to do it in > predicates.md. Changes arm.md and aarch64.md is worth to try but I > believe LRA will work without the changes.
Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
On Tue, Jan 14, 2014 at 07:37:33PM +0100, Uros Bizjak wrote: > OK, let's play safe. I'll revert these two changes (modulo size of > nocona prefetch block). Thanks. > > opt we never return a smaller number from ix86_data_alignment than > > we did in 4.8 and earlier, because otherwise if you have 4.8 compiled > > code that assumes the alignment 4.8 would use for something that is defined > > in a compilation unit built by gcc 4.9+, if we don't align it at least > > as much as we did in the past, the linked mix of 4.8 user and 4.9 definition > > could misbehave. > > >From 4.9 onwards, we would like to align >= 64byte structures on > 64byte boundary. Should we add a compatibility rule to align >= 32byte > structures to 32 bytes? > > Please also note that in 4.7 and 4.8, we have > > int max_align = optimize_size ? BITS_PER_WORD : MIN (256, > MAX_OFILE_ALIGNMENT); > > so, in effect -Os code will be incompatible with other optimization levels. Well, the max_align is only one of the several possibilities of aligment increases, but yes, there is an ABI issue, see e.g. PR56564 for details. > I guess that for 4.7 and 4.8, we should revert to this anyway, but > what to do with 4.9? For 4.9, if what you've added is what you want to do for performance reasons, then I'd do something like: /* GCC 4.8 and earlier used to incorrectly assume this alignment even for symbols from other compilation units or symbols that don't need to bind locally. In order to preserve some ABI compatibility with those compilers, ensure we don't decrease alignment from what we used to assume. */ int max_align_compat = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT); /* A data structure, equal or greater than the size of a cache line (64 bytes in the Pentium 4 and other recent Intel processors, including processors based on Intel Core microarchitecture) should be aligned so that its base address is a multiple of a cache line size. */ int max_align = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT); if (max_align < BITS_PER_WORD) max_align = BITS_PER_WORD; if (opt && AGGREGATE_TYPE_P (type) && TYPE_SIZE (type) && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST) { if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align_compat || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < max_align_compat) align = max_align_compat; if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < max_align) align = max_align; } That way, max_align will be purely optimization and can be changed as anyone wishes in the future, max_align_compat compatibility with pre-4.9 (beyond ABI) assumptions and !opt stuff the ABI mandated alignment. Jakub
Re: [PATCH, reginfo.c, i386.c] Backport fix for PR58139 to 4.8
On 01/14/2014 12:22 PM, Peter Bergner wrote: > The mainline fix for PR58139 which is a wrong code gen bug was > submitted here: > > http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00910.html > > and approved for mainline and 4.8 (after a few weeks) here: > > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00134.html > > However, my fix exposed a latent x86 bug, so this patch was never > committed to 4.8. The latent x86 bug was fixed by Honza and I'd > like to now ask to be able to backport my fix for PR58139 along > with Honza's fix to 4.8. > > This passed bootstrap and regtesting on powerpc64-linux and > I bootstrapped this on x86_64 and verified that the ICE seen > when compiling the test case with only my patch in: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58269#c2 > > is fixed when we add Honza's patch. Ok for 4.8? > > reginfo.c change is ok for me. Thanks, Peter.
Re: RFA: patch to fix PR59787 (arm target)
On 01/14/2014 01:41 PM, Yvan Roux wrote: >> A quick grep of the arm backend shows 11 instances of reload_in_progress: >> >> arm.c: && !(reload_in_progress || reload_completed) >> arm.c: if (! (reload_in_progress || reload_completed) >> arm.c: if (! (reload_in_progress || reload_completed) >> arm.c: if (! (reload_in_progress || reload_completed) >> arm.c: reload_in_progress || reload_completed)) >> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >> arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" >> predicates.md:"offsettable_address_p (reload_completed | >> reload_in_progress, >> predicates.md: (and (match_test "reload_in_progress || >> reload_completed") >> >> and aarch64 has five more: >> >> aarch64.md: "reload_completed || reload_in_progress" >> aarch64.md: "reload_completed || reload_in_progress" >> aarch64.md: "reload_completed || reload_in_progress" >> aarch64.md: "reload_completed || reload_in_progress" >> aarch64.md: "reload_completed || reload_in_progress" >> >> Yvan, could you do a quick audit on these to see if they are also likely >> to need fixing? > Yes, I'll check all of them. I checked these places too. I'd do analogous change for only arm.c in thumb1_legitimate_address_p, neon_vector_mem_operand, and neon_struct_mem_operand. I guess it is a a bad idea to do it in predicates.md. Changes arm.md and aarch64.md is worth to try but I believe LRA will work without the changes.
Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
On Tue, Jan 14, 2014 at 10:37 AM, Uros Bizjak wrote: > On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek wrote: > >>> On a second thought, the crossing of 16-byte boundaries is mentioned >>> for the data *access* (the instruction itself) if it is not naturally >>> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our >>> case. >>> >>> So, we don't have to align 32 byte structures in any way for newer >>> processors, since this optimization applies to 64+ byte (larger or >>> equal to cache line size) structures only. Older processors are >>> handled correctly, modulo nocona, where its cache line size value has >>> to be corrected. >>> >>> Following that, my original patch implements this optimization in the >>> correct way. >> >> Sorry for catching this late, but on the 4.8 and earlier branches >> there is no opt argument and thus any ix86_data_alignment change is >> unfortunately an ABI change. So I'd think we should revert >> r206433 and r206436. And for the trunk we need to ensure even for > > OK, let's play safe. I'll revert these two changes (modulo size of > nocona prefetch block). > >> opt we never return a smaller number from ix86_data_alignment than >> we did in 4.8 and earlier, because otherwise if you have 4.8 compiled >> code that assumes the alignment 4.8 would use for something that is defined >> in a compilation unit built by gcc 4.9+, if we don't align it at least >> as much as we did in the past, the linked mix of 4.8 user and 4.9 definition >> could misbehave. > > From 4.9 onwards, we would like to align >= 64byte structures on > 64byte boundary. Should we add a compatibility rule to align >= 32byte > structures to 32 bytes? That is why we issue a warning when alignment was changed with AVX support: [hjl@gnu-6 tmp]$ cat a1.i typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__)); extern __m256i y; void f1(__m256i x) { y = x; } [hjl@gnu-6 tmp]$ gcc -S a1.i a1.i: In function ‘f1’: a1.i:4:1: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6 f1(__m256i x) ^ a1.i:4:1: warning: AVX vector argument without AVX enabled changes the ABI [enabled by default] [hjl@gnu-6 tmp]$ > Please also note that in 4.7 and 4.8, we have > > int max_align = optimize_size ? BITS_PER_WORD : MIN (256, > MAX_OFILE_ALIGNMENT); > > so, in effect -Os code will be incompatible with other optimization levels. > > I guess that for 4.7 and 4.8, we should revert to this anyway, but > what to do with 4.9? > > Uros. -- H.J.
Re: RFA: patch to fix PR59787 (arm target)
> A quick grep of the arm backend shows 11 instances of reload_in_progress: > > arm.c: && !(reload_in_progress || reload_completed) > arm.c: if (! (reload_in_progress || reload_completed) > arm.c: if (! (reload_in_progress || reload_completed) > arm.c: if (! (reload_in_progress || reload_completed) > arm.c: reload_in_progress || reload_completed)) > arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" > arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" > arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" > arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" > predicates.md:"offsettable_address_p (reload_completed | > reload_in_progress, > predicates.md: (and (match_test "reload_in_progress || > reload_completed") > > and aarch64 has five more: > > aarch64.md: "reload_completed || reload_in_progress" > aarch64.md: "reload_completed || reload_in_progress" > aarch64.md: "reload_completed || reload_in_progress" > aarch64.md: "reload_completed || reload_in_progress" > aarch64.md: "reload_completed || reload_in_progress" > > Yvan, could you do a quick audit on these to see if they are also likely > to need fixing? Yes, I'll check all of them.
Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek wrote: >> On a second thought, the crossing of 16-byte boundaries is mentioned >> for the data *access* (the instruction itself) if it is not naturally >> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our >> case. >> >> So, we don't have to align 32 byte structures in any way for newer >> processors, since this optimization applies to 64+ byte (larger or >> equal to cache line size) structures only. Older processors are >> handled correctly, modulo nocona, where its cache line size value has >> to be corrected. >> >> Following that, my original patch implements this optimization in the >> correct way. > > Sorry for catching this late, but on the 4.8 and earlier branches > there is no opt argument and thus any ix86_data_alignment change is > unfortunately an ABI change. So I'd think we should revert > r206433 and r206436. And for the trunk we need to ensure even for OK, let's play safe. I'll revert these two changes (modulo size of nocona prefetch block). > opt we never return a smaller number from ix86_data_alignment than > we did in 4.8 and earlier, because otherwise if you have 4.8 compiled > code that assumes the alignment 4.8 would use for something that is defined > in a compilation unit built by gcc 4.9+, if we don't align it at least > as much as we did in the past, the linked mix of 4.8 user and 4.9 definition > could misbehave. >From 4.9 onwards, we would like to align >= 64byte structures on 64byte boundary. Should we add a compatibility rule to align >= 32byte structures to 32 bytes? Please also note that in 4.7 and 4.8, we have int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT); so, in effect -Os code will be incompatible with other optimization levels. I guess that for 4.7 and 4.8, we should revert to this anyway, but what to do with 4.9? Uros.
Re: RFA: patch to fix PR59787 (arm target)
On 14/01/14 16:48, Vladimir Makarov wrote: > The following patch fixes > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787 > > The problem of LRA looping was in that address with soft frame pointer > was rejected as valid constraint in iwmmxt_amd_movdi insn. > > Ok to commit? > > 2014-01-14 Vladimir Makarov > > PR target/59787 > * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress. A quick grep of the arm backend shows 11 instances of reload_in_progress: arm.c: && !(reload_in_progress || reload_completed) arm.c: if (! (reload_in_progress || reload_completed) arm.c: if (! (reload_in_progress || reload_completed) arm.c: if (! (reload_in_progress || reload_completed) arm.c: reload_in_progress || reload_completed)) arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" arm.md: "TARGET_32BIT && (reload_in_progress || reload_completed)" predicates.md:"offsettable_address_p (reload_completed | reload_in_progress, predicates.md: (and (match_test "reload_in_progress || reload_completed") and aarch64 has five more: aarch64.md: "reload_completed || reload_in_progress" aarch64.md: "reload_completed || reload_in_progress" aarch64.md: "reload_completed || reload_in_progress" aarch64.md: "reload_completed || reload_in_progress" aarch64.md: "reload_completed || reload_in_progress" Yvan, could you do a quick audit on these to see if they are also likely to need fixing? R.
Re: RFA: patch to fix PR59787 (arm target)
On 14/01/14 16:48, Vladimir Makarov wrote: > The following patch fixes > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787 > > The problem of LRA looping was in that address with soft frame pointer > was rejected as valid constraint in iwmmxt_amd_movdi insn. > > Ok to commit? > > 2014-01-14 Vladimir Makarov > > PR target/59787 > * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress. > > OK. R.
Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)
On 01/13/2014 09:48 PM, Marek Polacek wrote: +bool +pointer_to_zero_sized_aggr_p (tree t) +{ + t = strip_pointer_operator (t); + if (RECORD_OR_UNION_TYPE_P (t) + && TYPE_SIZE (t) + && integer_zerop (TYPE_SIZE (t))) +return true; + return false; +} I think you can just return the value of the condition, there's no need for the if statement. -- Florian Weimer / Red Hat Product Security Team
Re: RFA: patch to fix PR59787 (arm target)
Hi Vladimir, I have test your patch with the following configurations, , and it indeed eliminates the ICE. --target=arm-none-linux-gnueabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=softfp Thank you! Kind regards, Renlin Li On 14/01/14 16:48, Vladimir Makarov wrote: The following patch fixes http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787 The problem of LRA looping was in that address with soft frame pointer was rejected as valid constraint in iwmmxt_amd_movdi insn. Ok to commit? 2014-01-14 Vladimir Makarov PR target/59787 * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress.
[PATCH, reginfo.c, i386.c] Backport fix for PR58139 to 4.8
The mainline fix for PR58139 which is a wrong code gen bug was submitted here: http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00910.html and approved for mainline and 4.8 (after a few weeks) here: http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00134.html However, my fix exposed a latent x86 bug, so this patch was never committed to 4.8. The latent x86 bug was fixed by Honza and I'd like to now ask to be able to backport my fix for PR58139 along with Honza's fix to 4.8. This passed bootstrap and regtesting on powerpc64-linux and I bootstrapped this on x86_64 and verified that the ICE seen when compiling the test case with only my patch in: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58269#c2 is fixed when we add Honza's patch. Ok for 4.8? Peter Backport from mainline 2013-09-06 Jan Hubicka * config/i386/i386.c (ix86_hard_regno_mode_ok): AVX modes are valid only when AVX is enabled. 2013-09-05 Peter Bergner PR target/58139 * reginfo.c (choose_hard_reg_mode): Scan through all mode classes looking for widest mode. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206582) +++ gcc/config/i386/i386.c (working copy) @@ -33944,7 +33944,7 @@ ix86_hard_regno_mode_ok (int regno, enum are available. OImode move is available only when AVX is enabled. */ return ((TARGET_AVX && mode == OImode) - || VALID_AVX256_REG_MODE (mode) + || (TARGET_AVX && VALID_AVX256_REG_MODE (mode)) || VALID_SSE_REG_MODE (mode) || VALID_SSE2_REG_MODE (mode) || VALID_MMX_REG_MODE (mode) Index: gcc/reginfo.c === --- gcc/reginfo.c (revision 206582) +++ gcc/reginfo.c (working copy) @@ -620,40 +620,35 @@ choose_hard_reg_mode (unsigned int regno mode = GET_MODE_WIDER_MODE (mode)) if ((unsigned) hard_regno_nregs[regno][mode] == nregs && HARD_REGNO_MODE_OK (regno, mode) - && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))) + && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)) + && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode)) found_mode = mode; - if (found_mode != VOIDmode) -return found_mode; - for (mode = GET_CLASS_NARROWEST_MODE (MODE_FLOAT); mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode)) if ((unsigned) hard_regno_nregs[regno][mode] == nregs && HARD_REGNO_MODE_OK (regno, mode) - && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))) + && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)) + && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode)) found_mode = mode; - if (found_mode != VOIDmode) -return found_mode; - for (mode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_FLOAT); mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode)) if ((unsigned) hard_regno_nregs[regno][mode] == nregs && HARD_REGNO_MODE_OK (regno, mode) - && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))) + && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)) + && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode)) found_mode = mode; - if (found_mode != VOIDmode) -return found_mode; - for (mode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT); mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode)) if ((unsigned) hard_regno_nregs[regno][mode] == nregs && HARD_REGNO_MODE_OK (regno, mode) - && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode))) + && (! call_saved || ! HARD_REGNO_CALL_PART_CLOBBERED (regno, mode)) + && GET_MODE_SIZE (mode) > GET_MODE_SIZE (found_mode)) found_mode = mode; if (found_mode != VOIDmode)
Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
On Fri, Jan 03, 2014 at 05:04:39PM +0100, Uros Bizjak wrote: > On a second thought, the crossing of 16-byte boundaries is mentioned > for the data *access* (the instruction itself) if it is not naturally > aligned (please see example 3-40 and fig 3-2), which is *NOT* in our > case. > > So, we don't have to align 32 byte structures in any way for newer > processors, since this optimization applies to 64+ byte (larger or > equal to cache line size) structures only. Older processors are > handled correctly, modulo nocona, where its cache line size value has > to be corrected. > > Following that, my original patch implements this optimization in the > correct way. Sorry for catching this late, but on the 4.8 and earlier branches there is no opt argument and thus any ix86_data_alignment change is unfortunately an ABI change. So I'd think we should revert r206433 and r206436. And for the trunk we need to ensure even for opt we never return a smaller number from ix86_data_alignment than we did in 4.8 and earlier, because otherwise if you have 4.8 compiled code that assumes the alignment 4.8 would use for something that is defined in a compilation unit built by gcc 4.9+, if we don't align it at least as much as we did in the past, the linked mix of 4.8 user and 4.9 definition could misbehave. Jakub
[PATCH][buildrobot] PR59496: Fix unused variable warning
Hi! In the buildrobot's logs for building with config-list.mk, I noticed this warning: g++ -c -DIN_GCC_FRONTEND -DIN_GCC_FRONTEND -g -O2 -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common -DHAVE_CONFIG_H -I. -Ic-family -I../../../gcc/gcc -I../../../gcc/gcc/c-family -I../../../gcc/gcc/../include -I../../../gcc/gcc/../libcpp/include -I/opt/cfarm/mpc/include -I../../../gcc/gcc/../libdecnumber -I../../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber -I../../../gcc/gcc/../libbacktrace-o c-family/c-common.o -MT c-family/c-common.o -MMD -MP -MF c-family/.deps/c-common.TPo ../../../gcc/gcc/c-family/c-common.c ../../../gcc/gcc/c-family/c-common.c: In function ‘tree_node* c_sizeof_or_alignof_type(location_t, tree, bool, bool, int)’: ../../../gcc/gcc/c-family/c-common.c:5007:9: error: unused variable ‘field’ [-Werror=unused-variable] tree field = build_decl (UNKNOWN_LOCATION, FIELD_DECL, NULL_TREE, ^ cc1plus: all warnings being treated as errors make[2]: *** [c-family/c-common.o] Error 1 There were additional comments on the PR recently, I suggest this patch to plug it: 2014-01-14 Jan-Benedict Glaw PR bootstrap/59496 * config/rs6000/darwin.h (ADJUST_FIELD_ALIGN): Fix unused variable warning. diff --git a/gcc/config/rs6000/darwin.h b/gcc/config/rs6000/darwin.h index 43a2ab5..fc1f862 100644 --- a/gcc/config/rs6000/darwin.h +++ b/gcc/config/rs6000/darwin.h @@ -328,9 +328,10 @@ extern int darwin_emit_branch_islands; behavior is dealt with by darwin_rs6000_special_round_type_align. */ #define ADJUST_FIELD_ALIGN(FIELD, COMPUTED)\ - (TARGET_ALIGN_NATURAL ? (COMPUTED) \ - : (COMPUTED) == 128 ? 128 \ - : MIN ((COMPUTED), 32)) + ((void) (FIELD), \ + (TARGET_ALIGN_NATURAL ? (COMPUTED) \ +: (COMPUTED) == 128 ? 128 \ +: MIN ((COMPUTED), 32))) /* Darwin increases natural record alignment to doubleword if the first field is an FP double while the FP fields remain word aligned. */ -- Jan-Benedict Glaw jbg...@lug-owl.de +49-172-7608481 Signature of: GDB has a 'break' feature; why doesn't it have 'fix' too? the second : signature.asc Description: Digital signature
Re: [Patch, xtensa] Add section anchor support for the xtensa backend.
On Tue, Jan 14, 2014 at 7:20 AM, Felix Yang wrote: > Hi Sterling, > > I found that we can avoid emitting excessive literal loading > instructions with with section anchors. > This patch also passed the cases in testsuite/gcc.c-torture/execute/ dir. > Please apply it if OK for trunk. Hi Felix, It's been a while since I dealt with it, but literals are produced by gcc carefully such that they are arranged by pages and such to make certain that certain linux loader operations are convenient. (Marc, I believe, knows the details.) Marc, does this rearrangement of the literals make a difference on that side? Sterling > > > Index: gcc/ChangeLog > === > --- gcc/ChangeLog(revision 206599) > +++ gcc/ChangeLog(working copy) > @@ -1,3 +1,11 @@ > +2014-01-14 Felix Yang > + > +* common/config/xtensa/xtensa-common.c > +(xtensa_option_optimization_table): Enable -fsection-anchors under -O1 > +or plus, and disable -fcommon by default. > +* config/xtensa/xtensa.c (TARGET_MAX_ANCHOR_OFFSET): New. > +(TARGET_MIN_ANCHOR_OFFSET): Ditto. > + > 2014-01-14 Richard Biener > > PR tree-optimization/58921 > Index: gcc/common/config/xtensa/xtensa-common.c > === > --- gcc/common/config/xtensa/xtensa-common.c(revision 206599) > +++ gcc/common/config/xtensa/xtensa-common.c(working copy) > @@ -35,6 +35,13 @@ static const struct default_options xtensa_option_ > assembler, so GCC cannot do a good job of reordering blocks. > Do not enable reordering unless it is explicitly requested. */ > { OPT_LEVELS_ALL, OPT_freorder_blocks, NULL, 0 }, > +/* Enable section anchors under -O1 or plus. This can avoid generating > + excessive literal loading instructions to load addresses of globals. > */ > +{ OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 2 }, > +/* Allocate uninitialized global variables in the data section of object > + file, rather than generating them as common blocks. This is required > + for section anchors to work on uninitialized globals. */ > +{ OPT_LEVELS_ALL, OPT_fcommon, NULL, 0 }, > { OPT_LEVELS_NONE, 0, NULL, 0 } >}; > > Index: gcc/config/xtensa/xtensa.c > === > --- gcc/config/xtensa/xtensa.c(revision 206599) > +++ gcc/config/xtensa/xtensa.c(working copy) > @@ -290,6 +290,12 @@ static const int reg_nonleaf_alloc_order[FIRST_PSE > #undef TARGET_CANNOT_FORCE_CONST_MEM > #define TARGET_CANNOT_FORCE_CONST_MEM xtensa_cannot_force_const_mem > > +#undef TARGET_MAX_ANCHOR_OFFSET > +#define TARGET_MAX_ANCHOR_OFFSET 255 > + > +#undef TARGET_MIN_ANCHOR_OFFSET > +#define TARGET_MIN_ANCHOR_OFFSET 0 > + > #undef TARGET_LEGITIMATE_ADDRESS_P > #define TARGET_LEGITIMATE_ADDRESS_Pxtensa_legitimate_address_p > > > Cheers, > Felix
RFA: patch to fix PR59787 (arm target)
The following patch fixes http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59787 The problem of LRA looping was in that address with soft frame pointer was rejected as valid constraint in iwmmxt_amd_movdi insn. Ok to commit? 2014-01-14 Vladimir Makarov PR target/59787 * config/arm/arm.c (arm_coproc_mem_operand): Add lra_in_progress. Index: config/arm/arm.c === --- config/arm/arm.c (revision 206579) +++ config/arm/arm.c (working copy) @@ -12439,7 +12439,7 @@ arm_coproc_mem_operand (rtx op, bool wb) rtx ind; /* Reject eliminable registers. */ - if (! (reload_in_progress || reload_completed) + if (! (reload_in_progress || reload_completed || lra_in_progress) && ( reg_mentioned_p (frame_pointer_rtx, op) || reg_mentioned_p (arg_pointer_rtx, op) || reg_mentioned_p (virtual_incoming_args_rtx, op)
Re: [Patch, xtensa] Add LOCAL_REGNO to the xtensa backend.
On Tue, Jan 14, 2014 at 7:14 AM, Felix Yang wrote: > Hi Sterling, > > The xtensa backend uses register windows, and we need to define > LOCAL_REGNO for it. Hi Felix, How does this change the produced code? In particular, please identify a problem this patch is solving. I know that the documentation for the macro says to define it if you have register windows, but things aren't that straightforward. The original register windowing support in GCC went in to support Sparc and, iirc, Intel i960. There are subtle differences between the two ISAs that aren't entirely captured in the GCC infrastructure. Note that in Sparc the window isn't modified until the save and restore instructions are executed, whereas in Xtensa, the save and restores happen combined with call and return. All of this makes for very subtle data-flow issues around certain registers--in particular, a8 (which functions as both an argument register and the frame pointer.) Passing gcc/torture/... isn't enough to establish that this works. It needs a full set of ABI tests to be established as correct. What we have there works perfectly well. Your patch may indeed improve the generated code, or be more correct, but please identify the problem you are solving and show how this patch improves things. Thanks, Sterling
RFA: Fix assembler data directives emitted for variable length structures
Hi Guys, Several PRs (28865, 57180 and 59719) have reported that the assembler directives emitted by gcc to describe a structure with a variable length field occupy too much space. That is a serious problem for targets which use section anchors as they rely upon objects in the data area being exactly the size that they are supposed to be. The attached patch fixes the problem by updating the output_constant() function so that it returns the number of bytes that it really did emit, which may be larger than the number of bytes that it was requested to emit. It also adds a couple of tests to the testsuite to check that the desired behaviour is achieved. Tested without regressions on i686-pc-linux-gnu and aarch64-elf toolchains, as well as bootstrapping and regression testing a powerpc64-linux toolchain. OK to apply ? Cheers Nick gcc/ChangeLog 2014-01-13 Nick Clifton PR middle-end/28865 * varasm.c (output_constant): Return the number of bytes actually emitted. (output_constructor_array_range): Update the field size with the number of bytes emitted by output_constant. (output_constructor_regular_field): Likewise. Also do not complain if the total number of bytes emitted is now greater than the expected fieldpos. * output.h (output_constant): Update prototype and descriptive comment. gcc/testsuite/ChangeLog 2014-01-13 Nick Clifton PR middle-end/28865 * gcc.c-torture/compile/pr28865.c: New. * gcc.c-torture/execute/pr28865.c: New. Index: gcc/output.h === --- gcc/output.h (revision 206572) +++ gcc/output.h (working copy) @@ -294,11 +294,13 @@ This includes the pseudo-op such as ".int" or ".byte", and a newline. Assumes output_addressed_constants has been done on EXP already. - Generate exactly SIZE bytes of assembler data, padding at the end - with zeros if necessary. SIZE must always be specified. + Generate at least SIZE bytes of assembler data, padding at the end + with zeros if necessary. SIZE must always be specified. The returned + value is the actual number of bytes of assembler data generated, which + may be bigger than SIZE if the object contains a variable length field. ALIGN is the alignment in bits that may be assumed for the data. */ -extern void output_constant (tree, unsigned HOST_WIDE_INT, unsigned int); +extern unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT, unsigned int); /* When outputting delayed branch sequences, this rtx holds the sequence being output. It is null when no delayed branch Index: gcc/varasm.c === --- gcc/varasm.c (revision 206572) +++ gcc/varasm.c (working copy) @@ -4584,8 +4584,10 @@ This includes the pseudo-op such as ".int" or ".byte", and a newline. Assumes output_addressed_constants has been done on EXP already. - Generate exactly SIZE bytes of assembler data, padding at the end - with zeros if necessary. SIZE must always be specified. + Generate at least SIZE bytes of assembler data, padding at the end + with zeros if necessary. SIZE must always be specified. The returned + value is the actual number of bytes of assembler data generated, which + may be bigger than SIZE if the object contains a variable length field. SIZE is important for structure constructors, since trailing members may have been omitted from the constructor. @@ -4600,14 +4602,14 @@ ALIGN is the alignment of the data in bits. */ -void +unsigned HOST_WIDE_INT output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align) { enum tree_code code; unsigned HOST_WIDE_INT thissize; if (size == 0 || flag_syntax_only) -return; +return size; /* See if we're trying to initialize a pointer in a non-default mode to the address of some declaration somewhere. If the target says @@ -4672,7 +4674,7 @@ && vec_safe_is_empty (CONSTRUCTOR_ELTS (exp))) { assemble_zeros (size); - return; + return size; } if (TREE_CODE (exp) == FDESC_EXPR) @@ -4684,7 +4686,7 @@ #else gcc_unreachable (); #endif - return; + return size; } /* Now output the underlying data. If we've handling the padding, return. @@ -4723,8 +4725,7 @@ switch (TREE_CODE (exp)) { case CONSTRUCTOR: - output_constructor (exp, size, align, NULL); - return; + return output_constructor (exp, size, align, NULL); case STRING_CST: thissize = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size); @@ -4752,11 +4753,10 @@ case RECORD_TYPE: case UNION_TYPE: gcc_assert (TREE_CODE (exp) == CONSTRUCTOR); - output_constructor (exp, size, align, NULL); - return; + return output_constructor (exp, size, align, NULL); case
Re: PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes
On Tue, Jan 14, 2014 at 3:18 PM, H.J. Lu wrote: > There are several problems with i386 MMX/SSE/AVX ABI change detection: > > 1. MMX/SSE return value isn't checked for -m32 since revision 83533: > > http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=83533 > > which added ix86_struct_value_rtx. Since MMX/SSE condition is always > false, the MMX/SSE return value ABI change is disabled. > 2. For -m32, the same warning on MMX/SSE argument is issued twice, one from > type_natural_mode and one from function_arg_32. > 3. AVX return value ABI change isn't checked. > > This patch does followings: > > 1. Remove the ineffective ix86_struct_value_rtx. > 2. Add a bool parameter to indicate if type is used for function return > value. Warn ABI change if the vector mode isn't available for function > return value. Add AVX function return value ABI change warning. > 3. Consolidate ABI change warning into type_natural_mode. > 4. Update g++.dg/ext/vector23.C to prune ABI change for Linux/x86 > added by the AVX function return value ABI change warning. > 5. Update gcc.target/i386/pr39162.c to avoid the AVX function return > value ABI change warning. > 6. Add testcases for warning MMX/SSE/AVX ABI changes in parameter > passing and function return. > > Tested on Linux/x86-64 with -m32/-m64 for "make check". OK to install? > > Thanks. > > H.J. > --- > gcc/ > > 2014-01-14 H.J. Lu > > PR target/59794 > * config/i386/i386.c (type_natural_mode): Add a bool parameter > to indicate if type is used for function return value. Warn > ABI change if the vector mode isn't available for function > return value. > (ix86_function_arg_advance): Pass false to type_natural_mode. > (ix86_function_arg): Likewise. > (ix86_gimplify_va_arg): Likewise. > (function_arg_32): Don't warn ABI change. > (ix86_function_value): Pass true to type_natural_mode. > (ix86_return_in_memory): Likewise. > (ix86_struct_value_rtx): Removed. > (TARGET_STRUCT_VALUE_RTX): Likewise. > > gcc/testsuite/ > > 2014-01-14 H.J. Lu > > PR target/59794 > * g++.dg/ext/vector23.C: Also prune ABI change for Linux/x86. > * gcc.target/i386/pr39162.c (y): New __m256i variable. > (bar): Change return type to void. Set y to x. > * gcc.target/i386/pr59794-1.c: New testcase. > * gcc.target/i386/pr59794-2.c: Likewise. > * gcc.target/i386/pr59794-3.c: Likewise. > * gcc.target/i386/pr59794-4.c: Likewise. > * gcc.target/i386/pr59794-5.c: Likewise. > * gcc.target/i386/pr59794-6.c: Likewise. > * gcc.target/i386/pr59794-7.c: Likewise. OK for mainline and release branches after a couple of days. Thanks, Uros.
Re: [PATCH i386 10/8] [AVX512] Add missing AVX-512ER patterns, intrinsics, tests.
On Sun, Jan 12, 2014 at 10:02 PM, Kirill Yukhin wrote: > Hello, > On 11 Jan 12:42, Uros Bizjak wrote: >> On Fri, Jan 10, 2014 at 5:24 PM, Jakub Jelinek wrote: >> > This means you should ensure aligned_mem will be set for >> > CODE_FOR_avx512f_movntdqa in ix86_expand_special_args_builtin. > Fixed. Updated patch in the bottom. > >> > Leaving the rest of review to Uros/Richard. >> >> The rest is OK. > Thanks! I'll check it in tomorrow if no more issues! > It caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59808 H.J.
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On Tue, Jan 14, 2014 at 02:49:30PM +, Richard Earnshaw wrote: > On 14/01/14 14:32, Jakub Jelinek wrote: > > Anyway, the above is really a simple case, and I'd call it a > > backend bug if it isn't able to generate good code out of that. > > Exactly which back-end pass are you expecting to simplify > > (set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0) > (const_int 2))) Well, already at the expansion time you know here that & 2 is zero-extended, so you can/should emit here (set (reg:SI 1)) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2))) instead. And/or combiner, and/or what Kugan has been working on, to use the remembered SSA_NAME range info during expansion. And/or, as I said earlier, we can have a type promotion pass driven by backend properties. But not narrowing types anywhere will mean say the various issues mentioned in PR45397. For vectorizations we want to decrease the number of different type widths in a loop as much as possible, etc. Jakub
Re: [Patch AArch64] Implement Vector Permute Support
> On Jan 14, 2014, at 7:19 AM, Alex Velenko wrote: > > Hi, > > This patch turns off the vec_perm patterns for aarch64_be, this should resolve > the issue highlighted here > http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00321.html > With this patch applied, the test case provided in that link compiles without > an ICE. > > However, the Big-Endian port is still in development. This patch exposes > another known but unrelated issue with Big-Endian Large-Int modes. > > The patch has been tested on aarch64-none-elf and aarch64_be-none-elf > resulting in five > further regression due to the broken implementation of Big-Endian Large-Int > modes. > > Kind regards, > Alex Velenko > > gcc/ > > 2014-01-14 Alex Velenko > >* config/aarch64/aarch64-simd.md (vec_perm): Add BE check. >* config/aarch64/aarch64.c (aarch64_expand_vec_perm): Add comment. > > gcc/testsuite/ > > 2014-01-14 Alex Velenko > >* lib/target-supports.exp >(check_effective_target_vect_perm): Exclude aarch64_be. >(check_effective_target_vect_perm_byte): Likewise. >(check_effective_target_vect_perm_short): Likewise. I think you want to use a function to check if the target is effectively big-endian instead. Internally at Cavium, our elf compiler has big-endian multi-lib. Thanks, Andrew > >
Re: [wide-int] resolve bootstrap issue
Mike Stump writes: > diff --git a/gcc/expmed.c b/gcc/expmed.c > index ce063eb..720d8c1 100644 > --- a/gcc/expmed.c > +++ b/gcc/expmed.c > @@ -4963,6 +4963,7 @@ make_tree (tree type, rtx x) >return t; > > case CONST_DOUBLE: > + gcc_assert (HOST_BITS_PER_WIDE_INT * 2 <= MAX_BITSIZE_MODE_ANY_INT); >if (TARGET_SUPPORTS_WIDE_INT == 0 && GET_MODE (x) == VOIDmode) > t = wide_int_to_tree (type, > wide_int::from_array (&CONST_DOUBLE_LOW (x), 2, I think this would be better as a STATIC_ASSERT. > @@ -1440,10 +1442,10 @@ real_to_integer (const REAL_VALUE_TYPE *r, bool > *fail, int precision) > } > #endif >w = SIGSZ * HOST_BITS_PER_LONG + words * HOST_BITS_PER_WIDE_INT; > - result = wide_int::from_array > + tmp = real_int::from_array > (val, (w + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT, w); > - result = wi::lrshift (result, (words * HOST_BITS_PER_WIDE_INT) - exp); > - result = wide_int::from (result, precision, UNSIGNED); > + tmp = wi::lrshift (tmp, (words * HOST_BITS_PER_WIDE_INT) - > exp); > + result = wide_int::from (tmp, precision, UNSIGNED); Why did you need the ? It was supposed to work without. > diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c > index 00b5439..7c21afa 100644 > --- a/gcc/simplify-rtx.c > +++ b/gcc/simplify-rtx.c > @@ -5384,6 +5384,7 @@ simplify_immed_subreg (enum machine_mode outermode, rtx > op, > tmp[u] = buf; > base += HOST_BITS_PER_WIDE_INT; > } > + gcc_assert (GET_MODE_PRECISION (outer_submode) <= > MAX_BITSIZE_MODE_ANY_INT); > r = wide_int::from_array (tmp, units, > GET_MODE_PRECISION (outer_submode)); > elems[elem] = immed_wide_int_const (r, outer_submode); Long line. Looks good to me otherwise FWIW. Thanks, Richard
[Patch, xtensa] Add section anchor support for the xtensa backend.
Hi Sterling, I found that we can avoid emitting excessive literal loading instructions with with section anchors. This patch also passed the cases in testsuite/gcc.c-torture/execute/ dir. Please apply it if OK for trunk. Index: gcc/ChangeLog === --- gcc/ChangeLog(revision 206599) +++ gcc/ChangeLog(working copy) @@ -1,3 +1,11 @@ +2014-01-14 Felix Yang + +* common/config/xtensa/xtensa-common.c +(xtensa_option_optimization_table): Enable -fsection-anchors under -O1 +or plus, and disable -fcommon by default. +* config/xtensa/xtensa.c (TARGET_MAX_ANCHOR_OFFSET): New. +(TARGET_MIN_ANCHOR_OFFSET): Ditto. + 2014-01-14 Richard Biener PR tree-optimization/58921 Index: gcc/common/config/xtensa/xtensa-common.c === --- gcc/common/config/xtensa/xtensa-common.c(revision 206599) +++ gcc/common/config/xtensa/xtensa-common.c(working copy) @@ -35,6 +35,13 @@ static const struct default_options xtensa_option_ assembler, so GCC cannot do a good job of reordering blocks. Do not enable reordering unless it is explicitly requested. */ { OPT_LEVELS_ALL, OPT_freorder_blocks, NULL, 0 }, +/* Enable section anchors under -O1 or plus. This can avoid generating + excessive literal loading instructions to load addresses of globals. */ +{ OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 2 }, +/* Allocate uninitialized global variables in the data section of object + file, rather than generating them as common blocks. This is required + for section anchors to work on uninitialized globals. */ +{ OPT_LEVELS_ALL, OPT_fcommon, NULL, 0 }, { OPT_LEVELS_NONE, 0, NULL, 0 } }; Index: gcc/config/xtensa/xtensa.c === --- gcc/config/xtensa/xtensa.c(revision 206599) +++ gcc/config/xtensa/xtensa.c(working copy) @@ -290,6 +290,12 @@ static const int reg_nonleaf_alloc_order[FIRST_PSE #undef TARGET_CANNOT_FORCE_CONST_MEM #define TARGET_CANNOT_FORCE_CONST_MEM xtensa_cannot_force_const_mem +#undef TARGET_MAX_ANCHOR_OFFSET +#define TARGET_MAX_ANCHOR_OFFSET 255 + +#undef TARGET_MIN_ANCHOR_OFFSET +#define TARGET_MIN_ANCHOR_OFFSET 0 + #undef TARGET_LEGITIMATE_ADDRESS_P #define TARGET_LEGITIMATE_ADDRESS_Pxtensa_legitimate_address_p Cheers, Felix
Re: [Patch AArch64] Implement Vector Permute Support
Hi, This patch turns off the vec_perm patterns for aarch64_be, this should resolve the issue highlighted here http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00321.html With this patch applied, the test case provided in that link compiles without an ICE. However, the Big-Endian port is still in development. This patch exposes another known but unrelated issue with Big-Endian Large-Int modes. The patch has been tested on aarch64-none-elf and aarch64_be-none-elf resulting in five further regression due to the broken implementation of Big-Endian Large-Int modes. Kind regards, Alex Velenko gcc/ 2014-01-14 Alex Velenko * config/aarch64/aarch64-simd.md (vec_perm): Add BE check. * config/aarch64/aarch64.c (aarch64_expand_vec_perm): Add comment. gcc/testsuite/ 2014-01-14 Alex Velenko * lib/target-supports.exp (check_effective_target_vect_perm): Exclude aarch64_be. (check_effective_target_vect_perm_byte): Likewise. (check_effective_target_vect_perm_short): Likewise. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index bc47a291de4b9b24d829e4dbf060fff7a321558f..43a9c5b27d78a47cf965636a03232005a4c8e7c3 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -3840,7 +3840,7 @@ (match_operand:VB 1 "register_operand") (match_operand:VB 2 "register_operand") (match_operand:VB 3 "register_operand")] - "TARGET_SIMD" + "TARGET_SIMD && !BYTES_BIG_ENDIAN" { aarch64_expand_vec_perm (operands[0], operands[1], operands[2], operands[3]); diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index 95360089b89d5fef2997dc6dbe7f47a6864143ea..084668af5124aa1c4a7f25495cf44b52811d0e62 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -3417,7 +3417,8 @@ proc check_effective_target_vect_perm { } { } else { set et_vect_perm_saved 0 if { [is-effective-target arm_neon_ok] - || [istarget aarch64*-*-*] + || ([istarget aarch64*-*-*] + && ![istarget aarch64_be*-*-*]) || [istarget powerpc*-*-*] || [istarget spu-*-*] || [istarget i?86-*-*] @@ -3445,7 +3446,8 @@ proc check_effective_target_vect_perm_byte { } { set et_vect_perm_byte_saved 0 if { ([is-effective-target arm_neon_ok] && [is-effective-target arm_little_endian]) - || [istarget aarch64*-*-*] + || ([istarget aarch64*-*-*] + && ![istarget aarch64_be*-*-*]) || [istarget powerpc*-*-*] || [istarget spu-*-*] } { set et_vect_perm_byte_saved 1 @@ -3469,7 +3471,8 @@ proc check_effective_target_vect_perm_short { } { set et_vect_perm_short_saved 0 if { ([is-effective-target arm_neon_ok] && [is-effective-target arm_little_endian]) - || [istarget aarch64*-*-*] + || ([istarget aarch64*-*-*] + && ![istarget aarch64_be*-*-*]) || [istarget powerpc*-*-*] || [istarget spu-*-*] } { set et_vect_perm_short_saved 1
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On 14/01/14 14:49, Richard Earnshaw wrote: > On 14/01/14 14:32, Jakub Jelinek wrote: >> Anyway, the above is really a simple case, and I'd call it a >> backend bug if it isn't able to generate good code out of that. > > Exactly which back-end pass are you expecting to simplify > > (set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0) > (const_int 2))) > > (set (reg:SI 2) (zero_extend:SI (reg:HI 1))) > > (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0))) > > into > > (set (reg:SI 2) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2))) > > (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0))) > > Combine is about the only pass that does this sort of thing, and that's > far too often confused by extraneous information that it thinks might be > helpful, but isn't, or by the fact that the intermediate result is used > more than once. > > R. > Consider this case: struct b2Body { unsigned short flags; int type; }; _Bool IsAwake(short *a, struct b2Body *b) { int c; c = b->flags & 2; *a = c; return c == 2; } There's a redundant extend operation left in here on ARM, MIPS & PPC. ARM: ldrhr3, [r1] and r3, r3, #2 uxthr3, r3 // Redundant strhr3, [r0] addsr0, r3, #0 movne r0, #1 bx lr MIPS lhu $2,0($5) nop andi$2,$2,0x2 andi$2,$2,0x// Redundant sh $2,0($4) j $31 sltu$2,$0,$2 PPC: lhz 9,0(4) rlwinm 9,9,0,30,30 rlwinm 9,9,0,0x // Redundant sth 9,0(3) addic 10,9,-1 subfe 3,10,9 blr So if this is a backend issue, it's wide-spread on word-based machines. R.
[Patch, xtensa] Add LOCAL_REGNO to the xtensa backend.
Hi Sterling, The xtensa backend uses register windows, and we need to define LOCAL_REGNO for it. The dataflow may not be accurate with this macro. This patch passed the cases in testsuite/gcc.c-torture/execute dir. Please apply it if OK for trunk. Thanks. Index: gcc/ChangeLog === --- gcc/ChangeLog(revision 206599) +++ gcc/ChangeLog(working copy) @@ -1,3 +1,7 @@ +2014-01-14 Felix Yang + +* config/xtensa/xtensa.h (LOCAL_REGNO): New. + 2014-01-14 Richard Biener PR tree-optimization/58921 Index: gcc/config/xtensa/xtensa.h === --- gcc/config/xtensa/xtensa.h(revision 206599) +++ gcc/config/xtensa/xtensa.h(working copy) @@ -369,7 +369,14 @@ extern char xtensa_hard_regno_mode_ok[][FIRST_PSEU ((unsigned) ((IN) - GP_REG_FIRST) < WINDOW_SIZE)) ?\ (IN) + WINDOW_SIZE : (IN)) +/* Define this macro if the target machine has register windows. This + C expression returns true if the register is call-saved but is in the + register window. */ +#define LOCAL_REGNO(REGNO)\ + (GP_REG_P (REGNO) && ((unsigned) (REGNO - GP_REG_FIRST) < WINDOW_SIZE)) + + /* Define the classes of registers for register constraints in the machine description. */ enum reg_class Cheers, Felix
[gomp4 6/6] Enable initial support in the C front end for OpenACC data clauses.
From: Thomas Schwinge gcc/c/ * c-parser.c (OACC_PARALLEL_CLAUSE_MASK): Add PRAGMA_OMP_CLAUSE_COPY, PRAGMA_OMP_CLAUSE_COPYIN, PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_CREATE, PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_PRESENT, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE. gcc/testsuite/ * c-c++-common/goacc/data-clause-duplicate-1.c: New file. * c-c++-common/goacc/deviceptr-1.c: New file. libgomp/ * testsuite/libgomp.oacc-c/parallel-1.c: Extend. --- gcc/c/c-parser.c | 14 +- .../c-c++-common/goacc/data-clause-duplicate-1.c | 13 ++ gcc/testsuite/c-c++-common/goacc/deviceptr-1.c | 64 + libgomp/testsuite/libgomp.oacc-c/parallel-1.c | 150 +++-- 4 files changed, 228 insertions(+), 13 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c create mode 100644 gcc/testsuite/c-c++-common/goacc/deviceptr-1.c diff --git gcc/c/c-parser.c gcc/c/c-parser.c index 48c55e6..d6a2af0 100644 --- gcc/c/c-parser.c +++ gcc/c/c-parser.c @@ -11225,8 +11225,17 @@ c_parser_omp_structured_block (c_parser *parser) LOC is the location of the #pragma token. */ -#define OACC_PARALLEL_CLAUSE_MASK \ - (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_NONE) +#define OACC_PARALLEL_CLAUSE_MASK \ + ( (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPY) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPYIN) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_COPYOUT) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_CREATE) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_DEVICEPTR)\ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN)\ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE) ) static tree c_parser_oacc_parallel (location_t loc, c_parser *parser) @@ -11235,7 +11244,6 @@ c_parser_oacc_parallel (location_t loc, c_parser *parser) clauses = c_parser_oacc_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK, "#pragma acc parallel"); - gcc_assert (clauses == NULL); block = c_begin_omp_parallel (); add_stmt (c_parser_omp_structured_block (parser)); diff --git gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c new file mode 100644 index 000..1bcf5be --- /dev/null +++ gcc/testsuite/c-c++-common/goacc/data-clause-duplicate-1.c @@ -0,0 +1,13 @@ +void +fun (void) +{ + float *fp; +#pragma acc parallel copy(fp[0:2],fp[0:2]) /* { dg-error "'fp' appears more than once in map clauses" } */ + ; +#pragma acc parallel present_or_copyin(fp[3]) present_or_copyout(fp[7:4]) /* { dg-error "'fp' appears more than once in map clauses" } */ + ; +#pragma acc parallel create(fp[:10]) deviceptr(fp) + /* { dg-error "'fp' appears more than once in map clauses" "" { target *-*-* } 9 } */ + /* { dg-message "sorry, unimplemented: data clause not yet implemented" "" { target *-*-* } 9 } */ + ; +} diff --git gcc/testsuite/c-c++-common/goacc/deviceptr-1.c gcc/testsuite/c-c++-common/goacc/deviceptr-1.c new file mode 100644 index 000..0f0cf0c --- /dev/null +++ gcc/testsuite/c-c++-common/goacc/deviceptr-1.c @@ -0,0 +1,64 @@ +void +fun1 (void) +{ +#pragma acc parallel deviceptr(u) /* { dg-error "'u' undeclared" } */ + ; +#pragma acc parallel deviceptr(u[0:4]) /* { dg-error "expected '\\\)' before '\\\[' token" } */ + ; + +#pragma acc parallel deviceptr(fun1) /* { dg-error "'fun1' is not a variable" } */ + ; +#pragma acc parallel deviceptr(fun1[2:5]) + /* { dg-error "'fun1' is not a variable" "not a variable" { target *-*-* } 11 } */ + /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 11 } */ + ; + + int i; +#pragma acc parallel deviceptr(i) /* { dg-error "'i' is not a pointer variable" } */ + ; +#pragma acc parallel deviceptr(i[0:4]) + /* { dg-error "'i' is not a pointer variable" "not a pointer variable" { target *-*-* } 19 } */ + /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 19 } */ + ; + + float fa[10]; +#pragma acc parallel deviceptr(fa) /* { dg-error "'fa' is not a pointer variable" } */ + ; +#pragma acc parallel deviceptr(fa[1:5]) + /* { dg-error "'fa' is not a pointer variable" "not a pointer variable" { target *-*-* } 27 } */ + /* { dg-error "expected '\\\)' before '\\\[' token" "array" { target *-*-* } 27 } */ + ; + + fl
[gomp4 3/6] Initial support for OpenACC memory mapping semantics.
From: Thomas Schwinge gcc/ * tree-core.h (omp_clause_map_kind): Add OMP_CLAUSE_MAP_FORCE, OMP_CLAUSE_MAP_FORCE_ALLOC, OMP_CLAUSE_MAP_FORCE_TO, OMP_CLAUSE_MAP_FORCE_FROM, OMP_CLAUSE_MAP_FORCE_TOFROM, OMP_CLAUSE_MAP_FORCE_PRESENT, OMP_CLAUSE_MAP_FORCE_DEALLOC, and OMP_CLAUSE_MAP_FORCE_DEVICEPTR. * tree-pretty-print.c (dump_omp_clause): Handle these. * gimplify.c (gimplify_omp_var_data): Add GOVD_MAP_FORCE. (omp_region_type): Add ORT_TARGET_MAP_FORCE. (omp_add_variable, omp_notice_threadprivate_variable) (omp_notice_variable, gimplify_scan_omp_clauses) (gimplify_adjust_omp_clauses_1): Extend accordingly. (gimplify_oacc_parallel): Add ORT_TARGET_MAP_FORCE to ORT_TARGET usage. * omp-low.c (install_var_field, scan_sharing_clauses) (lower_oacc_parallel, lower_omp_target): Extend accordingly. --- gcc/gimplify.c | 92 ++--- gcc/omp-low.c | 33 +++--- gcc/tree-core.h | 19 +- gcc/tree-pretty-print.c | 21 +++ 4 files changed, 140 insertions(+), 25 deletions(-) diff --git gcc/gimplify.c gcc/gimplify.c index 90507c2..633784f 100644 --- gcc/gimplify.c +++ gcc/gimplify.c @@ -69,7 +69,13 @@ enum gimplify_omp_var_data GOVD_PRIVATE_OUTER_REF = 1024, GOVD_LINEAR = 2048, GOVD_ALIGNED = 4096, + + /* Flags for GOVD_MAP. */ + /* Don't copy back. */ GOVD_MAP_TO_ONLY = 8192, + /* Force a specific behavior (or else, a run-time error). */ + GOVD_MAP_FORCE = 16384, + GOVD_DATA_SHARE_CLASS = (GOVD_SHARED | GOVD_PRIVATE | GOVD_FIRSTPRIVATE | GOVD_LASTPRIVATE | GOVD_REDUCTION | GOVD_LINEAR | GOVD_LOCAL) @@ -86,7 +92,11 @@ enum omp_region_type ORT_UNTIED_TASK = 5, ORT_TEAMS = 8, ORT_TARGET_DATA = 16, - ORT_TARGET = 32 + ORT_TARGET = 32, + + /* Flags for ORT_TARGET. */ + /* Default to GOVD_MAP_FORCE for implicit mappings in this region. */ + ORT_TARGET_MAP_FORCE = 64 }; /* Gimplify hashtable helper. */ @@ -5430,9 +5440,20 @@ omp_add_variable (struct gimplify_omp_ctx *ctx, tree decl, unsigned int flags) copy into or out of the context. */ if (!(flags & GOVD_LOCAL)) { - nflags = flags & GOVD_MAP - ? GOVD_MAP | GOVD_MAP_TO_ONLY | GOVD_EXPLICIT - : flags & GOVD_PRIVATE ? GOVD_PRIVATE : GOVD_FIRSTPRIVATE; + if (flags & GOVD_MAP) + { + nflags = GOVD_MAP | GOVD_MAP_TO_ONLY | GOVD_EXPLICIT; +#if 0 + /* Not sure if this is actually needed; haven't found a case +where this would change anything; TODO. */ + if (flags & GOVD_MAP_FORCE) + nflags |= OMP_CLAUSE_MAP_FORCE; +#endif + } + else if (flags & GOVD_PRIVATE) + nflags = GOVD_PRIVATE; + else + nflags = GOVD_FIRSTPRIVATE; nflags |= flags & GOVD_SEEN; t = DECL_VALUE_EXPR (decl); gcc_assert (TREE_CODE (t) == INDIRECT_REF); @@ -5501,6 +5522,8 @@ omp_notice_threadprivate_variable (struct gimplify_omp_ctx *ctx, tree decl, for (octx = ctx; octx; octx = octx->outer_context) if (octx->region_type & ORT_TARGET) { + gcc_assert (!(octx->region_type & ORT_TARGET_MAP_FORCE)); + n = splay_tree_lookup (octx->variables, (splay_tree_key)decl); if (n == NULL) { @@ -5562,19 +5585,45 @@ omp_notice_variable (struct gimplify_omp_ctx *ctx, tree decl, bool in_code) n = splay_tree_lookup (ctx->variables, (splay_tree_key)decl); if (ctx->region_type & ORT_TARGET) { + unsigned map_force; + if (ctx->region_type & ORT_TARGET_MAP_FORCE) + map_force = GOVD_MAP_FORCE; + else + map_force = 0; if (n == NULL) { if (!lang_hooks.types.omp_mappable_type (TREE_TYPE (decl))) { error ("%qD referenced in target region does not have " "a mappable type", decl); - omp_add_variable (ctx, decl, GOVD_MAP | GOVD_EXPLICIT | flags); + omp_add_variable (ctx, decl, GOVD_MAP | map_force | GOVD_EXPLICIT | flags); } else - omp_add_variable (ctx, decl, GOVD_MAP | flags); + omp_add_variable (ctx, decl, GOVD_MAP | map_force | flags); } else - n->value |= flags; + { +#if 0 + /* The following fails for: + +int l = 10; +float c[l]; +#pragma acc parallel copy(c[2:4]) + { +#pragma acc parallel +{ + int t = sizeof c; +} + } + +..., which we currently don't have to care about (nesting +disabled), but eventually will have to; TODO. */ + if ((n->value & GOVD_MAP) && !(n->value & GOVD_EXPLICIT)) + gcc
[gomp4 1/6] During gimplification, allow additional flags next to ORT_TARGET.
From: Thomas Schwinge gcc/ * gimplify.c (gimplify_call_expr, gimplify_modify_expr) (omp_firstprivatize_variable, omp_notice_threadprivate_variable) (omp_notice_variable, gimplify_adjust_omp_clauses) (gimplify_omp_workshare): Treat ORT_TARGET as a flag, not as a value. --- gcc/gimplify.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git gcc/gimplify.c gcc/gimplify.c index e45bed2..90507c2 100644 --- gcc/gimplify.c +++ gcc/gimplify.c @@ -2363,7 +2363,7 @@ gimplify_call_expr (tree *expr_p, gimple_seq *pre_p, bool want_value) during omplower pass instead. */ struct gimplify_omp_ctx *ctx; for (ctx = gimplify_omp_ctxp; ctx; ctx = ctx->outer_context) - if (ctx->region_type == ORT_TARGET) + if (ctx->region_type & ORT_TARGET) break; if (ctx == NULL) fold_stmt (&gsi); @@ -4534,7 +4534,7 @@ gimplify_modify_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p, during omplower pass instead. */ struct gimplify_omp_ctx *ctx; for (ctx = gimplify_omp_ctxp; ctx; ctx = ctx->outer_context) -if (ctx->region_type == ORT_TARGET) +if (ctx->region_type & ORT_TARGET) break; if (ctx == NULL) fold_stmt (&gsi); @@ -5317,7 +5317,7 @@ omp_firstprivatize_variable (struct gimplify_omp_ctx *ctx, tree decl) else return; } - else if (ctx->region_type == ORT_TARGET) + else if (ctx->region_type & ORT_TARGET) omp_add_variable (ctx, decl, GOVD_MAP | GOVD_MAP_TO_ONLY); else if (ctx->region_type != ORT_WORKSHARE && ctx->region_type != ORT_SIMD @@ -5499,7 +5499,7 @@ omp_notice_threadprivate_variable (struct gimplify_omp_ctx *ctx, tree decl, struct gimplify_omp_ctx *octx; for (octx = ctx; octx; octx = octx->outer_context) -if (octx->region_type == ORT_TARGET) +if (octx->region_type & ORT_TARGET) { n = splay_tree_lookup (octx->variables, (splay_tree_key)decl); if (n == NULL) @@ -5560,7 +5560,7 @@ omp_notice_variable (struct gimplify_omp_ctx *ctx, tree decl, bool in_code) } n = splay_tree_lookup (ctx->variables, (splay_tree_key)decl); - if (ctx->region_type == ORT_TARGET) + if (ctx->region_type & ORT_TARGET) { if (n == NULL) { @@ -6285,7 +6285,7 @@ gimplify_adjust_omp_clauses (tree *list_p) if (!DECL_P (decl)) break; n = splay_tree_lookup (ctx->variables, (splay_tree_key) decl); - if (ctx->region_type == ORT_TARGET && !(n->value & GOVD_SEEN)) + if ((ctx->region_type & ORT_TARGET) && !(n->value & GOVD_SEEN)) remove = true; else if (DECL_SIZE (decl) && TREE_CODE (DECL_SIZE (decl)) != INTEGER_CST @@ -6857,7 +6857,7 @@ gimplify_omp_workshare (tree *expr_p, gimple_seq *pre_p) gcc_unreachable (); } gimplify_scan_omp_clauses (&OMP_CLAUSES (expr), pre_p, ort); - if (ort == ORT_TARGET || ort == ORT_TARGET_DATA) + if ((ort & ORT_TARGET) || ort == ORT_TARGET_DATA) { push_gimplify_context (); gimple g = gimplify_and_return_first (OMP_BODY (expr), &body); -- 1.8.1.1
[gomp4 2/6] Prepare for extending omp_clause_map_kind.
From: Thomas Schwinge gcc/ * tree-core.h (omp_clause_map_kind): Make the identifiers' bit patterns more obvious. Add comments. * omp-low.c (lower_oacc_parallel, lower_omp_target): Test for omp_clause_map_kind flags set instead of for values. --- gcc/omp-low.c | 22 ++ gcc/tree-core.h | 16 +++- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git gcc/omp-low.c gcc/omp-low.c index eb755c3..899e970 100644 --- gcc/omp-low.c +++ gcc/omp-low.c @@ -8855,13 +8855,16 @@ lower_oacc_parallel (gimple_stmt_iterator *gsi_p, omp_context *ctx) { tree avar = create_tmp_var (TREE_TYPE (var), NULL); mark_addressable (avar); - if (OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_ALLOC - && OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_FROM) + enum omp_clause_map_kind map_kind + = OMP_CLAUSE_MAP_KIND (c); + if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL) +&& (map_kind & OMP_CLAUSE_MAP_TO)) + || map_kind == OMP_CLAUSE_MAP_POINTER) gimplify_assign (avar, var, &ilist); avar = build_fold_addr_expr (avar); gimplify_assign (x, avar, &ilist); - if ((OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_FROM -|| OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_TOFROM) + if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL) +&& (map_kind & OMP_CLAUSE_MAP_FROM)) && !TYPE_READONLY (TREE_TYPE (var))) { x = build_sender_ref (ovar, ctx); @@ -10331,13 +10334,16 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) gcc_assert (kind == GF_OMP_TARGET_KIND_REGION); tree avar = create_tmp_var (TREE_TYPE (var), NULL); mark_addressable (avar); - if (OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_ALLOC - && OMP_CLAUSE_MAP_KIND (c) != OMP_CLAUSE_MAP_FROM) + enum omp_clause_map_kind map_kind + = OMP_CLAUSE_MAP_KIND (c); + if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL) +&& (map_kind & OMP_CLAUSE_MAP_TO)) + || map_kind == OMP_CLAUSE_MAP_POINTER) gimplify_assign (avar, var, &ilist); avar = build_fold_addr_expr (avar); gimplify_assign (x, avar, &ilist); - if ((OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_FROM -|| OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_TOFROM) + if ((!(map_kind & OMP_CLAUSE_MAP_SPECIAL) +&& (map_kind & OMP_CLAUSE_MAP_FROM)) && !TYPE_READONLY (TREE_TYPE (var))) { x = build_sender_ref (ovar, ctx); diff --git gcc/tree-core.h gcc/tree-core.h index e2750e0..3602b5f 100644 --- gcc/tree-core.h +++ gcc/tree-core.h @@ -1112,14 +1112,20 @@ enum omp_clause_depend_kind enum omp_clause_map_kind { - OMP_CLAUSE_MAP_ALLOC, - OMP_CLAUSE_MAP_TO, - OMP_CLAUSE_MAP_FROM, - OMP_CLAUSE_MAP_TOFROM, + /* If not already present, allocate. */ + OMP_CLAUSE_MAP_ALLOC = 0, + /* ..., and copy to device. */ + OMP_CLAUSE_MAP_TO = 1 << 0, + /* ..., and copy from device. */ + OMP_CLAUSE_MAP_FROM = 1 << 1, + /* ..., and copy to and from device. */ + OMP_CLAUSE_MAP_TOFROM = OMP_CLAUSE_MAP_TO | OMP_CLAUSE_MAP_FROM, + /* Special map kinds. */ + OMP_CLAUSE_MAP_SPECIAL = 1 << 2, /* The following kind is an internal only map kind, used for pointer based array sections. OMP_CLAUSE_SIZE for these is not the pointer size, which is implicitly POINTER_SIZE / BITS_PER_UNIT, but the bias. */ - OMP_CLAUSE_MAP_POINTER + OMP_CLAUSE_MAP_POINTER = OMP_CLAUSE_MAP_SPECIAL }; enum omp_clause_proc_bind_kind -- 1.8.1.1
[gomp4 4/6] C front end infrastructure for OpenACC clauses parsing.
From: Thomas Schwinge gcc/c/ * c-parser.c (c_parser_oacc_all_clauses): New function. (c_parser_oacc_parallel): Use it. * c-typeck.c (c_finish_omp_clauses): Update comment. Remove duplicated variable initialization. --- gcc/c/c-parser.c | 59 +++- gcc/c/c-typeck.c | 4 ++-- 2 files changed, 56 insertions(+), 7 deletions(-) diff --git gcc/c/c-parser.c gcc/c/c-parser.c index ce46f31..c8b80db 100644 --- gcc/c/c-parser.c +++ gcc/c/c-parser.c @@ -9465,7 +9465,7 @@ c_parser_pragma_pch_preprocess (c_parser *parser) c_common_pch_pragma (parse_in, TREE_STRING_POINTER (name)); } -/* OpenMP 2.5 / 3.0 / 3.1 / 4.0 parsing routines. */ +/* OpenACC and OpenMP parsing routines. */ /* Returns name of the next clause. If the clause is not recognized PRAGMA_OMP_CLAUSE_NONE is returned and @@ -10767,9 +10767,58 @@ c_parser_omp_clause_uniform (c_parser *parser, tree list) return list; } +/* Parse all OpenACC clauses. The set clauses allowed by the directive + is a bitmask in MASK. Return the list of clauses found. */ + +static tree +c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask, + const char *where, bool finish_p = true) +{ + tree clauses = NULL; + bool first = true; + + while (c_parser_next_token_is_not (parser, CPP_PRAGMA_EOL)) +{ + location_t here; + pragma_omp_clause c_kind; + const char *c_name; + tree prev = clauses; + + if (!first && c_parser_next_token_is (parser, CPP_COMMA)) + c_parser_consume_token (parser); + + here = c_parser_peek_token (parser)->location; + c_kind = c_parser_omp_clause_name (parser); + + switch (c_kind) + { + default: + c_parser_error (parser, "expected clause"); + goto saw_error; + } + + first = false; + + if (((mask >> c_kind) & 1) == 0 && !parser->error) + { + /* Remove the invalid clause(s) from the list to avoid +confusing the rest of the compiler. */ + clauses = prev; + error_at (here, "%qs is not valid for %qs", c_name, where); + } +} + + saw_error: + c_parser_skip_to_pragma_eol (parser); + + if (finish_p) +return c_finish_omp_clauses (clauses); + + return clauses; +} + /* Parse all OpenMP clauses. The set clauses allowed by the directive - is a bitmask in MASK. Return the list of clauses found; the result - of clause default goes in *pdefault. */ + is a bitmask in MASK. Return the list of clauses found. */ static tree c_parser_omp_all_clauses (c_parser *parser, omp_clause_mask mask, @@ -11019,8 +11068,8 @@ c_parser_oacc_parallel (location_t loc, c_parser *parser) { tree stmt, clauses, block; - clauses = c_parser_omp_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK, - "#pragma acc parallel"); + clauses = c_parser_oacc_all_clauses (parser, OACC_PARALLEL_CLAUSE_MASK, + "#pragma acc parallel"); gcc_assert (clauses == NULL); block = c_begin_omp_parallel (); diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c index 854e149..81f0c5c 100644 --- gcc/c/c-typeck.c +++ gcc/c/c-typeck.c @@ -11661,7 +11661,7 @@ c_find_omp_placeholder_r (tree *tp, int *, void *data) return NULL_TREE; } -/* For all elements of CLAUSES, validate them vs OpenMP constraints. +/* For all elements of CLAUSES, validate them against their constraints. Remove any elements from the list that are invalid. */ tree @@ -11669,7 +11669,7 @@ c_finish_omp_clauses (tree clauses) { bitmap_head generic_head, firstprivate_head, lastprivate_head; bitmap_head aligned_head; - tree c, t, *pc = &clauses; + tree c, t, *pc; bool branch_seen = false; bool copyprivate_seen = false; tree *nowait_clause = NULL; -- 1.8.1.1
[gomp4 5/6] Initial support in the C front end for OpenACC data clauses.
From: Thomas Schwinge gcc/c-family/ * c-pragma.h (pragma_omp_clause): Add PRAGMA_OMP_CLAUSE_COPY, PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_CREATE, PRAGMA_OMP_CLAUSE_DELETE, PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_PRESENT, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE. gcc/c/ * c-parser.c (c_parser_omp_clause_name): Handle these. (c_parser_oacc_data_clause, c_parser_oacc_data_clause_deviceptr): New functions. (c_parser_oacc_all_clauses): Handle PRAGMA_OMP_CLAUSE_COPY, PRAGMA_OMP_CLAUSE_COPYIN, PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_CREATE, PRAGMA_OMP_CLAUSE_DELETE, PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_PRESENT, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN, PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, and PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE. gcc/ * tree-core.h (omp_clause_code): Update description for OMP_CLAUSE_MAP. --- gcc/c-family/c-pragma.h | 12 +++- gcc/c/c-parser.c| 171 +++- gcc/tree-core.h | 6 +- 3 files changed, 184 insertions(+), 5 deletions(-) diff --git gcc/c-family/c-pragma.h gcc/c-family/c-pragma.h index 64eed11..2c8af67 100644 --- gcc/c-family/c-pragma.h +++ gcc/c-family/c-pragma.h @@ -63,18 +63,23 @@ typedef enum pragma_kind { } pragma_kind; -/* All clauses defined by OpenMP 2.5, 3.0, 3.1 and 4.0. +/* All clauses defined by OpenACC 2.0, and OpenMP 2.5, 3.0, 3.1, and 4.0. Used internally by both C and C++ parsers. */ typedef enum pragma_omp_clause { PRAGMA_OMP_CLAUSE_NONE = 0, PRAGMA_OMP_CLAUSE_ALIGNED, PRAGMA_OMP_CLAUSE_COLLAPSE, + PRAGMA_OMP_CLAUSE_COPY, PRAGMA_OMP_CLAUSE_COPYIN, + PRAGMA_OMP_CLAUSE_COPYOUT, PRAGMA_OMP_CLAUSE_COPYPRIVATE, + PRAGMA_OMP_CLAUSE_CREATE, PRAGMA_OMP_CLAUSE_DEFAULT, + PRAGMA_OMP_CLAUSE_DELETE, PRAGMA_OMP_CLAUSE_DEPEND, PRAGMA_OMP_CLAUSE_DEVICE, + PRAGMA_OMP_CLAUSE_DEVICEPTR, PRAGMA_OMP_CLAUSE_DIST_SCHEDULE, PRAGMA_OMP_CLAUSE_FINAL, PRAGMA_OMP_CLAUSE_FIRSTPRIVATE, @@ -92,6 +97,11 @@ typedef enum pragma_omp_clause { PRAGMA_OMP_CLAUSE_NUM_THREADS, PRAGMA_OMP_CLAUSE_ORDERED, PRAGMA_OMP_CLAUSE_PARALLEL, + PRAGMA_OMP_CLAUSE_PRESENT, + PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY, + PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN, + PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT, + PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE, PRAGMA_OMP_CLAUSE_PRIVATE, PRAGMA_OMP_CLAUSE_PROC_BIND, PRAGMA_OMP_CLAUSE_REDUCTION, diff --git gcc/c/c-parser.c gcc/c/c-parser.c index c8b80db..48c55e6 100644 --- gcc/c/c-parser.c +++ gcc/c/c-parser.c @@ -9496,16 +9496,26 @@ c_parser_omp_clause_name (c_parser *parser) case 'c': if (!strcmp ("collapse", p)) result = PRAGMA_OMP_CLAUSE_COLLAPSE; + else if (!strcmp ("copy", p)) + result = PRAGMA_OMP_CLAUSE_COPY; else if (!strcmp ("copyin", p)) result = PRAGMA_OMP_CLAUSE_COPYIN; + else if (!strcmp ("copyout", p)) + result = PRAGMA_OMP_CLAUSE_COPYOUT; else if (!strcmp ("copyprivate", p)) result = PRAGMA_OMP_CLAUSE_COPYPRIVATE; + else if (!strcmp ("create", p)) + result = PRAGMA_OMP_CLAUSE_CREATE; break; case 'd': - if (!strcmp ("depend", p)) + if (!strcmp ("delete", p)) + result = PRAGMA_OMP_CLAUSE_DELETE; + else if (!strcmp ("depend", p)) result = PRAGMA_OMP_CLAUSE_DEPEND; else if (!strcmp ("device", p)) result = PRAGMA_OMP_CLAUSE_DEVICE; + else if (!strcmp ("deviceptr", p)) + result = PRAGMA_OMP_CLAUSE_DEVICEPTR; else if (!strcmp ("dist_schedule", p)) result = PRAGMA_OMP_CLAUSE_DIST_SCHEDULE; break; @@ -9550,6 +9560,16 @@ c_parser_omp_clause_name (c_parser *parser) case 'p': if (!strcmp ("parallel", p)) result = PRAGMA_OMP_CLAUSE_PARALLEL; + else if (!strcmp ("present", p)) + result = PRAGMA_OMP_CLAUSE_PRESENT; + else if (!strcmp ("present_or_copy", p)) + result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPY; + else if (!strcmp ("present_or_copyin", p)) + result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYIN; + else if (!strcmp ("present_or_copyout", p)) + result = PRAGMA_OMP_CLAUSE_PRESENT_OR_COPYOUT; + else if (!strcmp ("present_or_create", p)) + result = PRAGMA_OMP_CLAUSE_PRESENT_OR_CREATE; else if (!strcmp ("private", p)) result = PRAGMA_OMP_CLAUSE_PRIVATE; else if (!strcmp ("proc_bind", p)) @@ -9611,7 +9631,7 @@ check_no_duplicate_clause (tree clauses, enum omp_clause_code code, } } -/* OpenMP 2.5: +/* OpenACC 2.0, OpenMP
[gomp4] Initial support for OpenACC data clauses
Hi! Here is a patch series that adds initial support for OpenACC data clauses. It is not yet complete, but I thought I might as well already now strive to get this integrated upstream instead of "hoarding" the patches locally. Would it be a good idea to also commit to trunk the (portions of the) patches that don't directly relate with OpenACC stuff? That way, trunk and gomp-4_0-branch would diverge a little less? Or, would you first like to see all of this stabilitize on gomp-4_0-branch? Grüße, Thomas pgpUJT7vPMKqc.pgp Description: PGP signature
[wide-int] resolve bootstrap issue
This resolves a bootstrap issue found after reducing the size of the maximal wide_int; the real code really does want a slightly larger type so we create on just for real. Ok? diff --git a/gcc/builtins.c b/gcc/builtins.c index 81bb407..f4ffdb0 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -685,10 +685,10 @@ c_readstr (const char *str, enum machine_mode mode) HOST_WIDE_INT tmp[MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT]; gcc_assert (GET_MODE_CLASS (mode) == MODE_INT); - unsigned int len = (GET_MODE_PRECISION (mode) + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT; + gcc_assert (len <= MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT); for (i = 0; i < len; i++) tmp[i] = 0; diff --git a/gcc/expmed.c b/gcc/expmed.c index ce063eb..720d8c1 100644 --- a/gcc/expmed.c +++ b/gcc/expmed.c @@ -4963,6 +4963,7 @@ make_tree (tree type, rtx x) return t; case CONST_DOUBLE: + gcc_assert (HOST_BITS_PER_WIDE_INT * 2 <= MAX_BITSIZE_MODE_ANY_INT); if (TARGET_SUPPORTS_WIDE_INT == 0 && GET_MODE (x) == VOIDmode) t = wide_int_to_tree (type, wide_int::from_array (&CONST_DOUBLE_LOW (x), 2, diff --git a/gcc/lto-streamer-in.c b/gcc/lto-streamer-in.c index d058307..08eba48 100644 --- a/gcc/lto-streamer-in.c +++ b/gcc/lto-streamer-in.c @@ -1312,6 +1312,7 @@ lto_input_tree_1 (struct lto_input_block *ib, struct data_in *data_in, for (i = 0; i < len; i++) a[i] = streamer_read_hwi (ib); + gcc_assert (TYPE_PRECISION (type) <= MAX_BITSIZE_MODE_ANY_INT); result = wide_int_to_tree (type, wide_int::from_array (a, len, TYPE_PRECISION (type))); streamer_tree_cache_append (data_in->reader_cache, result, hash); diff --git a/gcc/real.c b/gcc/real.c index b060497..51d1868 100644 --- a/gcc/real.c +++ b/gcc/real.c @@ -1377,10 +1377,12 @@ real_to_integer (const REAL_VALUE_TYPE *r) wide_int real_to_integer (const REAL_VALUE_TYPE *r, bool *fail, int precision) { + typedef FIXED_WIDE_INT (WIDE_INT_MAX_PRECISION * 2) real_int; HOST_WIDE_INT val[2 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT]; int exp; int words; wide_int result; + real_int tmp; int w; switch (r->cl) @@ -1440,10 +1442,10 @@ real_to_integer (const REAL_VALUE_TYPE *r, bool *fail, int precision) } #endif w = SIGSZ * HOST_BITS_PER_LONG + words * HOST_BITS_PER_WIDE_INT; - result = wide_int::from_array + tmp = real_int::from_array (val, (w + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT, w); - result = wi::lrshift (result, (words * HOST_BITS_PER_WIDE_INT) - exp); - result = wide_int::from (result, precision, UNSIGNED); + tmp = wi::lrshift (tmp, (words * HOST_BITS_PER_WIDE_INT) - exp); + result = wide_int::from (tmp, precision, UNSIGNED); if (r->sign) return -result; diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 00b5439..7c21afa 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -5384,6 +5384,7 @@ simplify_immed_subreg (enum machine_mode outermode, rtx op, tmp[u] = buf; base += HOST_BITS_PER_WIDE_INT; } + gcc_assert (GET_MODE_PRECISION (outer_submode) <= MAX_BITSIZE_MODE_ANY_INT); r = wide_int::from_array (tmp, units, GET_MODE_PRECISION (outer_submode)); elems[elem] = immed_wide_int_const (r, outer_submode); diff --git a/gcc/tree-vect-generic.c b/gcc/tree-vect-generic.c index 50453b4..74d29d2 100644 --- a/gcc/tree-vect-generic.c +++ b/gcc/tree-vect-generic.c @@ -62,7 +62,7 @@ build_replicated_const (tree type, tree inner_type, HOST_WIDE_INT value) HOST_WIDE_INT a[WIDE_INT_MAX_ELTS]; int i; - gcc_assert (n); + gcc_assert (n && n <= WIDE_INT_MAX_ELTS); if (width == HOST_BITS_PER_WIDE_INT) low = value; @@ -75,6 +75,7 @@ build_replicated_const (tree type, tree inner_type, HOST_WIDE_INT value) for (i = 0; i < n; i++) a[i] = low; + gcc_assert (TYPE_PRECISION (type) <= MAX_BITSIZE_MODE_ANY_INT); return wide_int_to_tree (type, wide_int::from_array (a, n, TYPE_PRECISION (type))); }
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On 14/01/14 14:32, Jakub Jelinek wrote: > Anyway, the above is really a simple case, and I'd call it a > backend bug if it isn't able to generate good code out of that. Exactly which back-end pass are you expecting to simplify (set (subreg:SI (reg:HI 1) 0) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2))) (set (reg:SI 2) (zero_extend:SI (reg:HI 1))) (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0))) into (set (reg:SI 2) (and:SI (subreg:SI (reg:HI 0) 0) (const_int 2))) (set (reg:SI 3) (ne:SI (reg:SI 2) (const_int 0))) Combine is about the only pass that does this sort of thing, and that's far too often confused by extraneous information that it thinks might be helpful, but isn't, or by the fact that the intermediate result is used more than once. R.
Re: PR 59712 patch
On 9 January 2014 21:55, François Dumont wrote: > Hi > > Here is a patch for this small problem with clang. It is not a blocking > issue for the 4.9 release but at the same time it is a rather safe fix so > just tell me if I can commit it. > > All unordered_* tests run under Linux x86_64. I haven't clang installed > at the moment so a clang feedback would be appreciated. I'm going to test this with Clang to be sure it actually fixes the bug before we change anything, but it might have to wait for the weekend. N.B. please change the comment at the top of the patch to say "pool of nodes" instead of "pool of node"
Re: [C PATCH] Disallow subtracting pointers to empty structs (PR c/58346)
The C++ part is OK. Jason
Re: [RFC] Using function clones for Pointer Bounds Checker
2014/1/14 Richard Biener : > On Tue, Jan 14, 2014 at 1:47 PM, Ilya Enkovich wrote: >> 2014/1/14 Richard Biener : >>> On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich >>> wrote: Hi, I've been working for some time on the prototype of the Pointer Bounds Checker which uses function clones for instrumentation (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After several experiments with this approach I want to share my results and ask for some feedback to make a decision about the future steps. Firstly I want to remind the reasons for digging in this direction. In the original approach bounds of call arguments and input parameters are associated with arguments via special built-in calls. It creates implicit data flow compiler is not aware about which confuses some optimizations resulting in miss-optimization and breaks bounds data flow. Thus optimizations have to be fixed to get better pointers protection. Clones approach does not use special built-in function calls to associate bounds with call arguments and input parameters. Each function which should be instrumented gets an additional version and only this special version will be instrumented.This new version gets additional bound arguments to express input bounds. When function call is instrumented, it is redirected to instrumented version and all bounds are passed as explicit call arguments. Thus we have explicit pointer bounds flow similar to regular function parameters. It should allow to avoid changes in optimization, avoid miss-optimizations, allow existing IPA optimizations to work with bound args (e.g. propagate constant bounds value and remove checks in called function). I made a prototype implementation of this approach in the following way: - Add new IPA pass before early local passes to produce versions for all functions to be instrumented. - Put instrumentation pass after SSA pass. - Add new pass after IPA passes to remove bodies of functions which have instrumented versions. Function nodes may still be required for calls in not instrumented code. But we do not emit this code and therefore function bodies are not needed. Positive changes are: - IPA optimizations are not confused by bound parameters - bounds are now more like regular arguments; it makes their processing in expand easier - functions with bounds not attached to any pointer are allowed >>> >>> First of all thanks for trying to work in this direction. Comments on the >>> issues you encountered below (also CCed Honza as he should be more >>> familiar with reachability and clone issues). >>> On simple codes this approach worked well but on a bigger tests some issues were revealed. 1. Nodes reachability. Instrumented version is actually always reachable when original function is reachable because it is always emitted instead of the original. Thus I had to fix reachability analysis to achieve it. Another similar problem is check whether node can be removed after inline when inlining instrumented function. Not hard to fix but probably other similar problems exist. >>> >>> I suppose you do not update the callgraph / the call stmts when >>> creating the clones? Btw, is it desirable to inline the uninstrumented >>> function and then instrument the result (thus run cloning and >>> instrumentation after early inlining?)? Especially handling always_inlines >>> before cloning/isntrumentation looks very sensible. >> >> Right. Created clones have the same code as the original function and >> therefore same cgraph edges. I suppose instrumentation after early >> inlining is OK and may be preferred because inline shouldn't lead to >> any losses of bounds information. I tried variant when instrumentation >> works right after early inlining but with cloning still before early >> local passes. In general it looked OK. >> >>> 2. Function processing order. Function processing order is determined before early local passes. But during function instrumentation call graph is modified significantly and used topological order becomes outdated. That causes some troubles. E.g. function marked as 'always inline' cannot be inlined because it is not in SSA form yet. Surely inlining problem may be solved by just putting instrumentation after early inline, but similar problem may exist in other passes too. To resolve this problem I tried to split early local passes into three parts. The first one builds SSA, the second one performs instrumentation, the last one does the rest. Each part is performed on all functions before the next one starts. Thus I get all functions in SSA form and all instrumentation performed before starting early optimizations. Unfortunately such passes ord
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On 14/01/14 14:25, Richard Biener wrote: > On Tue, Jan 14, 2014 at 3:21 PM, Kyrill Tkachov > wrote: >> Moving to gcc, I accidentally sent it to gcc-patches previously... >> >> >> On 14/01/14 14:09, Richard Biener wrote: >>> >>> On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov >>> wrote: Hi all, I'm looking into PR 54168 where we end up generating an unnecessary extend operation on arm. Given code: struct b2Body { unsigned short flags; int type; }; static _Bool IsAwake(struct b2Body *b) { return (b->flags & 2) == 2; } int foo(struct b2Body *bA, struct b2Body *bB) { int typeA = bA->type; int typeB = bB->type; _Bool activeA = IsAwake(bA) && typeA != 0; _Bool activeB = IsAwake(bB) && typeB != 0; if (!activeA && !activeB) { return 1; } return 0; } Compiled for arm-none-eabi with -O3 -march=armv7-a The inlined generated code for IsAwake contains the fragment: ldrhr0, [r1] and r0, r0, #2 uxthr0, r0 cmp r0, #0 which contains a redundant extend, which also confuses combine and prevents the whole thing from being optimised into an ldrh ; ands sequence. Looking at the tree dumps I notice that after the forwprop pass the types of the operands in the _7 = _3 & 2; statement turn into short unsigned where before that pass they were just ints: IsAwake (struct b2Body * b) { short unsigned int _3; int _4; _Bool _6; short unsigned int _7; : _3 = b_2(D)->flags; _4 = (int) _3; _7 = _3 & 2; _6 = _7 != 0; return _6; } I believe the C standard expects the operation to be performed in int mode. Now, since this is a bitwise and operation with a known constant 2, the operation can be safely performed in unsigned short mode. However on word-based machines like arm this would introduce unnecessary extend operations down the line, as I believe is the case here. >>> >>> Though the variant using shorts is clearly shorter (in number of stmts) >>> and thus easier to optimize. Am I correct that the issue in the end >>> is that _7 != 0 requires to extend _7? & 2 is trivially performed without >>> any extension, no? >> >> >> If I look at the dump before forwprop, the number of statements is exactly >> the same, so it's not any shorter in that sense. > > Well, it is - _4 = (int) _3; is dead, thus a zero-extension instruction > is removed. > That's a rather short-sighted definition of removed. You've removed an extension that: a) can be merged with the preceding load b) will have to be put back again anyway, when the _4 & 2 is expanded. And finally, you end up with a second one when _5 != 0 is later expanded. R. >> >> >> IsAwake (struct b2Body * b) >> { >> short unsigned int _3; >> int _4; >> int _5; >> _Bool _6; >> >> >> : >> _3 = b_2(D)->flags; >> _4 = (int) _3; >> _5 = _4 & 2; >> _6 = _5 != 0; >> return _6; >> >> } >> >> Using shorts is not cheaper on an architecture like arm which is word-based. >> Just the fact that we're introducing shorts already implies we're going to >> have to extend somewhere. > > But given the bit-and with '2' the extension can be removed, no? > > Richard. > >> Kyrill >> >> >>> >>> Richard. >>> Anyone have any insight on how to resolve this one? Thanks, Kyrill >> >> >
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On Tue, Jan 14, 2014 at 02:21:52PM +, Kyrill Tkachov wrote: > If I look at the dump before forwprop, the number of statements is > exactly the same, so it's not any shorter in that sense. > > > IsAwake (struct b2Body * b) > { > short unsigned int _3; > int _4; > int _5; > _Bool _6; > > : > _3 = b_2(D)->flags; > _4 = (int) _3; > _5 = _4 & 2; > _6 = _5 != 0; > return _6; > > } > > Using shorts is not cheaper on an architecture like arm which is > word-based. Just the fact that we're introducing shorts already > implies we're going to have to extend somewhere. As discussed multiple times in the various type demotion/promotion threads, for GIMPLE optimizations it is desirable to use as narrow types as possible, because that e.g. means greater possibilities of finding redundancies, optimizing away useless computations in the upper bits, canonicalize computations done in different types etc. Already the FEs right now narrow types, but only within the same folded expressions (get_narrower etc.), of course we want to move that away from FEs if possible and do it during (early?) GIMPLE passes. Generally, targets should be able to expand good code for the narrower types, there is no reason why they can't, but perhaps their tasks would be easier if we did some target hooks/optabs or something driven type promotion. Anyway, the above is really a simple case, and I'd call it a backend bug if it isn't able to generate good code out of that. Jakub
Re: Drop REG_CROSSING_JUMP when converting to a conditional return
On 01/14/14 03:37, Richard Sandiford wrote: While experimenting with a patch to use conditional returns, I hit a case where a conditional jump between hot and cold sections was being converted into a conditional return. The new jump still had the REG_CROSSING_JUMP and later passes were confused by the combination of that and a return JUMP_LABEL. The jump is now returning directly from the current section rather than crossing sections within the function, so I think we should just drop the note. Tested on s390x-linux-gnu with some local patches. OK to install? Thanks, Richard gcc/ * jump.c (redirect_jump_2): Remove REG_CROSSING_JUMP notes when converting a conditional jump into a conditional return. OK. jeff
Re: [PATCH] Fix PR59802, LCM compile-time slowness
On 01/14/14 06:42, Richard Biener wrote: This fixes the slowness seen in LCM compute_available accounted to RTL cprop. Currently the dataflow problem uses a "random" basic-block order to seed the initial worklist (it wants to visit predecessors before successors) - the following patch makes it use inverted postorder (similar to tree PRE antic computation). This reduces the compile-time for the testcase in PR59802 at -O3 from CPROP : 54.53 (55%) usr 0.04 ( 6%) sys 54.57 (55%) wall kB ( 2%) ggc PRE : 4.47 ( 5%) usr 0.03 ( 5%) sys 4.48 ( 5%) wall1833 kB ( 1%) ggc TOTAL : 98.51 0.6399.13 220195 kB to CPROP : 2.13 ( 5%) usr 0.06 (10%) sys 2.20 ( 5%) wall kB ( 2%) ggc PRE : 0.52 ( 1%) usr 0.02 ( 3%) sys 0.54 ( 1%) wall1833 kB ( 1%) ggc TOTAL : 42.22 0.6042.81 220195 kB which is nice. I checked for other compile-time hog PRs with CPROP but didn't find one, have yet to check for PRE (three-letter, likely too much noise). Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk (and branch?) Thanks, Richard. 2014-01-14 Richard Biener PR rtl-optimization/59802 * lcm.c (compute_available): Use inverted postorder to seed the initial worklist. Looks good. Thanks for taking care of this. Jeff
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On Tue, Jan 14, 2014 at 3:21 PM, Kyrill Tkachov wrote: > Moving to gcc, I accidentally sent it to gcc-patches previously... > > > On 14/01/14 14:09, Richard Biener wrote: >> >> On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov >> wrote: >>> >>> Hi all, >>> >>> I'm looking into PR 54168 where we end up generating an unnecessary >>> extend >>> operation on arm. >>> >>> Given code: >>> >>> struct b2Body { >>> unsigned short flags; >>> int type; >>> }; >>> >>> static _Bool IsAwake(struct b2Body *b) >>> { >>> return (b->flags & 2) == 2; >>> } >>> >>> >>> int foo(struct b2Body *bA, struct b2Body *bB) >>> { >>> int typeA = bA->type; >>> int typeB = bB->type; >>> _Bool activeA = IsAwake(bA) && typeA != 0; >>> _Bool activeB = IsAwake(bB) && typeB != 0; >>> >>> if (!activeA && !activeB) >>> { >>> return 1; >>> } >>> >>> return 0; >>> } >>> >>> Compiled for arm-none-eabi with -O3 -march=armv7-a >>> >>> The inlined generated code for IsAwake contains the fragment: >>> >>> ldrhr0, [r1] >>> and r0, r0, #2 >>> uxthr0, r0 >>> cmp r0, #0 >>> >>> which contains a redundant extend, which also confuses combine and >>> prevents >>> the whole thing from being optimised into an ldrh ; ands sequence. >>> >>> Looking at the tree dumps I notice that after the forwprop pass the types >>> of >>> the operands in the _7 = _3 & 2; statement turn into short unsigned where >>> before that pass they were just ints: >>> >>> IsAwake (struct b2Body * b) >>> { >>>short unsigned int _3; >>>int _4; >>>_Bool _6; >>>short unsigned int _7; >>> >>>: >>>_3 = b_2(D)->flags; >>>_4 = (int) _3; >>>_7 = _3 & 2; >>>_6 = _7 != 0; >>>return _6; >>> >>> } >>> >>> >>> I believe the C standard expects the operation to be performed in int >>> mode. >>> Now, since this is a bitwise and operation with a known constant 2, the >>> operation can be safely performed in unsigned short mode. However on >>> word-based machines like arm this would introduce unnecessary extend >>> operations down the line, as I believe is the case here. >> >> Though the variant using shorts is clearly shorter (in number of stmts) >> and thus easier to optimize. Am I correct that the issue in the end >> is that _7 != 0 requires to extend _7? & 2 is trivially performed without >> any extension, no? > > > If I look at the dump before forwprop, the number of statements is exactly > the same, so it's not any shorter in that sense. Well, it is - _4 = (int) _3; is dead, thus a zero-extension instruction is removed. > > > IsAwake (struct b2Body * b) > { > short unsigned int _3; > int _4; > int _5; > _Bool _6; > > > : > _3 = b_2(D)->flags; > _4 = (int) _3; > _5 = _4 & 2; > _6 = _5 != 0; > return _6; > > } > > Using shorts is not cheaper on an architecture like arm which is word-based. > Just the fact that we're introducing shorts already implies we're going to > have to extend somewhere. But given the bit-and with '2' the extension can be removed, no? Richard. > Kyrill > > >> >> Richard. >> >>> Anyone have any insight on how to resolve this one? >>> >>> Thanks, >>> Kyrill >>> > >
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
Moving to gcc, I accidentally sent it to gcc-patches previously... On 14/01/14 14:09, Richard Biener wrote: On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov wrote: Hi all, I'm looking into PR 54168 where we end up generating an unnecessary extend operation on arm. Given code: struct b2Body { unsigned short flags; int type; }; static _Bool IsAwake(struct b2Body *b) { return (b->flags & 2) == 2; } int foo(struct b2Body *bA, struct b2Body *bB) { int typeA = bA->type; int typeB = bB->type; _Bool activeA = IsAwake(bA) && typeA != 0; _Bool activeB = IsAwake(bB) && typeB != 0; if (!activeA && !activeB) { return 1; } return 0; } Compiled for arm-none-eabi with -O3 -march=armv7-a The inlined generated code for IsAwake contains the fragment: ldrhr0, [r1] and r0, r0, #2 uxthr0, r0 cmp r0, #0 which contains a redundant extend, which also confuses combine and prevents the whole thing from being optimised into an ldrh ; ands sequence. Looking at the tree dumps I notice that after the forwprop pass the types of the operands in the _7 = _3 & 2; statement turn into short unsigned where before that pass they were just ints: IsAwake (struct b2Body * b) { short unsigned int _3; int _4; _Bool _6; short unsigned int _7; : _3 = b_2(D)->flags; _4 = (int) _3; _7 = _3 & 2; _6 = _7 != 0; return _6; } I believe the C standard expects the operation to be performed in int mode. Now, since this is a bitwise and operation with a known constant 2, the operation can be safely performed in unsigned short mode. However on word-based machines like arm this would introduce unnecessary extend operations down the line, as I believe is the case here. Though the variant using shorts is clearly shorter (in number of stmts) and thus easier to optimize. Am I correct that the issue in the end is that _7 != 0 requires to extend _7? & 2 is trivially performed without any extension, no? If I look at the dump before forwprop, the number of statements is exactly the same, so it's not any shorter in that sense. IsAwake (struct b2Body * b) { short unsigned int _3; int _4; int _5; _Bool _6; : _3 = b_2(D)->flags; _4 = (int) _3; _5 = _4 & 2; _6 = _5 != 0; return _6; } Using shorts is not cheaper on an architecture like arm which is word-based. Just the fact that we're introducing shorts already implies we're going to have to extend somewhere. Kyrill Richard. Anyone have any insight on how to resolve this one? Thanks, Kyrill
PATCH: PR target/59794: [4.7/4.8/4.9 Regression] i386 backend fails to detect MMX/SSE/AVX ABI changes
Hi, There are several problems with i386 MMX/SSE/AVX ABI change detection: 1. MMX/SSE return value isn't checked for -m32 since revision 83533: http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=83533 which added ix86_struct_value_rtx. Since MMX/SSE condition is always false, the MMX/SSE return value ABI change is disabled. 2. For -m32, the same warning on MMX/SSE argument is issued twice, one from type_natural_mode and one from function_arg_32. 3. AVX return value ABI change isn't checked. This patch does followings: 1. Remove the ineffective ix86_struct_value_rtx. 2. Add a bool parameter to indicate if type is used for function return value. Warn ABI change if the vector mode isn't available for function return value. Add AVX function return value ABI change warning. 3. Consolidate ABI change warning into type_natural_mode. 4. Update g++.dg/ext/vector23.C to prune ABI change for Linux/x86 added by the AVX function return value ABI change warning. 5. Update gcc.target/i386/pr39162.c to avoid the AVX function return value ABI change warning. 6. Add testcases for warning MMX/SSE/AVX ABI changes in parameter passing and function return. Tested on Linux/x86-64 with -m32/-m64 for "make check". OK to install? Thanks. H.J. --- gcc/ 2014-01-14 H.J. Lu PR target/59794 * config/i386/i386.c (type_natural_mode): Add a bool parameter to indicate if type is used for function return value. Warn ABI change if the vector mode isn't available for function return value. (ix86_function_arg_advance): Pass false to type_natural_mode. (ix86_function_arg): Likewise. (ix86_gimplify_va_arg): Likewise. (function_arg_32): Don't warn ABI change. (ix86_function_value): Pass true to type_natural_mode. (ix86_return_in_memory): Likewise. (ix86_struct_value_rtx): Removed. (TARGET_STRUCT_VALUE_RTX): Likewise. gcc/testsuite/ 2014-01-14 H.J. Lu PR target/59794 * g++.dg/ext/vector23.C: Also prune ABI change for Linux/x86. * gcc.target/i386/pr39162.c (y): New __m256i variable. (bar): Change return type to void. Set y to x. * gcc.target/i386/pr59794-1.c: New testcase. * gcc.target/i386/pr59794-2.c: Likewise. * gcc.target/i386/pr59794-3.c: Likewise. * gcc.target/i386/pr59794-4.c: Likewise. * gcc.target/i386/pr59794-5.c: Likewise. * gcc.target/i386/pr59794-6.c: Likewise. * gcc.target/i386/pr59794-7.c: Likewise. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ad48fc8..70181c3 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -6104,10 +6104,14 @@ init_cumulative_args (CUMULATIVE_ARGS *cum, /* Argument info to initialize */ The midde-end can't deal with the vector types > 16 bytes. In this case, we return the original mode and warn ABI change if CUM isn't - NULL. */ + NULL. + + If INT_RETURN is true, warn ABI change if the vector mode isn't + available for function return value. */ static enum machine_mode -type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum) +type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum, + bool in_return) { enum machine_mode mode = TYPE_MODE (type); @@ -6133,6 +6137,7 @@ type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum) if (size == 32 && !TARGET_AVX) { static bool warnedavx; + static bool warnedavx_ret; if (cum && !warnedavx @@ -6142,12 +6147,20 @@ type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum) warning (0, "AVX vector argument without AVX " "enabled changes the ABI"); } + else if (in_return & !warnedavx_ret) + { + warnedavx_ret = true; + warning (0, "AVX vector return without AVX " +"enabled changes the ABI"); + } + return TYPE_MODE (type); } else if (((size == 8 && TARGET_64BIT) || size == 16) && !TARGET_SSE) { static bool warnedsse; + static bool warnedsse_ret; if (cum && !warnedsse @@ -6157,10 +6170,19 @@ type_natural_mode (const_tree type, const CUMULATIVE_ARGS *cum) warning (0, "SSE vector argument without SSE " "enabled changes the ABI"); } + else if (!TARGET_64BIT +&& in_return +& !warnedsse_ret) + { + warnedsse_ret = true; + warning (0, "SSE ve
Re: PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
On Tue, Jan 14, 2014 at 3:03 PM, Kyrill Tkachov wrote: > Hi all, > > I'm looking into PR 54168 where we end up generating an unnecessary extend > operation on arm. > > Given code: > > struct b2Body { > unsigned short flags; > int type; > }; > > static _Bool IsAwake(struct b2Body *b) > { > return (b->flags & 2) == 2; > } > > > int foo(struct b2Body *bA, struct b2Body *bB) > { > int typeA = bA->type; > int typeB = bB->type; > _Bool activeA = IsAwake(bA) && typeA != 0; > _Bool activeB = IsAwake(bB) && typeB != 0; > > if (!activeA && !activeB) > { > return 1; > } > > return 0; > } > > Compiled for arm-none-eabi with -O3 -march=armv7-a > > The inlined generated code for IsAwake contains the fragment: > > ldrhr0, [r1] > and r0, r0, #2 > uxthr0, r0 > cmp r0, #0 > > which contains a redundant extend, which also confuses combine and prevents > the whole thing from being optimised into an ldrh ; ands sequence. > > Looking at the tree dumps I notice that after the forwprop pass the types of > the operands in the _7 = _3 & 2; statement turn into short unsigned where > before that pass they were just ints: > > IsAwake (struct b2Body * b) > { > short unsigned int _3; > int _4; > _Bool _6; > short unsigned int _7; > > : > _3 = b_2(D)->flags; > _4 = (int) _3; > _7 = _3 & 2; > _6 = _7 != 0; > return _6; > > } > > > I believe the C standard expects the operation to be performed in int mode. > Now, since this is a bitwise and operation with a known constant 2, the > operation can be safely performed in unsigned short mode. However on > word-based machines like arm this would introduce unnecessary extend > operations down the line, as I believe is the case here. Though the variant using shorts is clearly shorter (in number of stmts) and thus easier to optimize. Am I correct that the issue in the end is that _7 != 0 requires to extend _7? & 2 is trivially performed without any extension, no? Richard. > Anyone have any insight on how to resolve this one? > > Thanks, > Kyrill >
PR 54168: Unnecessary narrowing in tree-ssa-forwprop pass?
Hi all, I'm looking into PR 54168 where we end up generating an unnecessary extend operation on arm. Given code: struct b2Body { unsigned short flags; int type; }; static _Bool IsAwake(struct b2Body *b) { return (b->flags & 2) == 2; } int foo(struct b2Body *bA, struct b2Body *bB) { int typeA = bA->type; int typeB = bB->type; _Bool activeA = IsAwake(bA) && typeA != 0; _Bool activeB = IsAwake(bB) && typeB != 0; if (!activeA && !activeB) { return 1; } return 0; } Compiled for arm-none-eabi with -O3 -march=armv7-a The inlined generated code for IsAwake contains the fragment: ldrhr0, [r1] and r0, r0, #2 uxthr0, r0 cmp r0, #0 which contains a redundant extend, which also confuses combine and prevents the whole thing from being optimised into an ldrh ; ands sequence. Looking at the tree dumps I notice that after the forwprop pass the types of the operands in the _7 = _3 & 2; statement turn into short unsigned where before that pass they were just ints: IsAwake (struct b2Body * b) { short unsigned int _3; int _4; _Bool _6; short unsigned int _7; : _3 = b_2(D)->flags; _4 = (int) _3; _7 = _3 & 2; _6 = _7 != 0; return _6; } I believe the C standard expects the operation to be performed in int mode. Now, since this is a bitwise and operation with a known constant 2, the operation can be safely performed in unsigned short mode. However on word-based machines like arm this would introduce unnecessary extend operations down the line, as I believe is the case here. Anyone have any insight on how to resolve this one? Thanks, Kyrill
Re: [RFC] Using function clones for Pointer Bounds Checker
On Tue, Jan 14, 2014 at 1:47 PM, Ilya Enkovich wrote: > 2014/1/14 Richard Biener : >> On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich >> wrote: >>> Hi, >>> >>> I've been working for some time on the prototype of the Pointer Bounds >>> Checker which uses function clones for instrumentation >>> (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After >>> several experiments with this approach I want to share my results and >>> ask for some feedback to make a decision about the future steps. >>> >>> Firstly I want to remind the reasons for digging in this direction. In >>> the original approach bounds of call arguments and input parameters >>> are associated with arguments via special built-in calls. It creates >>> implicit data flow compiler is not aware about which confuses some >>> optimizations resulting in miss-optimization and breaks bounds data >>> flow. Thus optimizations have to be fixed to get better pointers >>> protection. >>> >>> Clones approach does not use special built-in function calls to >>> associate bounds with call arguments and input parameters. Each >>> function which should be instrumented gets an additional version and >>> only this special version will be instrumented.This new version gets >>> additional bound arguments to express input bounds. When function call >>> is instrumented, it is redirected to instrumented version and all >>> bounds are passed as explicit call arguments. Thus we have explicit >>> pointer bounds flow similar to regular function parameters. It should >>> allow to avoid changes in optimization, avoid miss-optimizations, >>> allow existing IPA optimizations to work with bound args (e.g. >>> propagate constant bounds value and remove checks in called function). >>> >>> I made a prototype implementation of this approach in the following way: >>> >>> - Add new IPA pass before early local passes to produce versions for >>> all functions to be instrumented. >>> - Put instrumentation pass after SSA pass. >>> - Add new pass after IPA passes to remove bodies of functions which >>> have instrumented versions. Function nodes may still be required for >>> calls in not instrumented code. But we do not emit this code and >>> therefore function bodies are not needed. >>> >>> Positive changes are: >>> >>> - IPA optimizations are not confused by bound parameters >>> - bounds are now more like regular arguments; it makes their >>> processing in expand easier >>> - functions with bounds not attached to any pointer are allowed >> >> First of all thanks for trying to work in this direction. Comments on the >> issues you encountered below (also CCed Honza as he should be more >> familiar with reachability and clone issues). >> >>> On simple codes this approach worked well but on a bigger tests some >>> issues were revealed. >>> >>> 1. Nodes reachability. Instrumented version is actually always >>> reachable when original function is reachable because it is always >>> emitted instead of the original. Thus I had to fix reachability >>> analysis to achieve it. Another similar problem is check whether node >>> can be removed after inline when inlining instrumented function. Not >>> hard to fix but probably other similar problems exist. >> >> I suppose you do not update the callgraph / the call stmts when >> creating the clones? Btw, is it desirable to inline the uninstrumented >> function and then instrument the result (thus run cloning and >> instrumentation after early inlining?)? Especially handling always_inlines >> before cloning/isntrumentation looks very sensible. > > Right. Created clones have the same code as the original function and > therefore same cgraph edges. I suppose instrumentation after early > inlining is OK and may be preferred because inline shouldn't lead to > any losses of bounds information. I tried variant when instrumentation > works right after early inlining but with cloning still before early > local passes. In general it looked OK. > >> >>> 2. Function processing order. Function processing order is determined >>> before early local passes. But during function instrumentation call >>> graph is modified significantly and used topological order becomes >>> outdated. That causes some troubles. E.g. function marked as 'always >>> inline' cannot be inlined because it is not in SSA form yet. Surely >>> inlining problem may be solved by just putting instrumentation after >>> early inline, but similar problem may exist in other passes too. To >>> resolve this problem I tried to split early local passes into three >>> parts. The first one builds SSA, the second one performs >>> instrumentation, the last one does the rest. Each part is performed on >>> all functions before the next one starts. Thus I get all functions in >>> SSA form and all instrumentation performed before starting early >>> optimizations. Unfortunately such passes order leads to invalid SSA >>> because of local_pure_const optimization affecting callers correctness >>> (in case caller
[PATCH] Fix PR59802, LCM compile-time slowness
This fixes the slowness seen in LCM compute_available accounted to RTL cprop. Currently the dataflow problem uses a "random" basic-block order to seed the initial worklist (it wants to visit predecessors before successors) - the following patch makes it use inverted postorder (similar to tree PRE antic computation). This reduces the compile-time for the testcase in PR59802 at -O3 from CPROP : 54.53 (55%) usr 0.04 ( 6%) sys 54.57 (55%) wall kB ( 2%) ggc PRE : 4.47 ( 5%) usr 0.03 ( 5%) sys 4.48 ( 5%) wall1833 kB ( 1%) ggc TOTAL : 98.51 0.6399.13 220195 kB to CPROP : 2.13 ( 5%) usr 0.06 (10%) sys 2.20 ( 5%) wall kB ( 2%) ggc PRE : 0.52 ( 1%) usr 0.02 ( 3%) sys 0.54 ( 1%) wall1833 kB ( 1%) ggc TOTAL : 42.22 0.6042.81 220195 kB which is nice. I checked for other compile-time hog PRs with CPROP but didn't find one, have yet to check for PRE (three-letter, likely too much noise). Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk (and branch?) Thanks, Richard. 2014-01-14 Richard Biener PR rtl-optimization/59802 * lcm.c (compute_available): Use inverted postorder to seed the initial worklist. Index: gcc/lcm.c === *** gcc/lcm.c (revision 206599) --- gcc/lcm.c (working copy) *** compute_available (sbitmap *avloc, sbitm *** 496,507 bitmap_vector_ones (avout, last_basic_block_for_fn (cfun)); /* Put every block on the worklist; this is necessary because of the ! optimistic initialization of AVOUT above. */ ! FOR_EACH_BB_FN (bb, cfun) { *qin++ = bb; bb->aux = bb; } qin = worklist; qend = &worklist[n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS]; --- 496,515 bitmap_vector_ones (avout, last_basic_block_for_fn (cfun)); /* Put every block on the worklist; this is necessary because of the ! optimistic initialization of AVOUT above. Use inverted postorder ! to make the dataflow problem require less iterations. */ ! int *postorder = XNEWVEC (int, n_basic_blocks_for_fn (cfun)); ! int postorder_num = inverted_post_order_compute (postorder); ! for (int i = 0; i < postorder_num; ++i) { + bb = BASIC_BLOCK_FOR_FN (cfun, postorder[i]); + if (bb == EXIT_BLOCK_PTR_FOR_FN (cfun) + || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun)) + continue; *qin++ = bb; bb->aux = bb; } + free (postorder); qin = worklist; qend = &worklist[n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS];
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Tue, 14 Jan 2014, Jakub Jelinek wrote: > On Tue, Jan 14, 2014 at 10:01:06AM +0100, Richard Biener wrote: > > Jakub, adding the new flag is ok with me. > > So like this? Ok if it passes testing. Thanks, Richard. > 2014-01-14 Jakub Jelinek > > * tree-vectorizer.h (struct _loop_vec_info): Add no_data_dependencies > field. > (LOOP_VINFO_NO_DATA_DEPENDENCIES): Define. > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): Clear it > when not giving up or versioning for alias only because of > loop->safelen. > (vect_analyze_data_ref_dependences): Set to true. > * tree-vect-stmts.c (vectorizable_load): Use > LOOP_VINFO_NO_DATA_DEPENDENCIES instead of > LOOP_REQUIRES_VERSIONING_FOR_ALIAS. > > --- gcc/tree-vectorizer.h.jj 2014-01-03 11:40:57.0 +0100 > +++ gcc/tree-vectorizer.h 2014-01-14 13:10:00.477989924 +0100 > @@ -347,6 +347,25 @@ typedef struct _loop_vec_info { > fix it up. */ >bool operands_swapped; > > + /* True if there are no loop carried data dependencies in the loop. > + If loop->safelen <= 1, then this is always true, either the loop > + didn't have any loop carried data dependencies, or the loop is being > + vectorized guarded with some runtime alias checks, or couldn't > + be vectorized at all, but then this field shouldn't be used. > + For loop->safelen >= 2, the user has asserted that there are no > + backward dependencies, but there still could be loop carried forward > + dependencies in such loops. This flag will be false if normal > + vectorizer data dependency analysis would fail or require versioning > + for alias, but because of loop->safelen >= 2 it has been vectorized > + even without versioning for alias. E.g. in: > + #pragma omp simd > + for (int i = 0; i < m; i++) > + a[i] = a[i + k] * c; > + (or #pragma simd or #pragma ivdep) we can vectorize this and it will > + DTRT even for k > 0 && k < m, but without safelen we would not > + vectorize this, so this field would be false. */ > + bool no_data_dependencies; > + >/* If if-conversion versioned this loop before conversion, this is the > loop version without if-conversion. */ >struct loop *scalar_loop; > @@ -385,6 +404,7 @@ typedef struct _loop_vec_info { > #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)->peeling_for_gaps > #define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)->operands_swapped > #define LOOP_VINFO_PEELING_FOR_NITER(L)(L)->peeling_for_niter > +#define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies > #define LOOP_VINFO_SCALAR_LOOP(L) (L)->scalar_loop > > #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \ > --- gcc/tree-vect-data-refs.c.jj 2014-01-10 00:38:26.0 +0100 > +++ gcc/tree-vect-data-refs.c 2014-01-14 13:12:06.056342116 +0100 > @@ -244,6 +244,7 @@ vect_analyze_data_ref_dependence (struct > { > if (loop->safelen < *max_vf) > *max_vf = loop->safelen; > + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false; > return false; > } > > @@ -291,6 +292,7 @@ vect_analyze_data_ref_dependence (struct > { > if (loop->safelen < *max_vf) > *max_vf = loop->safelen; > + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false; > return false; > } > > @@ -447,6 +449,7 @@ vect_analyze_data_ref_dependences (loop_ > dump_printf_loc (MSG_NOTE, vect_location, > "=== vect_analyze_data_ref_dependences ===\n"); > > + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = true; >if (!compute_all_dependences (LOOP_VINFO_DATAREFS (loop_vinfo), > &LOOP_VINFO_DDRS (loop_vinfo), > LOOP_VINFO_LOOP_NEST (loop_vinfo), true)) > --- gcc/tree-vect-stmts.c.jj 2014-01-14 10:33:21.0 +0100 > +++ gcc/tree-vect-stmts.c 2014-01-14 13:14:15.157677243 +0100 > @@ -6381,10 +6381,11 @@ vectorizable_load (gimple stmt, gimple_s > if (inv_p && !bb_vinfo) > { > gcc_assert (!grouped_load); > - /* If we have versioned for aliasing then we are sure > - this is a loop invariant load and thus we can insert > - it on the preheader edge. */ > - if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) > + /* If we have versioned for aliasing or the loop doesn't > + have any data dependencies that would preclude this, > + then we are sure this is a loop invariant load and > + thus we can insert it on the preheader edge. */ > + if (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)) > { > if (dump_enabled_p ()) > { > > > Jakub > > -- Richard Biener SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Tue, Jan 14, 2014 at 10:01:06AM +0100, Richard Biener wrote: > Jakub, adding the new flag is ok with me. So like this? 2014-01-14 Jakub Jelinek * tree-vectorizer.h (struct _loop_vec_info): Add no_data_dependencies field. (LOOP_VINFO_NO_DATA_DEPENDENCIES): Define. * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): Clear it when not giving up or versioning for alias only because of loop->safelen. (vect_analyze_data_ref_dependences): Set to true. * tree-vect-stmts.c (vectorizable_load): Use LOOP_VINFO_NO_DATA_DEPENDENCIES instead of LOOP_REQUIRES_VERSIONING_FOR_ALIAS. --- gcc/tree-vectorizer.h.jj2014-01-03 11:40:57.0 +0100 +++ gcc/tree-vectorizer.h 2014-01-14 13:10:00.477989924 +0100 @@ -347,6 +347,25 @@ typedef struct _loop_vec_info { fix it up. */ bool operands_swapped; + /* True if there are no loop carried data dependencies in the loop. + If loop->safelen <= 1, then this is always true, either the loop + didn't have any loop carried data dependencies, or the loop is being + vectorized guarded with some runtime alias checks, or couldn't + be vectorized at all, but then this field shouldn't be used. + For loop->safelen >= 2, the user has asserted that there are no + backward dependencies, but there still could be loop carried forward + dependencies in such loops. This flag will be false if normal + vectorizer data dependency analysis would fail or require versioning + for alias, but because of loop->safelen >= 2 it has been vectorized + even without versioning for alias. E.g. in: + #pragma omp simd + for (int i = 0; i < m; i++) + a[i] = a[i + k] * c; + (or #pragma simd or #pragma ivdep) we can vectorize this and it will + DTRT even for k > 0 && k < m, but without safelen we would not + vectorize this, so this field would be false. */ + bool no_data_dependencies; + /* If if-conversion versioned this loop before conversion, this is the loop version without if-conversion. */ struct loop *scalar_loop; @@ -385,6 +404,7 @@ typedef struct _loop_vec_info { #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)->peeling_for_gaps #define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)->operands_swapped #define LOOP_VINFO_PEELING_FOR_NITER(L)(L)->peeling_for_niter +#define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies #define LOOP_VINFO_SCALAR_LOOP(L) (L)->scalar_loop #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \ --- gcc/tree-vect-data-refs.c.jj2014-01-10 00:38:26.0 +0100 +++ gcc/tree-vect-data-refs.c 2014-01-14 13:12:06.056342116 +0100 @@ -244,6 +244,7 @@ vect_analyze_data_ref_dependence (struct { if (loop->safelen < *max_vf) *max_vf = loop->safelen; + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false; return false; } @@ -291,6 +292,7 @@ vect_analyze_data_ref_dependence (struct { if (loop->safelen < *max_vf) *max_vf = loop->safelen; + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = false; return false; } @@ -447,6 +449,7 @@ vect_analyze_data_ref_dependences (loop_ dump_printf_loc (MSG_NOTE, vect_location, "=== vect_analyze_data_ref_dependences ===\n"); + LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo) = true; if (!compute_all_dependences (LOOP_VINFO_DATAREFS (loop_vinfo), &LOOP_VINFO_DDRS (loop_vinfo), LOOP_VINFO_LOOP_NEST (loop_vinfo), true)) --- gcc/tree-vect-stmts.c.jj2014-01-14 10:33:21.0 +0100 +++ gcc/tree-vect-stmts.c 2014-01-14 13:14:15.157677243 +0100 @@ -6381,10 +6381,11 @@ vectorizable_load (gimple stmt, gimple_s if (inv_p && !bb_vinfo) { gcc_assert (!grouped_load); - /* If we have versioned for aliasing then we are sure -this is a loop invariant load and thus we can insert -it on the preheader edge. */ - if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) + /* If we have versioned for aliasing or the loop doesn't +have any data dependencies that would preclude this, +then we are sure this is a loop invariant load and +thus we can insert it on the preheader edge. */ + if (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)) { if (dump_enabled_p ()) { Jakub
Re: [RFC] Using function clones for Pointer Bounds Checker
2014/1/14 Richard Biener : > On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich > wrote: >> Hi, >> >> I've been working for some time on the prototype of the Pointer Bounds >> Checker which uses function clones for instrumentation >> (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After >> several experiments with this approach I want to share my results and >> ask for some feedback to make a decision about the future steps. >> >> Firstly I want to remind the reasons for digging in this direction. In >> the original approach bounds of call arguments and input parameters >> are associated with arguments via special built-in calls. It creates >> implicit data flow compiler is not aware about which confuses some >> optimizations resulting in miss-optimization and breaks bounds data >> flow. Thus optimizations have to be fixed to get better pointers >> protection. >> >> Clones approach does not use special built-in function calls to >> associate bounds with call arguments and input parameters. Each >> function which should be instrumented gets an additional version and >> only this special version will be instrumented.This new version gets >> additional bound arguments to express input bounds. When function call >> is instrumented, it is redirected to instrumented version and all >> bounds are passed as explicit call arguments. Thus we have explicit >> pointer bounds flow similar to regular function parameters. It should >> allow to avoid changes in optimization, avoid miss-optimizations, >> allow existing IPA optimizations to work with bound args (e.g. >> propagate constant bounds value and remove checks in called function). >> >> I made a prototype implementation of this approach in the following way: >> >> - Add new IPA pass before early local passes to produce versions for >> all functions to be instrumented. >> - Put instrumentation pass after SSA pass. >> - Add new pass after IPA passes to remove bodies of functions which >> have instrumented versions. Function nodes may still be required for >> calls in not instrumented code. But we do not emit this code and >> therefore function bodies are not needed. >> >> Positive changes are: >> >> - IPA optimizations are not confused by bound parameters >> - bounds are now more like regular arguments; it makes their >> processing in expand easier >> - functions with bounds not attached to any pointer are allowed > > First of all thanks for trying to work in this direction. Comments on the > issues you encountered below (also CCed Honza as he should be more > familiar with reachability and clone issues). > >> On simple codes this approach worked well but on a bigger tests some >> issues were revealed. >> >> 1. Nodes reachability. Instrumented version is actually always >> reachable when original function is reachable because it is always >> emitted instead of the original. Thus I had to fix reachability >> analysis to achieve it. Another similar problem is check whether node >> can be removed after inline when inlining instrumented function. Not >> hard to fix but probably other similar problems exist. > > I suppose you do not update the callgraph / the call stmts when > creating the clones? Btw, is it desirable to inline the uninstrumented > function and then instrument the result (thus run cloning and > instrumentation after early inlining?)? Especially handling always_inlines > before cloning/isntrumentation looks very sensible. Right. Created clones have the same code as the original function and therefore same cgraph edges. I suppose instrumentation after early inlining is OK and may be preferred because inline shouldn't lead to any losses of bounds information. I tried variant when instrumentation works right after early inlining but with cloning still before early local passes. In general it looked OK. > >> 2. Function processing order. Function processing order is determined >> before early local passes. But during function instrumentation call >> graph is modified significantly and used topological order becomes >> outdated. That causes some troubles. E.g. function marked as 'always >> inline' cannot be inlined because it is not in SSA form yet. Surely >> inlining problem may be solved by just putting instrumentation after >> early inline, but similar problem may exist in other passes too. To >> resolve this problem I tried to split early local passes into three >> parts. The first one builds SSA, the second one performs >> instrumentation, the last one does the rest. Each part is performed on >> all functions before the next one starts. Thus I get all functions in >> SSA form and all instrumentation performed before starting early >> optimizations. Unfortunately such passes order leads to invalid SSA >> because of local_pure_const optimization affecting callers correctness >> (in case caller SSA was built before optimization revealed 'pure' or >> 'const' flag). > > Generally the processing order of early_local_passes is chosen > to get bette
Re: [PATCH, AArch64] Use llfloor and llceil for vcvtmd_s64_f64 and vcvtpd_s64_f64 in arm_neon.h
On 6 January 2014 12:30, Yufeng Zhang wrote: > This patch fixes the implementation of vcvtmd_s64_f64 and vcvtpd_s64_f64 in > arm_neon.h to use llfloor and llceil instead, which are ILP32-friendly. > > This patch will fix the following test failure in the ILP32 mode: > > FAIL: gcc.target/aarch64/vect-vcvt.c scan-assembler fcvtms\\tx[0-9]+, > d[0-9]+ > > OK for the trunk? OK, but we should wait for stage-1 now. Thanks /Marcus
[PING] Re: [PATCH, AArch64] Use llfloor and llceil for vcvtmd_s64_f64 and vcvtpd_s64_f64 in arm_neon.h
Ping~ Originally posted here: http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00185.html Thanks, Yufeng On 01/06/14 12:30, Yufeng Zhang wrote: This patch fixes the implementation of vcvtmd_s64_f64 and vcvtpd_s64_f64 in arm_neon.h to use llfloor and llceil instead, which are ILP32-friendly. This patch will fix the following test failure in the ILP32 mode: FAIL: gcc.target/aarch64/vect-vcvt.c scan-assembler fcvtms\\tx[0-9]+, d[0-9]+ OK for the trunk? Thanks, Yufeng gcc/ * config/aarch64/aarch64-builtins.c (aarch64_builtin_vectorized_function): Add BUILT_IN_LFLOORF, BUILT_IN_LLFLOOR, BUILT_IN_LCEILF and BUILT_IN_LLCEIL. * config/aarch64/arm_neon.h (vcvtaq_u64_f64): Call __builtin_llfloor instead of __builtin_lfloor. (vcvtnq_u64_f64): Call __builtin_llceil instead of __builtin_lceil.
Re: [RFC] Using function clones for Pointer Bounds Checker
On Tue, Jan 14, 2014 at 10:15 AM, Ilya Enkovich wrote: > Hi, > > I've been working for some time on the prototype of the Pointer Bounds > Checker which uses function clones for instrumentation > (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After > several experiments with this approach I want to share my results and > ask for some feedback to make a decision about the future steps. > > Firstly I want to remind the reasons for digging in this direction. In > the original approach bounds of call arguments and input parameters > are associated with arguments via special built-in calls. It creates > implicit data flow compiler is not aware about which confuses some > optimizations resulting in miss-optimization and breaks bounds data > flow. Thus optimizations have to be fixed to get better pointers > protection. > > Clones approach does not use special built-in function calls to > associate bounds with call arguments and input parameters. Each > function which should be instrumented gets an additional version and > only this special version will be instrumented.This new version gets > additional bound arguments to express input bounds. When function call > is instrumented, it is redirected to instrumented version and all > bounds are passed as explicit call arguments. Thus we have explicit > pointer bounds flow similar to regular function parameters. It should > allow to avoid changes in optimization, avoid miss-optimizations, > allow existing IPA optimizations to work with bound args (e.g. > propagate constant bounds value and remove checks in called function). > > I made a prototype implementation of this approach in the following way: > > - Add new IPA pass before early local passes to produce versions for > all functions to be instrumented. > - Put instrumentation pass after SSA pass. > - Add new pass after IPA passes to remove bodies of functions which > have instrumented versions. Function nodes may still be required for > calls in not instrumented code. But we do not emit this code and > therefore function bodies are not needed. > > Positive changes are: > > - IPA optimizations are not confused by bound parameters > - bounds are now more like regular arguments; it makes their > processing in expand easier > - functions with bounds not attached to any pointer are allowed First of all thanks for trying to work in this direction. Comments on the issues you encountered below (also CCed Honza as he should be more familiar with reachability and clone issues). > On simple codes this approach worked well but on a bigger tests some > issues were revealed. > > 1. Nodes reachability. Instrumented version is actually always > reachable when original function is reachable because it is always > emitted instead of the original. Thus I had to fix reachability > analysis to achieve it. Another similar problem is check whether node > can be removed after inline when inlining instrumented function. Not > hard to fix but probably other similar problems exist. I suppose you do not update the callgraph / the call stmts when creating the clones? Btw, is it desirable to inline the uninstrumented function and then instrument the result (thus run cloning and instrumentation after early inlining?)? Especially handling always_inlines before cloning/isntrumentation looks very sensible. > 2. Function processing order. Function processing order is determined > before early local passes. But during function instrumentation call > graph is modified significantly and used topological order becomes > outdated. That causes some troubles. E.g. function marked as 'always > inline' cannot be inlined because it is not in SSA form yet. Surely > inlining problem may be solved by just putting instrumentation after > early inline, but similar problem may exist in other passes too. To > resolve this problem I tried to split early local passes into three > parts. The first one builds SSA, the second one performs > instrumentation, the last one does the rest. Each part is performed on > all functions before the next one starts. Thus I get all functions in > SSA form and all instrumentation performed before starting early > optimizations. Unfortunately such passes order leads to invalid SSA > because of local_pure_const optimization affecting callers correctness > (in case caller SSA was built before optimization revealed 'pure' or > 'const' flag). Generally the processing order of early_local_passes is chosen to get better optimization - changing it shouldn't affect correctness and thus the issues you observe demand fixing anyway. (I've noted in the past that the early_local_passes processing order should more explicitely honor callgraph SCCs, eventually even iterating). Moving SSA build out of the early_local_passes and into a separate lowering stage is possible, the pure-const stuff is handled by keeping pass_fixup_cfg where it is now I think. In theory you can go into SSA form in all_lowering_passes already (but you have
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Tue, 14 Jan 2014, Richard Biener wrote: > On Mon, 13 Jan 2014, Cong Hou wrote: > > > I noticed that LIM could not hoist vector invariant, and that is why > > my first implementation tries to hoist them all. > > Yes, I filed PR59786 for this. I'll see if I can come up with > a fix suitable for stage3. > > > In addition, there are two disadvantages of hoisting invariant load + > > lim method: > > > > First, for some instructions the scalar version is faster than the > > vector version, and in this case hoisting scalar instructions before > > vectorization is better. Those instructions include data > > packing/unpacking, integer multiplication with SSE2, etc.. > > > > Second, it may use more SIMD registers. > > > > The following code shows a simple example: > > > > char *a, *b, *c; > > for (int i = 0; i < N; ++i) > > a[i] = b[0] * c[0] + a[i]; > > > > Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0] > > into a vector. > > Yes. I've tried with adjusting the vec_def_type as in the prototype > patch I sent before christmas but that's quite intrusive for this > stage. You could argue that performing invariant motion is not > really the vectorizers main task and that a combination of hoisting > only the load, later LIM hoisting the rest and then tree-vect-generic.c > demoting vector ops to scalar ops (unimplemented, but also a useful > general optimization) would work as well. For example with the untested following. Not sure if the LIM change is appropriate at this stage (it's handling of "cost" is weird, and in other places of the compiler we simply aggressively hoist invariants and expect RTL to fixup register pressure issues). The lowering change looks more like sth for forwprop but that runs quite late after vectorization. tree-vect-generic could at least factor in whether the target has a scalar op of that kind and whether that is maybe more expensive (though trading two vector splats for one is very likely offsetting that). It also would need to consider the case where this moves a vector splat inside a loop when handling the testcases we talk about without improved invariant motion. Any comments? Anything we want to fix before 4.9? The testcases are optimized by RTL invariant motion but they perform a vector addition. For example void test1 (int* a, int* b) { int i; for (i = 0; i < 10; ++i) a[i] = *b + 1; } gets .L7: movd(%rsi), %xmm1 leaq(%rdi,%rdx,4), %rdx xorl%eax, %eax pshufd $0, %xmm1, %xmm0 paddd .LC0(%rip), %xmm0 .p2align 4,,10 .p2align 3 .L4: addl$1, %eax addq$16, %rdx movaps %xmm0, -16(%rdx) cmpl%eax, %ecx ja .L4 instead of .L7: movl(%rsi), %eax leaq(%rdi,%rdx,4), %rdx addl$1, %eax movl%eax, -12(%rsp) xorl%eax, %eax movd-12(%rsp), %xmm1 pshufd $0, %xmm1, %xmm0 .p2align 4,,10 .p2align 3 .L4: addl$1, %eax addq$16, %rdx movaps %xmm0, -16(%rdx) cmpl%eax, %ecx ja .L4 which because of the by default disabled inter-unit moves looks even more expensive. With inter-unit moves we get .L7: movl(%rsi), %eax leaq(%rdi,%rdx,4), %rdx addl$1, %eax movd%eax, %xmm0 xorl%eax, %eax pshufd $0, %xmm0, %xmm0 .p2align 4,,10 .p2align 3 .L4: addl$1, %eax movaps %xmm0, (%rdx) addq$16, %rdx cmpl%eax, %ecx ja .L4 not sure if the avoided constant pool load offsets the inter-unit move here (depends on the kind of pipeline constraints that has, the above is with corei7 tuning). It looks to me that demoting vector to scalar ops might be better performed at RTL level? Plus the reverse op as seen from the above example where it isn't all clear which variant is better (which probably depends quite some bit on the CPU architecture). Thanks, Richard. Index: gcc/tree-ssa-loop-im.c === *** gcc/tree-ssa-loop-im.c (revision 206599) --- gcc/tree-ssa-loop-im.c (working copy) *** stmt_cost (gimple stmt) *** 533,538 --- 533,541 return 0; default: + /* All vector operations are expensive. */ + if (VECTOR_TYPE_P (gimple_expr_type (stmt))) + return LIM_EXPENSIVE; return 1; } } Index: gcc/tree-vect-generic.c === *** gcc/tree-vect-generic.c (revision 206599) --- gcc/tree-vect-generic.c (working copy) *** lower_vec_perm (gimple_stmt_iterator *gs *** 1335,1340 --- 1335,1357 update_stmt (gsi_stmt (*gsi)); } + /* If OP is a uniform vector return the element it is a splat from. */ + + static tree + ssa_uniform_vecto
Drop REG_CROSSING_JUMP when converting to a conditional return
While experimenting with a patch to use conditional returns, I hit a case where a conditional jump between hot and cold sections was being converted into a conditional return. The new jump still had the REG_CROSSING_JUMP and later passes were confused by the combination of that and a return JUMP_LABEL. The jump is now returning directly from the current section rather than crossing sections within the function, so I think we should just drop the note. Tested on s390x-linux-gnu with some local patches. OK to install? Thanks, Richard gcc/ * jump.c (redirect_jump_2): Remove REG_CROSSING_JUMP notes when converting a conditional jump into a conditional return. Index: gcc/jump.c === --- gcc/jump.c 2014-01-03 15:06:10.516727719 + +++ gcc/jump.c 2014-01-14 10:27:13.224173269 + @@ -1580,6 +1580,16 @@ redirect_jump_2 (rtx jump, rtx olabel, r } } + /* Handle the case where we had a conditional crossing jump to a return + label and are now changing it into a direct conditional return. + The jump is no longer crossing in that case. */ + if (ANY_RETURN_P (nlabel)) +{ + note = find_reg_note (jump, REG_CROSSING_JUMP, NULL_RTX); + if (note) + remove_note (jump, note); +} + if (!ANY_RETURN_P (olabel) && --LABEL_NUSES (olabel) == 0 && delete_unused > 0 /* Undefined labels will remain outside the insn stream. */
Re: [PATCH/AARCH64] Add issue_rate tuning field
On 14/01/14 01:49, Andrew Pinski wrote: > Hi, > While writing a scheduler for Cavium's aarch64 processor (Thunder), > I found there was no way currently to change the issue rate in > back-end. This patch adds a field (issue_rate) to tune_params and > creates a new function that the middle-end calls. I updated the > current two tuning variables (generic_tunings and cortexa53_tunings) > to be 1 which was the default before. > > OK? Built and tested for aarch64-elf with no regressions. > > Thanks, > Andrew Pinski > > ChangeLog: > * config/aarch64/aarch64-protos.h (tune_params): Add issue_rate. > * config/aarch64/aarch64.c (generic_tunings): Add issue rate of 1. > (cortexa53_tunings): Likewise. > (aarch64_sched_issue_rate): New function. > (TARGET_SCHED_ISSUE_RATE): Define. > > Ug, I'd missed that we weren't setting this. I think the value should be 2 for both generic and Cortex-A53 (I can't really envisage single-issue AArch64 systems). OK with that change (yes I know that wasn't the default before)... R.
Re: libsanitizer merge from upstream r196090
Uros Bizjak wrote: > The same tsan failures are reported in one of HJ's testers [2], with message: Can this be duplicate of http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59410 ? -Y
Re: [PATCH][IRA] Analysis of register usage of functions for usage by IRA.
On 13/01/14 16:16, Tom de Vries wrote: > On 10-01-14 12:39, Richard Earnshaw wrote: >> Consequently, you'll need to add a patch for AArch64 which has two >> registers clobbered by PLT-based calls. >> Thanks for pointing that out. That's r16 and r17, right? I can propose the hook for AArch64, once we all agree on how the hook should look. >> Yes; and thanks! > > Hi Richard, > > I'm posting this patch that implements the TARGET_FN_OTHER_HARD_REG_USAGE > hook > for aarch64. It uses the conservative hook format for now. > > I've build gcc and cc1 with the patch, and observed the impact on this code > snippet: > ... > static int > bar (int x) > { >return x + 3; > } > > int > foo (int y) > { >return y + bar (y); > } > ... > > AFAICT, that looks as expected: > ... > $ gcc fuse-caller-save.c -mno-lra -fno-use-caller-save -O2 -S -o- > 1 > $ gcc fuse-caller-save.c -mno-lra -fuse-caller-save -O2 -S -o- > 2 > $ diff -u 1 2 > --- 1 2014-01-13 16:51:24.0 +0100 > +++ 2 2014-01-13 16:51:19.0 +0100 > @@ -11,14 +11,12 @@ > .global foo > .type foo, %function > foo: > - stp x29, x30, [sp, -32]! > + stp x29, x30, [sp, -16]! > + mov w1, w0 > add x29, sp, 0 > - str x19, [sp,16] > - mov w19, w0 > bl bar > - add w0, w0, w19 > - ldr x19, [sp,16] > - ldp x29, x30, [sp], 32 > + ldp x29, x30, [sp], 16 > + add w0, w0, w1 > ret > .size foo, .-foo > .section.text.startup,"ax",%progbits > ... > > Btw, the results are the same for -mno-lra and -mlra. I'm just using the > -mno-lra version here because the -mlra version of -fuse-caller-save is still > in > review ( http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00586.html ). > > Thanks, > - Tom > > > fuse-caller-save-aarch64-hook.patch > > > 2014-01-11 Tom de Vries > > * config/aarch64/aarch64.c (TARGET_FN_OTHER_HARD_REG_USAGE): Redefine as > aarch64_fn_other_hard_reg_usage. > (aarch64_fn_other_hard_reg_usage): New function. > --- > gcc/config/aarch64/aarch64.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 3b1f6b5..295fd5d 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -3287,6 +3287,16 @@ aarch64_fixed_condition_code_regs (unsigned int *p1, > unsigned int *p2) >return true; > } > > +/* Implement TARGET_FN_OTHER_HARD_REG_USAGE. */ > + > +static bool > +aarch64_fn_other_hard_reg_usage (struct hard_reg_set_container *regs) > +{ > + SET_HARD_REG_BIT (regs->set, R16_REGNUM); > + SET_HARD_REG_BIT (regs->set, R17_REGNUM); > + return true; > +} I think that in this context using IP0_REGNUM and IP1_REGNUM would be slightly clearer; since it is because these registers are the inter-procedure-call scratch registers that they aren't safe to use in this context. Otherwise, this is OK. R.
Re: libsanitizer merge from upstream r196090
> FAIL: g++.dg/tsan/default_options.C -O2 execution test This one looks plain wrong to me. It should be checked for success instead of failure. -Y
Re: libsanitizer merge from upstream r196090
I've started a separate topic about flaky tsan tests here: https://groups.google.com/forum/#!topic/thread-sanitizer/KIok3F_b1oI --kcc On Fri, Jan 10, 2014 at 7:20 PM, Jakub Jelinek wrote: > On Fri, Jan 10, 2014 at 03:50:33PM +0400, Maxim Ostapenko wrote: >> On Fri, Jan 10, 2014 at 10:39 AM, Jakub Jelinek wrote: >> >> > Some of the tsan tests seems to FAIL randomly for quite a while >> > (since they were added), didn't have time to look if it is just >> bugs in the test or >> > some compiler issue or library issue. >> >> When I've commited these tsan tests, all of them were passed on my >> x86_64-unknown-linux-gnu 64bit system. >> >> Should I review them more carefully? > > That would be nice. The FAILs I'm seeing include e.g. > FAIL: c-c++-common/tsan/simple_race.c -O2 execution test > > FAIL: c-c++-common/tsan/simple_race.c -O0 execution test > FAIL: g++.dg/tsan/default_options.C -O2 execution test > > (and various others in the past, though not sure if any of the > pattern related failures could have something with libbacktrace > symbolization not being there yet (note, I do not have > /usr/bin/llvm-symbolizer installed on the testing box)). > > Both these tests don't report anything (well, default_options.C prints DONE), > simply succeed (with dg-shouldfail that is a failure). > I don't see anything wrong in the compiler output, the Thread? > routines just call __tsan_func_entry, then __tsan_write4 (&Global); > then __tsan_func_exit, so I don't see how it could be related to anything > but the library. Note the box is sometimes quite loaded (two make -j48 > regtests going on at the same time), but there is enough memory. > > Is the library perhaps timing sensitive, e.g. that it would track > issues only if the two threads are actually concurrent rather than could be > concurrent? Say if the first pthread_create creates thread immediately, > second pthread_create returns but the clone started thread isn't up yet, > then pthread_join on the first thread finishes and the first thread is > destroyed, then the second thread actually starts? > > Jakub
Re: [PATCH, ARM] Fix two IT issues
On 14/01/14 09:06, Zhenqiang Chen wrote: > Hi, > > The patch fixes two IT issues: > 1) IT block is splitted for A15 (and Cortex-M). > > For A15 tune, max cond insns (max_insns_skipped) is 2, which is set as the > maximum allowed insns in an IT block (see thumb2_final_prescan_insn). So IT > which has 3 or 4 instructions, will be splitted. Take the first if-then-else > in the test case of the patch as example, it will generate ITs like: > cmp r0, #10 > ite gt > subgt r0, r0, r1 > suble r0, r1, r0 > ite gt > addgt r0, r0, #10 > suble r0, r0, #7 > It does not make sense to split the IT. For cortex-m, the IT can not be > folded if the previous insn is 4 BYTES. > > 2) For arm_v7m_tune, max cond insns is not aligned with its branch cost. In > ifcvt.c, the relation between MAX_CONDITIONAL_EXECUTE (max cond insns) and > branch cost is: > > #ifndef MAX_CONDITIONAL_EXECUTE > #define MAX_CONDITIONAL_EXECUTE \ > (BRANCH_COST (optimize_function_for_speed_p (cfun), false) \ >+ 1) > #endif > > So the best value of max cond insns for v7m_tune should be 2. > > Bootstrap and no make check regression on ARM Chrome book. No make check > regression for Cortex-M3. > > Cortex-M4 O2 performance changes on coremark, dhrystone and eembc-v1: > coremark: -0.11% > dhrystone: 1.26%. > a2time01_lite: 2.63% > canrdr01_lite: 4.27% > iirflt01_lite: 6.51% > rspeed01_lite: 6.51% > dither01_lite: 7.36% > > The biggest regression in eembc-v1 is pntrch01_lite: -0.51% > All other regressions < 0.1% > > Cortex-M4 O3 performance changes are similar with O2, except one regression > due to loop alignment change. > > OK for trunk? > > Thanks! > -Zhenqiang > > 2014-01-14 Zhenqiang Chen > > * config/arm/arm.c (arm_v7m_tune): Set max_insns_skipped to 2. > (thumb2_final_prescan_insn): Set max to MAX_INSN_PER_IT_BLOCK. > > testsuite/ChangeLog: > 2014-01-14 Zhenqiang Chen > > * gcc.target/arm/its.c: New test. > OK. R.
[RFC] Using function clones for Pointer Bounds Checker
Hi, I've been working for some time on the prototype of the Pointer Bounds Checker which uses function clones for instrumentation (http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03327.html). After several experiments with this approach I want to share my results and ask for some feedback to make a decision about the future steps. Firstly I want to remind the reasons for digging in this direction. In the original approach bounds of call arguments and input parameters are associated with arguments via special built-in calls. It creates implicit data flow compiler is not aware about which confuses some optimizations resulting in miss-optimization and breaks bounds data flow. Thus optimizations have to be fixed to get better pointers protection. Clones approach does not use special built-in function calls to associate bounds with call arguments and input parameters. Each function which should be instrumented gets an additional version and only this special version will be instrumented.This new version gets additional bound arguments to express input bounds. When function call is instrumented, it is redirected to instrumented version and all bounds are passed as explicit call arguments. Thus we have explicit pointer bounds flow similar to regular function parameters. It should allow to avoid changes in optimization, avoid miss-optimizations, allow existing IPA optimizations to work with bound args (e.g. propagate constant bounds value and remove checks in called function). I made a prototype implementation of this approach in the following way: - Add new IPA pass before early local passes to produce versions for all functions to be instrumented. - Put instrumentation pass after SSA pass. - Add new pass after IPA passes to remove bodies of functions which have instrumented versions. Function nodes may still be required for calls in not instrumented code. But we do not emit this code and therefore function bodies are not needed. Positive changes are: - IPA optimizations are not confused by bound parameters - bounds are now more like regular arguments; it makes their processing in expand easier - functions with bounds not attached to any pointer are allowed On simple codes this approach worked well but on a bigger tests some issues were revealed. 1. Nodes reachability. Instrumented version is actually always reachable when original function is reachable because it is always emitted instead of the original. Thus I had to fix reachability analysis to achieve it. Another similar problem is check whether node can be removed after inline when inlining instrumented function. Not hard to fix but probably other similar problems exist. 2. Function processing order. Function processing order is determined before early local passes. But during function instrumentation call graph is modified significantly and used topological order becomes outdated. That causes some troubles. E.g. function marked as 'always inline' cannot be inlined because it is not in SSA form yet. Surely inlining problem may be solved by just putting instrumentation after early inline, but similar problem may exist in other passes too. To resolve this problem I tried to split early local passes into three parts. The first one builds SSA, the second one performs instrumentation, the last one does the rest. Each part is performed on all functions before the next one starts. Thus I get all functions in SSA form and all instrumentation performed before starting early optimizations. Unfortunately such passes order leads to invalid SSA because of local_pure_const optimization affecting callers correctness (in case caller SSA was built before optimization revealed 'pure' or 'const' flag). In general I feel that having special function version for instrumentation has a better potential, should lead to less intrusive changes in the compiler and better code quality. But before continue this implementation I would like to get some feedback and probably some advice on how to order passes to get the best result. Currently I incline to have all functions instrumented before any local optimizations and solve pure_const problem by modifying all callers when attribute is computed for some function. Thanks, Ilya
[PATCH, ARM] Fix two IT issues
Hi, The patch fixes two IT issues: 1) IT block is splitted for A15 (and Cortex-M). For A15 tune, max cond insns (max_insns_skipped) is 2, which is set as the maximum allowed insns in an IT block (see thumb2_final_prescan_insn). So IT which has 3 or 4 instructions, will be splitted. Take the first if-then-else in the test case of the patch as example, it will generate ITs like: cmp r0, #10 ite gt subgt r0, r0, r1 suble r0, r1, r0 ite gt addgt r0, r0, #10 suble r0, r0, #7 It does not make sense to split the IT. For cortex-m, the IT can not be folded if the previous insn is 4 BYTES. 2) For arm_v7m_tune, max cond insns is not aligned with its branch cost. In ifcvt.c, the relation between MAX_CONDITIONAL_EXECUTE (max cond insns) and branch cost is: #ifndef MAX_CONDITIONAL_EXECUTE #define MAX_CONDITIONAL_EXECUTE \ (BRANCH_COST (optimize_function_for_speed_p (cfun), false) \ + 1) #endif So the best value of max cond insns for v7m_tune should be 2. Bootstrap and no make check regression on ARM Chrome book. No make check regression for Cortex-M3. Cortex-M4 O2 performance changes on coremark, dhrystone and eembc-v1: coremark: -0.11% dhrystone: 1.26%. a2time01_lite: 2.63% canrdr01_lite: 4.27% iirflt01_lite: 6.51% rspeed01_lite: 6.51% dither01_lite: 7.36% The biggest regression in eembc-v1 is pntrch01_lite: -0.51% All other regressions < 0.1% Cortex-M4 O3 performance changes are similar with O2, except one regression due to loop alignment change. OK for trunk? Thanks! -Zhenqiang 2014-01-14 Zhenqiang Chen * config/arm/arm.c (arm_v7m_tune): Set max_insns_skipped to 2. (thumb2_final_prescan_insn): Set max to MAX_INSN_PER_IT_BLOCK. testsuite/ChangeLog: 2014-01-14 Zhenqiang Chen * gcc.target/arm/its.c: New test. diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 39d23cc..8751675 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -1696,7 +1696,7 @@ const struct tune_params arm_v7m_tune = &v7m_extra_costs, NULL,/* Sched adj cost. */ 1, /* Constant limit. */ - 5, /* Max cond insns. */ + 2, /* Max cond insns. */ ARM_PREFETCH_NOT_BENEFICIAL, true,/* Prefer constant pool. */ arm_cortex_m_branch_cost, @@ -22131,11 +22131,11 @@ thumb2_final_prescan_insn (rtx insn) int mask; int max; - /* Maximum number of conditionally executed instructions in a block - is minimum of the two max values: maximum allowed in an IT block - and maximum that is beneficial according to the cost model and tune. */ - max = (max_insns_skipped < MAX_INSN_PER_IT_BLOCK) ? -max_insns_skipped : MAX_INSN_PER_IT_BLOCK; + /* max_insns_skipped in the tune was already taken into account in the + cost model of ifcvt pass when generating COND_EXEC insns. At this stage + just emit the IT blocks as we can. It does not make sense to split + the IT blocks. */ + max = MAX_INSN_PER_IT_BLOCK; /* Remove the previous insn from the count of insns to be output. */ if (arm_condexec_count) diff --git a/gcc/testsuite/gcc.target/arm/its.c b/gcc/testsuite/gcc.target/arm/its.c new file mode 100644 index 000..8eecdf4 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/its.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ +int test (int a, int b) +{ + int r; + if (a > 10) +{ + r = a - b; + r += 10; +} + else +{ + r = b - a; + r -= 7; +} + if (r > 0) +r -= 3; + return r; +} +/* { dg-final { scan-assembler-times "\tit" 2 { target arm_thumb2 } } } */
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Mon, 13 Jan 2014, Cong Hou wrote: > I noticed that LIM could not hoist vector invariant, and that is why > my first implementation tries to hoist them all. Yes, I filed PR59786 for this. I'll see if I can come up with a fix suitable for stage3. > In addition, there are two disadvantages of hoisting invariant load + > lim method: > > First, for some instructions the scalar version is faster than the > vector version, and in this case hoisting scalar instructions before > vectorization is better. Those instructions include data > packing/unpacking, integer multiplication with SSE2, etc.. > > Second, it may use more SIMD registers. > > The following code shows a simple example: > > char *a, *b, *c; > for (int i = 0; i < N; ++i) > a[i] = b[0] * c[0] + a[i]; > > Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0] > into a vector. Yes. I've tried with adjusting the vec_def_type as in the prototype patch I sent before christmas but that's quite intrusive for this stage. You could argue that performing invariant motion is not really the vectorizers main task and that a combination of hoisting only the load, later LIM hoisting the rest and then tree-vect-generic.c demoting vector ops to scalar ops (unimplemented, but also a useful general optimization) would work as well. That said, we should definitely have a second look for 4.10. For now hoisting the load is an improvement over 4.8 (at least I hope so ;)) which needs to be good enough for 4.9. I had to fix a latent bug to cure some testsuite fallout so the following is what I ended up committing. Jakub, adding the new flag is ok with me. Thanks, Richard. 2014-01-14 Richard Biener PR tree-optimization/58921 PR tree-optimization/59006 * tree-vect-loop-manip.c (vect_loop_versioning): Remove code hoisting invariant stmts. * tree-vect-stmts.c (vectorizable_load): Insert the splat of invariant loads on the preheader edge if possible. * gcc.dg/torture/pr58921.c: New testcase. * gcc.dg/torture/pr59006.c: Likewise. * gcc.dg/vect/pr58508.c: XFAIL no longer handled cases. Index: gcc/tree-vect-loop-manip.c === *** gcc/tree-vect-loop-manip.c (revision 206576) --- gcc/tree-vect-loop-manip.c (working copy) *** vect_loop_versioning (loop_vec_info loop *** 2435,2507 } } - - /* Extract load statements on memrefs with zero-stride accesses. */ - - if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) - { - /* In the loop body, we iterate each statement to check if it is a load. -Then we check the DR_STEP of the data reference. If DR_STEP is zero, -then we will hoist the load statement to the loop preheader. */ - - basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); - int nbbs = loop->num_nodes; - - for (int i = 0; i < nbbs; ++i) - { - for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]); - !gsi_end_p (si);) - { - gimple stmt = gsi_stmt (si); - stmt_vec_info stmt_info = vinfo_for_stmt (stmt); - struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); - - if (is_gimple_assign (stmt) - && (!dr - || (DR_IS_READ (dr) && integer_zerop (DR_STEP (dr) - { - bool hoist = true; - ssa_op_iter iter; - tree var; - - /* We hoist a statement if all SSA uses in it are defined -outside of the loop. */ - FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) - { - gimple def = SSA_NAME_DEF_STMT (var); - if (!gimple_nop_p (def) - && flow_bb_inside_loop_p (loop, gimple_bb (def))) - { - hoist = false; - break; - } - } - - if (hoist) - { - if (dr) - gimple_set_vuse (stmt, NULL); - - gsi_remove (&si, false); - gsi_insert_on_edge_immediate (loop_preheader_edge (loop), - stmt); - - if (dump_enabled_p ()) - { - dump_printf_loc - (MSG_NOTE, vect_location, - "hoisting out of the vectorized loop: "); - dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); - dump_printf (MSG_NOTE, "\n"); - } - continue; - } - } - gsi_next (&si); - } - } - } - /* End loop-exit-fixes after versioning.
Re: [PATCH] Fix up vect/fast-math-mgrid-resid.f testcase (PR testsuite/59494)
On Mon, 13 Jan 2014, Jakub Jelinek wrote: > Hi! > > As discussed in the PR and on IRC, this testcase is very fragile, counting > additions with vect_ named SSA_NAME on lhs works only for some tunings, > for other tunings reassoc width etc. affect it and we can e.g. have > anonymous SSA_NAMEs on the lhs in the optimized dump instead. > > These alternate regexps seems to match regardless of the tunings (at least > what I've tried), starting with the corresponding fix onwards, and FAIL > before the fix. > > Regtested on x86_64-linux and i686-linux, ok for trunk? Ok. Thanks, Richard. > 2014-01-13 Jakub Jelinek > > PR testsuite/59494 > * gfortran.dg/vect/fast-math-mgrid-resid.f: Change > -fdump-tree-optimized to -fdump-tree-pcom-details in dg-options and > cleanup-tree-dump from optimized to pcom. Remove scan-tree-dump-times > for vect_\[^\\n\]*\\+, add scan-tree-dump-times for no suitable chains > and > Executing predictive commoning without unrolling. > > --- gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f.jj 2013-04-08 > 15:38:21.0 +0200 > +++ gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f2014-01-13 > 13:18:39.904315828 +0100 > @@ -1,7 +1,7 @@ > ! { dg-do compile { target i?86-*-* x86_64-*-* } } > ! { dg-require-effective-target vect_double } > ! { dg-require-effective-target sse2 } > -! { dg-options "-O3 -ffast-math -msse2 -fpredictive-commoning > -ftree-vectorize -fdump-tree-optimized" } > +! { dg-options "-O3 -ffast-math -msse2 -fpredictive-commoning > -ftree-vectorize -fdump-tree-pcom-details" } > > > *** RESID COMPUTES THE RESIDUAL: R = V - AU > @@ -39,8 +39,9 @@ C >RETURN >END > ! we want to check that predictive commoning did something on the > -! vectorized loop, which means we have to have exactly 13 vector > -! additions. > -! { dg-final { scan-tree-dump-times "vect_\[^\\n\]*\\+ " 13 "optimized" } } > +! vectorized loop. > +! { dg-final { scan-tree-dump-times "Executing predictive commoning without > unrolling" 1 "pcom" { target lp64 } } } > +! { dg-final { scan-tree-dump-times "Executing predictive commoning without > unrolling" 2 "pcom" { target ia32 } } } > +! { dg-final { scan-tree-dump-times "Predictive commoning failed: no > suitable chains" 0 "pcom" } } > ! { dg-final { cleanup-tree-dump "vect" } } > -! { dg-final { cleanup-tree-dump "optimized" } } > +! { dg-final { cleanup-tree-dump "pcom" } } > > Jakub > > -- Richard Biener SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer