Re: s390: Avoid CAS boolean output inefficiency

2012-08-09 Thread Eric Botcazou
 This was caused (or perhaps abetted by) the representation of EQ
 as NE ^ 1.  With the subsequent truncation and zero-extend, I
 think combine reached its insn limit of 3 before seeing everything
 it needed to see.

This can be 4 now, if you tweak the initial heuristic.

-- 
Eric Botcazou


Re: s390: Avoid CAS boolean output inefficiency

2012-08-08 Thread Ulrich Weigand
Richard Henderson wrote:
 On 08/07/2012 10:02 AM, Ulrich Weigand wrote:
  The following patch changes the builtin expander to pass a MEM oldval
  as-is to the back-end expander, so that the back-end can move the
  store to before the CC operation.  With that patch I'm also seeing
  all the IPMs disappear.
 ...
  What do you think about this solution?  It has the advantage that
  we still get the same xor code if we actually do need the ipm ...
 
 I'm ok with that patch.

Thanks!  I've checked in the following version.
Tested on s390x-ibm-linux with no regressions.

Bye,
Ulrich

ChangeLog:

* builtins.c (expand_builtin_atomic_compare_exchange): Pass old
value operand as MEM to expand_atomic_compare_and_swap.

* config/s390/s390.md (atomic_compare_and_swapmode): Accept
nonimmediate_operand for old value; generate load and store if
needed.
* config/s390/s390.c (s390_expand_cs_hqi): Accept any operand
as vtarget.

Index: gcc/builtins.c
===
*** gcc/builtins.c  (revision 190226)
--- gcc/builtins.c  (working copy)
*** expand_builtin_atomic_compare_exchange (
*** 5376,5381 
--- 5376,5382 
  
expect = expand_normal (CALL_EXPR_ARG (exp, 1));
expect = convert_memory_address (Pmode, expect);
+   expect = gen_rtx_MEM (mode, expect);
desired = expand_expr_force_mode (CALL_EXPR_ARG (exp, 2), mode);
  
weak = CALL_EXPR_ARG (exp, 3);
*** expand_builtin_atomic_compare_exchange (
*** 5383,5396 
if (host_integerp (weak, 0)  tree_low_cst (weak, 0) != 0)
  is_weak = true;
  
!   oldval = copy_to_reg (gen_rtx_MEM (mode, expect));
! 
if (!expand_atomic_compare_and_swap ((target == const0_rtx ? NULL : 
target),
   oldval, mem, oldval, desired,
   is_weak, success, failure))
  return NULL_RTX;
  
!   emit_move_insn (gen_rtx_MEM (mode, expect), oldval);
return target;
  }
  
--- 5384,5398 
if (host_integerp (weak, 0)  tree_low_cst (weak, 0) != 0)
  is_weak = true;
  
!   oldval = expect;
if (!expand_atomic_compare_and_swap ((target == const0_rtx ? NULL : 
target),
   oldval, mem, oldval, desired,
   is_weak, success, failure))
  return NULL_RTX;
  
!   if (oldval != expect)
! emit_move_insn (expect, oldval);
! 
return target;
  }
  
Index: gcc/config/s390/s390.c
===
*** gcc/config/s390/s390.c  (revision 190226)
--- gcc/config/s390/s390.c  (working copy)
*** s390_expand_cs_hqi (enum machine_mode mo
*** 4825,4831 
rtx res = gen_reg_rtx (SImode);
rtx csloop = NULL, csend = NULL;
  
-   gcc_assert (register_operand (vtarget, VOIDmode));
gcc_assert (MEM_P (mem));
  
init_alignment_context (ac, mem, mode);
--- 4825,4830 
Index: gcc/config/s390/s390.md
===
*** gcc/config/s390/s390.md (revision 190226)
--- gcc/config/s390/s390.md (working copy)
***
*** 8870,8876 
  
  (define_expand atomic_compare_and_swapmode
[(match_operand:SI 0 register_operand);; bool success output
!(match_operand:DGPR 1 register_operand)  ;; oldval output
 (match_operand:DGPR 2 memory_operand);; memory
 (match_operand:DGPR 3 register_operand)  ;; expected intput
 (match_operand:DGPR 4 register_operand)  ;; newval intput
--- 8870,8876 
  
  (define_expand atomic_compare_and_swapmode
[(match_operand:SI 0 register_operand);; bool success output
!(match_operand:DGPR 1 nonimmediate_operand);; oldval output
 (match_operand:DGPR 2 memory_operand);; memory
 (match_operand:DGPR 3 register_operand)  ;; expected intput
 (match_operand:DGPR 4 register_operand)  ;; newval intput
***
*** 8879,8887 
 (match_operand:SI 7 const_int_operand)]  ;; failure model

  {
!   rtx cc, cmp;
emit_insn (gen_atomic_compare_and_swapmode_internal
!(operands[1], operands[2], operands[3], operands[4]));
cc = gen_rtx_REG (CCZ1mode, CC_REGNUM);
cmp = gen_rtx_EQ (SImode, cc, const0_rtx);
emit_insn (gen_cstorecc4 (operands[0], cmp, cc, const0_rtx));
--- 8879,8900 
 (match_operand:SI 7 const_int_operand)]  ;; failure model

  {
!   rtx cc, cmp, output = operands[1];
! 
!   if (!register_operand (output, MODEmode))
! output = gen_reg_rtx (MODEmode);
! 
emit_insn (gen_atomic_compare_and_swapmode_internal
!(output, operands[2], operands[3], operands[4]));
! 
!   /* We deliberately accept non-register operands in the predicate
!  to ensure the write back to the output operand happens *before*
!  the store-flags code below.  This makes it easier for combine
!  to merge the store-flags code with a 

Re: s390: Avoid CAS boolean output inefficiency

2012-08-07 Thread Ulrich Weigand
Richard Henderson wrote:
 On 08/06/2012 11:34 AM, Ulrich Weigand wrote:
  There is one particular inefficiency I have noticed.  This code:
  
if (!__atomic_compare_exchange_n (v, expected, max, 0 , 0, 0))
  abort ();
  
  from atomic-compare-exchange-3.c gets compiled into:
  
  l   %r3,0(%r2)
  larl%r1,v
  cs  %r3,%r4,0(%r1)
  ipm %r1
  sra %r1,28
  st  %r3,0(%r2)
  ltr %r1,%r1
  jne .L3
  
  which is extremely inefficient; it converts the condition code into
  an integer using the slow ipm, sra sequence, just so that it can
  convert the integer back into a condition code via ltr and branch
  on it ...
 
 This was caused (or perhaps abetted by) the representation of EQ
 as NE ^ 1.  With the subsequent truncation and zero-extend, I
 think combine reached its insn limit of 3 before seeing everything
 it needed to see.

Yes, combine isn't able to handle everything in one go, but it finds
an intermediate insn.  With the old __sync_compare_and_swap, it is
then able to optimize everything away in a second step; the only
reason this doesn't work any more is the intervening store.

The following patch changes the builtin expander to pass a MEM oldval
as-is to the back-end expander, so that the back-end can move the
store to before the CC operation.  With that patch I'm also seeing
all the IPMs disappear.

[ Well, all except for this one:

 This happens because CSE notices the cbranch vs 0, and sets r116
 to zero along the path to
 
  32   if (!__atomic_compare_exchange_n (v, expected, 0, STRONG,
 __ATOMIC_RELEASE, 
 __ATOMIC_ACQUIRE))
 
 at which point CSE decides that it would be cheaper to re-use
 the zero already in r116 instead of load another constant 0 here.
 After that, combine is ham-strung because r116 is not dead.

As you point out, this does indeed fix this problem as well:

 A short-term possibility is to have the CAS insns accept general_operand,
 so that the 0 gets merged.  
]


What do you think about this solution?  It has the advantage that
we still get the same xor code if we actually do need the ipm ...


Bye,
Ulrich


diff -ur gcc/builtins.c gcc.new/builtins.c
--- gcc/builtins.c  2012-08-07 16:04:45.054348099 +0200
+++ gcc.new/builtins.c  2012-08-07 15:44:01.304349225 +0200
@@ -5376,6 +5376,7 @@
 
   expect = expand_normal (CALL_EXPR_ARG (exp, 1));
   expect = convert_memory_address (Pmode, expect);
+  expect = gen_rtx_MEM (mode, expect);
   desired = expand_expr_force_mode (CALL_EXPR_ARG (exp, 2), mode);
 
   weak = CALL_EXPR_ARG (exp, 3);
@@ -5383,14 +5384,15 @@
   if (host_integerp (weak, 0)  tree_low_cst (weak, 0) != 0)
 is_weak = true;
 
-  oldval = copy_to_reg (gen_rtx_MEM (mode, expect));
-
+  oldval = expect;
   if (!expand_atomic_compare_and_swap ((target == const0_rtx ? NULL : target),
   oldval, mem, oldval, desired,
   is_weak, success, failure))
 return NULL_RTX;
 
-  emit_move_insn (gen_rtx_MEM (mode, expect), oldval);
+  if (oldval != expect)
+emit_move_insn (expect, oldval);
+
   return target;
 }
 
diff -ur gcc/config/s390/s390.md gcc.new/config/s390/s390.md
--- gcc/config/s390/s390.md 2012-08-07 16:04:54.204348621 +0200
+++ gcc.new/config/s390/s390.md 2012-08-07 16:00:21.934348628 +0200
@@ -8870,7 +8870,7 @@
 
 (define_expand atomic_compare_and_swapmode
   [(match_operand:SI 0 register_operand) ;; bool success output
-   (match_operand:DGPR 1 register_operand)   ;; oldval output
+   (match_operand:DGPR 1 nonimmediate_operand);; oldval output
(match_operand:DGPR 2 memory_operand) ;; memory
(match_operand:DGPR 3 register_operand)   ;; expected intput
(match_operand:DGPR 4 register_operand)   ;; newval intput
@@ -8879,9 +8879,17 @@
(match_operand:SI 7 const_int_operand)]   ;; failure model
   
 {
-  rtx cc, cmp;
+  rtx cc, cmp, output = operands[1];
+
+  if (!register_operand (output, MODEmode))
+output = gen_reg_rtx (MODEmode);
+
   emit_insn (gen_atomic_compare_and_swapmode_internal
-(operands[1], operands[2], operands[3], operands[4]));
+(output, operands[2], operands[3], operands[4]));
+
+  if (output != operands[1])
+emit_move_insn (operands[1], output);
+
   cc = gen_rtx_REG (CCZ1mode, CC_REGNUM);
   cmp = gen_rtx_EQ (SImode, cc, const0_rtx);
   emit_insn (gen_cstorecc4 (operands[0], cmp, cc, const0_rtx));


-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com



Re: s390: Avoid CAS boolean output inefficiency

2012-08-07 Thread Richard Henderson
On 08/07/2012 10:02 AM, Ulrich Weigand wrote:
 The following patch changes the builtin expander to pass a MEM oldval
 as-is to the back-end expander, so that the back-end can move the
 store to before the CC operation.  With that patch I'm also seeing
 all the IPMs disappear.
...
 What do you think about this solution?  It has the advantage that
 we still get the same xor code if we actually do need the ipm ...

I'm ok with that patch.


r~


s390: Avoid CAS boolean output inefficiency

2012-08-06 Thread Richard Henderson
On 08/06/2012 11:34 AM, Ulrich Weigand wrote:
 There is one particular inefficiency I have noticed.  This code:
 
   if (!__atomic_compare_exchange_n (v, expected, max, 0 , 0, 0))
 abort ();
 
 from atomic-compare-exchange-3.c gets compiled into:
 
 l   %r3,0(%r2)
 larl%r1,v
 cs  %r3,%r4,0(%r1)
 ipm %r1
 sra %r1,28
 st  %r3,0(%r2)
 ltr %r1,%r1
 jne .L3
 
 which is extremely inefficient; it converts the condition code into
 an integer using the slow ipm, sra sequence, just so that it can
 convert the integer back into a condition code via ltr and branch
 on it ...

This was caused (or perhaps abetted by) the representation of EQ
as NE ^ 1.  With the subsequent truncation and zero-extend, I
think combine reached its insn limit of 3 before seeing everything
it needed to see.

I'm able to fix this problem by representing EQ as EQ before reload.
For extimm targets this results in identical code; for older targets
it requires avoidance of the constant pool, i.e. LHI+XR instead of X.

l   %r2,0(%r3)
larl%r1,v
cs  %r2,%r5,0(%r1)
st  %r2,0(%r3)
jne .L3

That fixed, we see the second CAS in that file:

.loc 1 27 0
cs  %r2,%r2,0(%r1)
ipm %r5
sll %r5,28
lhi %r0,1
xr  %r5,%r0
st  %r2,0(%r3)
ltr %r5,%r5
je  .L20

This happens because CSE notices the cbranch vs 0, and sets r116
to zero along the path to

 32   if (!__atomic_compare_exchange_n (v, expected, 0, STRONG,
__ATOMIC_RELEASE, __ATOMIC_ACQUIRE))

at which point CSE decides that it would be cheaper to re-use
the zero already in r116 instead of load another constant 0 here.
After that, combine is ham-strung because r116 is not dead.

I'm not quite sure the best way to fix this, since rtx_costs already
has all constants cost 0.  CSE ought not believe that r116 is better
than a plain constant.  CSE also shouldn't be extending the life of
pseudos this way.

A short-term possibility is to have the CAS insns accept general_operand,
so that the 0 gets merged.  With reload inheritance and post-reload cse,
that might produce code that is good enough.  Certainly it's effective
for the atomic-compare-exchange-3.c testcase.  I'm less than happy with
that, since the non-optimization of CAS depends on following code that
is totally unrelated.

This patch ought to be independent of any other patch so far.


r~
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 0e43e51..bed6b79 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -5325,12 +5325,15 @@
(match_operand 3 const0_operand)]))
  (clobber (reg:CC CC_REGNUM))])]
   
-  emit_insn (gen_sne (operands[0], operands[2]));
-   if (GET_CODE (operands[1]) == EQ)
- emit_insn (gen_xorsi3 (operands[0], operands[0], const1_rtx));
-   DONE;)
+{
+  if (!TARGET_EXTIMM  GET_CODE (operands[1]) == EQ)
+{
+  emit_insn (gen_seq_neimm (operands[0], operands[2]));
+  DONE;
+}
+})
 
-(define_insn_and_split sne
+(define_insn_and_split *sne
   [(set (match_operand:SI 0 register_operand =d)
(ne:SI (match_operand:CCZ1 1 register_operand 0)
   (const_int 0)))
@@ -5342,6 +5345,48 @@
 [(set (match_dup 0) (ashiftrt:SI (match_dup 0) (const_int 28)))
  (clobber (reg:CC CC_REGNUM))])])
 
+(define_insn_and_split *seq
+  [(set (match_operand:SI 0 register_operand =d)
+   (eq:SI (match_operand:CCZ1 1 register_operand 0)
+  (const_int 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  TARGET_EXTIMM
+  #
+   reload_completed
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  emit_insn (gen_lshrsi3 (op0, op0, GEN_INT (28)));
+  emit_insn (gen_xorsi3 (op0, op0, const1_rtx));
+  DONE;
+})
+
+;; ??? Ideally we'd USE a const1_rtx, properly reloaded, but that makes
+;; things more difficult for combine (which can only insert clobbers).
+;; But perhaps it would be better still to have simply used a branch around
+;; constant load instead of beginning with the IPM?
+;;
+;; What about LOCR for Z196?  That's a more general question about cstore
+;; being decomposed into movcc...
+
+(define_insn_and_split seq_neimm
+  [(set (match_operand:SI 0 register_operand =d)
+   (eq:SI (match_operand:CCZ1 1 register_operand 0)
+  (const_int 0)))
+   (clobber (match_scratch:SI 2 =d))
+   (clobber (reg:CC CC_REGNUM))]
+  !TARGET_EXTIMM
+  #
+   reload_completed
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op2 = operands[2];
+  emit_insn (gen_ashlsi3 (op0, op0, GEN_INT (28)));
+  emit_move_insn (op2, const1_rtx);
+  emit_insn (gen_xorsi3 (op0, op0, op2));
+  DONE;
+})
 
 ;;
 ;; - Conditional move instructions (introduced with z196)