The atomic_load/storedi_1 patterns are fixed to use LM, STM.

I've had a go at generating better code in the HQImode CAS
loop for aligned memory, but I don't know that I'd call it
the most efficient thing ever.  Some of this is due to 
deficiencies in other parts of the compiler (including the
s390 backend):

  (1) MEM_ALIGN can't pass down full align+ofs data that we had
      during cfgexpand.  This means the opportunities for using
      the "aligned" path are less than they ought to be.

  (2) In get_pointer_alignment (used by get_builtin_sync_mem),
      we don't consider an ADDR_EXPR to return the full alignment
      that the type is due.  I'm sure this is to work around some
      other sort of usage via the <string.h> builtins, but it's
      less-than-handy in this case.

      I wonder if in get_builtin_sync_mem we ought to be using
      get_object_alignment (build_fold_indirect_ref (addr)) instead?

      Consider

        struct S { int x; unsigned short y; } g_s;
        unsigned short o, n;
        void good() {
          __builtin_compare_exchange (&g_s.y, &o, n, 0, 0, 0);
        }
        void bad(S *p_s) {
          __builtin_compare_exchange (&p_s->y, &o, n, 0, 0, 0);
        }

      where GOOD produces the aligned MEM that we need, and BAD doesn't.

  (3) Support for IC, and ICM via the insv pattern is lacking.
      I've added a tiny bit of support here, in the form of using
      the existing strict_low_part patterns, but most definitely we
      could do better.

  (4) The *sethighpartsi and *sethighpartdi_64 patterns ought to be
      more different.  As is, we can't insert into bits 48-56 of a
      DImode quantity, because we don't generate ICM for DImode,
      only ICMH.

  (5) Missing support for RISBGZ in the form of an extv/z expander.
      The existing *extv/z splitters probably ought to be conditionalized
      on !Z10.

  (6) The strict_low_part patterns should allow registers for at
      least Z10.  The SImode strict_low_part can use LR everywhere.

  (7) RISBGZ could be used for a 3-address constant lshrsi3 before
      srlk is available.

For the GOOD function above, and this patch set, for -O3 -march=z10:

        larl    %r3,s+4
        lhrl    %r0,o
        lhi     %r2,1
        l       %r1,0(%r3)
        nilh    %r1,0
.L2:
        lr      %r5,%r1
        larl    %r12,n
        lr      %r4,%r1
        risbg   %r4,%r0,32,47,16
        icm     %r5,3,0(%r12)
        cs      %r4,%r5,0(%r3)
        je      .L3
        lr      %r5,%r4
        nilh    %r5,0
        cr      %r5,%r1
        lr      %r1,%r5
        jne     .L2
        lhi     %r2,0
.L3:
        srl     %r4,16
        sthrl   %r4,o

Odd things:

   * O is forced into a register before reaching the expander, so we
     get the RISBG for that.  N is left in a memory and so we commit
     to using ICM for that.  Further, because of how strict_low_part
     is implemented we're committed to leaving that in memory.

   * We don't optimize the loop and hoist the LARL of N outside the loop.

   * Given that we're having to zap the mask in %r1 for the second
     compare anyway, I wonder if RISBG is really beneficial over OR.
     Is RISBG (or ICM for that matter) any faster (or even smaller)?


r~


Richard Henderson (2):
  s390: Reorg s390_expand_insv
  s390: Convert from sync to atomic optabs

 gcc/config/s390/s390-protos.h |    3 +-
 gcc/config/s390/s390.c        |  270 ++++++++++++++++++----------
 gcc/config/s390/s390.md       |  401 +++++++++++++++++++++++++++++------------
 3 files changed, 465 insertions(+), 209 deletions(-)

-- 
1.7.7.6

Reply via email to