The atomic_load/storedi_1 patterns are fixed to use LM, STM. I've had a go at generating better code in the HQImode CAS loop for aligned memory, but I don't know that I'd call it the most efficient thing ever. Some of this is due to deficiencies in other parts of the compiler (including the s390 backend):
(1) MEM_ALIGN can't pass down full align+ofs data that we had during cfgexpand. This means the opportunities for using the "aligned" path are less than they ought to be. (2) In get_pointer_alignment (used by get_builtin_sync_mem), we don't consider an ADDR_EXPR to return the full alignment that the type is due. I'm sure this is to work around some other sort of usage via the <string.h> builtins, but it's less-than-handy in this case. I wonder if in get_builtin_sync_mem we ought to be using get_object_alignment (build_fold_indirect_ref (addr)) instead? Consider struct S { int x; unsigned short y; } g_s; unsigned short o, n; void good() { __builtin_compare_exchange (&g_s.y, &o, n, 0, 0, 0); } void bad(S *p_s) { __builtin_compare_exchange (&p_s->y, &o, n, 0, 0, 0); } where GOOD produces the aligned MEM that we need, and BAD doesn't. (3) Support for IC, and ICM via the insv pattern is lacking. I've added a tiny bit of support here, in the form of using the existing strict_low_part patterns, but most definitely we could do better. (4) The *sethighpartsi and *sethighpartdi_64 patterns ought to be more different. As is, we can't insert into bits 48-56 of a DImode quantity, because we don't generate ICM for DImode, only ICMH. (5) Missing support for RISBGZ in the form of an extv/z expander. The existing *extv/z splitters probably ought to be conditionalized on !Z10. (6) The strict_low_part patterns should allow registers for at least Z10. The SImode strict_low_part can use LR everywhere. (7) RISBGZ could be used for a 3-address constant lshrsi3 before srlk is available. For the GOOD function above, and this patch set, for -O3 -march=z10: larl %r3,s+4 lhrl %r0,o lhi %r2,1 l %r1,0(%r3) nilh %r1,0 .L2: lr %r5,%r1 larl %r12,n lr %r4,%r1 risbg %r4,%r0,32,47,16 icm %r5,3,0(%r12) cs %r4,%r5,0(%r3) je .L3 lr %r5,%r4 nilh %r5,0 cr %r5,%r1 lr %r1,%r5 jne .L2 lhi %r2,0 .L3: srl %r4,16 sthrl %r4,o Odd things: * O is forced into a register before reaching the expander, so we get the RISBG for that. N is left in a memory and so we commit to using ICM for that. Further, because of how strict_low_part is implemented we're committed to leaving that in memory. * We don't optimize the loop and hoist the LARL of N outside the loop. * Given that we're having to zap the mask in %r1 for the second compare anyway, I wonder if RISBG is really beneficial over OR. Is RISBG (or ICM for that matter) any faster (or even smaller)? r~ Richard Henderson (2): s390: Reorg s390_expand_insv s390: Convert from sync to atomic optabs gcc/config/s390/s390-protos.h | 3 +- gcc/config/s390/s390.c | 270 ++++++++++++++++++---------- gcc/config/s390/s390.md | 401 +++++++++++++++++++++++++++++------------ 3 files changed, 465 insertions(+), 209 deletions(-) -- 1.7.7.6