Hi all,

In order to reference constant pool objects on s390, we setup a constant
pool pointer such that multiple objects can be addressed relative to
this pointer.  For the pointer r13 is allocated which is setup during
function prologue and is live across the whole function--similar to a
fixed register but it is not.

I'm currently experimenting with a new implementation where the address
for each constant pool access is computed right before its usage.  This
means, a GPR like r13 is not blocked throughout the whole function but
only allocated locally wherever a constant pool object is referenced.
In certain cases multiple constant pool objects are accessed shortly
after each other which results in unnecessary address computations as
e.g.:

int f (double x)
{
  return x == 123. || x == 321.;
}

Is compiled with my experimental patch into

f:
.LFB0:
        .cfi_startproc
        larl    %r1,.LC0      <-- load address for 123.
        cdb     %f0,0(%r1)
        larl    %r2,.LC1      <-- load address for 321.
        lhi     %r0,0
        lochie  %r0,1
        cdb     %f0,0(%r2)
        lghi    %r2,0
        locghie %r2,1
        rosbg   %r2,%r0,63,63,0
        br      %r14
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .section        .rodata.cst8,"aM",@progbits,8
        .align  8
.LC0:
        .long   1081348096
        .long   0
        .align  8
.LC1:
        .long   1079951360
        .long   0

Ideally we would emit a single LARL instruction (LOAD ADDRESS RELATIVE
LONG) and make use of different offsets (displacements in s390 speech)
in the subsequent CDB (COMPARE) instructions.  This would be pretty
similar to section anchors which makes me wonder whether something
similar exists for constant pool entries, too?

Conceptually, I guess one of the biggest differences between section
anchors and constant pool anchors is that constants might be pushed into
the constant pool pretty late during RA whereas for the former it is
already done during expand where addresses are computed via
use_anchored_address() and where it is trivial to load anchors into
registers multiple times since those get folded by later passes and so
on.  I came up with an experimental patch for LRA where constant pool
anchors are emitted.  For the example from above I get:

f:
.LFB0:
        .cfi_startproc
        larl    %r1,.LANCHOR0
        cdb     %f0,0(%r1)         <-- offset 0
        larl    %r2,.LANCHOR0
        lhi     %r1,0              <-- clobbers r1
        lochie  %r1,1
        cdb     %f0,8(%r2)         <-- offset 8
        lghi    %r2,0
        locghie %r2,1
        rosbg   %r2,%r1,63,63,0
        br      %r14
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .section        .rodata.cst8,"aM",@progbits,8
        .align  8
        .set    .LANCHOR0,. + 0
.LC0:
        .long   1081348096
        .long   0
.LC1:
        .long   1079951360
        .long   0

This is still not optimal since LRA used r1 for the first anchor load
which is clobbered before the second CDB instruction which means we
cannot remove the second "redundant" LARL.  If LRA had used e.g. r3 for
the first LARL, then postreload and late_combine would have done their
magic and an optimal version would have been emitted in the end:

f:
.LFB0:
        .cfi_startproc
        larl    %r3,.LANCHOR0      <-- single LARL
        cdb     %f0,0(%r3)         <-- offset 0
        lhi     %r1,0
        lochie  %r1,1
        cdb     %f0,8(%r3)         <-- offset 8
        lghi    %r2,0
        locghie %r2,1
        rosbg   %r2,%r1,63,63,0
        br      %r14
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .section        .rodata.cst8,"aM",@progbits,8
        .align  8
        .set    .LANCHOR0,. + 0
.LC0:
        .long   1081348096
        .long   0
.LC1:
        .long   1079951360
        .long   0

You could argue that the conflict with r1 might have been just bad luck
and not a big deal in practice (I haven't done extensive testing so
far).  However, this also reveals that the problem is not easily solved
during RA (I never thought that I say something like that in my life but
here we are).  What immediately comes to my mind is loop-invariant code
motion of address computation which does not/cannot happen after RA in
an optimal way anymore.

Another idea would be to push at least constants into the constant pool
as early as possible where it is clear that those will end up there
anyway as e.g. FP constants on s390.  However, first of all the
implementation is not straight forward (naively rejecting constants via
TARGET_LEGITIMATE_CONSTANT_P results in the case that those constants
are pushed into the literal pool and then loaded into registers instead
of using MEMs which affects alternative selection negatively) and second
of all that might influence subsequent RTL optimizations since those
would need to look through MEMs/REGs instead of directly dealing with
constants.  Especially the consequences of the latter are hard to
predict for me.

Long story short: Before going further down the rabbit hole I wanted to
make sure that this hasn't been solved already.  It would be great to
here how other targets solved this.  Any comments are highly
appreciated.

Cheers,
Stefan

PS: On s390 we only have a few instructions which accept PC relative
operands which means that address loading is somewhat important.

Reply via email to