Hi,

To continue the review of the AArch64 frame code I tried a few examples
to figure out what it does now. For initial_adjust <= 63*1024 and final_adjust <
1024 there are no probes inserted as expected, ie. the vast majority of
functions are unaffected. So that works perfectly.

For larger frames the first oddity is that there are now 2 separate params
controlling how probes are generated:

stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
stack-clash-protection-probe-interval (default 12)

I don't see how this makes sense. These values are closely related, so if
one is different from the other, probing becomes ineffective/incorrect. 
For example we generate code that trivially bypasses the guard despite
all the probing:

--param=stack-clash-protection-probe-interval=13
--param=stack-clash-protection-guard-size=12

So if there is a good reason to continue with 2 separate values, we must
force probe interval <= guard size!

Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
crashes due to the offsets used in the probes - we don't need large offsets
as we want to probe close to the bottom of the stack.

Functions with a large stack emit like alloca a lot of code, here I used
--param=stack-clash-protection-probe-interval=15:

int f1(int x)
{
  char arr[128*1024];
  return arr[x];
}

f1:
        mov     x16, 64512
        sub     sp, sp, x16
        .cfi_def_cfa_offset 64512
        mov     x16, -32768
        add     sp, sp, x16
        .cfi_def_cfa_offset -1024
        str     xzr, [sp, 32760]
        add     sp, sp, x16
        .cfi_def_cfa_offset -66560
        str     xzr, [sp, 32760]
        sub     sp, sp, #1024
        .cfi_def_cfa_offset -65536
        str     xzr, [sp, 1016]
        ldrb    w0, [sp, w0, sxtw]
        .cfi_def_cfa_offset 131072
        add     sp, sp, 131072
        .cfi_def_cfa_offset 0
        ret

Note the cfa offsets are wrong.

There is an odd mix of a big initial adjustment, then some probes+adjustments 
and
then a final adjustment and probe for the remainder. I can't see the point of 
having
both an initial and remainder adjustment. I would expect this:

        sub     sp, sp, 65536
        str     xzr, [sp, 1024]
        sub     sp, sp, 65536
        str     xzr, [sp, 1024]
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 131072
        ret


int f2(int x)
{
  char arr[128*1024];
  return arr[x];
}

f2:
        mov     x16, 64512
        sub     sp, sp, x16
        mov     x16, -65536
        movk    x16, 0xfffd, lsl 16
        add     x16, sp, x16
.LPSRL0:
        sub     sp, sp, 4096
        str     xzr, [sp, 4088]
        cmp     sp, x16
        b.ne    .LPSRL0
        sub     sp, sp, #1024
        str     xzr, [sp, 1016]
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 262144
        ret

The cfa entries are OK for this case. There is a mix of positive/negative 
offsets which
makes things confusing. Again there are 3 kinds of adjustments when for this 
size we
only need the loop.

Reusing the existing gen_probe_stack_range code appears a bad idea since
it ignores the probe interval and just defaults to 4KB. I don't see why it 
should be
any more complex than this:

        sub     x16, sp, 262144  // only need temporary if > 1MB
.LPSRL0:
        sub     sp, sp, 65536
        str     xzr, [sp, 1024]
        cmp     sp, x16
        b.ne    .LPSRL0
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 262144
        ret

Probe insertion if final adjustment >= 1024 also generates a lot of redundant
code - although this is more a theoretical issue given this is so rare.

Wilco

Reply via email to