On 20/06/17 22:39, Jeff Law wrote:
> On 06/20/2017 03:27 AM, Richard Earnshaw (lists) wrote:
>> On 19/06/17 18:07, Jeff Law wrote:
>>> As some of you are likely aware, Qualys has just published fairly
>>> detailed information on using stack/heap clashes as an attack vector.
>>> Eric B, Michael M -- sorry I couldn't say more when I contact you about
>>> -fstack-check and some PPC specific stuff.  This has been under embargo
>>> for the last month.
>>>
>>>
>>> --
>>>
>>>
>>> http://www.openwall.com/lists/oss-security/2017/06/19/1
>>>
>> [...]
>>> aarch64 is significantly worse.  There are no implicit probes we can
>>> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
>>> So we have the track the distance to the most recent probe and when that
>>> distance grows too large, we have to emit a probe.  Of course we have to
>>> make worst case assumptions at function entry.
>>>
>>
>> I'm not sure I understand what you're saying here.  According to the
>> comment above aarch64_expand_prologue, the stack frame looks like:
>>
>> +-------------------------------+
>> |                               |
>> |  incoming stack arguments     |
>> |                               |
>> +-------------------------------+
>> |                               | <-- incoming stack pointer (aligned)
>> |  callee-allocated save area   |
>> |  for register varargs         |
>> |                               |
>> +-------------------------------+
>> |  local variables              | <-- frame_pointer_rtx
>> |                               |
>> +-------------------------------+
>> |  padding0                     | \
>> +-------------------------------+  |
>> |  callee-saved registers       |  | frame.saved_regs_size
>> +-------------------------------+  |
>> |  LR'                          |  |
>> +-------------------------------+  |
>> |  FP'                          | / <- hard_frame_pointer_rtx (aligned)
>> +-------------------------------+
>> |  dynamic allocation           |
>> +-------------------------------+
>> |  padding                      |
>> +-------------------------------+
>> |  outgoing stack arguments     | <-- arg_pointer
>> |                               |
>> +-------------------------------+
>> |                               | <-- stack_pointer_rtx (aligned)
>>
>> Now for the majority of frames the amount of local variables is small
>> and there is neither dynamic allocation nor the need for outgoing local
>> variables.  In this case the first instruction in the function is
>>
>>      stp     fp, lr, [sp, #-FrameSize
> But the stack pointer might have already been advanced into the guard
> page by the caller.   For the sake of argument assume the guard page is
> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
> the caller hasn't touched the 0xf1000 page.

Then make sure the caller does touch the 0xf1000 page.  If it's
allocated that much stack it should be forced to do the probe and not
rely on all it's children having to do it because it can't be bothered.


> 
> If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
> rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
> to emit a probe prior to this stack allocation.
> 
> Now because this instruction stores at *new_sp, it does allow us to
> eliminate future probes and I do take advantage of that in my code.
> 
> The implementation is actually rather simple.  We keep a conservative
> estimate of the offset of the last known probe relative to the stack
> pointer.  At entry we have to assume the offset is:
> 
> PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT)
> 
> 
> A stack allocation increases the offset.  A store into the stack
> decreases the offset.
>  i
> A probe is required before an allocation that increases the offset to >=
> PROBE_INTERVAL.
> 
> An allocation + store instruction such as shown does both, but can (and
> is) easily modeled.  THe only tricky case here is that you can't naively
> break it up into an allocation and store as that can force an
> unnecessary probe (say if the allocated space is just enough to hold the
> stored objects).
> 

It's better to pay the (relatively small) price when doing large
allocations than to repeatedly pay the price on every small allocation.

> 
> 
> 
>>
>>
>> If the locals area gets slightly larger (>= 512 bytes) then the sequence
>> becomes
>>      sub     sp, sp, #FrameSize
>>      stp     fp, lr, [sp]
>>
>> But again this acts as a sufficient implicit probe provided that
>> FrameSize does not exceed the probe interval.
> And again, the store acts as a probe which can eliminate potential
> probes that might occur later in the instruction stream.  But if the
> allocation by the "sub" instruction causes our running offset to cross
> PROBE_BOUNDARY, then we must emit a probe prior to the "sub" instruction.
> 
> Hopefully it'll be clearer when I post the code :-)  aarch64 is one that
> will need updating as all work to-date has been with Red Hat's 4.8
> compiler with the aarch64 code generator bolted onto the side.
> 
> So perhaps "no implicit probes" was too strong.  It would probably be
> better stated "no implicit probes in the caller".  We certainly use
> stores in the prologue to try and eliminate probes.  In fact, we try
> harder on aarch64 than any other target.
> 
> Jeff
> 

Reply via email to