Re: [PING][PATCH, AArch64] Disable reg offset in quad-word store for Falkor

Siddhesh Poyarekar Wed, 17 Jan 2018 07:44:10 -0800

On Wednesday 17 January 2018 08:31 PM, Wilco Dijkstra wrote:
> Why is that a bad thing? With the patch as is, the testcase generates:
> 
> .L4:
>       ldr     q0, [x2, x3]
>       add     x5, x1, x3
>       add     x3, x3, 16
>       cmp     x3, x4
>       str     q0, [x5]
>       bne     .L4
> 
> With a change in address cost (for loads and stores) we would get:
> 
> .L4:
>       ldr     q0, [x3], 16
>       str     q0, [x4], 16
>       cmp     x3, x5
>       bne     .L4
> 
> This looks better to me, especially if there are more loads and stores and
> some have offsets as well (the writeback is once per stream while the extra
> add happens for every store). It may be worth trying both possibilities
> on a large body of code and see which comes out smallest/fastest.


This is great for the load because of the way the falkor prefetcher
works, but it is terrible for the store because of the way the pipeline
works.  The only performant store for falkor is an indirect load with a
constant or zero offset.  Everything else has hidden costs.

> Note using the cost model as intended means the compiler tries to use the
> lowest cost possibility rather than never emitting the instruction, not even
> when optimizing for size. I think it's wrong to always block a valid 
> instruction.
<snip>
> It's not clear whether it is easy to split out the costs today (it could be 
> done
> in aarch64_rtx_costs but not aarch64_address_cost, and the latter is what
> IVOpt uses).

I briefly looked at the possibility of splitting the register_offset
cost into load and store, but I realized that I'd have to modify the
target hook for it to be useful, which is way too much work for this
single quirk.

>> Further, it seems like worthwhile work only if there are other parts
>> that actually have the same quirk and can use this split.  Do you know
>> of any such cores?
> 
> Currently there are several supported CPUs which use a much higher cost
> for TImode and for register offsets. So it's a common thing to want, however
> I don't know whether splitting load/store address costs helps for those.

It wouldn't.  This ought to be expressed already using the addr_scale_costs.

> I think a special case for Falkor in aarch64_address_cost would be acceptable
> in GCC8 - that would be much smaller and cleaner than the current patch. 
> If required we could improve upon this in GCC9 and add a way to differentiate
> between loads and stores.

I can't do this in address_cost since I can't determine whether the
address is a load or a store location.  The most minimal way seems to be
using the patterns in the md file.

Siddhesh

Re: [PING][PATCH, AArch64] Disable reg offset in quad-word store for Falkor

Reply via email to