On Wednesday 17 January 2018 08:31 PM, Wilco Dijkstra wrote: > Why is that a bad thing? With the patch as is, the testcase generates: > > .L4: > ldr q0, [x2, x3] > add x5, x1, x3 > add x3, x3, 16 > cmp x3, x4 > str q0, [x5] > bne .L4 > > With a change in address cost (for loads and stores) we would get: > > .L4: > ldr q0, [x3], 16 > str q0, [x4], 16 > cmp x3, x5 > bne .L4 > > This looks better to me, especially if there are more loads and stores and > some have offsets as well (the writeback is once per stream while the extra > add happens for every store). It may be worth trying both possibilities > on a large body of code and see which comes out smallest/fastest.
This is great for the load because of the way the falkor prefetcher works, but it is terrible for the store because of the way the pipeline works. The only performant store for falkor is an indirect load with a constant or zero offset. Everything else has hidden costs. > Note using the cost model as intended means the compiler tries to use the > lowest cost possibility rather than never emitting the instruction, not even > when optimizing for size. I think it's wrong to always block a valid > instruction. <snip> > It's not clear whether it is easy to split out the costs today (it could be > done > in aarch64_rtx_costs but not aarch64_address_cost, and the latter is what > IVOpt uses). I briefly looked at the possibility of splitting the register_offset cost into load and store, but I realized that I'd have to modify the target hook for it to be useful, which is way too much work for this single quirk. >> Further, it seems like worthwhile work only if there are other parts >> that actually have the same quirk and can use this split. Do you know >> of any such cores? > > Currently there are several supported CPUs which use a much higher cost > for TImode and for register offsets. So it's a common thing to want, however > I don't know whether splitting load/store address costs helps for those. It wouldn't. This ought to be expressed already using the addr_scale_costs. > I think a special case for Falkor in aarch64_address_cost would be acceptable > in GCC8 - that would be much smaller and cleaner than the current patch. > If required we could improve upon this in GCC9 and add a way to differentiate > between loads and stores. I can't do this in address_cost since I can't determine whether the address is a load or a store location. The most minimal way seems to be using the patterns in the md file. Siddhesh