On Tue, Mar 15, 2016 at 03:31:30PM +0000, James Greenhalgh wrote: > On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote: > > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix) <felix.y...@huawei.com> > > wrote: > > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <felix.y...@huawei.com> > > >> wrote: > > >> > Hi, > > >> > > > >> > As discussed in LKML: > > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > > >> the > > >> cost of changing a cache line > > >> > from shared to exclusive state can be significant on aarch64 cores, > > >> especially when this is triggered by an exclusive store, since it may > > >> > result in having to retry the transaction. > > >> > This patch makes use of the "prfm PSTL1STRM" instruction to > > >> > prefetch > > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc > > >> atomic > > >> routines. > > >> > Bootstrapped on AArch64 server, is it OK? > > >> > > >> > > >> I don't think this is a good thing in general. For an example on > > >> ThunderX, the > > >> prefetch just adds a cycle for no benefit. This really depends on the > > >> micro-architecture of the core and how LDXR/STXR are > > >> implemented. So after this patch, it will slow down ThunderX. > > >> > > >> Thanks, > > >> Andrew Pinski > > >> > > > > > > Hi Andrew, > > > > > > I am not quite clear about the ThunderX micro-arch. But, Yes, I agree > > > it depends on the micro-architecture of the core. As the mentioned > > > kernel patch is merged upstream, I think the added prefetch instruction > > > in atomic routines is good for most of AArch64 cores in the market. If > > > it does nothing good for ThunderX, then how about adding some checking > > > here? I mean disabling the the generation of the prfm if we are tuning > > > for ThunderX. > > > > No it is not just not do any good, it actually causes worse > > performance for ThunderX. How about only doing it for the > > micro-architecture where it helps and also not do it for generic since > > it hurts ThunderX so much. > > This should be a GCC 7 patch at this point, which should give us some time > to talk through whether we want this patch or not. > > How bad is this for ThunderX - upthread you said one cycle penalty, but here > you suggest it hurts ThunderX more? Note that the prefetch is outside of > the LDXR/STXR loop.
Hi Andrew, Did you have any further thoughts on the magnitude of the penalty you would face on ThunderX? Felix, I think if this is going to be expensive for some AArch64 platforms, then the best way to progress this patch would be to add a new flag to tune_flags. Something like AARCH64_EXTRA_TUNE_PREFETCH_LDREX. This would allow targets which want the explicit prefetch to enable it. Thanks, James