Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

James Greenhalgh Fri, 27 May 2016 04:29:48 -0700

On Tue, Mar 15, 2016 at 03:31:30PM +0000, James Greenhalgh wrote:
> On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote:
> > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix) <felix.y...@huawei.com> 
> > wrote:
> > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <felix.y...@huawei.com> 
> > >> wrote:
> > >> > Hi,
> > >> >
> > >> >     As discussed in LKML:
> > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html,
> > >>  the
> > >> cost of changing a cache line
> > >> >     from shared to exclusive state can be significant on aarch64 cores,
> > >> especially when this is triggered by an exclusive store, since it may
> > >> >     result in having to retry the transaction.
> > >> >     This patch makes use of the "prfm PSTL1STRM" instruction to 
> > >> > prefetch
> > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc 
> > >> atomic
> > >> routines.
> > >> >     Bootstrapped on AArch64 server, is it OK?
> > >>
> > >>
> > >> I don't think this is a good thing in general.  For an example on 
> > >> ThunderX, the
> > >> prefetch just adds a cycle for no benefit.  This really depends on the
> > >> micro-architecture of the core and how LDXR/STXR are
> > >> implemented.   So after this patch, it will slow down ThunderX.
> > >>
> > >> Thanks,
> > >> Andrew Pinski
> > >>
> > >
> > > Hi Andrew,
> > >
> > >    I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree
> > >    it depends on the micro-architecture of the core.  As the mentioned
> > >    kernel patch is merged upstream, I think the added prefetch instruction
> > >    in atomic routines is good for most of AArch64 cores in the market.  If
> > >    it does nothing good for ThunderX, then how about adding some checking
> > >    here?  I mean disabling the the generation of the prfm if we are tuning
> > >    for ThunderX.
> > 
> > No it is not just not do any good, it actually causes worse
> > performance for ThunderX.  How about only doing it for the
> > micro-architecture where it helps and also not do it for generic since
> > it hurts ThunderX so much.
> 
> This should be a GCC 7 patch at this point, which should give us some time
> to talk through whether we want this patch or not.
> 
> How bad is this for ThunderX - upthread you said one cycle penalty, but here
> you suggest it hurts ThunderX more? Note that the prefetch is outside of
> the LDXR/STXR loop.


Hi Andrew,

Did you have any further thoughts on the magnitude of the penalty you
would face on ThunderX? 

Felix, I think if this is going to be expensive for some AArch64 platforms,
then the best way to progress this patch would be to add a new flag to
tune_flags. Something like AARCH64_EXTRA_TUNE_PREFETCH_LDREX. This would
allow targets which want the explicit prefetch to enable it.

Thanks,
James

Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

Reply via email to