Hi!

> > On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > > * Thomas Gleixner <t...@linutronix.de> wrote:
> > > 
> > > > > So I do think we could do more in this area to improve driver 
> > > > > performance, if the 
> > > > > code is correct and if there's actual benchmarks that are showing 
> > > > > real benefits.
> > > > 
> > > > If it's about hotpath performance I'm all for it, but the use case here 
> > > > is
> > > > a debug facility...
> > > > 
> > > > And if we go down that road then we want a AVX based memcpy()
> > > > implementation which is runtime conditional on the feature bit(s) and
> > > > length dependent. Just slapping a readqq() at it and use it in a loop 
> > > > does
> > > > not make any sense.
> > > 
> > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > most 
> > > optimistic implementation is actually correct:
> > > 
> > >  - if no preempt disable()/enable() is required
> > > 
> > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > state in 
> > >    any fashion
> > > 
> > >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > > or have
> > >    weird behavior if the FPU control word is modified to non-standard 
> > > values by 
> > >    untrusted user-space
> > > 
> > > If we have to touch the FPU tag or control words then it's probably only 
> > > good for 
> > > a specialized API.
> > 
> > I did not mean to have a general memcpy replacement. Rather something like
> > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > length does not justify the AVX stuff at all.
> 
> OK, fair enough.
> 
> Note that a generic version might still be worth trying out, if and only if 
> it's 
> safe to access those vector registers directly: modern x86 CPUs will do their 
> non-constant memcpy()s via the common memcpy_erms() function - which could in 
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if 
> size (and alignment, perhaps) is a multiple of 32 bytes or so.

How is AVX2 supposed to help the memcpy speed?

If the copy is small, constant overhead will dominate, and I don't
think AVX2 is going to be win there.

If the copy is big, well, the copy loop will likely run out of L1 and
maybe even out of L2, and at that point speed of the loop does not
matter because memory is slow...?

Best regards,
                                                                        Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachment: signature.asc
Description: Digital signature

Reply via email to