subject:"\[RFC PATCH 0\/3\] kernel\: add support for 256\-bit IO access"

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Ingo Molnar

* Pavel Machek  wrote:

> > > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > > most 
> > > > optimistic implementation is actually correct:
> > > > 
> > > >  - if no preempt disable()/enable() is required
> > > > 
> > > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > > state in 
> > > >any fashion
> > > > 
> > > >  - if direct access to the AVX[2] registers cannot raise weird 
> > > > exceptions or have
> > > >weird behavior if the FPU control word is modified to non-standard 
> > > > values by 
> > > >untrusted user-space
> > > > 
> > > > If we have to touch the FPU tag or control words then it's probably 
> > > > only good for 
> > > > a specialized API.
> > > 
> > > I did not mean to have a general memcpy replacement. Rather something like
> > > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > > length does not justify the AVX stuff at all.
> > 
> > OK, fair enough.
> > 
> > Note that a generic version might still be worth trying out, if and only if 
> > it's 
> > safe to access those vector registers directly: modern x86 CPUs will do 
> > their 
> > non-constant memcpy()s via the common memcpy_erms() function - which could 
> > in 
> > theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> > variant, if 
> > size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> How is AVX2 supposed to help the memcpy speed?
> 
> If the copy is small, constant overhead will dominate, and I don't
> think AVX2 is going to be win there.

There are several advantages:

1)

"REP; MOVS" (also called ERMS) has a significant constant "setup cost".

In the scheme I suggested (and if it's possible) then single-register AVX2 
access 
on the other hand has a setup cost on the "few cycles" order of magnitude.

2)

AVX2 have various non-temporary load and store behavioral variants - while 
"REP; 
MOVS" doesn't (or rather, any such caching optimizations, to the extent they 
exist, are hidden in the microcode).

> If the copy is big, well, the copy loop will likely run out of L1 and maybe 
> even 
> out of L2, and at that point speed of the loop does not matter because memory 
> is 
> slow...?

In many cases "memory" will be something very fast, such as another level of 
cache. Also, on NUMA "memory" can also be something locally wired to the CPU - 
again accessible at ridiculous bandwidths.

Nevertheless ERMS is probably wins for the regular bulk memcpy by a few 
percentage 
points, so I don't think AVX2 is a win in the generic large-memcpy case, as 
long 
as continued caching of both the loads and the stores is beneficial.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Ingo Molnar

* Pavel Machek  wrote:

> > > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > > most 
> > > > optimistic implementation is actually correct:
> > > > 
> > > >  - if no preempt disable()/enable() is required
> > > > 
> > > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > > state in 
> > > >any fashion
> > > > 
> > > >  - if direct access to the AVX[2] registers cannot raise weird 
> > > > exceptions or have
> > > >weird behavior if the FPU control word is modified to non-standard 
> > > > values by 
> > > >untrusted user-space
> > > > 
> > > > If we have to touch the FPU tag or control words then it's probably 
> > > > only good for 
> > > > a specialized API.
> > > 
> > > I did not mean to have a general memcpy replacement. Rather something like
> > > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > > length does not justify the AVX stuff at all.
> > 
> > OK, fair enough.
> > 
> > Note that a generic version might still be worth trying out, if and only if 
> > it's 
> > safe to access those vector registers directly: modern x86 CPUs will do 
> > their 
> > non-constant memcpy()s via the common memcpy_erms() function - which could 
> > in 
> > theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> > variant, if 
> > size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> How is AVX2 supposed to help the memcpy speed?
> 
> If the copy is small, constant overhead will dominate, and I don't
> think AVX2 is going to be win there.

There are several advantages:

1)

"REP; MOVS" (also called ERMS) has a significant constant "setup cost".

In the scheme I suggested (and if it's possible) then single-register AVX2 
access 
on the other hand has a setup cost on the "few cycles" order of magnitude.

2)

AVX2 have various non-temporary load and store behavioral variants - while 
"REP; 
MOVS" doesn't (or rather, any such caching optimizations, to the extent they 
exist, are hidden in the microcode).

> If the copy is big, well, the copy loop will likely run out of L1 and maybe 
> even 
> out of L2, and at that point speed of the loop does not matter because memory 
> is 
> slow...?

In many cases "memory" will be something very fast, such as another level of 
cache. Also, on NUMA "memory" can also be something locally wired to the CPU - 
again accessible at ridiculous bandwidths.

Nevertheless ERMS is probably wins for the regular bulk memcpy by a few 
percentage 
points, so I don't think AVX2 is a win in the generic large-memcpy case, as 
long 
as continued caching of both the loads and the stores is beneficial.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Pavel Machek

Hi!

> > On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > > * Thomas Gleixner  wrote:
> > > 
> > > > > So I do think we could do more in this area to improve driver 
> > > > > performance, if the 
> > > > > code is correct and if there's actual benchmarks that are showing 
> > > > > real benefits.
> > > > 
> > > > If it's about hotpath performance I'm all for it, but the use case here 
> > > > is
> > > > a debug facility...
> > > > 
> > > > And if we go down that road then we want a AVX based memcpy()
> > > > implementation which is runtime conditional on the feature bit(s) and
> > > > length dependent. Just slapping a readqq() at it and use it in a loop 
> > > > does
> > > > not make any sense.
> > > 
> > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > most 
> > > optimistic implementation is actually correct:
> > > 
> > >  - if no preempt disable()/enable() is required
> > > 
> > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > state in 
> > >any fashion
> > > 
> > >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > > or have
> > >weird behavior if the FPU control word is modified to non-standard 
> > > values by 
> > >untrusted user-space
> > > 
> > > If we have to touch the FPU tag or control words then it's probably only 
> > > good for 
> > > a specialized API.
> > 
> > I did not mean to have a general memcpy replacement. Rather something like
> > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > length does not justify the AVX stuff at all.
> 
> OK, fair enough.
> 
> Note that a generic version might still be worth trying out, if and only if 
> it's 
> safe to access those vector registers directly: modern x86 CPUs will do their 
> non-constant memcpy()s via the common memcpy_erms() function - which could in 
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if 
> size (and alignment, perhaps) is a multiple of 32 bytes or so.

How is AVX2 supposed to help the memcpy speed?

If the copy is small, constant overhead will dominate, and I don't
think AVX2 is going to be win there.

If the copy is big, well, the copy loop will likely run out of L1 and
maybe even out of L2, and at that point speed of the loop does not
matter because memory is slow...?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Pavel Machek

Hi!

> > On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > > * Thomas Gleixner  wrote:
> > > 
> > > > > So I do think we could do more in this area to improve driver 
> > > > > performance, if the 
> > > > > code is correct and if there's actual benchmarks that are showing 
> > > > > real benefits.
> > > > 
> > > > If it's about hotpath performance I'm all for it, but the use case here 
> > > > is
> > > > a debug facility...
> > > > 
> > > > And if we go down that road then we want a AVX based memcpy()
> > > > implementation which is runtime conditional on the feature bit(s) and
> > > > length dependent. Just slapping a readqq() at it and use it in a loop 
> > > > does
> > > > not make any sense.
> > > 
> > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > most 
> > > optimistic implementation is actually correct:
> > > 
> > >  - if no preempt disable()/enable() is required
> > > 
> > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > state in 
> > >any fashion
> > > 
> > >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > > or have
> > >weird behavior if the FPU control word is modified to non-standard 
> > > values by 
> > >untrusted user-space
> > > 
> > > If we have to touch the FPU tag or control words then it's probably only 
> > > good for 
> > > a specialized API.
> > 
> > I did not mean to have a general memcpy replacement. Rather something like
> > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > length does not justify the AVX stuff at all.
> 
> OK, fair enough.
> 
> Note that a generic version might still be worth trying out, if and only if 
> it's 
> safe to access those vector registers directly: modern x86 CPUs will do their 
> non-constant memcpy()s via the common memcpy_erms() function - which could in 
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if 
> size (and alignment, perhaps) is a multiple of 32 bytes or so.

How is AVX2 supposed to help the memcpy speed?

If the copy is small, constant overhead will dominate, and I don't
think AVX2 is going to be win there.

If the copy is big, well, the copy loop will likely run out of L1 and
maybe even out of L2, and at that point speed of the loop does not
matter because memory is slow...?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Andy Lutomirski

On Thu, Mar 22, 2018 at 5:40 PM, Alexei Starovoitov
 wrote:
> On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
>>
>>  - I think the BPF JIT, whose byte code machine languge is used by an
>>increasing number of kernel subsystems, could benefit from having vector 
>> ops.
>>It would possibly allow the handling of floating point types.
>
> this is on our todo list already.
> To process certain traffic inside BPF in XDP we'd like to have access to
> floating point. The current workaround is to precompute the math and do
> bpf map lookup instead.
> Since XDP processing of packets is batched (typically up to napi budget
> of 64 packets at a time), we can, in theory, wrap the loop with
> kernel_fpu_begin/end and it will be cleaner and faster,
> but the work hasn't started yet.
> The microbenchmark numbers you quoted for xsave/xrestore look promising,
> so we probably will focus on it soon.
>
> Another use case for vector insns is to accelerate fib/lpm lookups
> which is likely beneficial for kernel overall regardless of bpf usage.
>

This is a further argument for the deferred restore approach IMO.
With deferred restore, kernel_fpu_begin() + kernel_fpu_end() has very
low amortized cost as long as we do lots of work in the kernel before
re-entering userspace.  For XDP workloads, this could be a pretty big
win, I imagine.

Someone just needs to do the nasty x86 work.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Andy Lutomirski

On Thu, Mar 22, 2018 at 5:40 PM, Alexei Starovoitov
 wrote:
> On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
>>
>>  - I think the BPF JIT, whose byte code machine languge is used by an
>>increasing number of kernel subsystems, could benefit from having vector 
>> ops.
>>It would possibly allow the handling of floating point types.
>
> this is on our todo list already.
> To process certain traffic inside BPF in XDP we'd like to have access to
> floating point. The current workaround is to precompute the math and do
> bpf map lookup instead.
> Since XDP processing of packets is batched (typically up to napi budget
> of 64 packets at a time), we can, in theory, wrap the loop with
> kernel_fpu_begin/end and it will be cleaner and faster,
> but the work hasn't started yet.
> The microbenchmark numbers you quoted for xsave/xrestore look promising,
> so we probably will focus on it soon.
>
> Another use case for vector insns is to accelerate fib/lpm lookups
> which is likely beneficial for kernel overall regardless of bpf usage.
>

This is a further argument for the deferred restore approach IMO.
With deferred restore, kernel_fpu_begin() + kernel_fpu_end() has very
low amortized cost as long as we do lots of work in the kernel before
re-entering userspace.  For XDP workloads, this could be a pretty big
win, I imagine.

Someone just needs to do the nasty x86 work.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Alexei Starovoitov

On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
> 
>  - I think the BPF JIT, whose byte code machine languge is used by an
>increasing number of kernel subsystems, could benefit from having vector 
> ops.
>It would possibly allow the handling of floating point types.

this is on our todo list already.
To process certain traffic inside BPF in XDP we'd like to have access to
floating point. The current workaround is to precompute the math and do
bpf map lookup instead.
Since XDP processing of packets is batched (typically up to napi budget
of 64 packets at a time), we can, in theory, wrap the loop with
kernel_fpu_begin/end and it will be cleaner and faster,
but the work hasn't started yet.
The microbenchmark numbers you quoted for xsave/xrestore look promising,
so we probably will focus on it soon.

Another use case for vector insns is to accelerate fib/lpm lookups
which is likely beneficial for kernel overall regardless of bpf usage.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Alexei Starovoitov

On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
> 
>  - I think the BPF JIT, whose byte code machine languge is used by an
>increasing number of kernel subsystems, could benefit from having vector 
> ops.
>It would possibly allow the handling of floating point types.

this is on our todo list already.
To process certain traffic inside BPF in XDP we'd like to have access to
floating point. The current workaround is to precompute the math and do
bpf map lookup instead.
Since XDP processing of packets is batched (typically up to napi budget
of 64 packets at a time), we can, in theory, wrap the loop with
kernel_fpu_begin/end and it will be cleaner and faster,
but the work hasn't started yet.
The microbenchmark numbers you quoted for xsave/xrestore look promising,
so we probably will focus on it soon.

Another use case for vector insns is to accelerate fib/lpm lookups
which is likely beneficial for kernel overall regardless of bpf usage.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Linus Torvalds

On Thu, Mar 22, 2018 at 5:48 AM, David Laight  wrote:
>
> So if we needed to do PIO reads using the AVX2 (or better AVX-512)
> registers would make a significant difference.
> Fortunately we can 'dma' most of the data we need to transfer.

I think this is the really fundamental issue.

A device that expects PIO to do some kind of high-performance
transaction is a *broken* device.

It really is that simple. We don't bend over for misdesigned hardware
crap unless it is really common.

> I've traced writes before, they are a lot faster and are limited
> by things in the fpga fabric (they appear back to back).

The write combine buffer really should be much more effective than any
AVX or similar can ever be.

   Linus

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Linus Torvalds

On Thu, Mar 22, 2018 at 5:48 AM, David Laight  wrote:
>
> So if we needed to do PIO reads using the AVX2 (or better AVX-512)
> registers would make a significant difference.
> Fortunately we can 'dma' most of the data we need to transfer.

I think this is the really fundamental issue.

A device that expects PIO to do some kind of high-performance
transaction is a *broken* device.

It really is that simple. We don't bend over for misdesigned hardware
crap unless it is really common.

> I've traced writes before, they are a lot faster and are limited
> by things in the fpga fabric (they appear back to back).

The write combine buffer really should be much more effective than any
AVX or similar can ever be.

   Linus

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight

From: David Laight
> Sent: 22 March 2018 10:36
...
> Any code would need to be in memcpy_fromio(), not in every driver that
> might benefit.
> Then fallback code can be used if the registers aren't available.
> 
> >  (b) we can't guarantee that %ymm register write will show up on any
> > bus as a single write transaction anyway
> 
> Misaligned 8 byte accesses generate a single PCIe TLP.
> I can look at what happens for AVX2 transfers later.
> I've got code that mmap()s PCIe addresses into user space, and can
> look at the TLP (indirectly through tracing on an fpga target).
> Just need to set something up that uses AVX copies.

On my i7-7700 reads into xmm registers generate 16 byte TLP and
reads into ymm registers 32 byte TLP.
I don't think I've a system that supports avx-512

With my lethargic fpga slave 32 bytes reads happen every 144 clocks
and 16 byte ones every 126 (+/- the odd clock).
Single bytes ones happen every 108, 8 byte 117.
So I have (about) 110 clock overhead on every read cycle.
These clocks are the 62.5MHz clock on the fpga.

So if we needed to do PIO reads using the AVX2 (or better AVX-512)
registers would make a significant difference.
Fortunately we can 'dma' most of the data we need to transfer.

I've traced writes before, they are a lot faster and are limited
by things in the fpga fabric (they appear back to back).

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight

From: David Laight
> Sent: 22 March 2018 10:36
...
> Any code would need to be in memcpy_fromio(), not in every driver that
> might benefit.
> Then fallback code can be used if the registers aren't available.
> 
> >  (b) we can't guarantee that %ymm register write will show up on any
> > bus as a single write transaction anyway
> 
> Misaligned 8 byte accesses generate a single PCIe TLP.
> I can look at what happens for AVX2 transfers later.
> I've got code that mmap()s PCIe addresses into user space, and can
> look at the TLP (indirectly through tracing on an fpga target).
> Just need to set something up that uses AVX copies.

On my i7-7700 reads into xmm registers generate 16 byte TLP and
reads into ymm registers 32 byte TLP.
I don't think I've a system that supports avx-512

With my lethargic fpga slave 32 bytes reads happen every 144 clocks
and 16 byte ones every 126 (+/- the odd clock).
Single bytes ones happen every 108, 8 byte 117.
So I have (about) 110 clock overhead on every read cycle.
These clocks are the 62.5MHz clock on the fpga.

So if we needed to do PIO reads using the AVX2 (or better AVX-512)
registers would make a significant difference.
Fortunately we can 'dma' most of the data we need to transfer.

I've traced writes before, they are a lot faster and are limited
by things in the fpga fabric (they appear back to back).

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight

From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
> 
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> burst over PCIe, and I assume that is where the code and idea came
> from. But I don't actually see any reason why a burst of 8 regular
> quad-word bytes wouldn't cause a 64-byte burst write too.

The big gain is from wide PCIe reads, not writes.
Writes to uncached locations (without WC) are almost certainly required
to generate the 'obvious' PCIe TLP, otherwise things are likely to break.

> So right now this is for _one_ odd rdma controller, with absolutely
> _zero_ performance numbers, and a very high likelihood that it does
> not matter in the least.
> 
> And if there are some atomicity concerns ("we need to guarantee a
> single atomic access for race conditions with the hardware"), they are
> probably bogus and misplaced anyway, since
> 
>  (a) we can't guarantee that AVX2 exists in the first place

Any code would need to be in memcpy_fromio(), not in every driver that
might benefit.
Then fallback code can be used if the registers aren't available.

>  (b) we can't guarantee that %ymm register write will show up on any
> bus as a single write transaction anyway

Misaligned 8 byte accesses generate a single PCIe TLP.
I can look at what happens for AVX2 transfers later.
I've got code that mmap()s PCIe addresses into user space, and can
look at the TLP (indirectly through tracing on an fpga target).
Just need to set something up that uses AVX copies.

> So as far as I can tell, there are basically *zero* upsides, and a lot
> of potential downsides.

There are some upsides.
I'll do a performance measurement for reads.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight

From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
> 
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> burst over PCIe, and I assume that is where the code and idea came
> from. But I don't actually see any reason why a burst of 8 regular
> quad-word bytes wouldn't cause a 64-byte burst write too.

The big gain is from wide PCIe reads, not writes.
Writes to uncached locations (without WC) are almost certainly required
to generate the 'obvious' PCIe TLP, otherwise things are likely to break.

> So right now this is for _one_ odd rdma controller, with absolutely
> _zero_ performance numbers, and a very high likelihood that it does
> not matter in the least.
> 
> And if there are some atomicity concerns ("we need to guarantee a
> single atomic access for race conditions with the hardware"), they are
> probably bogus and misplaced anyway, since
> 
>  (a) we can't guarantee that AVX2 exists in the first place

Any code would need to be in memcpy_fromio(), not in every driver that
might benefit.
Then fallback code can be used if the registers aren't available.

>  (b) we can't guarantee that %ymm register write will show up on any
> bus as a single write transaction anyway

Misaligned 8 byte accesses generate a single PCIe TLP.
I can look at what happens for AVX2 transfers later.
I've got code that mmap()s PCIe addresses into user space, and can
look at the TLP (indirectly through tracing on an fpga target).
Just need to set something up that uses AVX copies.

> So as far as I can tell, there are basically *zero* upsides, and a lot
> of potential downsides.

There are some upsides.
I'll do a performance measurement for reads.

David

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
> >
> > * Linus Torvalds  wrote:
> >
> >> And even if you ignore that "maintenance problems down the line" issue
> >> ("we can fix them when they happen") I don't want to see games like
> >> this, because I'm pretty sure it breaks the optimized xsave by tagging
> >> the state as being dirty.
> >
> > That's true - and it would penalize the context switch cost of the affected 
> > task
> > for the rest of its lifetime, as I don't think there's much that clears 
> > XINUSE
> > other than a FINIT, which is rarely done by user-space.
> >
> >> So no. Don't use vector stuff in the kernel. It's not worth the pain.
> >
> > I agree, but:
> >
> >> The *only* valid use is pretty much crypto, and even there it has had 
> >> issues.
> >> Benchmarks use big arrays and/or dense working sets etc to "prove" how 
> >> good the
> >> vector version is, and then you end up in situations where it's used once 
> >> per
> >> fairly small packet for an interrupt, and it's actually much worse than 
> >> doing it
> >> by hand.
> >
> > That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is 
> > so
> > expensive, so this argument is somewhat circular.
> 
> If we do the deferred restore, then the XSAVE/XRSTOR happens at most
> once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
> entries are already so slow that this will be mostly in the noise :(

For performance/scalability work we should just ignore the PTI overhead: it 
doesn't exist on AMD CPUs and Intel has announced Meltdown-fixed CPUs, to be 
released later this year:

   https://www.anandtech.com/show/12533/intel-spectre-meltdown

By the time any kernel changes we are talking about today get to distros and 
users 
the newest hardware won't have the Meltdown bug.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
> >
> > * Linus Torvalds  wrote:
> >
> >> And even if you ignore that "maintenance problems down the line" issue
> >> ("we can fix them when they happen") I don't want to see games like
> >> this, because I'm pretty sure it breaks the optimized xsave by tagging
> >> the state as being dirty.
> >
> > That's true - and it would penalize the context switch cost of the affected 
> > task
> > for the rest of its lifetime, as I don't think there's much that clears 
> > XINUSE
> > other than a FINIT, which is rarely done by user-space.
> >
> >> So no. Don't use vector stuff in the kernel. It's not worth the pain.
> >
> > I agree, but:
> >
> >> The *only* valid use is pretty much crypto, and even there it has had 
> >> issues.
> >> Benchmarks use big arrays and/or dense working sets etc to "prove" how 
> >> good the
> >> vector version is, and then you end up in situations where it's used once 
> >> per
> >> fairly small packet for an interrupt, and it's actually much worse than 
> >> doing it
> >> by hand.
> >
> > That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is 
> > so
> > expensive, so this argument is somewhat circular.
> 
> If we do the deferred restore, then the XSAVE/XRSTOR happens at most
> once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
> entries are already so slow that this will be mostly in the noise :(

For performance/scalability work we should just ignore the PTI overhead: it 
doesn't exist on AMD CPUs and Intel has announced Meltdown-fixed CPUs, to be 
released later this year:

   https://www.anandtech.com/show/12533/intel-spectre-meltdown

By the time any kernel changes we are talking about today get to distros and 
users 
the newest hardware won't have the Meltdown bug.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar

* Linus Torvalds  wrote:

> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.

Yeah, I think the only valid save/restore pattern is to 100% correctly 
enumerate 
the width of the vector registers, and use full width instructions.

Using partial registers, even though it's possible in some cases is probably a 
bad 
idea not just due to most instructions auto-zeroing the upper portion to reduce 
false dependencies, but also because 'mixed' use of partial and full register 
access is known to result in penalties on a wide range of Intel CPUs, at least 
according to the Agner PDFs. On AMD CPUs there's no penalty.

So what I think could be done at best is to define a full register save/restore 
API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the 
native vector register width. (I.e. if old kernel is used on very new CPU.)

Note that the actual AVX code could still use partial width, it's the 
save/restore 
primitives that has to handle full width registers.

> And yes, we can have some statically patched code that takes that into 
> account, 
> and saves the whole zmm register when AVX512 is on, but the whole *point* of 
> the 
> dynamic XSAVES thing is actually that Intel wants to be able enable new 
> user-space features without having to wait for OS support. Literally. That's 
> why 
> and how it was designed.

This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using 
vector instructions in its memset() the AVX bits get used early on and to the 
maximum, so the XINUSE for them is set for every task.

The optionality of other XSAVE based features like MPX wouldn't be hurt if the 
kernel only uses vector registers.

> And saving a couple of zmm registers is actually pretty hard. They're big. Do 
> you want to allocate 128 bytes of stack space, preferably 64-byte aligned, 
> for a 
> save area? No. So now it needs to be some kind of per-thread (or maybe 
> per-CPU, 
> if we're willing to continue to not preempt) special save area too.

Hm, that's indeed a nasty complication:

 - While a single 128 bytes slot might work - in practice at least two vector
   registers are needed to have enough parallelism and hide latencies.

 - >thread.fpu.state.xsave is available almost all the time: with our 
   current 'direct' FPU context switching code the only time there's live data 
in
   >thread.fpu is when the task is not running. But it's not IRQ-safe.

We could probably allow irq save/restore sections to use it, as 
local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context 
save/restore pattern.

But I was hoping for a less restrictive model ... :-/

To have a better model and avoid the local_irq_save()/restore we could perhaps 
change the IRQ model to have a per IRQ 'current' value (we have separate IRQ 
stacks already), but that's quite a bit of work to transform all code that 
operates on the interrupted task (scheduler and timer code).

But it's work that would be useful for other reasons as well.

With such a separation in place >thread.fpu.state.xsave would become a 
generic, natural vector register save area.

> And even then, it doesn't solve the real worry of "maybe there will be odd 
> interactions with future extensions that we don't even know of".

Yes, that's true, but I think we could avoid these dangers by using CPU model 
based enumeration. The cost would be that vector ops would only be available on 
new CPU models after an explicit opt-in. In many cases it will be a single new 
constant to an existing switch() statement, easily backported as well.

> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?

Ok, so that's not what I'd use it for, I'd use it:

 - Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes.
   Right now the XSAVE*+XRSTOR* cost is significant:

 x86/fpu: Cost of: XSAVE   insn:   104 cycles
 x86/fpu: Cost of: XRSTOR  insn:80 cycles

   ... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF 
   lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a
   significant amount of L1 cache churn caused by XSAVE/XRSTOR.

   Most of the relevant vector instructions have a single cycle cost
   on the other hand.

 - To use vector ops in bulk, well-aligned memcpy(), which in many workloads
   is a fair chunk of all memset() activity. A usage profile on a typical 
system:

galatea:~> cat /proc/sched_debug  | grep hist | grep -E 
'[[:digit:]]{4,}$' | grep '0\]'
hist[0x]:1514272
hist[0x0010]:1905248
hist[0x0020]:  99471
hist[0x0030]: 343309
hist[0x0040]: 177874
hist[0x0080]: 190052
hist[0x00a0]:   5258
hist[0x00b0]:   2387

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Ingo Molnar

* Linus Torvalds  wrote:

> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.

Yeah, I think the only valid save/restore pattern is to 100% correctly 
enumerate 
the width of the vector registers, and use full width instructions.

Using partial registers, even though it's possible in some cases is probably a 
bad 
idea not just due to most instructions auto-zeroing the upper portion to reduce 
false dependencies, but also because 'mixed' use of partial and full register 
access is known to result in penalties on a wide range of Intel CPUs, at least 
according to the Agner PDFs. On AMD CPUs there's no penalty.

So what I think could be done at best is to define a full register save/restore 
API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the 
native vector register width. (I.e. if old kernel is used on very new CPU.)

Note that the actual AVX code could still use partial width, it's the 
save/restore 
primitives that has to handle full width registers.

> And yes, we can have some statically patched code that takes that into 
> account, 
> and saves the whole zmm register when AVX512 is on, but the whole *point* of 
> the 
> dynamic XSAVES thing is actually that Intel wants to be able enable new 
> user-space features without having to wait for OS support. Literally. That's 
> why 
> and how it was designed.

This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using 
vector instructions in its memset() the AVX bits get used early on and to the 
maximum, so the XINUSE for them is set for every task.

The optionality of other XSAVE based features like MPX wouldn't be hurt if the 
kernel only uses vector registers.

> And saving a couple of zmm registers is actually pretty hard. They're big. Do 
> you want to allocate 128 bytes of stack space, preferably 64-byte aligned, 
> for a 
> save area? No. So now it needs to be some kind of per-thread (or maybe 
> per-CPU, 
> if we're willing to continue to not preempt) special save area too.

Hm, that's indeed a nasty complication:

 - While a single 128 bytes slot might work - in practice at least two vector
   registers are needed to have enough parallelism and hide latencies.

 - >thread.fpu.state.xsave is available almost all the time: with our 
   current 'direct' FPU context switching code the only time there's live data 
in
   >thread.fpu is when the task is not running. But it's not IRQ-safe.

We could probably allow irq save/restore sections to use it, as 
local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context 
save/restore pattern.

But I was hoping for a less restrictive model ... :-/

To have a better model and avoid the local_irq_save()/restore we could perhaps 
change the IRQ model to have a per IRQ 'current' value (we have separate IRQ 
stacks already), but that's quite a bit of work to transform all code that 
operates on the interrupted task (scheduler and timer code).

But it's work that would be useful for other reasons as well.

With such a separation in place >thread.fpu.state.xsave would become a 
generic, natural vector register save area.

> And even then, it doesn't solve the real worry of "maybe there will be odd 
> interactions with future extensions that we don't even know of".

Yes, that's true, but I think we could avoid these dangers by using CPU model 
based enumeration. The cost would be that vector ops would only be available on 
new CPU models after an explicit opt-in. In many cases it will be a single new 
constant to an existing switch() statement, easily backported as well.

> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?

Ok, so that's not what I'd use it for, I'd use it:

 - Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes.
   Right now the XSAVE*+XRSTOR* cost is significant:

 x86/fpu: Cost of: XSAVE   insn:   104 cycles
 x86/fpu: Cost of: XRSTOR  insn:80 cycles

   ... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF 
   lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a
   significant amount of L1 cache churn caused by XSAVE/XRSTOR.

   Most of the relevant vector instructions have a single cycle cost
   on the other hand.

 - To use vector ops in bulk, well-aligned memcpy(), which in many workloads
   is a fair chunk of all memset() activity. A usage profile on a typical 
system:

galatea:~> cat /proc/sched_debug  | grep hist | grep -E 
'[[:digit:]]{4,}$' | grep '0\]'
hist[0x]:1514272
hist[0x0010]:1905248
hist[0x0020]:  99471
hist[0x0030]: 343309
hist[0x0040]: 177874
hist[0x0080]: 190052
hist[0x00a0]:   5258
hist[0x00b0]:   2387
hist[0x00c0]:

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Linus Torvalds

On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnar  wrote:
>
> So I added a bit of instrumentation and the current state of things is that on
> 64-bit x86 every single task has an initialized FPU, every task has the exact
> same, fully filled in xfeatures (XINUSE) value:

Bah. Your CPU is apparently some old crud that barely has AVX. Of
course all those features are enabled.

> Note that this is with an AVX (128-bit) supporting CPU:

That's weak even by modern standards.

I have MPX bounds and MPX CSR on my old Skylake.

And the real worry is things like AVX-512 etc, which is exactly when
things like "save and restore one ymm register" will quite likely
clear the upper bits of the zmm register.

And yes, we can have some statically patched code that takes that into
account, and saves the whole zmm register when AVX512 is on, but the
whole *point* of the dynamic XSAVES thing is actually that Intel wants
to be able enable new user-space features without having to wait for
OS support. Literally. That's why and how it was designed.

And saving a couple of zmm registers is actually pretty hard. They're
big. Do you want to allocate 128 bytes of stack space, preferably
64-byte aligned, for a save area? No. So now it needs to be some kind
of per-thread (or maybe per-CPU, if we're willing to continue to not
preempt) special save area too.

And even then, it doesn't solve the real worry of "maybe there will be
odd interactions with future extensions that we don't even know of".

All this to do a 32-byte PIO access, with absolutely zero data right
now on what the win is?

Yes, yes, I can find an Intel white-paper that talks about setting WC
and then using xmm and ymm instructions to write a single 64-byte
burst over PCIe, and I assume that is where the code and idea came
from. But I don't actually see any reason why a burst of 8 regular
quad-word bytes wouldn't cause a 64-byte burst write too.

So right now this is for _one_ odd rdma controller, with absolutely
_zero_ performance numbers, and a very high likelihood that it does
not matter in the least.

And if there are some atomicity concerns ("we need to guarantee a
single atomic access for race conditions with the hardware"), they are
probably bogus and misplaced anyway, since

 (a) we can't guarantee that AVX2 exists in the first place

 (b) we can't guarantee that %ymm register write will show up on any
bus as a single write transaction anyway

So as far as I can tell, there are basically *zero* upsides, and a lot
of potential downsides.

Linus

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Linus Torvalds

On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnar  wrote:
>
> So I added a bit of instrumentation and the current state of things is that on
> 64-bit x86 every single task has an initialized FPU, every task has the exact
> same, fully filled in xfeatures (XINUSE) value:

Bah. Your CPU is apparently some old crud that barely has AVX. Of
course all those features are enabled.

> Note that this is with an AVX (128-bit) supporting CPU:

That's weak even by modern standards.

I have MPX bounds and MPX CSR on my old Skylake.

And the real worry is things like AVX-512 etc, which is exactly when
things like "save and restore one ymm register" will quite likely
clear the upper bits of the zmm register.

And yes, we can have some statically patched code that takes that into
account, and saves the whole zmm register when AVX512 is on, but the
whole *point* of the dynamic XSAVES thing is actually that Intel wants
to be able enable new user-space features without having to wait for
OS support. Literally. That's why and how it was designed.

And saving a couple of zmm registers is actually pretty hard. They're
big. Do you want to allocate 128 bytes of stack space, preferably
64-byte aligned, for a save area? No. So now it needs to be some kind
of per-thread (or maybe per-CPU, if we're willing to continue to not
preempt) special save area too.

And even then, it doesn't solve the real worry of "maybe there will be
odd interactions with future extensions that we don't even know of".

All this to do a 32-byte PIO access, with absolutely zero data right
now on what the win is?

Yes, yes, I can find an Intel white-paper that talks about setting WC
and then using xmm and ymm instructions to write a single 64-byte
burst over PCIe, and I assume that is where the code and idea came
from. But I don't actually see any reason why a burst of 8 regular
quad-word bytes wouldn't cause a 64-byte burst write too.

So right now this is for _one_ odd rdma controller, with absolutely
_zero_ performance numbers, and a very high likelihood that it does
not matter in the least.

And if there are some atomicity concerns ("we need to guarantee a
single atomic access for race conditions with the hardware"), they are
probably bogus and misplaced anyway, since

 (a) we can't guarantee that AVX2 exists in the first place

 (b) we can't guarantee that %ymm register write will show up on any
bus as a single write transaction anyway

So as far as I can tell, there are basically *zero* upsides, and a lot
of potential downsides.

Linus

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Andy Lutomirski

On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
>
> * Linus Torvalds  wrote:
>
>> And even if you ignore that "maintenance problems down the line" issue
>> ("we can fix them when they happen") I don't want to see games like
>> this, because I'm pretty sure it breaks the optimized xsave by tagging
>> the state as being dirty.
>
> That's true - and it would penalize the context switch cost of the affected 
> task
> for the rest of its lifetime, as I don't think there's much that clears XINUSE
> other than a FINIT, which is rarely done by user-space.
>
>> So no. Don't use vector stuff in the kernel. It's not worth the pain.
>
> I agree, but:
>
>> The *only* valid use is pretty much crypto, and even there it has had issues.
>> Benchmarks use big arrays and/or dense working sets etc to "prove" how good 
>> the
>> vector version is, and then you end up in situations where it's used once per
>> fairly small packet for an interrupt, and it's actually much worse than 
>> doing it
>> by hand.
>
> That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
> expensive, so this argument is somewhat circular.

If we do the deferred restore, then the XSAVE/XRSTOR happens at most
once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
entries are already so slow that this will be mostly in the noise :(

--Andy

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Andy Lutomirski

On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
>
> * Linus Torvalds  wrote:
>
>> And even if you ignore that "maintenance problems down the line" issue
>> ("we can fix them when they happen") I don't want to see games like
>> this, because I'm pretty sure it breaks the optimized xsave by tagging
>> the state as being dirty.
>
> That's true - and it would penalize the context switch cost of the affected 
> task
> for the rest of its lifetime, as I don't think there's much that clears XINUSE
> other than a FINIT, which is rarely done by user-space.
>
>> So no. Don't use vector stuff in the kernel. It's not worth the pain.
>
> I agree, but:
>
>> The *only* valid use is pretty much crypto, and even there it has had issues.
>> Benchmarks use big arrays and/or dense working sets etc to "prove" how good 
>> the
>> vector version is, and then you end up in situations where it's used once per
>> fairly small packet for an interrupt, and it's actually much worse than 
>> doing it
>> by hand.
>
> That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
> expensive, so this argument is somewhat circular.

If we do the deferred restore, then the XSAVE/XRSTOR happens at most
once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
entries are already so slow that this will be mostly in the noise :(

--Andy

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Ingo Molnar

So I poked around a bit and I'm having second thoughts:

* Linus Torvalds  wrote:

> On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar  wrote:
> >
> > So assuming the target driver will only load on modern FPUs I *think* it 
> > should
> > actually be possible to do something like (pseudocode):
> >
> > vmovdqa %ymm0, 40(%rsp)
> > vmovdqa %ymm1, 80(%rsp)
> >
> > ...
> > # use ymm0 and ymm1
> > ...
> >
> > vmovdqa 80(%rsp), %ymm1
> > vmovdqa 40(%rsp), %ymm0
> >
> > ... without using the heavy XSAVE/XRSTOR instructions.
> 
> No. The above is buggy. It may *work*, but it won't work in the long run.
> 
> Pretty much every single vector extension has traditionally meant that
> touching "old" registers breaks any new register use. Even if you save
> the registers carefully like in your example code, it will do magic
> and random things to the "future extended" version.

This should be relatively straightforward to solve via a proper CPU features 
check: for example by only patching in the AVX routines for 'known compatible' 
fpu_kernel_xstate_size values. Future extensions of register width will extend
the XSAVE area.

It's not fool-proof: in theory there could be semantic extensions to the vector 
unit that does not increase the size of the context - but the normal pattern is 
to 
increase the number of XINUSE bits and bump up the maximum context area size.

If that's a worry then an even safer compatibility check would be to explicitly 
list CPU models - we do track them pretty accurately anyway these days, mostly 
due 
to perf PMU support defaulting to a safe but dumb variant if a CPU model is not 
specifically listed.

That method, although more maintenance-intense, should be pretty fool-proof 
AFAICS.

> So I absolutely *refuse* to have anything to do with the vector unit.
> You can only touch it in the kernel if you own it entirely (ie that
> "kernel_fpu_begin()/_end()" thing). Anything else is absolutely
> guaranteed to cause problems down the line.
> 
> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.

So I added a bit of instrumentation and the current state of things is that on 
64-bit x86 every single task has an initialized FPU, every task has the exact 
same, fully filled in xfeatures (XINUSE) value:

 [root@galatea ~]# grep -h fpu /proc/*/task/*/fpu | sort | uniq -c
504 x86/fpu: initialized :1
504 x86/fpu: xfeatures_mask  :7

So our latest FPU model is *really* simple and user-space should not be able to 
observe any changes in the XINUSE bits of the XSAVE header, because (at least 
for 
the basic vector CPU features) all bits are maxed out all the time.

Note that this is with an AVX (128-bit) supporting CPU:

[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.

But note that it probably wouldn't make sense to make use of XINUSE 
optimizations 
on most systems for the AVX space, as glibc will use the highest-bitness vector 
ops for its regular memcpy(), and every user task makes use of memcpy.

It does make sense for some of the more optional XSAVE based features such as 
pkeys. But I don't have any newer Intel system with a wider xsave feature set 
to 
check.

> So no. Don't use vector stuff in the kernel. It's not worth the pain.

That might still be true, but still I'm torn:

 - Broad areas of user-space has seemlessly integrated vector ops and is using 
   them all the time they can find an excuse to use them.

 - The vector registers are fundamentally callee-saved, so in most synchronous 
   calls the vector unit registers are unused. Asynchronous interruptions of 
   context (interrupts, faults, preemption, etc.) can still use them as well, 
as 
   long as they save/restore register contents.

So other than Intel not making it particularly easy to make a forwards 
compatible 
vector register granular save/restore pattern (but see above for how we could 
handle that) for asynchronous contexts, I don't see too many other 
complications.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Ingo Molnar

So I poked around a bit and I'm having second thoughts:

* Linus Torvalds  wrote:

> On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar  wrote:
> >
> > So assuming the target driver will only load on modern FPUs I *think* it 
> > should
> > actually be possible to do something like (pseudocode):
> >
> > vmovdqa %ymm0, 40(%rsp)
> > vmovdqa %ymm1, 80(%rsp)
> >
> > ...
> > # use ymm0 and ymm1
> > ...
> >
> > vmovdqa 80(%rsp), %ymm1
> > vmovdqa 40(%rsp), %ymm0
> >
> > ... without using the heavy XSAVE/XRSTOR instructions.
> 
> No. The above is buggy. It may *work*, but it won't work in the long run.
> 
> Pretty much every single vector extension has traditionally meant that
> touching "old" registers breaks any new register use. Even if you save
> the registers carefully like in your example code, it will do magic
> and random things to the "future extended" version.

This should be relatively straightforward to solve via a proper CPU features 
check: for example by only patching in the AVX routines for 'known compatible' 
fpu_kernel_xstate_size values. Future extensions of register width will extend
the XSAVE area.

It's not fool-proof: in theory there could be semantic extensions to the vector 
unit that does not increase the size of the context - but the normal pattern is 
to 
increase the number of XINUSE bits and bump up the maximum context area size.

If that's a worry then an even safer compatibility check would be to explicitly 
list CPU models - we do track them pretty accurately anyway these days, mostly 
due 
to perf PMU support defaulting to a safe but dumb variant if a CPU model is not 
specifically listed.

That method, although more maintenance-intense, should be pretty fool-proof 
AFAICS.

> So I absolutely *refuse* to have anything to do with the vector unit.
> You can only touch it in the kernel if you own it entirely (ie that
> "kernel_fpu_begin()/_end()" thing). Anything else is absolutely
> guaranteed to cause problems down the line.
> 
> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.

So I added a bit of instrumentation and the current state of things is that on 
64-bit x86 every single task has an initialized FPU, every task has the exact 
same, fully filled in xfeatures (XINUSE) value:

 [root@galatea ~]# grep -h fpu /proc/*/task/*/fpu | sort | uniq -c
504 x86/fpu: initialized :1
504 x86/fpu: xfeatures_mask  :7

So our latest FPU model is *really* simple and user-space should not be able to 
observe any changes in the XINUSE bits of the XSAVE header, because (at least 
for 
the basic vector CPU features) all bits are maxed out all the time.

Note that this is with an AVX (128-bit) supporting CPU:

[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.

But note that it probably wouldn't make sense to make use of XINUSE 
optimizations 
on most systems for the AVX space, as glibc will use the highest-bitness vector 
ops for its regular memcpy(), and every user task makes use of memcpy.

It does make sense for some of the more optional XSAVE based features such as 
pkeys. But I don't have any newer Intel system with a wider xsave feature set 
to 
check.

> So no. Don't use vector stuff in the kernel. It's not worth the pain.

That might still be true, but still I'm torn:

 - Broad areas of user-space has seemlessly integrated vector ops and is using 
   them all the time they can find an excuse to use them.

 - The vector registers are fundamentally callee-saved, so in most synchronous 
   calls the vector unit registers are unused. Asynchronous interruptions of 
   context (interrupts, faults, preemption, etc.) can still use them as well, 
as 
   long as they save/restore register contents.

So other than Intel not making it particularly easy to make a forwards 
compatible 
vector register granular save/restore pattern (but see above for how we could 
handle that) for asynchronous contexts, I don't see too many other 
complications.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Ingo Molnar

* Linus Torvalds  wrote:

> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.

That's true - and it would penalize the context switch cost of the affected 
task 
for the rest of its lifetime, as I don't think there's much that clears XINUSE 
other than a FINIT, which is rarely done by user-space.

> So no. Don't use vector stuff in the kernel. It's not worth the pain.

I agree, but:

> The *only* valid use is pretty much crypto, and even there it has had issues. 
> Benchmarks use big arrays and/or dense working sets etc to "prove" how good 
> the 
> vector version is, and then you end up in situations where it's used once per 
> fairly small packet for an interrupt, and it's actually much worse than doing 
> it 
> by hand.

That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so 
expensive, so this argument is somewhat circular.

IFF it was safe to just use the vector unit then vector unit based crypto would 
be 
very fast for small buffer as well, and would be even faster for larger buffer 
sizes as well. Saving and restoring up to ~1.5K of context is not cheap.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Ingo Molnar

* Linus Torvalds  wrote:

> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.

That's true - and it would penalize the context switch cost of the affected 
task 
for the rest of its lifetime, as I don't think there's much that clears XINUSE 
other than a FINIT, which is rarely done by user-space.

> So no. Don't use vector stuff in the kernel. It's not worth the pain.

I agree, but:

> The *only* valid use is pretty much crypto, and even there it has had issues. 
> Benchmarks use big arrays and/or dense working sets etc to "prove" how good 
> the 
> vector version is, and then you end up in situations where it's used once per 
> fairly small packet for an interrupt, and it's actually much worse than doing 
> it 
> by hand.

That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so 
expensive, so this argument is somewhat circular.

IFF it was safe to just use the vector unit then vector unit based crypto would 
be 
very fast for small buffer as well, and would be even faster for larger buffer 
sizes as well. Saving and restoring up to ~1.5K of context is not cheap.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski

On Tue, Mar 20, 2018 at 3:10 PM, David Laight  wrote:
> From: Andy Lutomirski
>> Sent: 20 March 2018 14:57
> ...
>> I'd rather see us finally finish the work that Rik started to rework
>> this differently.  I'd like kernel_fpu_begin() to look like:
>>
>> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>>   return; // we're already okay.  maybe we need to check
>> in_interrupt() or something, though?
>> } else {
>>   XSAVES/XSAVEOPT/XSAVE;
>>   set_thread_flag(TIF_NEED_FPU_RESTORE):
>> }
>>
>> and kernel_fpu_end() does nothing at all.
>
> I guess it might need to set (clear?) the CFLAGS bit for a process
> that isn't using the fpu at all - which seems a sensible feature.

What do you mean "CFLAGS"?

But we no longer have any concept of "isn't using the fpu at all" --
we got rid of that.

>
>> We take the full performance hit for a *single* kernel_fpu_begin() on
>> an otherwise short syscall or interrupt, but there's no additional
>> cost for more of them or for long-enough-running things that we
>> schedule in the middle.
>
> It might be worth adding a parameter to kernel_fpu_begin() to indicate
> which registers are needed, and a return value to say what has been
> granted.
> Then a driver could request AVX2 (for example) and use a fallback path
> if the register set isn't available (for any reason).
> A call from an ISR could always fail.

Last time I benchmarked it, XSAVEC on none of the state wasn't a whole
lot faster than XSAVEC for all of it.

>
>> As I remember, the main hangup was that this interacts a bit oddly
>> with PKRU, but that's manageable.
>
> WTF PKRU ??

PKRU is uniquely demented.  All the rest of the XSAVEC state only
affects code that explicitly references that state.  PKRU affects
every single access to user pages, so we need PKRU to match the
current task at all times in the kernel.  This means that, if we start
deferring XRSTORS until prepare_exit_to_usermode(), we need to start
restoring PKRU using WRPKRU in __switch_to().  Of course, *that*
interacts a bit oddly with XINUSE, but maybe we don't care.

Phooey on you, Intel, for putting PKRU into xstate and not giving a
fast instruction to control XINUSE.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski

On Tue, Mar 20, 2018 at 3:10 PM, David Laight  wrote:
> From: Andy Lutomirski
>> Sent: 20 March 2018 14:57
> ...
>> I'd rather see us finally finish the work that Rik started to rework
>> this differently.  I'd like kernel_fpu_begin() to look like:
>>
>> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>>   return; // we're already okay.  maybe we need to check
>> in_interrupt() or something, though?
>> } else {
>>   XSAVES/XSAVEOPT/XSAVE;
>>   set_thread_flag(TIF_NEED_FPU_RESTORE):
>> }
>>
>> and kernel_fpu_end() does nothing at all.
>
> I guess it might need to set (clear?) the CFLAGS bit for a process
> that isn't using the fpu at all - which seems a sensible feature.

What do you mean "CFLAGS"?

But we no longer have any concept of "isn't using the fpu at all" --
we got rid of that.

>
>> We take the full performance hit for a *single* kernel_fpu_begin() on
>> an otherwise short syscall or interrupt, but there's no additional
>> cost for more of them or for long-enough-running things that we
>> schedule in the middle.
>
> It might be worth adding a parameter to kernel_fpu_begin() to indicate
> which registers are needed, and a return value to say what has been
> granted.
> Then a driver could request AVX2 (for example) and use a fallback path
> if the register set isn't available (for any reason).
> A call from an ISR could always fail.

Last time I benchmarked it, XSAVEC on none of the state wasn't a whole
lot faster than XSAVEC for all of it.

>
>> As I remember, the main hangup was that this interacts a bit oddly
>> with PKRU, but that's manageable.
>
> WTF PKRU ??

PKRU is uniquely demented.  All the rest of the XSAVEC state only
affects code that explicitly references that state.  PKRU affects
every single access to user pages, so we need PKRU to match the
current task at all times in the kernel.  This means that, if we start
deferring XRSTORS until prepare_exit_to_usermode(), we need to start
restoring PKRU using WRPKRU in __switch_to().  Of course, *that*
interacts a bit oddly with XINUSE, but maybe we don't care.

Phooey on you, Intel, for putting PKRU into xstate and not giving a
fast instruction to control XINUSE.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Linus Torvalds

On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar  wrote:
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>
> ... without using the heavy XSAVE/XRSTOR instructions.

No. The above is buggy. It may *work*, but it won't work in the long run.

Pretty much every single vector extension has traditionally meant that
touching "old" registers breaks any new register use. Even if you save
the registers carefully like in your example code, it will do magic
and random things to the "future extended" version.

So I absolutely *refuse* to have anything to do with the vector unit.
You can only touch it in the kernel if you own it entirely (ie that
"kernel_fpu_begin()/_end()" thing). Anything else is absolutely
guaranteed to cause problems down the line.

And even if you ignore that "maintenance problems down the line" issue
("we can fix them when they happen") I don't want to see games like
this, because I'm pretty sure it breaks the optimized xsave by tagging
the state as being dirty.

So no. Don't use vector stuff in the kernel. It's not worth the pain.

The *only* valid use is pretty much crypto, and even there it has had
issues. Benchmarks use big arrays and/or dense working sets etc to
"prove" how good the vector version is, and then you end up in
situations where it's used once per fairly small packet for an
interrupt, and it's actually much worse than doing it by hand.

   Linus

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Linus Torvalds

On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar  wrote:
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>
> ... without using the heavy XSAVE/XRSTOR instructions.

No. The above is buggy. It may *work*, but it won't work in the long run.

Pretty much every single vector extension has traditionally meant that
touching "old" registers breaks any new register use. Even if you save
the registers carefully like in your example code, it will do magic
and random things to the "future extended" version.

So I absolutely *refuse* to have anything to do with the vector unit.
You can only touch it in the kernel if you own it entirely (ie that
"kernel_fpu_begin()/_end()" thing). Anything else is absolutely
guaranteed to cause problems down the line.

And even if you ignore that "maintenance problems down the line" issue
("we can fix them when they happen") I don't want to see games like
this, because I'm pretty sure it breaks the optimized xsave by tagging
the state as being dirty.

So no. Don't use vector stuff in the kernel. It's not worth the pain.

The *only* valid use is pretty much crypto, and even there it has had
issues. Benchmarks use big arrays and/or dense working sets etc to
"prove" how good the vector version is, and then you end up in
situations where it's used once per fairly small packet for an
interrupt, and it's actually much worse than doing it by hand.

   Linus

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Andy Lutomirski
> Sent: 20 March 2018 14:57
...
> I'd rather see us finally finish the work that Rik started to rework
> this differently.  I'd like kernel_fpu_begin() to look like:
> 
> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>   return; // we're already okay.  maybe we need to check
> in_interrupt() or something, though?
> } else {
>   XSAVES/XSAVEOPT/XSAVE;
>   set_thread_flag(TIF_NEED_FPU_RESTORE):
> }
> 
> and kernel_fpu_end() does nothing at all.

I guess it might need to set (clear?) the CFLAGS bit for a process
that isn't using the fpu at all - which seems a sensible feature.
 
> We take the full performance hit for a *single* kernel_fpu_begin() on
> an otherwise short syscall or interrupt, but there's no additional
> cost for more of them or for long-enough-running things that we
> schedule in the middle.

It might be worth adding a parameter to kernel_fpu_begin() to indicate
which registers are needed, and a return value to say what has been
granted.
Then a driver could request AVX2 (for example) and use a fallback path
if the register set isn't available (for any reason).
A call from an ISR could always fail.

> As I remember, the main hangup was that this interacts a bit oddly
> with PKRU, but that's manageable.

WTF PKRU ??

Dvaid

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Andy Lutomirski
> Sent: 20 March 2018 14:57
...
> I'd rather see us finally finish the work that Rik started to rework
> this differently.  I'd like kernel_fpu_begin() to look like:
> 
> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>   return; // we're already okay.  maybe we need to check
> in_interrupt() or something, though?
> } else {
>   XSAVES/XSAVEOPT/XSAVE;
>   set_thread_flag(TIF_NEED_FPU_RESTORE):
> }
> 
> and kernel_fpu_end() does nothing at all.

I guess it might need to set (clear?) the CFLAGS bit for a process
that isn't using the fpu at all - which seems a sensible feature.
 
> We take the full performance hit for a *single* kernel_fpu_begin() on
> an otherwise short syscall or interrupt, but there's no additional
> cost for more of them or for long-enough-running things that we
> schedule in the middle.

It might be worth adding a parameter to kernel_fpu_begin() to indicate
which registers are needed, and a return value to say what has been
granted.
Then a driver could request AVX2 (for example) and use a fallback path
if the register set isn't available (for any reason).
A call from an ISR could always fail.

> As I remember, the main hangup was that this interacts a bit oddly
> with PKRU, but that's manageable.

WTF PKRU ??

Dvaid

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski

On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar  wrote:
>
> * Thomas Gleixner  wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
> FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily 
> copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
> for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) 
> lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
> faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>

I think this kinda sorts works, but only kinda sorta:

 - I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs.  (IIRC
that's how it works for the new VEX stuff.)

 - This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects.  I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past.  This might not be a
total showstopper, but it's odd.

I'd rather see us finally finish the work that Rik started to rework
this differently.  I'd like kernel_fpu_begin() to look like:

if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
  return; // we're already okay.  maybe we need to check
in_interrupt() or something, though?
} else {
  XSAVES/XSAVEOPT/XSAVE;
  set_thread_flag(TIF_NEED_FPU_RESTORE):
}

and kernel_fpu_end() does nothing at all.

We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.

As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.

--Andy

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski

On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar  wrote:
>
> * Thomas Gleixner  wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
> FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily 
> copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
> for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) 
> lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
> faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>

I think this kinda sorts works, but only kinda sorta:

 - I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs.  (IIRC
that's how it works for the new VEX stuff.)

 - This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects.  I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past.  This might not be a
total showstopper, but it's odd.

I'd rather see us finally finish the work that Rik started to rework
this differently.  I'd like kernel_fpu_begin() to look like:

if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
  return; // we're already okay.  maybe we need to check
in_interrupt() or something, though?
} else {
  XSAVES/XSAVEOPT/XSAVE;
  set_thread_flag(TIF_NEED_FPU_RESTORE):
}

and kernel_fpu_end() does nothing at all.

We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.

As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.

--Andy

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Rahul Lakkireddy

On Monday, March 03/19/18, 2018 at 20:57:22 +0530, Christoph Hellwig wrote:
> On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> > This series of patches add support for 256-bit IO read and write.
> > The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> > and write 256-bits at a time from IO, respectively.
> 
> What a horrible name.  please encode the actual number of bits instead.

Ok.  Will change it to read256() and write256().

Thanks,
Rahul

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Rahul Lakkireddy

On Monday, March 03/19/18, 2018 at 20:57:22 +0530, Christoph Hellwig wrote:
> On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> > This series of patches add support for 256-bit IO read and write.
> > The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> > and write 256-bits at a time from IO, respectively.
> 
> What a horrible name.  please encode the actual number of bits instead.

Ok.  Will change it to read256() and write256().

Thanks,
Rahul

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Ingo Molnar
> Sent: 20 March 2018 10:54
...
> Note that a generic version might still be worth trying out, if and only if 
> it's
> safe to access those vector registers directly: modern x86 CPUs will do their
> non-constant memcpy()s via the common memcpy_erms() function - which could in
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if
> size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> Assuming it's correct with arbitrary user-space FPU state and if it results 
> in any
> measurable speedups, which might not be the case: ERMS is supposed to be very
> fast.
> 
> So even if it's possible (which it might not be), it could end up being slower
> than the ERMS version.

Last I checked memcpy() was implemented as 'rep movsb' on the latest Intel cpus.
Since memcpy_to/fromio() get aliased to memcpy() this generates byte copies.
The previous 'fastest' version of memcpy() was ok for uncached locations.

For PCIe I suspect that the actual instructions don't make a massive difference.
I'm not even sure interleaving two transfers makes any difference.
What makes a huge difference for memcpy_fromio() is the size of the register.
The time taken for a read will be largely independent of the width of the
register used.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Ingo Molnar
> Sent: 20 March 2018 10:54
...
> Note that a generic version might still be worth trying out, if and only if 
> it's
> safe to access those vector registers directly: modern x86 CPUs will do their
> non-constant memcpy()s via the common memcpy_erms() function - which could in
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if
> size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> Assuming it's correct with arbitrary user-space FPU state and if it results 
> in any
> measurable speedups, which might not be the case: ERMS is supposed to be very
> fast.
> 
> So even if it's possible (which it might not be), it could end up being slower
> than the ERMS version.

Last I checked memcpy() was implemented as 'rep movsb' on the latest Intel cpus.
Since memcpy_to/fromio() get aliased to memcpy() this generates byte copies.
The previous 'fastest' version of memcpy() was ok for uncached locations.

For PCIe I suspect that the actual instructions don't make a massive difference.
I'm not even sure interleaving two transfers makes any difference.
What makes a huge difference for memcpy_fromio() is the size of the register.
The time taken for a read will be largely independent of the width of the
register used.

David

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> > 
> > > > So I do think we could do more in this area to improve driver 
> > > > performance, if the 
> > > > code is correct and if there's actual benchmarks that are showing real 
> > > > benefits.
> > > 
> > > If it's about hotpath performance I'm all for it, but the use case here is
> > > a debug facility...
> > > 
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just slapping a readqq() at it and use it in a loop does
> > > not make any sense.
> > 
> > Yeah, so generic memcpy() replacement is only feasible I think if the most 
> > optimistic implementation is actually correct:
> > 
> >  - if no preempt disable()/enable() is required
> > 
> >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > state in 
> >any fashion
> > 
> >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > or have
> >weird behavior if the FPU control word is modified to non-standard 
> > values by 
> >untrusted user-space
> > 
> > If we have to touch the FPU tag or control words then it's probably only 
> > good for 
> > a specialized API.
> 
> I did not mean to have a general memcpy replacement. Rather something like
> magic_memcpy() which falls back to memcpy when AVX is not usable or the
> length does not justify the AVX stuff at all.

OK, fair enough.

Note that a generic version might still be worth trying out, if and only if 
it's 
safe to access those vector registers directly: modern x86 CPUs will do their 
non-constant memcpy()s via the common memcpy_erms() function - which could in 
theory be an easy common point to be (cpufeatures-) patched to an AVX2 variant, 
if 
size (and alignment, perhaps) is a multiple of 32 bytes or so.

Assuming it's correct with arbitrary user-space FPU state and if it results in 
any 
measurable speedups, which might not be the case: ERMS is supposed to be very 
fast.

So even if it's possible (which it might not be), it could end up being slower 
than the ERMS version.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> > 
> > > > So I do think we could do more in this area to improve driver 
> > > > performance, if the 
> > > > code is correct and if there's actual benchmarks that are showing real 
> > > > benefits.
> > > 
> > > If it's about hotpath performance I'm all for it, but the use case here is
> > > a debug facility...
> > > 
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just slapping a readqq() at it and use it in a loop does
> > > not make any sense.
> > 
> > Yeah, so generic memcpy() replacement is only feasible I think if the most 
> > optimistic implementation is actually correct:
> > 
> >  - if no preempt disable()/enable() is required
> > 
> >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > state in 
> >any fashion
> > 
> >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > or have
> >weird behavior if the FPU control word is modified to non-standard 
> > values by 
> >untrusted user-space
> > 
> > If we have to touch the FPU tag or control words then it's probably only 
> > good for 
> > a specialized API.
> 
> I did not mean to have a general memcpy replacement. Rather something like
> magic_memcpy() which falls back to memcpy when AVX is not usable or the
> length does not justify the AVX stuff at all.

OK, fair enough.

Note that a generic version might still be worth trying out, if and only if 
it's 
safe to access those vector registers directly: modern x86 CPUs will do their 
non-constant memcpy()s via the common memcpy_erms() function - which could in 
theory be an easy common point to be (cpufeatures-) patched to an AVX2 variant, 
if 
size (and alignment, perhaps) is a multiple of 32 bytes or so.

Assuming it's correct with arbitrary user-space FPU state and if it results in 
any 
measurable speedups, which might not be the case: ERMS is supposed to be very 
fast.

So even if it's possible (which it might not be), it could end up being slower 
than the ERMS version.

Thanks,

Ingo

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Thomas Gleixner
> Sent: 20 March 2018 09:41
> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
...
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just slapping a readqq() at it and use it in a loop does
> > > not make any sense.
> >
> > Yeah, so generic memcpy() replacement is only feasible I think if the most
> > optimistic implementation is actually correct:
> >
> >  - if no preempt disable()/enable() is required
> >
> >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > state in
> >any fashion
> >
> >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > or have
> >weird behavior if the FPU control word is modified to non-standard 
> > values by
> >untrusted user-space
> >
> > If we have to touch the FPU tag or control words then it's probably only 
> > good for
> > a specialized API.
> 
> I did not mean to have a general memcpy replacement. Rather something like
> magic_memcpy() which falls back to memcpy when AVX is not usable or the
> length does not justify the AVX stuff at all.

There is probably no point for memcpy().

Where it would make a big difference is memcpy_fromio() for PCIe devices
(where longer TLP make a big difference).
But any code belongs in its implementation not in every driver.
The implementation of memcpy_toio() is nothing like as critical.

If might be the code would need to fallback to 64bit accesses
if the AVX(2) registers can't currently be accessed - maybe some
obscure state

However memcpy_to/fromio() are both horrid at the moment because
they result in byte copies!

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight

From: Thomas Gleixner
> Sent: 20 March 2018 09:41
> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
...
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just slapping a readqq() at it and use it in a loop does
> > > not make any sense.
> >
> > Yeah, so generic memcpy() replacement is only feasible I think if the most
> > optimistic implementation is actually correct:
> >
> >  - if no preempt disable()/enable() is required
> >
> >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > state in
> >any fashion
> >
> >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > or have
> >weird behavior if the FPU control word is modified to non-standard 
> > values by
> >untrusted user-space
> >
> > If we have to touch the FPU tag or control words then it's probably only 
> > good for
> > a specialized API.
> 
> I did not mean to have a general memcpy replacement. Rather something like
> magic_memcpy() which falls back to memcpy when AVX is not usable or the
> length does not justify the AVX stuff at all.

There is probably no point for memcpy().

Where it would make a big difference is memcpy_fromio() for PCIe devices
(where longer TLP make a big difference).
But any code belongs in its implementation not in every driver.
The implementation of memcpy_toio() is nothing like as critical.

If might be the code would need to fallback to 64bit accesses
if the AVX(2) registers can't currently be accessed - maybe some
obscure state

However memcpy_to/fromio() are both horrid at the moment because
they result in byte copies!

David

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
> 
> > > So I do think we could do more in this area to improve driver 
> > > performance, if the 
> > > code is correct and if there's actual benchmarks that are showing real 
> > > benefits.
> > 
> > If it's about hotpath performance I'm all for it, but the use case here is
> > a debug facility...
> > 
> > And if we go down that road then we want a AVX based memcpy()
> > implementation which is runtime conditional on the feature bit(s) and
> > length dependent. Just slapping a readqq() at it and use it in a loop does
> > not make any sense.
> 
> Yeah, so generic memcpy() replacement is only feasible I think if the most 
> optimistic implementation is actually correct:
> 
>  - if no preempt disable()/enable() is required
> 
>  - if direct access to the AVX[2] registers does not disturb legacy FPU state 
> in 
>any fashion
> 
>  - if direct access to the AVX[2] registers cannot raise weird exceptions or 
> have
>weird behavior if the FPU control word is modified to non-standard values 
> by 
>untrusted user-space
> 
> If we have to touch the FPU tag or control words then it's probably only good 
> for 
> a specialized API.

I did not mean to have a general memcpy replacement. Rather something like
magic_memcpy() which falls back to memcpy when AVX is not usable or the
length does not justify the AVX stuff at all.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
> 
> > > So I do think we could do more in this area to improve driver 
> > > performance, if the 
> > > code is correct and if there's actual benchmarks that are showing real 
> > > benefits.
> > 
> > If it's about hotpath performance I'm all for it, but the use case here is
> > a debug facility...
> > 
> > And if we go down that road then we want a AVX based memcpy()
> > implementation which is runtime conditional on the feature bit(s) and
> > length dependent. Just slapping a readqq() at it and use it in a loop does
> > not make any sense.
> 
> Yeah, so generic memcpy() replacement is only feasible I think if the most 
> optimistic implementation is actually correct:
> 
>  - if no preempt disable()/enable() is required
> 
>  - if direct access to the AVX[2] registers does not disturb legacy FPU state 
> in 
>any fashion
> 
>  - if direct access to the AVX[2] registers cannot raise weird exceptions or 
> have
>weird behavior if the FPU control word is modified to non-standard values 
> by 
>untrusted user-space
> 
> If we have to touch the FPU tag or control words then it's probably only good 
> for 
> a specialized API.

I did not mean to have a general memcpy replacement. Rather something like
magic_memcpy() which falls back to memcpy when AVX is not usable or the
length does not justify the AVX stuff at all.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> > So I do think we could do more in this area to improve driver performance, 
> > if the 
> > code is correct and if there's actual benchmarks that are showing real 
> > benefits.
> 
> If it's about hotpath performance I'm all for it, but the use case here is
> a debug facility...
> 
> And if we go down that road then we want a AVX based memcpy()
> implementation which is runtime conditional on the feature bit(s) and
> length dependent. Just slapping a readqq() at it and use it in a loop does
> not make any sense.

Yeah, so generic memcpy() replacement is only feasible I think if the most 
optimistic implementation is actually correct:

 - if no preempt disable()/enable() is required

 - if direct access to the AVX[2] registers does not disturb legacy FPU state 
in 
   any fashion

 - if direct access to the AVX[2] registers cannot raise weird exceptions or 
have
   weird behavior if the FPU control word is modified to non-standard values by 
   untrusted user-space

If we have to touch the FPU tag or control words then it's probably only good 
for 
a specialized API.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> > So I do think we could do more in this area to improve driver performance, 
> > if the 
> > code is correct and if there's actual benchmarks that are showing real 
> > benefits.
> 
> If it's about hotpath performance I'm all for it, but the use case here is
> a debug facility...
> 
> And if we go down that road then we want a AVX based memcpy()
> implementation which is runtime conditional on the feature bit(s) and
> length dependent. Just slapping a readqq() at it and use it in a loop does
> not make any sense.

Yeah, so generic memcpy() replacement is only feasible I think if the most 
optimistic implementation is actually correct:

 - if no preempt disable()/enable() is required

 - if direct access to the AVX[2] registers does not disturb legacy FPU state 
in 
   any fashion

 - if direct access to the AVX[2] registers cannot raise weird exceptions or 
have
   weird behavior if the FPU control word is modified to non-standard values by 
   untrusted user-space

If we have to touch the FPU tag or control words then it's probably only good 
for 
a specialized API.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
> 
> > > Useful also for code that needs AVX-like registers to do things like CRCs.
> > 
> > x86/crypto/ has a lot of AVX optimized code.
> 
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks 
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().

Correct.

> So assuming the target driver will only load on modern FPUs I *think* it 
> should 
> actually be possible to do something like (pseudocode):
> 
>   vmovdqa %ymm0, 40(%rsp)
>   vmovdqa %ymm1, 80(%rsp)
> 
>   ...
>   # use ymm0 and ymm1
>   ...
> 
>   vmovdqa 80(%rsp), %ymm1
>   vmovdqa 40(%rsp), %ymm0
> 
> ... without using the heavy XSAVE/XRSTOR instructions.
> 
> Note that preemption probably still needs to be disabled and possibly there 
> are 
> other details as well, but there should be no 'heavy' FPU operations.

Emphasis on should :)

> I think this should still preserve all user-space FPU state and shouldn't 
> muck up 
> any 'weird' user-space FPU state (such as pending exceptions, legacy x87 
> running 
> code, NaN registers or weird FPU control word settings) we might have 
> interrupted 
> either.
> 
> But I could be wrong, it should be checked whether this sequence is safe. 
> Worst-case we might have to save/restore the FPU control and tag words - but 
> those 
> operations should still be much faster than a full XSAVE/XRSTOR pair.

Fair enough.

> So I do think we could do more in this area to improve driver performance, if 
> the 
> code is correct and if there's actual benchmarks that are showing real 
> benefits.

If it's about hotpath performance I'm all for it, but the use case here is
a debug facility...

And if we go down that road then we want a AVX based memcpy()
implementation which is runtime conditional on the feature bit(s) and
length dependent. Just slapping a readqq() at it and use it in a loop does
not make any sense.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
> 
> > > Useful also for code that needs AVX-like registers to do things like CRCs.
> > 
> > x86/crypto/ has a lot of AVX optimized code.
> 
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks 
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().

Correct.

> So assuming the target driver will only load on modern FPUs I *think* it 
> should 
> actually be possible to do something like (pseudocode):
> 
>   vmovdqa %ymm0, 40(%rsp)
>   vmovdqa %ymm1, 80(%rsp)
> 
>   ...
>   # use ymm0 and ymm1
>   ...
> 
>   vmovdqa 80(%rsp), %ymm1
>   vmovdqa 40(%rsp), %ymm0
> 
> ... without using the heavy XSAVE/XRSTOR instructions.
> 
> Note that preemption probably still needs to be disabled and possibly there 
> are 
> other details as well, but there should be no 'heavy' FPU operations.

Emphasis on should :)

> I think this should still preserve all user-space FPU state and shouldn't 
> muck up 
> any 'weird' user-space FPU state (such as pending exceptions, legacy x87 
> running 
> code, NaN registers or weird FPU control word settings) we might have 
> interrupted 
> either.
> 
> But I could be wrong, it should be checked whether this sequence is safe. 
> Worst-case we might have to save/restore the FPU control and tag words - but 
> those 
> operations should still be much faster than a full XSAVE/XRSTOR pair.

Fair enough.

> So I do think we could do more in this area to improve driver performance, if 
> the 
> code is correct and if there's actual benchmarks that are showing real 
> benefits.

If it's about hotpath performance I'm all for it, but the use case here is
a debug facility...

And if we go down that road then we want a AVX based memcpy()
implementation which is runtime conditional on the feature bit(s) and
length dependent. Just slapping a readqq() at it and use it in a loop does
not make any sense.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> > Useful also for code that needs AVX-like registers to do things like CRCs.
> 
> x86/crypto/ has a lot of AVX optimized code.

Yeah, that's true, but the crypto code is processing fundamentally bigger 
blocks 
of data, which amortizes the cost of using kernel_fpu_begin()/_end().

kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
FPU 
save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy 
a 
thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
for 
something small, like a single 256-bit or 512-bit word access.

But there's actually a new thing in modern kernels: we got rid of (most of) 
lazy 
save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
faults 
taken normally.

So assuming the target driver will only load on modern FPUs I *think* it should 
actually be possible to do something like (pseudocode):

vmovdqa %ymm0, 40(%rsp)
vmovdqa %ymm1, 80(%rsp)

...
# use ymm0 and ymm1
...

vmovdqa 80(%rsp), %ymm1
vmovdqa 40(%rsp), %ymm0

... without using the heavy XSAVE/XRSTOR instructions.

Note that preemption probably still needs to be disabled and possibly there are 
other details as well, but there should be no 'heavy' FPU operations.

I think this should still preserve all user-space FPU state and shouldn't muck 
up 
any 'weird' user-space FPU state (such as pending exceptions, legacy x87 
running 
code, NaN registers or weird FPU control word settings) we might have 
interrupted 
either.

But I could be wrong, it should be checked whether this sequence is safe. 
Worst-case we might have to save/restore the FPU control and tag words - but 
those 
operations should still be much faster than a full XSAVE/XRSTOR pair.

So I do think we could do more in this area to improve driver performance, if 
the 
code is correct and if there's actual benchmarks that are showing real benefits.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> > Useful also for code that needs AVX-like registers to do things like CRCs.
> 
> x86/crypto/ has a lot of AVX optimized code.

Yeah, that's true, but the crypto code is processing fundamentally bigger 
blocks 
of data, which amortizes the cost of using kernel_fpu_begin()/_end().

kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
FPU 
save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy 
a 
thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
for 
something small, like a single 256-bit or 512-bit word access.

But there's actually a new thing in modern kernels: we got rid of (most of) 
lazy 
save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
faults 
taken normally.

So assuming the target driver will only load on modern FPUs I *think* it should 
actually be possible to do something like (pseudocode):

vmovdqa %ymm0, 40(%rsp)
vmovdqa %ymm1, 80(%rsp)

...
# use ymm0 and ymm1
...

vmovdqa 80(%rsp), %ymm1
vmovdqa 40(%rsp), %ymm0

... without using the heavy XSAVE/XRSTOR instructions.

Note that preemption probably still needs to be disabled and possibly there are 
other details as well, but there should be no 'heavy' FPU operations.

I think this should still preserve all user-space FPU state and shouldn't muck 
up 
any 'weird' user-space FPU state (such as pending exceptions, legacy x87 
running 
code, NaN registers or weird FPU control word settings) we might have 
interrupted 
either.

But I could be wrong, it should be checked whether this sequence is safe. 
Worst-case we might have to save/restore the FPU control and tag words - but 
those 
operations should still be much faster than a full XSAVE/XRSTOR pair.

So I do think we could do more in this area to improve driver performance, if 
the 
code is correct and if there's actual benchmarks that are showing real benefits.

Thanks,

Ingo

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Linus Torvalds

On Mon, Mar 19, 2018 at 8:53 AM, David Laight  wrote:
>
> The x87 and SSE registers can't be changed - they can contain callee-saved
> registers.
> But (IIRC) the AVX and AVX2 registers are all caller-saved.

No.

The kernel entry is not the usual function call.

On kernel entry, *all* registers are callee-saved.

Of course, some may be return values, and I have a slight hope that I
can trash %eflags. But basically, a system call is simply not a
function call. We have different calling conventions on the argument
side too. In fact, the arguments are in different registers depending
on just *which* system call you take.

  Linus

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Linus Torvalds

On Mon, Mar 19, 2018 at 8:53 AM, David Laight  wrote:
>
> The x87 and SSE registers can't be changed - they can contain callee-saved
> registers.
> But (IIRC) the AVX and AVX2 registers are all caller-saved.

No.

The kernel entry is not the usual function call.

On kernel entry, *all* registers are callee-saved.

Of course, some may be return values, and I have a slight hope that I
can trash %eflags. But basically, a system call is simply not a
function call. We have different calling conventions on the argument
side too. In fact, the arguments are in different registers depending
on just *which* system call you take.

  Linus

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Thomas Gleixner
> Sent: 19 March 2018 15:37
...
> > If system call entry reset the AVX registers then any FP save/restore
> > would be faster because the AVX registers wouldn't need to be saved
> > (and the cpu won't save them).
> > I believe the instruction to reset the AVX registers is fast.
> > The AVX registers only ever need saving if the process enters the
> > kernel through an interrupt.
> 
> Wrong. The x8664 ABI clearly states:
> 
>Linux Kernel code is not allowed to change the x87 and SSE units. If
>those are changed by kernel code, they have to be restored properly
>before sleeping or leav- ing the kernel.
> 
> That means the syscall interface relies on FPU state being not changed by
> the kernel. So if you want to clear AVX on syscall entry you need to save
> it first and then restore before returning. That would be a huge
> performance hit.

The x87 and SSE registers can't be changed - they can contain callee-saved
registers.
But (IIRC) the AVX and AVX2 registers are all caller-saved.
So the system call entry stub functions are allowed to change them.
Which means that the syscall entry code can also change them.
Of course it must not leak kernel values back to userspace.

It is a few years since I looked at the AVX and fpu save code.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Thomas Gleixner
> Sent: 19 March 2018 15:37
...
> > If system call entry reset the AVX registers then any FP save/restore
> > would be faster because the AVX registers wouldn't need to be saved
> > (and the cpu won't save them).
> > I believe the instruction to reset the AVX registers is fast.
> > The AVX registers only ever need saving if the process enters the
> > kernel through an interrupt.
> 
> Wrong. The x8664 ABI clearly states:
> 
>Linux Kernel code is not allowed to change the x87 and SSE units. If
>those are changed by kernel code, they have to be restored properly
>before sleeping or leav- ing the kernel.
> 
> That means the syscall interface relies on FPU state being not changed by
> the kernel. So if you want to clear AVX on syscall entry you need to save
> it first and then restore before returning. That would be a huge
> performance hit.

The x87 and SSE registers can't be changed - they can contain callee-saved
registers.
But (IIRC) the AVX and AVX2 registers are all caller-saved.
So the system call entry stub functions are allowed to change them.
Which means that the syscall entry code can also change them.
Of course it must not leak kernel values back to userspace.

It is a few years since I looked at the AVX and fpu save code.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Thomas Gleixner
> > Sent: 19 March 2018 15:05
> > 
> > On Mon, 19 Mar 2018, David Laight wrote:
> > > From: Rahul Lakkireddy
> > > In principle it ought to be possible to get access to one or two
> > > (eg) AVX registers by saving them to stack and telling the fpu
> > > save code where you've put them.
> > 
> > No. We have functions for this and we are not adding new ad hoc magic.
> 
> I was thinking that a real API might do this...

We have a real API and that's good enough for the stuff we have using AVX
in the kernel.

> Useful also for code that needs AVX-like registers to do things like CRCs.

x86/crypto/ has a lot of AVX optimized code.

> > > OTOH, for x86, if the code always runs in process context (eg from a
> > > system call) then, since the ABI defines them all as caller-saved
> > > the AVX(2) registers, it is only necessary to ensure that the current
> > > FPU registers belong to the current process once.
> > > The registers can be set to zero by an 'invalidate' instruction on
> > > system call entry (hope this is done) and after use.
> > 
> > Why would a system call touch the FPU registers? The kernel normally does
> > not use FPU instructions and the code which explicitely does has to take
> > care of save/restore. It would be performance madness to fiddle with the
> > FPU stuff unconditionally if nothing uses it.
> 
> If system call entry reset the AVX registers then any FP save/restore
> would be faster because the AVX registers wouldn't need to be saved
> (and the cpu won't save them).
> I believe the instruction to reset the AVX registers is fast.
> The AVX registers only ever need saving if the process enters the
> kernel through an interrupt.

Wrong. The x8664 ABI clearly states:

   Linux Kernel code is not allowed to change the x87 and SSE units. If
   those are changed by kernel code, they have to be restored properly
   before sleeping or leav- ing the kernel.

That means the syscall interface relies on FPU state being not changed by
the kernel. So if you want to clear AVX on syscall entry you need to save
it first and then restore before returning. That would be a huge
performance hit.

Thanks,

tglx

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Thomas Gleixner
> > Sent: 19 March 2018 15:05
> > 
> > On Mon, 19 Mar 2018, David Laight wrote:
> > > From: Rahul Lakkireddy
> > > In principle it ought to be possible to get access to one or two
> > > (eg) AVX registers by saving them to stack and telling the fpu
> > > save code where you've put them.
> > 
> > No. We have functions for this and we are not adding new ad hoc magic.
> 
> I was thinking that a real API might do this...

We have a real API and that's good enough for the stuff we have using AVX
in the kernel.

> Useful also for code that needs AVX-like registers to do things like CRCs.

x86/crypto/ has a lot of AVX optimized code.

> > > OTOH, for x86, if the code always runs in process context (eg from a
> > > system call) then, since the ABI defines them all as caller-saved
> > > the AVX(2) registers, it is only necessary to ensure that the current
> > > FPU registers belong to the current process once.
> > > The registers can be set to zero by an 'invalidate' instruction on
> > > system call entry (hope this is done) and after use.
> > 
> > Why would a system call touch the FPU registers? The kernel normally does
> > not use FPU instructions and the code which explicitely does has to take
> > care of save/restore. It would be performance madness to fiddle with the
> > FPU stuff unconditionally if nothing uses it.
> 
> If system call entry reset the AVX registers then any FP save/restore
> would be faster because the AVX registers wouldn't need to be saved
> (and the cpu won't save them).
> I believe the instruction to reset the AVX registers is fast.
> The AVX registers only ever need saving if the process enters the
> kernel through an interrupt.

Wrong. The x8664 ABI clearly states:

   Linux Kernel code is not allowed to change the x87 and SSE units. If
   those are changed by kernel code, they have to be restored properly
   before sleeping or leav- ing the kernel.

That means the syscall interface relies on FPU state being not changed by
the kernel. So if you want to clear AVX on syscall entry you need to save
it first and then restore before returning. That would be a huge
performance hit.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Christoph Hellwig

On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.

What a horrible name.  please encode the actual number of bits instead.

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Christoph Hellwig

On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.

What a horrible name.  please encode the actual number of bits instead.

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Thomas Gleixner
> Sent: 19 March 2018 15:05
> 
> On Mon, 19 Mar 2018, David Laight wrote:
> > From: Rahul Lakkireddy
> > In principle it ought to be possible to get access to one or two
> > (eg) AVX registers by saving them to stack and telling the fpu
> > save code where you've put them.
> 
> No. We have functions for this and we are not adding new ad hoc magic.

I was thinking that a real API might do this...
Useful also for code that needs AVX-like registers to do things like CRCs.

> > OTOH, for x86, if the code always runs in process context (eg from a
> > system call) then, since the ABI defines them all as caller-saved
> > the AVX(2) registers, it is only necessary to ensure that the current
> > FPU registers belong to the current process once.
> > The registers can be set to zero by an 'invalidate' instruction on
> > system call entry (hope this is done) and after use.
> 
> Why would a system call touch the FPU registers? The kernel normally does
> not use FPU instructions and the code which explicitely does has to take
> care of save/restore. It would be performance madness to fiddle with the
> FPU stuff unconditionally if nothing uses it.

If system call entry reset the AVX registers then any FP save/restore
would be faster because the AVX registers wouldn't need to be saved
(and the cpu won't save them).
I believe the instruction to reset the AVX registers is fast.
The AVX registers only ever need saving if the process enters the
kernel through an interrupt.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Thomas Gleixner
> Sent: 19 March 2018 15:05
> 
> On Mon, 19 Mar 2018, David Laight wrote:
> > From: Rahul Lakkireddy
> > In principle it ought to be possible to get access to one or two
> > (eg) AVX registers by saving them to stack and telling the fpu
> > save code where you've put them.
> 
> No. We have functions for this and we are not adding new ad hoc magic.

I was thinking that a real API might do this...
Useful also for code that needs AVX-like registers to do things like CRCs.

> > OTOH, for x86, if the code always runs in process context (eg from a
> > system call) then, since the ABI defines them all as caller-saved
> > the AVX(2) registers, it is only necessary to ensure that the current
> > FPU registers belong to the current process once.
> > The registers can be set to zero by an 'invalidate' instruction on
> > system call entry (hope this is done) and after use.
> 
> Why would a system call touch the FPU registers? The kernel normally does
> not use FPU instructions and the code which explicitely does has to take
> care of save/restore. It would be performance madness to fiddle with the
> FPU stuff unconditionally if nothing uses it.

If system call entry reset the AVX registers then any FP save/restore
would be faster because the AVX registers wouldn't need to be saved
(and the cpu won't save them).
I believe the instruction to reset the AVX registers is fast.
The AVX registers only ever need saving if the process enters the
kernel through an interrupt.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Rahul Lakkireddy
> In principle it ought to be possible to get access to one or two
> (eg) AVX registers by saving them to stack and telling the fpu
> save code where you've put them.

No. We have functions for this and we are not adding new ad hoc magic.

> OTOH, for x86, if the code always runs in process context (eg from a
> system call) then, since the ABI defines them all as caller-saved
> the AVX(2) registers, it is only necessary to ensure that the current
> FPU registers belong to the current process once.
> The registers can be set to zero by an 'invalidate' instruction on
> system call entry (hope this is done) and after use.

Why would a system call touch the FPU registers? The kernel normally does
not use FPU instructions and the code which explicitely does has to take
care of save/restore. It would be performance madness to fiddle with the
FPU stuff unconditionally if nothing uses it.

Thanks,

tglx

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Rahul Lakkireddy
> In principle it ought to be possible to get access to one or two
> (eg) AVX registers by saving them to stack and telling the fpu
> save code where you've put them.

No. We have functions for this and we are not adding new ad hoc magic.

> OTOH, for x86, if the code always runs in process context (eg from a
> system call) then, since the ABI defines them all as caller-saved
> the AVX(2) registers, it is only necessary to ensure that the current
> FPU registers belong to the current process once.
> The registers can be set to zero by an 'invalidate' instruction on
> system call entry (hope this is done) and after use.

Why would a system call touch the FPU registers? The kernel normally does
not use FPU instructions and the code which explicitely does has to take
care of save/restore. It would be performance madness to fiddle with the
FPU stuff unconditionally if nothing uses it.

Thanks,

tglx

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Rahul Lakkireddy
> Sent: 19 March 2018 14:21
> 
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.

Why not use the AVX2 registers to get 512bit accesses.

> Patch 1 adds u256 type and adds necessary non-atomic accessors.  Also
> adds byteorder conversion APIs.
> 
> Patch 2 adds 256-bit read and write to x86 via VMOVDQU AVX CPU
> instructions.
> 
> Patch 3 updates cxgb4 driver to use the readqq API to speed up
> reading on-chip memory 256-bits at a time.

Calling kernel_fpu_begin() is likely to be slow.
I doubt you want to do it every time around a loop of accesses.

In principle it ought to be possible to get access to one or two
(eg) AVX registers by saving them to stack and telling the fpu
save code where you've put them.
Then the IPI fp save code could then copy the saved values over
the current values if asked to save the fp state for a process.
This should be reasonable cheap - especially if there isn't an
fp save IPI.

OTOH, for x86, if the code always runs in process context (eg from a
system call) then, since the ABI defines them all as caller-saved
the AVX(2) registers, it is only necessary to ensure that the current
FPU registers belong to the current process once.
The registers can be set to zero by an 'invalidate' instruction on
system call entry (hope this is done) and after use.

David

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight

From: Rahul Lakkireddy
> Sent: 19 March 2018 14:21
> 
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.

Why not use the AVX2 registers to get 512bit accesses.

> Patch 1 adds u256 type and adds necessary non-atomic accessors.  Also
> adds byteorder conversion APIs.
> 
> Patch 2 adds 256-bit read and write to x86 via VMOVDQU AVX CPU
> instructions.
> 
> Patch 3 updates cxgb4 driver to use the readqq API to speed up
> reading on-chip memory 256-bits at a time.

Calling kernel_fpu_begin() is likely to be slow.
I doubt you want to do it every time around a loop of accesses.

In principle it ought to be possible to get access to one or two
(eg) AVX registers by saving them to stack and telling the fpu
save code where you've put them.
Then the IPI fp save code could then copy the saved values over
the current values if asked to save the fp state for a process.
This should be reasonable cheap - especially if there isn't an
fp save IPI.

OTOH, for x86, if the code always runs in process context (eg from a
system call) then, since the ABI defines them all as caller-saved
the AVX(2) registers, it is only necessary to ensure that the current
FPU registers belong to the current process once.
The registers can be set to zero by an 'invalidate' instruction on
system call entry (hope this is done) and after use.

David

[RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Rahul Lakkireddy

This series of patches add support for 256-bit IO read and write.
The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
and write 256-bits at a time from IO, respectively.

Patch 1 adds u256 type and adds necessary non-atomic accessors.  Also
adds byteorder conversion APIs.

Patch 2 adds 256-bit read and write to x86 via VMOVDQU AVX CPU
instructions.

Patch 3 updates cxgb4 driver to use the readqq API to speed up
reading on-chip memory 256-bits at a time.

Feedback and suggestions will be much appreciated.

Thanks,
Rahul

Rahul Lakkireddy (3):
  include/linux: add 256-bit IO accessors
  x86/io: implement 256-bit IO read and write
  cxgb4: read on-chip memory 256-bits at a time

 arch/x86/include/asm/io.h  | 57 -
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c | 16 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  6 +++
 include/linux/byteorder/generic.h  | 48 +
 include/linux/io-64-nonatomic-hi-lo.h  | 59 ++
 include/linux/io-64-nonatomic-lo-hi.h  | 59 ++
 include/linux/types.h  |  7 +++
 7 files changed, 243 insertions(+), 9 deletions(-)

-- 
2.14.1

[RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Rahul Lakkireddy

This series of patches add support for 256-bit IO read and write.
The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
and write 256-bits at a time from IO, respectively.

Patch 1 adds u256 type and adds necessary non-atomic accessors.  Also
adds byteorder conversion APIs.

Patch 2 adds 256-bit read and write to x86 via VMOVDQU AVX CPU
instructions.

Patch 3 updates cxgb4 driver to use the readqq API to speed up
reading on-chip memory 256-bits at a time.

Feedback and suggestions will be much appreciated.

Thanks,
Rahul

Rahul Lakkireddy (3):
  include/linux: add 256-bit IO accessors
  x86/io: implement 256-bit IO read and write
  cxgb4: read on-chip memory 256-bits at a time

 arch/x86/include/asm/io.h  | 57 -
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c | 16 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  6 +++
 include/linux/byteorder/generic.h  | 48 +
 include/linux/io-64-nonatomic-hi-lo.h  | 59 ++
 include/linux/io-64-nonatomic-lo-hi.h  | 59 ++
 include/linux/types.h  |  7 +++
 7 files changed, 243 insertions(+), 9 deletions(-)

-- 
2.14.1

66 matches

Mail list logo