Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap

2024-10-09 Thread Andrew Cooper
On 09/10/2024 4:51 am, Manwaring, Derek wrote:
> On 2024-10-08 at 19:56+ Sean Christopherson wrote:
>> Another (slightly crazy) approach would be use protection keys to provide the
>> security properties that you want, while giving KVM (and userspace) a 
>> quick-and-easy
>> override to access guest memory.
>>
>>   1. mmap() guest_memfd into userpace with RW protections
>>   2. Configure PKRU to make guest_memfd memory inaccessible by default
>>   3. Swizzle PKRU on-demand when intentionally accessing guest memory
>>
>> It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest 
>> memory
>> instead of to usersepace memory.
>>
>> The benefit of the PKRU approach is that there are no PTE modifications, and 
>> thus
>> no TLB flushes, and only the CPU that is access guest memory gains temporary
>> access.  The big downside is that it would be limited to modern hardware, but
>> that might be acceptable, especially if it simplifies KVM's implementation.
> Yeah this might be worth it if it simplifies significantly. Jenkins et
> al. showed MPK worked for stopping in-process Spectre V1 [1]. While
> future hardware bugs are always possible, the host kernel would still
> offer better protection overall since discovery of additional Spectre
> approaches and gadgets in the kernel is more likely (I think it's a
> bigger surface area than hardware-specific MPK transient execution
> issues).
>
> Patrick, we talked about this a couple weeks ago and ended up focusing
> on within-userspace protection, but I see keys can also be used to stop
> kernel access like Andrew's project he mentioned during Dave's MPK
> session at LPC [2]. Andrew, could you share that here?

This was in reference to PKS specifically (so Sapphire Rapids and
later), and also for Xen but the technique is general.

Allocate one supervisor key for the directmap (and other ranges wanting
protecting), and configure MSR_PKS[key]=AD by default.

Protection Keys were identified as being safe as a defence against
Meltdown.  At the time, only PKRU existed, and PKS was expected to have
been less overhead than KPTI on Skylake, which was even more frustrating
for those of us who'd begged for a supervisor form at the time.  What's
done is done.


The changes needed in main code would be accessors for directmap
pointers, because there needs to temporary AD-disable.  This would take
the form of 2x WRMSR, as opposed to a STAC/CLAC pair.

An area of concern is the overhead of the WRMSRs.  MSR_PKS is defined as
not-architecturally-serialising, but like STAC/CLAC probably comes with
model-dependent dispatch-serialising properties to prevent memory
accesses executing speculatively under the wrong protection key.

Also, for this strategy to be effective, you need to PKEY-tag all
aliases of the memory.

~Andrew



[tip: x86/core] x86/cpu: Comment Skylake server stepping too

2021-04-10 Thread tip-bot2 for Andrew Cooper
The following commit has been merged into the x86/core branch of tip:

Commit-ID: 99cb64de36d5c9397a664808b92943e35bdce25e
Gitweb:
https://git.kernel.org/tip/99cb64de36d5c9397a664808b92943e35bdce25e
Author:Andrew Cooper 
AuthorDate:Fri, 09 Apr 2021 13:10:27 +01:00
Committer: Borislav Petkov 
CommitterDate: Sat, 10 Apr 2021 11:14:33 +02:00

x86/cpu: Comment Skylake server stepping too

Further to

  53375a5a218e ("x86/cpu: Resort and comment Intel models"),

CascadeLake and CooperLake are steppings of Skylake, and make up the 1st
to 3rd generation "Xeon Scalable Processor" line.

Signed-off-by: Andrew Cooper 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20210409121027.16437-1-andrew.coop...@citrix.com
---
 arch/x86/include/asm/intel-family.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel-family.h 
b/arch/x86/include/asm/intel-family.h
index b15262f..955b06d 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -33,7 +33,7 @@
  * _EX - 4+ socket server parts
  *
  * The #define line may optionally include a comment including platform or core
- * names. An exception is made for kabylake where steppings seem to have gotten
+ * names. An exception is made for skylake/kabylake where steppings seem to 
have gotten
  * their own names :-(
  */
 
@@ -74,6 +74,8 @@
 #define INTEL_FAM6_SKYLAKE_L   0x4E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE 0x5E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE_X   0x55/* Sky Lake */
+/* CASCADELAKE_X   0x55   Sky Lake -- s: 7 */
+/* COOPERLAKE_X0x55   Sky Lake -- s: 11
*/
 
 #define INTEL_FAM6_KABYLAKE_L  0x8E/* Sky Lake */
 /* AMBERLAKE_L 0x8E   Sky Lake -- s: 9 */


[PATCH] x86/cpu: Comment Skylake server stepping too

2021-04-09 Thread Andrew Cooper
Further to commit 53375a5a218e ("x86/cpu: Resort and comment Intel
models"), CascadeLake and CooperLake are steppings of Skylake, and make up
the 1st to 3rd generation "Xeon Scalable Processor" line.

Signed-off-by: Andrew Cooper 
---
CC: Peter Zijlstra 
CC: Tony Luck 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org

It is left as an exercise to the reader to ponder why the 3rd generation Xeon
Scalable brand is formed of both CooperLake and IceLake parts.
---
 arch/x86/include/asm/intel-family.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel-family.h 
b/arch/x86/include/asm/intel-family.h
index b15262f1f645..955b06d6325a 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -33,7 +33,7 @@
  * _EX - 4+ socket server parts
  *
  * The #define line may optionally include a comment including platform or core
- * names. An exception is made for kabylake where steppings seem to have gotten
+ * names. An exception is made for skylake/kabylake where steppings seem to 
have gotten
  * their own names :-(
  */
 
@@ -74,6 +74,8 @@
 #define INTEL_FAM6_SKYLAKE_L   0x4E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE 0x5E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE_X   0x55/* Sky Lake */
+/* CASCADELAKE_X   0x55   Sky Lake -- s: 7 */
+/* COOPERLAKE_X0x55   Sky Lake -- s: 11
*/
 
 #define INTEL_FAM6_KABYLAKE_L  0x8E/* Sky Lake */
 /* AMBERLAKE_L 0x8E   Sky Lake -- s: 9 */
-- 
2.11.0



Re: Why does glibc use AVX-512?

2021-03-26 Thread Andrew Cooper
On 26/03/2021 19:47, Andy Lutomirski wrote:
> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
>> * Andy Lutomirski:
>>
> AVX-512 cleared, and programs need to explicitly request enablement.
> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.
 Isn't XSAVEOPT already able to handle that?

>>> Yes, but we need a place to put the data, and we need to acknowledge
>>> that, with the current save-everything-on-signal model, the amount of
>>> time and memory used is essentially unbounded.  This isn't great.
>> The size has to have a known upper bound, but the save amount can be
>> dynamic, right?
>>
>> How was the old lazy FPU initialization support for i386 implemented?
>>
 Assuming you can make XSAVEOPT work for you on the kernel side, my
 instincts tell me that we should have markup for RTM, not for AVX-512.
 This way, we could avoid use of the AVX-512 registers and keep using
 VZEROUPPER, without run-time transaction checks, and deal with other
 idiosyncrasies needed for transaction support that users might
 encounter once this feature sees more use.  But the VZEROUPPER vs RTM
 issues is currently stuck in some internal process issue on my end (or
 two, come to think of it), which I hope to untangle next month.
>>> Can you elaborate on the issue?
>> This is the bug:
>>
>>   vzeroupper use in AVX2 multiarch string functions cause HTM aborts
>>   
>>
>> Unfortunately we have a bug (outside of glibc) that makes me wonder if
>> we can actually roll out RTM transaction checks (or any RTM
>> instruction) on a large scale:
>>
>>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>>   
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts.  It's
> possible you should just disable it. :(

For a variety of errata and speculative security reasons, hypervisors
now have the ability to hide/show the HLE/RTM CPUID bits, independently
of letting TSX actually work or not.

For migration compatibility reasons, you might quite possibly find
yourself in a VM which advertises the HLE/RTM bits but will
unconditionally abort any transaction.

Honestly, if I were you, I'd just leave it to the user to explicitly opt
in if they want transactions.

~Andrew



Re: [RFC PATCH] x86/retpolines: Prevent speculation after RET

2021-02-19 Thread Andrew Cooper
On 19/02/2021 08:15, Peter Zijlstra wrote:
> On Thu, Feb 18, 2021 at 08:11:38PM +0100, Borislav Petkov wrote:
>> On Thu, Feb 18, 2021 at 08:02:31PM +0100, Peter Zijlstra wrote:
>>> On Thu, Feb 18, 2021 at 07:46:39PM +0100, Borislav Petkov wrote:
 Both vendors speculate after a near RET in some way:

 Intel:

 "Unlike near indirect CALL and near indirect JMP, the processor will not
 speculatively execute the next sequential instruction after a near RET
 unless that instruction is also the target of a jump or is a target in a
 branch predictor."
>>> Right, the way I read that means it's not a problem for us here.
>> Look at that other thread: the instruction *after* the RET can be
>> speculatively executed if that instruction is the target of a jump or it
>> is in a branch predictor.
> Right, but that has nothing to do with the RET instruction itself. You
> can speculatively execute any random instruction by training the BTB,
> which is I suppose the entire point of things :-)
>
> So the way I read it is that: RET does not 'leak' speculation, but if
> you target the instruction after RET with any other speculation crud,
> ofcourse you can get it to 'run'.
>
> And until further clarified, I'll stick with that :-)

https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf
Final page, Mitigation G-5

Some parts (before Milan I believe that CPUID rule translates into) may
speculatively execute the instructions sequentially following a call/jmp
indirect or ret instruction.

For Intel, its just call/jmp instructions.  From SDM Vol2 for CALL (and
similar for JMP)

"Certain situations may lead to the next sequential instruction after a
near indirect CALL being speculatively executed. If software needs to
prevent this (e.g., in order to prevent a speculative execution side
channel), then an LFENCE instruction opcode can be placed after the near
indirect CALL in order to block speculative execution."


In both cases, the reason LFENCE is given is for the CALL case, where
there is sequential architectural execution.  JMP and RET do not have
architectural execution following them, so can use a shorter speculation
blocker.

When compiling with retpoline, all CALL/JMP indirects are removed, other
than within the __x86_indirect_thunk_%reg blocks, and those can be fixed
by hand.  That just leaves RET speculation, which has no following
architectural execution, at which point `ret; int3` is the shortest way
of halting speculation, at half the size of `ret; lfence`.

With a gcc toolchain, it does actually work if you macro 'ret' (and
retl/q) to be .byte 0xc3, 0xcc, but this doesn't work for Clang IAS
which refuses to macro real instructions.

What would be massively helpful if is the toolchains could have their
existing ARM straight-line-speculation support hooked up appropriately
so we get some new code gen options on x86, and don't have to resort to
the macro bodges above.

~Andrew


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper
On 08/02/2021 10:36, Jan Beulich wrote:
> On 08.02.2021 11:23, Andrew Cooper wrote:
>> On 08/02/2021 09:50, Jan Beulich wrote:
>>> On 08.02.2021 10:44, Andrew Cooper wrote:
>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>> The ring buffer for user events is used in the local system only, so
>>>>> smp barriers are fine for ensuring consistency.
>>>>>
>>>>> Reported-by: Andrew Cooper 
>>>>> Signed-off-by: Juergen Gross 
>>>> These need to be virt_* to not break in UP builds (on non-x86).
>>> Initially I though so, too, but isn't the sole vCPU of such a
>>> VM getting re-scheduled to a different pCPU in the hypervisor
>>> an implied barrier anyway?
>> Yes, but that isn't relevant to why UP builds break.
>>
>> smp_*() degrade to compiler barriers in UP builds, and while that's
>> mostly fine for x86 read/write, its not fine for ARM barriers.
> Hmm, I may not know enough of Arm's memory model - are you saying
> Arm CPUs aren't even self-coherent, i.e. later operations (e.g.
> the consuming of ring contents) won't observe earlier ones (the
> updating of ring contents) when only a single physical CPU is
> involved in all of this? (I did mention the hypervisor level
> context switch simply because that's the only way multiple CPUs
> can get involved.)

In this case, no - see my later reply.  I'd mistaken the two ends of
this ring.  As they're both inside the same guest, its fine.

For cases such as the xenstore/console ring, the semantics required
really are SMP, even if the guest is built UP.  These cases really will
break when smp_rmb() etc degrade to just a compiler barrier on
architectures with weaker semantics than x86.

~Andrew


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper
On 08/02/2021 10:25, Jürgen Groß wrote:
> On 08.02.21 11:23, Andrew Cooper wrote:
>> On 08/02/2021 09:50, Jan Beulich wrote:
>>> On 08.02.2021 10:44, Andrew Cooper wrote:
>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>> The ring buffer for user events is used in the local system only, so
>>>>> smp barriers are fine for ensuring consistency.
>>>>>
>>>>> Reported-by: Andrew Cooper 
>>>>> Signed-off-by: Juergen Gross 
>>>> These need to be virt_* to not break in UP builds (on non-x86).
>>> Initially I though so, too, but isn't the sole vCPU of such a
>>> VM getting re-scheduled to a different pCPU in the hypervisor
>>> an implied barrier anyway?
>>
>> Yes, but that isn't relevant to why UP builds break.
>>
>> smp_*() degrade to compiler barriers in UP builds, and while that's
>> mostly fine for x86 read/write, its not fine for ARM barriers.
>>
>> virt_*() exist specifically to be smp_*() which don't degrade to broken
>> in UP builds.
>
> But the barrier is really only necessary to serialize accesses within
> the guest against each other. There is no guest outside party involved.
>
> In case you are right this would mean that UP guests are all broken on
> Arm.

Oh - right.  This is a ring between the interrupt handler and a task. 
Not a ring between the guest and something else.

In which case smp_*() are correct.  Sorry for the noise.

~Andrew


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper
On 08/02/2021 09:50, Jan Beulich wrote:
> On 08.02.2021 10:44, Andrew Cooper wrote:
>> On 06/02/2021 10:49, Juergen Gross wrote:
>>> The ring buffer for user events is used in the local system only, so
>>> smp barriers are fine for ensuring consistency.
>>>
>>> Reported-by: Andrew Cooper 
>>> Signed-off-by: Juergen Gross 
>> These need to be virt_* to not break in UP builds (on non-x86).
> Initially I though so, too, but isn't the sole vCPU of such a
> VM getting re-scheduled to a different pCPU in the hypervisor
> an implied barrier anyway?

Yes, but that isn't relevant to why UP builds break.

smp_*() degrade to compiler barriers in UP builds, and while that's
mostly fine for x86 read/write, its not fine for ARM barriers.

virt_*() exist specifically to be smp_*() which don't degrade to broken
in UP builds.

~Andrew


Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper
On 06/02/2021 10:49, Juergen Gross wrote:
> The ring buffer for user events is used in the local system only, so
> smp barriers are fine for ensuring consistency.
>
> Reported-by: Andrew Cooper 
> Signed-off-by: Juergen Gross 

These need to be virt_* to not break in UP builds (on non-x86).

~Andrew


Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs

2021-02-05 Thread Andrew Cooper
On 05/02/2021 10:02, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 04:11:12PM -0800, Andy Lutomirski wrote:
>> I'm wondering if a more mild violation is possible:
>>
>> Initialize *addr = 0.
>>
>> mov $1, (addr)
>> wrmsr
>>
>> remote cpu's IDT vector:
>>
>> mov (addr), %rax
>> %rax == 0!
>>
>> There's no speculative-execution-becoming-visible-even-if-it-doesn't-retire
>> here -- there's just an ordering violation.  For Linux, this would
>> presumably only manifest as a potential deadlock or confusion if the
>> IPI vector code looks at the list of pending work and doesn't find the
>> expected work in it.
>>
>> Dave?  hpa?  What is the SDM trying to tell us?
> [ Big caveat, I've not spoken to any hardware people about this. The
> below is purely my own understanding. ]
>
> This is my interpretation as well. Without the MFENCE+LFENCE there is no
> guarantee the store is out of the store-buffer and the remote load isn't
> guaranteed to observe it.
>
> What I think the SDM is trying to tell us, is that the IPI, even if it
> goes on the same regular coherency fabric as memory transfers, is not
> subject to the regular memory ordering rules.
>
> Normal TSO rules tells us that when:
>
> P1() {
>   x = 1;
>   y = 1;
> }
>
> P2() {
>   r1 = y;
>   r2 = x;
> }
>
> r2 must not be 0 when r1 is 1. Because if we see store to y, we must
> also see store to x. But the IPI thing doesn't behave like a store. The
> (fast) wrmsr isn't even considered a memop.
>
> The thing is, the above ordering does not guarantee we have r2 != 0.
> r2==0 is allowed when r1==0. And that's an entirely sane outcome even if
> we run the instructions like:
>
>   CPU1CPU2
>
> cycle-1   mov $1, ([x])
> cycle-2   mov $1, ([y])
> cycle-3   mov ([y]), rax
> cycle-4   mov ([x]), rbx
>
> There is no guarantee _any_ of the stores will have made it out. And
> that's exactly the issue. The IPI might make it out of the core before
> any of the stores will.
>
> Furthermore, since there is no dependency between:
>
>   mov $1, ([x])
>   wrmsr
>
> The CPU is allowed to reorder the execution and retire the wrmsr before
> the store. Very much like it would for normal non-dependent
> instructions.

Execution, sure (for details which don't escape the core, just like any
other speculative execution).  Retirement, surely not - it is inherently
tied to the program order of things.

Causality would also be broken if the WRMSR retired ahead of the MOV,
and an interrupt were to hit the boundary between them.

> And presumably it is still allowed to do that when we write it like:
>
>   mov $1, ([x])
>   mfence
>   wrmsr
>
> because, mfence only has dependencies to memops and (fast) wrmsr is not
> a memop.
>
> Which then brings us to:
>
>   mov $1, ([x])
>   mfence
>   lfence
>   wrmsr
>
> In this case, the lfence acts like the newly minted ifence (see
> spectre), and will block execution of (any) later instructions until
> completion of all prior instructions. This, and only this ensures the
> wrmsr happens after the mfence, which in turn ensures the store to x is
> globally visible.

I understand that "what the architecture guarantees" differs from "how
parts behave in practice".

And I also understand the reasoning behind declaring that MFENCE;LFENCE
the only architecturally guaranteed way of ensuring all stores are
globally visible before the WRMSR starts.


However, what is missing is a explanation of how it is possible to build
a causality-preserving part where those fences (for plain stores) are
necessary in practice.

That sequence is a large set of pipeline stalls in practice for what
appears to a problem in theory only.

~Andrew


Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs

2021-02-04 Thread Andrew Cooper
On 05/03/2020 17:47, Dave Hansen wrote:
> Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a
> plain "mfence" while the Intel SDM (10.12.3 MSR Access in x2APIC
> Mode) calls for "mfence;lfence".
>
> Short summary: we have special MSRs that have weaker ordering
> than all the rest.  Add fencing consistent with current SDM
> recommendatrions.
>
> This is not known to cause any issues in practice, only in
> theory.

So, I accept that Intel have their own reasons for what is written in
the SDM, but "not ordered with stores" is at best misleading.

The x2APIC (and other) MSRs, aren't serialising.  That's fine, as is the
fact that the WRMSR to trigger them doesn't have memory operands, and is
therefore not explicitly ordered with other loads and stores.

Consider:
    xor %edi, %edi
    movb (%rdi), %dl
    wrmsr

It is fine for a non-serialising wrmsr here to execute speculative in
terms of internal calculations, but nothing it does can escape the local
core until the movb has fully retired, and is therefore globally visible.

Otherwise, I can send IPIs from non-architectural paths (in this case,
behind a page fault), and causality is broken.

IPIs are (at minimum) a write-like-thing leaving the core, even if they
don't interact with the regular memory path, and their effects cannot
become visible until the effects of older instructions are visible.

What the SDM is trying to say is that this potentially matters for
writes queued in the WC buffers.

If some code is using WC memory, and wants to send an IPI, and wants to
have the remote IPI handler read said data, then yes - there is a
problem - but the problem is the lack of SFENCE required to make the WC
buffer visible in the first place.

WC code is already responsible for its own memory ordering, and the
x2APIC IPIs can't execute early even in the absence of architectural
ordering guarantees.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-02-01 Thread Andrew Cooper
On 01/02/2021 05:58, Nadav Amit wrote:
>> On Jan 31, 2021, at 4:10 AM, Andrew Cooper  wrote:
>>
>> On 31/01/2021 01:07, Andy Lutomirski wrote:
>>> Adding Andrew Cooper, who has a distressingly extensive understanding
>>> of the x86 PTE magic.
>> Pretty sure it is all learning things the hard way...
>>
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct 
>>>> mmu_gather *tlb,
>>>>ptent = pte_mkwrite(ptent);
>>>>}
>>>>ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>>>> ptent);
>>>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> +   if (pte_may_need_flush(oldpte, ptent))
>>>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> You're choosing to avoid the flush, based on A/D bits read ahead of the
>> actual modification of the PTE.
>>
>> In this example, another thread can write into the range (sets A and D),
>> and get a suitable TLB entry which goes unflushed while the rest of the
>> kernel thinks the memory is write-protected and clean.
>>
>> The only safe way to do this is to use XCHG/etc to modify the PTE, and
>> base flush calculations on the results.  Atomic operations are ordered
>> with A/D updates from pagewalks on other CPUs, even on AMD where A
>> updates are explicitly not ordered with regular memory reads, for
>> performance reasons.
> Thanks Andrew for the feedback, but I think the patch does it exactly in
> this safe manner that you describe (at least on native x86, but I see a
> similar path elsewhere as well):
>
> oldpte = ptep_modify_prot_start()
> -> __ptep_modify_prot_start()
> -> ptep_get_and_clear
> -> native_ptep_get_and_clear()
> -> xchg()
>
> Note that the xchg() will clear the PTE (i.e., making it non-present), and
> no further updates of A/D are possible until ptep_modify_prot_commit() is
> called.
>
> On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
> we should be safe.
>
> I guess you are right and a pte_may_need_flush() deserves a comment to
> clarify that oldpte must be obtained by an atomic operation to ensure no A/D
> bits are lost (as you say).
>
> Yet, I do not see a correctness problem. Am I missing something?

No(ish) - I failed to spot that path.

But native_ptep_get_and_clear() is broken on !SMP builds.  It needs to
be an XCHG even in that case, to spot A/D updates from prefetch or
shared-virtual-memory DMA.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper
On 31/01/2021 01:07, Andy Lutomirski wrote:
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.

Pretty sure it is all learning things the hard way...

> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 632d5a677d3f..b7473d2c9a1f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
>> *tlb,
>> ptent = pte_mkwrite(ptent);
>> }
>> ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>> ptent);
>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> +   if (pte_may_need_flush(oldpte, ptent))
>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);

You're choosing to avoid the flush, based on A/D bits read ahead of the
actual modification of the PTE.

In this example, another thread can write into the range (sets A and D),
and get a suitable TLB entry which goes unflushed while the rest of the
kernel thinks the memory is write-protected and clean.

The only safe way to do this is to use XCHG/etc to modify the PTE, and
base flush calculations on the results.  Atomic operations are ordered
with A/D updates from pagewalks on other CPUs, even on AMD where A
updates are explicitly not ordered with regular memory reads, for
performance reasons.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper
On 31/01/2021 02:59, Andy Lutomirski wrote:
 diff --git a/arch/x86/include/asm/tlbflush.h 
 b/arch/x86/include/asm/tlbflush.h
 index 8c87a2e0b660..a617dc0a9b06 100644
 --- a/arch/x86/include/asm/tlbflush.h
 +++ b/arch/x86/include/asm/tlbflush.h
 @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
 arch_tlbflush_unmap_batch *batch,

 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

 +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 +{
 +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
 +_PAGE_SOFTW3 | _PAGE_ACCESSED;
>>> Why is accessed ignored?  Surely clearing the accessed bit needs a
>>> flush if the old PTE is present.
>> I am just following the current scheme in the kernel (x86):
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>unsigned long address, pte_t *ptep)
>> {
>> /*
>>  * On x86 CPUs, clearing the accessed bit without a TLB flush
>>  * doesn't cause data corruption. [ It could cause incorrect
>>  * page aging and the (mistaken) reclaim of hot pages, but the
>>  * chance of that should be relatively low. ]
>>  *
> If anyone ever implements the optimization of skipping the flush when
> unmapping a !accessed page, then this will cause data corruption.
>
> If nothing else, this deserves a nice comment in the new code.

Architecturally, access bits get as part of a pagewalk, and before the
TLB entry is made.  Once an access bit has been set, you know for
certain that a mapping for the address is, or was, present in a TLB
somewhere.

I can't help but feel that this "optimisation" is shooting itself in the
foot.  It does cause incorrect page aging, and TLBs are getting bigger,
so the chance of reclamating hot pages is getting worse and worse.

~Andrew


Re: [PATCH] x86/debug: 'Fix' ptrace dr6 output

2021-01-28 Thread Andrew Cooper
On 28/01/2021 18:20, Peter Zijlstra wrote:
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -496,9 +496,32 @@ static int hw_breakpoint_handler(struct
>   dr6_p = (unsigned long *)ERR_PTR(args->err);
>   dr6 = *dr6_p;
>  
> - /* If it's a single step, TRAP bits are random */
> - if (dr6 & DR_STEP)
> + /*
> +  * If DR_STEP is set, TRAP bits might also be set, but we must not
> +  * process them since another exception (without DR_STEP) will follow.

:) How lucky are you feeling?

Data breakpoints will in principle merge with DR_STEP (as will all other
#DB's with trap semantics), other than the many errata about breakpoints
not being recognised correctly.

Under VT-x because there is a still unfixed vmexit microcode bug which
loses all breakpoint information if DR_STEP is set.  %dr6 handling works
fine when #DB isn't intercepted, but then you get to keep the pipeline
livelock vulnerability as a consequence.

Instruction breakpoints on the following instruction will be delivered
as a second #DB, because fault style #DBs are raised at a separate
position in the instruction execution cycle.

~Andrew


Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper
On 27/01/2021 13:59, Jürgen Groß wrote:
> On 27.01.21 12:23, Andrew Cooper wrote:
>> On 27/01/2021 10:26, Jürgen Groß wrote:
>>> On 27.01.21 10:43, Andrew Cooper wrote:
>>>> On 27/01/2021 09:38, Juergen Gross wrote:
>>>>> diff --git a/arch/x86/xen/enlighten_pv.c
>>>>> b/arch/x86/xen/enlighten_pv.c
>>>>> index 4409306364dc..ca5ac10fcbf7 100644
>>>>> --- a/arch/x86/xen/enlighten_pv.c
>>>>> +++ b/arch/x86/xen/enlighten_pv.c
>>>>> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>>>>>    exc_debug(regs);
>>>>>    }
>>>>>    +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
>>>>> +{
>>>>> +    /* This should never happen and there is no way to handle it. */
>>>>> +    panic("Unknown trap in Xen PV mode.");
>>>>
>>>> Looks much better.  How about including regs->entry_vector here,
>>>> just to
>>>> short circuit the inevitable swearing which will accompany
>>>> encountering
>>>> this panic() ?
>>>
>>> You are aware the regs parameter is struct pt_regs *, not the Xen
>>> struct cpu_user_regs *?
>>
>> Yes, but I was assuming that they both contained the same information.
>>
>>>
>>> So I have no idea how I should get this information without creating
>>> a per-vector handler.
>>
>> Oh - that's dull.
>>
>> Fine then.  Reviewed-by: Andrew Cooper 
>>
>
> I think I'll switch the panic() to printk(); BUG(); in order to have
> more diagnostic data. Can I keep your R-b: with that modification?

Yeah.  Sounds good.

~Andrew


Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper
On 27/01/2021 10:26, Jürgen Groß wrote:
> On 27.01.21 10:43, Andrew Cooper wrote:
>> On 27/01/2021 09:38, Juergen Gross wrote:
>>> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
>>> index 4409306364dc..ca5ac10fcbf7 100644
>>> --- a/arch/x86/xen/enlighten_pv.c
>>> +++ b/arch/x86/xen/enlighten_pv.c
>>> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>>>   exc_debug(regs);
>>>   }
>>>   +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
>>> +{
>>> +    /* This should never happen and there is no way to handle it. */
>>> +    panic("Unknown trap in Xen PV mode.");
>>
>> Looks much better.  How about including regs->entry_vector here, just to
>> short circuit the inevitable swearing which will accompany encountering
>> this panic() ?
>
> You are aware the regs parameter is struct pt_regs *, not the Xen
> struct cpu_user_regs *?

Yes, but I was assuming that they both contained the same information.

>
> So I have no idea how I should get this information without creating
> a per-vector handler.

Oh - that's dull.

Fine then.  Reviewed-by: Andrew Cooper 


Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper
On 27/01/2021 09:38, Juergen Gross wrote:
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 4409306364dc..ca5ac10fcbf7 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>   exc_debug(regs);
>  }
>  
> +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
> +{
> + /* This should never happen and there is no way to handle it. */
> + panic("Unknown trap in Xen PV mode.");

Looks much better.  How about including regs->entry_vector here, just to
short circuit the inevitable swearing which will accompany encountering
this panic() ?

~Andrew


Re: [PATCH] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-25 Thread Andrew Cooper
On 25/01/2021 14:00, Juergen Gross wrote:
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 4409306364dc..82948251f57b 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -583,6 +583,14 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>   exc_debug(regs);
>  }
>  
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +DEFINE_IDTENTRY_RAW(xenpv_exc_vmm_communication)
> +{
> + /* This should never happen and there is no way to handle it. */
> + panic("X86_TRAP_VC in Xen PV mode.");

Honestly, exactly the same is true of #VE, #HV and #SX.

What we do in the hypervisor is wire up one handler for all unknown
exceptions (to avoid potential future #DF issues) leading to a panic. 
Wouldn't it be better to do this unconditionally, especially as #GP/#NP
doesn't work for PV guests for unregistered callbacks, rather than
fixing up piecewise like this?

~Andrew


Re: [PATCH 17/21] x86/acpi: Convert indirect jump to retpoline

2021-01-14 Thread Andrew Cooper
On 14/01/2021 23:47, Josh Poimboeuf wrote:
> On Thu, Jan 14, 2021 at 10:59:39PM +0000, Andrew Cooper wrote:
>> On 14/01/2021 19:40, Josh Poimboeuf wrote:
>>> It's kernel policy to not have (unannotated) indirect jumps because of
>>> Spectre v2.  This one's probably harmless, but better safe than sorry.
>>> Convert it to a retpoline.
>>>
>>> Cc: "Rafael J. Wysocki" 
>>> Cc: Len Brown 
>>> Cc: Pavel Machek 
>>> Signed-off-by: Josh Poimboeuf 
>>> ---
>>>  arch/x86/kernel/acpi/wakeup_64.S | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/acpi/wakeup_64.S 
>>> b/arch/x86/kernel/acpi/wakeup_64.S
>>> index 5d3a0b8fd379..0b371580e620 100644
>>> --- a/arch/x86/kernel/acpi/wakeup_64.S
>>> +++ b/arch/x86/kernel/acpi/wakeup_64.S
>>> @@ -7,6 +7,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  # Copyright 2003 Pavel Machek >>  
>>> @@ -39,7 +40,7 @@ SYM_FUNC_START(wakeup_long64)
>>> movqsaved_rbp, %rbp
>>>  
>>> movqsaved_rip, %rax
>>> -   jmp *%rax
>>> +   JMP_NOSPEC rax
>>>  SYM_FUNC_END(wakeup_long64)
>> I suspect this won't work as you intend.
>>
>> wakeup_long64() still executes on the low mappings, not the high
>> mappings, so the `jmp __x86_indirect_thunk_rax` under this JMP_NOSPEC
>> will wander off into the weeds.
>>
>> This is why none of the startup "jmps from weird contexts onto the high
>> mappings" have been retpolined-up.
> D'oh.  Of course it wouldn't be that easy.  I suppose the other two
> retpoline patches (15 and 21) are bogus as well.

If by 21 you mean 19, then most likely, yes.  They're all low=>high
jumps in weird contexts.

~Andrew


Re: [PATCH 17/21] x86/acpi: Convert indirect jump to retpoline

2021-01-14 Thread Andrew Cooper
On 14/01/2021 19:40, Josh Poimboeuf wrote:
> It's kernel policy to not have (unannotated) indirect jumps because of
> Spectre v2.  This one's probably harmless, but better safe than sorry.
> Convert it to a retpoline.
>
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Pavel Machek 
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/x86/kernel/acpi/wakeup_64.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/acpi/wakeup_64.S 
> b/arch/x86/kernel/acpi/wakeup_64.S
> index 5d3a0b8fd379..0b371580e620 100644
> --- a/arch/x86/kernel/acpi/wakeup_64.S
> +++ b/arch/x86/kernel/acpi/wakeup_64.S
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  # Copyright 2003 Pavel Machek   
> @@ -39,7 +40,7 @@ SYM_FUNC_START(wakeup_long64)
>   movqsaved_rbp, %rbp
>  
>   movqsaved_rip, %rax
> - jmp *%rax
> + JMP_NOSPEC rax
>  SYM_FUNC_END(wakeup_long64)

I suspect this won't work as you intend.

wakeup_long64() still executes on the low mappings, not the high
mappings, so the `jmp __x86_indirect_thunk_rax` under this JMP_NOSPEC
will wander off into the weeds.

This is why none of the startup "jmps from weird contexts onto the high
mappings" have been retpolined-up.

~Andrew


Re: [PATCH v2] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Andrew Cooper
On 12/01/2021 12:17, Jürgen Groß wrote:
> On 12.01.21 12:53, Roger Pau Monne wrote:
>> Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
>> addr = 0 in order to fetch the size of a specific resource.
>>
>> Add a shortcut to the default map resource path, since fetching the
>> size requires no address to be passed in, and thus no VMA to setup.
>>
>> This is missing from the initial implementation, and causes issues
>> when mapping resources that don't have fixed or known sizes.
>>
>> Signed-off-by: Roger Pau Monné 
>> Cc: sta...@vger.kernel.org # >= 4.18
>
> Reviewed-by: Juergen Gross 

Tested-by: Andrew Cooper 


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Andrew Cooper
On 11/01/2021 22:09, boris.ostrov...@oracle.com wrote:
> On 1/11/21 10:29 AM, Roger Pau Monne wrote:
>>  
>> +xdata.domid = kdata.dom;
>> +xdata.type = kdata.type;
>> +xdata.id = kdata.id;
>> +
>> +if (!kdata.addr && !kdata.num) {
>
> I think we should not allow only one of them to be zero. If it's only 
> kdata.num then we will end up with pfns array set to ZERO_SIZE_PTR (which is 
> 0x10). We seem to be OK in that we are not derefencing pfns (either in kernel 
> or in hypervisor) if number of frames is zero but IMO we shouldn't be 
> tempting the fate.
>
>
> (And if it's only kdata.addr then we will get a vma but I am not sure it will 
> do what we want.)

Passing addr == 0 without num being 0 is already an error in Xen, and
passing num == 0 without addr being 0 is bogus and will be an error by
the time I'm finished fixing this.

FWIW, the common usecase for non-trivial examples will be:

xenforeignmem_resource_size(domid, type, id, &size);
xenforeignmem_map_resource(domid, type, id, NULL, size, ...);

which translates into:

ioctl(MAP_RESOURCE, NULL, 0) => size
mmap(NULL, size, ...) => ptr
ioctl(MAP_RESOURCE, ptr, size)

from the kernels point of view, and two hypercalls from Xen's point of
view.  The NULL's above are expected to be the common case for letting
the kernel chose the vma, but ought to be filled in by the time the
second ioctl() occurs.

See
https://lore.kernel.org/xen-devel/20200922182444.12350-1-andrew.coop...@citrix.com/T/#u
for all the gory details.

~Andrew


Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-17 Thread Andrew Cooper
On 16/12/2020 08:21, Jürgen Groß wrote:
> On 15.12.20 21:59, Andrew Cooper wrote:
>> On 15/12/2020 11:10, Juergen Gross wrote:
>>> In case a process waits for any Xenstore action in the xenbus driver
>>> it should be interruptible by signals.
>>>
>>> Signed-off-by: Juergen Gross 
>>> ---
>>> V2:
>>> - don't special case SIGKILL as libxenstore is handling -EINTR fine
>>> ---
>>>   drivers/xen/xenbus/xenbus_xs.c | 9 -
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/xenbus/xenbus_xs.c
>>> b/drivers/xen/xenbus/xenbus_xs.c
>>> index 3a06eb699f33..17c8f8a155fd 100644
>>> --- a/drivers/xen/xenbus/xenbus_xs.c
>>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>>> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>>>     static void *read_reply(struct xb_req_data *req)
>>>   {
>>> +    int ret;
>>> +
>>>   do {
>>> -    wait_event(req->wq, test_reply(req));
>>> +    ret = wait_event_interruptible(req->wq, test_reply(req));
>>> +
>>> +    if (ret == -ERESTARTSYS && signal_pending(current)) {
>>> +    req->msg.type = XS_ERROR;
>>> +    return ERR_PTR(-EINTR);
>>> +    }
>>
>> So now I can talk fully about the situations which lead to this, I think
>> there is a bit more complexity.
>>
>> It turns out there are a number of issues related to running a Xen
>> system with no xenstored.
>>
>> 1) If a xenstore-write occurs during startup before init-xenstore-domain
>> runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
>> reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
>> tell the dom0 kernel that xenstored is in dom1.  This effectively
>> deadlocks the system.
>
> This should be easy to solve: any request to /dev/xen/xenbus should
> block upfront in case xenstored isn't up yet (could e.g. wait
> interruptible until xenstored_ready is non-zero).

I'm not sure that that would fix the problem.  The problem is that
setting the ring details via /dev/xen/xenbus_backend blocks, which
prevents us launching the xenstored stubdomain, which prevents the
earlier xenbus write being completed.

So long as /dev/xen/xenbus_backend doesn't block, there's no problem
with other /dev/xen/xenbus activity being pending briefly.


Looking at the current logic, I'm not completely convinced.  Even
finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
ready.

There are 3 possible cases.

1) PV guest, and details in start_info
2) HVM guest, and details in HVM_PARAMs
3) No details (expected for dom0).  Something in userspace must provide
details at a later point.

So the setup phases go from nothing, to having ring details, to finding
the ring working.

I think it would be prudent to try reading a key between having details
and declaring the xenstored_ready.  Any activity, even XS_ERROR,
indicates that the other end of the ring is listening.

>
>> 2) If xenstore-watch is running when xenstored dies, it spins at 100%
>> cpu usage making no system calls at all.  This is caused by bad error
>> handling from xs_watch(), and attempting to debug found:
>
> Can you expand on "bad error handling from xs_watch()", please?

do_watch() has

    for ( ... ) { // defaults to an infinite loop
        vec = xs_read_watch();
        if (vec == NULL)
            continue;
        ...
    }


My next plan was to experiment with break instead of continue, which
I'll get to at some point.

>
>>
>> 3) (this issue).  If anyone starts xenstore-watch with no xenstored
>> running at all, it blocks in D in the kernel.
>
> Should be handled with solution for 1).
>
>>
>> The cause is the special handling for watch/unwatch commands which,
>> instead of just queuing up the data for xenstore, explicitly waits for
>> an OK for registering the watch.  This causes a write() system call to
>> block waiting for a non-existent entity to reply.
>>
>> So while this patch does resolve the major usability issue I found (I
>> can't even SIGINT and get my terminal back), I think there are issues.
>>
>> The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
>> require special handling.  The main kernel thread for processing
>> incoming data from xenstored does need to know how to associate each
>> async XS_WATCH_EVENT to the caller who watched the path.
>>
>> Therefore, depending on when this cancellation hits, we might be in any
>> of the following states:

Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-15 Thread Andrew Cooper
On 15/12/2020 11:10, Juergen Gross wrote:
> In case a process waits for any Xenstore action in the xenbus driver
> it should be interruptible by signals.
>
> Signed-off-by: Juergen Gross 
> ---
> V2:
> - don't special case SIGKILL as libxenstore is handling -EINTR fine
> ---
>  drivers/xen/xenbus/xenbus_xs.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index 3a06eb699f33..17c8f8a155fd 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>  
>  static void *read_reply(struct xb_req_data *req)
>  {
> + int ret;
> +
>   do {
> - wait_event(req->wq, test_reply(req));
> + ret = wait_event_interruptible(req->wq, test_reply(req));
> +
> + if (ret == -ERESTARTSYS && signal_pending(current)) {
> + req->msg.type = XS_ERROR;
> + return ERR_PTR(-EINTR);
> + }

So now I can talk fully about the situations which lead to this, I think
there is a bit more complexity.

It turns out there are a number of issues related to running a Xen
system with no xenstored.

1) If a xenstore-write occurs during startup before init-xenstore-domain
runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
tell the dom0 kernel that xenstored is in dom1.  This effectively
deadlocks the system.

2) If xenstore-watch is running when xenstored dies, it spins at 100%
cpu usage making no system calls at all.  This is caused by bad error
handling from xs_watch(), and attempting to debug found:

3) (this issue).  If anyone starts xenstore-watch with no xenstored
running at all, it blocks in D in the kernel.

The cause is the special handling for watch/unwatch commands which,
instead of just queuing up the data for xenstore, explicitly waits for
an OK for registering the watch.  This causes a write() system call to
block waiting for a non-existent entity to reply.

So while this patch does resolve the major usability issue I found (I
can't even SIGINT and get my terminal back), I think there are issues.

The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
require special handling.  The main kernel thread for processing
incoming data from xenstored does need to know how to associate each
async XS_WATCH_EVENT to the caller who watched the path.

Therefore, depending on when this cancellation hits, we might be in any
of the following states:

1) the watch is queued in the kernel, but not even sent to xenstored yet
2) the watch is queued in the xenstored ring, but not acted upon
3) the watch is queued in the xenstored ring, and the xenstored has seen
it but not replied yet
4) the watch has been processed, but the XS_WATCH reply hasn't been
received yet
5) the watch has been processed, and the XS_WATCH reply received

State 5 (and a little bit) is the normal success path when xenstored has
acted upon the request, and the internal kernel infrastructure is set up
appropriately to handle XS_WATCH_EVENTs.

States 1 and 2 can be very common if there is no xenstored (or at least,
it hasn't started up yet).  In reality, there is either no xenstored, or
it is up and running (and for a period of time during system startup,
these cases occur in sequence).

As soon as the XS_WATCH event has been written into the xenstored ring,
it is not safe to cancel.  You've committed to xenstored processing the
request (if it is up).

If xenstored is actually up and running, its fine and necessary to
block.  The request will be processed in due course (timing subject to
the client and server load).  If xenstored isn't up, blocking isn't ok.

Therefore, I think we need to distinguish "not yet on the ring" from "on
the ring", as our distinction as to whether cancelling is safe, and
ensure we don't queue anything on the ring before we're sure xenstored
has started up.

Does this make sense?

~Andrew


Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Andrew Cooper
On 11/12/2020 21:27, Thomas Gleixner wrote:
> On Fri, Dec 11 2020 at 09:29, boris ostrovsky wrote:
>
>> On 12/11/20 7:37 AM, Thomas Gleixner wrote:
>>> On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote:
 On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:
> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>> Change the implementation so that the channel is bound to CPU0 at the XEN
>> level and leave the affinity mask alone. At startup of the interrupt
>> affinity will be assigned out of the affinity mask and the XEN binding 
>> will
>> be updated.
> If that's the case then I wonder whether we need this call at all and 
> instead bind at startup time.
 After some discussion with Thomas on IRC and xen-devel archaeology the
 result is: this will be needed especially for systems running on a
 single vcpu (e.g. small guests), as the .irq_set_affinity() callback
 won't be called in this case when starting the irq.
>> On UP are we not then going to end up with an empty affinity mask? Or
>> are we guaranteed to have it set to 1 by interrupt generic code?
> An UP kernel does not ever look on the affinity mask. The
> chip::irq_set_affinity() callback is not invoked so the mask is
> irrelevant.
>
> A SMP kernel on a UP machine sets CPU0 in the mask so all is good.
>
>> This is actually why I brought this up in the first place --- a
>> potential mismatch between the affinity mask and Xen-specific data
>> (e.g. info->cpu and then protocol-specific data in event channel
>> code). Even if they are re-synchronized later, at startup time (for
>> SMP).
> Which is not a problem either. The affinity mask is only relevant for
> setting the affinity, but it's not relevant for delivery and never can
> be.
>
>> I don't see anything that would cause a problem right now but I worry
>> that this inconsistency may come up at some point.
> As long as the affinity mask becomes not part of the event channel magic
> this should never matter.
>
> Look at it from hardware:
>
> interrupt is affine to CPU0
>
>  CPU0 runs:
>  
>  set_affinity(CPU0 -> CPU1)
> local_irq_disable()
> 
>  --> interrupt is raised in hardware and pending on CPU0
>
> irq hardware is reconfigured to be affine to CPU1
>
> local_irq_enable()
>
>  --> interrupt is handled on CPU0
>
> the next interrupt will be raised on CPU1
>
> So info->cpu which is registered via the hypercall binds the 'hardware
> delivery' and whenever the new affinity is written it is rebound to some
> other CPU and the next interrupt is then raised on this other CPU.
>
> It's not any different from the hardware example at least not as far as
> I understood the code.

Xen's event channels do have a couple of quirks.

Binding an event channel always results in one spurious event being
delivered.  This is to cover notifications which can get lost during the
bidirectional setup, or re-setups in certain configurations.

Binding an interdomain or pirq event channel always defaults to vCPU0. 
There is no way to atomically set the affinity while binding.  I believe
the API predates SMP guest support in Xen, and noone has fixed it up since.

As a consequence, the guest will observe the event raised on vCPU0 as
part of setting up the event, even if it attempts to set a different
affinity immediately afterwards.  A little bit of care needs to be taken
when binding an event channel on vCPUs other than 0, to ensure that the
callback is safe with respect to any remaining state needing initialisation.

Beyond this, there is nothing magic I'm aware of.

We have seen soft lockups before in certain scenarios, simply due to the
quantity of events hitting vCPU0 before irqbalance gets around to
spreading the load.  This is why there is an attempt to round-robin the
userspace event channel affinities by default, but I still don't see why
this would need custom affinity logic itself.

Thanks,

~Andrew


Re: [PATCH] x86/cpu: correct values for GDT_ENTRY_INIT

2020-11-26 Thread Andrew Cooper
On 26/11/2020 19:15, Andy Lutomirski wrote:
> On Thu, Nov 26, 2020 at 11:07 AM Lukas Bulwahn  
> wrote:
>> On Thu, Nov 26, 2020 at 6:16 PM Andrew Cooper  
>> wrote:
>>> On 26/11/2020 11:54, Lukas Bulwahn wrote:
>>>> Commit 1e5de18278e6 ("x86: Introduce GDT_ENTRY_INIT()") unintentionally
>>>> transformed a few 0x values to 0xf (note: five times "f" instead of
>>>> four) as part of the refactoring.
>>> The transformation in that change is correct.
>>>
>>> Segment bases are 20 bits wide in x86,

I of course meant segment limits here, rather than bases.

>>> Does:
>>>
>>> diff --git a/arch/x86/include/asm/desc_defs.h
>>> b/arch/x86/include/asm/desc_defs.h
>>> index f7e7099af595..9561f3c66e9e 100644
>>> --- a/arch/x86/include/asm/desc_defs.h
>>> +++ b/arch/x86/include/asm/desc_defs.h
>>> @@ -22,7 +22,7 @@ struct desc_struct {
>>>
>>>  #define GDT_ENTRY_INIT(flags, base, limit) \
>>> {   \
>>> -   .limit0 = (u16) (limit),\
>>> +   .limit0 = (u16) (limit) & 0x,   \
>>> .limit1 = ((limit) >> 16) & 0x0F,   \
>>> .base0  = (u16) (base), \
>>> .base1  = ((base) >> 16) & 0xFF,\
>>>
>>> fix the warning?
>>>
>> Thanks, I will try that out, and try compiling a 32-bit kernel as well.
> You should also try comparing the objdump output before and after your
> patch.  objdump -D will produce bizarre output but should work.

Expanding on this a little, if that does indeed fix the sparse warning,
then I'd make an argument for this being a bug in sparse.  Explicitly
casting to u16 is semantically and intentionally identical to & 0x.

~Andrew


Re: [PATCH] x86/cpu: correct values for GDT_ENTRY_INIT

2020-11-26 Thread Andrew Cooper
On 26/11/2020 11:54, Lukas Bulwahn wrote:
> Commit 1e5de18278e6 ("x86: Introduce GDT_ENTRY_INIT()") unintentionally
> transformed a few 0x values to 0xf (note: five times "f" instead of
> four) as part of the refactoring.

The transformation in that change is correct.

Segment bases are 20 bits wide in x86, but the top nibble is folded into
the middle of the attributes, which is why the transformation also has
xfxx => x0xx for the attributes field.

>
> A quick check with:
>
>   git show 1e5de18278e6 | grep "f"
>
> reveals all those 14 occurrences:
>
> 12 in ./arch/x86/kernel/cpu/common.c, and
> 2  in ./arch/x86/include/asm/lguest.h.
>
> The two occurrences in ./arch/x86/include/asm/lguest.h were deleted with
> commit ecda85e70277 ("x86/lguest: Remove lguest support").
> Correct the remaining twelve occurrences in ./arch/x86/kernel/cpu/common.c
> back to the original values in the source code before the refactoring.
>
> Commit 866b556efa12 ("x86/head/64: Install startup GDT") probably simply
> copied the required startup gdt information from
> ./arch/x86/kernel/cpu/common.c to ./arch/x86/kernel/head64.c.
> So, correct those three occurrences in ./arch/x86/kernel/head64.c as well.
>
> As this value is truncated anyway, the object code has not changed when
> introducing the mistake and is also not changed with this correction now.
>
> This was discovered with sparse, which warns with:
>
>   warning: cast truncates bits from constant value (f becomes )

Does:

diff --git a/arch/x86/include/asm/desc_defs.h
b/arch/x86/include/asm/desc_defs.h
index f7e7099af595..9561f3c66e9e 100644
--- a/arch/x86/include/asm/desc_defs.h
+++ b/arch/x86/include/asm/desc_defs.h
@@ -22,7 +22,7 @@ struct desc_struct {
 
 #define GDT_ENTRY_INIT(flags, base, limit) \
    {   \
-   .limit0 = (u16) (limit),    \
+   .limit0 = (u16) (limit) & 0x,   \
    .limit1 = ((limit) >> 16) & 0x0F,   \
    .base0  = (u16) (base), \
    .base1  = ((base) >> 16) & 0xFF,    \

fix the warning?

Changing the limit from 4G to 128M isn't going to be compatible with a
32bit kernel trying to boot :).

~Andrew


Re: WARNING: can't access registers at asm_common_interrupt

2020-11-11 Thread Andrew Cooper
On 11/11/2020 20:15, Josh Poimboeuf wrote:
> On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
>>> On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
> Would objtool have an easier time coping if this were implemented in
> terms of a static call?
 I doubt it, the big problem is that there is no visibility into the
 actual alternative text. Runtime patching fragments into static call
 would have the exact same problem.

 Something that _might_ maybe work is trying to morph the immediate
 fragments into an alternative. That is, instead of this:

 static inline notrace unsigned long arch_local_save_flags(void)
 {
return PVOP_CALLEE0(unsigned long, irq.save_fl);
 }

 Write it something like:

 static inline notrace unsigned long arch_local_save_flags(void)
 {
PVOP_CALL_ARGS;
PVOP_TEST_NULL(irq.save_fl);
asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
"PUSHF; POP _ASM_AX",
X86_FEATURE_NATIVE)
: CLBR_RET_REG, ASM_CALL_CONSTRAINT
: paravirt_type(irq.save_fl.func),
  paravirt_clobber(PVOP_CALLEE_CLOBBERS)
: "memory", "cc");
return __eax;
 }

 And then we have to teach objtool how to deal with conflicting
 alternatives...

 That would remove most (all, if we can figure out a form that deals with
 the spinlock fragments) of paravirt_patch.c

 Hmm?
>>> I was going to suggest something similar.  Though I would try to take it
>>> further and replace paravirt_patch_default() with static calls.
>> Possible, we just need to be _really_ careful to not allow changing
>> those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
>> does a __ro_after_init on the whole thing.
> But what if you want to live migrate to another hypervisor ;-)

The same as what happens currently.  The user gets to keep all the
resulting pieces ;)

>>> Either way it doesn't make objtool's job much easier.  But it would be
>>> nice to consolidate runtime patching mechanisms and get rid of
>>> .parainstructions.
>> I think the above (combining alternative and paravirt/static_call) does
>> make objtool's job easier, since then we at least have the actual
>> alternative instructions available to inspect, or am I mis-understanding
>> things?
> Right, it makes objtool's job a _little_ easier, since it already knows
> how to read alternatives.  But it still has to learn to deal with the
> conflicting stack layouts.

I suppose the needed abstraction is "these blocks will start and end
with the same stack layout", while allowing the internals to diverge.

~Andrew


Re: WARNING: can't access registers at asm_common_interrupt

2020-11-11 Thread Andrew Cooper
On 11/11/2020 18:13, Josh Poimboeuf wrote:
> On Wed, Nov 11, 2020 at 06:47:36PM +0100, Peter Zijlstra wrote:
>> This is PARAVIRT_XXL only, which is a Xen special. My preference, as
>> always, is to kill it... Sadly the Xen people have a different opinion.
> That would be s nice... then we could get rid of paravirt patching
> altogether and replace it with static calls.
>
>>> Objtool doesn't know about the pushf/pop paravirt patch, so ORC gets
>>> confused by the changed stack layout.
>>>
>>> I'm thinking we either need to teach objtool how to deal with
>>> save_fl/restore_fl patches, or we need to just get rid of those nasty
>>> patches somehow.  Peter, any thoughts?
>> Don't use Xen? ;-)
>>
>> So with PARAVIRT_XXL the compiler will emit something like:
>>
>>   "CALL *pvops.save_fl"
>>
>> Which we then overwrite at runtime with "pushf; pop %[re]ax" and a few
>> NOPs.
>>
>> Now, objtool understands alternatives, and ensures they have the same
>> stack layout, it has no chance in hell of understanding this, simply
>> because paravirt_patch.c is magic.
>>
>> I don't have any immediate clever ideas, but let me ponder it a wee bit.

Well...

static_calls are a newer, and more generic, form of pvops.  Most of the
magic is to do with inlining small fragments, but static calls can do
that now too, IIRC?

>> 
>>
>> Something really disguisting we could do is recognise the indirect call
>> offset and emit an extra ORC entry for RIP+1. So the cases are:
>>
>>  CALL *pv_ops.save_fl-- 7 bytes IIRC
>>  CALL $imm;  -- 5 bytes
>>  PUSHF; POP %[RE]AX  -- 2 bytes
>>
>> so the RIP+1 (the POP insn) will only ever exist in this case. The
>> indirect and direct call cases would never land on that IP.
> I had a similar idea, and a bit of deja vu - we may have talked about
> this before.  At least I know we talked about doing something similar
> for alternatives which muck with the stack.

The main complexity with pvops is that the

    CALL *pv_ops.save_fl

form needs to be usable from extremely early in the day (pre general
patching), hence the use of function pointers and some non-standard ABIs.

For performance reasons, the end result of this pvop wants to be `pushf;
pop %[re]ax` in then native case, and `call xen_pv_save_fl` in the Xen
case, but this doesn't mean that the compiled instruction needs to be a
function pointer to begin with.

Would objtool have an easier time coping if this were implemented in
terms of a static call?

~Andrew


Re: [PATCH 2/2] xen: Kconfig: nest Xen guest options

2020-10-15 Thread Andrew Cooper
On 15/10/2020 13:37, boris.ostrov...@oracle.com wrote:
> On 10/14/20 1:53 PM, Jason Andryuk wrote:
>> +config XEN_512GB
>> +bool "Limit Xen pv-domain memory to 512GB"
>> +depends on XEN_PV && X86_64
>
> Why is X86_64 needed here?

>512G support was implemented using a direct-mapped P2M, and is rather
beyond the virtual address capabilities of 32bit.

~Andrew


Re: [SUSPECTED SPAM][PATCH 0/2] Remove Xen PVH dependency on PCI

2020-10-14 Thread Andrew Cooper
On 14/10/2020 18:53, Jason Andryuk wrote:
> A Xen PVH domain doesn't have a PCI bus or devices,

[*] Yet.

> so it doesn't need PCI support built in.

Untangling the dependences is a good thing, but eventually we plan to
put an optional PCI bus back in, e.g. for SRIOV usecases.

~Andrew


Re: How should we handle illegal task FPU state?

2020-10-01 Thread Andrew Cooper
On 01/10/2020 21:58, Sean Christopherson wrote:
> On Thu, Oct 01, 2020 at 01:32:04PM -0700, Yu, Yu-cheng wrote:
>> On 10/1/2020 10:43 AM, Andy Lutomirski wrote:
>>> The question is: what do we do about it?  We have two basic choices, I 
>>> think.
>>>
>>> a) Decide that the saved FPU for a task *must* be valid at all times.
>>> If there's a failure to restore state, kill the task.
>>>
>>> b) Improve our failed restoration handling and maybe even
>>> intentionally make it possible to create illegal state to allow
>>> testing.
>>>
>>> (a) sounds like a nice concept, but I'm not convinced it's practical.
>>> For example, I'm not even convinced that the set of valid SSP values
>>> is documented.
> Eh, crappy SDM writing isn't a good reason to make our lives harder.  The
> SSP MSRs are canonical MSRs and follow the same rules as the SYSCALL,
> FS/GS BASE, etc... MSRs.  I'll file an SDM bug.

Don't forget the added constraint of being 4 byte aligned. ;)

But the SDM is fine in this regard, at least as far as Vol4 goes, even
if does have an excessively verbose way of expressing itself.

~Andrew


Re: How should we handle illegal task FPU state?

2020-10-01 Thread Andrew Cooper
On 01/10/2020 18:43, Andy Lutomirski wrote:
> Our current handling of illegal task FPU state is currently rather
> simplistic.  We basically ignore the issue with this extable code:
>
> /*
>  * Handler for when we fail to restore a task's FPU state.  We should never 
> get
>  * here because the FPU state of a task using the FPU (task->thread.fpu.state)
>  * should always be valid.  However, past bugs have allowed userspace to set
>  * reserved bits in the XSAVE area using PTRACE_SETREGSET or 
> sys_rt_sigreturn().
>  * These caused XRSTOR to fail when switching to the task, leaking the FPU
>  * registers of the task previously executing on the CPU.  Mitigate this class
>  * of vulnerability by restoring from the initial state (essentially, zeroing
>  * out all the FPU registers) if we can't restore from the task's FPU state.
>  */
> __visible bool ex_handler_fprestore(const struct exception_table_entry *fixup,
> struct pt_regs *regs, int trapnr,
> unsigned long error_code,
> unsigned long fault_addr)
> {
> regs->ip = ex_fixup_addr(fixup);
>
> WARN_ONCE(1, "Bad FPU state detected at %pB, reinitializing FPU 
> registers.",
>   (void *)instruction_pointer(regs));
>
> __copy_kernel_to_fpregs(&init_fpstate, -1);
> return true;
> }
> EXPORT_SYMBOL_GPL(ex_handler_fprestore);
>
> In other words, we mostly pretend that illegal FPU state can't happen,
> and, if it happens, we print a WARN and we blindly run the task with
> the wrong state.  This is at least an improvement from the previous
> code -- see
>
> commit d5c8028b4788f62b31fb79a331b3ad3e041fa366
> Author: Eric Biggers 
> Date:   Sat Sep 23 15:00:09 2017 +0200
>
> x86/fpu: Reinitialize FPU registers if restoring FPU state fails
>
> And we have some code that tries to sanitize user state to avoid this.
>
> IMO this all made a little bit of sense when "FPU" meant literally FPU
> or at least state that was more or less just user registers.  But now
> we have this fancy "supervisor" state, and I don't think we should be
> running user code in a context with potentially corrupted or even
> potentially incorrectly re-initialized supervisor state.  This is an
> issue for SHSTK -- if an attacker can find a straightforward way to
> corrupt a target task's FPU state, then that task will run with CET
> disabled.  Whoops!

-1 would not recommend.

> The question is: what do we do about it?  We have two basic choices, I think.
>
> a) Decide that the saved FPU for a task *must* be valid at all times.
> If there's a failure to restore state, kill the task.
>
> b) Improve our failed restoration handling and maybe even
> intentionally make it possible to create illegal state to allow
> testing.
>
> (a) sounds like a nice concept, but I'm not convinced it's practical.
> For example, I'm not even convinced that the set of valid SSP values
> is documented.

SSP is just a stack pointer, but its not included in XSTATE.

The CET_U XSTATE component contains MSR_PL3_SSP and MSR_U_CET, both of
which have well defined validity descriptions in the manual.

The CET_S XSTATE component contains MSR_PL{2..0}_SSP, but this will be
of no interest to Linux at all.

As these can only be loaded in supervisor mode, neither operate on the
currently active SSP.

Given how broken Rings 1 and 2 are by CET-SS, I'm very surprised that an
XSTATE was allocated for this purpose.  Nothing in the architecture ever
updates these values in hardware, so they're never modified since the
last XRSTORS, unless someone played with the MSRs directly.

> So maybe (b) is the right choice.  Getting a good implementation might
> be tricky.  Right now, we restore FPU too late in
> arch_exit_to_user_mode_prepare(), and that function isn't allowed to
> fail or to send signals.  We could kill the task on failure, and I
> suppose we could consider queueing a signal, sending IPI-to-self, and
> returning with TIF_NEED_FPU_LOAD still set and bogus state.  Or we
> could rework the exit-to-usermode code to allow failure.  All of this
> becomes utterly gross for the return-from-NMI path, although I guess
> we don't restore FPU regs in that path regardless.  Or we can
> do_exit() and just bail outright.
>
> I think it would be polite to at least allow core dumping a bogus FPU
> state, and notifying ptrace() might be nice.  And, if the bogus part
> of the FPU state is non-supervisor, we could plausibly deliver a
> signal, but this is (as above) potentially quite difficult.
>
> (As an aside, our current handling of signal delivery failure sucks.
> We should *at least* log something useful.)
>
>
> Regardless of how we decide to handle this, I do think we need to do
> *something* before applying the CET patches.

Conceptually, a fault on [F]XRSTOR[S] ought to be fatal.  Something
corrupt there is definitely an exceptional circumstance.

Making it accessible in a coredump is a n

Re: [PATCH v38 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call

2020-09-29 Thread Andrew Cooper
On 29/09/2020 15:10, Dave Hansen wrote:
> On 9/28/20 4:38 PM, Andrew Cooper wrote:
>>>> CET=y, BUG_SPECTRE_V2=y: does not exist
>>>> CET=n, BUG_SPECTRE_V2=y: vulnerable, use retpoline
>>>> CET=y, BUG_SPECTRE_V2=n: no retpoline, not vulnerable
>>>> CET=n, BUG_SPECTRE_V2=n: no retpoline, not vulnerable
>>> Just to confirm: does this mean that the CPU mitigates against user
>>> code mistraining the branch predictors for CPL0?
>> If (and only if) you have eIBRS enabled.
>>
>> eIBRS should be available on all CET-capable hardware, and Linux ought
>> to use it by default.
> You're totally right, of course.  I was (wrongly) thinking about this
> VDSO retpoline as kernel code.
>
> There's another wrinkle here.  Let's say we're vulnerable to a
> Spectre-v2-style attack and we want to mitigate it on CET hardware that
> has enhanced IBRS.  I'm not sure how reliable of a mitigation retpolines
> are on enhanced IBRS hardware.  Intel has recommended _against_ using
> them in some cases:
>
>> https://software.intel.com/security-software-guidance/api-app/sites/default/files/Retpoline-A-Branch-Target-Injection-Mitigation.pdf
> "On processors that support enhanced IBRS, it should be used for
> mitigation instead of retpoline."
> I actually authored that bit of the whitepaper, and I recall that this
> was not simply a recommendation based on performance advantages of using
> enhanced IBRS.  I can dig through some old email if we decide that we
> want to explore using a retpoline on enhanced IBRS hardware.

If only life were simple.

In light of https://arxiv.org/abs/2008.02307 which managed to
demonstrate that the original KAISER was actually a speculative attack
and nothing to do with the prefetch instruction, a discussion about
same-mode training happened.

The updated recommendation given was to continue using retpoline as well
as eIBRS to prevent same-mode training of the syscall indirect branch. 
Josh (CC'd) has been doing a lot of work to find and fix other
speculative leaks in this area.

For Skylake uarch and later, even if an RSB underflow leads to a BTB
lookup, it still requires an interrupt/NMI to hit one of two instruction
boundaries to empty the RSB, and an attacker with that level of control
probably has more interesting things to be trying to do.

Without retpoline (or something even more expensive such as IRET-ing
around), an attacker can still create speculative type confusion between
different system calls, when eIBRS is active.

Once you mix CET-SS in, this breaks, unless you're prepared to update
the retpoline gadget to include a WRSS to modify the shadow stack
alongside the regular stack.  Add this to the large pile of fun for
whomever has the privileg^W chore of implementing supervisor CET support.

>
> But, let's take a step back.  The changelog for this patch needs to at
> least have:
>
> 1. What is the attack being mitigated by the retpoline?
> 2. Do we actually want to mitigate it?
> 3. What options are there to mitigate it?
> 4. Which option does this patch use and why?
>
> Right now, there's not even a comment about this.

I agree.  The reason for using a retpoline here in the first place is
unclear.

~Andrew


Re: [PATCH v38 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call

2020-09-28 Thread Andrew Cooper
On 28/09/2020 21:42, Jarkko Sakkinen wrote:
> On Mon, Sep 28, 2020 at 05:44:35PM +0100, Andrew Cooper wrote:
>> On 28/09/2020 01:58, Jarkko Sakkinen wrote:
>>> On Fri, Sep 25, 2020 at 07:23:59PM +0100, Andrew Cooper wrote:
>>>> On 15/09/2020 12:28, Jarkko Sakkinen wrote:
>>>>> diff --git a/arch/x86/entry/vdso/vsgx_enter_enclave.S 
>>>>> b/arch/x86/entry/vdso/vsgx_enter_enclave.S
>>>>> new file mode 100644
>>>>> index ..adbd59d41517
>>>>> --- /dev/null
>>>>> +++ b/arch/x86/entry/vdso/vsgx_enter_enclave.S
>>>>> @@ -0,0 +1,157 @@
>>>>> +SYM_FUNC_START(__vdso_sgx_enter_enclave)
>>>>> 
>>>>> +.Lretpoline:
>>>>> + call2f
>>>>> +1:   pause
>>>>> + lfence
>>>>> + jmp 1b
>>>>> +2:   mov %rax, (%rsp)
>>>>> + ret
>>>> I hate to throw further spanners in the work, but this is not compatible
>>>> with CET, and the user shadow stack work in progress.
>>> CET goes beyond my expertise. Can you describe, at least rudimentary,
>>> how this code is not compatible?
>> CET Shadow Stacks detect attacks which modify the return address on the
>> stack.
>>
>> Retpoline *is* a ROP gadget.  It really does modify the return address
>> on the stack, even if its purpose is defensive (vs Spectre v2) rather
>> than malicious.
> Aah. I get that, yes.
>
> Kernel is full of retpoline but I presume that ring-0 does not use CET.

No-one has implemented support CET-SS support for Linux itself yet, but
other kernels do have it working.

~Andrew


Re: [PATCH v38 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call

2020-09-28 Thread Andrew Cooper
On 28/09/2020 23:41, Andy Lutomirski wrote:
> On Mon, Sep 28, 2020 at 3:18 PM Dave Hansen  wrote:
>> On 9/28/20 3:06 PM, H.J. Lu wrote:
 I'm open to do either solution. My thinking was to initially do things
 vsgx.S local (i.e. consider ALTERNATIVE post upstreaming) and use the
 above solution but I'm also fine doing ALTERNATIVE. Dave kindly briefed
 on details how that thing works and it should be perfectly usable for
 our use case.

>>> Since SHSTK and IBT are enabled per process, not the whole machine,
>>> are you going to patch vDSO on a per-process basis?
>> No.
>>
>> Retpolines mitigate Spectre v2 attacks.  If you're not vulnerable to
>> Spectre v2, you don't need retpolines.
>>
>> All processors which support CET *also* have hardware mitigations
>> against Spectre v2 and don't need retpolines.  Here's all of the
>> possibilities:
>>
>> CET=y, BUG_SPECTRE_V2=y: does not exist
>> CET=n, BUG_SPECTRE_V2=y: vulnerable, use retpoline
>> CET=y, BUG_SPECTRE_V2=n: no retpoline, not vulnerable
>> CET=n, BUG_SPECTRE_V2=n: no retpoline, not vulnerable
> Just to confirm: does this mean that the CPU mitigates against user
> code mistraining the branch predictors for CPL0?

If (and only if) you have eIBRS enabled.

eIBRS should be available on all CET-capable hardware, and Linux ought
to use it by default.

> Because this is the
> vDSO, and the situation we're actually concerned about is user code
> mistraining its own branch predictors.  This could happen
> cross-process or within the same process.

There is nothing (in Intel parts) which prevents mode same-mode training
of indirect branches, either in user or kernel space.

However, an IBPB on context switch should prevent cross-process trailing
attacks.

~Andrew


Re: [PATCH v38 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call

2020-09-28 Thread Andrew Cooper
On 28/09/2020 01:58, Jarkko Sakkinen wrote:
> On Fri, Sep 25, 2020 at 07:23:59PM +0100, Andrew Cooper wrote:
>> On 15/09/2020 12:28, Jarkko Sakkinen wrote:
>>> diff --git a/arch/x86/entry/vdso/vsgx_enter_enclave.S 
>>> b/arch/x86/entry/vdso/vsgx_enter_enclave.S
>>> new file mode 100644
>>> index ..adbd59d41517
>>> --- /dev/null
>>> +++ b/arch/x86/entry/vdso/vsgx_enter_enclave.S
>>> @@ -0,0 +1,157 @@
>>> +SYM_FUNC_START(__vdso_sgx_enter_enclave)
>>> 
>>> +.Lretpoline:
>>> +   call2f
>>> +1: pause
>>> +   lfence
>>> +   jmp 1b
>>> +2: mov %rax, (%rsp)
>>> +   ret
>> I hate to throw further spanners in the work, but this is not compatible
>> with CET, and the user shadow stack work in progress.
> CET goes beyond my expertise. Can you describe, at least rudimentary,
> how this code is not compatible?

CET Shadow Stacks detect attacks which modify the return address on the
stack.

Retpoline *is* a ROP gadget.  It really does modify the return address
on the stack, even if its purpose is defensive (vs Spectre v2) rather
than malicious.

>> Whichever of these two large series lands first is going to inflict
>> fixing this problem on the other.
>>
>> As the vdso text is global (to a first approximation), it must not be a
>> retpoline if any other process is liable to want to use CET-SS.
> Why is that?

Because when CET-SS is enabled, the ret will suffer a #CP exception
(return address on the stack not matching the one recorded in the shadow
stack), which I presume/hope is wired into SIGSEGV.

~Andrew


Re: [PATCH v38 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call

2020-09-25 Thread Andrew Cooper
On 15/09/2020 12:28, Jarkko Sakkinen wrote:
> diff --git a/arch/x86/entry/vdso/vsgx_enter_enclave.S 
> b/arch/x86/entry/vdso/vsgx_enter_enclave.S
> new file mode 100644
> index ..adbd59d41517
> --- /dev/null
> +++ b/arch/x86/entry/vdso/vsgx_enter_enclave.S
> @@ -0,0 +1,157 @@
> +SYM_FUNC_START(__vdso_sgx_enter_enclave)
> 
> +.Lretpoline:
> + call2f
> +1:   pause
> + lfence
> + jmp 1b
> +2:   mov %rax, (%rsp)
> + ret

I hate to throw further spanners in the work, but this is not compatible
with CET, and the user shadow stack work in progress.

Whichever of these two large series lands first is going to inflict
fixing this problem on the other.

As the vdso text is global (to a first approximation), it must not be a
retpoline if any other process is liable to want to use CET-SS.

If the retpoline really does need to stay, then the vdso probably needs
to gain suitable __x86_indirect_thunk_%reg thunks which are patched at
boot based on the system properties.

~Andrew


Re: [PATCH v3] x86/uaccess: Use pointer masking to limit uaccess speculation

2020-09-14 Thread Andrew Cooper
On 14/09/2020 20:27, Josh Poimboeuf wrote:
> On Mon, Sep 14, 2020 at 09:21:56PM +0200, Borislav Petkov wrote:
>> On Mon, Sep 14, 2020 at 11:48:55AM -0700, Dan Williams wrote:
 Err, stupid question: can this macro then be folded into access_ok() so
 that you don't have to touch so many places and the check can happen
 automatically?
>>> I think that ends up with more changes because it changes the flow of
>>> access_ok() from returning a boolean to returning a modified user
>>> address that can be used in the speculative path.
>> I mean something like the totally untested, only to show intent hunk
>> below? (It is late here so I could very well be missing an aspect):
>>
>> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
>> index 2bffba2a1b23..c94e1589682c 100644
>> --- a/arch/x86/include/asm/uaccess.h
>> +++ b/arch/x86/include/asm/uaccess.h
>> @@ -7,6 +7,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -92,8 +93,15 @@ static inline bool pagefault_disabled(void);
>>   */
>>  #define access_ok(addr, size)   \
>>  ({  \
>> +bool range; \
>> +typeof(addr) a = addr, b;   \
>> +\
>>  WARN_ON_IN_IRQ();   \
>> -likely(!__range_not_ok(addr, size, user_addr_max()));   \
>> +\
>> +range = __range_not_ok(addr, size, user_addr_max());\
>> +b = (typeof(addr)) array_index_nospec((__force unsigned long)addr, 
>> TASK_SIZE_MAX); \
>> +\
>> +likely(!range && a == b);   \
> That's not going to work because 'a == b' can (and will) be
> misspeculated.

Correct.

array_index_nospec() only works (safety wise) when there are no
conditional jumps between it and the memory access it is protecting.

Otherwise, an attacker can just take control of that later jump and skip
the check that way.

~Andrew


Re: [PATCH 01/13] x86/entry: Fix AC assertion

2020-09-02 Thread Andrew Cooper
On 02/09/2020 16:58, Brian Gerst wrote:
> On Wed, Sep 2, 2020 at 9:38 AM Peter Zijlstra  wrote:
>> From: Peter Zijlstra 
>>
>> The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
>> improve user entry sanity checks") unconditionally triggers on my IVB
>> machine because it does not support SMAP.
>>
>> For !SMAP hardware we patch out CLAC/STAC instructions and thus if
>> userspace sets AC, we'll still have it set after entry.
>>
>> Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry 
>> sanity checks")
>> Signed-off-by: Peter Zijlstra (Intel) 
>> Acked-by: Andy Lutomirski 
>> ---
>>  arch/x86/include/asm/entry-common.h |   11 +--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> --- a/arch/x86/include/asm/entry-common.h
>> +++ b/arch/x86/include/asm/entry-common.h
>> @@ -18,8 +18,16 @@ static __always_inline void arch_check_u
>>  * state, not the interrupt state as imagined by Xen.
>>  */
>> unsigned long flags = native_save_fl();
>> -   WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
>> - X86_EFLAGS_NT));
>> +   unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
>> +
>> +   /*
>> +* For !SMAP hardware we patch out CLAC on entry.
>> +*/
>> +   if (boot_cpu_has(X86_FEATURE_SMAP) ||
>> +   (IS_ENABLED(CONFIG_64_BIT) && 
>> boot_cpu_has(X86_FEATURE_XENPV)))
>> +   mask |= X86_EFLAGS_AC;
> Is the explicit Xen check necessary?  IIRC the Xen hypervisor will
> filter out the SMAP bit in the cpuid pvop.

The Xen check isn't anything to do with SMAP.

64bit PV guest kernels run in Ring3, so userspace's choice of AC for
real alignment check purposes needs to not leak into kernel context.

Xen's ABI for a user => kernel context switch should clear AC on behalf
of the kernel, but the fact still remains that if AC actually leaks into
context for whatever reason, stuff is going to break.

~Andrew


[PATCH] docs/ia64: Drop obsolete Xen documentation

2020-08-27 Thread Andrew Cooper
While the xensource.com URLs referenced still exist, neither the Xen or Linux
2.6.18 fork have been touched since 2009, 11 years ago.  Other URLs are dead.

IA64 support was removed in Xen 4.2, in 2012.  Relegate this piece of
documentation to source history.

Signed-off-by: Andrew Cooper 
---
CC: Tony Luck 
CC: Fenghua Yu 
CC: Jonathan Corbet 
CC: linux-i...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 Documentation/ia64/index.rst |   1 -
 Documentation/ia64/xen.rst   | 206 ---
 2 files changed, 207 deletions(-)
 delete mode 100644 Documentation/ia64/xen.rst

diff --git a/Documentation/ia64/index.rst b/Documentation/ia64/index.rst
index 0436e1034115..4bdfe28067ee 100644
--- a/Documentation/ia64/index.rst
+++ b/Documentation/ia64/index.rst
@@ -15,4 +15,3 @@ IA-64 Architecture
irq-redir
mca
serial
-   xen
diff --git a/Documentation/ia64/xen.rst b/Documentation/ia64/xen.rst
deleted file mode 100644
index 831339c74441..
--- a/Documentation/ia64/xen.rst
+++ /dev/null
@@ -1,206 +0,0 @@
-
-Recipe for getting/building/running Xen/ia64 with pv_ops
-
-This recipe describes how to get xen-ia64 source and build it,
-and run domU with pv_ops.
-
-Requirements
-
-
-  - python
-  - mercurial
-it (aka "hg") is an open-source source code
-management software. See the below.
-http://www.selenic.com/mercurial/wiki/
-  - git
-  - bridge-utils
-
-Getting and Building Xen and Dom0
-=
-
-  My environment is:
-
-- Machine  : Tiger4
-- Domain0 OS  : RHEL5
-- DomainU OS  : RHEL5
-
- 1. Download source::
-
-   # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg
-   # cd xen-unstable.hg
-   # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg
-
- 2. # make world
-
- 3. # make install-tools
-
- 4. copy kernels and xen::
-
-   # cp xen/xen.gz /boot/efi/efi/redhat/
-   # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \
-   /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen
-
- 5. make initrd for Dom0/DomU::
-
-   # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \
-  O=$(pwd)/build-linux-2.6.18-xen_ia64
-   # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \
- 2.6.18.8-xen --builtin mptspi --builtin mptbase \
- --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \
- --builtin ehci-hcd
-
-Making a disk image for guest OS
-
-
- 1. make file::
-
-  # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0
-  # mke2fs -F -j /root/rhel5.img
-  # mount -o loop /root/rhel5.img /mnt
-  # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt
-  # mkdir /mnt/{root,proc,sys,home,tmp}
-
-  Note: You may miss some device files. If so, please create them
-  with mknod. Or you can use tar instead of cp.
-
- 2. modify DomU's fstab::
-
-  # vi /mnt/etc/fstab
- /dev/xvda1  /ext3defaults1 1
- none/dev/pts devpts  gid=5,mode=620  0 0
- none/dev/shm tmpfs   defaults0 0
- none/procprocdefaults0 0
- none/sys sysfs   defaults0 0
-
- 3. modify inittab
-
-set runlevel to 3 to avoid X trying to start::
-
-  # vi /mnt/etc/inittab
- id:3:initdefault:
-
-Start a getty on the hvc0 console::
-
-   X0:2345:respawn:/sbin/mingetty hvc0
-
-tty1-6 mingetty can be commented out
-
- 4. add hvc0 into /etc/securetty::
-
-  # vi /mnt/etc/securetty (add hvc0)
-
- 5. umount::
-
-  # umount /mnt
-
-FYI, virt-manager can also make a disk image for guest OS.
-It's GUI tools and easy to make it.
-
-Boot Xen & Domain0
-==
-
- 1. replace elilo
-elilo of RHEL5 can boot Xen and Dom0.
-If you use old elilo (e.g RHEL4), please download from the below
-http://elilo.sourceforge.net/cgi-bin/blosxom
-and copy into /boot/efi/efi/redhat/::
-
-  # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi
-
- 2. modify elilo.conf (like the below)::
-
-  # vi /boot/efi/efi/redhat/elilo.conf
-  prompt
-  timeout=20
-  default=xen
-  relocatable
-
-  image=vmlinuz-2.6.18.8-xen
- label=xen
- vmm=xen.gz
- initrd=initrd-2.6.18.8-xen.img
- read-only
- append=" -- rhgb root=/dev/sda2"
-
-The append options before "--" are for xen hypervisor,
-the options after "--" are for dom0.
-
-FYI, your machine may need console options like
-"com1=19200,8n1 console=vga,com1". For example,
-append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \
-console=ttyS0 root=/dev/sda2"
-
-Getting and Building domU

Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Andrew Cooper
On 25/08/2020 18:35, Luck, Tony wrote:
>>> Or malicious hypervisor action, and that's a problem.
>>>
>>> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
>>> actual SYSCALL text or the first memory it accesses -- I don't have a
>>> TDX spec so I don't know the details).
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

You have to be more specific about what you mean by "malicious" hypervisor.

Nothing can protect against a hypervisor which refuses to schedule the
Trusted Domain.  The guest cannot protect against availability
maliciousness.  However, you can use market forces to fix that problem. 
(I'll take my credit card elsewhere if you don't schedule my VM, etc)

Things are more complicated when it comes to integrity or
confidentiality of the TD, but the prevailing feeling seems to be
"crashing obviously and reliably if something goes wrong is ok".

If I've read the TDX spec/whitepaper properly, the main hypervisor can
write to all the encrypted pages.  This will destroy data, break the
MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache
line is next referenced.

Cunning timing on behalf of a malicious hypervisor (hitting the SYSCALL
gap) will cause the guest's #PF handler to run on a user stack, opening
a privilege escalation hole.

Whatever you might want to say about the exact integrity/confidentiality
expectations, I think "the hypervisor can open a user=>kernel privilege
escalation hole inside the TD" is not what people would consider acceptable.

On AMD parts, this is why the #VC handler is IST, in an attempt to at
least notice this damage and crash.  There is no way TDX can get away
with requiring #PF to be IST as well.

~Andrew



Re: [PATCH] x86/entry: Fix AC assertion

2020-08-24 Thread Andrew Cooper
On 24/08/2020 16:21, pet...@infradead.org wrote:
> On Mon, Aug 24, 2020 at 03:22:06PM +0100, Andrew Cooper wrote:
>> On 24/08/2020 11:14, pet...@infradead.org wrote:
>>> The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
>>> improve user entry sanity checks") unconditionally triggers on my IVB
>>> machine because it does not support SMAP.
>>>
>>> For !SMAP hardware we patch out CLAC/STAC instructions and thus if
>>> userspace sets AC, we'll still have it set after entry.
>> Technically, you don't patch in, rather than patch out.
> True.
>
>>> Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry 
>>> sanity checks")
>>> Signed-off-by: Peter Zijlstra (Intel) 
>>> Acked-by: Andy Lutomirski 
>>> ---
>>>  arch/x86/include/asm/entry-common.h |   11 +--
>>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> --- a/arch/x86/include/asm/entry-common.h
>>> +++ b/arch/x86/include/asm/entry-common.h
>>> @@ -18,8 +18,15 @@ static __always_inline void arch_check_u
>>>  * state, not the interrupt state as imagined by Xen.
>>>  */
>>> unsigned long flags = native_save_fl();
>>> -   WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
>>> - X86_EFLAGS_NT));
>>> +   unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
>>> +
>>> +   /*
>>> +* For !SMAP hardware we patch out CLAC on entry.
>>> +*/
>>> +   if (boot_cpu_has(X86_FEATURE_SMAP))
>>> +   mask |= X86_EFLAGS_AC;
>> The Xen PV ABI clears AC on entry for 64bit guests, because Linux is
>> actually running in Ring 3, and therefore susceptible to #AC's which
>> wouldn't occur natively.
> So do you then want it to be something like:
>
>   if (boot_cpu_has(X86_FEATURE_SMAP) ||
>   (IS_ENABLED(CONFIG_64_BIT) && boot_cpu_has(X86_FEATURE_XENPV)))
>
> ? Or are you fine with the proposed?

Dealers choice, but this option would be slightly better overall.

(Are there any other cases where Linux will be running in Ring 3?  I
haven't been paying attention to recent changes in PVOps.)

~Andrew


Re: [PATCH] x86/entry: Fix AC assertion

2020-08-24 Thread Andrew Cooper
On 24/08/2020 11:14, pet...@infradead.org wrote:
> The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
> improve user entry sanity checks") unconditionally triggers on my IVB
> machine because it does not support SMAP.
>
> For !SMAP hardware we patch out CLAC/STAC instructions and thus if
> userspace sets AC, we'll still have it set after entry.

Technically, you don't patch in, rather than patch out.

>
> Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity 
> checks")
> Signed-off-by: Peter Zijlstra (Intel) 
> Acked-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/entry-common.h |   11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -18,8 +18,15 @@ static __always_inline void arch_check_u
>* state, not the interrupt state as imagined by Xen.
>*/
>   unsigned long flags = native_save_fl();
> - WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
> -   X86_EFLAGS_NT));
> + unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
> +
> + /*
> +  * For !SMAP hardware we patch out CLAC on entry.
> +  */
> + if (boot_cpu_has(X86_FEATURE_SMAP))
> + mask |= X86_EFLAGS_AC;

The Xen PV ABI clears AC on entry for 64bit guests, because Linux is
actually running in Ring 3, and therefore susceptible to #AC's which
wouldn't occur natively.

~Andrew


Re: [PATCH v2 5/8] x86/debug: Simplify #DB signal code

2020-08-24 Thread Andrew Cooper
On 24/08/2020 12:05, pet...@infradead.org wrote:
> On Sun, Aug 23, 2020 at 04:09:42PM -0700, Andy Lutomirski wrote:
>> On Fri, Aug 21, 2020 at 3:21 AM Peter Zijlstra  wrote:
>>> Get rid of the two variables, avoid computing si_code when not needed
>>> and be consistent about which dr6 value is used.
>>>
>>> -   if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
>>> -   send_sigtrap(regs, 0, si_code);
>>> +   /*
>>> +* If dr6 has no reason to give us about the origin of this trap,
>>> +* then it's very likely the result of an icebp/int01 trap.
>>> +* User wants a sigtrap for that.
>>> +*/
>>> +   if (dr6 & (DR_STEP | DR_TRAP_BITS) || !dr6)
>>> +   send_sigtrap(regs, 0, get_si_code(dr6));
>> The old condition was ... || (actual DR6 value) and the new condition
>> was ... || (stupid notifier-modified DR6 value).  I think the old code
>> was more correct.
> Hurmph.. /me goes re-read the SDM.
>
> INT1 is a trap,
> instruction breakpoint is a fault
>
> So if you have:
>
>   INT1
> 1:some-instr
>
> and set an X breakpoint on 1, we'll loose the INT1, right?

You should get two.  First with a dr6 of 0 (ICEBP, RIP pointing at 1:),
and a second with dr6 indicating the X breakpoint (again, RIP pointing
at 1:).

SDM Vol3 6.9 PRIORITY AMONG SIMULTANEOUS EXCEPTIONS AND INTERRUPTS

Traps on previous instructions are at priority 4, because they still
"part" of the previous instruction.  X breakpoints are priority 7.

The two #DB's shouldn't merge because nothing inhibits delivery[1] of
the trap at priority 4, and on return from the handler, RF isn't set so
the instruction breakpoint will trigger.

~Andrew

[1] Anyone playing with MovSS gets to keep all resulting pieces.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Andrew Cooper
On 28/07/2020 17:59, Roger Pau Monné wrote:
> On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
>> Hi,
>>
>> On 27/07/2020 10:13, Roger Pau Monne wrote:
>>> To be used in order to create foreign mappings. This is based on the
>>> ZONE_DEVICE facility which is used by persistent memory devices in
>>> order to create struct pages and kernel virtual mappings for the IOMEM
>>> areas of such devices. Note that on kernels without support for
>>> ZONE_DEVICE Xen will fallback to use ballooned pages in order to
>>> create foreign mappings.
>>>
>>> The newly added helpers use the same parameters as the existing
>>> {alloc/free}_xenballooned_pages functions, which allows for in-place
>>> replacement of the callers. Once a memory region has been added to be
>>> used as scratch mapping space it will no longer be released, and pages
>>> returned are kept in a linked list. This allows to have a buffer of
>>> pages and prevents resorting to frequent additions and removals of
>>> regions.
>>>
>>> If enabled (because ZONE_DEVICE is supported) the usage of the new
>>> functionality untangles Xen balloon and RAM hotplug from the usage of
>>> unpopulated physical memory ranges to map foreign pages, which is the
>>> correct thing to do in order to avoid mappings of foreign pages depend
>>> on memory hotplug.
>> I think this is going to break Dom0 on Arm if the kernel has been built with
>> hotplug. This is because you may end up to re-use region that will be used
>> for the 1:1 mapping of a foreign map.
>>
>> Note that I don't know whether hotplug has been tested on Xen on Arm yet. So
>> it might be possible to be already broken.
>>
>> Meanwhile, my suggestion would be to make the use of hotplug in the balloon
>> code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?
> Right, this feature (allocation of unpopulated memory separated from
> the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
> think could be used on Arm.
>
> IMO the right solution seems to be to subtract the physical memory
> regions that can be used for the identity mappings of foreign pages
> (all RAM on the system AFAICT) from iomem_resource, as that would make
> this and the memory hotplug done in the balloon driver safe?

The right solution is a mechanism for translated guests to query Xen to
find regions of guest physical address space which are unused, and can
be safely be used for foreign/grant/other  mappings.

Please don't waste any more time applying more duct tape to a broken
system, and instead spend the time organising some proper foundations.

~Andrew


Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Andrew Cooper
On 23/07/2020 16:10, Jürgen Groß wrote:
> On 23.07.20 15:59, Roger Pau Monné wrote:
>> On Thu, Jul 23, 2020 at 03:22:49PM +0200, David Hildenbrand wrote:
>>> On 23.07.20 14:23, Roger Pau Monné wrote:
 On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> On 23.07.20 10:45, Roger Pau Monne wrote:
>> Add an extra option to add_memory_resource that overrides the memory
>> hotplug online behavior in order to force onlining of memory from
>> add_memory_resource unconditionally.
>>
>> This is required for the Xen balloon driver, that must run the
>> online page callback in order to correctly process the newly added
>> memory region, note this is an unpopulated region that is used by
>> Linux
>> to either hotplug RAM or to map foreign pages from other domains,
>> and
>> hence memory hotplug when running on Xen can be used even without
>> the
>> user explicitly requesting it, as part of the normal operations
>> of the
>> OS when attempting to map memory from a different domain.
>>
>> Setting a different default value of memhp_default_online_type when
>> attaching the balloon driver is not a robust solution, as the
>> user (or
>> distro init scripts) could still change it and thus break the Xen
>> balloon driver.
>
> I think we discussed this a couple of times before (even triggered
> by my
> request), and this is responsibility of user space to configure.
> Usually
> distros have udev rules to online memory automatically.
> Especially, user
> space should eb able to configure *how* to online memory.

 Note (as per the commit message) that in the specific case I'm
 referring to the memory hotplugged by the Xen balloon driver will be
 an unpopulated range to be used internally by certain Xen subsystems,
 like the xen-blkback or the privcmd drivers. The addition of such
 blocks of (unpopulated) memory can happen without the user explicitly
 requesting it, and hence not even aware such hotplug process is taking
 place. To be clear: no actual RAM will be added to the system.
>>>
>>> Okay, but there is also the case where XEN will actually hotplug memory
>>> using this same handler IIRC (at least I've read papers about it). Both
>>> are using the same handler, correct?
>>
>> Yes, it's used by this dual purpose, which I have to admit I don't
>> like that much either.
>>
>> One set of pages should be clearly used for RAM memory hotplug, and
>> the other to map foreign pages that are not related to memory hotplug,
>> it's just that we happen to need a physical region with backing struct
>> pages.
>>

> It's the admin/distro responsibility to configure this properly.
> In case
> this doesn't happen (or as you say, users change it), bad luck.
>
> E.g., virtio-mem takes care to not add more memory in case it is not
> getting onlined. I remember hyper-v has similar code to at least
> wait a
> bit for memory to get onlined.

 I don't think VirtIO or Hyper-V use the hotplug system in the same way
 as Xen, as said this is done to add unpopulated memory regions that
 will be used to map foreign memory (from other domains) by Xen drivers
 on the system.
>>>
>>> Indeed, if the memory is never exposed to the buddy (and all you
>>> need is
>>> struct pages +  a kernel virtual mapping), I wonder if
>>> memremap/ZONE_DEVICE is what you want?
>>
>> I'm certainly not familiar with the Linux memory subsystem, but if
>> that gets us a backing struct page and a kernel mapping then I would
>> say yes.
>>
>>> Then you won't have user-visible
>>> memory blocks created with unclear online semantics, partially
>>> involving
>>> the buddy.
>>
>> Seems like a fine solution.
>>
>> Juergen: would you be OK to use a separate page-list for
>> alloc_xenballooned_pages on HVM/PVH using the logic described by
>> David?
>>
>> I guess I would leave PV as-is, since it already has this reserved
>> region to map foreign pages.
>
> I would really like a common solution, especially as it would enable
> pv driver domains to use that feature, too.
>
> And finding a region for this memory zone in PVH dom0 should be common
> with PV dom0 after all. We don't want to collide with either PCI space
> or hotplug memory.

While I agree with goal here, these are two very different things, due
to the completely different nature of PV and HVM/PVH guests.

HVM/PVH guests have a concrete guest physical address space.  Linux
needs to pick some gfn's to use which aren't used by anything else (and
Xen's behaviour of not providing any help here is deeply unhelpful, and
needs fixing), and get struct page_info's for them.

PV is totally different.  Linux still needs page_info's for them, but
there is no concept of a guest physical address space.  You can
literally gain access to foreign mappings or grant maps by asking Xen to
modify a PTE.  For convenience wit

Re: [PATCH entry v2 5/6] x86/ldt: Disable 16-bit segments on Xen PV

2020-07-03 Thread Andrew Cooper
On 03/07/2020 18:02, Andy Lutomirski wrote:
> Xen PV doesn't implement ESPFIX64, so they don't work right.  Disable
> them.  Also print a warning the first time anyone tries to use a
> 16-bit segment on a Xen PV guest that would otherwise allow it
> to help people diagnose this change in behavior.
>
> This gets us closer to having all x86 selftests pass on Xen PV.

Do we know exactly how much virtual address space ESPFIX64 takes up?

Are people going to scream in horror if Linux needs to donate some
virtual address space back to Xen to get this working?  Linux's current
hypervisor range (8TB, 16 pagetable slots) really isn't enough for some
systems these days.

~Andrew


Re: FSGSBASE seems to be busted on Xen PV

2020-07-03 Thread Andrew Cooper
On 03/07/2020 18:10, Andy Lutomirski wrote:
> Hi Xen folks-
>
> I did some testing of the upcoming Linux FSGSBASE support on Xen PV,
> and I found what appears to be some significant bugs in the Xen
> context switching code.  These bugs are causing Linux selftest
> failures, and they could easily cause random and hard-to-debug
> failures of user programs that use the new instructions in a Xen PV
> guest.
>
> The bugs seem to boil down to the context switching code in Xen being
> clever and trying to guess that a nonzero FS or GS means that the
> segment base must match the in-memory descriptor.  This is simply not
> true if CR4.FSGSBASE is set -- the bases can have any canonical value,
> under the full control of the guest, and Xen has absolutely no way of
> knowing whether the values are expected to be in sync with the
> selectors.  (The same is true of FSGSBASE except that guest funny
> business either requires MSR accesses or some descriptor table
> fiddling, and guests are perhaps less likely to care)
>
> Having written a bunch of the corresponding Linux code, I don't
> there's any way around just independently saving and restoring the
> selectors and the bases.  At least it's relatively fast with FSGSBASE
> enabled.
>
> If you can't get this fixed in upstream Xen reasonably quickly, we may
> need to disable FSGSBASE in a Xen PV guest in Linux.

This has come up several times before, but if its actually breaking
userspace then Xen needs to change.

I'll see about making something which is rather more robust.

~Andrew


Re: [PATCH v2 1/4] x86/xen: remove 32-bit Xen PV guest support

2020-07-02 Thread Andrew Cooper
On 02/07/2020 23:59, Boris Ostrovsky wrote:
> On 7/1/20 7:06 AM, Juergen Gross wrote:
>>  
>> -#ifdef CONFIG_X86_PAE
>> -static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
>> -{
>> -trace_xen_mmu_set_pte_atomic(ptep, pte);
>> -__xen_set_pte(ptep, pte);
>
> Probably not for this series but I wonder whether __xen_set_pte() should
> continue to use hypercall now that we are 64-bit only.

The hypercall path is a SYSCALL (and SYSRET out).

The "writeable" PTE path is a #PF, followed by an x86 instruction
emulation, which then reaches the same logic as the hypercall path (and
an IRET out).

~Andrew


Re: [PATCH fsgsbase v2 4/4] x86/fsgsbase: Fix Xen PV support

2020-06-29 Thread Andrew Cooper
On 29/06/2020 06:17, Jürgen Groß wrote:
> On 26.06.20 19:24, Andy Lutomirski wrote:
>> On Xen PV, SWAPGS doesn't work.  Teach __rdfsbase_inactive() and
>> __wrgsbase_inactive() to use rdmsrl()/wrmsrl() on Xen PV.  The Xen
>> pvop code will understand this and issue the correct hypercalls.
>>
>> Cc: Boris Ostrovsky 
>> Cc: Juergen Gross 
>> Cc: Stefano Stabellini 
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Andy Lutomirski 
>> ---
>>   arch/x86/kernel/process_64.c | 20 ++--
>>   1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index cb8e37d3acaa..457d02aa10d8 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -163,9 +163,13 @@ static noinstr unsigned long
>> __rdgsbase_inactive(void)
>>     lockdep_assert_irqs_disabled();
>>   -    native_swapgs();
>> -    gsbase = rdgsbase();
>> -    native_swapgs();
>> +    if (!static_cpu_has(X86_FEATURE_XENPV)) {
>> +    native_swapgs();
>> +    gsbase = rdgsbase();
>> +    native_swapgs();
>> +    } else {
>> +    rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
>> +    }
>>     return gsbase;
>>   }
>> @@ -182,9 +186,13 @@ static noinstr void __wrgsbase_inactive(unsigned
>> long gsbase)
>>   {
>>   lockdep_assert_irqs_disabled();
>>   -    native_swapgs();
>> -    wrgsbase(gsbase);
>> -    native_swapgs();
>> +    if (!static_cpu_has(X86_FEATURE_XENPV)) {
>> +    native_swapgs();
>> +    wrgsbase(gsbase);
>> +    native_swapgs();
>> +    } else {
>> +    wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
>> +    }
>>   }
>>     /*
>>
>
> Another possibility would be to just do (I'm fine either way):
>
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index acc49fa6a097..b78dd373adbf 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -318,6 +318,8 @@ static void __init xen_init_capabilities(void)
>  setup_clear_cpu_cap(X86_FEATURE_XSAVE);
>  setup_clear_cpu_cap(X86_FEATURE_OSXSAVE);
>  }
> +
> +    setup_clear_cpu_cap(X86_FEATURE_FSGSBASE);

That will stop both userspace and Xen (side effect of the guest kernel's
CR4 choice) from using the instructions.

Even when the kernel is using the paravirt fastpath, its still Xen
actually taking the hit.  MSR_{FS,GS}_BASE/SHADOW are thousands of
cycles to access, whereas {RD,WR}{FS,GS}BASE are a handful.

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 19:26, Andy Lutomirski wrote:
> On Tue, Jun 23, 2020 at 8:23 AM Andrew Cooper  
> wrote:
>> On 23/06/2020 14:03, Peter Zijlstra wrote:
>>> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
>>>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
>>>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
>>>>> you to only make it IST if/when you try and make SNP happen, not before.
>>>> It is not the only reason, when ES guests gain debug register support
>>>> then #VC also needs to be IST, because #DB can be promoted into #VC
>>>> then, and as #DB is IST for a reason, #VC needs to be too.
>>> Didn't I read somewhere that that is only so for Rome/Naples but not for
>>> the later chips (Milan) which have #DB pass-through?
>> I don't know about hardware timelines, but some future part can now opt
>> in to having debug registers as part of the encrypted state, and swapped
>> by VMExit, which would make debug facilities generally usable, and
>> supposedly safe to the #DB infinite loop issues, at which point the
>> hypervisor need not intercept #DB for safety reasons.
>>
>> Its worth nothing that on current parts, the hypervisor can set up debug
>> facilities on behalf of the guest (or behind its back) as the DR state
>> is unencrypted, but that attempting to intercept #DB will redirect to
>> #VC inside the guest and cause fun. (Also spare a thought for 32bit
>> kernels which have to cope with userspace singlestepping the SYSENTER
>> path with every #DB turning into #VC.)
> What do you mean 32-bit?  64-bit kernels have exactly the same
> problem.  At least the stack is okay, though.

:)

AMD-like CPUs disallow SYSENTER/SYSEXIT in Long Mode, and raise #UD,
even from a compatibility mode segment.

64bit kernels only have this problem on Intel-like CPUs.

(It is a massive shame that between everyone's attempts, there are 0
"fast system call" instructions with sane semantics, but it is several
decades late to fix this problem...)

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 16:23, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote:
>> On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote:
>>> +noinstr void idtentry_validate_ist(struct pt_regs *regs)
>>> +{
>>> +   if ((regs->sp & ~(EXCEPTION_STKSZ-1)) ==
>>> +   (_RET_IP_ & ~(EXCEPTION_STKSZ-1)))
>>> +   die("IST stack recursion", regs, 0);
>>> +}
>> Yes, this is a start, it doesn't cover the case where the NMI stack is
>> in-between, so I think you need to walk down regs->sp too.
> That shouldn't be possible with the current code, I think.

NMI; #MC; Anything which IRET but isn't fatal - #DB, or #BP from
patching, #GP from *_safe(), etc; NMI

Sure its a corner case, but did you hear that IST is evil?

~Andrew

P.S. did you also hear that with Rowhammer, userspace has a nonzero
quantity of control over generating #MC, depending on how ECC is
configured on the platform.


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 16:16, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote:
>>> We're talking about the 3rd case where the only reason things 'work' is
>>> because we'll have to panic():
>>>
>>>  - #MC
>> Okay, #MC is special and can only be handled on a best-effort basis, as
>> #MC could happen anytime, also while already executing the #MC handler.
> I think the hardware has a MCE-mask bit somewhere. Flaky though because
> clearing it isn't 'atomic' with IRET, so there's a 'funny' window.

MSR_MCG_STATUS.MCIP, and yes - any #MC before that point will
immediately Shutdown.  Any #MC between that point and IRET will clobber
its IST stack and end up sad.

> It also interacts really bad with the NMI handler. If we get an #MC
> early in the NMI, where we hard-rely on the NMI-mask being set to set-up
> the recursion stack, then the #MC IRET will clear the NMI-mask, and
> we're toast.
>
> Andy has wild and crazy ideas, but I don't think we need more crazy
> here.

Want, certainly not.  Need, maybe :-/
>>>  - #DB with BUS LOCK DEBUG EXCEPTION
>> If I understand the problem correctly, this can be solved by moving off
>> the IST stack to the current task stack in the #DB handler, like I plan
>> to do for #VC, no?
> Hmm, probably. Would take a bit of care, but should be doable.

Andy and I have spent some time considering this.

Having exactly one vector move off IST and onto an in-use task-stack is
doable, I think, so long as it can sort out self-recursion protections.

Having more than one vector do this is very complicated.  You've got to
take care to walk up the list of IST-nesting to see if any interrupted
context is in the middle of trying to copy themselves onto the stack, so
you don't clobber the frame they're in the middle of copying.

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 14:03, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
>>> you to only make it IST if/when you try and make SNP happen, not before.
>> It is not the only reason, when ES guests gain debug register support
>> then #VC also needs to be IST, because #DB can be promoted into #VC
>> then, and as #DB is IST for a reason, #VC needs to be too.
> Didn't I read somewhere that that is only so for Rome/Naples but not for
> the later chips (Milan) which have #DB pass-through?

I don't know about hardware timelines, but some future part can now opt
in to having debug registers as part of the encrypted state, and swapped
by VMExit, which would make debug facilities generally usable, and
supposedly safe to the #DB infinite loop issues, at which point the
hypervisor need not intercept #DB for safety reasons.

Its worth nothing that on current parts, the hypervisor can set up debug
facilities on behalf of the guest (or behind its back) as the DR state
is unencrypted, but that attempting to intercept #DB will redirect to
#VC inside the guest and cause fun. (Also spare a thought for 32bit
kernels which have to cope with userspace singlestepping the SYSENTER
path with every #DB turning into #VC.)

>> Besides that, I am not a fan of delegating problems I already see coming
>> to future-Joerg and future-Peter, but if at all possible deal with them
>> now and be safe later.
> Well, we could just say no :-) At some point in the very near future
> this house of cards is going to implode.

What currently exists is a picture of a house of cards in front of
something which has fallen down.

> Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware
> folks? How come we're getting more and more of them?

I have tried to get this point across.  Then again - its far easier for
the software folk in the same company as the hardware folk to make this
point.

> (/me puts fingers
> in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)

I wasn't going to bring it up, but seeing as you have - while there are
prohibitively-complicating issues preventing it from working on native,
I don't see any point even considering it for the mess which is #VC, or
the even bigger mess which is #HV.

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 13:47, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:
>
>> There are cases which are definitely non-recoverable.
>>
>> For both ES and SNP, a malicious hypervisor can mess with the guest
>> physmap to make the the NMI, #VC and #DF stacks all alias.
>>
>> For ES, this had better result in the #DF handler deciding that crashing
>> is the way out, whereas for SNP, this had better escalate to Shutdown.
>> Crashing out hard if the hypervisor is misbehaving is acceptable.
> Then I'm thinking the only sensible option is to crash hard for any SNP
> #VC from kernel mode.
>
> Sadly that doesn't help with #VC needing to be IST :-( IST is such a
> frigging nightmare.

I presume you mean any #VC caused by RMP faults (i.e. something went
wrong with the memory owner/etc metadata) ?

If so, then yes.  Any failure here is a bug in the kernel or hypervisor
(and needs fixing) or a malicious hypervisor and the guest should
terminate for its own safety.

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 12:30, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote:
>> On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:
>> So what happens if this #VC triggers on the first access to the #VC
>> stack, because the malicious host has craftily mucked with only the #VC
>> IST stack page?
>>
>> Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
>> you up.
>>
>> AFAICT all of that is non-recoverable.
> I am not 100% sure, but I think if the #VC stack page is not validated,
> the #VC should be promoted to a #DF.
>
> Note that this is an issue only with secure nested paging (SNP), which
> is not enabled yet with this patch-set. When it gets enabled a stack
> recursion check in the #VC handler is needed which panics the VM. That
> also fixes the #VC-in-early-NMI problem.

There are cases which are definitely non-recoverable.

For both ES and SNP, a malicious hypervisor can mess with the guest
physmap to make the the NMI, #VC and #DF stacks all alias.

For ES, this had better result in the #DF handler deciding that crashing
is the way out, whereas for SNP, this had better escalate to Shutdown.


What matters here is the security model in SNP.  The hypervisor is
relied upon for availability (because it could simply refuse to schedule
the VM), but market/business forces will cause it to do its best to keep
the VM running.  Therefore, the securely model is simply(?) that the
hypervisor can't do anything to undermine the confidentiality or
integrity of the VM.

Crashing out hard if the hypervisor is misbehaving is acceptable.  In a
cloud, I as a customer would (threaten to?) take my credit card
elsewhere, while for enterprise, I'd shout at my virtualisation vendor
until a fix happened (also perhaps threaten to take my credit card
elsewhere).

Therefore, it is reasonable to build on the expectation that the
hypervisor won't be messing around with remapping stacks/etc.  Most
#VC's are synchronous with guest actions (they equate to actions which
would have caused a VMExit), so you can safely reason about when the
first #VC might occur, presuming no funny business with the frames
backing any memory operands touched.

~Andrew


Re: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

2020-06-05 Thread Andrew Cooper
On 05/06/2020 09:36, Christian Kujau wrote:
> Hi,
>
> I'm running a small Xen PVH domain and upgrading from vanilla 5.6.0 to 
> 
>
> Note: that "Xen Platform PCI: unrecognised magic value" on the top appears 
> in 5.6 kernels as well, but no ill effects so far.
>
> ---
> Xen Platform PCI: unrecognised magic value

PVH domains don't have the emulated platform device, so Linux will be
finding ~0 when it goes looking in config space.

The diagnostic should be skipped in that case, to avoid giving the false
impression that something is wrong.

~Andrew


Re: [patch V9 00/39] x86/entry: Rework leftovers (was part V)

2020-06-03 Thread Andrew Cooper
On 22/05/2020 22:17, Peter Zijlstra wrote:
> On Fri, May 22, 2020 at 08:20:15AM +0100, Andrew Cooper wrote:
>> Apologies for opening a related can of worms.
>>
>> The new debug_enter() has propagated a pre-existing issue forward,
>> ultimately caused by bad advice in the SDM.
>>
>> Because the RTM status bit in DR6 has inverted polarity, writing DR6 to
>> 0 causes RTM to appear asserted to any logic which cares, despite RTM
>> debugging not being enabled.  The same is true in principle for what is
>> handed to userspace via u_debugreg[DR_STATUS].
>>
>> On the subject of DR6, the SDM now reads:
>>
>> "Certain debug exceptions may clear bits 0-3. The remaining contents of
>> the DR6 register are never cleared by the processor. To avoid confusion
>> in identifying debug exceptions, debug handlers should clear the
>> register (except bit 16, which they should set) before returning to the
>> interrupted task."
> *URGH*
>
>> First of all, that should read "are never de-asserted by the processor"
>> rather than "cleared", but the advice has still failed to learn from its
>> first mistake.  The forward-compatible way to fix this is to set
>> DR6_DEFAULT (0x0ff0) which also covers future inverted polarity bits.
>>
>> As for what to do about userspace, that is harder.  One approach is to
>> express everything in terms of positive polarity (i.e. pass on dr6 ^
>> DR6_DEFAULT), so DR6_RTM only appears set when RTM debugging is
>> enabled.  This approach is already taken with the VMCS PENDING_DBG
>> field, so there is at least previous form.
>>
>> I realise that "do nothing" might be acceptable at this point, given the
>> lack of support for RTM debugging.
> This! I'm thinking "do nothing" is, at this moment, the right thing to
> do. If/when someone goes and tries to make RTM debugging work, they get
> to figure out how to deal with this mess.

Well that didn't last long...

The new ISE (rev 39, published today) introduces BUS LOCK DEBUG
EXCEPTION which is now a second inverted polarity sticky bit (bit 11) in
%dr6.

This one is liable to get more traction than RTM debugging, so something
probably does want fixing in the #DB handler.

~Andrew


Re: [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT

2020-05-30 Thread Andrew Cooper
On 29/05/2020 22:27, Peter Zijlstra wrote:
> From: Lai Jiangshan 
>
> A data breakpoint on the GDT is terrifying and should be avoided.
> The GDT on CPU entry area is already protected. The direct GDT
> should be also protected, although it is seldom used and only
> used for short time.

While I agree with the sentiment...

>
> Signed-off-by: Lai Jiangshan 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20200526014221.2119-3-la...@linux.alibaba.com
> ---
>  arch/x86/kernel/hw_breakpoint.c |   30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Per cpu debug control register value */
>  DEFINE_PER_CPU(unsigned long, cpu_dr7);
> @@ -237,13 +238,26 @@ static inline bool within_area(unsigned
>  }
>  
>  /*
> - * Checks whether the range from addr to end, inclusive, overlaps the CPU
> - * entry area range.
> + * Checks whether the range from addr to end, inclusive, overlaps the fixed
> + * mapped CPU entry area range or other ranges used for CPU entry.
>   */
> -static inline bool within_cpu_entry_area(unsigned long addr, unsigned long 
> end)
> +static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
>  {
> - return within_area(addr, end, CPU_ENTRY_AREA_BASE,
> -CPU_ENTRY_AREA_TOTAL_SIZE);
> + int cpu;
> +
> + /* CPU entry erea is always used for CPU entry */
> + if (within_area(addr, end, CPU_ENTRY_AREA_BASE,
> + CPU_ENTRY_AREA_TOTAL_SIZE))
> + return true;
> +
> + for_each_possible_cpu(cpu) {
> + /* The original rw GDT is being used after load_direct_gdt() */
> + if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
> + GDT_SIZE))

... why the O(n) loop over the system?

It is only GDTs which might ever be active on this local CPU(/thread)
which are a problem, because the breakpoint registers are similarly local.

Nothing is going to go wrong If I put a breakpoint on someone else's
live GDT, because they wont interact in the "fun" ways we're trying to
avoid.

~Andrew


Re: [PATCH 1/6] x86/entry: Introduce local_db_{save,restore}()

2020-05-28 Thread Andrew Cooper
On 28/05/2020 22:15, Peter Zijlstra wrote:
> On Thu, May 28, 2020 at 09:52:30PM +0100, Andrew Cooper wrote:
>> On 28/05/2020 21:19, Peter Zijlstra wrote:
>>> --- a/arch/x86/include/asm/debugreg.h
>>> +++ b/arch/x86/include/asm/debugreg.h
>>> @@ -113,6 +113,31 @@ static inline void debug_stack_usage_inc
>>>  static inline void debug_stack_usage_dec(void) { }
>>>  #endif /* X86_64 */
>>>  
>>> +static __always_inline void local_db_save(unsigned long *dr7)
>>> +{
>>> +   get_debugreg(*dr7, 7);
>>> +   if (*dr7)
>>> +   set_debugreg(0, 7);
>> %dr7 has an architecturally stuck bit in it.
>>
>> You want *dr7 != 0x400 to avoid writing 0 unconditionally.
> Do we have to have that bit set when writing it? Otherwise I might
> actually prefer masking it out.

Not currently.  I guess it depends on how likely %dr7 is to gain an
inverted polarity bit like %dr6 did.

~Andrew


Re: [PATCH 1/6] x86/entry: Introduce local_db_{save,restore}()

2020-05-28 Thread Andrew Cooper
On 28/05/2020 21:19, Peter Zijlstra wrote:
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -113,6 +113,31 @@ static inline void debug_stack_usage_inc
>  static inline void debug_stack_usage_dec(void) { }
>  #endif /* X86_64 */
>  
> +static __always_inline void local_db_save(unsigned long *dr7)
> +{
> + get_debugreg(*dr7, 7);
> + if (*dr7)
> + set_debugreg(0, 7);

%dr7 has an architecturally stuck bit in it.

You want *dr7 != 0x400 to avoid writing 0 unconditionally.

Also, API wise, wouldn't it be nicer to write "dr7 = local_db_save()"
rather than having a void function returning a single long via pointer?

~Andrew


Re: Possibility of conflicting memory types in lazier TLB mode?

2020-05-27 Thread Andrew Cooper
On 27/05/2020 01:09, Andy Lutomirski wrote:
> [cc Andrew Cooper and Dave Hansen]
>
> On Fri, May 15, 2020 at 7:35 PM Nicholas Piggin  wrote:
>> Excerpts from Rik van Riel's message of May 16, 2020 5:24 am:
>>> On Fri, 2020-05-15 at 16:50 +1000, Nicholas Piggin wrote:
>>>> But what about if there are (real, not speculative) stores in the
>>>> store
>>>> queue still on the lazy thread from when it was switched, that have
>>>> not
>>>> yet become coherent? The page is freed by another CPU and reallocated
>>>> for something that maps it as nocache. Do you have a coherency
>>>> problem
>>>> there?
>>>>
>>>> Ensuring the store queue is drained when switching to lazy seems like
>>>> it
>>>> would fix it, maybe context switch code does that already or you
>>>> have
>>>> some other trick or reason it's not a problem. Am I way off base
>>>> here?
>>> On x86, all stores become visible in-order globally.
>>>
>>> I suspect that
>>> means any pending stores in the queue
>>> would become visible to the rest of the system before
>>> the store to the "current" cpu-local variable, as
>>> well as other writes from the context switch code
>>> become visible to the rest of the system.
>>>
>>> Is that too naive a way of preventing the scenario you
>>> describe?
>>>
>>> What am I overlooking?
>> I'm concerned if the physical address gets mapped with different
>> cacheability attributes where that ordering is not enforced by cache
>> coherency
>>
>>  "The PAT allows any memory type to be specified in the page tables, and
>>  therefore it is possible to have a single physical page mapped to two
>>  or more different linear addresses, each with different memory types.
>>  Intel does not support this practice because it may lead to undefined
>>  operations that can result in a system failure. In particular, a WC
>>  page must never be aliased to a cacheable page because WC writes may
>>  not check the processor caches." -- Vol. 3A 11-35
>>
>> Maybe I'm over thinking it, and this would never happen anyway because
>> if anyone were to map a RAM page WC, they might always have to ensure
>> all processor caches are flushed first anyway so perhaps this is just a
>> non-issue?
>>
> After talking to Andrew Cooper (hi!), I think that, on reasonably
> modern Intel machines, WC memory is still *coherent* with the whole
> system -- it's just not ordered the usual way.

So actually, on further reading, Vol 3 11.3 states "coherency is not
enforced by the processor’s bus coherency protocol" and later in 11.3.1,
"The WC buffer is not snooped and thus does not provide data coherency".


So, it would seem like it is possible to engineer a situation where the
cache line is WB according to the caches, and has pending WC data in one
or more cores/threads.  The question is whether this manifests as a
problem in practice, or not.

When changing the memory type of a mapping, you typically need to do
break/flush/make to be SMP-safe.  The IPI, as well as the TLB flush are
actions which cause WC buffers to be flushed.

x86 will tolerate a make/flush sequence as well.  In this scenario, a
3rd core/thread could pick up the line via its WB property, use it, and
cause it to be written back, between the pagetable change and the IPI
hitting.

But does this matter?  WC is by definition weakly ordered writes for
more efficient bus usage.  The device at the far end can't tell whether
the incoming write was from a WC or a WB eviction, and any late WC
evictions are semantically indistinguishable from a general concurrency
hazards with multiple writers.

~Andrew


Re: [PATCH v10 01/26] Documentation/x86: Add CET description

2020-05-22 Thread Andrew Cooper
On 22/05/2020 17:49, Peter Zijlstra wrote:
> On Sat, May 16, 2020 at 03:09:22PM +0100, Andrew Cooper wrote:
>
>> Sadly, the same is not true for kernel shadow stacks.
>>
>> SSP is 0 after SYSCALL, SYSENTER and CLRSSBSY, and you've got to be
>> careful to re-establish the shadow stack before a CALL, interrupt or
>> exception tries pushing a word onto the shadow stack at 0xfff8.
> Oh man, I can only imagine the joy that brings to #NM and friends :-(

Establishing a supervisor shadow stack for the first time involves a
large leap of faith, even by usual x86 standards.

You need to have prepared MSR_PL0_SSP with correct mappings and
supervisor tokens, such that when you enable CR4.CET and
MSR_S_CET.SHSTK_EN, your SETSSBSY instruction succeeds at its atomic
"check the token and set the busy bit" shadow stack access.  Any failure
here tends to be a triple fault, and I didn't get around to figuring out
why #DF wasn't taken cleanly.

You also need to have prepared MSR_IST_SSP beforehand with the IST
shadow stack pointers matching any IST configuration in the IDT, lest a
NMI ruins your day on the instruction boundary before SETSSBSY.

A less obvious side effect of these "windows with an SSP of 0" is that
you're now forced to use IST for all non-maskable interrupts/exceptions,
even if you choose not to use SYSCALL, and you no longer need IST to
remove the risks of a userspace privilege escalation, and would prefer
not to use IST because of its problematic reentrancy characteristics.

For anyone counting the number of IST-necessary vectors across all
potential configurations in modern hardware, its #DB, NMI, #DF, #MC,
#VE, #HV, #VC and #SX, and an architectural limit of 7.

There are several other amusing aspects, such as iret-to-self needing to
use call-oriented-programming to keep itself shadow-stack-safe, or the
fact that IRET to user mode doesn't fault if it fails to clear the
supervisor busy bit, instead leaving you to double fault at some point
in the future at the next syscall/interrupt/exception because the stack
is still busy.

~Andrew

P.S. For anyone interested,
https://lore.kernel.org/xen-devel/20200501225838.9866-1-andrew.coop...@citrix.com/T/#u
for getting supervisor shadow stacks working on Xen, which is far
simpler to manage than Linux.  I do not envy whomever has the fun of
trying to make this work for Linux.


Re: [patch V9 00/39] x86/entry: Rework leftovers (was part V)

2020-05-22 Thread Andrew Cooper
On 21/05/2020 21:05, Thomas Gleixner wrote:
> Folks!
>
> This is V9 of the rework series. V7 and V8 were never posted but I used the
> version numbers for tags while fixing up 0day complaints. The last posted
> version was V6 which can be found here:
>
>   https://lore.kernel.org/r/20200515234547.710474...@linutronix.de
>
> The V9 leftover series is based on:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry
>
> That branch contains the merged part 1-4 of the original 5 part series.
>
> V9 has the following changes vs. V6:
>
>- Rebase on tip x86/entry

Apologies for opening a related can of worms.

The new debug_enter() has propagated a pre-existing issue forward,
ultimately caused by bad advice in the SDM.

Because the RTM status bit in DR6 has inverted polarity, writing DR6 to
0 causes RTM to appear asserted to any logic which cares, despite RTM
debugging not being enabled.  The same is true in principle for what is
handed to userspace via u_debugreg[DR_STATUS].

On the subject of DR6, the SDM now reads:

"Certain debug exceptions may clear bits 0-3. The remaining contents of
the DR6 register are never cleared by the processor. To avoid confusion
in identifying debug exceptions, debug handlers should clear the
register (except bit 16, which they should set) before returning to the
interrupted task."

First of all, that should read "are never de-asserted by the processor"
rather than "cleared", but the advice has still failed to learn from its
first mistake.  The forward-compatible way to fix this is to set
DR6_DEFAULT (0x0ff0) which also covers future inverted polarity bits.

As for what to do about userspace, that is harder.  One approach is to
express everything in terms of positive polarity (i.e. pass on dr6 ^
DR6_DEFAULT), so DR6_RTM only appears set when RTM debugging is
enabled.  This approach is already taken with the VMCS PENDING_DBG
field, so there is at least previous form.

I realise that "do nothing" might be acceptable at this point, given the
lack of support for RTM debugging.

Thanks,

~Andrew


Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY

2020-05-20 Thread Andrew Cooper
On 20/05/2020 09:06, Jürgen Groß wrote:
> On 19.05.20 21:44, Andy Lutomirski wrote:
>> On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner 
>> wrote:
>>>
>>> Andy Lutomirski  writes:
 B: Turn this thing around.  Specifically, in the one and only case we
 care about, we know pretty much exactly what context we got this entry
 in: we're running in a schedulable context doing an explicitly
 preemptible hypercall, and we have RIP pointing at a SYSCALL
 instruction (presumably, but we shouldn't bet on it) in the hypercall
 page.  Ideally we would change the Xen PV ABI so the hypercall would
 return something like EAGAIN instead of auto-restarting and we could
 ditch this mess entirely.  But the ABI seems to be set in stone or at
 least in molasses, so how about just:

 idt_entry(exit(regs));
 if (inhcall && need_resched())
    schedule();
>>>
>>> Which brings you into the situation that you call schedule() from the
>>> point where we just moved it out. If we would go there we'd need to
>>> ensure that RCU is watching as well. idtentry_exit() might have it
>>> turned off 
>>
>> I don't think this is possible.  Once you untangle all the wrappers,
>> the call sites are effectively:
>>
>> __this_cpu_write(xen_in_preemptible_hcall, true);
>> CALL_NOSPEC to the hypercall page
>> __this_cpu_write(xen_in_preemptible_hcall, false);
>>
>> I think IF=1 when this happens, but I won't swear to it.  RCU had
>> better be watching.
>
> Preemptible hypercalls are never done with interrupts off. To be more
> precise: they are only ever done during ioctl() processing.
>
> I can add an ASSERT() to xen_preemptible_hcall_begin() if you want.
>
>>
>> As I understand it, the one and only situation Xen wants to handle is
>> that an interrupt gets delivered during the hypercall.  The hypervisor
>> is too clever for its own good and deals with this by rewinding RIP to
>> the beginning of whatever instruction did the hypercall and delivers
>> the interrupt, and we end up in this handler.  So, if this happens,
>> the idea is to not only handle the interrupt but to schedule if
>> scheduling would be useful.
>
> Correct. More precise: the hypercalls in question can last very long
> (up to several seconds) and so they need to be interruptible. As said
> before: the interface how this is done is horrible. :-(

Forget seconds.  DOMCTL_domain_kill gets to ~14 minutes for a 2TB domain.

The reason for the existing logic is to be able voluntarily preempt.

It doesn't need to remain the way it is, but some adequate form of
pre-emption does need to stay.

~Andrew


Re: [PATCH v10 01/26] Documentation/x86: Add CET description

2020-05-16 Thread Andrew Cooper
On 16/05/2020 03:37, H.J. Lu wrote:
> On Fri, May 15, 2020 at 5:13 PM Andrew Cooper  
> wrote:
>> Finally seeing as the question was asked but not answered, it is
>> actually quite easy to figure out whether shadow stacks are enabled in
>> the current thread.
>>
>> mov $1, %eax
>> rdsspd  %eax
> This is for 32-bit mode.

It actually works for both, if all you need is a shstk yes/no check.

Usually, you also want SSP in the yes case, so substitute rdsspq %rax as
appropriate.

(On a tangent - binutils mandating the D/Q suffixes is very irritating
with mixed 32/64bit code because you have to #ifdef your instructions
despite the register operands being totally unambiguous.  Also, D is the
wrong suffix for AT&T syntax, and should be L.  Frankly - the Intel
manuals are wrong and should not have the operand size suffix included
in the opcode name, as they are consistent with all the other
instructions in this regard.)

>   I use
>
> /* Check if shadow stack is in use.  */
> xorl%esi, %esi
> rdsspq  %rsi
> testq   %rsi, %rsi
> /* Normal return if shadow stack isn't in use.  */
> je  L(no_shstk)

This is probably fine for user code, as I don't think it would be
legitimate for shstk to be enabled, with SSP being 0.

Sadly, the same is not true for kernel shadow stacks.

SSP is 0 after SYSCALL, SYSENTER and CLRSSBSY, and you've got to be
careful to re-establish the shadow stack before a CALL, interrupt or
exception tries pushing a word onto the shadow stack at 0xfff8.

It is a very good (lucky?) thing that frame is unmapped for other
reasons, because this corner case does not protect against multiple
threads/cores using the same shadow stack concurrently.

~Andrew


Re: [PATCH v10 01/26] Documentation/x86: Add CET description

2020-05-15 Thread Andrew Cooper
On 15/05/2020 23:43, Dave Hansen wrote:
> On 5/15/20 2:33 PM, Yu-cheng Yu wrote:
>> On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
>>> On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
>>> Can a binary compiled with CET run without CET?
>> Yes, but a few details:
>>
>> - The shadow stack is transparent to the application.  A CET application does
>> not have anything different from a non-CET application.  However, if a CET
>> application uses any CET instructions (e.g. INCSSP), it must first check if 
>> CET
>> is turned on.
>> - If an application is compiled for IBT, the compiler inserts ENDBRs at 
>> branch
>> targets.  These are nops if IBT is not on.
> I appreciate the detailed response, but it wasn't quite what I was
> asking.  Let's ignore IBT for now and just talk about shadow stacks.
>
> An app compiled with the new ELF flags and running on a CET-enabled
> kernel and CPU will start off with shadow stacks allocated and enabled,
> right?  It can turn its shadow stack off per-thread with the new prctl.
>  But, otherwise, it's stuck, the only way to turn shadow stacks off at
> startup would be editing the binary.
>
> Basically, if there ends up being a bug in an app that violates the
> shadow stack rules, the app is broken, period.  The only recourse is to
> have the kernel disable CET and reboot.
>
> Is that right?

If I may interject with the experience of having got supervisor shadow
stacks working for Xen.

Turning shadow stacks off is quite easy - clear MSR_U_CET.SHSTK_EN and
the shadow stack will stay in whatever state it was in, and you can
largely forget about it.  (Of course, in a sandbox scenario, it would be
prudent to prevent the ability to disable shadow stacks.)

Turning shadow stacks on is much more tricky.  You cannot enable it in
any function you intend to return from, as the divergence between the
stack and shadow stack will constitute a control flow violation.


When it comes to binaries,  you can reasonably arrange for clone() to
start a thread on a new stack/shstk, as you can prepare both stacks
suitably before execution starts.

You cannot reasonably implement a system call for "turn shadow stacks on
for me", because you'll crash on the ret out of the VDSO from the system
call.  It would be possible to conceive of an exec()-like system call
which is "discard my current stack, turn on shstk, and start me on this
new stack/shstk".

In principle, with a pair of system calls to atomically manage the ststk
settings and stack switching, it might possible to construct a
`run_with_shstk_enabled(func, stack, shstk)` API which executes in the
current threads context and doesn't explode.

Fork() is a problem when shadow stacks are disabled in the parent.  The
moment shadow stacks are disabled, the regular stack diverges from the
shadow stack.  A CET-enabled app which turns off shstk and then fork()'s
must have the child inherit the shstk-off property.  If the child were
to start with shstk enabled, it would explode almost immediately due to
the parent's stack divergence which it inherited.


Finally seeing as the question was asked but not answered, it is
actually quite easy to figure out whether shadow stacks are enabled in
the current thread.

    mov $1, %eax
    rdsspd  %eax
    cmp $1, %eax
    je  no_shstk
    ...
no_shsk:

rdssp is allocated from the hint nop encoding space, and the minimum
alignment of the shadow stack pointer is 4.  On older parts, or with
shstk disabled (either at the system level, or for the thread), the $1
will be preserved in %eax, while if CET is active, it will be clobbered
with something that has the bottom two bits clear.

It turns out this is a lifesaver for codepaths (e.g. the NMI handler)
which need to use other CET instructions which aren't from the hint nop
space, and run before the BSP can set everything up.

~Andrew


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-04-28 Thread Andrew Cooper
On 28/04/2020 08:55, Joerg Roedel wrote:
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
>> I have a somewhat serious question: should we use IST for #VC at all?
>> As I understand it, Rome and Naples make it mandatory for hypervisors
>> to intercept #DB, which means that, due to the MOV SS mess, it's sort
>> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
>> if we're running under a sufficiently sensible hypervisor, we don't
>> need IST for #VC.
> The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
> page ownership tracking between guest and host, so that the hypervisor
> can't remap guest pages without the guest noticing.
>
> If there is a violation of ownership, which can happen at any memory
> access, there will be a #VC exception to notify the guest. And as this
> can happen anywhere, for example on a carefully crafted stack page set
> by userspace before doing SYSCALL, the only robust choice for #VC is to
> use IST.

The kernel won't ever touch the guest stack before restoring %rsp in the
syscall path, but the (minimum 2) memory accesses required to save the
user %rsp and load the kernel stack may be subject to #VC exceptions, as
are instruction fetches at the head of the SYSCALL path.

So yes - #VC needs IST.

Sorry for the noise.  (That said, it is unfortunate that the hypervisor
messing with the memory backing the guest #VC handler results in an
infinite loop, rather than an ability to cleanly terminate.)

~Andrew


Re: [Xen-devel] [PATCH] ARM: xen: unexport HYPERVISOR_platform_op function

2019-09-06 Thread Andrew Cooper
On 06/09/2019 17:00, Arnd Bergmann wrote:
> On Fri, Sep 6, 2019 at 5:55 PM Andrew Cooper  
> wrote:
>> On 06/09/2019 16:39, Arnd Bergmann wrote:
>>> HYPERVISOR_platform_op() is an inline function and should not
>>> be exported. Since commit 15bfc2348d54 ("modpost: check for
>>> static EXPORT_SYMBOL* functions"), this causes a warning:
>>>
>>> WARNING: "HYPERVISOR_platform_op" [vmlinux] is a static EXPORT_SYMBOL_GPL
>>>
>>> Remove the extraneous export.
>>>
>>> Fixes: 15bfc2348d54 ("modpost: check for static EXPORT_SYMBOL* functions")
>>> Signed-off-by: Arnd Bergmann 
>> Something is wonky.  That symbol is (/ really ought to be) in the
>> hypercall page and most definitely not inline.
>>
>> Which tree is that changeset from?  I can't find the SHA.
> This is from linux-next, I think from the kbuild tree.

Thanks.

Julien/Stefano: Why are any of these hypercalls out-of-line?  ARM
doesn't use the hypercall page, and there is no argument translation
(not even in arm32 as there are no 5-argument hypercalls declared).

They'd surely be easier to implement with a few static inlines and some
common code, than to try and replicate the x86 side hypercall_page
interface ?

~Andrew


Re: [Xen-devel] [PATCH] ARM: xen: unexport HYPERVISOR_platform_op function

2019-09-06 Thread Andrew Cooper
On 06/09/2019 16:39, Arnd Bergmann wrote:
> HYPERVISOR_platform_op() is an inline function and should not
> be exported. Since commit 15bfc2348d54 ("modpost: check for
> static EXPORT_SYMBOL* functions"), this causes a warning:
>
> WARNING: "HYPERVISOR_platform_op" [vmlinux] is a static EXPORT_SYMBOL_GPL
>
> Remove the extraneous export.
>
> Fixes: 15bfc2348d54 ("modpost: check for static EXPORT_SYMBOL* functions")
> Signed-off-by: Arnd Bergmann 

Something is wonky.  That symbol is (/ really ought to be) in the
hypercall page and most definitely not inline.

Which tree is that changeset from?  I can't find the SHA.

I hate to open a separate can of worms, but why are they tagged GPL? 
The Xen hypercall ABI, like the Linux syscall ABI, are specifically not
GPL.  Xen has as something very similar to (and probably derived from)
the Linux-syscall-note exception.

~Andrew


Re: [Xen-devel] [PATCH 09/11] swiotlb-xen: simplify cache maintainance

2019-09-06 Thread Andrew Cooper
On 06/09/2019 15:01, Christoph Hellwig wrote:
> On Fri, Sep 06, 2019 at 09:52:12AM -0400, Boris Ostrovsky wrote:
>> We need nop definitions of these two for x86.
>>
>> Everything builds now but that's probably because the calls are under
>> 'if (!dev_is_dma_coherent(dev))' which is always false so compiler
>> optimized is out. I don't think we should rely on that.
> That is how a lot of the kernel works.  Provide protypes only for code
> that is semantically compiled, but can't ever be called due to
> IS_ENABLED() checks.  It took me a while to get used to it, but it
> actually is pretty nice as the linker does the work for you to check
> that it really is never called.  Much better than say a BUILD_BUG_ON().

Yeah - its a weird concept to get used to, but it results in much
clearer code.

~Andrew


Re: [Xen-devel] [PATCH -tip 0/2] x86: Prohibit kprobes on XEN_EMULATE_PREFIX

2019-09-05 Thread Andrew Cooper
On 05/09/2019 14:09, Masami Hiramatsu wrote:
> On Thu, 5 Sep 2019 20:32:24 +0900
> Masami Hiramatsu  wrote:
>
>> On Thu, 5 Sep 2019 08:54:17 +0100
>> Andrew Cooper  wrote:
>>
>>> On 05/09/2019 02:49, Masami Hiramatsu wrote:
>>>> On Wed, 4 Sep 2019 12:54:55 +0100
>>>> Andrew Cooper  wrote:
>>>>
>>>>> On 04/09/2019 12:45, Masami Hiramatsu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> These patches allow x86 instruction decoder to decode
>>>>>> xen-cpuid which has XEN_EMULATE_PREFIX, and prohibit
>>>>>> kprobes to probe on it.
>>>>>>
>>>>>> Josh reported that the objtool can not decode such special
>>>>>> prefixed instructions, and I found that we also have to
>>>>>> prohibit kprobes to probe on such instruction.
>>>>>>
>>>>>> This series can be applied on -tip master branch which
>>>>>> has merged Josh's objtool/perf sharing common x86 insn
>>>>>> decoder series.
>>>>> The paravirtualised xen-cpuid is were you'll see it most in a regular
>>>>> kernel, but be aware that it is also used for testing purposes in other
>>>>> circumstances, and there is an equivalent KVM prefix which is used for
>>>>> KVM testing.
>>>> Good catch! I didn't notice that. Is that really same sequance or KVM uses
>>>> another sequence of instructions for KVM prefix?
>>> I don't know if you've spotted, but the prefix is a ud2a instruction
>>> followed by 'xen' in ascii.
>>>
>>> The KVM version was added in c/s 6c86eedc206dd1f9d37a2796faa8e6f2278215d2
> Hmm, I think I might misunderstand what the "emulate prefix"... that is not
> a prefix which replace actual prefix, but just works like an escape sequence.
> Thus the next instruction can have any x86 prefix, correct?

There is a bit of history here :)

Originally, 13 years ago, Xen invented the "Force Emulate Prefix", which
was the sequence:

ud2a; .ascii 'xen'; cpuid

which hit the #UD handler and was recognised as a request for
virtualised CPUID information.  This was for ring-deprivileged
virtualisation, and is needed because the CPUID instruction itself
doesn't trap to the hypervisor.

Following some security issues in our instruction emulator, I reused
this prefix with VT-x/SVM guests for testing purposes.  It behaves in a
similar manner - when enabled, it is recognised in #UD exception
intercept, and causes Xen to add 5 to the instruction pointer, then
emulate the instruction starting there.

Then various folk thought that having the same kind of ability to test
KVM's instruction emulator would be a good idea, so they borrowed the idea.

>From a behaviour point of view, it is an opaque 5 bytes which means
"break into the hypervisor, then emulate the following instruction".

The name "prefix" is unfortunate.  It was named thusly because from the
programmers point of view, it was something you put before the CPUID
instruction which wanted to be emulated.  It is not related to x86
instruction concept of a prefix.

~Andrew


Re: [Xen-devel] [PATCH -tip 0/2] x86: Prohibit kprobes on XEN_EMULATE_PREFIX

2019-09-05 Thread Andrew Cooper
On 05/09/2019 10:26, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 09:53:32AM +0100, Andrew Cooper wrote:
>> On 05/09/2019 09:26, Peter Zijlstra wrote:
>>> On Thu, Sep 05, 2019 at 08:54:17AM +0100, Andrew Cooper wrote:
>>>
>>>> I don't know if you've spotted, but the prefix is a ud2a instruction
>>>> followed by 'xen' in ascii.
>>>>
>>>> The KVM version was added in c/s 6c86eedc206dd1f9d37a2796faa8e6f2278215d2
>>> While the Xen one disassebles to valid instructions, that KVM one does
>>> not:
>>>
>>> .text
>>> xen:
>>> ud2; .ascii "xen"
>>> kvm:
>>> ud2; .ascii "kvm"
>>>
>>> disassembles like:
>>>
>>>  :
>>>0:   0f 0b   ud2
>>>2:   78 65   js 69 
>>>4:   6e  outsb  %ds:(%rsi),(%dx)
>>> 0005 :
>>>5:   0f 0b   ud2
>>>7:   6b  .byte 0x6b
>>>8:   76 6d   jbe77 
>>>
>>> Which is a bit unfortunate I suppose. At least they don't appear to
>>> consume further bytes.
>> It does when you give objdump one extra byte to look at.
>>
>> 0005 :
>>    5:    0f 0b        ud2   
>>    7:    6b 76 6d 00      imul   $0x0,0x6d(%rsi),%esi
>>
>> I did try to point out that this property should have been checked
>> before settling on 'kvm' as the string.
> Bah you're right; when I write:
>
>   ud2; .ascii "kvm"; cpuid
>
> The output is gibberish :/

KVM could possibly amend this.  It is an off-by-default testing-only
interface.

Xen's is part of the ABI for ring-deprivielged guests, to force CPUID to
be the virtualised view on hardware without CPUID Faulting.

~Andrew


Re: [Xen-devel] [PATCH -tip 0/2] x86: Prohibit kprobes on XEN_EMULATE_PREFIX

2019-09-05 Thread Andrew Cooper
On 05/09/2019 09:26, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 08:54:17AM +0100, Andrew Cooper wrote:
>
>> I don't know if you've spotted, but the prefix is a ud2a instruction
>> followed by 'xen' in ascii.
>>
>> The KVM version was added in c/s 6c86eedc206dd1f9d37a2796faa8e6f2278215d2
> While the Xen one disassebles to valid instructions, that KVM one does
> not:
>
>   .text
> xen:
>   ud2; .ascii "xen"
> kvm:
>   ud2; .ascii "kvm"
>
> disassembles like:
>
>  :
>0:   0f 0b   ud2
>2:   78 65   js 69 
>4:   6e  outsb  %ds:(%rsi),(%dx)
> 0005 :
>5:   0f 0b   ud2
>7:   6b  .byte 0x6b
>8:   76 6d   jbe77 
>
> Which is a bit unfortunate I suppose. At least they don't appear to
> consume further bytes.

It does when you give objdump one extra byte to look at.

0005 :
   5:    0f 0b        ud2   
   7:    6b 76 6d 00      imul   $0x0,0x6d(%rsi),%esi

I did try to point out that this property should have been checked
before settling on 'kvm' as the string.

As for the Xen prefix, it's introduction pre-dates me substantially, and
I don't know whether the disassembly was considered, or we just got lucky.

IMO, the string 'Xen' would would have been sightly nicer

0005 :
   5:    0f 0b        ud2   
   7:    58       pop    %rax
   8:    65 6e        outsb  %gs:(%rsi),(%dx)

but we're 13 years too late to amend this.

> I know it is water under the bridge at this point; but you could've used
> UD1 with a displacement with some 'unlikely' values. That way it
> would've decoded to a single instruction.
>
> Something like:
>
>   ud10x6e6578(%rax),%rax
>
> which spells out "xen\0" in the displacement:
>
>   48 0f b9 80 78 65 6e 00

:)

I seem to recall UD0 and UD1 being very late to the documentation party.

~Andrew


Re: [Xen-devel] [PATCH -tip 0/2] x86: Prohibit kprobes on XEN_EMULATE_PREFIX

2019-09-05 Thread Andrew Cooper
On 05/09/2019 02:49, Masami Hiramatsu wrote:
> On Wed, 4 Sep 2019 12:54:55 +0100
> Andrew Cooper  wrote:
>
>> On 04/09/2019 12:45, Masami Hiramatsu wrote:
>>> Hi,
>>>
>>> These patches allow x86 instruction decoder to decode
>>> xen-cpuid which has XEN_EMULATE_PREFIX, and prohibit
>>> kprobes to probe on it.
>>>
>>> Josh reported that the objtool can not decode such special
>>> prefixed instructions, and I found that we also have to
>>> prohibit kprobes to probe on such instruction.
>>>
>>> This series can be applied on -tip master branch which
>>> has merged Josh's objtool/perf sharing common x86 insn
>>> decoder series.
>> The paravirtualised xen-cpuid is were you'll see it most in a regular
>> kernel, but be aware that it is also used for testing purposes in other
>> circumstances, and there is an equivalent KVM prefix which is used for
>> KVM testing.
> Good catch! I didn't notice that. Is that really same sequance or KVM uses
> another sequence of instructions for KVM prefix?

I don't know if you've spotted, but the prefix is a ud2a instruction
followed by 'xen' in ascii.

The KVM version was added in c/s 6c86eedc206dd1f9d37a2796faa8e6f2278215d2

>
>> It might be better to generalise the decode support to "virtualisation
>> escape prefix" or something slightly more generic.
> Agreed, it is easy to expand it, we can switch the prefix template.
> Could you tell me where I should look? I will add it.

These are the only two I'm aware of.

~Andrew


Re: [Xen-devel] [PATCH -tip 0/2] x86: Prohibit kprobes on XEN_EMULATE_PREFIX

2019-09-04 Thread Andrew Cooper
On 04/09/2019 12:45, Masami Hiramatsu wrote:
> Hi,
>
> These patches allow x86 instruction decoder to decode
> xen-cpuid which has XEN_EMULATE_PREFIX, and prohibit
> kprobes to probe on it.
>
> Josh reported that the objtool can not decode such special
> prefixed instructions, and I found that we also have to
> prohibit kprobes to probe on such instruction.
>
> This series can be applied on -tip master branch which
> has merged Josh's objtool/perf sharing common x86 insn
> decoder series.

The paravirtualised xen-cpuid is were you'll see it most in a regular
kernel, but be aware that it is also used for testing purposes in other
circumstances, and there is an equivalent KVM prefix which is used for
KVM testing.

It might be better to generalise the decode support to "virtualisation
escape prefix" or something slightly more generic.

~Andrew


[tip:x86/apic] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-22 Thread tip-bot for Andrew Cooper
Commit-ID:  83b584d9c6a1494170abd3a8b24f41939b23d625
Gitweb: https://git.kernel.org/tip/83b584d9c6a1494170abd3a8b24f41939b23d625
Author: Andrew Cooper 
AuthorDate: Mon, 15 Jul 2019 16:16:41 +0100
Committer:  Thomas Gleixner 
CommitDate: Mon, 22 Jul 2019 10:12:33 +0200

x86/paravirt: Drop {read,write}_cr8() hooks

There is a lot of infrastructure for functionality which is used
exclusively in __{save,restore}_processor_state() on the suspend/resume
path.

cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
rest of the Local APIC state isn't a clever thing to be doing.

Delete the suspend/resume cr8 handling, which shrinks the size of struct
saved_context, and allows for the removal of both PVOPS.

Signed-off-by: Andrew Cooper 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Juergen Gross 
Link: https://lkml.kernel.org/r/20190715151641.29210-1-andrew.coop...@citrix.com

---
 arch/x86/include/asm/paravirt.h   | 12 
 arch/x86/include/asm/paravirt_types.h |  5 -
 arch/x86/include/asm/special_insns.h  | 24 
 arch/x86/include/asm/suspend_64.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c  |  1 -
 arch/x86/kernel/paravirt.c|  4 
 arch/x86/power/cpu.c  |  4 
 arch/x86/xen/enlighten_pv.c   | 15 ---
 8 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..69089d46f128 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -139,18 +139,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-#ifdef CONFIG_X86_64
-static inline unsigned long read_cr8(void)
-{
-   return PVOP_CALL0(unsigned long, cpu.read_cr8);
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   PVOP_VCALL1(cpu.write_cr8, x);
-}
-#endif
-
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..70b654f3ffe5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -119,11 +119,6 @@ struct pv_cpu_ops {
 
void (*write_cr4)(unsigned long);
 
-#ifdef CONFIG_X86_64
-   unsigned long (*read_cr8)(void);
-   void (*write_cr8)(unsigned long);
-#endif
-
/* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct desc_ptr *);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 219be88a59d2..6d37b8fcfc77 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -73,20 +73,6 @@ static inline unsigned long native_read_cr4(void)
 
 void native_write_cr4(unsigned long val);
 
-#ifdef CONFIG_X86_64
-static inline unsigned long native_read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile("movq %%cr8,%0" : "=r" (cr8));
-   return cr8;
-}
-
-static inline void native_write_cr8(unsigned long val)
-{
-   asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
-}
-#endif
-
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
 {
@@ -200,16 +186,6 @@ static inline void wbinvd(void)
 
 #ifdef CONFIG_X86_64
 
-static inline unsigned long read_cr8(void)
-{
-   return native_read_cr8();
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   native_write_cr8(x);
-}
-
 static inline void load_gs_index(unsigned selector)
 {
native_load_gs_index(selector);
diff --git a/arch/x86/include/asm/suspend_64.h 
b/arch/x86/include/asm/suspend_64.h
index a7af9f53c0cb..35bb35d28733 100644
--- a/arch/x86/include/asm/suspend_64.h
+++ b/arch/x86/include/asm/suspend_64.h
@@ -34,7 +34,7 @@ struct saved_context {
 */
unsigned long kernelmode_gs_base, usermode_gs_base, fs_base;
 
-   unsigned long cr0, cr2, cr3, cr4, cr8;
+   unsigned long cr0, cr2, cr3, cr4;
u64 misc_enable;
bool misc_enable_saved;
struct saved_msrs saved_msrs;
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d3d075226c0a..8b54d8e3a561 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -62,7 +62,6 @@ int main(void)
ENTRY(cr2);
ENTRY(cr3);
ENTRY(cr4);
-   ENTRY(cr8);
ENTRY(gdt_desc);
BLANK();
 #undef ENTRY
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 0aa6256eedd8..59d3d2763a9e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -311,10 +311,6 @@ struct paravirt_patch_template pv_ops = {
.cpu.read_cr0   = native_read_cr0,
.cpu.write_cr0  = native_write_cr0,

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-19 Thread Andrew Cooper
On 16/07/2019 01:05, Andy Lutomirski wrote:
> On Mon, Jul 15, 2019 at 4:30 PM Andrew Cooper  
> wrote:
>> On 15/07/2019 19:17, Nadav Amit wrote:
>>>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
>>>> wrote:
>>>>
>>>> There is a lot of infrastructure for functionality which is used
>>>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>>>> path.
>>>>
>>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>>>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>>>> rest of the Local APIC state isn't a clever thing to be doing.
>>>>
>>>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>>>> saved_context, and allows for the removal of both PVOPS.
>>> I think removing the interface for CR8 writes is also good to avoid
>>> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
>>> Priorities between CR8 and APIC”):
>>>
>>> "Operating software should implement either direct APIC TPR updates or CR8
>>> style TPR updates but not mix them. Software can use a serializing
>>> instruction (for example, CPUID) to serialize updates between MOV CR8 and
>>> stores to the APIC.”
>>>
>>> And native_write_cr8() did not even issue a serializing instruction.
>>>
>> Given its location, the one write_cr8() is bounded by two serialising
>> operations, so is safe in practice.
>>
>> However, I agree with the statement in the manual.  I could submit a v3
>> with an updated commit message, or let it be fixed on commit.  Whichever
>> is easiest.
>>
> I don't see anything wrong with the message.  If we actually used CR8
> for interrupt priorities, we wouldn't want it to serialize.  The bug
> is that the code that did the write_cr8() should have had a comment as
> to how it serialized against lapic_restore().  But that doesn't seem
> worth mentioning in the message, since, as noted, the real problem was
> that it nonsensically restored just TPR without restoring everything
> else.

Fair enough, in which case I'm happy with v2 as it is.

~Andrew


Re: [Xen-devel] [PATCH 1/2] xen/gntdev: replace global limit of mapped pages by limit per call

2019-07-18 Thread Andrew Cooper
On 18/07/2019 07:52, Juergen Gross wrote:
> Today there is a global limit of pages mapped via /dev/xen/gntdev set
> to 1 million pages per default.

The Xen default limit even for dom0 is 1024 pages * 16 entries per page,
which is far lower than this limit.

> There is no reason why that limit is
> existing, as total number of foreign mappings is limited by the

s/foreign/grant/ ?

> hypervisor anyway and preferring kernel mappings over userspace ones
> doesn't make sense.

Its probably also worth stating that this a root-only device, which
further brings in to question the user/kernel split.

>
> Additionally checking of that limit is fragile, as the number of pages
> to map via one call is specified in a 32-bit unsigned variable which
> isn't tested to stay within reasonable limits (the only test is the
> value to be <= zero, which basically excludes only calls without any
> mapping requested). So trying to map e.g. 0x pages while
> already nearly 100 pages are mapped will effectively lower the
> global number of mapped pages such that a parallel call mapping a
> reasonable amount of pages can succeed in spite of the global limit
> being violated.
>
> So drop the global limit and introduce per call limit instead.

Its probably worth talking about this new limit.  What is it trying to
protect?

~Andrew


Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
On 15/07/2019 19:17, Nadav Amit wrote:
>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  wrote:
>>
>> There is a lot of infrastructure for functionality which is used
>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>> path.
>>
>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>> rest of the Local APIC state isn't a clever thing to be doing.
>>
>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>> saved_context, and allows for the removal of both PVOPS.
> I think removing the interface for CR8 writes is also good to avoid
> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
> Priorities between CR8 and APIC”):
>
> "Operating software should implement either direct APIC TPR updates or CR8
> style TPR updates but not mix them. Software can use a serializing
> instruction (for example, CPUID) to serialize updates between MOV CR8 and
> stores to the APIC.”
>
> And native_write_cr8() did not even issue a serializing instruction.
>

Given its location, the one write_cr8() is bounded by two serialising
operations, so is safe in practice.

However, I agree with the statement in the manual.  I could submit a v3
with an updated commit message, or let it be fixed on commit.  Whichever
is easiest.

~Andrew


Re: [Xen-devel] [PATCH 0/2] Remove 32-bit Xen PV guest support

2019-07-15 Thread Andrew Cooper
On 15/07/2019 18:28, Andy Lutomirski wrote:
> On Mon, Jul 15, 2019 at 9:34 AM Andi Kleen  wrote:
>> Juergen Gross  writes:
>>
>>> The long term plan has been to replace Xen PV guests by PVH. The first
>>> victim of that plan are now 32-bit PV guests, as those are used only
>>> rather seldom these days. Xen on x86 requires 64-bit support and with
>>> Grub2 now supporting PVH officially since version 2.04 there is no
>>> need to keep 32-bit PV guest support alive in the Linux kernel.
>>> Additionally Meltdown mitigation is not available in the kernel running
>>> as 32-bit PV guest, so dropping this mode makes sense from security
>>> point of view, too.
>> Normally we have a deprecation period for feature removals like this.
>> You would make the kernel print a warning for some releases, and when
>> no user complains you can then remove. If a user complains you can't.
>>
> As I understand it, the kernel rules do allow changes like this even
> if there's a complaint: this is a patch that removes what is
> effectively hardware support.  If the maintenance cost exceeds the
> value, then removal is fair game.  (Obviously we weight the value to
> preserving compatibility quite highly, but in this case, Xen dropped
> 32-bit hardware support a long time ago.  If the Xen hypervisor says
> that 32-bit PV guest support is deprecated, it's deprecated.)
>
> That being said, a warning might not be a bad idea.  What's the
> current status of this in upstream Xen?

So personally, I'd prefer to see support stay, but at the end of the day
it is Juergen's choice as the maintainer of the code.

Xen itself has been exclusively 64-bit since Xen 4.3 (released in 2013).

Over time, various features like SMEP/SMAP have been making 32bit PV
guests progressively slower, because ring 1 is supervisor rather than
user.  Things have got even worse with IBRS, to the point at which 32bit
PV guests are starting to run like treacle.

There are no current plans to remove support for 32bit PV guests from
Xen, but it is very much in the category of "you shouldn't be using this
mode any more".

~Andrew

P.S. I don't see 64bit PV guest support going anywhere, because there
are still a number of open performance questions due to the inherent
differences between syscall and vmexit, and the difference EPT/NPT
tables make on cross-domain mappings.


[PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
There is a lot of infrastructure for functionality which is used
exclusively in __{save,restore}_processor_state() on the suspend/resume
path.

cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
rest of the Local APIC state isn't a clever thing to be doing.

Delete the suspend/resume cr8 handling, which shrinks the size of struct
saved_context, and allows for the removal of both PVOPS.

Signed-off-by: Andrew Cooper 
---
CC: x...@kernel.org
CC: virtualizat...@lists.linux-foundation.org
CC: Borislav Petkov 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Stephane Eranian 
CC: Feng Tang 
CC: Juergen Gross 
CC: Boris Ostrovsky 
CC: "Rafael J. Wysocki" 
CC: Pavel Machek 

Spotted while reviewing "x86/apic: Initialize TPR to block interrupts 16-31"

https://lore.kernel.org/lkml/dc04a9f8b234d7b0956a8d2560b8945bcd9c4bf7.1563117760.git.l...@kernel.org/

v2:
 * Drop saved_context.cr8 as well (Juergen)
 * Remove akata...@vmware.com from the CC list due to bounces
---
 arch/x86/include/asm/paravirt.h   | 12 
 arch/x86/include/asm/paravirt_types.h |  5 -
 arch/x86/include/asm/special_insns.h  | 24 
 arch/x86/include/asm/suspend_64.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c  |  1 -
 arch/x86/kernel/paravirt.c|  4 
 arch/x86/power/cpu.c  |  4 
 arch/x86/xen/enlighten_pv.c   | 15 ---
 8 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..0e4a0539c353 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -139,18 +139,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-#ifdef CONFIG_X86_64
-static inline unsigned long read_cr8(void)
-{
-   return PVOP_CALL0(unsigned long, cpu.read_cr8);
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   PVOP_VCALL1(cpu.write_cr8, x);
-}
-#endif
-
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3c775fb5524b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -119,11 +119,6 @@ struct pv_cpu_ops {
 
void (*write_cr4)(unsigned long);
 
-#ifdef CONFIG_X86_64
-   unsigned long (*read_cr8)(void);
-   void (*write_cr8)(unsigned long);
-#endif
-
/* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct desc_ptr *);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 219be88a59d2..6d37b8fcfc77 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -73,20 +73,6 @@ static inline unsigned long native_read_cr4(void)
 
 void native_write_cr4(unsigned long val);
 
-#ifdef CONFIG_X86_64
-static inline unsigned long native_read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile("movq %%cr8,%0" : "=r" (cr8));
-   return cr8;
-}
-
-static inline void native_write_cr8(unsigned long val)
-{
-   asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
-}
-#endif
-
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
 {
@@ -200,16 +186,6 @@ static inline void wbinvd(void)
 
 #ifdef CONFIG_X86_64
 
-static inline unsigned long read_cr8(void)
-{
-   return native_read_cr8();
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   native_write_cr8(x);
-}
-
 static inline void load_gs_index(unsigned selector)
 {
native_load_gs_index(selector);
diff --git a/arch/x86/include/asm/suspend_64.h 
b/arch/x86/include/asm/suspend_64.h
index a7af9f53c0cb..35bb35d28733 100644
--- a/arch/x86/include/asm/suspend_64.h
+++ b/arch/x86/include/asm/suspend_64.h
@@ -34,7 +34,7 @@ struct saved_context {
 */
unsigned long kernelmode_gs_base, usermode_gs_base, fs_base;
 
-   unsigned long cr0, cr2, cr3, cr4, cr8;
+   unsigned long cr0, cr2, cr3, cr4;
u64 misc_enable;
bool misc_enable_saved;
struct saved_msrs saved_msrs;
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d3d075226c0a..8b54d8e3a561 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -62,7 +62,6 @@ int main(void)
ENTRY(cr2);
ENTRY(cr3);
ENTRY(cr4);
-   ENTRY(cr8);
ENTRY(gdt_desc);
BLANK();
 #undef ENTRY
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..de4d4e8a54c1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -311,10 +311,6 @@ struct paravirt_patch_template pv_ops = 

[PATCH] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
There is a lot of infrastructure for functionality which is used
exclusively in __{save,restore}_processor_state() on the suspend/resume
path.

cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored
independently by lapic_{suspend,resume}().

Delete the saving and restoration of cr8, which allows for the removal
of both PVOPS.

Signed-off-by: Andrew Cooper 
---
CC: x...@kernel.org
CC: virtualizat...@lists.linux-foundation.org
CC: Borislav Petkov 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Stephane Eranian 
CC: Feng Tang 
CC: Juergen Gross 
CC: Boris Ostrovsky 
CC: Alok Kataria 
CC: "Rafael J. Wysocki" 
CC: Pavel Machek 

Spotted while reviewing "x86/apic: Initialize TPR to block interrupts 16-31"

https://lore.kernel.org/lkml/dc04a9f8b234d7b0956a8d2560b8945bcd9c4bf7.1563117760.git.l...@kernel.org/
---
 arch/x86/include/asm/paravirt.h   | 12 
 arch/x86/include/asm/paravirt_types.h |  5 -
 arch/x86/include/asm/special_insns.h  | 24 
 arch/x86/kernel/paravirt.c|  4 
 arch/x86/power/cpu.c  |  4 
 arch/x86/xen/enlighten_pv.c   | 15 ---
 6 files changed, 64 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..0e4a0539c353 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -139,18 +139,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-#ifdef CONFIG_X86_64
-static inline unsigned long read_cr8(void)
-{
-   return PVOP_CALL0(unsigned long, cpu.read_cr8);
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   PVOP_VCALL1(cpu.write_cr8, x);
-}
-#endif
-
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3c775fb5524b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -119,11 +119,6 @@ struct pv_cpu_ops {
 
void (*write_cr4)(unsigned long);
 
-#ifdef CONFIG_X86_64
-   unsigned long (*read_cr8)(void);
-   void (*write_cr8)(unsigned long);
-#endif
-
/* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct desc_ptr *);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 219be88a59d2..6d37b8fcfc77 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -73,20 +73,6 @@ static inline unsigned long native_read_cr4(void)
 
 void native_write_cr4(unsigned long val);
 
-#ifdef CONFIG_X86_64
-static inline unsigned long native_read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile("movq %%cr8,%0" : "=r" (cr8));
-   return cr8;
-}
-
-static inline void native_write_cr8(unsigned long val)
-{
-   asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
-}
-#endif
-
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
 {
@@ -200,16 +186,6 @@ static inline void wbinvd(void)
 
 #ifdef CONFIG_X86_64
 
-static inline unsigned long read_cr8(void)
-{
-   return native_read_cr8();
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   native_write_cr8(x);
-}
-
 static inline void load_gs_index(unsigned selector)
 {
native_load_gs_index(selector);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..de4d4e8a54c1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -311,10 +311,6 @@ struct paravirt_patch_template pv_ops = {
.cpu.read_cr0   = native_read_cr0,
.cpu.write_cr0  = native_write_cr0,
.cpu.write_cr4  = native_write_cr4,
-#ifdef CONFIG_X86_64
-   .cpu.read_cr8   = native_read_cr8,
-   .cpu.write_cr8  = native_write_cr8,
-#endif
.cpu.wbinvd = native_wbinvd,
.cpu.read_msr   = native_read_msr,
.cpu.write_msr  = native_write_msr,
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 24b079e94bc2..1c58d8982728 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -122,9 +122,6 @@ static void __save_processor_state(struct saved_context 
*ctxt)
ctxt->cr2 = read_cr2();
ctxt->cr3 = __read_cr3();
ctxt->cr4 = __read_cr4();
-#ifdef CONFIG_X86_64
-   ctxt->cr8 = read_cr8();
-#endif
ctxt->misc_enable_saved = !rdmsrl_safe(MSR_IA32_MISC_ENABLE,
   &ctxt->misc_enable);
msr_save_context(ctxt);
@@ -207,7 +204,6 @@ static void notrace __restore_processor_state(struct 
saved_context *ctxt)
 #else
 /* CONFIG X86_64 */
wrmsrl(MSR_EFER, ctxt->efer);
-   write_cr8(ctxt->cr8

Re: [PATCH] x86/apic: Initialize TPR to block interrupts 16-31

2019-07-15 Thread Andrew Cooper
On 14/07/2019 18:21, Nadav Amit wrote:
>> On Jul 14, 2019, at 8:23 AM, Andy Lutomirski  wrote:
>>
>> The APIC, per spec, is fundamentally confused and thinks that
>> interrupt vectors 16-31 are valid.  This makes no sense -- the CPU
>> reserves vectors 0-31 for exceptions (faults, traps, etc).
>> Obviously, no device should actually produce an interrupt with
>> vector 16-31, but we can improve robustness by setting the APIC TPR
>> class to 1, which will prevent delivery of an interrupt with a
>> vector below 32.
>>
>> Note: this is *not* intended as a security measure against attackers
>> who control malicious hardware.  Any PCI or similar hardware that
>> can be controlled by an attacker MUST be behind a functional IOMMU
>> that remaps interrupts.  The purpose of this patch is to reduce the
>> chance that a certain class of device malfunctions crashes the
>> kernel in hard-to-debug ways.
>>
>> Cc: Nadav Amit 
>> Cc: Stephane Eranian 
>> Cc: Feng Tang 
>> Suggested-by: Andrew Cooper 
>> Signed-off-by: Andy Lutomirski 
>> ---
>> arch/x86/kernel/apic/apic.c | 7 +--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index 177aa8ef2afa..ff31322f8839 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -1531,11 +1531,14 @@ static void setup_local_APIC(void)
>> #endif
>>
>>  /*
>> - * Set Task Priority to 'accept all'. We never change this
>> - * later on.
>> + * Set Task Priority to 'accept all except vectors 0-31'.  An APIC
>> + * vector in the 16-31 range could be delivered if TPR == 0, but we
>> + * would think it's an exception and terrible things will happen.  We
>> + * never change this later on.
>>   */
>>  value = apic_read(APIC_TASKPRI);
>>  value &= ~APIC_TPRI_MASK;
>> +value |= 0x10;
>>  apic_write(APIC_TASKPRI, value);
>>
>>  apic_pending_intr_clear();
> It looks fine, and indeed it seems that writes to APIC_TASKPRI and CR8 are
> not overwriting this value.

Writes to these two registers should overwrite this value.

> Yet, the fact that if someone overwrites with zero (or does not restore) the
> TPR will not produce any warning is not that great. Not that I know what the
> right course of action is (adding checks in write_cr8()? but then what about
> if APIC_TASKPRI is not restored after some APIC reset?)

TPR is only written during boot and resume.

cr8 is only read and written during suspend/resume, which is redundant
with the TPR save/restore.  Dropping that would drop two PVops, and the
final trace of the kernel using cr8 itself.

All cr8 and TPR accesses in KVM look to be on virtual state, rather than
the real registers (as expected).

~Andrew


Re: [Xen-devel] [PATCH v2] xen/pv: Fix a boot up hang revealed by int3 self test

2019-07-15 Thread Andrew Cooper
On 15/07/2019 07:54, Jan Beulich wrote:
> On 15.07.2019 07:05, Zhenzhong Duan wrote:
>> On 2019/7/12 22:06, Andrew Cooper wrote:
>>> On 11/07/2019 03:15, Zhenzhong Duan wrote:
>>>> Commit 7457c0da024b ("x86/alternatives: Add int3_emulate_call()
>>>> selftest") is used to ensure there is a gap setup in exception stack
>>>> which could be used for inserting call return address.
>>>>
>>>> This gap is missed in XEN PV int3 exception entry path, then below panic
>>>> triggered:
>>>>
>>>> [    0.772876] general protection fault:  [#1] SMP NOPTI
>>>> [    0.772886] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0+ #11
>>>> [    0.772893] RIP: e030:int3_magic+0x0/0x7
>>>> [    0.772905] RSP: 3507:82203e98 EFLAGS: 0246
>>>> [    0.773334] Call Trace:
>>>> [    0.773334]  alternative_instructions+0x3d/0x12e
>>>> [    0.773334]  check_bugs+0x7c9/0x887
>>>> [    0.773334]  ? __get_locked_pte+0x178/0x1f0
>>>> [    0.773334]  start_kernel+0x4ff/0x535
>>>> [    0.773334]  ? set_init_arg+0x55/0x55
>>>> [    0.773334]  xen_start_kernel+0x571/0x57a
>>>>
>>>> As xenint3 and int3 entry code are same except xenint3 doesn't generate
>>>> a gap, we can fix it by using int3 and drop useless xenint3.
>>> For 64bit PV guests, Xen's ABI enters the kernel with using SYSRET, with
>>> %rcx/%r11 on the stack.
>>>
>>> To convert back to "normal" looking exceptions, the xen thunks do `pop
>>> %rcx; pop %r11; jmp do_*`...
>> I will add this to commit message.
>>>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>>>> index 0ea4831..35a66fc 100644
>>>> --- a/arch/x86/entry/entry_64.S
>>>> +++ b/arch/x86/entry/entry_64.S
>>>> @@ -1176,7 +1176,6 @@ idtentry stack_segment    do_stack_segment    
>>>> has_error_code=1
>>>>   #ifdef CONFIG_XEN_PV
>>>>   idtentry xennmi    do_nmi    has_error_code=0
>>>>   idtentry xendebug    do_debug    has_error_code=0
>>>> -idtentry xenint3    do_int3    has_error_code=0
>>>>   #endif
>>> What is confusing is why there are 3 extra magic versions here.  At a
>>> guess, I'd say something to do with IST settings (given the vectors),
>>> but I don't see why #DB/#BP would need to be different in the first
>>> place.  (NMI sure, but that is more to do with the crazy hoops needing
>>> to be jumped through to keep native functioning safely.)
>> xenint3 will be removed in this patch safely as it don't use IST now.
>>
>> But debug and nmi need paranoid_entry which will read MSR_GS_BASE to 
>> determine
>>
>> if swapgs is needed. I guess PV guesting running in ring3 will #GP with 
>> swapgs?
> Not only that (Xen could trap and emulate swapgs if that was needed) - 64-bit
> PV kernel mode always gets entered with kernel GS base already set. Hence
> finding out whether to switch the GS base is specifically not something that
> any exception entry point would need to do (and it should actively try to
> avoid it, for performance reasons).

Indeed.  The SWAPGS PVOP is implemented as a nop for x86 PV, to simply
the entry assembly (rather than doubling up all entry vectors).

~Andrew


Re: [Xen-devel] [PATCH v2] xen/pv: Fix a boot up hang revealed by int3 self test

2019-07-12 Thread Andrew Cooper
On 11/07/2019 03:15, Zhenzhong Duan wrote:
> Commit 7457c0da024b ("x86/alternatives: Add int3_emulate_call()
> selftest") is used to ensure there is a gap setup in exception stack
> which could be used for inserting call return address.
>
> This gap is missed in XEN PV int3 exception entry path, then below panic
> triggered:
>
> [0.772876] general protection fault:  [#1] SMP NOPTI
> [0.772886] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0+ #11
> [0.772893] RIP: e030:int3_magic+0x0/0x7
> [0.772905] RSP: 3507:82203e98 EFLAGS: 0246
> [0.773334] Call Trace:
> [0.773334]  alternative_instructions+0x3d/0x12e
> [0.773334]  check_bugs+0x7c9/0x887
> [0.773334]  ? __get_locked_pte+0x178/0x1f0
> [0.773334]  start_kernel+0x4ff/0x535
> [0.773334]  ? set_init_arg+0x55/0x55
> [0.773334]  xen_start_kernel+0x571/0x57a
>
> As xenint3 and int3 entry code are same except xenint3 doesn't generate
> a gap, we can fix it by using int3 and drop useless xenint3.

For 64bit PV guests, Xen's ABI enters the kernel with using SYSRET, with
%rcx/%r11 on the stack.

To convert back to "normal" looking exceptions, the xen thunks do `pop
%rcx; pop %r11; jmp do_*`...

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 0ea4831..35a66fc 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1176,7 +1176,6 @@ idtentry stack_segment  do_stack_segment
> has_error_code=1
>  #ifdef CONFIG_XEN_PV
>  idtentry xennmi  do_nmi  has_error_code=0
>  idtentry xendebugdo_debughas_error_code=0
> -idtentry xenint3 do_int3 has_error_code=0
>  #endif

What is confusing is why there are 3 extra magic versions here.  At a
guess, I'd say something to do with IST settings (given the vectors),
but I don't see why #DB/#BP would need to be different in the first
place.  (NMI sure, but that is more to do with the crazy hoops needing
to be jumped through to keep native functioning safely.)

~Andrew


Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

2019-07-05 Thread Andrew Cooper
On 05/07/2019 21:49, Paolo Bonzini wrote:
> On 05/07/19 22:25, Thomas Gleixner wrote:
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
> Oh, that brings back memories.  At the time I was working on Xen, so I
> remember that CVE.  IIRC there was some mitigation but the fix was
> basically to print a very scary error message if you used VT-d without
> interrupt remapping.  Maybe force the user to add something on the Xen
> command line too?

It was before my time.  I have no public comment on how the other
aspects of it were handled.

>> Is there any serious usage of virtualization w/o interrupt remapping left
>> or have the machines which are not capable been retired already?
> I think they were already starting to disappear in 2011, as I don't
> remember much worry about customers that were using systems without it.

ISTR Nehalem/Westmere era systems were the first to support interrupt
remapping, but were totally crippled with errata to the point of needing
to turn a prerequisite feature (Queued Invalidation) off.  I believe
later systems have it working to a first approximation.

As to the original question, whether people should be using such systems
is a different question to whether they actually are.

~Andrew


Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

2019-07-05 Thread Andrew Cooper
On 05/07/2019 20:19, Nadav Amit wrote:
>> On Jul 5, 2019, at 8:47 AM, Andrew Cooper  wrote:
>>
>> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>>  2) The loop termination logic is interesting at best.
>>>
>>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>>> million times to ack stale IRR/ISR bits. What?
>>>
>>> With TSC it uses the TSC to calculate the loop termination. It takes a
>>> timestamp at entry and terminates the loop when:
>>>
>>>   (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>>
>>> That's roughly one second.
>>>
>>> Both methods are problematic. The APIC has 256 vectors, which means
>>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>>> impossible as the first 32 vectors are reserved and not affected and
>>> the chance that more than a few bits are set is close to zero.
>> [Disclaimer.  I talked to Thomas in private first, and he asked me to
>> post this publicly as the CVE is almost a decade old already.]
>>
>> I'm afraid that this isn't quite true.
>>
>> In terms of IDT vectors, the first 32 are reserved for exceptions, but
>> only the first 16 are reserved in the LAPIC.  Vectors 16-31 are fair
>> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>>
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
> IIRC (and from skimming the CVE again) the basic problem in Xen was that
> MSIs can be used when devices are assigned to generate IRQs with arbitrary
> vectors. The mitigation was to require interrupt remapping to be enabled in
> the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment).
>
> Are you concerned about this case, additional concrete ones, or is it about
> security hardening? (or am I missing something?)

The phrase "impossible as the first 32 vectors are reserved" stuck out,
because its not true.  That generally means that any logic derived from
it is also false. :)

In practice, I was thinking more about robustness against buggy
conditions.  Setting TPR to 1 at start of day is very easy.  Some of the
other protections, less so.

When it comes to virtualisation, security is an illusion when a guest
kernel has a real piece of hardware in its hands.  Anyone who is under
the misapprehension otherwise should try talking to a IOMMU hardware
engineer and see the reaction on their face.  IOMMUs were designed to
isolate devices when all controlling software was of the same privilege
level.  They don't magically make the system safe against a hostile
guest device driver, which in the most basic case, can still mount a DoS
attempt with deliberately bad DMA.

~Andrew


Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

2019-07-05 Thread Andrew Cooper
On 05/07/2019 20:06, Andy Lutomirski wrote:
> On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper  
> wrote:
>> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>>   2) The loop termination logic is interesting at best.
>>>
>>>  If the machine has no TSC or cpu_khz is not known yet it tries 1
>>>  million times to ack stale IRR/ISR bits. What?
>>>
>>>  With TSC it uses the TSC to calculate the loop termination. It takes a
>>>  timestamp at entry and terminates the loop when:
>>>
>>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>>
>>>  That's roughly one second.
>>>
>>>  Both methods are problematic. The APIC has 256 vectors, which means
>>>  that in theory max. 256 IRR/ISR bits can be set. In practice this is
>>>  impossible as the first 32 vectors are reserved and not affected and
>>>  the chance that more than a few bits are set is close to zero.
>> [Disclaimer.  I talked to Thomas in private first, and he asked me to
>> post this publicly as the CVE is almost a decade old already.]
>>
>> I'm afraid that this isn't quite true.
>>
>> In terms of IDT vectors, the first 32 are reserved for exceptions, but
>> only the first 16 are reserved in the LAPIC.  Vectors 16-31 are fair
>> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>>
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
>>
>> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
>> without an error code on the stack, which results in a corrupt pt_regs
>> in the exception handler, and a stack underflow on the way back out,
>> most likely with a fault on IRET.
>>
>> These can be addressed by setting TPR to 0x10, which will inhibit
>> delivery of any errant IPIs in this range, but some extra sanity logic
>> may not go amiss.  An error code on a 64bit stack can be spotted with
>> `testb $8, %spl` due to %rsp being aligned before pushing the exception
>> frame.
> Several years ago, I remember having a discussion with someone (Jan
> Beulich, maybe?) about how to efficiently make the entry code figure
> out the error code situation automatically.  I suspect it was on IRC
> and I can't find the logs.

It was on IRC, but I don't remember exactly when, either.

> I'm thinking that maybe we should just
> make Linux's idtentry code do something like this.
>
> If nothing else, we could make idtentry do:
>
> testl $8, %esp   /* shorter than testb IIRC */

Sadly not.  test (unlike cmp and the basic mutative opcodes) doesn't
have a sign-extendable imm8 encoding.  The two options are:

f7 c4 08 00 00 00        test   $0x8,%esp
40 f6 c4 08      test   $0x8,%spl

> jz 1f  /* or jnz -- too lazy to figure it out */
> pushq $-1
> 1:

It is jz, and Xen does use this sequence for reserved/unimplemented
vectors, but we expect those codepaths never to be executed.

>
> instead of the current hardcoded push.  The cost of a mispredicted
> branch here will be smallish compared to the absurdly large cost of
> the entry itself.  But I thought I had something more clever than
> this.  This sequence works, but it still feels like it should be
> possible to do better:
>
> .macro PUSH_ERROR_IF_NEEDED
> /*
>  * Before the IRET frame is pushed, RSP is aligned to a 16-byte
>  * boundary.  After SS .. RIP and the error code are pushed, RSP is
>  * once again aligned.  Pushing -1 will put -1 in the error code slot
>  * (regs->orig_ax) if there was no error code.
> */
>
> pushq$-1/* orig_ax = -1, maybe */
> /* now RSP points to orig_ax (aligned) or di (misaligned) */
> pushq$0
> /* now RSP points to di (misaligned) or si (aligned) */
> orq$8, %rsp
> /* now RSP points to di */
> addq$8, %rsp
> /* now RSP points to orig_ax, and we're in good shape */
> .endm
>
> Is there a better sequence for this?

The only aspect I can think of is whether mixing the push/pops with
explicit updates updates to %rsp is better or worse than a very well
predicted branch, given that various frontends have special tracking to
reduce instruction dependencies on %rsp.  I'll have to defer to the CPU
microachitects as to which of the two options is the lesser evil.

That said, both Intel and AMD's Optimisation guides have stack alignment
suggestions which mix push/sub/and on function prolog, so I expect this
is as optimised as it can reasonably be in the pipelines.


Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

2019-07-05 Thread Andrew Cooper
On 04/07/2019 16:51, Thomas Gleixner wrote:
>   2) The loop termination logic is interesting at best.
>
>  If the machine has no TSC or cpu_khz is not known yet it tries 1
>  million times to ack stale IRR/ISR bits. What?
>
>  With TSC it uses the TSC to calculate the loop termination. It takes a
>  timestamp at entry and terminates the loop when:
>
> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>
>  That's roughly one second.
>
>  Both methods are problematic. The APIC has 256 vectors, which means
>  that in theory max. 256 IRR/ISR bits can be set. In practice this is
>  impossible as the first 32 vectors are reserved and not affected and
>  the chance that more than a few bits are set is close to zero.

[Disclaimer.  I talked to Thomas in private first, and he asked me to
post this publicly as the CVE is almost a decade old already.]

I'm afraid that this isn't quite true.

In terms of IDT vectors, the first 32 are reserved for exceptions, but
only the first 16 are reserved in the LAPIC.  Vectors 16-31 are fair
game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).

In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
I'm disappointed to see wasn't shared with other software vendors at the
time.

Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
without an error code on the stack, which results in a corrupt pt_regs
in the exception handler, and a stack underflow on the way back out,
most likely with a fault on IRET.

These can be addressed by setting TPR to 0x10, which will inhibit
delivery of any errant IPIs in this range, but some extra sanity logic
may not go amiss.  An error code on a 64bit stack can be spotted with
`testb $8, %spl` due to %rsp being aligned before pushing the exception
frame.

Another interesting problem is an IPI which its vector 0x80.  A cunning
attacker can use this to simulate system calls from unsuspecting
positions in userspace, or for interrupting kernel context.  At the very
least the int0x80 path does an unconditional swapgs, so will try to run
with the user gs, and I expect things will explode quickly from there.

One option here is to look at ISR and complain if it is found to be set.

Another option, which I've only just remembered, is that AMD hardware
has the Interrupt Enable Register in its extended APIC space, which may
or may not be good enough to prohibit delivery of 0x80.  There isn't
enough information in the APM to be clear, but the name suggests it is
worth experimenting with.

~Andrew


Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Andrew Cooper
On 03/07/2019 18:02, Nadav Amit wrote:
>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>
>> On 03.07.19 01:51, Nadav Amit wrote:
>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>> and hyper-v are only compile-tested).
>>> While the updated smp infrastructure is capable of running a function on
>>> a single local core, it is not optimized for this case. The multiple
>>> function calls and the indirect branch introduce some overhead, and
>>> might make local TLB flushes slower than they were before the recent
>>> changes.
>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>> is needed to restore the lost performance in this common case. This
>>> requires to check mm_cpumask() one more time, but unless this mask is
>>> updated very frequently, this should impact performance negatively.
>>> Cc: "K. Y. Srinivasan" 
>>> Cc: Haiyang Zhang 
>>> Cc: Stephen Hemminger 
>>> Cc: Sasha Levin 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: Borislav Petkov 
>>> Cc: x...@kernel.org
>>> Cc: Juergen Gross 
>>> Cc: Paolo Bonzini 
>>> Cc: Dave Hansen 
>>> Cc: Andy Lutomirski 
>>> Cc: Peter Zijlstra 
>>> Cc: Boris Ostrovsky 
>>> Cc: linux-hyp...@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: virtualizat...@lists.linux-foundation.org
>>> Cc: k...@vger.kernel.org
>>> Cc: xen-de...@lists.xenproject.org
>>> Signed-off-by: Nadav Amit 
>>> ---
>>>  arch/x86/hyperv/mmu.c | 13 +++---
>>>  arch/x86/include/asm/paravirt.h   |  6 +--
>>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>>  arch/x86/include/asm/tlbflush.h   |  9 ++--
>>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>  arch/x86/kernel/kvm.c | 11 +++--
>>>  arch/x86/kernel/paravirt.c|  2 +-
>>>  arch/x86/mm/tlb.c | 65 ---
>>>  arch/x86/xen/mmu_pv.c | 20 ++---
>>>  include/trace/events/xen.h|  2 +-
>>>  10 files changed, 91 insertions(+), 43 deletions(-)
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index beb44e22afdf..19e481e6e904 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>> preempt_enable();
>>>  }
>>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>> -const struct flush_tlb_info *info)
>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>> +   const struct flush_tlb_info *info)
>>>  {
>>> struct {
>>> struct mmuext_op op;
>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask 
>>> *cpus,
>>> const size_t mc_entry_size = sizeof(args->op) +
>>> sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>  -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>> +   trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>> if (cpumask_empty(cpus))
>>> return; /* nothing to do */
>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
>>> cpumask *cpus,
>>> args = mcs.args;
>>> args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>  -  /* Remove us, and any offline CPUS. */
>>> +   /* Flush locally if needed and remove us */
>>> +   if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>> +   local_irq_disable();
>>> +   flush_tlb_func_local(info);
>> I think this isn't the correct function for PV guests.
>>
>> In fact it should be much easier: just don't clear the own cpu from the
>> mask, that's all what's needed. The hypervisor is just fine having the
>> current cpu in the mask and it will do the right thing.
> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
> the same, unfortunately, since it would induce VM-exit on TLB flushes.

Why do you believe the vmexit matters?  You're talking one anyway for
the IPI.

Intel only have virtualised self-IPI, and while AMD do have working
non-self IPIs, you still take a vmexit anyway if any destination vcpu
isn't currently running in non-root mode (IIRC).

At that point, you might as well have the hypervisor do all the hard
work via a multi-cpu shootdown/flush hypercall, rather than trying to
arrange it locally.

~Andrew


Re: [Xen-devel] [RFC PATCH 04/16] x86/xen: hypercall support for xenhost_t

2019-06-14 Thread Andrew Cooper
On 14/06/2019 08:35, Juergen Gross wrote:
> On 14.06.19 09:20, Ankur Arora wrote:
>> On 2019-06-12 2:15 p.m., Andrew Cooper wrote:
>>> On 09/05/2019 18:25, Ankur Arora wrote:
>>>> Allow for different hypercall implementations for different xenhost
>>>> types.
>>>> Nested xenhost, which has two underlying xenhosts, can use both
>>>> simultaneously.
>>>>
>>>> The hypercall macros (HYPERVISOR_*) implicitly use the default
>>>> xenhost.x
>>>> A new macro (hypervisor_*) takes xenhost_t * as a parameter and
>>>> does the
>>>> right thing.
>>>>
>>>> TODO:
>>>>    - Multicalls for now assume the default xenhost
>>>>    - xen_hypercall_* symbols are only generated for the default
>>>> xenhost.
>>>>
>>>> Signed-off-by: Ankur Arora 
>>>
>>> Again, what is the hypervisor nesting and/or guest layout here?
>> Two hypervisors, L0 and L1, and the guest is a child of the L1
>> hypervisor but could have PV devices attached to both L0 and L1
>> hypervisors.
>>
>>>
>>> I can't think of any case where a single piece of software can
>>> legitimately have two hypercall pages, because if it has one working
>>> one, it is by definition a guest, and therefore not privileged
>>> enough to
>>> use the outer one.
>> Depending on which hypercall page is used, the hypercall would
>> (eventually) land in the corresponding hypervisor.
>>
>> Juergen elsewhere pointed out proxying hypercalls is a better approach,
>> so I'm not really considering this any more but, given this layout, and
>> assuming that the hypercall pages could be encoded differently would it
>> still not work?
>
> Hypercalls might work, but it is a bad idea and a violation of layering
> to let a L1 guest issue hypercalls to L0 hypervisor, as those hypercalls
> could influence other L1 guests and even the L1 hypervisor.
>
> Hmm, thinking more about it, I even doubt those hypercalls could work in
> all cases: when issued from a L1 PV guest the hypercalls would seem to
> be issued from user mode for the L0 hypervisor, and this is not allowed.

That is exactly the point I was trying to make.

If L2 is an HVM guest, then both its hypercall pages will be using
VMCALL/VMMCALL which will end up making hypercalls to L1, rather than
having one go to L0.

If L2 is a PV guest, then one hypercall page will be SYSCALL/INT 82
which will go to L1, and one will be VMCALL/VMMCALL which goes to L0,
but L0 will see it from ring1/ring3 and reject the hypercall.

However you nest the system, every guest only has a single occurrence of
"supervisor software", so only has a single context that will be
tolerated to make hypercalls by the next hypervisor up.

~Andrew


Re: [Xen-devel] [RFC PATCH 02/16] x86/xen: cpuid support in xenhost_t

2019-06-12 Thread Andrew Cooper
On 09/05/2019 18:25, Ankur Arora wrote:
> xen_cpuid_base() is used to probe and setup features early in a
> guest's lifetime.
>
> We want this to behave differently depending on xenhost->type: for
> instance, local xenhosts cannot intercept the cpuid instruction at all.
>
> Add op (*cpuid_base)() in xenhost_ops_t.
>
> Signed-off-by: Ankur Arora 

What is the real layout of hypervisor nesting here?

When Xen is at L0, all HVM guests get working CPUID faulting to combat
this problem, because CPUID faulting can be fully emulated even on older
Intel hardware, and AMD hardware.

It is a far cleaner way of fixing the problem.

~Andrew


Re: [Xen-devel] [RFC PATCH 04/16] x86/xen: hypercall support for xenhost_t

2019-06-12 Thread Andrew Cooper
On 09/05/2019 18:25, Ankur Arora wrote:
> Allow for different hypercall implementations for different xenhost types.
> Nested xenhost, which has two underlying xenhosts, can use both
> simultaneously.
>
> The hypercall macros (HYPERVISOR_*) implicitly use the default xenhost.x
> A new macro (hypervisor_*) takes xenhost_t * as a parameter and does the
> right thing.
>
> TODO:
>   - Multicalls for now assume the default xenhost
>   - xen_hypercall_* symbols are only generated for the default xenhost.
>
> Signed-off-by: Ankur Arora 

Again, what is the hypervisor nesting and/or guest layout here?

I can't think of any case where a single piece of software can
legitimately have two hypercall pages, because if it has one working
one, it is by definition a guest, and therefore not privileged enough to
use the outer one.

~Andrew


Re: [Xen-devel] [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-16 Thread Andrew Cooper
On 16/05/2019 14:50, Alexander Graf wrote:
> On 14.05.19 08:16, Filippo Sironi wrote:
>> Start populating /sys/hypervisor with KVM entries when we're running on
>> KVM. This is to replicate functionality that's available when we're
>> running on Xen.
>>
>> Start with /sys/hypervisor/uuid, which users prefer over
>> /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
>> machine, since it's also available when running on Xen HVM and on Xen PV
>> and, on top of that doesn't require root privileges by default.
>> Let's create arch-specific hooks so that different architectures can
>> provide different implementations.
>>
>> Signed-off-by: Filippo Sironi 
> I think this needs something akin to
>
>   https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-hypervisor-xen
>
> to document which files are available.
>
>> ---
>> v2:
>> * move the retrieval of the VM UUID out of uuid_show and into
>>   kvm_para_get_uuid, which is a weak function that can be overwritten
>>
>>  drivers/Kconfig  |  2 ++
>>  drivers/Makefile |  2 ++
>>  drivers/kvm/Kconfig  | 14 ++
>>  drivers/kvm/Makefile |  1 +
>>  drivers/kvm/sys-hypervisor.c | 30 ++
>>  5 files changed, 49 insertions(+)
>>  create mode 100644 drivers/kvm/Kconfig
>>  create mode 100644 drivers/kvm/Makefile
>>  create mode 100644 drivers/kvm/sys-hypervisor.c
>>
> [...]
>
>> +
>> +__weak const char *kvm_para_get_uuid(void)
>> +{
>> +return NULL;
>> +}
>> +
>> +static ssize_t uuid_show(struct kobject *obj,
>> + struct kobj_attribute *attr,
>> + char *buf)
>> +{
>> +const char *uuid = kvm_para_get_uuid();
>> +return sprintf(buf, "%s\n", uuid);
> The usual return value for the Xen /sys/hypervisor interface is
> "".

This string comes straight from Xen.

It was an effort to reduce the quantity of interesting fingerprintable
data accessable by default to unprivileged guests.

See
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=a2fc8d514df2b38c310d4f4432fe06520b0769ed

~Andrew


Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry

2019-03-21 Thread Andrew Cooper
On 21/03/2019 18:39, Andy Lutomirski wrote:
> On Thu, Mar 21, 2019 at 11:37 AM Peter Zijlstra  wrote:
>> On Thu, Mar 21, 2019 at 11:25:44AM -0700, Linus Torvalds wrote:
>>> On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski  wrote:
>>>> I dunno.  Lots of people at least use to have serious commercial interest 
>>>> in it.
>>> Yes, it used to be a big deal. But full virtualization has gotten a
>>> lot more common and better.
>>>
>>>> Hey Xen folks, how close are we to being able to say "if you want to
>>>> run a new kernel, you need to switch to PVH or similar"?
>>> I'd also like to know if we could perhaps at least limit PV to just
>>> the thing that people care most deeply about.
>>>
>>> For example, maybe people notice that they really deeply care about
>>> the PV spinlocks because they help a lot for some loads, but don't
>>> care so much about the low-level CPU PV stuff any more because modern
>>> CPUs do _those_ things so well these days.
>>>
>>> So it might not be an all-or-nothing thing, but a gradual "let's stop
>>> supporting xyz under PV, because it causes pain and isn't worth it".
>> So Juergen recently introduced PARAVIRT_XXL, which are exactly those
>> bits of PV we can get rid of.
>>
>> This paravirt-me-harder config does indeed include the CR2 bits.
>>
>> I recently talked to Andrew Cooper about this, and he said Xen Dom0
>> still needs all this :/
> There were patches from last year to fix that:
>
> https://lwn.net/Articles/753982/
>
> I have no clue what the status is.

I'm sorry to say that Xen PV is not going to die any time soon (as far
as Xen is concerned).

For customer VMs, using the VT-x/SVM hardware extensions is definitely
the way to go, but for host level operations, the difference between
syscall vs vmexit, or (not) having to do an EPT flush make an
overwhelming difference in performance.

Our PVH virtualisation mode, including for dom0, is making good
progress.  With Xen 4.12 and Linux 4.19+, a lot of basic functionality
now does work.  However, we've got 15 years of legacy problems to try
and untangle while doing this, including (but not limited to) Linux
(rather than Xen) being OSPM, or the fact that using virtual addresses
for hypercalls and mappings should never have propagate beyond the PV
ABI when HVM came along.

Even with the legacy API/ABI issues fixed, the straight performance
difference between root mode operations, and vmexits, means that it is
by no means a certainty that a PVH dom0 will be the best option for
systems to use.

I'm afraid that I've not got the original context of this thread, but
I'm going to guess that something is resulting in a #PF before %cr2
before it gets saved on the #PF path, and using a PVOP causes things to
explode?

If it helps in the short term, Xen will trap and emulate a mov from cr2
instruction correctly.  Performance will surely suck, but it might be an
option if we need to rethink how this information is made available to
guests.

Thanks,

~Andrew


Re: [Xen-devel] xen: Can't insert balloon page into VM userspace (WAS Re: [linux-linus bisection] complete test-arm64-arm64-xl-xsm)

2019-03-12 Thread Andrew Cooper
On 12/03/2019 17:18, David Hildenbrand wrote:
> On 12.03.19 18:14, Matthew Wilcox wrote:
>> On Tue, Mar 12, 2019 at 05:05:39PM +, Julien Grall wrote:
>>> On 3/12/19 3:59 PM, Julien Grall wrote:
 It looks like all the arm test for linus [1] and next [2] tree
 are now failing. x86 seems to be mostly ok.

 The bisector fingered the following commit:

 commit 0ee930e6cafa048c1925893d0ca89918b2814f2c
 Author: Matthew Wilcox 
 Date:   Tue Mar 5 15:46:06 2019 -0800

  mm/memory.c: prevent mapping typed pages to userspace
  Pages which use page_type must never be mapped to userspace as it 
 would
  destroy their page type.  Add an explicit check for this instead of
  assuming that kernel drivers always get this right.
>> Oh good, it found a real problem.
>>
>>> It turns out the problem is because the balloon driver will call
>>> __SetPageOffline() on allocated page. Therefore the page has a type and
>>> vm_insert_pages will deny the insertion.
>>>
>>> My knowledge is quite limited in this area. So I am not sure how we can
>>> solve the problem.
>>>
>>> I would appreciate if someone could provide input of to fix the mapping.
>> I don't know the balloon driver, so I don't know why it was doing this,
>> but what it was doing was Wrong and has been since 2014 with:
>>
>> commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
>> Author: Konstantin Khlebnikov 
>> Date:   Thu Oct 9 15:29:27 2014 -0700
>>
>> mm/balloon_compaction: redesign ballooned pages management
>>
>> If ballooned pages are supposed to be mapped into userspace, you can't mark
>> them as ballooned pages using the mapcount field.
>>
> Asking myself why anybody would want to map balloon inflated pages into
> user space (this just sounds plain wrong but my understanding to what
> XEN balloon driver does might be limited), but I assume the easy fix
> would be to revert

I suspect the bug here is that the balloon driver is (ab)used for a
second purpose - to create a hole in pfn space to map some other bits of
shared memory into.

I think at the end of the day, what is needed is a struct page_info
which looks like normal RAM, but the backing for which can be altered by
hypercall to map other things.

~Andrew


Re: [Xen-devel] [PATCH] xen-netback: fix occasional leak of grant ref mappings under memory pressure

2019-02-28 Thread Andrew Cooper
On 28/02/2019 02:03, Igor Druzhinin wrote:
> Zero-copy callback flag is not yet set on frag list skb at the moment
> xenvif_handle_frag_list() returns -ENOMEM. This eventually results in
> leaking grant ref mappings since xenvif_zerocopy_callback() is never
> called for these fragments. Those eventually build up and cause Xen
> to kill Dom0 as the slots get reused for new mappings.

Its worth pointing out what (debug) Xen notices is dom0 performing
implicit grant unmap.

~Andrew


  1   2   3   4   >