Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 19:26, Andy Lutomirski wrote:
> On Tue, Jun 23, 2020 at 8:23 AM Andrew Cooper  
> wrote:
>> On 23/06/2020 14:03, Peter Zijlstra wrote:
>>> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
>>>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
>>>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
>>>>> you to only make it IST if/when you try and make SNP happen, not before.
>>>> It is not the only reason, when ES guests gain debug register support
>>>> then #VC also needs to be IST, because #DB can be promoted into #VC
>>>> then, and as #DB is IST for a reason, #VC needs to be too.
>>> Didn't I read somewhere that that is only so for Rome/Naples but not for
>>> the later chips (Milan) which have #DB pass-through?
>> I don't know about hardware timelines, but some future part can now opt
>> in to having debug registers as part of the encrypted state, and swapped
>> by VMExit, which would make debug facilities generally usable, and
>> supposedly safe to the #DB infinite loop issues, at which point the
>> hypervisor need not intercept #DB for safety reasons.
>>
>> Its worth nothing that on current parts, the hypervisor can set up debug
>> facilities on behalf of the guest (or behind its back) as the DR state
>> is unencrypted, but that attempting to intercept #DB will redirect to
>> #VC inside the guest and cause fun. (Also spare a thought for 32bit
>> kernels which have to cope with userspace singlestepping the SYSENTER
>> path with every #DB turning into #VC.)
> What do you mean 32-bit?  64-bit kernels have exactly the same
> problem.  At least the stack is okay, though.

:)

AMD-like CPUs disallow SYSENTER/SYSEXIT in Long Mode, and raise #UD,
even from a compatibility mode segment.

64bit kernels only have this problem on Intel-like CPUs.

(It is a massive shame that between everyone's attempts, there are 0
"fast system call" instructions with sane semantics, but it is several
decades late to fix this problem...)

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 16:23, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote:
>> On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote:
>>> +noinstr void idtentry_validate_ist(struct pt_regs *regs)
>>> +{
>>> +   if ((regs->sp & ~(EXCEPTION_STKSZ-1)) ==
>>> +   (_RET_IP_ & ~(EXCEPTION_STKSZ-1)))
>>> +   die("IST stack recursion", regs, 0);
>>> +}
>> Yes, this is a start, it doesn't cover the case where the NMI stack is
>> in-between, so I think you need to walk down regs->sp too.
> That shouldn't be possible with the current code, I think.

NMI; #MC; Anything which IRET but isn't fatal - #DB, or #BP from
patching, #GP from *_safe(), etc; NMI

Sure its a corner case, but did you hear that IST is evil?

~Andrew

P.S. did you also hear that with Rowhammer, userspace has a nonzero
quantity of control over generating #MC, depending on how ECC is
configured on the platform.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 16:16, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote:
>>> We're talking about the 3rd case where the only reason things 'work' is
>>> because we'll have to panic():
>>>
>>>  - #MC
>> Okay, #MC is special and can only be handled on a best-effort basis, as
>> #MC could happen anytime, also while already executing the #MC handler.
> I think the hardware has a MCE-mask bit somewhere. Flaky though because
> clearing it isn't 'atomic' with IRET, so there's a 'funny' window.

MSR_MCG_STATUS.MCIP, and yes - any #MC before that point will
immediately Shutdown.  Any #MC between that point and IRET will clobber
its IST stack and end up sad.

> It also interacts really bad with the NMI handler. If we get an #MC
> early in the NMI, where we hard-rely on the NMI-mask being set to set-up
> the recursion stack, then the #MC IRET will clear the NMI-mask, and
> we're toast.
>
> Andy has wild and crazy ideas, but I don't think we need more crazy
> here.

Want, certainly not.  Need, maybe :-/
>>>  - #DB with BUS LOCK DEBUG EXCEPTION
>> If I understand the problem correctly, this can be solved by moving off
>> the IST stack to the current task stack in the #DB handler, like I plan
>> to do for #VC, no?
> Hmm, probably. Would take a bit of care, but should be doable.

Andy and I have spent some time considering this.

Having exactly one vector move off IST and onto an in-use task-stack is
doable, I think, so long as it can sort out self-recursion protections.

Having more than one vector do this is very complicated.  You've got to
take care to walk up the list of IST-nesting to see if any interrupted
context is in the middle of trying to copy themselves onto the stack, so
you don't clobber the frame they're in the middle of copying.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 14:03, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
>>> you to only make it IST if/when you try and make SNP happen, not before.
>> It is not the only reason, when ES guests gain debug register support
>> then #VC also needs to be IST, because #DB can be promoted into #VC
>> then, and as #DB is IST for a reason, #VC needs to be too.
> Didn't I read somewhere that that is only so for Rome/Naples but not for
> the later chips (Milan) which have #DB pass-through?

I don't know about hardware timelines, but some future part can now opt
in to having debug registers as part of the encrypted state, and swapped
by VMExit, which would make debug facilities generally usable, and
supposedly safe to the #DB infinite loop issues, at which point the
hypervisor need not intercept #DB for safety reasons.

Its worth nothing that on current parts, the hypervisor can set up debug
facilities on behalf of the guest (or behind its back) as the DR state
is unencrypted, but that attempting to intercept #DB will redirect to
#VC inside the guest and cause fun. (Also spare a thought for 32bit
kernels which have to cope with userspace singlestepping the SYSENTER
path with every #DB turning into #VC.)

>> Besides that, I am not a fan of delegating problems I already see coming
>> to future-Joerg and future-Peter, but if at all possible deal with them
>> now and be safe later.
> Well, we could just say no :-) At some point in the very near future
> this house of cards is going to implode.

What currently exists is a picture of a house of cards in front of
something which has fallen down.

> Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware
> folks? How come we're getting more and more of them?

I have tried to get this point across.  Then again - its far easier for
the software folk in the same company as the hardware folk to make this
point.

> (/me puts fingers
> in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)

I wasn't going to bring it up, but seeing as you have - while there are
prohibitively-complicating issues preventing it from working on native,
I don't see any point even considering it for the mess which is #VC, or
the even bigger mess which is #HV.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 13:47, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:
>
>> There are cases which are definitely non-recoverable.
>>
>> For both ES and SNP, a malicious hypervisor can mess with the guest
>> physmap to make the the NMI, #VC and #DF stacks all alias.
>>
>> For ES, this had better result in the #DF handler deciding that crashing
>> is the way out, whereas for SNP, this had better escalate to Shutdown.
>> Crashing out hard if the hypervisor is misbehaving is acceptable.
> Then I'm thinking the only sensible option is to crash hard for any SNP
> #VC from kernel mode.
>
> Sadly that doesn't help with #VC needing to be IST :-( IST is such a
> frigging nightmare.

I presume you mean any #VC caused by RMP faults (i.e. something went
wrong with the memory owner/etc metadata) ?

If so, then yes.  Any failure here is a bug in the kernel or hypervisor
(and needs fixing) or a malicious hypervisor and the guest should
terminate for its own safety.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andrew Cooper
On 23/06/2020 12:30, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote:
>> On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:
>> So what happens if this #VC triggers on the first access to the #VC
>> stack, because the malicious host has craftily mucked with only the #VC
>> IST stack page?
>>
>> Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
>> you up.
>>
>> AFAICT all of that is non-recoverable.
> I am not 100% sure, but I think if the #VC stack page is not validated,
> the #VC should be promoted to a #DF.
>
> Note that this is an issue only with secure nested paging (SNP), which
> is not enabled yet with this patch-set. When it gets enabled a stack
> recursion check in the #VC handler is needed which panics the VM. That
> also fixes the #VC-in-early-NMI problem.

There are cases which are definitely non-recoverable.

For both ES and SNP, a malicious hypervisor can mess with the guest
physmap to make the the NMI, #VC and #DF stacks all alias.

For ES, this had better result in the #DF handler deciding that crashing
is the way out, whereas for SNP, this had better escalate to Shutdown.


What matters here is the security model in SNP.  The hypervisor is
relied upon for availability (because it could simply refuse to schedule
the VM), but market/business forces will cause it to do its best to keep
the VM running.  Therefore, the securely model is simply(?) that the
hypervisor can't do anything to undermine the confidentiality or
integrity of the VM.

Crashing out hard if the hypervisor is misbehaving is acceptable.  In a
cloud, I as a customer would (threaten to?) take my credit card
elsewhere, while for enterprise, I'd shout at my virtualisation vendor
until a fix happened (also perhaps threaten to take my credit card
elsewhere).

Therefore, it is reasonable to build on the expectation that the
hypervisor won't be messing around with remapping stacks/etc.  Most
#VC's are synchronous with guest actions (they equate to actions which
would have caused a VMExit), so you can safely reason about when the
first #VC might occur, presuming no funny business with the frames
backing any memory operands touched.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-04-28 Thread Andrew Cooper
On 28/04/2020 08:55, Joerg Roedel wrote:
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
>> I have a somewhat serious question: should we use IST for #VC at all?
>> As I understand it, Rome and Naples make it mandatory for hypervisors
>> to intercept #DB, which means that, due to the MOV SS mess, it's sort
>> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
>> if we're running under a sufficiently sensible hypervisor, we don't
>> need IST for #VC.
> The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
> page ownership tracking between guest and host, so that the hypervisor
> can't remap guest pages without the guest noticing.
>
> If there is a violation of ownership, which can happen at any memory
> access, there will be a #VC exception to notify the guest. And as this
> can happen anywhere, for example on a carefully crafted stack page set
> by userspace before doing SYSCALL, the only robust choice for #VC is to
> use IST.

The kernel won't ever touch the guest stack before restoring %rsp in the
syscall path, but the (minimum 2) memory accesses required to save the
user %rsp and load the kernel stack may be subject to #VC exceptions, as
are instruction fetches at the head of the SYSCALL path.

So yes - #VC needs IST.

Sorry for the noise.  (That said, it is unfortunate that the hypervisor
messing with the memory backing the guest #VC handler results in an
infinite loop, rather than an ability to cleanly terminate.)

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-04-27 Thread Andrew Cooper
On 27/04/2020 18:37, Andy Lutomirski wrote:
> On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski  wrote:
>> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel  wrote:
>>> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
 I assume the race you mean is:

 #VC
 Immediate NMI before IST gets shifted
 #VC

 Kaboom.

 How are you dealing with this?  Ultimately, I think that NMI will need
 to turn off IST before engaging in any funny business. Let me ponder
 this a bit.
>>> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST 
>>> entry
>>> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
>>> one of the IST stacks to be unused during nesting, but that is fine. The
>>> stack memory for #VC is only allocated when SEV-ES is active (in an
>>> SEV-ES VM).
>> Blech.  It probably works, but still, yuck.  It's a bit sad that we
>> seem to be growing more and more poorly designed happens-anywhere
>> exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
>> #HV, and #DB.  This doesn't really scale.
> I have a somewhat serious question: should we use IST for #VC at all?
> As I understand it, Rome and Naples make it mandatory for hypervisors
> to intercept #DB, which means that, due to the MOV SS mess, it's sort
> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> if we're running under a sufficiently sensible hypervisor, we don't
> need IST for #VC.
>
> So I think we have two choices:
>
> 1. Use IST for #VC and deal with all the mess that entails.
>
> 2. Say that we SEV-ES client support on Rome and Naples is for
> development only and do a quick boot-time check for whether #DB is
> intercepted.  (Just set TF and see what vector we get.)  If #DB is
> intercepted, print a very loud warning and refuse to boot unless some
> special sev_es.insecure_development_mode or similar option is set.
>
> #2 results in simpler and more robust entry code.  #1 is more secure.
>
> So my question is: will anyone actually use SEV-ES in production on
> Rome or Naples?  As I understand it, it's not really ready for prime
> time on those chips.  And do we care if the combination of a malicious
> hypervisor and malicious guest userspace on Milan can compromise the
> guest kernel?  I don't think SEV-ES is really mean to resist a
> concerted effort by the hypervisor to compromise the guest.

More specifically, it is mandatory for hypervisors to intercept #DB to
defend against CVE-2015-8104, unless they're willing to trust the guest
not to tickle that corner case.

This is believed fixed with SEV-SNP to allow the encrypted guest to use
debugging functionality without posing a DoS risk to the host.  In this
case, the hypervisor is expected not to intercept #DB.

If #DB is intercepted, and #VC doesn't use IST, malicious userspace can
cause problems with a movss-delayed breakpoint over SYSCALL.

The question basically whether it is worth going to the effort of making
#VC IST and all the problems that entails, to cover one corner case in
earlier hardware.

Ultimately, this depends on whether anyone plans to put SEV-ES into
production on pre SEV-SNP hardware, and if developers using pre-SEV-SNP
hardware are happy with "don't run malicious userspace" or "don't run
malicious kernels and skip the #DB intercept" as a fair tradeoff to
avoid the #VC IST fun.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-19 Thread Andrew Cooper
On 16/07/2019 01:05, Andy Lutomirski wrote:
> On Mon, Jul 15, 2019 at 4:30 PM Andrew Cooper  
> wrote:
>> On 15/07/2019 19:17, Nadav Amit wrote:
>>>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
>>>> wrote:
>>>>
>>>> There is a lot of infrastructure for functionality which is used
>>>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>>>> path.
>>>>
>>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>>>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>>>> rest of the Local APIC state isn't a clever thing to be doing.
>>>>
>>>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>>>> saved_context, and allows for the removal of both PVOPS.
>>> I think removing the interface for CR8 writes is also good to avoid
>>> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
>>> Priorities between CR8 and APIC”):
>>>
>>> "Operating software should implement either direct APIC TPR updates or CR8
>>> style TPR updates but not mix them. Software can use a serializing
>>> instruction (for example, CPUID) to serialize updates between MOV CR8 and
>>> stores to the APIC.”
>>>
>>> And native_write_cr8() did not even issue a serializing instruction.
>>>
>> Given its location, the one write_cr8() is bounded by two serialising
>> operations, so is safe in practice.
>>
>> However, I agree with the statement in the manual.  I could submit a v3
>> with an updated commit message, or let it be fixed on commit.  Whichever
>> is easiest.
>>
> I don't see anything wrong with the message.  If we actually used CR8
> for interrupt priorities, we wouldn't want it to serialize.  The bug
> is that the code that did the write_cr8() should have had a comment as
> to how it serialized against lapic_restore().  But that doesn't seem
> worth mentioning in the message, since, as noted, the real problem was
> that it nonsensically restored just TPR without restoring everything
> else.

Fair enough, in which case I'm happy with v2 as it is.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
On 15/07/2019 19:17, Nadav Amit wrote:
>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  wrote:
>>
>> There is a lot of infrastructure for functionality which is used
>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>> path.
>>
>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>> rest of the Local APIC state isn't a clever thing to be doing.
>>
>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>> saved_context, and allows for the removal of both PVOPS.
> I think removing the interface for CR8 writes is also good to avoid
> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
> Priorities between CR8 and APIC”):
>
> "Operating software should implement either direct APIC TPR updates or CR8
> style TPR updates but not mix them. Software can use a serializing
> instruction (for example, CPUID) to serialize updates between MOV CR8 and
> stores to the APIC.”
>
> And native_write_cr8() did not even issue a serializing instruction.
>

Given its location, the one write_cr8() is bounded by two serialising
operations, so is safe in practice.

However, I agree with the statement in the manual.  I could submit a v3
with an updated commit message, or let it be fixed on commit.  Whichever
is easiest.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 0/2] Remove 32-bit Xen PV guest support

2019-07-15 Thread Andrew Cooper
On 15/07/2019 18:28, Andy Lutomirski wrote:
> On Mon, Jul 15, 2019 at 9:34 AM Andi Kleen  wrote:
>> Juergen Gross  writes:
>>
>>> The long term plan has been to replace Xen PV guests by PVH. The first
>>> victim of that plan are now 32-bit PV guests, as those are used only
>>> rather seldom these days. Xen on x86 requires 64-bit support and with
>>> Grub2 now supporting PVH officially since version 2.04 there is no
>>> need to keep 32-bit PV guest support alive in the Linux kernel.
>>> Additionally Meltdown mitigation is not available in the kernel running
>>> as 32-bit PV guest, so dropping this mode makes sense from security
>>> point of view, too.
>> Normally we have a deprecation period for feature removals like this.
>> You would make the kernel print a warning for some releases, and when
>> no user complains you can then remove. If a user complains you can't.
>>
> As I understand it, the kernel rules do allow changes like this even
> if there's a complaint: this is a patch that removes what is
> effectively hardware support.  If the maintenance cost exceeds the
> value, then removal is fair game.  (Obviously we weight the value to
> preserving compatibility quite highly, but in this case, Xen dropped
> 32-bit hardware support a long time ago.  If the Xen hypervisor says
> that 32-bit PV guest support is deprecated, it's deprecated.)
>
> That being said, a warning might not be a bad idea.  What's the
> current status of this in upstream Xen?

So personally, I'd prefer to see support stay, but at the end of the day
it is Juergen's choice as the maintainer of the code.

Xen itself has been exclusively 64-bit since Xen 4.3 (released in 2013).

Over time, various features like SMEP/SMAP have been making 32bit PV
guests progressively slower, because ring 1 is supervisor rather than
user.  Things have got even worse with IBRS, to the point at which 32bit
PV guests are starting to run like treacle.

There are no current plans to remove support for 32bit PV guests from
Xen, but it is very much in the category of "you shouldn't be using this
mode any more".

~Andrew

P.S. I don't see 64bit PV guest support going anywhere, because there
are still a number of open performance questions due to the inherent
differences between syscall and vmexit, and the difference EPT/NPT
tables make on cross-domain mappings.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
There is a lot of infrastructure for functionality which is used
exclusively in __{save,restore}_processor_state() on the suspend/resume
path.

cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
rest of the Local APIC state isn't a clever thing to be doing.

Delete the suspend/resume cr8 handling, which shrinks the size of struct
saved_context, and allows for the removal of both PVOPS.

Signed-off-by: Andrew Cooper 
---
CC: x...@kernel.org
CC: virtualization@lists.linux-foundation.org
CC: Borislav Petkov 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Stephane Eranian 
CC: Feng Tang 
CC: Juergen Gross 
CC: Boris Ostrovsky 
CC: "Rafael J. Wysocki" 
CC: Pavel Machek 

Spotted while reviewing "x86/apic: Initialize TPR to block interrupts 16-31"

https://lore.kernel.org/lkml/dc04a9f8b234d7b0956a8d2560b8945bcd9c4bf7.1563117760.git.l...@kernel.org/

v2:
 * Drop saved_context.cr8 as well (Juergen)
 * Remove akata...@vmware.com from the CC list due to bounces
---
 arch/x86/include/asm/paravirt.h   | 12 
 arch/x86/include/asm/paravirt_types.h |  5 -
 arch/x86/include/asm/special_insns.h  | 24 
 arch/x86/include/asm/suspend_64.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c  |  1 -
 arch/x86/kernel/paravirt.c|  4 
 arch/x86/power/cpu.c  |  4 
 arch/x86/xen/enlighten_pv.c   | 15 ---
 8 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..0e4a0539c353 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -139,18 +139,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-#ifdef CONFIG_X86_64
-static inline unsigned long read_cr8(void)
-{
-   return PVOP_CALL0(unsigned long, cpu.read_cr8);
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   PVOP_VCALL1(cpu.write_cr8, x);
-}
-#endif
-
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3c775fb5524b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -119,11 +119,6 @@ struct pv_cpu_ops {
 
void (*write_cr4)(unsigned long);
 
-#ifdef CONFIG_X86_64
-   unsigned long (*read_cr8)(void);
-   void (*write_cr8)(unsigned long);
-#endif
-
/* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct desc_ptr *);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 219be88a59d2..6d37b8fcfc77 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -73,20 +73,6 @@ static inline unsigned long native_read_cr4(void)
 
 void native_write_cr4(unsigned long val);
 
-#ifdef CONFIG_X86_64
-static inline unsigned long native_read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile("movq %%cr8,%0" : "=r" (cr8));
-   return cr8;
-}
-
-static inline void native_write_cr8(unsigned long val)
-{
-   asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
-}
-#endif
-
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
 {
@@ -200,16 +186,6 @@ static inline void wbinvd(void)
 
 #ifdef CONFIG_X86_64
 
-static inline unsigned long read_cr8(void)
-{
-   return native_read_cr8();
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   native_write_cr8(x);
-}
-
 static inline void load_gs_index(unsigned selector)
 {
native_load_gs_index(selector);
diff --git a/arch/x86/include/asm/suspend_64.h 
b/arch/x86/include/asm/suspend_64.h
index a7af9f53c0cb..35bb35d28733 100644
--- a/arch/x86/include/asm/suspend_64.h
+++ b/arch/x86/include/asm/suspend_64.h
@@ -34,7 +34,7 @@ struct saved_context {
 */
unsigned long kernelmode_gs_base, usermode_gs_base, fs_base;
 
-   unsigned long cr0, cr2, cr3, cr4, cr8;
+   unsigned long cr0, cr2, cr3, cr4;
u64 misc_enable;
bool misc_enable_saved;
struct saved_msrs saved_msrs;
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d3d075226c0a..8b54d8e3a561 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -62,7 +62,6 @@ int main(void)
ENTRY(cr2);
ENTRY(cr3);
ENTRY(cr4);
-   ENTRY(cr8);
ENTRY(gdt_desc);
BLANK();
 #undef ENTRY
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..de4d4e8a54c1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -311,10 +311,6 @@ struct paravirt_patch_template pv_ops = {
   

Re: [PATCH] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
On 15/07/2019 15:52, Juergen Gross wrote:
> On 15.07.19 16:26, Andy Lutomirski wrote:
>> On Mon, Jul 15, 2019 at 6:23 AM Juergen Gross  wrote:
>>>
>>> On 15.07.19 15:00, Andrew Cooper wrote:
>>>> There is a lot of infrastructure for functionality which is used
>>>> exclusively in __{save,restore}_processor_state() on the
>>>> suspend/resume
>>>> path.
>>>>
>>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored
>>>> independently by lapic_{suspend,resume}().
>>>
>>> Aren't those called only with CONFIG_PM set?
>>>
>>
>>
>> Unless I'm missing something, we only build any of the restore code
>> (including the write_cr8() call) if CONFIG_PM_SLEEP is set, and that
>> selects CONFIG_PM, so we should be fine, I think.
>>
>
> Okay, in that case I'd suggest to remove "cr8" from struct saved_context
> as it won't be used any longer.

To be honest, saving and restoring of cr8 without the rest of the Local
APIC state is conceptually broken to begin with.

If there are any bugs revealed by this, then the correct fixes are
elsewhere.

You're right about the saved_context.  I'll submit a v2 with an even
larger negative diffstat.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andrew Cooper
There is a lot of infrastructure for functionality which is used
exclusively in __{save,restore}_processor_state() on the suspend/resume
path.

cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored
independently by lapic_{suspend,resume}().

Delete the saving and restoration of cr8, which allows for the removal
of both PVOPS.

Signed-off-by: Andrew Cooper 
---
CC: x...@kernel.org
CC: virtualization@lists.linux-foundation.org
CC: Borislav Petkov 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Stephane Eranian 
CC: Feng Tang 
CC: Juergen Gross 
CC: Boris Ostrovsky 
CC: Alok Kataria 
CC: "Rafael J. Wysocki" 
CC: Pavel Machek 

Spotted while reviewing "x86/apic: Initialize TPR to block interrupts 16-31"

https://lore.kernel.org/lkml/dc04a9f8b234d7b0956a8d2560b8945bcd9c4bf7.1563117760.git.l...@kernel.org/
---
 arch/x86/include/asm/paravirt.h   | 12 
 arch/x86/include/asm/paravirt_types.h |  5 -
 arch/x86/include/asm/special_insns.h  | 24 
 arch/x86/kernel/paravirt.c|  4 
 arch/x86/power/cpu.c  |  4 
 arch/x86/xen/enlighten_pv.c   | 15 ---
 6 files changed, 64 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..0e4a0539c353 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -139,18 +139,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-#ifdef CONFIG_X86_64
-static inline unsigned long read_cr8(void)
-{
-   return PVOP_CALL0(unsigned long, cpu.read_cr8);
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   PVOP_VCALL1(cpu.write_cr8, x);
-}
-#endif
-
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3c775fb5524b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -119,11 +119,6 @@ struct pv_cpu_ops {
 
void (*write_cr4)(unsigned long);
 
-#ifdef CONFIG_X86_64
-   unsigned long (*read_cr8)(void);
-   void (*write_cr8)(unsigned long);
-#endif
-
/* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct desc_ptr *);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 219be88a59d2..6d37b8fcfc77 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -73,20 +73,6 @@ static inline unsigned long native_read_cr4(void)
 
 void native_write_cr4(unsigned long val);
 
-#ifdef CONFIG_X86_64
-static inline unsigned long native_read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile("movq %%cr8,%0" : "=r" (cr8));
-   return cr8;
-}
-
-static inline void native_write_cr8(unsigned long val)
-{
-   asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
-}
-#endif
-
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
 {
@@ -200,16 +186,6 @@ static inline void wbinvd(void)
 
 #ifdef CONFIG_X86_64
 
-static inline unsigned long read_cr8(void)
-{
-   return native_read_cr8();
-}
-
-static inline void write_cr8(unsigned long x)
-{
-   native_write_cr8(x);
-}
-
 static inline void load_gs_index(unsigned selector)
 {
native_load_gs_index(selector);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..de4d4e8a54c1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -311,10 +311,6 @@ struct paravirt_patch_template pv_ops = {
.cpu.read_cr0   = native_read_cr0,
.cpu.write_cr0  = native_write_cr0,
.cpu.write_cr4  = native_write_cr4,
-#ifdef CONFIG_X86_64
-   .cpu.read_cr8   = native_read_cr8,
-   .cpu.write_cr8  = native_write_cr8,
-#endif
.cpu.wbinvd = native_wbinvd,
.cpu.read_msr   = native_read_msr,
.cpu.write_msr  = native_write_msr,
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 24b079e94bc2..1c58d8982728 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -122,9 +122,6 @@ static void __save_processor_state(struct saved_context 
*ctxt)
ctxt->cr2 = read_cr2();
ctxt->cr3 = __read_cr3();
ctxt->cr4 = __read_cr4();
-#ifdef CONFIG_X86_64
-   ctxt->cr8 = read_cr8();
-#endif
ctxt->misc_enable_saved = !rdmsrl_safe(MSR_IA32_MISC_ENABLE,
   >misc_enable);
msr_save_context(ctxt);
@@ -207,7 +204,6 @@ static void notrace __restore_processor_state(struct 
saved_context *ctxt)
 #else
 /* CONFIG X86_64 */
wrmsrl(MSR_EFER, ctxt->efer);
-   write_cr8(ctxt->cr8);
__write_

Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Andrew Cooper
On 03/07/2019 18:02, Nadav Amit wrote:
>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>
>> On 03.07.19 01:51, Nadav Amit wrote:
>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>> and hyper-v are only compile-tested).
>>> While the updated smp infrastructure is capable of running a function on
>>> a single local core, it is not optimized for this case. The multiple
>>> function calls and the indirect branch introduce some overhead, and
>>> might make local TLB flushes slower than they were before the recent
>>> changes.
>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>> is needed to restore the lost performance in this common case. This
>>> requires to check mm_cpumask() one more time, but unless this mask is
>>> updated very frequently, this should impact performance negatively.
>>> Cc: "K. Y. Srinivasan" 
>>> Cc: Haiyang Zhang 
>>> Cc: Stephen Hemminger 
>>> Cc: Sasha Levin 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: Borislav Petkov 
>>> Cc: x...@kernel.org
>>> Cc: Juergen Gross 
>>> Cc: Paolo Bonzini 
>>> Cc: Dave Hansen 
>>> Cc: Andy Lutomirski 
>>> Cc: Peter Zijlstra 
>>> Cc: Boris Ostrovsky 
>>> Cc: linux-hyp...@vger.kernel.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: k...@vger.kernel.org
>>> Cc: xen-de...@lists.xenproject.org
>>> Signed-off-by: Nadav Amit 
>>> ---
>>>  arch/x86/hyperv/mmu.c | 13 +++---
>>>  arch/x86/include/asm/paravirt.h   |  6 +--
>>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>>  arch/x86/include/asm/tlbflush.h   |  9 ++--
>>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>  arch/x86/kernel/kvm.c | 11 +++--
>>>  arch/x86/kernel/paravirt.c|  2 +-
>>>  arch/x86/mm/tlb.c | 65 ---
>>>  arch/x86/xen/mmu_pv.c | 20 ++---
>>>  include/trace/events/xen.h|  2 +-
>>>  10 files changed, 91 insertions(+), 43 deletions(-)
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index beb44e22afdf..19e481e6e904 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>> preempt_enable();
>>>  }
>>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>> -const struct flush_tlb_info *info)
>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>> +   const struct flush_tlb_info *info)
>>>  {
>>> struct {
>>> struct mmuext_op op;
>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask 
>>> *cpus,
>>> const size_t mc_entry_size = sizeof(args->op) +
>>> sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>  -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>> +   trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>> if (cpumask_empty(cpus))
>>> return; /* nothing to do */
>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
>>> cpumask *cpus,
>>> args = mcs.args;
>>> args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>  -  /* Remove us, and any offline CPUS. */
>>> +   /* Flush locally if needed and remove us */
>>> +   if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>> +   local_irq_disable();
>>> +   flush_tlb_func_local(info);
>> I think this isn't the correct function for PV guests.
>>
>> In fact it should be much easier: just don't clear the own cpu from the
>> mask, that's all what's needed. The hypervisor is just fine having the
>> current cpu in the mask and it will do the right thing.
> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
> the same, unfortunately, since it would induce VM-exit on TLB flushes.

Why do you believe the vmexit matters?  You're talking one anyway for
the IPI.

Intel only have virtualised self-IPI, and while AMD do have working
non-self IPIs, you still take a vmexit anyway if any destination vcpu
isn't currently running in non-root mode (IIRC).

At that point, you might as well have the hypervisor do all the hard
work via a multi-cpu shootdown/flush hypercall, rather than trying to
arrange it locally.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-12 Thread Andrew Cooper
On 12/10/17 20:11, Boris Ostrovsky wrote:
> On 10/06/2017 10:32 AM, Josh Poimboeuf wrote:
>> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
  #ifdef CONFIG_PARAVIRT
 +/*
 + * Paravirt alternatives are applied much earlier than normal 
 alternatives.
 + * They are only applied when running on a hypervisor.  They replace some
 + * native instructions with calls to pv ops.
 + */
 +void __init apply_pv_alternatives(void)
 +{
 +  setup_force_cpu_cap(X86_FEATURE_PV_OPS);
>>> Not for Xen HVM guests.
>> From what I can tell, HVM guests still use pv_time_ops and
>> pv_mmu_ops.exit_mmap, right?
>>
 +  apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
 +}
>>> This is a problem (at least for Xen PV guests):
>>> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.
>> Ah, right.
>>
>>> It might be possible not to turn off/on the interrupts in this
>>> particular case since the guest probably won't be able to handle an
>>> interrupt at this point anyway.
>> Yeah, that should work.  For Xen and for the other hypervisors, this is
>> called well before irq init, so interrupts can't be handled yet anyway.
> There is also another problem:
>
> [1.312425] general protection fault:  [#1] SMP
> [1.312901] Modules linked in:
> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> [1.313878] task: 88003e2c task.stack: c938c000
> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> 7fcfc959f59a
> [1.315827] RDX:  RSI:  RDI:
> 
> [1.316315] RBP: 000a R08: 037f R09:
> 0064
> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> 7fcfc958ad60
> [1.317300] R13:  R14: 55f550185954 R15:
> 1000
> [1.317801] FS:  () GS:88003f80()
> knlGS:
> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> 00042660
> [1.319235] Call Trace:
> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50
> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> [1.345009] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x000b
>
>
> All code
> 
>0:51   push   %rcx
>1:50   push   %rax
>2:57   push   %rdi
>3:56   push   %rsi
>4:52   push   %rdx
>5:51   push   %rcx
>6:6a dapushq  $0xffda
>8:41 50push   %r8
>a:41 51push   %r9
>c:41 52push   %r10
>e:41 53push   %r11
>   10:48 83 ec 30  sub$0x30,%rsp
>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
>   1b:00 00
>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
>   24:0f 85 a5 00 00 00jne0xcf
>   2a:50   push   %rax
>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> 0xffd095cd<-- trapping instruction
>   31:58   pop%rax
>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
>   38:77 0fja 0x49
>   3a:4c 89 d1 mov%r10,%rcx
>   3d:ff   .byte 0xff
>   3e:14 c5adc$0xc5,%al
>
>
> so the original 'cli' was replaced with the pv call but to me the offset
> looks a bit off, no? Shouldn't it always be positive?

callq takes a 32bit signed displacement, so jumping back by up to 2G is
perfectly legitimate.

The #GP[0] however means that whatever 8 byte value was found at
-0x2f6a64(%rip) was a non-canonical address.

One option is that the pvops structure hasn't been initialised properly,
but an alternative is that the relocation wasn't processed correctly,
and the code is trying to reference something which isn't a function
pointer.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-14 Thread Andrew Cooper
On 14/02/17 14:46, Waiman Long wrote:
> On 02/14/2017 04:39 AM, Peter Zijlstra wrote:
>> On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote:
>>> It is the address of _time that will exceed the 32-bit limit.
>> That seems extremely unlikely. That would mean we have more than 4G
>> worth of per-cpu variables declared in the kernel.
> I have some doubt about if the compiler is able to properly use
> RIP-relative addressing for this. Anyway, it seems like constraints
> aren't allowed for asm() when not in the function context, at least for
> the the compiler that I am using (4.8.5). So it is a moot point.

You can work the issue of not having parameters in a plain asm()
statement by using an asm-offset, stringizing it, and have C put the
string fragments back together.

"cmpb $0, " STR(STEAL_TIME_preempted) "(%rax);"

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] new barrier type for paravirt (was Re: [PATCH] virtio_ring: use smp_store_mb)

2015-12-20 Thread Andrew Cooper
On 20/12/15 09:25, Michael S. Tsirkin wrote:
> On Thu, Dec 17, 2015 at 03:39:10PM +0100, Peter Zijlstra wrote:
>> On Thu, Dec 17, 2015 at 04:33:44PM +0200, Michael S. Tsirkin wrote:
>>> On Thu, Dec 17, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote:
 You could of course go fix that instead of mutilating things into
 sort-of functional state.
>>> Yes, we'd just need to touch all architectures, all for
>>> the sake of UP which almost no one uses.
>>> Basically, we need APIs that explicitly are
>>> for talking to another kernel on a different CPU on
>>> the same SMP system, and implemented identically
>>> between CONFIG_SMP and !CONFIG_SMP on all architectures.
>>>
>>> Do you think this is something of general usefulness,
>>> outside virtio?
>> I'm not aware of any other case, but if there are more parts of virt
>> that need this then I see no problem adding it.
> When I wrote this, I assumed there are no other users, and I'm still not
> sure there are other users at the moment. Do you see a problem then?
>
>> That is, virt in general is the only use-case that I can think of,
>> because this really is an artifact of interfacing with an SMP host while
>> running an UP kernel.
> Or another guest kernel on an SMP host.
>
>> But I'm really not familiar with virt, so I do not know if there's more
>> sites outside of virtio that could use this.
>> Touching all archs is a tad tedious, but its fairly straight forward.
> So I looked and I was only able to find other another possible user in Xen.
>
> Cc Xen folks.
>
> I noticed that drivers/xen/xenbus/xenbus_comms.c uses
> full memory barriers to communicate with the other side.
> For example:
>
> /* Must write data /after/ reading the consumer index.  * */
> mb();
>
> memcpy(dst, data, avail);
> data += avail;
> len -= avail;
> 
> /* Other side must not see new producer until data is * 
> there. */
> wmb();
> intf->req_prod += avail;
> 
> /* Implies mb(): other side will see the updated producer. */
> notify_remote_via_evtchn(xen_store_evtchn);
>
> To me, it looks like for guests compiled with CONFIG_SMP, smp_wmb and smp_mb
> would be sufficient, so mb() and wmb() here are only needed if
> a non-SMP guest runs on an SMP host.
>
> Is my analysis correct?

Correct.  The reason full barriers are used is so non-SMP guests still
function correctly.  It is certainly inefficient.

>
> So what I'm suggesting is something like the below patch,
> except instead of using virtio directly, a new set of barriers
> that behaves identically for SMP and non-SMP guests will be introduced.
>
> And of course the weak barriers flag is not needed for Xen -
> that's a virtio only thing.
>
> For example:
>
> smp_pv_wmb()
> smp_pv_rmb()
> smp_pv_mb()
>
> I'd like to get confirmation from Xen folks before posting
> this patchset.
>
> Comments/suggestions?

Very much +1 for fixing this.

Those names would be fine, but they do add yet another set of options in
an already-complicated area.

An alternative might be to have the regular smp_{w,r,}mb() not revert
back to nops if CONFIG_PARAVIRT, or perhaps if pvops have detected a
non-native environment.  (I don't know how feasible this suggestion is,
however.)

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v2 0/3] Fix and cleanup for 32-bit PV sysexit

2015-12-15 Thread Andrew Cooper
On 19/11/15 22:07, Andy Lutomirski wrote:
> On Thu, Nov 19, 2015 at 1:55 PM, Boris Ostrovsky
>  wrote:
>> The first patch fixes Xen PV regression introduced by 32-bit rewrite. Unlike 
>> the
>> earlier version it uses ALTERNATIVE instruction and avoids using xen_sysexit
>> (and sysret32 in compat mode) pv ops, as suggested by Andy.
>>
>> As result of this patch irq_enable_sysexit and usergs_sysret32 pv ops are not
>> used anymore by anyone and so can be removed.
> This whole series is:
>
> Acked-by: Andy Lutomirski 
>
> Now I just have to sucker someone into getting rid of
> PARAVIRT_ADJUST_EXCEPTION_FRAME (by using stub entries) and the
> overcomplicated syscall entry stuff.  :)

Looking at this, it should be quite easy now.

ALTERNATIVE "", "pop %rcx; %pop %11", X86_FEATURE_XENPV

(Completely untested)

> And whoever gets rid of
> PARAVIRT_ADJUST_EXCEPTION_FRAME gets to wonder why it doesn't crash
> and burn for NMIs on Xen, since I'm reasonably confident that it can't
> possibly be correct.

The Xen PV ABI only has a single kernel stack pointer which may be
registered.  There is no equivalent of an IST, so if a second fault
occurs, it is delivered normally on the current stack.

By the looks of it, the other NMI handling is ambivalent to the fact
that it isn't really on an IST stack under Xen.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Andrew Cooper
On 29/10/14 05:19, Andy Lutomirski wrote:
 Here's a draft CommonHV spec.  It's also on github:

 https://github.com/amluto/CommonHV

 So far, this provides a two-way RNG interface, a way to detect it, and
 a way to detect other hypervisor leaves.  The latter is because, after
 both the enormous public thread and some private discussions, it seems
 that detection of existing CPUID paravirt leaves is annoying and
 inefficient.  If we're going to define some cross-vendor CPUID leaves,
 it seems like it would be useful to offer a way to quickly enumerate
 other leaves.

 I've been told the AMD intends to update their manual to match Intel's
 so that hypervisors can use the entire 0x4F?? CPUID range.  I have
 intentionally not fixed an MSR value for the RNG because the range of
 allowed MSRs is very small in both the Intel and AMD manuals.  If any
 given hypervisor wants to ignore that small range and advertise a
 higher-numbered MSR, it is welcome to, but I don't want to codify
 something that doesn't comply with the manuals.

 Here's the draft.  Comments?  To the people who work on various
 hypervisors: Would you implement this?  Do you like it?  Is there
 anything, major or minor, that you'd like to see changed?  Do you
 think that this is a good idea at all?

As a first pass, it looks like a plausible idea.  I do however have come
comments.


 I've tried to get good coverage of various hypervisors.  There are
 Hyper-V, VMWare, KVM, and Xen people on the cc list.

 Thanks,
 Andy



 CommonHV, a common hypervisor interface
 ===

 This is CommonHV draft 1.

 The CommonHV specification is Copyright (c) 2014 Andrew Lutomirski.

 Licensing will be determined soon.  The license is expected to be extremely
 liberal.  I am currently leaning towards CC-BY-SA for the specification and
 an explicit license permitting anyone to implement the specification
 with no restrictions whatsoever.

 I have not patented, nor do I intend to patent, anything required to implement
 this specification.  I am not aware of any current or future intellectual
 property rights that would prevent a royalty-free implementation of
 this specification.

 I would like to find a stable, neutral steward of this specification
 going forward.  Help with this would be much appreciated.

 Scope
 -

 CommonHV is a simple interface for communication
 between hypervisors and their guests.

 CommonHV is intended to be very simple and to avoid interfering with
 existing paravirtual interfaces.  To that end, its scope is limited.
 CommonHV does only two types of things:

   * It provides a way to enumerate other paravirtual interfaces.
   * It provides a small, extensible set of paravirtual features that do not
 modify or replace standard system functionality.

 For example, CommonHV does not and will not define anything related to
 interrupt handling or virtual CPU management.

 For now, CommonHV is only applicable to the x86 platform.

 Discovery
 -

 A CommonHV hypervisor MUST set the hypervisor bit (bit 31 in CPUID.1H.0H.ECX)
 and provide the CPUID leaf 4F00H, containing:

   * CPUID.4F00H.0H.EAX = max_commonhv_leaf
   * CPUID.4F00H.0H.EBX = 0x6D6D6F43
   * CPUID.4F00H.0H.ECX = 0x56486E6F
   * CPUID.4F00H.0H.EDX = 0x66746e49

 EBX, ECX, and EDX form the string CommonHVIntf in little-endian ASCII.

While testing various nested combinations, XenServer has found that
modern Windows Server versions must have the hypervisor bit hidden from
them for them to be happy running HyperV, despite the fact that they
will make use of the Viridian virtual extensions also provided.

As a result, while it is certainly advisable for the hypervisor bit to
be set, CommonHV should be available to be found by paravirtualised
drivers inside an OS which can't cope with the hypervisor bit set.


 max_commonhv_leaf MUST be a number between 0x4F00 and 0x4FFF.  It
 indicates the largest leaf defined in this specification that is provided.
 Any leaves described in this specification with EAX values that exceed
 max_commonhv_leaf MUST be handled by guests as though they contain
 all zeros.

 CPUID leaf 4F01H: hypervisor interface enumeration
 --

 If max_commonhv_leaf = 0x4F01, CommonHV provides a list of tuples
 (location, signature).  Each tuple indicates the presence of another
 paravirtual interface identified by the signature at the indicated
 CPUID location.  It is expected that CPUID.location.0H will have
 (EBX, ECX, EDX) == signature, although whether this is required
 is left to the specification associated with the given signature.

 If the list contains N tuples, then, for each 0 = i  N:

   * CPUID.4F01H.i.EBX, CPUID.4F01H.i.ECX, and CPUID.4F01H.i.EDX
 are the signature.
   * CPUID.4F01H.i.EAX is the location.

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the 

Re: [Xen-devel] [GIT PULL] x86/mm changes for v3.9-rc1

2013-02-22 Thread Andrew Cooper
On 22/02/13 17:53, Yinghai Lu wrote:
 On Fri, Feb 22, 2013 at 9:24 AM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
 On Fri, Feb 22, 2013 at 11:55:31AM -0500, Konrad Rzeszutek Wilk wrote:
 On Thu, Feb 21, 2013 at 04:34:06PM -0800, H. Peter Anvin wrote:
 Hi Linus,

 This is a huge set of several partly interrelated (and concurrently
 developed) changes, which is why the branch history is messier than
 one would like.

 The *really* big items are two humonguous patchsets mostly developed
 by Yinghai Lu at my request, which completely revamps the way we
 create initial page tables.  In particular, rather than estimating how
 much memory we will need for page tables and then build them into that
 memory -- a calculation that has shown to be incredibly fragile -- we
 now build them (on 64 bits) with the aid of a pseudo-linear mode --
 a #PF handler which creates temporary page tables on demand.

 This has several advantages:

 1. It makes it much easier to support things that need access to
data very early (a followon patchset uses this to load microcode
way early in the kernel startup).

 2. It allows the kernel and all the kernel data objects to be invoked
from above the 4 GB limit.  This allows kdump to work on very large
systems.

 3. It greatly reduces the difference between Xen and native (Xen's
equivalent of the #PF handler are the temporary page tables created
by the domain builder), eliminating a bunch of fragile hooks.

 The patch series also gets us a bit closer to W^X.

 Additional work in this pull is the 64-bit get_user() work which you
 were also involved with, and a bunch of cleanups/speedups to
 __phys_addr()/__pa().
 Looking at figuring out which of the patches in the branch did this, but
 with this merge I am getting a crash with a very simple PV guest (booted 
 with
 one 1G):

 Call Trace:
   [8103feba] xen_get_user_pgd+0x5a  --
   [8103feba] xen_get_user_pgd+0x5a
   [81042d27] xen_write_cr3+0x77
   [81ad2d21] init_mem_mapping+0x1f9
   [81ac293f] setup_arch+0x742
   [81666d71] printk+0x48
   [81abcd62] start_kernel+0x90
   [8109416b] __add_preferred_console.clone.1+0x9b
   [81abc5f7] x86_64_start_reservations+0x2a
   [81abf0c7] xen_start_kernel+0x564

 And the hypervisor says:
 (XEN) d7:v0: unhandled page fault (ec=)
 (XEN) Pagetable walk from ea05b2d0:
 (XEN)  L4[0x1d4] =  
 (XEN) domain_crash_sync called from entry.S
 (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
 (XEN) [ Xen-4.2.0  x86_64  debug=n  Not tainted ]
 (XEN) CPU:3
 (XEN) RIP:e033:[8103feba]
 (XEN) RFLAGS: 0206   EM: 1   CONTEXT: pv guest
 (XEN) rax: ea00   rbx: 01a0c000   rcx: 8000
 (XEN) rdx: 0005b2a0   rsi: 01a0c000   rdi: 
 (XEN) rbp: 81a01dd8   rsp: 81a01d90   r8:  
 (XEN) r9:  1001   r10:    r11: 
 (XEN) r12:    r13: 0010   r14: 
 (XEN) r15: 0010   cr0: 8005003b   cr4: 000406f0
 (XEN) cr3: 000411165000   cr2: ea05b2d0
 (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
 (XEN) Guest stack trace from rsp=81a01d90:
 (XEN)8000   8103feba
 (XEN)0001e030 00010006 81a01dd8 e02b
 Here is a better serial log of the crash (just booting a normal Xen 4.1 + 
 initial
 kernel with 8GB):

 PXELINUX 3.82 2009-06-09  Copyright (C) 1994-2009 H. Peter Anvin et al
 boot:
 Loading xen.gz... ok
 Loading vmlinuz... ok
 Loading initramfs.cpio.gz... ok
  __  ___  __   
  \ \/ /___ _ __   | || |  / | | ___|_ __  _ __ ___
   \  // _ \ '_(_)_(_)/   | .__/|_|  \___|
|_|
 (XEN) Xen version 4.1.5-pre (kon...@dumpdata.com) (gcc version 4.4.4 
 20100503 (Red Hat 4.4.4-2) (GCC) ) Fri Feb 22 11:37:00 EST 2013
 (XEN) Latest ChangeSet: Fri Feb 15 15:31:55 2013 +0100 23459:9f12bdd6b7f0
 (XEN) Console output is synchronous.
 (XEN) Bootloader: unknown
 (XEN) Command line: cpuinfo conring_size=1048576 sync_console 
 cpufreq=verbose com1=115200,8n1 console=com1,vga loglvl=all guest_loglvl=all
 (XEN) Video information:
 (XEN)  VGA is text mode 80x25, font 8x16
 (XEN)  VBE/DDC methods: none; EDID transfer time: 0 seconds
 (XEN)  EDID info not retrieved because no DDC retrieval method detected
 (XEN) Disc information:
 (XEN)  Found 1 MBR signatures
 (XEN)  Found 1 EDD information structures
 (XEN) Xen-e820 RAM map:
 (XEN)   - 0009ec00 (usable)
 (XEN)  0009ec00 - 000a (reserved)
 (XEN)  000e - 0010 (reserved)
 (XEN)  0010 - 2000 (usable)
 (XEN)  2000 - 2020 

Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-07 Thread Andrew Cooper

On 07/01/13 10:25, Ian Campbell wrote:

On Fri, 2013-01-04 at 19:11 +, Konrad Rzeszutek Wilk wrote:

On Fri, Jan 04, 2013 at 06:07:51PM +0100, Daniel Kiper wrote:

Because current KEXEC_CMD_kexec_load does not load kernel
image and other things into Xen memory. It means that it
should live somewhere in dom0 Linux kernel memory.

We could have a very simple hypercall which would have:

struct fancy_new_hypercall {
xen_pfn_t payload; // IN

This would have to be XEN_GUEST_HANDLE(something) since userspace cannot
figure out what pfns back its memory. In any case since the hypervisor
is going to want to copy the data into the crashkernel space a virtual
address is convenient to have.


ssize_t len; // IN
#define DATA (11)
#define DATA_EOF (12)
#define DATA_KERNEL (13)
#define DATA_RAMDISK (14)
unsigned int flags; // IN
unsigned int status; // OUT
};

which would in a loop just iterate over the payloads and
let the hypervisor stick it in the crashkernel space.

This is all hand-waving of course. There probably would be a need
to figure out how much space you have in the reserved Xen's
'crashkernel' memory region too.

This is probably a mad idea but it's Monday morning and I'm sleep
deprived so I'll throw it out there...

What about adding DOMID_KEXEC (similar DOMID_IO etc)? This would allow
dom0 to map the kexec memory space with the usual privcmd mmap
hypercalls and build things in it directly.

OK, I suspect this might not be practical for a variety of reasons (lack
of a p2m for such domains so no way to find out the list of mfns, dom0
userspace simply doesn't have sufficient context to write sensible
things here, etc) but maybe someone has a better head on today...

Ian.



Given that /sbin/kexec creates a binary blob in memory, surely the most 
simple thing is to get it to suitably mlock() the region and give a list 
of VAs to the hypervisor.


This way, Xen can properly take care of what it does with information 
and where.  For example, at the moment, allowing dom0 to choose where 
gets overwritten in the Xen crash area is a recipe for disaster if a 
crash occurs midway through loading/reloading the crash kernel.


~Andrew

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-02 Thread Andrew Cooper

On 27/12/12 18:02, Eric W. Biederman wrote:

Andrew Cooperandrew.coop...@citrix.com  writes:


On 27/12/2012 07:53, Eric W. Biederman wrote:

The syscall ABI still has the wrong semantics.

Aka totally unmaintainable and umergeable.

The concept of domU support is also strange.  What does domU support even mean, 
when the dom0 support is loading a kernel to pick up Xen when Xen falls over.

There are two requirements pulling at this patch series, but I agree
that we need to clarify them.

It probably make sense to split them apart a little even.




Thinking about this split, there might be a way to simply it even more.

/sbin/kexec can load the Xen crash kernel itself by issuing hypercalls 
using /dev/xen/privcmd.  This would remove the need for the dom0 kernel 
to distinguish between loading a crash kernel for itself and loading a 
kernel for Xen.


Or is this just a silly idea complicating the matter?

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-26 Thread Andrew Cooper
On 22/11/12 17:47, H. Peter Anvin wrote:
 The other thing that should be considered here is how utterly 
 preposterous the notion of doing in-guest crash dumping is in a system 
 that contains a hypervisor.  The reason for kdump is that on bare metal 
 there are no other options, but in a hypervisor system the right thing 
 should be for the hypervisor to do the dump (possibly spawning a clean 
 I/O domain if the I/O domain is necessary to access the media.)

 There is absolutely no reason to have a crashkernel sitting around in 
 each guest, consuming memory, and possibly get corrupt.

   -hpa


I agree that regular guests should not be using the kexec/kdump. 
However, this patch series is required for allowing a pvops kernel to be
a crash kernel for Xen, which is very important from dom0/Xen's point of
view.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-26 Thread Andrew Cooper
On 22/11/2012 17:47, H. Peter Anvin wrote:
 The other thing that should be considered here is how utterly 
 preposterous the notion of doing in-guest crash dumping is in a system 
 that contains a hypervisor.  The reason for kdump is that on bare metal 
 there are no other options, but in a hypervisor system the right thing 
 should be for the hypervisor to do the dump (possibly spawning a clean 
 I/O domain if the I/O domain is necessary to access the media.)

 There is absolutely no reason to have a crashkernel sitting around in 
 each guest, consuming memory, and possibly get corrupt.

   -hpa


(Your reply to my email which I can see on the xen devel archive appears
to have gotten lost somewhere inside the citrix email system, so
apologies for replying out of order)

The kdump kernel loaded by dom0 is for when Xen crashes, not for when
dom0 crashes (although a dom0 crash does admittedly lead to a Xen crash)

There is no possible way it could be a separate domain; Xen completely
ceases to function as soon as jumps to the entry point of the kdump image.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization