Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-20 Thread Ben Catterall



On 10/08/15 11:14, Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.  Conceptually, taking interrupts from
depriv mode is no different to taking them in a PV guest.

Some complications which come to mind (none insurmountable):

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.

From what I understand from entry.S's assembly:
handle_ist_exception is used for machine_check and nmi ISTs and these 
perform this copy. The double fault handler does not do this copy.

 - we take the IST on a different stack page
 - the handler copies the guest's registers from its current page to 
 the bottom of the privileged stack so access routines for this still 
work as usual

 - Moves its rsp to just after this structure in the privileged stack
 - Calls do_nmi
 - does a ret_from_intr with the stack ptr on the privileged stack

Now, I _think_ it's sufficient to perform this copy and then just keep 
the rsp on the IST stack page (rather than moving it across as is 
currently done) so that we don't clobber the privileged stack.


Then, on the return path, move our rsp back to the privileged stack, 
just after the guest registers so that ret_from_intr can use the copied 
(and possibly modified) guest's registers.


Does that sound reasonable?

Thanks in advance!

* AMD Task Register on vmexit.  (this old gem)

~Andrew



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-19 Thread Ben Catterall



On 18/08/15 17:55, Andrew Cooper wrote:



On 17/08/15 08:07, Tim Deegan wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit
without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew


So, have we arrived at a decision for this? Thanks!


Apologies for the delay - I am currently at the Xen Developer Summit.


No worries! Hope you're enjoying the summit. :)

Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a
different SP;


Xen re-enables interrupts in most interrupt handlers, which means that
they must not have an IST set.  If an IST was set, a second interrupt
would clobber the frame of the first.

However, just adjusting tss->rsp0 and syscall top-of-stack to the
current rsp when entering depriv mode should be sufficient, and will
avoid needing to copy the stack.

Got it, thanks!



  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.


They will need adjusting, but just disabling the copy entirely should be
ok.

ok.




Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Agreed.  It is orthogonal to this problem.

~Andrew


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Andrew Cooper



On 17/08/15 08:07, Tim Deegan wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew


So, have we arrived at a decision for this? Thanks!


Apologies for the delay - I am currently at the Xen Developer Summit.


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;


Xen re-enables interrupts in most interrupt handlers, which means that 
they must not have an IST set.  If an IST was set, a second interrupt 
would clobber the frame of the first.


However, just adjusting tss->rsp0 and syscall top-of-stack to the 
current rsp when entering depriv mode should be sufficient, and will 
avoid needing to copy the stack.



  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.


They will need adjusting, but just disabling the copy entirely should be ok.



Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Agreed.  It is orthogonal to this problem.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Jan Beulich
>>> On 18.08.15 at 12:26,  wrote:
> On 18/08/15 11:25, Ben Catterall wrote:
>> On 17/08/15 16:17, Jan Beulich wrote:
>> On 17.08.15 at 17:07,  wrote:
 At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
> So, have we arrived at a decision for this? Thanks!

 Seems to have stalled a bit.  OK, I propose that:
   - we use TR/IST to make Xen take interrupts/exceptions at a
 different SP;
   - we make that SP be an extension of the main stack, so that things
 like current() Just Work[tm];
>>  From Xen's cpu stack layout, page 4 is currently unused so I'll put it
>> here. Is this an acceptable?
> Or, would it be better to put it at position 5, and move the optional 
> MEMORY_GUARD page down to position 4?

At this stage either would do. Once (if) the experiment yielded
positive results, we can still evaluate the pros and cons of either
placement.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Ben Catterall



On 18/08/15 11:25, Ben Catterall wrote:



On 17/08/15 16:17, Jan Beulich wrote:

On 17.08.15 at 17:07,  wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

So, have we arrived at a decision for this? Thanks!


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a
different SP;
  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];

 From Xen's cpu stack layout, page 4 is currently unused so I'll put it
here. Is this an acceptable?
Or, would it be better to put it at position 5, and move the optional 
MEMORY_GUARD page down to position 4?

  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.

Any other proposals?


No.





I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


Ok, thanks all!


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Ben Catterall



On 17/08/15 16:17, Jan Beulich wrote:

On 17.08.15 at 17:07,  wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

So, have we arrived at a decision for this? Thanks!


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
From Xen's cpu stack layout, page 4 is currently unused so I'll put it 
here. Is this an acceptable?

  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.

Any other proposals?


No.





I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


Ok, thanks all!

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Jan Beulich
>>> On 17.08.15 at 17:07,  wrote:
> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>> So, have we arrived at a decision for this? Thanks!
> 
> Seems to have stalled a bit.  OK, I propose that:
>  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
>  - we make that SP be an extension of the main stack, so that things
>like current() Just Work[tm];
>  - we set this up and tear it down when we enter/leave depriv mode.
>  - someone ought to look at the case where IST handlers copy
>themselves to the main stack, and see if we need to adjust that too.
> 
> Any other proposals?

No.

> I think we can leave the question of TR switching on VMEXIT as a
> separate issue.

Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Tim Deegan
At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
> On 12/08/15 14:33, Andrew Cooper wrote:
> > On 12/08/15 14:29, Andrew Cooper wrote:
> >> On 11/08/15 19:29, Boris Ostrovsky wrote:
> >>> Would switching TR only when we know that we need to enter this
> >>> deprivileged mode help?
> >> This is an absolute must.  It is not safe to use syscall/sysexit without
> >> IST in place for NMIs and MCEs.
> >>
> >>> Assuming that it is less expensive than copying the stack.
> >> I was referring to the stack overflow issue, and whether it might be
> >> sensible to pro-actively which TR.
> >
> > Ahem! s/which/switch/
> >
> > ~Andrew
> >
> 
> So, have we arrived at a decision for this? Thanks!

Seems to have stalled a bit.  OK, I propose that:
 - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
 - we make that SP be an extension of the main stack, so that things
   like current() Just Work[tm];
 - we set this up and tear it down when we enter/leave depriv mode.
 - someone ought to look at the case where IST handlers copy
   themselves to the main stack, and see if we need to adjust that too.

Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Ben Catterall



On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

On 08/11/2015 01:19 PM, Andrew Cooper wrote:

On 11/08/15 18:05, Tim Deegan wrote:

* Under this model, PV exception handlers should copy themselves
onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary
stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this
thing?

IIRC: AMD processors don't context switch TR on vmexit,

Correct


which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.


We'd have to do the TR context switch ourselves, and that would be
expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.


Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had
the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)


I actually didn't even realize that TR is not saved on vmexit ;-/.

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.


Ahem! s/which/switch/

~Andrew



So, have we arrived at a decision for this? Thanks!

Ben

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Andrew Cooper
On 12/08/15 14:29, Andrew Cooper wrote:
> On 11/08/15 19:29, Boris Ostrovsky wrote:
>> On 08/11/2015 01:19 PM, Andrew Cooper wrote:
>>> On 11/08/15 18:05, Tim Deegan wrote:
>>> * Under this model, PV exception handlers should copy themselves
>>> onto
>>> the privileged execution stack.
>>> * Currently, the IST handlers  copy themselves onto the primary
>>> stack if
>>> they interrupt guest context.
>>> * AMD Task Register on vmexit.  (this old gem)
>> Gah, this thing. :
> Curious (and I can't seem find this in the manuals): What is this
> thing?
 IIRC: AMD processors don't context switch TR on vmexit,
>>> Correct
>>>
 which makes using IST handlers tricky there.
>>> (That is one way of putting it)
>>>
>>> IST handlers cannot be used by Xen if Xen does not switch the task
>>> register before stgi, or IST exceptions (NMI, MCE and double fault) will
>>> be taken with guest-supplied stack pointers.
>>>
 We'd have to do the TR context switch ourselves, and that would be
 expensive.
>>> It is suspected to be expensive, but I have never actually seen any
>>> numbers one way or another.
>>>
 Andrew, am I remembering that right?
>>> Looks about right.
>>>
>>> I have been meaning to investigate this for a while, but never had
>>> the time.
>>>
>>> Xen opts for disabling interrupt stack tables in the context of AMD HVM
>>> vcpus, which interacts catastrophically with debug builds using
>>> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
>>> detect stack overflows, but without an IST double fault hander, ends in
>>> a triple fault rather than a host crash detailing the stack overflow.
>>>
>>> KVM unilaterally reloads the host task register on vmexit, and I suspect
>>> this is probably the way to go, but have not had time to investigate
>>> whether there is any performance impact from doing so.  Given how little
>>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
>>> as expensive as it might have been in legacy modes.
>>>
>>> (CC'ing the AMD SVM maintainers to see if they have any information on
>>> this subject)
>>>
>> I actually didn't even realize that TR is not saved on vmexit ;-/.
>>
>> Would switching TR only when we know that we need to enter this
>> deprivileged mode help?
> This is an absolute must.  It is not safe to use syscall/sysexit without
> IST in place for NMIs and MCEs.
>
>> Assuming that it is less expensive than copying the stack.
> I was referring to the stack overflow issue, and whether it might be
> sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Andrew Cooper
On 11/08/15 19:29, Boris Ostrovsky wrote:
> On 08/11/2015 01:19 PM, Andrew Cooper wrote:
>> On 11/08/15 18:05, Tim Deegan wrote:
>> * Under this model, PV exception handlers should copy themselves
>> onto
>> the privileged execution stack.
>> * Currently, the IST handlers  copy themselves onto the primary
>> stack if
>> they interrupt guest context.
>> * AMD Task Register on vmexit.  (this old gem)
> Gah, this thing. :
 Curious (and I can't seem find this in the manuals): What is this
 thing?
>>> IIRC: AMD processors don't context switch TR on vmexit,
>> Correct
>>
>>> which makes using IST handlers tricky there.
>> (That is one way of putting it)
>>
>> IST handlers cannot be used by Xen if Xen does not switch the task
>> register before stgi, or IST exceptions (NMI, MCE and double fault) will
>> be taken with guest-supplied stack pointers.
>>
>>> We'd have to do the TR context switch ourselves, and that would be
>>> expensive.
>> It is suspected to be expensive, but I have never actually seen any
>> numbers one way or another.
>>
>>> Andrew, am I remembering that right?
>> Looks about right.
>>
>> I have been meaning to investigate this for a while, but never had
>> the time.
>>
>> Xen opts for disabling interrupt stack tables in the context of AMD HVM
>> vcpus, which interacts catastrophically with debug builds using
>> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
>> detect stack overflows, but without an IST double fault hander, ends in
>> a triple fault rather than a host crash detailing the stack overflow.
>>
>> KVM unilaterally reloads the host task register on vmexit, and I suspect
>> this is probably the way to go, but have not had time to investigate
>> whether there is any performance impact from doing so.  Given how little
>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
>> as expensive as it might have been in legacy modes.
>>
>> (CC'ing the AMD SVM maintainers to see if they have any information on
>> this subject)
>>
>
> I actually didn't even realize that TR is not saved on vmexit ;-/.
>
> Would switching TR only when we know that we need to enter this
> deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.

> Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Tim Deegan
At 14:22 +0100 on 12 Aug (1439389325), Ben Catterall wrote:
> On 11/08/15 18:05, Tim Deegan wrote:
> > BTW, I think there need to be three stacks anyway, since the depriv
> > code shouldn't be allowed to write to the priv code's stack frames.
> > Or maybe I've misunderstood how much access the depriv code will have.
> So, just to clarify:
> 
> We have a separate deprivileged stack allocated which the deprivileged 
> code uses. This is mapped in user mode.
> 
> We have the privileged stack which Xen runs on. To prevent this being 
> clobbered when we are in our mode and take an interrupt, we copy this 
> out to a buffer. This buffer is the saved privileged stack state.
> 
> So, we sort of have three stacks already, just the privileged stack is 
> copied out to a buffer, rather than switching pointers to another 
> interrupt stack.
> 
> Hopefully that clarifies?

Yes, thanks -- the buffer is what I was thinking of as a third stack.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Ben Catterall



On 11/08/15 18:05, Tim Deegan wrote:

Hi,

At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:

On 11/08/15 10:55, Tim Deegan wrote:

At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

   - one to hold the depriv execution context;
   - one to hold the privileged execution context; and
   - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.


Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.


The copy is relatively small and paid only on the first and last entries
into the mode. I don't know if this is cheaper than the  bookwork that
would be needed on entering and returning from the mode to switch to
these stacks. I'm assuming the sp pointers in the TSS and ISTs would
need changing on the first and last entry/exit if we have the extra
stack, is that correct?


Yep.


Or, is this a more dramatic change in that
everything uses this three stack model rather than just this feature.


Well, some other parts would have to change to accomodate this new
behaviour - that was what Andrew was talking about.

BTW, I think there need to be three stacks anyway, since the depriv
code shouldn't be allowed to write to the priv code's stack frames.
Or maybe I've misunderstood how much access the depriv code will have.

So, just to clarify:

We have a separate deprivileged stack allocated which the deprivileged 
code uses. This is mapped in user mode.


We have the privileged stack which Xen runs on. To prevent this being 
clobbered when we are in our mode and take an interrupt, we copy this 
out to a buffer. This buffer is the saved privileged stack state.


So, we sort of have three stacks already, just the privileged stack is 
copied out to a buffer, rather than switching pointers to another 
interrupt stack.


Hopefully that clarifies?




I'm not sure how much in Xen would need changing to switch across to
using three stacks. Also, would this also need to be done for PV guests?
Would that need to be a separate patch series?

What's the overall consensus? Thanks!


I'm not sure there is one yet -- needs some more discussion of
whether the non-copying approach is feasible.


If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.


* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)


Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?


IIRC: AMD processors don't context switch TR on vmexit, which makes
using IST handlers tricky there.  We'd have to do the TR context
switch ourselves, and that would be expensive.  Andrew, am I
remembering that right?


Thanks!

Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Jan Beulich
>>> On 11.08.15 at 19:19,  wrote:
> KVM unilaterally reloads the host task register on vmexit, and I suspect
> this is probably the way to go, but have not had time to investigate
> whether there is any performance impact from doing so.  Given how little
> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
> as expensive as it might have been in legacy modes.

How much of the TSS is being used shouldn't matter at all for
LTR execution time, since the instruction doesn't access the
TSS itself (and iirc there's also no caching documented for TSS
uses).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Boris Ostrovsky

On 08/11/2015 01:19 PM, Andrew Cooper wrote:

On 11/08/15 18:05, Tim Deegan wrote:

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?

IIRC: AMD processors don't context switch TR on vmexit,

Correct


which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.


We'd have to do the TR context switch ourselves, and that would be expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.


Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)



I actually didn't even realize that TR is not saved on vmexit ;-/.

Would switching TR only when we know that we need to enter this 
deprivileged mode help? Assuming that it is less expensive than copying 
the stack.


-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Andrew Cooper
On 11/08/15 18:05, Tim Deegan wrote:
>
 * Under this model, PV exception handlers should copy themselves onto
 the privileged execution stack.
 * Currently, the IST handlers  copy themselves onto the primary stack if
 they interrupt guest context.
 * AMD Task Register on vmexit.  (this old gem)
>>> Gah, this thing. :
>> Curious (and I can't seem find this in the manuals): What is this thing?
> IIRC: AMD processors don't context switch TR on vmexit,

Correct

> which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.

> We'd have to do the TR context switch ourselves, and that would be expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.

> Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Tim Deegan
Hi,

At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:
> On 11/08/15 10:55, Tim Deegan wrote:
> > At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
> >> On 10/08/15 10:49, Tim Deegan wrote:
> >>> Hi,
> >>>
> >>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>  The process to switch into and out of deprivileged mode can be likened to
>  setjmp/longjmp.
> 
>  To enter deprivileged mode, we take a copy of the stack from the guest's
>  registers up to the current stack pointer.
> >>> This copy is pretty unfortunate, but I can see that avoiding it will
> >>> be a bit complex.  Could we do something with more stacks?  AFAICS
> >>> there have to be three stacks anyway:
> >>>
> >>>   - one to hold the depriv execution context;
> >>>   - one to hold the privileged execution context; and
> >>>   - one to take interrupts on.
> >>>
> >>> So maybe we could do some fiddling to make Xen take interrupts on a
> >>> different stack while we're depriv'd?
> >>
> >> That should happen naturally by virtue of the privilege level change
> >> involved in taking the interrupt.
> >
> > Right, and this is why we need a third stack - so interrupts don't
> > trash the existing priv state on the 'normal' Xen stack.  And so we
> > either need to copy the priv stack out (and maybe copy it back), or
> > tell the CPU to use a different stack.
> 
> The copy is relatively small and paid only on the first and last entries 
> into the mode. I don't know if this is cheaper than the  bookwork that 
> would be needed on entering and returning from the mode to switch to 
> these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
> need changing on the first and last entry/exit if we have the extra 
> stack, is that correct?

Yep.

> Or, is this a more dramatic change in that 
> everything uses this three stack model rather than just this feature.

Well, some other parts would have to change to accomodate this new
behaviour - that was what Andrew was talking about.

BTW, I think there need to be three stacks anyway, since the depriv
code shouldn't be allowed to write to the priv code's stack frames.
Or maybe I've misunderstood how much access the depriv code will have.

> I'm not sure how much in Xen would need changing to switch across to 
> using three stacks. Also, would this also need to be done for PV guests? 
> Would that need to be a separate patch series?
> 
> What's the overall consensus? Thanks!

I'm not sure there is one yet -- needs some more discussion of
whether the non-copying approach is feasible.

> > If we had enough headroom, we could try to be clever and tell the CPU
> > to take interrupts on the priv stack _below_ the existing state.  That
> > would avoid the first of your problems below.
> >
> >> * Under this model, PV exception handlers should copy themselves onto
> >> the privileged execution stack.
> >> * Currently, the IST handlers  copy themselves onto the primary stack if
> >> they interrupt guest context.
> >> * AMD Task Register on vmexit.  (this old gem)
> >
> > Gah, this thing. :
> Curious (and I can't seem find this in the manuals): What is this thing?

IIRC: AMD processors don't context switch TR on vmexit, which makes
using IST handlers tricky there.  We'd have to do the TR context
switch ourselves, and that would be expensive.  Andrew, am I
remembering that right?

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ben Catterall



On 11/08/15 10:55, Tim Deegan wrote:

At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.


Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.


The copy is relatively small and paid only on the first and last entries 
into the mode. I don't know if this is cheaper than the  bookwork that 
would be needed on entering and returning from the mode to switch to 
these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
need changing on the first and last entry/exit if we have the extra 
stack, is that correct? Or, is this a more dramatic change in that 
everything uses this three stack model rather than just this feature.


I'm not sure how much in Xen would need changing to switch across to 
using three stacks. Also, would this also need to be done for PV guests? 
Would that need to be a separate patch series?


What's the overall consensus? Thanks!



If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.


* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)


Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?


Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ben Catterall



On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.


This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?

If we do have to copy, we could track whether the original stack has
been clobbered by an interrupt, and so avoid (at least some of) the
copy back afterwards?

One nit in the assembler - if I've followed correctly, this saved IP:


+/* Perform a near call to push rip onto the stack */
+call   1f


is returned to (with adjustments) here:


+/* Go to user mode return code */
+jmp*(%rsi)


It would be good to make this a matched pair of call/ret if we can;
the CPU has special branch prediction tracking for function calls that
gets confused by a call that's not returned to.


sure, will do.

Cheers,

Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Tim Deegan
At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
> On 10/08/15 10:49, Tim Deegan wrote:
> > Hi,
> >
> > At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
> >> The process to switch into and out of deprivileged mode can be likened to
> >> setjmp/longjmp.
> >>
> >> To enter deprivileged mode, we take a copy of the stack from the guest's
> >> registers up to the current stack pointer.
> > This copy is pretty unfortunate, but I can see that avoiding it will
> > be a bit complex.  Could we do something with more stacks?  AFAICS
> > there have to be three stacks anyway:
> >
> >  - one to hold the depriv execution context;
> >  - one to hold the privileged execution context; and
> >  - one to take interrupts on.
> >
> > So maybe we could do some fiddling to make Xen take interrupts on a
> > different stack while we're depriv'd?
> 
> That should happen naturally by virtue of the privilege level change
> involved in taking the interrupt.

Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.

If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.

> * Under this model, PV exception handlers should copy themselves onto
> the privileged execution stack.
> * Currently, the IST handlers  copy themselves onto the primary stack if
> they interrupt guest context.
> * AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :(

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ian Campbell
On Thu, 2015-08-06 at 21:55 +0100, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
> > The process to switch into and out of deprivileged mode can be likened 
> > to
> > setjmp/longjmp.
> > 
> > To enter deprivileged mode, we take a copy of the stack from the 
> > guest's
> > registers up to the current stack pointer. This allows us to restore 
> > the stack
> > when we have finished the deprivileged mode operation, meaning we can 
> > continue
> > execution from that point. This is similar to if a context switch had 
> > happened.
> > 
> > To exit deprivileged mode, we copy the stack back, replacing the 
> > current stack.
> > We can then continue execution from where we left off, which will 
> > unwind the
> > stack and free up resources. This method means that we do not need to
> > change any other code paths and its invocation will be transparent to 
> > callers.
> > This should allow the feature to be more easily deployed to different 
> > parts
> > of Xen.
> > 
> > Note that this copy of the stack is per-vcpu but, it will contain per
> > -pcpu data.
> > Extra work is needed to properly migrate vcpus between pcpus.
> 
> Under what circumstances do you see there being persistent state in the
> depriv area between calls, given that the calls are synchronous from VM
> actions?

Would we not want to keep (some of) the device model's state in a depriv
area? e.g. anything which is purely internal to the DM which is therefore
only accessed from depriv-land?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-10 Thread Andrew Cooper
On 10/08/15 10:49, Tim Deegan wrote:
> Hi,
>
> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>> The process to switch into and out of deprivileged mode can be likened to
>> setjmp/longjmp.
>>
>> To enter deprivileged mode, we take a copy of the stack from the guest's
>> registers up to the current stack pointer.
> This copy is pretty unfortunate, but I can see that avoiding it will
> be a bit complex.  Could we do something with more stacks?  AFAICS
> there have to be three stacks anyway:
>
>  - one to hold the depriv execution context;
>  - one to hold the privileged execution context; and
>  - one to take interrupts on.
>
> So maybe we could do some fiddling to make Xen take interrupts on a
> different stack while we're depriv'd?

That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.  Conceptually, taking interrupts from
depriv mode is no different to taking them in a PV guest.

Some complications which come to mind (none insurmountable):

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-10 Thread Tim Deegan
Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
> The process to switch into and out of deprivileged mode can be likened to
> setjmp/longjmp.
> 
> To enter deprivileged mode, we take a copy of the stack from the guest's
> registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

 - one to hold the depriv execution context;
 - one to hold the privileged execution context; and
 - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?

If we do have to copy, we could track whether the original stack has
been clobbered by an interrupt, and so avoid (at least some of) the
copy back afterwards?

One nit in the assembler - if I've followed correctly, this saved IP:

> +/* Perform a near call to push rip onto the stack */
> +call   1f

is returned to (with adjustments) here:

> +/* Go to user mode return code */
> +jmp*(%rsi)

It would be good to make this a matched pair of call/ret if we can;
the CPU has special branch prediction tracking for function calls that
gets confused by a call that's not returned to.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread Andrew Cooper
On 07/08/15 13:51, Ben Catterall wrote:
> On 06/08/15 21:55, Andrew Cooper wrote:
>> On 06/08/15 17:45, Ben Catterall wrote:
>>> The process to switch into and out of deprivileged mode can be
>>> likened to
>>> setjmp/longjmp.
>>>
>>> To enter deprivileged mode, we take a copy of the stack from the
>>> guest's
>>> registers up to the current stack pointer. This allows us to restore
>>> the stack
>>> when we have finished the deprivileged mode operation, meaning we
>>> can continue
>>> execution from that point. This is similar to if a context switch
>>> had happened.
>>>
>>> To exit deprivileged mode, we copy the stack back, replacing the
>>> current stack.
>>> We can then continue execution from where we left off, which will
>>> unwind the
>>> stack and free up resources. This method means that we do not need to
>>> change any other code paths and its invocation will be transparent
>>> to callers.
>>> This should allow the feature to be more easily deployed to
>>> different parts
>>> of Xen.
>>>
>>> Note that this copy of the stack is per-vcpu but, it will contain
>>> per-pcpu data.
>>> Extra work is needed to properly migrate vcpus between pcpus.
>>
>> Under what circumstances do you see there being persistent state in the
>> depriv area between calls, given that the calls are synchronous from VM
>> actions?
>
> I don't know if we can make these synchronous as we need a way to
> interrupt the vcpu if it's spinning for a long time. Otherwise an
> attacker could just spin in depriv and cause a DoS. With that in mind,
> the scheduler may decide to migrate the vcpu whilst it's in depriv
> mode which would mean this per-pcpu data is held in the stack copy
> which is then migrated to another pcpu incorrectly.

If the emulator spins for a sufficient time, it is fine to shoot the
domain.  This is a strict improvement on the current behaviour where a
spinning emulator would shoot the host, via a watchdog timeout.

As said elsewhere, this kind of DoS is not a very interesting attack
vector.  State handling errors which cause Xen to change the wrong thing
are far more interesting from a guests point of view.

http://xenbits.xen.org/xsa/advisory-123.html (full host compromise) or
http://xenbits.xen.org/xsa/advisory-108.html (read other guests data)
are examples of kinds of interesting issues which could potentially be
mitigated with this depriv infrastructure.

>
>>
>>>
>>> The switch to and from deprivileged mode is performed using sysret
>>> and syscall
>>> respectively.
>>
>> I suspect we need to borrow the SS attribute workaround from Linux to
>> make this function reliably on AMD systems.
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
>>
>>
> >
> Ah! ok, I'll look into this. Thanks!

Just be aware of it.  Don't spend your time attempting to retrofit it to
Xen.  It is more work than it looks.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread David Vrabel
On 07/08/15 13:51, Ben Catterall wrote:
> 
> I don't know if we can make these synchronous as we need a way to
> interrupt the vcpu if it's spinning for a long time. Otherwise an
> attacker could just spin in depriv and cause a DoS. With that in mind,
> the scheduler may decide to migrate the vcpu whilst it's in depriv mode
> which would mean this per-pcpu data is held in the stack copy which is
> then migrated to another pcpu incorrectly.

IMO, DoS attacks on depriv'd emulators aren't very interesting.

I think it is counter-productive to address this attack in this initial
implementation at the expense (delays/complexity/etc.) of solving the
key requirement of mitigating information leaks and privilege escalation
attacks

David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread Ben Catterall



On 06/08/15 21:55, Andrew Cooper wrote:

On 06/08/15 17:45, Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer. This allows us to restore the stack
when we have finished the deprivileged mode operation, meaning we can continue
execution from that point. This is similar to if a context switch had happened.

To exit deprivileged mode, we copy the stack back, replacing the current stack.
We can then continue execution from where we left off, which will unwind the
stack and free up resources. This method means that we do not need to
change any other code paths and its invocation will be transparent to callers.
This should allow the feature to be more easily deployed to different parts
of Xen.

Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
Extra work is needed to properly migrate vcpus between pcpus.


Under what circumstances do you see there being persistent state in the
depriv area between calls, given that the calls are synchronous from VM
actions?


I don't know if we can make these synchronous as we need a way to 
interrupt the vcpu if it's spinning for a long time. Otherwise an 
attacker could just spin in depriv and cause a DoS. With that in mind, 
the scheduler may decide to migrate the vcpu whilst it's in depriv mode 
which would mean this per-pcpu data is held in the stack copy which is 
then migrated to another pcpu incorrectly.






The switch to and from deprivileged mode is performed using sysret and syscall
respectively.


I suspect we need to borrow the SS attribute workaround from Linux to
make this function reliably on AMD systems.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b


>
Ah! ok, I'll look into this. Thanks!


The return paths in entry.S have been edited so that, when we receive an
interrupt whilst in deprivileged mode, we return into that mode correctly.

A hook on the syscall handler in entry.S has also been added which handles
returning from user mode and will support deprivileged mode system calls when
these are needed.

Signed-off-by: Ben Catterall 
---
  xen/arch/x86/domain.c   |  12 +++
  xen/arch/x86/hvm/Makefile   |   1 +
  xen/arch/x86/hvm/deprivileged.c | 103 ++
  xen/arch/x86/hvm/deprivileged_asm.S | 205 
  xen/arch/x86/hvm/vmx/vmx.c  |   7 ++
  xen/arch/x86/x86_64/asm-offsets.c   |   5 +
  xen/arch/x86/x86_64/entry.S |  35 ++
  xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
  xen/include/xen/hvm/deprivileged.h  |  38 +++
  xen/include/xen/sched.h |  18 +++-
  10 files changed, 425 insertions(+), 1 deletion(-)
  create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
  #include 
  #include 
  #include 
+#include 

  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
  DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
  if ( has_hvm_container_domain(d) )
  {
  rc = hvm_vcpu_initialise(v);
+
+/* Initialise HVM deprivileged mode */
+printk("HVM initialising deprivileged mode ...");


All printk()s should have a XENLOG_$severity prefix.


will do.

+hvm_deprivileged_prepare_vcpu(v);
+printk("Done.\n");
+
  goto done;
  }

@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
  vcpu_destroy_fpu(v);

  if ( has_hvm_container_vcpu(v) )
+{
+/* Destroy the deprivileged mode on this vcpu */
+hvm_deprivileged_destroy_vcpu(v);
+
  hvm_vcpu_destroy(v);
+}
  else
  xfree(v->arch.pv_vcpu.trap_ctxt);
  }
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index bd83ba3..6819886 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -17,6 +17,7 @@ obj-y += quirks.o
  obj-y += rtc.o
  obj-y += save.o
  obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
  obj-y += stdvga.o
  obj-y += vioapic.o
  obj-y += viridian.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 071d900..979fc69 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
  }
  return 0;
  }
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
+ */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+struct page_info *pg;
+
+/* TODO: clarify if this MEMF is correct */
+/* Allocate 2^STACK_ORDER contiguous pages */
+pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
+if( pg == NULL )
+{
+panic("

Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-06 Thread Andrew Cooper
On 06/08/15 17:45, Ben Catterall wrote:
> The process to switch into and out of deprivileged mode can be likened to
> setjmp/longjmp.
>
> To enter deprivileged mode, we take a copy of the stack from the guest's
> registers up to the current stack pointer. This allows us to restore the stack
> when we have finished the deprivileged mode operation, meaning we can continue
> execution from that point. This is similar to if a context switch had 
> happened.
>
> To exit deprivileged mode, we copy the stack back, replacing the current 
> stack.
> We can then continue execution from where we left off, which will unwind the
> stack and free up resources. This method means that we do not need to
> change any other code paths and its invocation will be transparent to callers.
> This should allow the feature to be more easily deployed to different parts
> of Xen.
>
> Note that this copy of the stack is per-vcpu but, it will contain per-pcpu 
> data.
> Extra work is needed to properly migrate vcpus between pcpus.

Under what circumstances do you see there being persistent state in the
depriv area between calls, given that the calls are synchronous from VM
actions?

>
> The switch to and from deprivileged mode is performed using sysret and syscall
> respectively.

I suspect we need to borrow the SS attribute workaround from Linux to
make this function reliably on AMD systems.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b

>
> The return paths in entry.S have been edited so that, when we receive an
> interrupt whilst in deprivileged mode, we return into that mode correctly.
>
> A hook on the syscall handler in entry.S has also been added which handles
> returning from user mode and will support deprivileged mode system calls when
> these are needed.
>
> Signed-off-by: Ben Catterall 
> ---
>  xen/arch/x86/domain.c   |  12 +++
>  xen/arch/x86/hvm/Makefile   |   1 +
>  xen/arch/x86/hvm/deprivileged.c | 103 ++
>  xen/arch/x86/hvm/deprivileged_asm.S | 205 
> 
>  xen/arch/x86/hvm/vmx/vmx.c  |   7 ++
>  xen/arch/x86/x86_64/asm-offsets.c   |   5 +
>  xen/arch/x86/x86_64/entry.S |  35 ++
>  xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
>  xen/include/xen/hvm/deprivileged.h  |  38 +++
>  xen/include/xen/sched.h |  18 +++-
>  10 files changed, 425 insertions(+), 1 deletion(-)
>  create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 045f6ff..a0e5e70 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -62,6 +62,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>  DEFINE_PER_CPU(unsigned long, cr4);
> @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
>  if ( has_hvm_container_domain(d) )
>  {
>  rc = hvm_vcpu_initialise(v);
> +
> +/* Initialise HVM deprivileged mode */
> +printk("HVM initialising deprivileged mode ...");

All printk()s should have a XENLOG_$severity prefix.

> +hvm_deprivileged_prepare_vcpu(v);
> +printk("Done.\n");
> +
>  goto done;
>  }
>  
> @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
>  vcpu_destroy_fpu(v);
>  
>  if ( has_hvm_container_vcpu(v) )
> +{
> +/* Destroy the deprivileged mode on this vcpu */
> +hvm_deprivileged_destroy_vcpu(v);
> +
>  hvm_vcpu_destroy(v);
> +}
>  else
>  xfree(v->arch.pv_vcpu.trap_ctxt);
>  }
> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
> index bd83ba3..6819886 100644
> --- a/xen/arch/x86/hvm/Makefile
> +++ b/xen/arch/x86/hvm/Makefile
> @@ -17,6 +17,7 @@ obj-y += quirks.o
>  obj-y += rtc.o
>  obj-y += save.o
>  obj-y += deprivileged.o
> +obj-y += deprivileged_asm.o
>  obj-y += stdvga.o
>  obj-y += vioapic.o
>  obj-y += viridian.o
> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
> index 071d900..979fc69 100644
> --- a/xen/arch/x86/hvm/deprivileged.c
> +++ b/xen/arch/x86/hvm/deprivileged.c
> @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>  }
>  return 0;
>  }
> +
> +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
> + */
> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
> +{
> +struct page_info *pg;
> +
> +/* TODO: clarify if this MEMF is correct */
> +/* Allocate 2^STACK_ORDER contiguous pages */
> +pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
> +if( pg == NULL )
> +{
> +panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n");
> +return -ENOMEM;
> +}
> +
> +vcpu->stack = page_to_virt(pg);

Xen has two heaps, the xenheap and the domheap.

You may only construct pointers like this into the xenheap.  The domheap
is not guaranteed to have safe virtual m