Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-20 Thread Ben Catterall



On 10/08/15 11:14, Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.  Conceptually, taking interrupts from
depriv mode is no different to taking them in a PV guest.

Some complications which come to mind (none insurmountable):

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.

From what I understand from entry.S's assembly:
handle_ist_exception is used for machine_check and nmi ISTs and these 
perform this copy. The double fault handler does not do this copy.

 - we take the IST on a different stack page
 - the handler copies the guest's registers from its current page to 
 the bottom of the privileged stack so access routines for this still 
work as usual

 - Moves its rsp to just after this structure in the privileged stack
 - Calls do_nmi
 - does a ret_from_intr with the stack ptr on the privileged stack

Now, I _think_ it's sufficient to perform this copy and then just keep 
the rsp on the IST stack page (rather than moving it across as is 
currently done) so that we don't clobber the privileged stack.


Then, on the return path, move our rsp back to the privileged stack, 
just after the guest registers so that ret_from_intr can use the copied 
(and possibly modified) guest's registers.


Does that sound reasonable?

Thanks in advance!

* AMD Task Register on vmexit.  (this old gem)

~Andrew



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-19 Thread Ben Catterall



On 18/08/15 17:55, Andrew Cooper wrote:



On 17/08/15 08:07, Tim Deegan wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit
without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew


So, have we arrived at a decision for this? Thanks!


Apologies for the delay - I am currently at the Xen Developer Summit.


No worries! Hope you're enjoying the summit. :)

Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a
different SP;


Xen re-enables interrupts in most interrupt handlers, which means that
they must not have an IST set.  If an IST was set, a second interrupt
would clobber the frame of the first.

However, just adjusting tss-rsp0 and syscall top-of-stack to the
current rsp when entering depriv mode should be sufficient, and will
avoid needing to copy the stack.

Got it, thanks!



  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.


They will need adjusting, but just disabling the copy entirely should be
ok.

ok.




Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Agreed.  It is orthogonal to this problem.

~Andrew


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Ben Catterall



On 18/08/15 11:25, Ben Catterall wrote:



On 17/08/15 16:17, Jan Beulich wrote:

On 17.08.15 at 17:07, t...@xen.org wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

So, have we arrived at a decision for this? Thanks!


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a
different SP;
  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];

 From Xen's cpu stack layout, page 4 is currently unused so I'll put it
here. Is this an acceptable?
Or, would it be better to put it at position 5, and move the optional 
MEMORY_GUARD page down to position 4?

  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.

Any other proposals?


No.





I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


Ok, thanks all!


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Ben Catterall



On 17/08/15 16:17, Jan Beulich wrote:

On 17.08.15 at 17:07, t...@xen.org wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

So, have we arrived at a decision for this? Thanks!


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
From Xen's cpu stack layout, page 4 is currently unused so I'll put it 
here. Is this an acceptable?

  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.

Any other proposals?


No.





I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


Ok, thanks all!

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Andrew Cooper



On 17/08/15 08:07, Tim Deegan wrote:

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:

On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew


So, have we arrived at a decision for this? Thanks!


Apologies for the delay - I am currently at the Xen Developer Summit.


Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;


Xen re-enables interrupts in most interrupt handlers, which means that 
they must not have an IST set.  If an IST was set, a second interrupt 
would clobber the frame of the first.


However, just adjusting tss-rsp0 and syscall top-of-stack to the 
current rsp when entering depriv mode should be sufficient, and will 
avoid needing to copy the stack.



  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.


They will need adjusting, but just disabling the copy entirely should be ok.



Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.


Agreed.  It is orthogonal to this problem.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-18 Thread Jan Beulich
 On 18.08.15 at 12:26, ben.catter...@citrix.com wrote:
 On 18/08/15 11:25, Ben Catterall wrote:
 On 17/08/15 16:17, Jan Beulich wrote:
 On 17.08.15 at 17:07, t...@xen.org wrote:
 At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
 So, have we arrived at a decision for this? Thanks!

 Seems to have stalled a bit.  OK, I propose that:
   - we use TR/IST to make Xen take interrupts/exceptions at a
 different SP;
   - we make that SP be an extension of the main stack, so that things
 like current() Just Work[tm];
  From Xen's cpu stack layout, page 4 is currently unused so I'll put it
 here. Is this an acceptable?
 Or, would it be better to put it at position 5, and move the optional 
 MEMORY_GUARD page down to position 4?

At this stage either would do. Once (if) the experiment yielded
positive results, we can still evaluate the pros and cons of either
placement.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Ben Catterall



On 12/08/15 14:33, Andrew Cooper wrote:

On 12/08/15 14:29, Andrew Cooper wrote:

On 11/08/15 19:29, Boris Ostrovsky wrote:

On 08/11/2015 01:19 PM, Andrew Cooper wrote:

On 11/08/15 18:05, Tim Deegan wrote:

* Under this model, PV exception handlers should copy themselves
onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary
stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this
thing?

IIRC: AMD processors don't context switch TR on vmexit,

Correct


which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.


We'd have to do the TR context switch ourselves, and that would be
expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.


Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had
the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)


I actually didn't even realize that TR is not saved on vmexit ;-/.

Would switching TR only when we know that we need to enter this
deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.


Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.


Ahem! s/which/switch/

~Andrew



So, have we arrived at a decision for this? Thanks!

Ben

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Tim Deegan
At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
 On 12/08/15 14:33, Andrew Cooper wrote:
  On 12/08/15 14:29, Andrew Cooper wrote:
  On 11/08/15 19:29, Boris Ostrovsky wrote:
  Would switching TR only when we know that we need to enter this
  deprivileged mode help?
  This is an absolute must.  It is not safe to use syscall/sysexit without
  IST in place for NMIs and MCEs.
 
  Assuming that it is less expensive than copying the stack.
  I was referring to the stack overflow issue, and whether it might be
  sensible to pro-actively which TR.
 
  Ahem! s/which/switch/
 
  ~Andrew
 
 
 So, have we arrived at a decision for this? Thanks!

Seems to have stalled a bit.  OK, I propose that:
 - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
 - we make that SP be an extension of the main stack, so that things
   like current() Just Work[tm];
 - we set this up and tear it down when we enter/leave depriv mode.
 - someone ought to look at the case where IST handlers copy
   themselves to the main stack, and see if we need to adjust that too.

Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-17 Thread Jan Beulich
 On 17.08.15 at 17:07, t...@xen.org wrote:
 At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
 So, have we arrived at a decision for this? Thanks!
 
 Seems to have stalled a bit.  OK, I propose that:
  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
  - we make that SP be an extension of the main stack, so that things
like current() Just Work[tm];
  - we set this up and tear it down when we enter/leave depriv mode.
  - someone ought to look at the case where IST handlers copy
themselves to the main stack, and see if we need to adjust that too.
 
 Any other proposals?

No.

 I think we can leave the question of TR switching on VMEXIT as a
 separate issue.

Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Jan Beulich
 On 11.08.15 at 19:19, andrew.coop...@citrix.com wrote:
 KVM unilaterally reloads the host task register on vmexit, and I suspect
 this is probably the way to go, but have not had time to investigate
 whether there is any performance impact from doing so.  Given how little
 of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
 as expensive as it might have been in legacy modes.

How much of the TSS is being used shouldn't matter at all for
LTR execution time, since the instruction doesn't access the
TSS itself (and iirc there's also no caching documented for TSS
uses).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Ben Catterall



On 11/08/15 18:05, Tim Deegan wrote:

Hi,

At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:

On 11/08/15 10:55, Tim Deegan wrote:

At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

   - one to hold the depriv execution context;
   - one to hold the privileged execution context; and
   - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.


Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.


The copy is relatively small and paid only on the first and last entries
into the mode. I don't know if this is cheaper than the  bookwork that
would be needed on entering and returning from the mode to switch to
these stacks. I'm assuming the sp pointers in the TSS and ISTs would
need changing on the first and last entry/exit if we have the extra
stack, is that correct?


Yep.


Or, is this a more dramatic change in that
everything uses this three stack model rather than just this feature.


Well, some other parts would have to change to accomodate this new
behaviour - that was what Andrew was talking about.

BTW, I think there need to be three stacks anyway, since the depriv
code shouldn't be allowed to write to the priv code's stack frames.
Or maybe I've misunderstood how much access the depriv code will have.

So, just to clarify:

We have a separate deprivileged stack allocated which the deprivileged 
code uses. This is mapped in user mode.


We have the privileged stack which Xen runs on. To prevent this being 
clobbered when we are in our mode and take an interrupt, we copy this 
out to a buffer. This buffer is the saved privileged stack state.


So, we sort of have three stacks already, just the privileged stack is 
copied out to a buffer, rather than switching pointers to another 
interrupt stack.


Hopefully that clarifies?




I'm not sure how much in Xen would need changing to switch across to
using three stacks. Also, would this also need to be done for PV guests?
Would that need to be a separate patch series?

What's the overall consensus? Thanks!


I'm not sure there is one yet -- needs some more discussion of
whether the non-copying approach is feasible.


If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.


* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)


Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?


IIRC: AMD processors don't context switch TR on vmexit, which makes
using IST handlers tricky there.  We'd have to do the TR context
switch ourselves, and that would be expensive.  Andrew, am I
remembering that right?


Thanks!

Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Tim Deegan
At 14:22 +0100 on 12 Aug (1439389325), Ben Catterall wrote:
 On 11/08/15 18:05, Tim Deegan wrote:
  BTW, I think there need to be three stacks anyway, since the depriv
  code shouldn't be allowed to write to the priv code's stack frames.
  Or maybe I've misunderstood how much access the depriv code will have.
 So, just to clarify:
 
 We have a separate deprivileged stack allocated which the deprivileged 
 code uses. This is mapped in user mode.
 
 We have the privileged stack which Xen runs on. To prevent this being 
 clobbered when we are in our mode and take an interrupt, we copy this 
 out to a buffer. This buffer is the saved privileged stack state.
 
 So, we sort of have three stacks already, just the privileged stack is 
 copied out to a buffer, rather than switching pointers to another 
 interrupt stack.
 
 Hopefully that clarifies?

Yes, thanks -- the buffer is what I was thinking of as a third stack.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Andrew Cooper
On 12/08/15 14:29, Andrew Cooper wrote:
 On 11/08/15 19:29, Boris Ostrovsky wrote:
 On 08/11/2015 01:19 PM, Andrew Cooper wrote:
 On 11/08/15 18:05, Tim Deegan wrote:
 * Under this model, PV exception handlers should copy themselves
 onto
 the privileged execution stack.
 * Currently, the IST handlers  copy themselves onto the primary
 stack if
 they interrupt guest context.
 * AMD Task Register on vmexit.  (this old gem)
 Gah, this thing. :
 Curious (and I can't seem find this in the manuals): What is this
 thing?
 IIRC: AMD processors don't context switch TR on vmexit,
 Correct

 which makes using IST handlers tricky there.
 (That is one way of putting it)

 IST handlers cannot be used by Xen if Xen does not switch the task
 register before stgi, or IST exceptions (NMI, MCE and double fault) will
 be taken with guest-supplied stack pointers.

 We'd have to do the TR context switch ourselves, and that would be
 expensive.
 It is suspected to be expensive, but I have never actually seen any
 numbers one way or another.

 Andrew, am I remembering that right?
 Looks about right.

 I have been meaning to investigate this for a while, but never had
 the time.

 Xen opts for disabling interrupt stack tables in the context of AMD HVM
 vcpus, which interacts catastrophically with debug builds using
 MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
 detect stack overflows, but without an IST double fault hander, ends in
 a triple fault rather than a host crash detailing the stack overflow.

 KVM unilaterally reloads the host task register on vmexit, and I suspect
 this is probably the way to go, but have not had time to investigate
 whether there is any performance impact from doing so.  Given how little
 of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
 as expensive as it might have been in legacy modes.

 (CC'ing the AMD SVM maintainers to see if they have any information on
 this subject)

 I actually didn't even realize that TR is not saved on vmexit ;-/.

 Would switching TR only when we know that we need to enter this
 deprivileged mode help?
 This is an absolute must.  It is not safe to use syscall/sysexit without
 IST in place for NMIs and MCEs.

 Assuming that it is less expensive than copying the stack.
 I was referring to the stack overflow issue, and whether it might be
 sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-12 Thread Andrew Cooper
On 11/08/15 19:29, Boris Ostrovsky wrote:
 On 08/11/2015 01:19 PM, Andrew Cooper wrote:
 On 11/08/15 18:05, Tim Deegan wrote:
 * Under this model, PV exception handlers should copy themselves
 onto
 the privileged execution stack.
 * Currently, the IST handlers  copy themselves onto the primary
 stack if
 they interrupt guest context.
 * AMD Task Register on vmexit.  (this old gem)
 Gah, this thing. :
 Curious (and I can't seem find this in the manuals): What is this
 thing?
 IIRC: AMD processors don't context switch TR on vmexit,
 Correct

 which makes using IST handlers tricky there.
 (That is one way of putting it)

 IST handlers cannot be used by Xen if Xen does not switch the task
 register before stgi, or IST exceptions (NMI, MCE and double fault) will
 be taken with guest-supplied stack pointers.

 We'd have to do the TR context switch ourselves, and that would be
 expensive.
 It is suspected to be expensive, but I have never actually seen any
 numbers one way or another.

 Andrew, am I remembering that right?
 Looks about right.

 I have been meaning to investigate this for a while, but never had
 the time.

 Xen opts for disabling interrupt stack tables in the context of AMD HVM
 vcpus, which interacts catastrophically with debug builds using
 MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
 detect stack overflows, but without an IST double fault hander, ends in
 a triple fault rather than a host crash detailing the stack overflow.

 KVM unilaterally reloads the host task register on vmexit, and I suspect
 this is probably the way to go, but have not had time to investigate
 whether there is any performance impact from doing so.  Given how little
 of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
 as expensive as it might have been in legacy modes.

 (CC'ing the AMD SVM maintainers to see if they have any information on
 this subject)


 I actually didn't even realize that TR is not saved on vmexit ;-/.

 Would switching TR only when we know that we need to enter this
 deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.

 Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Tim Deegan
At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
 On 10/08/15 10:49, Tim Deegan wrote:
  Hi,
 
  At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
  The process to switch into and out of deprivileged mode can be likened to
  setjmp/longjmp.
 
  To enter deprivileged mode, we take a copy of the stack from the guest's
  registers up to the current stack pointer.
  This copy is pretty unfortunate, but I can see that avoiding it will
  be a bit complex.  Could we do something with more stacks?  AFAICS
  there have to be three stacks anyway:
 
   - one to hold the depriv execution context;
   - one to hold the privileged execution context; and
   - one to take interrupts on.
 
  So maybe we could do some fiddling to make Xen take interrupts on a
  different stack while we're depriv'd?
 
 That should happen naturally by virtue of the privilege level change
 involved in taking the interrupt.

Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.

If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.

 * Under this model, PV exception handlers should copy themselves onto
 the privileged execution stack.
 * Currently, the IST handlers  copy themselves onto the primary stack if
 they interrupt guest context.
 * AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :(

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ian Campbell
On Thu, 2015-08-06 at 21:55 +0100, Andrew Cooper wrote:
 On 06/08/15 17:45, Ben Catterall wrote:
  The process to switch into and out of deprivileged mode can be likened 
  to
  setjmp/longjmp.
  
  To enter deprivileged mode, we take a copy of the stack from the 
  guest's
  registers up to the current stack pointer. This allows us to restore 
  the stack
  when we have finished the deprivileged mode operation, meaning we can 
  continue
  execution from that point. This is similar to if a context switch had 
  happened.
  
  To exit deprivileged mode, we copy the stack back, replacing the 
  current stack.
  We can then continue execution from where we left off, which will 
  unwind the
  stack and free up resources. This method means that we do not need to
  change any other code paths and its invocation will be transparent to 
  callers.
  This should allow the feature to be more easily deployed to different 
  parts
  of Xen.
  
  Note that this copy of the stack is per-vcpu but, it will contain per
  -pcpu data.
  Extra work is needed to properly migrate vcpus between pcpus.
 
 Under what circumstances do you see there being persistent state in the
 depriv area between calls, given that the calls are synchronous from VM
 actions?

Would we not want to keep (some of) the device model's state in a depriv
area? e.g. anything which is purely internal to the DM which is therefore
only accessed from depriv-land?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ben Catterall



On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.


This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?

If we do have to copy, we could track whether the original stack has
been clobbered by an interrupt, and so avoid (at least some of) the
copy back afterwards?

One nit in the assembler - if I've followed correctly, this saved IP:


+/* Perform a near call to push rip onto the stack */
+call   1f


is returned to (with adjustments) here:


+/* Go to user mode return code */
+jmp*(%rsi)


It would be good to make this a matched pair of call/ret if we can;
the CPU has special branch prediction tracking for function calls that
gets confused by a call that's not returned to.


sure, will do.

Cheers,

Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Ben Catterall



On 11/08/15 10:55, Tim Deegan wrote:

At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:

On 10/08/15 10:49, Tim Deegan wrote:

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?


That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.


Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.


The copy is relatively small and paid only on the first and last entries 
into the mode. I don't know if this is cheaper than the  bookwork that 
would be needed on entering and returning from the mode to switch to 
these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
need changing on the first and last entry/exit if we have the extra 
stack, is that correct? Or, is this a more dramatic change in that 
everything uses this three stack model rather than just this feature.


I'm not sure how much in Xen would need changing to switch across to 
using three stacks. Also, would this also need to be done for PV guests? 
Would that need to be a separate patch series?


What's the overall consensus? Thanks!



If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.


* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)


Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?


Tim.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Tim Deegan
Hi,

At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:
 On 11/08/15 10:55, Tim Deegan wrote:
  At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
  On 10/08/15 10:49, Tim Deegan wrote:
  Hi,
 
  At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
  The process to switch into and out of deprivileged mode can be likened to
  setjmp/longjmp.
 
  To enter deprivileged mode, we take a copy of the stack from the guest's
  registers up to the current stack pointer.
  This copy is pretty unfortunate, but I can see that avoiding it will
  be a bit complex.  Could we do something with more stacks?  AFAICS
  there have to be three stacks anyway:
 
- one to hold the depriv execution context;
- one to hold the privileged execution context; and
- one to take interrupts on.
 
  So maybe we could do some fiddling to make Xen take interrupts on a
  different stack while we're depriv'd?
 
  That should happen naturally by virtue of the privilege level change
  involved in taking the interrupt.
 
  Right, and this is why we need a third stack - so interrupts don't
  trash the existing priv state on the 'normal' Xen stack.  And so we
  either need to copy the priv stack out (and maybe copy it back), or
  tell the CPU to use a different stack.
 
 The copy is relatively small and paid only on the first and last entries 
 into the mode. I don't know if this is cheaper than the  bookwork that 
 would be needed on entering and returning from the mode to switch to 
 these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
 need changing on the first and last entry/exit if we have the extra 
 stack, is that correct?

Yep.

 Or, is this a more dramatic change in that 
 everything uses this three stack model rather than just this feature.

Well, some other parts would have to change to accomodate this new
behaviour - that was what Andrew was talking about.

BTW, I think there need to be three stacks anyway, since the depriv
code shouldn't be allowed to write to the priv code's stack frames.
Or maybe I've misunderstood how much access the depriv code will have.

 I'm not sure how much in Xen would need changing to switch across to 
 using three stacks. Also, would this also need to be done for PV guests? 
 Would that need to be a separate patch series?
 
 What's the overall consensus? Thanks!

I'm not sure there is one yet -- needs some more discussion of
whether the non-copying approach is feasible.

  If we had enough headroom, we could try to be clever and tell the CPU
  to take interrupts on the priv stack _below_ the existing state.  That
  would avoid the first of your problems below.
 
  * Under this model, PV exception handlers should copy themselves onto
  the privileged execution stack.
  * Currently, the IST handlers  copy themselves onto the primary stack if
  they interrupt guest context.
  * AMD Task Register on vmexit.  (this old gem)
 
  Gah, this thing. :
 Curious (and I can't seem find this in the manuals): What is this thing?

IIRC: AMD processors don't context switch TR on vmexit, which makes
using IST handlers tricky there.  We'd have to do the TR context
switch ourselves, and that would be expensive.  Andrew, am I
remembering that right?

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Andrew Cooper
On 11/08/15 18:05, Tim Deegan wrote:

 * Under this model, PV exception handlers should copy themselves onto
 the privileged execution stack.
 * Currently, the IST handlers  copy themselves onto the primary stack if
 they interrupt guest context.
 * AMD Task Register on vmexit.  (this old gem)
 Gah, this thing. :
 Curious (and I can't seem find this in the manuals): What is this thing?
 IIRC: AMD processors don't context switch TR on vmexit,

Correct

 which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.

 We'd have to do the TR context switch ourselves, and that would be expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.

 Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-11 Thread Boris Ostrovsky

On 08/11/2015 01:19 PM, Andrew Cooper wrote:

On 11/08/15 18:05, Tim Deegan wrote:

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :

Curious (and I can't seem find this in the manuals): What is this thing?

IIRC: AMD processors don't context switch TR on vmexit,

Correct


which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.


We'd have to do the TR context switch ourselves, and that would be expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.


Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)



I actually didn't even realize that TR is not saved on vmexit ;-/.

Would switching TR only when we know that we need to enter this 
deprivileged mode help? Assuming that it is less expensive than copying 
the stack.


-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-10 Thread Tim Deegan
Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
 The process to switch into and out of deprivileged mode can be likened to
 setjmp/longjmp.
 
 To enter deprivileged mode, we take a copy of the stack from the guest's
 registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

 - one to hold the depriv execution context;
 - one to hold the privileged execution context; and
 - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?

If we do have to copy, we could track whether the original stack has
been clobbered by an interrupt, and so avoid (at least some of) the
copy back afterwards?

One nit in the assembler - if I've followed correctly, this saved IP:

 +/* Perform a near call to push rip onto the stack */
 +call   1f

is returned to (with adjustments) here:

 +/* Go to user mode return code */
 +jmp*(%rsi)

It would be good to make this a matched pair of call/ret if we can;
the CPU has special branch prediction tracking for function calls that
gets confused by a call that's not returned to.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-10 Thread Andrew Cooper
On 10/08/15 10:49, Tim Deegan wrote:
 Hi,

 At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
 The process to switch into and out of deprivileged mode can be likened to
 setjmp/longjmp.

 To enter deprivileged mode, we take a copy of the stack from the guest's
 registers up to the current stack pointer.
 This copy is pretty unfortunate, but I can see that avoiding it will
 be a bit complex.  Could we do something with more stacks?  AFAICS
 there have to be three stacks anyway:

  - one to hold the depriv execution context;
  - one to hold the privileged execution context; and
  - one to take interrupts on.

 So maybe we could do some fiddling to make Xen take interrupts on a
 different stack while we're depriv'd?

That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.  Conceptually, taking interrupts from
depriv mode is no different to taking them in a PV guest.

Some complications which come to mind (none insurmountable):

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread Andrew Cooper
On 07/08/15 13:51, Ben Catterall wrote:
 On 06/08/15 21:55, Andrew Cooper wrote:
 On 06/08/15 17:45, Ben Catterall wrote:
 The process to switch into and out of deprivileged mode can be
 likened to
 setjmp/longjmp.

 To enter deprivileged mode, we take a copy of the stack from the
 guest's
 registers up to the current stack pointer. This allows us to restore
 the stack
 when we have finished the deprivileged mode operation, meaning we
 can continue
 execution from that point. This is similar to if a context switch
 had happened.

 To exit deprivileged mode, we copy the stack back, replacing the
 current stack.
 We can then continue execution from where we left off, which will
 unwind the
 stack and free up resources. This method means that we do not need to
 change any other code paths and its invocation will be transparent
 to callers.
 This should allow the feature to be more easily deployed to
 different parts
 of Xen.

 Note that this copy of the stack is per-vcpu but, it will contain
 per-pcpu data.
 Extra work is needed to properly migrate vcpus between pcpus.

 Under what circumstances do you see there being persistent state in the
 depriv area between calls, given that the calls are synchronous from VM
 actions?

 I don't know if we can make these synchronous as we need a way to
 interrupt the vcpu if it's spinning for a long time. Otherwise an
 attacker could just spin in depriv and cause a DoS. With that in mind,
 the scheduler may decide to migrate the vcpu whilst it's in depriv
 mode which would mean this per-pcpu data is held in the stack copy
 which is then migrated to another pcpu incorrectly.

If the emulator spins for a sufficient time, it is fine to shoot the
domain.  This is a strict improvement on the current behaviour where a
spinning emulator would shoot the host, via a watchdog timeout.

As said elsewhere, this kind of DoS is not a very interesting attack
vector.  State handling errors which cause Xen to change the wrong thing
are far more interesting from a guests point of view.

http://xenbits.xen.org/xsa/advisory-123.html (full host compromise) or
http://xenbits.xen.org/xsa/advisory-108.html (read other guests data)
are examples of kinds of interesting issues which could potentially be
mitigated with this depriv infrastructure.




 The switch to and from deprivileged mode is performed using sysret
 and syscall
 respectively.

 I suspect we need to borrow the SS attribute workaround from Linux to
 make this function reliably on AMD systems.

 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b


 
 Ah! ok, I'll look into this. Thanks!

Just be aware of it.  Don't spend your time attempting to retrofit it to
Xen.  It is more work than it looks.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread Ben Catterall



On 06/08/15 21:55, Andrew Cooper wrote:

On 06/08/15 17:45, Ben Catterall wrote:

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer. This allows us to restore the stack
when we have finished the deprivileged mode operation, meaning we can continue
execution from that point. This is similar to if a context switch had happened.

To exit deprivileged mode, we copy the stack back, replacing the current stack.
We can then continue execution from where we left off, which will unwind the
stack and free up resources. This method means that we do not need to
change any other code paths and its invocation will be transparent to callers.
This should allow the feature to be more easily deployed to different parts
of Xen.

Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
Extra work is needed to properly migrate vcpus between pcpus.


Under what circumstances do you see there being persistent state in the
depriv area between calls, given that the calls are synchronous from VM
actions?


I don't know if we can make these synchronous as we need a way to 
interrupt the vcpu if it's spinning for a long time. Otherwise an 
attacker could just spin in depriv and cause a DoS. With that in mind, 
the scheduler may decide to migrate the vcpu whilst it's in depriv mode 
which would mean this per-pcpu data is held in the stack copy which is 
then migrated to another pcpu incorrectly.






The switch to and from deprivileged mode is performed using sysret and syscall
respectively.


I suspect we need to borrow the SS attribute workaround from Linux to
make this function reliably on AMD systems.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b



Ah! ok, I'll look into this. Thanks!


The return paths in entry.S have been edited so that, when we receive an
interrupt whilst in deprivileged mode, we return into that mode correctly.

A hook on the syscall handler in entry.S has also been added which handles
returning from user mode and will support deprivileged mode system calls when
these are needed.

Signed-off-by: Ben Catterall ben.catter...@citrix.com
---
  xen/arch/x86/domain.c   |  12 +++
  xen/arch/x86/hvm/Makefile   |   1 +
  xen/arch/x86/hvm/deprivileged.c | 103 ++
  xen/arch/x86/hvm/deprivileged_asm.S | 205 
  xen/arch/x86/hvm/vmx/vmx.c  |   7 ++
  xen/arch/x86/x86_64/asm-offsets.c   |   5 +
  xen/arch/x86/x86_64/entry.S |  35 ++
  xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
  xen/include/xen/hvm/deprivileged.h  |  38 +++
  xen/include/xen/sched.h |  18 +++-
  10 files changed, 425 insertions(+), 1 deletion(-)
  create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
  #include xen/iommu.h
  #include compat/vcpu.h
  #include asm/psr.h
+#include xen/hvm/deprivileged.h

  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
  DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
  if ( has_hvm_container_domain(d) )
  {
  rc = hvm_vcpu_initialise(v);
+
+/* Initialise HVM deprivileged mode */
+printk(HVM initialising deprivileged mode ...);


All printk()s should have a XENLOG_$severity prefix.


will do.

+hvm_deprivileged_prepare_vcpu(v);
+printk(Done.\n);
+
  goto done;
  }

@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
  vcpu_destroy_fpu(v);

  if ( has_hvm_container_vcpu(v) )
+{
+/* Destroy the deprivileged mode on this vcpu */
+hvm_deprivileged_destroy_vcpu(v);
+
  hvm_vcpu_destroy(v);
+}
  else
  xfree(v-arch.pv_vcpu.trap_ctxt);
  }
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index bd83ba3..6819886 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -17,6 +17,7 @@ obj-y += quirks.o
  obj-y += rtc.o
  obj-y += save.o
  obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
  obj-y += stdvga.o
  obj-y += vioapic.o
  obj-y += viridian.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 071d900..979fc69 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
  }
  return 0;
  }
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
+ */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+struct page_info *pg;
+
+/* TODO: clarify if this MEMF is correct */
+/* Allocate 2^STACK_ORDER contiguous pages */
+pg = alloc_domheap_pages(NULL, 

Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-07 Thread David Vrabel
On 07/08/15 13:51, Ben Catterall wrote:
 
 I don't know if we can make these synchronous as we need a way to
 interrupt the vcpu if it's spinning for a long time. Otherwise an
 attacker could just spin in depriv and cause a DoS. With that in mind,
 the scheduler may decide to migrate the vcpu whilst it's in depriv mode
 which would mean this per-pcpu data is held in the stack copy which is
 then migrated to another pcpu incorrectly.

IMO, DoS attacks on depriv'd emulators aren't very interesting.

I think it is counter-productive to address this attack in this initial
implementation at the expense (delays/complexity/etc.) of solving the
key requirement of mitigating information leaks and privilege escalation
attacks

David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

2015-08-06 Thread Andrew Cooper
On 06/08/15 17:45, Ben Catterall wrote:
 The process to switch into and out of deprivileged mode can be likened to
 setjmp/longjmp.

 To enter deprivileged mode, we take a copy of the stack from the guest's
 registers up to the current stack pointer. This allows us to restore the stack
 when we have finished the deprivileged mode operation, meaning we can continue
 execution from that point. This is similar to if a context switch had 
 happened.

 To exit deprivileged mode, we copy the stack back, replacing the current 
 stack.
 We can then continue execution from where we left off, which will unwind the
 stack and free up resources. This method means that we do not need to
 change any other code paths and its invocation will be transparent to callers.
 This should allow the feature to be more easily deployed to different parts
 of Xen.

 Note that this copy of the stack is per-vcpu but, it will contain per-pcpu 
 data.
 Extra work is needed to properly migrate vcpus between pcpus.

Under what circumstances do you see there being persistent state in the
depriv area between calls, given that the calls are synchronous from VM
actions?


 The switch to and from deprivileged mode is performed using sysret and syscall
 respectively.

I suspect we need to borrow the SS attribute workaround from Linux to
make this function reliably on AMD systems.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b


 The return paths in entry.S have been edited so that, when we receive an
 interrupt whilst in deprivileged mode, we return into that mode correctly.

 A hook on the syscall handler in entry.S has also been added which handles
 returning from user mode and will support deprivileged mode system calls when
 these are needed.

 Signed-off-by: Ben Catterall ben.catter...@citrix.com
 ---
  xen/arch/x86/domain.c   |  12 +++
  xen/arch/x86/hvm/Makefile   |   1 +
  xen/arch/x86/hvm/deprivileged.c | 103 ++
  xen/arch/x86/hvm/deprivileged_asm.S | 205 
 
  xen/arch/x86/hvm/vmx/vmx.c  |   7 ++
  xen/arch/x86/x86_64/asm-offsets.c   |   5 +
  xen/arch/x86/x86_64/entry.S |  35 ++
  xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
  xen/include/xen/hvm/deprivileged.h  |  38 +++
  xen/include/xen/sched.h |  18 +++-
  10 files changed, 425 insertions(+), 1 deletion(-)
  create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S

 diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
 index 045f6ff..a0e5e70 100644
 --- a/xen/arch/x86/domain.c
 +++ b/xen/arch/x86/domain.c
 @@ -62,6 +62,7 @@
  #include xen/iommu.h
  #include compat/vcpu.h
  #include asm/psr.h
 +#include xen/hvm/deprivileged.h
  
  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
  DEFINE_PER_CPU(unsigned long, cr4);
 @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
  if ( has_hvm_container_domain(d) )
  {
  rc = hvm_vcpu_initialise(v);
 +
 +/* Initialise HVM deprivileged mode */
 +printk(HVM initialising deprivileged mode ...);

All printk()s should have a XENLOG_$severity prefix.

 +hvm_deprivileged_prepare_vcpu(v);
 +printk(Done.\n);
 +
  goto done;
  }
  
 @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
  vcpu_destroy_fpu(v);
  
  if ( has_hvm_container_vcpu(v) )
 +{
 +/* Destroy the deprivileged mode on this vcpu */
 +hvm_deprivileged_destroy_vcpu(v);
 +
  hvm_vcpu_destroy(v);
 +}
  else
  xfree(v-arch.pv_vcpu.trap_ctxt);
  }
 diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
 index bd83ba3..6819886 100644
 --- a/xen/arch/x86/hvm/Makefile
 +++ b/xen/arch/x86/hvm/Makefile
 @@ -17,6 +17,7 @@ obj-y += quirks.o
  obj-y += rtc.o
  obj-y += save.o
  obj-y += deprivileged.o
 +obj-y += deprivileged_asm.o
  obj-y += stdvga.o
  obj-y += vioapic.o
  obj-y += viridian.o
 diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
 index 071d900..979fc69 100644
 --- a/xen/arch/x86/hvm/deprivileged.c
 +++ b/xen/arch/x86/hvm/deprivileged.c
 @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
  }
  return 0;
  }
 +
 +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
 + */
 +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
 +{
 +struct page_info *pg;
 +
 +/* TODO: clarify if this MEMF is correct */
 +/* Allocate 2^STACK_ORDER contiguous pages */
 +pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
 +if( pg == NULL )
 +{
 +panic(HVM: Out of memory on per-vcpu deprivileged mode init.\n);
 +return -ENOMEM;
 +}
 +
 +vcpu-stack = page_to_virt(pg);

Xen has two heaps, the xenheap and the domheap.

You may only construct pointers like this into the xenheap.  The domheap
is not guaranteed to have safe virtual mappings to.  (This code only
works because your