Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 10/08/15 11:14, Andrew Cooper wrote: On 10/08/15 10:49, Tim Deegan wrote: Hi, At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: The process to switch into and out of deprivileged mode can be likened to setjmp/longjmp. To enter deprivileged mode, we take a copy of the stack from the guest's registers up to the current stack pointer. This copy is pretty unfortunate, but I can see that avoiding it will be a bit complex. Could we do something with more stacks? AFAICS there have to be three stacks anyway: - one to hold the depriv execution context; - one to hold the privileged execution context; and - one to take interrupts on. So maybe we could do some fiddling to make Xen take interrupts on a different stack while we're depriv'd? That should happen naturally by virtue of the privilege level change involved in taking the interrupt. Conceptually, taking interrupts from depriv mode is no different to taking them in a PV guest. Some complications which come to mind (none insurmountable): * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. From what I understand from entry.S's assembly: handle_ist_exception is used for machine_check and nmi ISTs and these perform this copy. The double fault handler does not do this copy. - we take the IST on a different stack page - the handler copies the guest's registers from its current page to the bottom of the privileged stack so access routines for this still work as usual - Moves its rsp to just after this structure in the privileged stack - Calls do_nmi - does a ret_from_intr with the stack ptr on the privileged stack Now, I _think_ it's sufficient to perform this copy and then just keep the rsp on the IST stack page (rather than moving it across as is currently done) so that we don't clobber the privileged stack. Then, on the return path, move our rsp back to the privileged stack, just after the guest registers so that ret_from_intr can use the copied (and possibly modified) guest's registers. Does that sound reasonable? Thanks in advance! * AMD Task Register on vmexit. (this old gem) ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 18/08/15 17:55, Andrew Cooper wrote: On 17/08/15 08:07, Tim Deegan wrote: At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: On 12/08/15 14:33, Andrew Cooper wrote: On 12/08/15 14:29, Andrew Cooper wrote: On 11/08/15 19:29, Boris Ostrovsky wrote: Would switching TR only when we know that we need to enter this deprivileged mode help? This is an absolute must. It is not safe to use syscall/sysexit without IST in place for NMIs and MCEs. Assuming that it is less expensive than copying the stack. I was referring to the stack overflow issue, and whether it might be sensible to pro-actively which TR. Ahem! s/which/switch/ ~Andrew So, have we arrived at a decision for this? Thanks! Apologies for the delay - I am currently at the Xen Developer Summit. No worries! Hope you're enjoying the summit. :) Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; Xen re-enables interrupts in most interrupt handlers, which means that they must not have an IST set. If an IST was set, a second interrupt would clobber the frame of the first. However, just adjusting tss->rsp0 and syscall top-of-stack to the current rsp when entering depriv mode should be sufficient, and will avoid needing to copy the stack. Got it, thanks! - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; - we set this up and tear it down when we enter/leave depriv mode. - someone ought to look at the case where IST handlers copy themselves to the main stack, and see if we need to adjust that too. They will need adjusting, but just disabling the copy entirely should be ok. ok. Any other proposals? I think we can leave the question of TR switching on VMEXIT as a separate issue. Agreed. It is orthogonal to this problem. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 17/08/15 08:07, Tim Deegan wrote: At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: On 12/08/15 14:33, Andrew Cooper wrote: On 12/08/15 14:29, Andrew Cooper wrote: On 11/08/15 19:29, Boris Ostrovsky wrote: Would switching TR only when we know that we need to enter this deprivileged mode help? This is an absolute must. It is not safe to use syscall/sysexit without IST in place for NMIs and MCEs. Assuming that it is less expensive than copying the stack. I was referring to the stack overflow issue, and whether it might be sensible to pro-actively which TR. Ahem! s/which/switch/ ~Andrew So, have we arrived at a decision for this? Thanks! Apologies for the delay - I am currently at the Xen Developer Summit. Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; Xen re-enables interrupts in most interrupt handlers, which means that they must not have an IST set. If an IST was set, a second interrupt would clobber the frame of the first. However, just adjusting tss->rsp0 and syscall top-of-stack to the current rsp when entering depriv mode should be sufficient, and will avoid needing to copy the stack. - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; - we set this up and tear it down when we enter/leave depriv mode. - someone ought to look at the case where IST handlers copy themselves to the main stack, and see if we need to adjust that too. They will need adjusting, but just disabling the copy entirely should be ok. Any other proposals? I think we can leave the question of TR switching on VMEXIT as a separate issue. Agreed. It is orthogonal to this problem. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
>>> On 18.08.15 at 12:26, wrote: > On 18/08/15 11:25, Ben Catterall wrote: >> On 17/08/15 16:17, Jan Beulich wrote: >> On 17.08.15 at 17:07, wrote: At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: > So, have we arrived at a decision for this? Thanks! Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; >> From Xen's cpu stack layout, page 4 is currently unused so I'll put it >> here. Is this an acceptable? > Or, would it be better to put it at position 5, and move the optional > MEMORY_GUARD page down to position 4? At this stage either would do. Once (if) the experiment yielded positive results, we can still evaluate the pros and cons of either placement. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 18/08/15 11:25, Ben Catterall wrote: On 17/08/15 16:17, Jan Beulich wrote: On 17.08.15 at 17:07, wrote: At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: So, have we arrived at a decision for this? Thanks! Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; From Xen's cpu stack layout, page 4 is currently unused so I'll put it here. Is this an acceptable? Or, would it be better to put it at position 5, and move the optional MEMORY_GUARD page down to position 4? - we set this up and tear it down when we enter/leave depriv mode. - someone ought to look at the case where IST handlers copy themselves to the main stack, and see if we need to adjust that too. Any other proposals? No. I think we can leave the question of TR switching on VMEXIT as a separate issue. Just like for the other one - at this point I think anything that work should be okay. Dealing with quirks can be deferred (but it would be nice if a respective note was added in a prominent place so it doesn't get forgotten once/if these patches leave RFC state). Jan Ok, thanks all! ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 17/08/15 16:17, Jan Beulich wrote: On 17.08.15 at 17:07, wrote: At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: So, have we arrived at a decision for this? Thanks! Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; From Xen's cpu stack layout, page 4 is currently unused so I'll put it here. Is this an acceptable? - we set this up and tear it down when we enter/leave depriv mode. - someone ought to look at the case where IST handlers copy themselves to the main stack, and see if we need to adjust that too. Any other proposals? No. I think we can leave the question of TR switching on VMEXIT as a separate issue. Just like for the other one - at this point I think anything that work should be okay. Dealing with quirks can be deferred (but it would be nice if a respective note was added in a prominent place so it doesn't get forgotten once/if these patches leave RFC state). Jan Ok, thanks all! ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
>>> On 17.08.15 at 17:07, wrote: > At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: >> So, have we arrived at a decision for this? Thanks! > > Seems to have stalled a bit. OK, I propose that: > - we use TR/IST to make Xen take interrupts/exceptions at a different SP; > - we make that SP be an extension of the main stack, so that things >like current() Just Work[tm]; > - we set this up and tear it down when we enter/leave depriv mode. > - someone ought to look at the case where IST handlers copy >themselves to the main stack, and see if we need to adjust that too. > > Any other proposals? No. > I think we can leave the question of TR switching on VMEXIT as a > separate issue. Just like for the other one - at this point I think anything that work should be okay. Dealing with quirks can be deferred (but it would be nice if a respective note was added in a prominent place so it doesn't get forgotten once/if these patches leave RFC state). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote: > On 12/08/15 14:33, Andrew Cooper wrote: > > On 12/08/15 14:29, Andrew Cooper wrote: > >> On 11/08/15 19:29, Boris Ostrovsky wrote: > >>> Would switching TR only when we know that we need to enter this > >>> deprivileged mode help? > >> This is an absolute must. It is not safe to use syscall/sysexit without > >> IST in place for NMIs and MCEs. > >> > >>> Assuming that it is less expensive than copying the stack. > >> I was referring to the stack overflow issue, and whether it might be > >> sensible to pro-actively which TR. > > > > Ahem! s/which/switch/ > > > > ~Andrew > > > > So, have we arrived at a decision for this? Thanks! Seems to have stalled a bit. OK, I propose that: - we use TR/IST to make Xen take interrupts/exceptions at a different SP; - we make that SP be an extension of the main stack, so that things like current() Just Work[tm]; - we set this up and tear it down when we enter/leave depriv mode. - someone ought to look at the case where IST handlers copy themselves to the main stack, and see if we need to adjust that too. Any other proposals? I think we can leave the question of TR switching on VMEXIT as a separate issue. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 12/08/15 14:33, Andrew Cooper wrote: On 12/08/15 14:29, Andrew Cooper wrote: On 11/08/15 19:29, Boris Ostrovsky wrote: On 08/11/2015 01:19 PM, Andrew Cooper wrote: On 11/08/15 18:05, Tim Deegan wrote: * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) Gah, this thing. : Curious (and I can't seem find this in the manuals): What is this thing? IIRC: AMD processors don't context switch TR on vmexit, Correct which makes using IST handlers tricky there. (That is one way of putting it) IST handlers cannot be used by Xen if Xen does not switch the task register before stgi, or IST exceptions (NMI, MCE and double fault) will be taken with guest-supplied stack pointers. We'd have to do the TR context switch ourselves, and that would be expensive. It is suspected to be expensive, but I have never actually seen any numbers one way or another. Andrew, am I remembering that right? Looks about right. I have been meaning to investigate this for a while, but never had the time. Xen opts for disabling interrupt stack tables in the context of AMD HVM vcpus, which interacts catastrophically with debug builds using MEMORY_GUARD. MEMORY_GUARD shoots a page out of the primary stack to detect stack overflows, but without an IST double fault hander, ends in a triple fault rather than a host crash detailing the stack overflow. KVM unilaterally reloads the host task register on vmexit, and I suspect this is probably the way to go, but have not had time to investigate whether there is any performance impact from doing so. Given how little of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be as expensive as it might have been in legacy modes. (CC'ing the AMD SVM maintainers to see if they have any information on this subject) I actually didn't even realize that TR is not saved on vmexit ;-/. Would switching TR only when we know that we need to enter this deprivileged mode help? This is an absolute must. It is not safe to use syscall/sysexit without IST in place for NMIs and MCEs. Assuming that it is less expensive than copying the stack. I was referring to the stack overflow issue, and whether it might be sensible to pro-actively which TR. Ahem! s/which/switch/ ~Andrew So, have we arrived at a decision for this? Thanks! Ben ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 12/08/15 14:29, Andrew Cooper wrote: > On 11/08/15 19:29, Boris Ostrovsky wrote: >> On 08/11/2015 01:19 PM, Andrew Cooper wrote: >>> On 11/08/15 18:05, Tim Deegan wrote: >>> * Under this model, PV exception handlers should copy themselves >>> onto >>> the privileged execution stack. >>> * Currently, the IST handlers copy themselves onto the primary >>> stack if >>> they interrupt guest context. >>> * AMD Task Register on vmexit. (this old gem) >> Gah, this thing. : > Curious (and I can't seem find this in the manuals): What is this > thing? IIRC: AMD processors don't context switch TR on vmexit, >>> Correct >>> which makes using IST handlers tricky there. >>> (That is one way of putting it) >>> >>> IST handlers cannot be used by Xen if Xen does not switch the task >>> register before stgi, or IST exceptions (NMI, MCE and double fault) will >>> be taken with guest-supplied stack pointers. >>> We'd have to do the TR context switch ourselves, and that would be expensive. >>> It is suspected to be expensive, but I have never actually seen any >>> numbers one way or another. >>> Andrew, am I remembering that right? >>> Looks about right. >>> >>> I have been meaning to investigate this for a while, but never had >>> the time. >>> >>> Xen opts for disabling interrupt stack tables in the context of AMD HVM >>> vcpus, which interacts catastrophically with debug builds using >>> MEMORY_GUARD. MEMORY_GUARD shoots a page out of the primary stack to >>> detect stack overflows, but without an IST double fault hander, ends in >>> a triple fault rather than a host crash detailing the stack overflow. >>> >>> KVM unilaterally reloads the host task register on vmexit, and I suspect >>> this is probably the way to go, but have not had time to investigate >>> whether there is any performance impact from doing so. Given how little >>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be >>> as expensive as it might have been in legacy modes. >>> >>> (CC'ing the AMD SVM maintainers to see if they have any information on >>> this subject) >>> >> I actually didn't even realize that TR is not saved on vmexit ;-/. >> >> Would switching TR only when we know that we need to enter this >> deprivileged mode help? > This is an absolute must. It is not safe to use syscall/sysexit without > IST in place for NMIs and MCEs. > >> Assuming that it is less expensive than copying the stack. > I was referring to the stack overflow issue, and whether it might be > sensible to pro-actively which TR. Ahem! s/which/switch/ ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 11/08/15 19:29, Boris Ostrovsky wrote: > On 08/11/2015 01:19 PM, Andrew Cooper wrote: >> On 11/08/15 18:05, Tim Deegan wrote: >> * Under this model, PV exception handlers should copy themselves >> onto >> the privileged execution stack. >> * Currently, the IST handlers copy themselves onto the primary >> stack if >> they interrupt guest context. >> * AMD Task Register on vmexit. (this old gem) > Gah, this thing. : Curious (and I can't seem find this in the manuals): What is this thing? >>> IIRC: AMD processors don't context switch TR on vmexit, >> Correct >> >>> which makes using IST handlers tricky there. >> (That is one way of putting it) >> >> IST handlers cannot be used by Xen if Xen does not switch the task >> register before stgi, or IST exceptions (NMI, MCE and double fault) will >> be taken with guest-supplied stack pointers. >> >>> We'd have to do the TR context switch ourselves, and that would be >>> expensive. >> It is suspected to be expensive, but I have never actually seen any >> numbers one way or another. >> >>> Andrew, am I remembering that right? >> Looks about right. >> >> I have been meaning to investigate this for a while, but never had >> the time. >> >> Xen opts for disabling interrupt stack tables in the context of AMD HVM >> vcpus, which interacts catastrophically with debug builds using >> MEMORY_GUARD. MEMORY_GUARD shoots a page out of the primary stack to >> detect stack overflows, but without an IST double fault hander, ends in >> a triple fault rather than a host crash detailing the stack overflow. >> >> KVM unilaterally reloads the host task register on vmexit, and I suspect >> this is probably the way to go, but have not had time to investigate >> whether there is any performance impact from doing so. Given how little >> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be >> as expensive as it might have been in legacy modes. >> >> (CC'ing the AMD SVM maintainers to see if they have any information on >> this subject) >> > > I actually didn't even realize that TR is not saved on vmexit ;-/. > > Would switching TR only when we know that we need to enter this > deprivileged mode help? This is an absolute must. It is not safe to use syscall/sysexit without IST in place for NMIs and MCEs. > Assuming that it is less expensive than copying the stack. I was referring to the stack overflow issue, and whether it might be sensible to pro-actively which TR. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
At 14:22 +0100 on 12 Aug (1439389325), Ben Catterall wrote: > On 11/08/15 18:05, Tim Deegan wrote: > > BTW, I think there need to be three stacks anyway, since the depriv > > code shouldn't be allowed to write to the priv code's stack frames. > > Or maybe I've misunderstood how much access the depriv code will have. > So, just to clarify: > > We have a separate deprivileged stack allocated which the deprivileged > code uses. This is mapped in user mode. > > We have the privileged stack which Xen runs on. To prevent this being > clobbered when we are in our mode and take an interrupt, we copy this > out to a buffer. This buffer is the saved privileged stack state. > > So, we sort of have three stacks already, just the privileged stack is > copied out to a buffer, rather than switching pointers to another > interrupt stack. > > Hopefully that clarifies? Yes, thanks -- the buffer is what I was thinking of as a third stack. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 11/08/15 18:05, Tim Deegan wrote: Hi, At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote: On 11/08/15 10:55, Tim Deegan wrote: At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote: On 10/08/15 10:49, Tim Deegan wrote: Hi, At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: The process to switch into and out of deprivileged mode can be likened to setjmp/longjmp. To enter deprivileged mode, we take a copy of the stack from the guest's registers up to the current stack pointer. This copy is pretty unfortunate, but I can see that avoiding it will be a bit complex. Could we do something with more stacks? AFAICS there have to be three stacks anyway: - one to hold the depriv execution context; - one to hold the privileged execution context; and - one to take interrupts on. So maybe we could do some fiddling to make Xen take interrupts on a different stack while we're depriv'd? That should happen naturally by virtue of the privilege level change involved in taking the interrupt. Right, and this is why we need a third stack - so interrupts don't trash the existing priv state on the 'normal' Xen stack. And so we either need to copy the priv stack out (and maybe copy it back), or tell the CPU to use a different stack. The copy is relatively small and paid only on the first and last entries into the mode. I don't know if this is cheaper than the bookwork that would be needed on entering and returning from the mode to switch to these stacks. I'm assuming the sp pointers in the TSS and ISTs would need changing on the first and last entry/exit if we have the extra stack, is that correct? Yep. Or, is this a more dramatic change in that everything uses this three stack model rather than just this feature. Well, some other parts would have to change to accomodate this new behaviour - that was what Andrew was talking about. BTW, I think there need to be three stacks anyway, since the depriv code shouldn't be allowed to write to the priv code's stack frames. Or maybe I've misunderstood how much access the depriv code will have. So, just to clarify: We have a separate deprivileged stack allocated which the deprivileged code uses. This is mapped in user mode. We have the privileged stack which Xen runs on. To prevent this being clobbered when we are in our mode and take an interrupt, we copy this out to a buffer. This buffer is the saved privileged stack state. So, we sort of have three stacks already, just the privileged stack is copied out to a buffer, rather than switching pointers to another interrupt stack. Hopefully that clarifies? I'm not sure how much in Xen would need changing to switch across to using three stacks. Also, would this also need to be done for PV guests? Would that need to be a separate patch series? What's the overall consensus? Thanks! I'm not sure there is one yet -- needs some more discussion of whether the non-copying approach is feasible. If we had enough headroom, we could try to be clever and tell the CPU to take interrupts on the priv stack _below_ the existing state. That would avoid the first of your problems below. * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) Gah, this thing. : Curious (and I can't seem find this in the manuals): What is this thing? IIRC: AMD processors don't context switch TR on vmexit, which makes using IST handlers tricky there. We'd have to do the TR context switch ourselves, and that would be expensive. Andrew, am I remembering that right? Thanks! Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
>>> On 11.08.15 at 19:19, wrote: > KVM unilaterally reloads the host task register on vmexit, and I suspect > this is probably the way to go, but have not had time to investigate > whether there is any performance impact from doing so. Given how little > of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be > as expensive as it might have been in legacy modes. How much of the TSS is being used shouldn't matter at all for LTR execution time, since the instruction doesn't access the TSS itself (and iirc there's also no caching documented for TSS uses). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 08/11/2015 01:19 PM, Andrew Cooper wrote: On 11/08/15 18:05, Tim Deegan wrote: * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) Gah, this thing. : Curious (and I can't seem find this in the manuals): What is this thing? IIRC: AMD processors don't context switch TR on vmexit, Correct which makes using IST handlers tricky there. (That is one way of putting it) IST handlers cannot be used by Xen if Xen does not switch the task register before stgi, or IST exceptions (NMI, MCE and double fault) will be taken with guest-supplied stack pointers. We'd have to do the TR context switch ourselves, and that would be expensive. It is suspected to be expensive, but I have never actually seen any numbers one way or another. Andrew, am I remembering that right? Looks about right. I have been meaning to investigate this for a while, but never had the time. Xen opts for disabling interrupt stack tables in the context of AMD HVM vcpus, which interacts catastrophically with debug builds using MEMORY_GUARD. MEMORY_GUARD shoots a page out of the primary stack to detect stack overflows, but without an IST double fault hander, ends in a triple fault rather than a host crash detailing the stack overflow. KVM unilaterally reloads the host task register on vmexit, and I suspect this is probably the way to go, but have not had time to investigate whether there is any performance impact from doing so. Given how little of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be as expensive as it might have been in legacy modes. (CC'ing the AMD SVM maintainers to see if they have any information on this subject) I actually didn't even realize that TR is not saved on vmexit ;-/. Would switching TR only when we know that we need to enter this deprivileged mode help? Assuming that it is less expensive than copying the stack. -boris ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 11/08/15 18:05, Tim Deegan wrote: > * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) >>> Gah, this thing. : >> Curious (and I can't seem find this in the manuals): What is this thing? > IIRC: AMD processors don't context switch TR on vmexit, Correct > which makes using IST handlers tricky there. (That is one way of putting it) IST handlers cannot be used by Xen if Xen does not switch the task register before stgi, or IST exceptions (NMI, MCE and double fault) will be taken with guest-supplied stack pointers. > We'd have to do the TR context switch ourselves, and that would be expensive. It is suspected to be expensive, but I have never actually seen any numbers one way or another. > Andrew, am I remembering that right? Looks about right. I have been meaning to investigate this for a while, but never had the time. Xen opts for disabling interrupt stack tables in the context of AMD HVM vcpus, which interacts catastrophically with debug builds using MEMORY_GUARD. MEMORY_GUARD shoots a page out of the primary stack to detect stack overflows, but without an IST double fault hander, ends in a triple fault rather than a host crash detailing the stack overflow. KVM unilaterally reloads the host task register on vmexit, and I suspect this is probably the way to go, but have not had time to investigate whether there is any performance impact from doing so. Given how little of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be as expensive as it might have been in legacy modes. (CC'ing the AMD SVM maintainers to see if they have any information on this subject) ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
Hi, At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote: > On 11/08/15 10:55, Tim Deegan wrote: > > At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote: > >> On 10/08/15 10:49, Tim Deegan wrote: > >>> Hi, > >>> > >>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: > The process to switch into and out of deprivileged mode can be likened to > setjmp/longjmp. > > To enter deprivileged mode, we take a copy of the stack from the guest's > registers up to the current stack pointer. > >>> This copy is pretty unfortunate, but I can see that avoiding it will > >>> be a bit complex. Could we do something with more stacks? AFAICS > >>> there have to be three stacks anyway: > >>> > >>> - one to hold the depriv execution context; > >>> - one to hold the privileged execution context; and > >>> - one to take interrupts on. > >>> > >>> So maybe we could do some fiddling to make Xen take interrupts on a > >>> different stack while we're depriv'd? > >> > >> That should happen naturally by virtue of the privilege level change > >> involved in taking the interrupt. > > > > Right, and this is why we need a third stack - so interrupts don't > > trash the existing priv state on the 'normal' Xen stack. And so we > > either need to copy the priv stack out (and maybe copy it back), or > > tell the CPU to use a different stack. > > The copy is relatively small and paid only on the first and last entries > into the mode. I don't know if this is cheaper than the bookwork that > would be needed on entering and returning from the mode to switch to > these stacks. I'm assuming the sp pointers in the TSS and ISTs would > need changing on the first and last entry/exit if we have the extra > stack, is that correct? Yep. > Or, is this a more dramatic change in that > everything uses this three stack model rather than just this feature. Well, some other parts would have to change to accomodate this new behaviour - that was what Andrew was talking about. BTW, I think there need to be three stacks anyway, since the depriv code shouldn't be allowed to write to the priv code's stack frames. Or maybe I've misunderstood how much access the depriv code will have. > I'm not sure how much in Xen would need changing to switch across to > using three stacks. Also, would this also need to be done for PV guests? > Would that need to be a separate patch series? > > What's the overall consensus? Thanks! I'm not sure there is one yet -- needs some more discussion of whether the non-copying approach is feasible. > > If we had enough headroom, we could try to be clever and tell the CPU > > to take interrupts on the priv stack _below_ the existing state. That > > would avoid the first of your problems below. > > > >> * Under this model, PV exception handlers should copy themselves onto > >> the privileged execution stack. > >> * Currently, the IST handlers copy themselves onto the primary stack if > >> they interrupt guest context. > >> * AMD Task Register on vmexit. (this old gem) > > > > Gah, this thing. : > Curious (and I can't seem find this in the manuals): What is this thing? IIRC: AMD processors don't context switch TR on vmexit, which makes using IST handlers tricky there. We'd have to do the TR context switch ourselves, and that would be expensive. Andrew, am I remembering that right? Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 11/08/15 10:55, Tim Deegan wrote: At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote: On 10/08/15 10:49, Tim Deegan wrote: Hi, At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: The process to switch into and out of deprivileged mode can be likened to setjmp/longjmp. To enter deprivileged mode, we take a copy of the stack from the guest's registers up to the current stack pointer. This copy is pretty unfortunate, but I can see that avoiding it will be a bit complex. Could we do something with more stacks? AFAICS there have to be three stacks anyway: - one to hold the depriv execution context; - one to hold the privileged execution context; and - one to take interrupts on. So maybe we could do some fiddling to make Xen take interrupts on a different stack while we're depriv'd? That should happen naturally by virtue of the privilege level change involved in taking the interrupt. Right, and this is why we need a third stack - so interrupts don't trash the existing priv state on the 'normal' Xen stack. And so we either need to copy the priv stack out (and maybe copy it back), or tell the CPU to use a different stack. The copy is relatively small and paid only on the first and last entries into the mode. I don't know if this is cheaper than the bookwork that would be needed on entering and returning from the mode to switch to these stacks. I'm assuming the sp pointers in the TSS and ISTs would need changing on the first and last entry/exit if we have the extra stack, is that correct? Or, is this a more dramatic change in that everything uses this three stack model rather than just this feature. I'm not sure how much in Xen would need changing to switch across to using three stacks. Also, would this also need to be done for PV guests? Would that need to be a separate patch series? What's the overall consensus? Thanks! If we had enough headroom, we could try to be clever and tell the CPU to take interrupts on the priv stack _below_ the existing state. That would avoid the first of your problems below. * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) Gah, this thing. : Curious (and I can't seem find this in the manuals): What is this thing? Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 10/08/15 10:49, Tim Deegan wrote: Hi, At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: The process to switch into and out of deprivileged mode can be likened to setjmp/longjmp. To enter deprivileged mode, we take a copy of the stack from the guest's registers up to the current stack pointer. This copy is pretty unfortunate, but I can see that avoiding it will be a bit complex. Could we do something with more stacks? AFAICS there have to be three stacks anyway: - one to hold the depriv execution context; - one to hold the privileged execution context; and - one to take interrupts on. So maybe we could do some fiddling to make Xen take interrupts on a different stack while we're depriv'd? If we do have to copy, we could track whether the original stack has been clobbered by an interrupt, and so avoid (at least some of) the copy back afterwards? One nit in the assembler - if I've followed correctly, this saved IP: +/* Perform a near call to push rip onto the stack */ +call 1f is returned to (with adjustments) here: +/* Go to user mode return code */ +jmp*(%rsi) It would be good to make this a matched pair of call/ret if we can; the CPU has special branch prediction tracking for function calls that gets confused by a call that's not returned to. sure, will do. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote: > On 10/08/15 10:49, Tim Deegan wrote: > > Hi, > > > > At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: > >> The process to switch into and out of deprivileged mode can be likened to > >> setjmp/longjmp. > >> > >> To enter deprivileged mode, we take a copy of the stack from the guest's > >> registers up to the current stack pointer. > > This copy is pretty unfortunate, but I can see that avoiding it will > > be a bit complex. Could we do something with more stacks? AFAICS > > there have to be three stacks anyway: > > > > - one to hold the depriv execution context; > > - one to hold the privileged execution context; and > > - one to take interrupts on. > > > > So maybe we could do some fiddling to make Xen take interrupts on a > > different stack while we're depriv'd? > > That should happen naturally by virtue of the privilege level change > involved in taking the interrupt. Right, and this is why we need a third stack - so interrupts don't trash the existing priv state on the 'normal' Xen stack. And so we either need to copy the priv stack out (and maybe copy it back), or tell the CPU to use a different stack. If we had enough headroom, we could try to be clever and tell the CPU to take interrupts on the priv stack _below_ the existing state. That would avoid the first of your problems below. > * Under this model, PV exception handlers should copy themselves onto > the privileged execution stack. > * Currently, the IST handlers copy themselves onto the primary stack if > they interrupt guest context. > * AMD Task Register on vmexit. (this old gem) Gah, this thing. :( Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On Thu, 2015-08-06 at 21:55 +0100, Andrew Cooper wrote: > On 06/08/15 17:45, Ben Catterall wrote: > > The process to switch into and out of deprivileged mode can be likened > > to > > setjmp/longjmp. > > > > To enter deprivileged mode, we take a copy of the stack from the > > guest's > > registers up to the current stack pointer. This allows us to restore > > the stack > > when we have finished the deprivileged mode operation, meaning we can > > continue > > execution from that point. This is similar to if a context switch had > > happened. > > > > To exit deprivileged mode, we copy the stack back, replacing the > > current stack. > > We can then continue execution from where we left off, which will > > unwind the > > stack and free up resources. This method means that we do not need to > > change any other code paths and its invocation will be transparent to > > callers. > > This should allow the feature to be more easily deployed to different > > parts > > of Xen. > > > > Note that this copy of the stack is per-vcpu but, it will contain per > > -pcpu data. > > Extra work is needed to properly migrate vcpus between pcpus. > > Under what circumstances do you see there being persistent state in the > depriv area between calls, given that the calls are synchronous from VM > actions? Would we not want to keep (some of) the device model's state in a depriv area? e.g. anything which is purely internal to the DM which is therefore only accessed from depriv-land? Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 10/08/15 10:49, Tim Deegan wrote: > Hi, > > At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: >> The process to switch into and out of deprivileged mode can be likened to >> setjmp/longjmp. >> >> To enter deprivileged mode, we take a copy of the stack from the guest's >> registers up to the current stack pointer. > This copy is pretty unfortunate, but I can see that avoiding it will > be a bit complex. Could we do something with more stacks? AFAICS > there have to be three stacks anyway: > > - one to hold the depriv execution context; > - one to hold the privileged execution context; and > - one to take interrupts on. > > So maybe we could do some fiddling to make Xen take interrupts on a > different stack while we're depriv'd? That should happen naturally by virtue of the privilege level change involved in taking the interrupt. Conceptually, taking interrupts from depriv mode is no different to taking them in a PV guest. Some complications which come to mind (none insurmountable): * Under this model, PV exception handlers should copy themselves onto the privileged execution stack. * Currently, the IST handlers copy themselves onto the primary stack if they interrupt guest context. * AMD Task Register on vmexit. (this old gem) ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
Hi, At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote: > The process to switch into and out of deprivileged mode can be likened to > setjmp/longjmp. > > To enter deprivileged mode, we take a copy of the stack from the guest's > registers up to the current stack pointer. This copy is pretty unfortunate, but I can see that avoiding it will be a bit complex. Could we do something with more stacks? AFAICS there have to be three stacks anyway: - one to hold the depriv execution context; - one to hold the privileged execution context; and - one to take interrupts on. So maybe we could do some fiddling to make Xen take interrupts on a different stack while we're depriv'd? If we do have to copy, we could track whether the original stack has been clobbered by an interrupt, and so avoid (at least some of) the copy back afterwards? One nit in the assembler - if I've followed correctly, this saved IP: > +/* Perform a near call to push rip onto the stack */ > +call 1f is returned to (with adjustments) here: > +/* Go to user mode return code */ > +jmp*(%rsi) It would be good to make this a matched pair of call/ret if we can; the CPU has special branch prediction tracking for function calls that gets confused by a call that's not returned to. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 07/08/15 13:51, Ben Catterall wrote: > On 06/08/15 21:55, Andrew Cooper wrote: >> On 06/08/15 17:45, Ben Catterall wrote: >>> The process to switch into and out of deprivileged mode can be >>> likened to >>> setjmp/longjmp. >>> >>> To enter deprivileged mode, we take a copy of the stack from the >>> guest's >>> registers up to the current stack pointer. This allows us to restore >>> the stack >>> when we have finished the deprivileged mode operation, meaning we >>> can continue >>> execution from that point. This is similar to if a context switch >>> had happened. >>> >>> To exit deprivileged mode, we copy the stack back, replacing the >>> current stack. >>> We can then continue execution from where we left off, which will >>> unwind the >>> stack and free up resources. This method means that we do not need to >>> change any other code paths and its invocation will be transparent >>> to callers. >>> This should allow the feature to be more easily deployed to >>> different parts >>> of Xen. >>> >>> Note that this copy of the stack is per-vcpu but, it will contain >>> per-pcpu data. >>> Extra work is needed to properly migrate vcpus between pcpus. >> >> Under what circumstances do you see there being persistent state in the >> depriv area between calls, given that the calls are synchronous from VM >> actions? > > I don't know if we can make these synchronous as we need a way to > interrupt the vcpu if it's spinning for a long time. Otherwise an > attacker could just spin in depriv and cause a DoS. With that in mind, > the scheduler may decide to migrate the vcpu whilst it's in depriv > mode which would mean this per-pcpu data is held in the stack copy > which is then migrated to another pcpu incorrectly. If the emulator spins for a sufficient time, it is fine to shoot the domain. This is a strict improvement on the current behaviour where a spinning emulator would shoot the host, via a watchdog timeout. As said elsewhere, this kind of DoS is not a very interesting attack vector. State handling errors which cause Xen to change the wrong thing are far more interesting from a guests point of view. http://xenbits.xen.org/xsa/advisory-123.html (full host compromise) or http://xenbits.xen.org/xsa/advisory-108.html (read other guests data) are examples of kinds of interesting issues which could potentially be mitigated with this depriv infrastructure. > >> >>> >>> The switch to and from deprivileged mode is performed using sysret >>> and syscall >>> respectively. >> >> I suspect we need to borrow the SS attribute workaround from Linux to >> make this function reliably on AMD systems. >> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b >> >> > > > Ah! ok, I'll look into this. Thanks! Just be aware of it. Don't spend your time attempting to retrofit it to Xen. It is more work than it looks. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 07/08/15 13:51, Ben Catterall wrote: > > I don't know if we can make these synchronous as we need a way to > interrupt the vcpu if it's spinning for a long time. Otherwise an > attacker could just spin in depriv and cause a DoS. With that in mind, > the scheduler may decide to migrate the vcpu whilst it's in depriv mode > which would mean this per-pcpu data is held in the stack copy which is > then migrated to another pcpu incorrectly. IMO, DoS attacks on depriv'd emulators aren't very interesting. I think it is counter-productive to address this attack in this initial implementation at the expense (delays/complexity/etc.) of solving the key requirement of mitigating information leaks and privilege escalation attacks David ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 06/08/15 21:55, Andrew Cooper wrote: On 06/08/15 17:45, Ben Catterall wrote: The process to switch into and out of deprivileged mode can be likened to setjmp/longjmp. To enter deprivileged mode, we take a copy of the stack from the guest's registers up to the current stack pointer. This allows us to restore the stack when we have finished the deprivileged mode operation, meaning we can continue execution from that point. This is similar to if a context switch had happened. To exit deprivileged mode, we copy the stack back, replacing the current stack. We can then continue execution from where we left off, which will unwind the stack and free up resources. This method means that we do not need to change any other code paths and its invocation will be transparent to callers. This should allow the feature to be more easily deployed to different parts of Xen. Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data. Extra work is needed to properly migrate vcpus between pcpus. Under what circumstances do you see there being persistent state in the depriv area between calls, given that the calls are synchronous from VM actions? I don't know if we can make these synchronous as we need a way to interrupt the vcpu if it's spinning for a long time. Otherwise an attacker could just spin in depriv and cause a DoS. With that in mind, the scheduler may decide to migrate the vcpu whilst it's in depriv mode which would mean this per-pcpu data is held in the stack copy which is then migrated to another pcpu incorrectly. The switch to and from deprivileged mode is performed using sysret and syscall respectively. I suspect we need to borrow the SS attribute workaround from Linux to make this function reliably on AMD systems. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b > Ah! ok, I'll look into this. Thanks! The return paths in entry.S have been edited so that, when we receive an interrupt whilst in deprivileged mode, we return into that mode correctly. A hook on the syscall handler in entry.S has also been added which handles returning from user mode and will support deprivileged mode system calls when these are needed. Signed-off-by: Ben Catterall --- xen/arch/x86/domain.c | 12 +++ xen/arch/x86/hvm/Makefile | 1 + xen/arch/x86/hvm/deprivileged.c | 103 ++ xen/arch/x86/hvm/deprivileged_asm.S | 205 xen/arch/x86/hvm/vmx/vmx.c | 7 ++ xen/arch/x86/x86_64/asm-offsets.c | 5 + xen/arch/x86/x86_64/entry.S | 35 ++ xen/include/asm-x86/hvm/vmx/vmx.h | 2 + xen/include/xen/hvm/deprivileged.h | 38 +++ xen/include/xen/sched.h | 18 +++- 10 files changed, 425 insertions(+), 1 deletion(-) create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 045f6ff..a0e5e70 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -62,6 +62,7 @@ #include #include #include +#include DEFINE_PER_CPU(struct vcpu *, curr_vcpu); DEFINE_PER_CPU(unsigned long, cr4); @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v) if ( has_hvm_container_domain(d) ) { rc = hvm_vcpu_initialise(v); + +/* Initialise HVM deprivileged mode */ +printk("HVM initialising deprivileged mode ..."); All printk()s should have a XENLOG_$severity prefix. will do. +hvm_deprivileged_prepare_vcpu(v); +printk("Done.\n"); + goto done; } @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v) vcpu_destroy_fpu(v); if ( has_hvm_container_vcpu(v) ) +{ +/* Destroy the deprivileged mode on this vcpu */ +hvm_deprivileged_destroy_vcpu(v); + hvm_vcpu_destroy(v); +} else xfree(v->arch.pv_vcpu.trap_ctxt); } diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile index bd83ba3..6819886 100644 --- a/xen/arch/x86/hvm/Makefile +++ b/xen/arch/x86/hvm/Makefile @@ -17,6 +17,7 @@ obj-y += quirks.o obj-y += rtc.o obj-y += save.o obj-y += deprivileged.o +obj-y += deprivileged_asm.o obj-y += stdvga.o obj-y += vioapic.o obj-y += viridian.o diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c index 071d900..979fc69 100644 --- a/xen/arch/x86/hvm/deprivileged.c +++ b/xen/arch/x86/hvm/deprivileged.c @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d, } return 0; } + +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. + */ +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu) +{ +struct page_info *pg; + +/* TODO: clarify if this MEMF is correct */ +/* Allocate 2^STACK_ORDER contiguous pages */ +pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner); +if( pg == NULL ) +{ +panic("
Re: [Xen-devel] [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
On 06/08/15 17:45, Ben Catterall wrote: > The process to switch into and out of deprivileged mode can be likened to > setjmp/longjmp. > > To enter deprivileged mode, we take a copy of the stack from the guest's > registers up to the current stack pointer. This allows us to restore the stack > when we have finished the deprivileged mode operation, meaning we can continue > execution from that point. This is similar to if a context switch had > happened. > > To exit deprivileged mode, we copy the stack back, replacing the current > stack. > We can then continue execution from where we left off, which will unwind the > stack and free up resources. This method means that we do not need to > change any other code paths and its invocation will be transparent to callers. > This should allow the feature to be more easily deployed to different parts > of Xen. > > Note that this copy of the stack is per-vcpu but, it will contain per-pcpu > data. > Extra work is needed to properly migrate vcpus between pcpus. Under what circumstances do you see there being persistent state in the depriv area between calls, given that the calls are synchronous from VM actions? > > The switch to and from deprivileged mode is performed using sysret and syscall > respectively. I suspect we need to borrow the SS attribute workaround from Linux to make this function reliably on AMD systems. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b > > The return paths in entry.S have been edited so that, when we receive an > interrupt whilst in deprivileged mode, we return into that mode correctly. > > A hook on the syscall handler in entry.S has also been added which handles > returning from user mode and will support deprivileged mode system calls when > these are needed. > > Signed-off-by: Ben Catterall > --- > xen/arch/x86/domain.c | 12 +++ > xen/arch/x86/hvm/Makefile | 1 + > xen/arch/x86/hvm/deprivileged.c | 103 ++ > xen/arch/x86/hvm/deprivileged_asm.S | 205 > > xen/arch/x86/hvm/vmx/vmx.c | 7 ++ > xen/arch/x86/x86_64/asm-offsets.c | 5 + > xen/arch/x86/x86_64/entry.S | 35 ++ > xen/include/asm-x86/hvm/vmx/vmx.h | 2 + > xen/include/xen/hvm/deprivileged.h | 38 +++ > xen/include/xen/sched.h | 18 +++- > 10 files changed, 425 insertions(+), 1 deletion(-) > create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 045f6ff..a0e5e70 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -62,6 +62,7 @@ > #include > #include > #include > +#include > > DEFINE_PER_CPU(struct vcpu *, curr_vcpu); > DEFINE_PER_CPU(unsigned long, cr4); > @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v) > if ( has_hvm_container_domain(d) ) > { > rc = hvm_vcpu_initialise(v); > + > +/* Initialise HVM deprivileged mode */ > +printk("HVM initialising deprivileged mode ..."); All printk()s should have a XENLOG_$severity prefix. > +hvm_deprivileged_prepare_vcpu(v); > +printk("Done.\n"); > + > goto done; > } > > @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v) > vcpu_destroy_fpu(v); > > if ( has_hvm_container_vcpu(v) ) > +{ > +/* Destroy the deprivileged mode on this vcpu */ > +hvm_deprivileged_destroy_vcpu(v); > + > hvm_vcpu_destroy(v); > +} > else > xfree(v->arch.pv_vcpu.trap_ctxt); > } > diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile > index bd83ba3..6819886 100644 > --- a/xen/arch/x86/hvm/Makefile > +++ b/xen/arch/x86/hvm/Makefile > @@ -17,6 +17,7 @@ obj-y += quirks.o > obj-y += rtc.o > obj-y += save.o > obj-y += deprivileged.o > +obj-y += deprivileged_asm.o > obj-y += stdvga.o > obj-y += vioapic.o > obj-y += viridian.o > diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c > index 071d900..979fc69 100644 > --- a/xen/arch/x86/hvm/deprivileged.c > +++ b/xen/arch/x86/hvm/deprivileged.c > @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d, > } > return 0; > } > + > +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. > + */ > +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu) > +{ > +struct page_info *pg; > + > +/* TODO: clarify if this MEMF is correct */ > +/* Allocate 2^STACK_ORDER contiguous pages */ > +pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner); > +if( pg == NULL ) > +{ > +panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n"); > +return -ENOMEM; > +} > + > +vcpu->stack = page_to_virt(pg); Xen has two heaps, the xenheap and the domheap. You may only construct pointers like this into the xenheap. The domheap is not guaranteed to have safe virtual m