Re: TSC in nested SVM and VMX

2010-10-03 Thread Nadav Har'El
On Sun, Oct 03, 2010, Alexander Graf wrote about Re: TSC in nested SVM and 
VMX:
 Looking through the spec, the only indicator I've found is this passage:
 
 TSC_OFFSET - an offset to add when the guest reads the TSC (time stamp
 counter). Guest writes to the TSC can be intercepted and emulated by
 changing the offset (without writing the physical TSC). This offset is
 cleared when the guest exits back to the host.
 
 So apparently writes to TSC don't affect tsc_offset, but instead affect
 the host's tsc skew. So with nesting a non-intercepted tsc write affects
 L1's tsc_offset. This means the code is correct. Sorry for the fuss :).

I don't understand, how does this passage imply that writes to the TSC don't
affect the tsc_offset? It says that writes to the TSC can (I don't know why
this word was used...) changing the offset. I don't understand why a guest
should be allowed to ruin its host's TSC (or in the nested case, why an L2
should be allowed to ruin L1's TSC without L1's knowledge) - isn't this
exactly why the TSC offset exists?

-- 
Nadav Har'El|  Sunday, Oct  3 2010, 25 Tishri 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Computers are useless. They can only
http://nadav.harel.org.il   |give you answers. -- Pablo Picasso
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-03 Thread Alexander Graf

On 03.10.2010, at 10:35, Nadav Har'El wrote:

 On Sun, Oct 03, 2010, Alexander Graf wrote about Re: TSC in nested SVM and 
 VMX:
 Looking through the spec, the only indicator I've found is this passage:
 
 TSC_OFFSET - an offset to add when the guest reads the TSC (time stamp
 counter). Guest writes to the TSC can be intercepted and emulated by
 changing the offset (without writing the physical TSC). This offset is
 cleared when the guest exits back to the host.
 
 So apparently writes to TSC don't affect tsc_offset, but instead affect
 the host's tsc skew. So with nesting a non-intercepted tsc write affects
 L1's tsc_offset. This means the code is correct. Sorry for the fuss :).
 
 I don't understand, how does this passage imply that writes to the TSC don't
 affect the tsc_offset? It says that writes to the TSC can (I don't know why
 this word was used...) changing the offset. I don't understand why a guest
 should be allowed to ruin its host's TSC (or in the nested case, why an L2
 should be allowed to ruin L1's TSC without L1's knowledge) - isn't this
 exactly why the TSC offset exists?

Yes, it is. But because except for that passage no other indication exists that 
tsc_offset gets changed by a tsc write, it probably won't happen.

Also, L2 affecting the host's TSC can make sense at times. Hyper-V for example 
runs its Dom0 inside of a VM context. If that wants to change the system's tsc 
offset, it should be allowed to, no? Unless the hypervisor wants to use the TSC 
too of course - in which case hell breaks lose.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-02 Thread Alexander Graf

Am 01.10.2010 um 21:22 schrieb Zachary Amsden zams...@redhat.com:

 On 10/01/2010 04:46 AM, Alexander Graf wrote:
 On 01.10.2010, at 13:21, Nadav Har'El wrote:
 
   
 On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM and 
 VMX:
 
 1)  When reading an MSR, we are not emulating the L2 guest; we are
 DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2
 guest is actually done by the code running /inside/ the L1 emulation, so
 MSR reads for the L2 guest are handed by L1, and MSR reads for the L1
 guest are handled by L0, which is this code.
 ...
 So if we are currently running nested, the L1 tsc_offset is stored in
 the nested.hsave field; the vmcb which is active is polluted by the L2
 guest offset, which would be incorrect to return to the L1 emulation.
   
 Thanks for the detailed explanation.
 
 It seems, then, that the nested VMX logic is somewhat different from that
 of the nested SVM. In nested VMX, if a function gets called when running
 L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
 (and I'm not even sure *which* L2 that would be when there are multiple
 L2 guests for the one L1).
 
 If the #vmexit comes while you're in L1, everything works on the L1's vmcb. 
 If you hit it while in L2, everything works on the L2's vmcb unless special 
 attention is taken.
 
 The reason behind the TSC shift is very simple. With the tsc_offset setting 
 we're trying to adjust the L1's offset. Adjusting the L1's offset means we 
 need to adjust L1 and L2 alike, as the virtual L2's offset == L1 offset + 
 vmcb L2 offset, because L2's TSC is also offset by the amount L1 is.
 
 So basically what happens is:
 
 nested VMRUN:
 
 svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;
 
 please note the +=!
 
 
 svm_write_tsc_offset:
 
 This gets called when we really want to current level's TSC offset only 
 because the guest issued a tsc write. In L2 this means the L2's value.
 
 if (is_nested(svm)) {
 g_tsc_offset = svm-vmcb-control.tsc_offset -
svm-nested.hsave-control.tsc_offset;
 
 Remember the difference between L1 and L2.
 
 svm-nested.hsave-control.tsc_offset = offset;
 
 Set L1 to the new offset
 
 }
 
 svm-vmcb-control.tsc_offset = offset + g_tsc_offset;
 
 Set L2 to new offset + delta.
 
 
 So what this function does is that it treats TSC writes as L1 writes even 
 while in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. Are 
 you sure this is intended and works when L1 doesn't intercept MSR writes to 
 TSC?
   
 
 L1 must intercept MSR writes to TSC for this to work.  It does, so all is 
 well.

Sure, in nested kvm all is fine because we becer hit the above code path. But 
other nypervisors might not intercept tsc writes which should only be reflected 
in an l2 tsc offset change, no?

Alex --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-02 Thread Alexander Graf

On 02.10.2010, at 03:56, Alexander Graf wrote:

 
 Am 01.10.2010 um 21:22 schrieb Zachary Amsden zams...@redhat.com:
 
 On 10/01/2010 04:46 AM, Alexander Graf wrote:
 On 01.10.2010, at 13:21, Nadav Har'El wrote:
 
 
 On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM 
 and VMX:
 
 1)  When reading an MSR, we are not emulating the L2 guest; we are
 DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2
 guest is actually done by the code running /inside/ the L1 emulation, so
 MSR reads for the L2 guest are handed by L1, and MSR reads for the L1
 guest are handled by L0, which is this code.
 ...
 So if we are currently running nested, the L1 tsc_offset is stored in
 the nested.hsave field; the vmcb which is active is polluted by the L2
 guest offset, which would be incorrect to return to the L1 emulation.
 
 Thanks for the detailed explanation.
 
 It seems, then, that the nested VMX logic is somewhat different from that
 of the nested SVM. In nested VMX, if a function gets called when running
 L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
 (and I'm not even sure *which* L2 that would be when there are multiple
 L2 guests for the one L1).
 
 If the #vmexit comes while you're in L1, everything works on the L1's vmcb. 
 If you hit it while in L2, everything works on the L2's vmcb unless special 
 attention is taken.
 
 The reason behind the TSC shift is very simple. With the tsc_offset setting 
 we're trying to adjust the L1's offset. Adjusting the L1's offset means we 
 need to adjust L1 and L2 alike, as the virtual L2's offset == L1 offset + 
 vmcb L2 offset, because L2's TSC is also offset by the amount L1 is.
 
 So basically what happens is:
 
 nested VMRUN:
 
svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;
 
 please note the +=!
 
 
 svm_write_tsc_offset:
 
 This gets called when we really want to current level's TSC offset only 
 because the guest issued a tsc write. In L2 this means the L2's value.
 
if (is_nested(svm)) {
g_tsc_offset = svm-vmcb-control.tsc_offset -
   svm-nested.hsave-control.tsc_offset;
 
 Remember the difference between L1 and L2.
 
svm-nested.hsave-control.tsc_offset = offset;
 
 Set L1 to the new offset
 
}
 
svm-vmcb-control.tsc_offset = offset + g_tsc_offset;
 
 Set L2 to new offset + delta.
 
 
 So what this function does is that it treats TSC writes as L1 writes even 
 while in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. Are 
 you sure this is intended and works when L1 doesn't intercept MSR writes to 
 TSC?
 
 
 L1 must intercept MSR writes to TSC for this to work.  It does, so all is 
 well.
 
 Sure, in nested kvm all is fine because we becer

never

 hit the above code path. But other nypervisors

hypervisors

 might not intercept tsc writes which should only be reflected in an l2 tsc 
 offset change, no?

Note to self: proof-read mails when writing from a phone.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-02 Thread Zachary Amsden

On 10/02/2010 01:19 AM, Alexander Graf wrote:

On 02.10.2010, at 03:56, Alexander Graf wrote:

   

Am 01.10.2010 um 21:22 schrieb Zachary Amsdenzams...@redhat.com:

 

On 10/01/2010 04:46 AM, Alexander Graf wrote:
   

On 01.10.2010, at 13:21, Nadav Har'El wrote:


 

On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM and 
VMX:

   

1)  When reading an MSR, we are not emulating the L2 guest; we are
DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2
guest is actually done by the code running /inside/ the L1 emulation, so
MSR reads for the L2 guest are handed by L1, and MSR reads for the L1
guest are handled by L0, which is this code.
...
So if we are currently running nested, the L1 tsc_offset is stored in
the nested.hsave field; the vmcb which is active is polluted by the L2
guest offset, which would be incorrect to return to the L1 emulation.

 

Thanks for the detailed explanation.

It seems, then, that the nested VMX logic is somewhat different from that
of the nested SVM. In nested VMX, if a function gets called when running
L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
(and I'm not even sure *which* L2 that would be when there are multiple
L2 guests for the one L1).

   

If the #vmexit comes while you're in L1, everything works on the L1's vmcb. If 
you hit it while in L2, everything works on the L2's vmcb unless special 
attention is taken.

The reason behind the TSC shift is very simple. With the tsc_offset setting 
we're trying to adjust the L1's offset. Adjusting the L1's offset means we need 
to adjust L1 and L2 alike, as the virtual L2's offset == L1 offset + vmcb L2 
offset, because L2's TSC is also offset by the amount L1 is.

So basically what happens is:

nested VMRUN:

svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;

please note the +=!


svm_write_tsc_offset:

This gets called when we really want to current level's TSC offset only because 
the guest issued a tsc write. In L2 this means the L2's value.

if (is_nested(svm)) {
g_tsc_offset = svm-vmcb-control.tsc_offset -
   svm-nested.hsave-control.tsc_offset;

Remember the difference between L1 and L2.

svm-nested.hsave-control.tsc_offset = offset;

Set L1 to the new offset

}

svm-vmcb-control.tsc_offset = offset + g_tsc_offset;

Set L2 to new offset + delta.


So what this function does is that it treats TSC writes as L1 writes even while 
in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. Are you sure 
this is intended and works when L1 doesn't intercept MSR writes to TSC?

 

L1 must intercept MSR writes to TSC for this to work.  It does, so all is well.
   

Sure, in nested kvm all is fine because we becer
 

never

   

hit the above code path. But other nypervisors
 


We do hit that code path, and it works fine because it is correct.  It 
only applies to L1 TSC writes.


An L2 guest writing to TSC MSR will not run this code path, an L2 guest 
writing to TSC will (assuming the L1 guest traps writes) trigger a 
#VMEXIT which should be forwarded to the L1 guest.  In response, the L1 
guest has two choices:


1) adjust the TSC offset in the vmcb for the L2 guest.
2) rewrite the TSC instead, triggering the above code path, which 
follows the standard case (as it is not running nested), adjusting the 
TSC offset for the L1 guest only.


In both cases, this adjusted offset will be added to the L2 guest offset 
when it resumes in the nested #VMRUN.


The only time the above code path follows the nested case is

1) A nested L2 guest is running
2) The L0 emulation of L1 requires adjusting the hardware TSC offset 
because of a hardware CPU TSC change



hypervisors

   

might not intercept tsc writes which should only be reflected in an l2 tsc 
offset change, no?
 




Other hypervisors are irrelevant here.  L1 hypervisor, whatever it may 
be, may or may not intercept TSC writes.  If it does not, it does not 
correctly virtualize L2.


We correctly virtualize the L1 guest's mistake by allowing the L2 guest 
in that case to rewrite the TSC for the L1 guest.  This may cause some 
slight disruption for L1's correct timekeeping...


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-02 Thread Alexander Graf

On 03.10.2010, at 00:46, Zachary Amsden wrote:

 On 10/02/2010 01:19 AM, Alexander Graf wrote:
 On 02.10.2010, at 03:56, Alexander Graf wrote:
 
   
 Am 01.10.2010 um 21:22 schrieb Zachary Amsdenzams...@redhat.com:
 
 
 On 10/01/2010 04:46 AM, Alexander Graf wrote:
   
 On 01.10.2010, at 13:21, Nadav Har'El wrote:
 
 
 
 On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM 
 and VMX:
 
   
 1)  When reading an MSR, we are not emulating the L2 guest; we are
 DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2
 guest is actually done by the code running /inside/ the L1 emulation, so
 MSR reads for the L2 guest are handed by L1, and MSR reads for the L1
 guest are handled by L0, which is this code.
 ...
 So if we are currently running nested, the L1 tsc_offset is stored in
 the nested.hsave field; the vmcb which is active is polluted by the L2
 guest offset, which would be incorrect to return to the L1 emulation.
 
 
 Thanks for the detailed explanation.
 
 It seems, then, that the nested VMX logic is somewhat different from that
 of the nested SVM. In nested VMX, if a function gets called when running
 L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
 (and I'm not even sure *which* L2 that would be when there are multiple
 L2 guests for the one L1).
 
   
 If the #vmexit comes while you're in L1, everything works on the L1's 
 vmcb. If you hit it while in L2, everything works on the L2's vmcb unless 
 special attention is taken.
 
 The reason behind the TSC shift is very simple. With the tsc_offset 
 setting we're trying to adjust the L1's offset. Adjusting the L1's offset 
 means we need to adjust L1 and L2 alike, as the virtual L2's offset == L1 
 offset + vmcb L2 offset, because L2's TSC is also offset by the amount L1 
 is.
 
 So basically what happens is:
 
 nested VMRUN:
 
svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;
 
 please note the +=!
 
 
 svm_write_tsc_offset:
 
 This gets called when we really want to current level's TSC offset only 
 because the guest issued a tsc write. In L2 this means the L2's value.
 
if (is_nested(svm)) {
g_tsc_offset = svm-vmcb-control.tsc_offset -
   svm-nested.hsave-control.tsc_offset;
 
 Remember the difference between L1 and L2.
 
svm-nested.hsave-control.tsc_offset = offset;
 
 Set L1 to the new offset
 
}
 
svm-vmcb-control.tsc_offset = offset + g_tsc_offset;
 
 Set L2 to new offset + delta.
 
 
 So what this function does is that it treats TSC writes as L1 writes even 
 while in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. 
 Are you sure this is intended and works when L1 doesn't intercept MSR 
 writes to TSC?
 
 
 L1 must intercept MSR writes to TSC for this to work.  It does, so all is 
 well.
   
 Sure, in nested kvm all is fine because we becer
 
 never
 
   
 hit the above code path. But other nypervisors
 
 
 We do hit that code path, and it works fine because it is correct.  It only 
 applies to L1 TSC writes.
 
 An L2 guest writing to TSC MSR will not run this code path, an L2 guest 
 writing to TSC will (assuming the L1 guest traps writes) trigger a #VMEXIT 
 which should be forwarded to the L1 guest.  In response, the L1 guest has two 
 choices:
 
 1) adjust the TSC offset in the vmcb for the L2 guest.
 2) rewrite the TSC instead, triggering the above code path, which follows the 
 standard case (as it is not running nested), adjusting the TSC offset for the 
 L1 guest only.
 
 In both cases, this adjusted offset will be added to the L2 guest offset when 
 it resumes in the nested #VMRUN.

Once tsc writes within L2 are intercepted, all is fine. No need to worry.

 
 The only time the above code path follows the nested case is
 
 1) A nested L2 guest is running

Yes.

 2) The L0 emulation of L1 requires adjusting the hardware TSC offset because 
 of a hardware CPU TSC change

Not that I've found. Mind to show me the code path that does trigger this?

 
 hypervisors
 
   
 might not intercept tsc writes which should only be reflected in an l2 tsc 
 offset change, no?
 
 
 
 Other hypervisors are irrelevant here.  L1 hypervisor, whatever it may be, 
 may or may not intercept TSC writes.  If it does not, it does not correctly 
 virtualize L2.

You mean because it doesn't contain L2 properly? That's for L1's hv to decide.

 
 We correctly virtualize the L1 guest's mistake by allowing the L2 guest in 
 that case to rewrite the TSC for the L1 guest.  This may cause some slight 
 disruption for L1's correct timekeeping...

Ok, let's take a step back. Let's take a look at how it works without nesting.

If the guest writes the MSR, does the host's tsc get changed or does the 
guest's tsc_offset field get changed? If it's the former, the code is correct. 
If it's the latter, it's wrong.

Looking through

Re: TSC in nested SVM and VMX

2010-10-01 Thread Nadav Har'El
On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM and 
VMX:
 1)  When reading an MSR, we are not emulating the L2 guest; we are 
 DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2 
 guest is actually done by the code running /inside/ the L1 emulation, so 
 MSR reads for the L2 guest are handed by L1, and MSR reads for the L1 
 guest are handled by L0, which is this code.
...
 So if we are currently running nested, the L1 tsc_offset is stored in 
 the nested.hsave field; the vmcb which is active is polluted by the L2 
 guest offset, which would be incorrect to return to the L1 emulation.

Thanks for the detailed explanation.

It seems, then, that the nested VMX logic is somewhat different from that
of the nested SVM. In nested VMX, if a function gets called when running
L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
(and I'm not even sure *which* L2 that would be when there are multiple
L2 guests for the one L1).

Nadav.

-- 
Nadav Har'El|  Friday, Oct  1 2010, 23 Tishri 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |What's tiny, yellow and very dangerous? A
http://nadav.harel.org.il   |canary with the super-user password.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-01 Thread Alexander Graf

On 01.10.2010, at 13:21, Nadav Har'El wrote:

 On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM and 
 VMX:
 1)  When reading an MSR, we are not emulating the L2 guest; we are 
 DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2 
 guest is actually done by the code running /inside/ the L1 emulation, so 
 MSR reads for the L2 guest are handed by L1, and MSR reads for the L1 
 guest are handled by L0, which is this code.
 ...
 So if we are currently running nested, the L1 tsc_offset is stored in 
 the nested.hsave field; the vmcb which is active is polluted by the L2 
 guest offset, which would be incorrect to return to the L1 emulation.
 
 Thanks for the detailed explanation.
 
 It seems, then, that the nested VMX logic is somewhat different from that
 of the nested SVM. In nested VMX, if a function gets called when running
 L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
 (and I'm not even sure *which* L2 that would be when there are multiple
 L2 guests for the one L1).

If the #vmexit comes while you're in L1, everything works on the L1's vmcb. If 
you hit it while in L2, everything works on the L2's vmcb unless special 
attention is taken.

The reason behind the TSC shift is very simple. With the tsc_offset setting 
we're trying to adjust the L1's offset. Adjusting the L1's offset means we need 
to adjust L1 and L2 alike, as the virtual L2's offset == L1 offset + vmcb L2 
offset, because L2's TSC is also offset by the amount L1 is.

So basically what happens is:

nested VMRUN:

svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;

please note the +=!


svm_write_tsc_offset:

This gets called when we really want to current level's TSC offset only because 
the guest issued a tsc write. In L2 this means the L2's value.

if (is_nested(svm)) {
g_tsc_offset = svm-vmcb-control.tsc_offset -
   svm-nested.hsave-control.tsc_offset;

Remember the difference between L1 and L2.

svm-nested.hsave-control.tsc_offset = offset;

Set L1 to the new offset

}

svm-vmcb-control.tsc_offset = offset + g_tsc_offset;

Set L2 to new offset + delta.


So what this function does is that it treats TSC writes as L1 writes even while 
in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. Are you sure 
this is intended and works when L1 doesn't intercept MSR writes to TSC?


svm_adjust_tsc_offset:

svm-vmcb-control.tsc_offset += adjustment;
if (is_nested(svm))
svm-nested.hsave-control.tsc_offset += adjustment;

Very simple case. We want to adjust L1's offset, so we need to adjust L1 and L2 
because the change is transparent to L2.


#VMEXIT:

/* Restore the original control entries */  
   
copy_vmcb_control_area(vmcb, hsave);
   

which again does:

dst-tsc_offset   = from-tsc_offset;

So we're setting the tsc offset to the value that's stored in the host save 
area.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-10-01 Thread Zachary Amsden

On 10/01/2010 04:46 AM, Alexander Graf wrote:

On 01.10.2010, at 13:21, Nadav Har'El wrote:

   

On Thu, Sep 30, 2010, Zachary Amsden wrote about Re: TSC in nested SVM and 
VMX:
 

1)  When reading an MSR, we are not emulating the L2 guest; we are
DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2
guest is actually done by the code running /inside/ the L1 emulation, so
MSR reads for the L2 guest are handed by L1, and MSR reads for the L1
guest are handled by L0, which is this code.
...
So if we are currently running nested, the L1 tsc_offset is stored in
the nested.hsave field; the vmcb which is active is polluted by the L2
guest offset, which would be incorrect to return to the L1 emulation.
   

Thanks for the detailed explanation.

It seems, then, that the nested VMX logic is somewhat different from that
of the nested SVM. In nested VMX, if a function gets called when running
L1, the current VMCS will be that of L1 (aka vmcs01), not of its guest L2
(and I'm not even sure *which* L2 that would be when there are multiple
L2 guests for the one L1).
 

If the #vmexit comes while you're in L1, everything works on the L1's vmcb. If 
you hit it while in L2, everything works on the L2's vmcb unless special 
attention is taken.

The reason behind the TSC shift is very simple. With the tsc_offset setting 
we're trying to adjust the L1's offset. Adjusting the L1's offset means we need 
to adjust L1 and L2 alike, as the virtual L2's offset == L1 offset + vmcb L2 
offset, because L2's TSC is also offset by the amount L1 is.

So basically what happens is:

nested VMRUN:

 svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;

please note the +=!


svm_write_tsc_offset:

This gets called when we really want to current level's TSC offset only because 
the guest issued a tsc write. In L2 this means the L2's value.

 if (is_nested(svm)) {
 g_tsc_offset = svm-vmcb-control.tsc_offset -
svm-nested.hsave-control.tsc_offset;

Remember the difference between L1 and L2.

 svm-nested.hsave-control.tsc_offset = offset;

Set L1 to the new offset

 }

 svm-vmcb-control.tsc_offset = offset + g_tsc_offset;

Set L2 to new offset + delta.


So what this function does is that it treats TSC writes as L1 writes even while 
in L2 and adjusts L2 accordingly. Joerg, this sounds fishy to me. Are you sure 
this is intended and works when L1 doesn't intercept MSR writes to TSC?
   


L1 must intercept MSR writes to TSC for this to work.  It does, so all 
is well.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in nested SVM and VMX

2010-09-30 Thread Zachary Amsden

On 09/30/2010 12:50 PM, Nadav Har'El wrote:

Hi,

I noticed that the TSC handling code has recently changed, and since it
wasn't done correctly in the nested VMX patch, I wanted to take the opportunity
to fix it.

I looked at what nested SVM does about TSC, and most of it I think I
understand, but a couple of other things I don't understand.

The basic point is that when L1 starts L2 with some vmcs12.tsc_offset
(nested_vmcb-control.tsc_offset in SVM nomenclature), the TSC that L1
actually thinks it is offsetting from is already the hardware TSC plus
the vmcs01.tsc_offset. So when L0 runs L2, it needs to use the offset
vmcs01.tsc_offset + vmcs12.tsc_offset.

This explains the line
 svm-vmcb-control.tsc_offset += nested_vmcb-control.tsc_offset;
in nested_svm_vmrun().

In svm_adjust_tsc_offset(), when the hardware TSC changes (e.g., when moving
cores?) and the TSC offset is changed to keep the guest TSC unchanged,
when in nested mode we need to update both the currently running L2's tsc
offset, but also also L1's when we eventually return to it, which explains
the code in that function.

But there are two things I don't understand:

1. in svm_get_msr(), MSR_IA32_TSC, there is a special is_nested() case which
basically ignores the above mentioned addition and uses just the L0-L1
tsc offset. why? Why isn't svm-vmcb-control.tsc_offset + native_read_tsc()
the correct thing in both cases?

2. In svm_write_tsc_offset(), when in a nested guest, we don't write the offset
given to us, but (if I understand correctly) set this offset for the *L1*
guest (and set the L2's tsc offset accordingly, adding to it vmcs12's
tsc offset). Why was this done? Why was the simple code
  svm-vmcb-control.tsc_offset = offset;
and that's it, not the right thing to do in this function?
   


The offset for the L1 guest is copied out to a temporary structure, 
nested.hsave; this hold all L1 state information while the L2 guest is 
running.  The L2 guest state is copied directly into the L1 control 
block, and runs IN PLACE in that structure.  Upon nested vmexit, notice 
that the tsc_offset is NOT subtracted, rather, the original offset from 
nested.hsave is copied back into place.


Now, this answers both of your questions.

1)  When reading an MSR, we are not emulating the L2 guest; we are 
DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2 
guest is actually done by the code running /inside/ the L1 emulation, so 
MSR reads for the L2 guest are handed by L1, and MSR reads for the L1 
guest are handled by L0, which is this code.


So if we are currently running nested, the L1 tsc_offset is stored in 
the nested.hsave field; the vmcb which is active is polluted by the L2 
guest offset, which would be incorrect to return to the L1 emulation.


2) When writing an MSR, we are also writing directly for the L1 
emulation.  If we are currently nested, we want the new tsc_offset to be 
copied back into place from nested.hsave when the L2 guest triggers a 
vmexit.  So if nested, we must write the offset for L1 emulation into 
nested.hsave.  We must also apply the delta which was created by 
applying this new L1 offset to L2, as we presume, the offset is being 
used to correct for hardware variations.  This is why we compute 
g_tsc_offset and add it to the active L2 offset (which is running live, 
directly in the original vmcb).


Now, when we exit and re-enter L2, observe:

delta = L2 offset - L1 offset = svm-vmcb-control.tsc_offset - 
svm-nested.hsave-control.tsc_offset


now emulation changes the underlying L1 offset to L1':

L1 offset = L1'
active L2 offset += L1' - L1  (*)

Note vmcb is unaffected - vmexit does NOT preserve tsc_offset for the L2 
guest.

However, next time vmrun for L2 is attempted, we compute:

active L2 offset = L1' + L2 offset = (L1 + L2 offset) + (L1' - L1) = (*)

so equivalence is preserved.


It took me many hours of staring and scratching my head to figure out 
that code.  I was convinced many times that is was wrong or buggy, until 
I realized how it actually works.


It gets more complicated when you realize the L1 guest can also be 
manipulating TSC offset for the L2 guest, inside of the vmcb that we are 
running in place.  However, that is beyond the scope of our work; bugs 
in the L1 guest stay in the L1 guest, and it will do its own MSR read / 
write and tsc_offset emulation beyond our visibility.


We really should write this down in a document somewhere.  Avi, shall I 
add this to the timekeeping docs?  (And I hope I got all the L1 and L2s 
correct in the above - apologies if I made it even more confusing by 
misstating something).


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html