On Mon, Jan 14, 2008 at 02:06:47PM -0200, Marcelo Tosatti wrote:
> Hi Avi,
> 
> On Sun, Jan 13, 2008 at 02:19:29PM +0200, Avi Kivity wrote:
> > Marcelo Tosatti wrote:
> > >The boot TSC sync check is failing on recent Linux SMP guests on TSC
> > >stable hosts.
> > >
> > >  
> > 
> > What about tsc unstable hosts?  If your patch convinces the guest its 
> > tsc is table, while the host tsc is not, then it may cause confusion 
> > later on.
> 
> The adjustment to zero won't fool the guest because it assumes that the
> TSC's are synchronized. It will simply set the guest TSC to zero on all
> VCPUs based on the time VCPU0 was initialized.
> 
> That is, setting -(vcpu[0].first_tsc) on all VCPU's will not correct any
> synchronization problem.
> 
> > >Following patch attempts to address the problems, which are:
> > >
> > >1) APIC_DM_STARTUP, which is only called for vcpu's other than vcpu0,
> > >will trigger ->vcpu_reset(), setting the TSC to 0. Fix that by moving
> > >the guest_write_tsc(0) to vcpu_load.
> > >
> > >2) vcpu's are initialized at different times by QEMU (vcpu0 init happens
> > >way earlier than the rest). Fix that by initializing the TSC of vcpu's >
> > >0 with reference to vcpu0 init tsc value. This way TSC synchronization
> > >is kept (apparently Xen does something similar).
> > >
> > >3) The code which adjusts the TSC of a VCPU on physical CPU switch is
> > >meant to guarantee that the guest sees a monotonically increasing value.
> > >However there is a large gap, in terms of clocks, between the time which
> > >the source CPU TSC is read and the time the destination CPU TSC is read.
> > >So move those two reads to happen at vcpu_clear.
> > >
> > >I believe that 3) is the reason for the -no-kvm-irqchip problems
> > >reported by Avi on RHEL5, with kernels < 2.6.21 (where vcpu->cpu pinning
> > >would fix the problem). Unfortunately I could reproduce that problem.
> > >
> > >4-way guest with constant tick at 250Hz now reliably sees the TSC's
> > >synchronized, and idle guest CPU consumption is reduced by 50% (3-4%
> > >instead of 8%, the latter with acpi_pm_good boot parameter).
> > >
> > >
> > >diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > >index 4741806..e1c8cf4 100644
> > >--- a/arch/x86/kvm/vmx.c
> > >+++ b/arch/x86/kvm/vmx.c
> > >@@ -47,6 +47,8 @@ struct vcpu_vmx {
> > >   struct kvm_vcpu       vcpu;
> > >   int                   launched;
> > >   u8                    fail;
> > >+  u64                   first_tsc;
> > >+  u64                   tsc_this;
> > >   u32                   idt_vectoring_info;
> > >   struct kvm_msr_entry *guest_msrs;
> > >   struct kvm_msr_entry *host_msrs;
> > >@@ -254,6 +256,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
> > >   if (vmx->vcpu.cpu == -1)
> > >           return;
> > >   smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
> > >+  rdtscll(vmx->tsc_this);
> > >   vmx->launched = 0;
> > > }
> > > 
> > >@@ -480,6 +483,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
> > >   reload_host_efer(vmx);
> > > }
> > > 
> > >+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
> > >+
> > > /*
> > >  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
> > >  * vcpu mutex is already taken.
> > >@@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > > {
> > >   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > >   u64 phys_addr = __pa(vmx->vmcs);
> > >-  u64 tsc_this, delta;
> > >+  u64 delta;
> > > 
> > >   if (vcpu->cpu != cpu) {
> > >           vcpu_clear(vmx);
> > >@@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > >           struct descriptor_table dt;
> > >           unsigned long sysenter_esp;
> > > 
> > >+          if (unlikely(vcpu->cpu == -1)) {
> > >  
> > 
> > This can happen after migration, I believe.
> 
> Right now the destination host will do ->vcpu_reset() on all VCPU's on
> startup... so its already broken. This patch is not introducing any
> issues, just changing where it happens :)
> 
> Perhaps fixing migration should be the subject of a separate patch?
> 
> > >+                  rdtscll(vcpu->arch.host_tsc);
> > >+                  vmx->tsc_this = vcpu->arch.host_tsc;
> > >+                  if (vcpu->vcpu_id == 0) {
> > >+                          guest_write_tsc(vcpu->arch.host_tsc, 0);
> > >+                          vmx->first_tsc = vcpu->arch.host_tsc;
> > >+                  } else {
> > >+                          struct vcpu_vmx *cpu0;
> > >+                          cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
> > >+                          guest_write_tsc(cpu0->first_tsc, 0);
> > >+                  }
> > >+          }
> > >+
> > >  
> > 
> > Depending on vcpu_id == 0 can cause problems (for example, if vcpu 0 is 
> > somehow not the first to run).
> 
> But QEMU will always run kvm_create() first (which does
> kvm_create_vcpu(0)), then start the other threads later. So I don't see
> how that can happen.
> 
> > We might initialize the tsc base on vm creation, and if the host tsc is 
> > synced, then the guest tsc should also be.
> > 
> > >           vcpu->cpu = cpu;
> > >           /*
> > >            * Linux uses per-cpu TSS and GDT, so set these when 
> > >            switching
> > >@@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > >           /*
> > >            * Make sure the time stamp counter is monotonous.
> > >            */
> > >-          rdtscll(tsc_this);
> > >-          delta = vcpu->arch.host_tsc - tsc_this;
> > >+          delta = vcpu->arch.host_tsc - vmx->tsc_this;
> > >           vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
> > >  
> > 
> > This is a little roundabout, how about moving the delta calculation 
> > immediately after the call to vcpu_clear()?
> > 
> > I don't think this is the cause of the problem, it can't account for 
> > more than a few hundred cycles, compared to the much greater vmentry cost.
> 
> Actually it accounts for several thousand cycles warp by the time the
> kernel checks the TSC synchronization between CPU's, which happens very
> early on boot.
> 
> You saw the hang on userspace init, by then there could be many
> VCPU->CPU switchs.
> 
> > 
> > Anyway it should be in a separate patch.
> 
> How does the following look (minus proper changelog):

Actually setting the TSC_OFFSET at vcpu_clear() seems a bit early:

vmwrite error: reg 2010 value 82470677 (err -2109274505)
Pid: 5590, comm: qemu-system-x86 Not tainted 2.6.24-rc6-ge7706133 #26

Call Trace:
 [<ffffffff880a5983>] :kvm_intel:vmwrite_error+0x22/0x28
 [<ffffffff880a5a99>] :kvm_intel:vcpu_clear+0x54/0x60
 [<ffffffff880a5ad0>] :kvm_intel:vmx_vcpu_load+0x29/0x12b
 [<ffffffff8024d803>] get_futex_value_locked+0x1d/0x38
 [<ffffffff8808fcf1>] :kvm:kvm_arch_vcpu_ioctl_run+0x13/0x4b9
 [<ffffffff8808c3df>] :kvm:kvm_vcpu_ioctl+0xda/0x2dd

So the first patch seemed alright (taking into account the comments from
my previous email, that migration should probably be fixed in a separate
patch since its broken already and that its guaranteed that vcpu0 is the
first to hit vcpu_load).

Avi?

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to