RE: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Dong, Eddie
> >
> > Even if the device driver doesn't support migration, you still want to
> > migrate VM? That maybe risk and we should add the "bad path" for the
> > driver at least.
> 
> At a minimum we should have support for hot-plug if we are expecting to
> support migration.  You would simply have to hot-plug the device before you
> start migration and then return it after.  That is how the current bonding
> approach for this works if I am not mistaken.

Hotplug is good to eliminate the device spefic state clone, but bonding 
approach is very network specific, it doesn’t work for other devices such as 
FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

> 
> The advantage we are looking to gain is to avoid removing/disabling the
> device for as long as possible.  Ideally we want to keep the device active
> through the warm-up period, but if the guest doesn't do that we should still
> be able to fall back on the older approaches if needed.
> 
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Dong, Eddie
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu  wrote:
> > On 2015年11月25日 13:30, Alexander Duyck wrote:
> >> No, what I am getting at is that you can't go around and modify the
> >> configuration space for every possible device out there.  This
> >> solution won't scale.
> >
> >
> > PCI config space regs are emulation by Qemu and so We can find the
> > free PCI config space regs for the faked PCI capability. Its position
> > can be not permanent.
> 
> Yes, but do you really want to edit every driver on every OS that you plan to
> support this on.  What about things like direct assignment of regular Ethernet
> ports?  What you really need is a solution that will work generically on any
> existing piece of hardware out there.

The fundamental assumption of this patch series is to modify the driver in 
guest to self-emulate or track the device state, so that the migration may be 
possible.
I don't think we can modify OS, without modifying the drivers, even using the 
PCIe hotplug mechanism.  
In the meantime, modifying Windows OS is a big challenge given that only 
Microsoft can do. While, modifying driver is relatively simple and manageable 
to device vendors, if the device vendor want to support state-clone based 
migration.

Thx Eddie


RE: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
> > Thanks Dave:
> > Whether the randomness value/branch/code path the PVM and SVM
> may
> > have, It is only a performance issue. COLO never assumes the PVM and
> > SVM has same internal Machine state.  From correctness p.o.v, as if
> > the PVM and SVM generate Identical response, we can view the SVM is a
> > valid replica of PVM, and the SVM can take over When the PVM suffers
> > from hardware failure. We can view the client is all the way talking
> > with the SVM, without the notion of PVM.  Of course, if the SVM dies, we
> can regenerate a copy of PVM with a new checkpoint too.
> > The SOCC paper has the detail recovery model :)
> 
> I've had a read; I think the bit I was asking about was what you labelled 'D' 
> in
> that papers fig.4 - so I think that does explain it for me.

Very good :)

> But I also have some more questions:
> 
>   1) 5.3.3 Web server
> a) In fig 11 it shows Remus's performance dropping off with the number
> of threads - why is that? Is it
>just an increase in the amount of memory changes in each
> snapshot?

I didn't dig into details of them, but document the throughput we observed.
I felt a bit stranger too, memory dirty page set may be larger than small 
connection
Case, but I am not sure and that is the data we saw :(

> b) Is fig 11/12 measured with all of the TCP optimisations shown in fig
> 13 on?

Yes.

> 
>   2) Did you manage to overcome the issue shown in 5.6 with newer guest
> kernels degredation - could you just fall
>  back to micro checkpointing if the guests diverge too quickly?

In general, I would say the COLO performance for these 2 workloads is pretty 
good, and 
I actually didn't list the subsection 5.6 initially. It is the conference 
sepherd who ask me to 
add this paragraph to make the paper to be balanced :)

In summary, COLO can have very good MP-guest performance comparing with Remus, 
with 
the payment of potential optimization/modification effort to guest TCP/IP 
stack. One solution may
Not work for all workloads, but it provides a large room for OSVs to provide 
customized solution
for a specific usage -- which I think is very good for open source biz model: 
make money through 
consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, 
Probably for specific usage too.

> 
>   3) Was the link between the two servers for synchronisation a low-latency
> dedicated connection?

We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution 
itself doesn't 
require dedicated link.

> 
>   4) Did you try an ftp PUT benchmark using external storage - i.e. that
> wouldn't have the local disc
>  synchronisation overhead?

Not yet.
External network shared storage works, but today the performance may be not 
that good, 
because our optimization so far is still very limited. It is just an initial 
effort to make the 2
common workloads happy. We believe there are large room ahead to make the 
response of 
TCP/IP stack more predictable. Once the basic COLO stuff is ready for product 
and accepted
by the industry, it is possible we may impact TCP community to have this kind 
of predictability
 in mind for the future protocol development, which will greatly help the 
performance.


Thx Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
> >
> > Let me clarify on this issue. COLO didn't ignore the TCP sequence
> > number, but uses a new implementation to make the sequence number to
> > be best effort identical between the primary VM (PVM) and secondary VM
> > (SVM). Likely, VMM has to synchronize the emulation of randomization
> > number generation mechanism between the PVM and SVM, like the
> lock-stepping mechanism does.
> >
> > Further mnore, for long TCP connection, we can rely on the (on-demand)
> > VM checkpoint to get the identical Sequence number both in PVM and
> SVM.
> 
> That wasn't really my question; I was worrying about other forms of
> randomness, such as winners of lock contention, and other SMP
> non-determinisms, and I'm also worried by what proportion of time the
> system can't recover from a failure due to being unable to distinguish an
> SVM failure from a randomness issue.
> 
Thanks Dave:
Whether the randomness value/branch/code path the PVM and SVM may have,
It is only a performance issue. COLO never assumes the PVM and SVM has same 
internal
Machine state.  From correctness p.o.v, as if the PVM and SVM generate
Identical response, we can view the SVM is a valid replica of PVM, and the SVM 
can take over
When the PVM suffers from hardware failure. We can view the client is all the 
way talking with 
the SVM, without the notion of PVM.  Of course, if the SVM dies, we can 
regenerate a copy
of PVM with a new checkpoint too.
The SOCC paper has the detail recovery model :)

Thanks, Eddie



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
> >
> > I didn't quite understand a couple of things though, perhaps you can
> > explain:
> >1) If we ignore the TCP sequence number problem, in an SMP machine
> > don't we get other randomnesses - e.g. which core completes something
> > first, or who wins a lock contention, so the output stream might not
> > be identical - so do those normal bits of randomness cause the
> > machines to flag as out-of-sync?
> 
> It's about COLO agent, CCing Congyang, he can give the detailed
> explanation.
> 

Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but 
uses a 
new implementation to make the sequence number to be best effort identical 
between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to 
synchronize 
the emulation of randomization number generation mechanism between the 
PVM and SVM, like the lock-stepping mechanism does. 

Further mnore, for long TCP connection, we can rely on the (on-demand) VM 
checkpoint to get the 
identical Sequence number both in PVM and SVM. 


Thanks, Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [KVM timekeeping 26/35] Catchup slower TSC to guest rate

2010-09-06 Thread Dong, Eddie
Zachary:
Will you extend the logic to cover the situation when the guest runs at 
higher than the guest rate but the PCPU is over committed. In that case, likely 
we can use the time spent when the VCPU is scheduled out to catch up as well. 
Of course if the VCPU scheduled out time is not enough to compensate the cycles 
caused by fast host TSC (exceeding a threahold), we will eventually have to 
fall back to trap and emulation mode.

Thx, Eddie

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Zachary Amsden
Sent: 2010年8月20日 16:08
To: kvm@vger.kernel.org
Cc: Zachary Amsden; Avi Kivity; Marcelo Tosatti; Glauber Costa; Thomas 
Gleixner; John Stultz; linux-ker...@vger.kernel.org
Subject: [KVM timekeeping 26/35] Catchup slower TSC to guest rate

Use the catchup code to continue adjusting the TSC when
running at lower than the guest rate

Signed-off-by: Zachary Amsden 
---
 arch/x86/kvm/x86.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4215d7..086d56a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1013,8 +1013,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
}
local_irq_restore(flags);
-   if (catchup)
+   if (catchup) {
+   if (this_tsc_khz < v->kvm->arch.virtual_tsc_khz)
+   vcpu->tsc_rebase = 1;
return 0;
+   }
 
/*
 * Time as measured by the TSC may go backwards when resetting the base
@@ -5022,6 +5025,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
kvm_guest_exit();
 
+   /* Running on slower TSC without kvmclock, we must bump TSC */
+   if (vcpu->arch.tsc_rebase)
+   kvm_request_clock_update(vcpu);
+
preempt_enable();
 
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-08-03 Thread Dong, Eddie
Arnd Bergmann wrote:
> On Friday 30 July 2010 17:51:52 Shirley Ma wrote:
>> On Fri, 2010-07-30 at 16:53 +0800, Xin, Xiaohui wrote:
 Since vhost-net already supports macvtap/tun backends, do you think
 whether it's better to implement zero copy in macvtap/tun than
 inducing a new media passthrough device here?
 
>>> 
>>> I'm not sure if there will be more duplicated code in the kernel.
>> 
>> I think it should be less duplicated code in the kernel if we use
>> macvtap to support what media passthrough driver here. Since macvtap
>> has support virtio_net head and offloading already, the only missing
>> func is zero copy. Also QEMU supports macvtap, we just need add a
>> zero copy flag in option.
> 
> Yes, I fully agree and that was one of the intended directions for
> macvtap to start with. Thank you so much for following up on that,
> I've long been planning to work on macvtap zero-copy myself but it's
> now lower on my priorities, so it's good to hear that you made
> progress on it, even if there are still performance issues.
> 

But zero-copy is a Linux generic feature that can be used by other VMMs as well 
if the BE service drivers want to incorporate.  If we can make mp device 
VMM-agnostic (it may be not yet in current patch), that will help Linux more.


Thx, Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/24] Nested VMX, v5

2010-07-09 Thread Dong, Eddie
Nadav Har'El wrote:
> Hi Avi,
> 
> This is a followup of our nested VMX patches that Orit Wasserman
> posted in December. We've addressed most of the comments and concerns
> that you and others on the mailing list had with the previous patch
> set. We hope you'll find these patches easier to understand, and
> suitable for applying to KVM. 
> 
> 
> The following 24 patches implement nested VMX support. The patches
> enable a guest to use the VMX APIs in order to run its own nested
> guests. I.e., it allows running hypervisors (that use VMX) under KVM.
> We describe the theory behind this work, our implementation, and its
> performance characteristics, 
> in IBM Research report H-0282, "The Turtles Project: Design and
> Implementation of Nested Virtualization", available at:
> 
>   http://bit.ly/a0o9te
> 
> The current patches support running Linux under a nested KVM using
> shadow page table (with bypass_guest_pf disabled). They support
> multiple nested hypervisors, which can run multiple guests. Only
> 64-bit nested hypervisors are supported. SMP is supported. Additional
> patches for running Windows under nested KVM, and Linux under nested
> VMware server, and support for nested EPT, are currently running in
> the lab, and will be sent as follow-on patchsets. 
> 

Nadav & All:
Thnaks for the posting and in general the patches are well written. I 
like the concept of VMCSxy and I feel it is pretty clear (better than my 
previous naming as well), but there are some confusing inside, especially for 
the term "shadow" which I feel quit hard.

Comments from me:
1: Basically there are 2 diferent type in VMCS, one is defined by 
hardware, whose layout is unknown to VMM. Another one is defined by VMM (this 
patch) and used for vmcs12.

The former one is using "struct vmcs" to describe its data instance, 
but the later one doesn't have a clear definition (or struct vmcs12?). I 
suggest we can have a distinguish struct for this, for example "struct 
sw_vmcs"(software vmcs), or "struct vvmcs" (virtual vmcs).

2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either "struct 
vmcs", or "struct sw_vmcs", but not for struct Clear distinguish between data 
structure and instance helps IMO.

3: We may use prefix or suffix in addition to vmcsxy to explictly state 
the format of that instance. For example vmcs02 in current patch is for 
hardware use, hence it is an instance "struct vmcs", but vmcs01 is an instance 
of "struct sw_vmcs". Postfix and prefix helps to make better understand.

4: Rename l2_vmcs to vmcs02, l1_shadow_vmcs to vmcs01, l1_vmcs to 
vmcs02, with prefix/postfix can strengthen above concept of vmcsxy.


5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each 
time when guest VMPTRLD, and free the instance at VMCLEAR. The code may fail if 
the vmcs (un-vmcleared) exceeds certain threshold to avoid denial of service. 
That is fine, but it brings additional complexity and may pay with a lot of 
memory. I think we can emulate using concept of "cached vmcs" here in case L1 
VMM doesn't do vmclear in time.  L0 VMM can simply flush those vmcs02 to guest 
memory i.e. vmcs12 per need. For example if the cached vcs02 exceeds 10, we can 
do automatically flush.


Thx, Eddie




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 8/24] Hold a vmcs02 for each vmcs12

2010-07-06 Thread Dong, Eddie
> +/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if
> one + * does not already exist. The allocation is done in L0 memory,
> so to avoid + * denial-of-service attack by guests, we limit the
> number of concurrently- + * allocated vmcss. A well-behaving L1 will
> VMCLEAR unused vmcs12s and not + * trigger this limit.
> + */
> +static const int NESTED_MAX_VMCS = 256;
> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> + struct vmcs_list *new_l2_guest;
> + struct vmcs *l2_vmcs;
> +
> + if (nested_get_current_vmcs(vcpu))
> + return 0; /* nothing to do - we already have a VMCS */
> +
> + if (to_vmx(vcpu)->nested.l2_vmcs_num >= NESTED_MAX_VMCS)
> + return -ENOMEM;
> +
> + new_l2_guest = (struct vmcs_list *)
> + kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
> + if (!new_l2_guest)
> + return -ENOMEM;
> +
> + l2_vmcs = alloc_vmcs();

I didn't see where it was used. Hints on the usage?

> + if (!l2_vmcs) {
> + kfree(new_l2_guest);
> + return -ENOMEM;
> + }
> +
> + new_l2_guest->vmcs_addr = to_vmx(vcpu)->nested.current_vmptr;
> + new_l2_guest->l2_vmcs = l2_vmcs;
> + list_add(&(new_l2_guest->list),
> &(to_vmx(vcpu)->nested.l2_vmcs_list));
> + to_vmx(vcpu)->nested.l2_vmcs_num++; +   return 0;
> +}
> +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12

2010-07-05 Thread Dong, Eddie
Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to
> actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately
> merges the information in shadow_vmcs that L1 built for L2 (vmcs12),
> and that in the VMCS that we built for L1 (vmcs01).
>
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS),
> which makes it difficult for us to read from vmcs01 while writing to
> vmcs12. This is why we first make a copy of vmcs01 in memory
> (l1_shadow_vmcs) and then read that memory copy while writing to
> vmcs12.
>
> Signed-off-by: Nadav Har'El 
> ---
> --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
> @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
>   return flexpriority_enabled;
>  }
>
> +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu
> *vcpu) +{
> + return cpu_has_vmx_tpr_shadow() &&
> + get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> + CPU_BASED_TPR_SHADOW;
> +}
> +
> +static inline bool nested_cpu_has_secondary_exec_ctrls(struct
> kvm_vcpu *vcpu) +{
> + return cpu_has_secondary_exec_ctrls() &&
> + get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> +}
> +
> +static inline bool nested_vm_need_virtualize_apic_accesses(struct
> kvm_vcpu +   *vcpu)
> +{
> + return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> + (get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
> +}
> +
> +static inline bool nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
> +{
> + return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> + (get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> + SECONDARY_EXEC_ENABLE_EPT);
> +}
> +
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>   int i;
> @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
>   preempt_enable();
>  }
>
> +int load_vmcs_host_state(struct shadow_vmcs *src)
> +{
> + vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> + vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> + vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> + vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> + vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> + vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> + vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
> +
> + vmcs_write64(TSC_OFFSET, src->tsc_offset);
> +
> + if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
> + vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
> +
> + vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
> +
> + vmcs_writel(HOST_CR0, src->host_cr0);
> + vmcs_writel(HOST_CR3, src->host_cr3);
> + vmcs_writel(HOST_CR4, src->host_cr4);
> + vmcs_writel(HOST_FS_BASE, src->host_fs_base);
> + vmcs_writel(HOST_GS_BASE, src->host_gs_base);
> + vmcs_writel(HOST_TR_BASE, src->host_tr_base);
> + vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
> + vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
> + vmcs_writel(HOST_RSP, src->host_rsp);
> + vmcs_writel(HOST_RIP, src->host_rip);
> + vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
> + vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
> +
> + return 0;
> +}
> +
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put(), but
> assumes
>   * vcpu mutex is already taken.
> @@ -1922,6 +1985,71 @@ static void vmclear_local_vcpus(void)
>   __vcpu_clear(vmx);
>  }
>
> +int load_vmcs_common(struct shadow_vmcs *src)
> +{
> + vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector);
> + vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector);
> + vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector);
> + vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector);
> + vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector);
> + vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector);
> + vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector);
> + vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector);
> +
> + vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl);
> +
> + if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> + vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat);
> +
> + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> src->vm_entry_intr_info_field);
> + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, +
> src->vm_entry_exception_error_code);
> + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> src->vm_entry_instruction_len); + +   vmcs_write32(GUEST_ES_LIMIT,
> src->guest_es_limit); +   vmcs_write32(GUEST_CS_LIMIT,
> src->guest_cs_limit); +   vmcs_write32

RE: [PATCH 10/24] Implement VMPTRLD

2010-07-05 Thread Dong, Eddie
Nadav Har'El wrote:
> This patch implements the VMPTRLD instruction.
> 
> Signed-off-by: Nadav Har'El 
> ---
> --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
> @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv
>   return 0;
>  }
> 
> +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu)
> +{
> + unsigned long rflags;
> + rflags = vmx_get_rflags(vcpu);
> + rflags |= X86_EFLAGS_CF;
> + rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_ZF &
> + ~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> + vmx_set_rflags(vcpu, rflags);
> +}
> +
> +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu)
> +{
> + unsigned long rflags;
> + rflags = vmx_get_rflags(vcpu);
> + rflags |= X86_EFLAGS_ZF;
> + rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_CF &
> + ~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> + vmx_set_rflags(vcpu, rflags);
> +}
> +
>  static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
>  {
>   unsigned long rflags;
> @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
>   return 1;
>  }
> 
> +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t
> guest_vmcs_addr) +{
> + bool ret;
> + struct vmcs12 *vmcs12;
> + struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
> + if (vmcs_page == NULL)
> + return 0;
> + vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
> + if (vmcs12->revision_id == VMCS12_REVISION)
> + ret = 1;
> + else {
> + set_rflags_to_vmx_fail_valid(vcpu);
> + ret = 0;
> + }
> + kunmap_atomic(vmcs12, KM_USER0);
> + kvm_release_page_dirty(vmcs_page);
> + return ret;
> +}
> +
> +/* Emulate the VMPTRLD instruction */
> +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + gpa_t guest_vmcs_addr;
> +
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;
> +
> + if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) {
> + set_rflags_to_vmx_fail_invalid(vcpu);
> + return 1;
> + }
> +
> + if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
> + return 1;
> +
> + if (vmx->nested.current_vmptr != guest_vmcs_addr) {
> + vmx->nested.current_vmptr = guest_vmcs_addr;
> +
> + if (nested_create_current_vmcs(vcpu)) {
> + printk(KERN_ERR "%s error could not allocate memory",
> + __func__);
> + return -ENOMEM;
> + }
> + }
> +
> + clear_rflags_cf_zf(vcpu);
> + skip_emulated_instruction(vcpu);

How about the "Launch" status? Should we get that status from vmcs1x to 
distinguish guest VMLaunch & VMResume?

> + return 1;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>   [EXIT_REASON_VMCALL]  = handle_vmcall,
>   [EXIT_REASON_VMCLEAR] = handle_vmclear,
>   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
> - [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
> + [EXIT_REASON_VMPTRLD] = handle_vmptrld,
>   [EXIT_REASON_VMPTRST] = handle_vmx_insn,
>   [EXIT_REASON_VMREAD]  = handle_vmx_insn,
>   [EXIT_REASON_VMRESUME]= handle_vmx_insn,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 9/24] Implement VMCLEAR

2010-07-05 Thread Dong, Eddie
Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
> 
> Signed-off-by: Nadav Har'El 
> ---
> --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
> @@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
>*/
>   u32 revision_id;
>   u32 abort;
> +
> + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>  };
> 
>  struct vmcs_list {
> @@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
>   return 0;
>  }
> 
> +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
> +{
> + unsigned long rflags;
> + rflags = vmx_get_rflags(vcpu);
> + rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
> + vmx_set_rflags(vcpu, rflags);
> +}
> +
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + gpa_t guest_vmcs_addr, save_current_vmptr;
> +
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;
> +
> + if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr))
> + return 1;
> +

SDM implements alignment check, range check and reserve bit check and may 
generate VMfail(VMCLEAR with invalid physical address).
As well as "addr != VMXON pointer" check
Missed?

> + save_current_vmptr = vmx->nested.current_vmptr;
> +
> + vmx->nested.current_vmptr = guest_vmcs_addr;
> + if (!nested_map_current(vcpu))
> + return 1;
> + vmx->nested.current_l2_page->launch_state = 0;
> + nested_unmap_current(vcpu);
> +
> + nested_free_current_vmcs(vcpu);
> +
> + if (save_current_vmptr == guest_vmcs_addr)
> + vmx->nested.current_vmptr = -1ull;
> + else
> + vmx->nested.current_vmptr = save_current_vmptr;
> +
> + skip_emulated_instruction(vcpu);
> + clear_rflags_cf_zf(vcpu);

SDM has formal definition of VMSucceed. Cleating CF/ZF only is not sufficient 
as SDM 2B 5.2 mentioned.
Any special concern here?

BTW, should we define formal VMfail() & VMsucceed() API for easy understand and 
map to SDM?

> + return 1;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4109,7 +4151,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>   [EXIT_REASON_HLT] = handle_halt,
>   [EXIT_REASON_INVLPG]  = handle_invlpg,
>   [EXIT_REASON_VMCALL]  = handle_vmcall,
> - [EXIT_REASON_VMCLEAR] = handle_vmx_insn,
> + [EXIT_REASON_VMCLEAR] = handle_vmclear,
>   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
>   [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
>   [EXIT_REASON_VMPTRST] = handle_vmx_insn,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3] KVM: VMX: Execute WBINVD to keep data consistency with assigned devices

2010-06-28 Thread Dong, Eddie
Avi Kivity wrote:
> On 06/28/2010 10:30 AM, Dong, Eddie wrote:
>>> 
>>> Several milliseconds of non-responsiveness may not be acceptable for
>>> some applications.  So I think queue_work_on() and a clflush loop is
>>> better than an IPI and wbinvd.
>>> 
>>> 
>> Probably we should make it configurable. For RT usage models, we do
>> care about responsiveness more than performance, but for typical
>> server useg model, we'd better focus on performance in this issue.
>> WBINVD may perform much much better than CLFLUSH, and a mallicious
>> guest repeatedly issuing wbinvd may greatly impact the system
>> performance. 
>> 
> 
> I'm not even sure clflush can work.  I thought you could loop on just
> the cache size, but it appears you'll need to loop over the entire
> guest address space, which could take ages.

If RT usage model comes into reality, we may have to do in this way though pay 
with huge overhead :)
Is there any RT customers here?

> 
> So I guess we'll have to settle for wbinvd, just avoiding it when the
> hardware allows us to.

Yes, for now I agree we can just use wbinvd to emulate wbinvd :)

Thx, Eddie

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3] KVM: VMX: Execute WBINVD to keep data consistency with assigned devices

2010-06-28 Thread Dong, Eddie
Avi Kivity wrote:
> On 06/28/2010 09:42 AM, Sheng Yang wrote:
 +static void wbinvd_ipi(void *garbage)
 +{
 +  wbinvd();
 +}
 
>>> Like Jan mentioned, this is quite heavy.  What about a clflush()
>>> loop instead?  That may take more time, but at least it's
>>> preemptible.  Of course, it isn't preemptible in an IPI.
>>> 
>> 
>> I think this kind of behavior happened rarely, and most recent
>> processor should have WBINVD exit which means it's an IPI... So I
>> think it's maybe acceptable here. 
>> 
>> 
> 
> Several milliseconds of non-responsiveness may not be acceptable for
> some applications.  So I think queue_work_on() and a clflush loop is
> better than an IPI and wbinvd.
> 

Probably we should make it configurable. For RT usage models, we do care about 
responsiveness more than performance, but for typical server useg model, we'd 
better focus on performance in this issue. WBINVD may perform much much better 
than CLFLUSH, and a mallicious guest repeatedly issuing wbinvd may greatly 
impact the system performance.

thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Dong, Eddie
Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>> 
>> I mean once the frontend side driver post the buffers to the backend
>> driver, the backend driver will "immediately" use that buffers to
>> compose skb or gro_frags and post them to the assigned host NIC
>> driver as receive buffers. In that case, if the backend driver
>> recieves a packet from the NIC that requires to do copy, it may be
>> unable to find additional free guest buffer because all of them are
>> already used by the NIC driver. We have to reserve some guest
>> buffers for the possible copy even if the buffer address is not
>> identified by original skb :(
> 
> OK I see what you mean.  Can you tell me how does Xiaohui's
> previous patch-set deal with this problem?
> 
> Thanks,

In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
complete queue pairs) uses the buffer from guest, so it eliminates copy 
completely in software and requires hardware to do so. If we can have an 
additonal place to store the buffer per skb (may cause copy later on), we can 
do copy later on or re-post the buffer to assigned NIC driver later on. But 
that may be not very clean either :(

BTW, some hardware may require certain level of packet copy such as for 
broadcast packets in very old VMDq device, which is not addressed in previous 
Xiaohui's patch yet. We may address this by implementing an additional 
virtqueue between guest and host for slow path (broadcast packets only here) 
with additinal complexity in FE/BE driver. 

Thx, Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Dong, Eddie
Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
>> 
>> Xiaohui & Herbert:
>>  Mixing copy of head & 0-copy of bulk data imposes additional
>>  challange to find the guest buffer. The backend driver may be
>> unable to find a spare guest buffer from virtqueue at that time
>> which may block the receiving process then. Can't we completely
>> eliminate netdev_alloc_skb here? Assigning guest buffer at this time
>> makes life much easier.
> 
> I'm not sure I understand you concern.  If you mean that when
> the guest doesn't give enough pages to the host and the host
> can't receive on behalf of the guest then isn't that already
> the case with the original patch-set?
> 

I mean once the frontend side driver post the buffers to the backend driver, 
the backend driver will "immediately" use that buffers to compose skb or 
gro_frags and post them to the assigned host NIC driver as receive buffers. In 
that case, if the backend driver recieves a packet from the NIC that requires 
to do copy, it may be unable to find additional free guest buffer because all 
of them are already used by the NIC driver. We have to reserve some guest 
buffers for the possible copy even if the buffer address is not identified by 
original skb :(

Thx, Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Dong, Eddie

> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will
> allocate 
> as usual, the data pointed by skb->data will be copied into the first
> guest buffer. 
> That means we should reserve sufficient room in guest buffer. For PS
> mode 
> supported driver (for example ixgbe), the room will be more than 128.
> After 128bytes, 
> we will put the first frag data. Look into virtio-net.c the function
> page_to_skb() 
> and receive_mergeable(), that means we should modify guest virtio-net
> driver to 
> compute the offset as the parameter for skb_set_frag().
> 
> How do you think about this? Attached is a patch to how to modify the
> guest driver. 
> I reserve 512 bytes as an example, and transfer the header len of the
> skb in hdr->hdr_len. 
> 
Xiaohui & Herbert:
Mixing copy of head & 0-copy of bulk data imposes additional challange 
to find the guest buffer. The backend driver may be unable to find a spare 
guest buffer from virtqueue at that time which may block the receiving process 
then.
Can't we completely eliminate netdev_alloc_skb here? Assigning guest 
buffer at this time makes life much easier.
Thx, Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-10 Thread Dong, Eddie

>> A VF interrupt usually happens in 4-8KHZ. How about the virtio?
>> I assume virtio will be widely used together w/ leagcy guest with
>> INTx mode. 
>> 
> 
> True, but in time it will be replaced by MSI.
> 
> Note without vhost virtio is also in userspace, so there are lots of
> exits anyway for the status register.

Few months ago, we noticed the interrupt frequency of PV I/O in previous 
solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each 
PV I/O frontend driver (or its interrupt source) has similar interrupt 
frequency which means Nx more interrupt. I guess virtio is in similar situation.

We then did an optimization for PV IO to mitigate the interrupt to guest by 
setting interrupt throttle in backend side, because native NIC also does in 
that way -- so called ITR register in Intel NIC. We can see 30-90% CPU 
utilization saving depending on how many frontend driver interrupt is employed. 
Not sure if it is adopted in vhost side.

One drawback of course is the latency, but it is mostly tolerable if it is 
reduced to ~1KHZ. 

Thx, Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-09 Thread Dong, Eddie
Avi Kivity wrote:
> On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>> 
>> Besides VF IO interrupt and timer interrupt introduced performance
>> overhead risk, 
> 
> VF usually uses MSI

Typo, I mean PV IO. 
A VF interrupt usually happens in 4-8KHZ. How about the virtio?
I assume virtio will be widely used together w/ leagcy guest with INTx mode.

> 
>>   EOI message deliver from lapic to ioapic,
> 
> Only for non-MSI
> 
>>   which becomes in user land now, may have potential scalability
>> issue. For example, if we have a 64 VCPU guest, if each vcpu has
>> 1khz interrupt (or ipi), the EOI from guest will normally have to
>> involve ioapic module for clearance in 64khz which may have long
>> lock contentio.
> 
> No, EOI for IPI or for local APIC timer does not involve the IOAPIC.
> 
>> you may reduce the involvement of ioapic eoi by tracking ioapic
>> pin<->  vector map in kernel, but not sure if it is clean enough. 
> 
> It's sufficient to look at TMR, no?  For edge triggered I don't think
> we need the EOI.

Mmm, I noticed statements difference between new SDM & old SDM.
In old SDM, IPI can have both edge and level trigger mode. But new SDM says 
only INIT can have both choice.
Given the new SDM eliminates the level trigger mode, it is OK.

> 
> But, the amount of churn and risk worries me, so I don't think the
> move is worthwhile.

This also remind me the debate at early stage of KVM when Gregory Haskins is 
working on any arbitrary choice of irqchip.
The patch even at that time (without SMP) is very complicated.

I agree from ROI point of view, this movement may be not worthwhile.
thx, eddie

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-09 Thread Dong, Eddie
Avi Kivity wrote:
> I am currently investigating a problem with the a guest running Linux
> malfunctioning in the NMI watchdog code.  The problem is that we don't
> handle NMI delivery mode for the local APIC LINT0 pin; instead we
> expect ExtInt deliver mode or that the line is disabled completely. 
> In addition the i8254 timer is tied to the BSP, while in this case the
> timer can broadcast to all vcpus.
> 
> There is some code that tries to second-guess the guest and provide it
> the inputs it sees, but this is fragile.  The only way to get reliable
> operation is to emulate the hardware fully.
> 
> Now I'd much rather do that in userspace, since it's a lot of
> sensitive work.  I'll enumerate below the general motivation,
> advantages and disadvantages, and a plan for moving forward.
> 
> Motivation
> ==
> 
> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices.  Both devices are
> high interaction since they deal with interrupts; practically after
> every interrupt there is either a PIC ioport write, or an APIC bus
> message, both signalling an EOI operation.  Moving the PIT into the
> kernel allowed us to catch up with missed timer interrupt injections,
> and speeded up guests which read the PIT counters (e.g. tickless
> guests). 
> 
> However, modern guests running on modern qemu use MSI extensively;
> both virtio and assigned devices now have MSI support; and the
> planned VFIO only supports kernel delivery via MSI anyway; line based
> interrupts will need to be mediated by userspace.
> 
> The only high frequency non-MSI interrupt sources remaining are the
> various timers; and the default one, HPET, is in userspace (and having
> its own scaling problems as a result).  So in theory we can move PIC,
> IOAPIC, and PIT support to userspace and not lose much performance.
> 
> Moving the implementation to userspace allows us more flexibility, and
> more consistency in the implementation of timekeeping for the various
> clock chips; it becomes easier to follow the nuances of real hardware
> in this area.
> 
> Interestingly, while the IOAPIC/PIC code was written we proposed
> making it independent of the local APIC; had we done so, the move
> would have been much easier (simply dropping the existing code).
> 
> 
> Advantages of a move
> 
> 
> 1. Reduced kernel footprint
> 
> Good for security, and allows fixing bugs without reboots.
> 
> 2. Centralized timekeeping
> 
> Instead of having one solution for PIT timekeeping, and another for
> RTC and HPET timekeeping, we can have all timer chips in userspace. 
> The local APIC timer still needs to be in the kernel - it is much too
> high bandwidth to be in userspace; but on the other hand it is very
> different from the other timer chips.
> 
> 3. Flexibility
> 
> Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a
> very strong advantage.
> 
> Disadvantages
> =
> 
> 1. Still need to keep the old code around for a long while
> 
> We can't just rip it out - old userspace depends on it.  So the
> security advantages are only with cooperating userspace, and the
> other advantages only show up.
> 
> 2. Need to bring the qemu code up to date
> 
> The current qemu ioapic code lags some way behind the kernel; also
> need PIT timekeeping
> 
> 3. May need kernel support for interval-timer-follows-thread
> 
> Currently the timekeeping code has an optimization which causes the
> hrtimer that models the PIT to follow the BSP (which is most likely to
> receive the interrupt); this reduces cpu cross-talk.
> 
> I don't think the kernel interval timer code has such an optimization;
> we may need to implement it.
> 
> 4. Much churn
> 
> This is a lot of work.
> 
> 5. Risk
> 
> We may find out after all this is implemented that performance is not
> acceptable and all the work will have to be dropped.
> 
> 

Besides VF IO interrupt and timer interrupt introduced performance overhead 
risk, EOI message deliver from lapic to ioapic, which becomes in user land now, 
may have potential scalability issue. For example, if we have a 64 VCPU guest, 
if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have 
to involve ioapic module for clearance in 64khz which may have long lock 
contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin 
<-> vector map in kernel, but not sure if it is clean enough.--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Nested virtualization without nested vmx

2010-03-02 Thread Dong, Eddie
Olivier Berghmans wrote:
> Hi,
> 
> Thanks for the reply. I have a Intel Q9550 processor. EPT was
> introduced in the Nehalem series, where my processor is not part of.
> Based on that, I conclude that I am not using EPT and thus using
> shadow page tables.
> 
> The problem of running (for example) a VirtualBox guest inside a KVM
> guest is a problem in the use of the shadow page tables? Is there a
> way I can test or confirm this so that I can know what exactly is the
> problem with the shadow page tables? Using EPT (in a Nehalem
> processor) the nesting might work without the nested vmx support in
> KVM?
> 

Base on the experience done in Xen side, I suspect the issue you saw is in 
shadow page table side. 
With EPT or NPT, the complicated shadow page table is removed, so you may avoid 
those issues and see something different.

Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Nested virtualization without nested vmx

2010-03-01 Thread Dong, Eddie
The goal of HVM virtualization is to provide an exact same with native platform 
to guest in KVM guest and Xen HVM guest, however for some reason, it is not 
strictly followed in today's virtualization solution. VMMs normally take 
shortcut to make commodity OS happen for performance etc. That brings trouble 
to nested virtualBox mentioned in your case.

3 years ago, Intel ever tried to run VirtualPC inside Xen HVM guest. Disheng Su 
fixed a number of bugs in Xen and eventually succesefully launched the nested 
virtualPC. You may still find the patches and comments in Xenbit (from Disheng 
Su for nested virtualPC). The main issue at that time is for shadow page table 
which optimizes MMU virtualization but not exactly follow architecture 
equavalence for performance reason. Are you running on shadow page table or EPT?

IBM and Intel both developed virtual VMX support for nested virtualization. You 
may leverage.

Thx, Eddie


Olivier Berghmans wrote:
> Hi all,
> 
> I'm currently researching the possibility of nested virtualization.
> The goal is to test which hypervisor can be nested inside kvm.
> 
> I have already read much about nested virtualization on AMD machines
> (on the mailing list and Avi's blog) with the nested svm patch, but I
> did not find much information about this subject on Intel machines.
> What I did find is that the nested vmx support is not yet finished (or
> maybe almost?). So I understand that testing whether kvm runs inside
> kvm on Intel will not work until this patch is included in kvm.
> 
> However, the goal of the patch is to insert the virtualization
> extension into the guest (in virtualized processor), right?
> Hypervisors that use software virtualization do not need these
> extension so is it correct to state that these hypervisors could
> already be nested inside kvm (e.g. running a VirtualBox or VMware
> guest inside a kvm guest)? When I tried this with VMware, the KVM
> guest (L1) crashed and rebooted. The test with VirtualBox displayed
> the message "Kernel panic - not syncing: Attempted to kill init!".
> Nesting a Xen guest inside a KVM guest also failed in that the Xen
> guest tries to boot, crashes and tries to reboot, etc.
> 
> Is there something that I'm overlooking or something that I can try in
> order to get it to work? Or what could be the problem that these
> hypervisors cannot be nested?
> 
> Some technical information:
> 
> host$ cat /proc/cpuinfo
> model name: Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
> 
> host$ dmesg | grep kvm
> [   18.480937] loaded kvm module (kvm-kmod-devel-88)
> [ 4924.495094] kvm: emulating exchange as write
> 
> host$ uname -a
> Linux 2.6.28-16-generic #57-Ubuntu SMP Wed Nov 11 09:47:24 UTC 2009
> i686 GNU/Linux
> 
> Kind regards,
> Olivier
> 
> --
> 
> Met vriendelijke groet
> Olivier Berghmans

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: KVM and Qemu Synchronization

2009-09-03 Thread Dong, Eddie
Saksena, Abhishek wrote:
> Correction to question:-
> 
> Can keeping device models and KVM pit synchronize with the host clock
> <> result in any issues with booting up OSes? 
> 

Time virtualization, especially in SMP guest for legacy OS, is tricky. Hardly 
it can be really synced.
But, Linux has PV timer for KVM & Xen. You can use PV timer for SMP case. 
Windows rely on TSC more which is relatively easy to be synced.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: EOI acceleration for high bandwidth IO

2009-07-06 Thread Dong, Eddie
Avi Kivity wrote:
> On 07/06/2009 04:42 PM, Dong, Eddie wrote:
>>  EOI is one of key VM Exit at high bandwidth IO such as VT-d
>>  with 10Gb/s NIC. This patch accelerate guest EOI emulation
>> utilizing HW VM Exit  information. 
>> 
> 
> Won't this fail if the guest uses STOSD to issue the EOI?
> 
Good catch, should we use an exclusion list for the opcode?
Or use decode cache for hot IP (RO in EPT for gip)?

We noticed huge amount of vEOI in 10Gb/s NIC which is ~70KHZ for EOI.
With SR-IOV, it could go up much more to even million level. Decode and
emulation cost 7K cycles, while short path may only spend 3-4K cycles.

Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EOI acceleration for high bandwidth IO

2009-07-06 Thread Dong, Eddie


EOI is one of key VM Exit at high bandwidth IO such as VT-d with 10Gb/s NIC.
This patch accelerate guest EOI emulation utilizing HW VM Exit
information.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ccafe0d..b63138f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -875,6 +875,15 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned 
long cr8)
 | (apic_get_reg(apic, APIC_TASKPRI) & 4));
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu->arch.apic;
+
+   if (apic)
+   apic_set_eoi(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 40010b0..3a7a29a 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -27,6 +27,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3a75db3..6eea29d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3125,6 +3125,12 @@ static int handle_apic_access(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
 
exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
offset = exit_qualification & 0xffful;
+   if ((exit_qualification >> 12) & 0xf == 1 &&
+   offset == APIC_EOI) {   /* EOI write */
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
 
er = emulate_instruction(vcpu, kvm_run, 0, 0, 0);
 

eoi2.patch
Description: eoi2.patch


RE: [PATCH 0/1] x2apic implementation for kvm

2009-05-25 Thread Dong, Eddie
Avi Kivity wrote:
> Sheng Yang wrote:
>> I think that means the PV interface for lapic. And yes, we can
>> support it follow MS's interface, but x2apic still seems another
>> story as you noted... I still don't think support x2apic here would
>> bring us more benefits. 
>> 
> 
> x2apic has the following benefit:
> 
> - msr exits are faster than mmio (no page table walk, emulation)
> - no need to read back ICR to look at the busy bit
> - one ICR write instead of two
> - potential to support large guests once we add interrupt remapping
> - shared code with the Hyper-V paravirt interface
> 
Is there any plan to implement an PV irqchip such as Xenirqchip for KVM?--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: event injection MACROs

2009-05-15 Thread Dong, Eddie
Gleb Natapov wrote:
> On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote:
>> Gleb Natapov wrote:
>>> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
>>>> Avi Kivity wrote:
>>>>> Dong, Eddie wrote:
>>>>>> OK.
>>>>>> Also back to Gleb's question, the reason I want to do that is to
>>>>>> simplify event generation mechanism in current KVM.
>>>>>> 
>>>>>> Today KVM use additional layer of exception/nmi/interrupt such as
>>>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>>>>>> vcpu->arch.nmi_injected. All those additional layer is due to
>>>>>> compete of VM_ENTRY_INTR_INFO_FIELD
>>>>>> write to inject the event. Both SVM & VMX has only one resource
>>>>>> to inject the virtual event but KVM generates 3 catagory of
>>>>>> events in parallel which further requires additional
>>>>>> logic to dictate among them.
>>>>> 
>>>>> I thought of using a queue to hold all pending events (in a common
>>>>> format), sort it by priority, and inject the head.
>>>> 
>>>> The SDM Table 5-4 requires to merge 2 events together, i.e.
>>>> convert to #DF/ Triple fault or inject serially when 2 events
>>>> happens no matter NMI, IRQ or exception. 
>>>> 
>>>> As if considering above events merging activity, that is a single
>>>> element queue.
>>> I don't know how you got to this conclusion from you previous
>>> statement. See explanation to table 5-2 for instate where it is
>>> stated that interrupt should be held pending if there is exception
>>> with higher priority. Should be held pending where? In the queue,
>>> like we do. Note that low prio exceptions are just dropped since
>>> they will be regenerated.
>> 
>> I have different understanding here.
>> My understanding is that "held" means NO INTA in HW, i.e. LAPIC
>> still hold this IRQ. 
>> 
> And what if INTA already happened and CPU is ready to fetch IDT for
> interrupt vector and at this very moment CPU faults?

If INTA happens, that means it is delivered. If its delivery triggers another 
exception, that is what Table5-4 handles.

My understanding is that it is 2 stage process. Table 5-2 talk about 
events happening before delivery, so that HW needs to prioritize them. 
Once a decision is make, the highest one is delivered but then it could 
trigger another exception when fetching IDT etc.

Current execption.pending/interrupt.pending/nmi_injected doesn't match 
either of above, interrupt/nmi is only for failed event injection, and a strange
fixed priority check when it is really injected: 
exception > failed NMI > failed IRQ > new NMI > new IRQ.

Table 5-2 looks missed in current KVM IMO except a wrong (but minor)
 exception > NMI > IRQ sequence.

> 
>>> 
>>>>  We could have either:  1) A pure SW "queue" that will be flush to
>>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW
>>>> register. 
>>>> 
>>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
>>> have queue of three elements sorted by priority. On each entry we
>>> should 
>> 
>> Table 5-4 alreadys says NMI/IRQ is BENIGN.
> Table 5-2 applies here not table 5-4 I think.
> 
>> 
>>> inject an event with highest priority. And remove it from queue on
>>> exit.
>> 
>> The problem is that we have to decide to inject only one of above 3,
>> and discard the rest. Whether priority them or merge (to one event
>> as Table 5-4) is another story. 
> Only a small number of event are merged into #DF. Most handled
> serially (SDM does not define what serially means unfortunately), so
> I don't understand where "discard the rest" is come from. We can

vmx_complete_interrupts clear all of them at next EXIT.

Even from HW point of view, if there are pending NMI/IRQ/exception,
CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds 
IRQ, thus it can be re-injected), completely discard exception.

I don't say discarding has any problem, but unnecessary to keep all of 3.
the only difference is when to discard the rest 2, at queue_exception/irq/nmi 
time or later on (even at next EXIT time), which is same to me.

> discard exception since it will be regenerated anyway, but IRQ and
> NMI is another story. SDM says that IRQ should be held pending (once
> again not much explanation here), nothing about NMI.
> 
>>> 

RE: event injection MACROs

2009-05-14 Thread Dong, Eddie
Gleb Natapov wrote:
> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
>> Avi Kivity wrote:
>>> Dong, Eddie wrote:
>>>> OK.
>>>> Also back to Gleb's question, the reason I want to do that is to
>>>> simplify event generation mechanism in current KVM.
>>>> 
>>>> Today KVM use additional layer of exception/nmi/interrupt such as
>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>>>> vcpu->arch.nmi_injected. All those additional layer is due to
>>>> compete of VM_ENTRY_INTR_INFO_FIELD
>>>> write to inject the event. Both SVM & VMX has only one resource to
>>>> inject the virtual event but KVM generates 3 catagory of events in
>>>> parallel which further requires additional
>>>> logic to dictate among them.
>>> 
>>> I thought of using a queue to hold all pending events (in a common
>>> format), sort it by priority, and inject the head.
>> 
>> The SDM Table 5-4 requires to merge 2 events together, i.e. convert
>> to #DF/ 
>> Triple fault or inject serially when 2 events happens no matter NMI,
>> IRQ or exception. 
>> 
>> As if considering above events merging activity, that is a single
>> element queue. 
> I don't know how you got to this conclusion from you previous
> statement. 
> See explanation to table 5-2 for instate where it is stated that
> interrupt should be held pending if there is exception with higher
> priority. Should be held pending where? In the queue, like we do. Note
> that low prio exceptions are just dropped since they will be
> regenerated. 

I have different understanding here.
My understanding is that "held" means NO INTA in HW, i.e. LAPIC still hold this 
IRQ.

> 
>>  We could have either:  1) A pure SW "queue" that will be flush to HW
>> register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.
>> 
> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
> have 
> queue of three elements sorted by priority. On each entry we should

Table 5-4 alreadys says NMI/IRQ is BENIGN.

> inject an event with highest priority. And remove it from queue on
> exit. 

The problem is that we have to decide to inject only one of above 3, and 
discard the rest.
Whether priority them or merge (to one event as Table 5-4) is another story.

> 
>> 
>> A potential benefit is that it can avoid duplicated code and
>> potential bugs 
>> in current code as following patch shows if I understand correctly:
>> 
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
>> *vcpu, struct kvm_run *kvm_run) cr2 =
>> vmcs_readl(EXIT_QUALIFICATION);
>> KVMTRACE_3D(PAGE_FAULT, vcpu,
>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -
>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) +
>> if (vcpu->arch.interrupt.pending ||
>> vcpu->arch.exception.pending  ||
>> vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } 
> This fix is already in Avi's tree (not yet pushed).
> 
>> Either way are OK and up to you. BTW Xen uses HW register directly
>> to representing 
>> an pending event.
>> 
> In this particular case I don't mind to use HW register either, but I
> don't see any advantage.
> 
>>> 
>>>> One example is that exception has higher priority
>>>> than NMI/IRQ injection in current code which is not true in
>>>> reality. 
>>>> 
>>> 
>>> I don't think it matters in practice, since the guest will see it
>>> as a timing issue.  NMIs and IRQs are asynchronous (even those
>>> generated by the guest through the local APIC).
>> 
>> Yes. But also cause IRQ injection be delayed which may have side
>> effect. 
>> For example if guest exception handler is very longer or if guest
>> VCPU fall into recursive #GP. Within current logic, a guest IRQ
>> event from KDB (IPI) running 
>> on VCPU0, as an example, can't force the dead loop VCPU1 into KDB
>> since it 
>> is recursively #GP.
> If one #GP causes another #GP this is a #DF. If CPU has a chance to

Means another #GP in next instruction i.e. Beginning of #GP handler in guest.
No #DF here.

> executes 
> something in between KVM will have a chance to inject NMI.

Could have no chance in so

RE: event injection MACROs

2009-05-14 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>> OK.
>> Also back to Gleb's question, the reason I want to do that is to
>> simplify event 
>> generation mechanism in current KVM.
>> 
>> Today KVM use additional layer of exception/nmi/interrupt such as
>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>> vcpu->arch.nmi_injected. 
>> All those additional layer is due to compete of
>> VM_ENTRY_INTR_INFO_FIELD 
>> write to inject the event. Both SVM & VMX has only one resource to
>> inject the virtual event but KVM generates 3 catagory of events in
>> parallel which further requires additional 
>> logic to dictate among them.
> 
> I thought of using a queue to hold all pending events (in a common
> format), sort it by priority, and inject the head.

The SDM Table 5-4 requires to merge 2 events together, i.e. convert to #DF/
Triple fault or inject serially when 2 events happens no matter NMI, IRQ or 
exception.

As if considering above events merging activity, that is a single element queue.
 We could have either:  1) A pure SW "queue" that will be flush to HW 
register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.


A potential benefit is that it can avoid duplicated code and potential bugs
in current code as following patch shows if I understand correctly:

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct
kvm_run *kvm_run)
cr2 = vmcs_readl(EXIT_QUALIFICATION);
KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2,
(u32)((u64)cr2 >> 32), handler);
-   if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
)
+   if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
 || vcpu->arch.nmi_injected)
kvm_mmu_unprotect_page_virt(vcpu, cr2);
return kvm_mmu_page_fault(vcpu, cr2, error_code);
}


If using above merged SW "queue" or HW direct register, we can do like 
following:

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct
kvm_run *kvm_run)
cr2 = vmcs_readl(EXIT_QUALIFICATION);
KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2,
(u32)((u64)cr2 >> 32), handler);
-   if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
)
+   if (vmcs_read(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK)
kvm_mmu_unprotect_page_virt(vcpu, cr2);
return kvm_mmu_page_fault(vcpu, cr2, error_code);
}

Either way are OK and up to you. BTW Xen uses HW register directly to 
representing
an pending event.

> 
>> One example is that exception has higher priority
>> than NMI/IRQ injection in current code which is not true in reality.
>> 
> 
> I don't think it matters in practice, since the guest will see it as a
> timing issue.  NMIs and IRQs are asynchronous (even those generated by
> the guest through the local APIC).

Yes. But also cause IRQ injection be delayed which may have side effect.
For example if guest exception handler is very longer or if guest VCPU fall into
recursive #GP. Within current logic, a guest IRQ event from KDB (IPI) running
on VCPU0, as an example, can't force the dead loop VCPU1 into KDB since it
is recursively #GP. 

> 
>> Another issue is that an failed event from previous injection say
>> IRQ or NMI may be discarded if an virtual exception happens in the
>> EXIT handling now. With the patch of generic double fault handling,
>> this case should be handled as normally. 
>> 
> 
> Discarding an exception is usually okay as it will be regenerated.  I
> don't think we discard interrupts or NMIs.
In reality (Running OS in guest), it doesn't happen so far. But 
architecturally, 
it could. For example KVM injects an IRQ, but VM Resume get #PF and 
back to KVM with IDT_VECTORING valid. Then KVM will put back the failed 
IRQ to interrupt queue. But if #PF handling generates another exception,
then the interrupt queue won't be able to be injected, since KVM inject 
exception first. And the interrupt queue is discarded at next VM Exit.

Overal, I think this is mostly for simplification but may benefit future
a lot. Especially with Gleb's recent cleanup, it soulds to be much easier to
do than before. 

I may make mistake here, will like to see more comments.
thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: event injection MACROs

2009-05-13 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>> I noticed the MACRO for SVM vmcb->control.event_inj and VMX
>> VM_EXIT_INTR_INFO are almost same, I have a need to query the event
>> injection situation in common code so plan to expose this register
>> read/write to x86.c.  Should we define a new format for
>> evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move those
>> original MACRO to kvm_host.h? 
>> 
>> 
> 
> This is dangerous if additional bits or field values are defined by
> either architecture.  Better to use accessors.

OK.
Also back to Gleb's question, the reason I want to do that is to simplify event
generation mechanism in current KVM.

Today KVM use additional layer of exception/nmi/interrupt such as
vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & 
vcpu->arch.nmi_injected.
All those additional layer is due to compete of VM_ENTRY_INTR_INFO_FIELD
write to inject the event. Both SVM & VMX has only one resource to inject the 
virtual event
but KVM generates 3 catagory of events in parallel which further requires 
additional
logic to dictate among them. One example is that exception has higher priority
than NMI/IRQ injection in current code which is not true in reality. 

Another issue is that an failed event from previous injection say IRQ or NMI 
may be 
discarded if an virtual exception happens in the EXIT handling now. With the 
patch of 
generic double fault handling, this case should be handled as normally.

Will post RFC soon.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-13 Thread Dong, Eddie
> That is OK, You can send two patches. The first one will WARN_ON and
> overwrite exception like the current code does. And the second one
> will remove WARN_ON explaining that this case is actually possible to
> trigger from a guest.
> 
Sounds you don't like to provide this additional one, here it is for the 
purpose of
removing the block issue. My basic position is still same with what mentioned 
in previous mail, but I am neutral to either way.

Thx, eddie

Signed-off-by: Eddie Dong 

Overwriting former event may help forward progress
in case of multiple exception/interrupt happens serially.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0e75a2..b3de5d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -183,11 +183,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
int class1, class2;
 
if (!vcpu->arch.exception.pending) {
-   vcpu->arch.exception.pending = true;
-   vcpu->arch.exception.has_error_code = has_error;
-   vcpu->arch.exception.nr = nr;
-   vcpu->arch.exception.error_code = error_code;
-   return;
+   goto out;
}
 
/* to check exception */
@@ -208,9 +204,15 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
vcpu->arch.exception.has_error_code = true;
vcpu->arch.exception.nr = DF_VECTOR;
vcpu->arch.exception.error_code = 0;
+   return;
} else
printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n",
prev_nr, nr);
+out:
+   vcpu->arch.exception.pending = true;
+   vcpu->arch.exception.has_error_code = has_error;
+   vcpu->arch.exception.nr = nr;
+   vcpu->arch.exception.error_code = error_code;
 }
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)

serial_irq.patch
Description: serial_irq.patch


RE: Enable IRQ windows after exception injection if there are pending virq

2009-05-13 Thread Dong, Eddie
Gleb Natapov wrote:
> On Tue, May 12, 2009 at 11:06:39PM +0800, Dong, Eddie wrote:
>> 
>> I didn't take many test since our PTS system stop working now due to
>> KVM userspace 
>> build changes. But since the logic is pretty simple, so I want to
>> post here to see comments. Thx, eddie 
>> 
>> 
>> 
>> 
>> If there is pending irq after an virtual exception is injected,
>> KVM needs to enable IRQ window to trap back earlier once
>> the exception is handled.
>> 
> I already posted patch to do that
> http://patchwork.kernel.org/patch/21830/ Is you patch different?
> 
Is it base on the idea I mentioned to you in private mail (April 27), or a 
novel one?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Enable IRQ windows after exception injection if there are pending virq

2009-05-12 Thread Dong, Eddie

I didn't take many test since our PTS system stop working now due to KVM 
userspace
build changes. But since the logic is pretty simple, so I want to post here to 
see comments.
Thx, eddie




If there is pending irq after an virtual exception is injected,
KVM needs to enable IRQ window to trap back earlier once 
the exception is handled.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 308d8e9..f8ceaea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3154,15 +3154,18 @@ static void inject_irq(struct kvm_vcpu *vcpu)
}
 }
 
-static void inject_pending_irq(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void inject_pending_irq(struct kvm_vcpu *vcpu)
 {
-   bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
-   kvm_run->request_interrupt_window;
-
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
kvm_x86_ops->drop_interrupt_shadow(vcpu);
 
inject_irq(vcpu);
+}
+
+static void set_pending_virq(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+   bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
+   kvm_run->request_interrupt_window;
 
/* enable NMI/IRQ window open exits if needed */
if (vcpu->arch.nmi_pending)
@@ -3229,7 +3232,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
if (vcpu->arch.exception.pending)
__queue_exception(vcpu);
else
-   inject_pending_irq(vcpu, kvm_run);
+   inject_pending_irq(vcpu);
+
+   set_pending_virq(vcpu, kvm_run);
 
if (kvm_lapic_enabled(vcpu)) {
if (!vcpu->arch.apic->vapic_addr)

irq_windows.patch
Description: irq_windows.patch


event injection MACROs

2009-05-12 Thread Dong, Eddie
I noticed the MACRO for SVM vmcb->control.event_inj and VMX VM_EXIT_INTR_INFO 
are almost same, I have a need to query the event injection situation in common 
code so plan to expose this register read/write to x86.c.  Should we define a 
new format for evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move 
those original MACRO to kvm_host.h?

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-11 Thread Dong, Eddie
Gleb Natapov wrote:
> On Mon, May 11, 2009 at 09:04:52AM +0800, Dong, Eddie wrote:
>> 
>>> There is not point referring to current code. Current code does not
>>> handle serial exceptions properly. So fix it in your patch
>>> otherwise I propose to use my patch that fixes current code
>>> (http://patchwork.kernel.org/patch/21829/).
>>> 
>> 
>> I would like Avi to decide. As comments to the difference of 2
>> patches, my undrestanding is that I am addressing the problem base
>> on SDM 5-4 with the answer to serial injection as first in first
>> service. Your patch doesn;t solve generic double fault case for
>> example exception 11 on 11, or GP on GP which needs to be converted
>> to #DF per SDM, rather you only handle the case the secondary
>> exception is PF,  and servicing PF.  
>> 
> There is nothing to decide really. I prefer your patch with serial
> exception handling fixed. If you'll not do it I'll do it.

OK, an additional patch will be constructive but my position is neutral. The 
reason (mentioned) is:

1: Current KVM just WARN_ON for those case (and never be hit), so the this 
patch won't introduce 
additional issues. Either printk or WARN_ON to notify us in case we met the 
problem in future is safer way for me.

2: In case of real "serial ecception" happens, from architectural point of 
view, I think we'd better consult Table 5-2 to prioritize them, which is 
neither reserving former exception nor overwritting. But as you mentioned, the 
list is not completed. My point is that this is another complicated scenario 
that we should spend time in future, but not related to current patch.

3: This function will soon needs to be extended to cover IRQ case too, which 
needs to push back the overwritten IRQ. We need a total solution for this, so I 
prefer to do that some time later.

4: I prefer to split issue if possible. 


> 
>> I can check with internal architecture to see what does "handle
>> exceptions serially" mean in really. For me serial means first in
>> first out, and thus we should remain 1st exception.  
>> 
> There is a table 5.2 that defines an order between some events.  The
> table is not complete, I don't see #DE there for instance.  But
> consider 
> this case: #DE (or #NP) happens while exception stack is paged out so
> #PF happens next. #PF is handled by TSS gate so it uses its own stack
> and it fixes exception stack in its handler. If we drop #PF because
> #DE is already waiting we will keep trying to inject #DE
> indefinitely. The result is hanging QEMU process eating 100% cpu
> time. If we replace #DE with #PF on the other hand then #PF handler
> will fix exception stack instruction that caused #DE will be
> re-executed, #DE regenerated and handled properly. So which scenario
> do you prefer? 

See above.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Luvalley-2 has been released: running KVM below any operating system

2009-05-10 Thread Dong, Eddie
Xiaodong Yi wrote:
> It is not a typo. I copied from UnixBench output directly. Howver, it
> must be a bug of Luvalley because even the native Linux benchmark on
> Double-Precision Whetstone is not that high. I also noticed that other
> benchmarks are all lower than native Linux.
> 
> About timing, Luvalley does nothing more than KVM except that Luvalley
> implemented the VLAPIC timer using TSC while KVM uses the services of
> Linux kernel. The other timers of both Luvalley and KVM, I think, are
> all implemented in Qemu.
> 
> Moreover, I could not explain why Luvalley's benchmarks on process
> creation, execl throughput, file copy and shell script are 20% ~ 40%
> higher than KVM, while other benchmarks on pipe throughput, context
> switching and syscall overhead are almost the same as KVM.
> 

A typical issue in VMM benchmarking using OS benchmark such as what you used is 
time inaccurate issue.

Benchmarks using guest time won't be able to get a right time or right duration 
due to scheduler etc, and thus VMM benchmarks is using network time for 
measuring such as vConsolidate.  Spec.org is definning their benchmark for VMM, 
and I believe they will use network time too.

For simplicity, you may continue use OS benchmark to measure VMM, but then you 
need to calibrate guest time accuracy first such as using stop watch etc. In 
both Xen & KVM, we benchmark using OS benchmark too, but it is usually only to 
see improvement of a performance patch. Formal benchmark data needs to consult 
wall clock or stop watch.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-10 Thread Dong, Eddie

> There is not point referring to current code. Current code does not
> handle serial exceptions properly. So fix it in your patch otherwise I
> propose to use my patch that fixes current code
> (http://patchwork.kernel.org/patch/21829/).
> 

I would like Avi to decide. As comments to the difference of 2 patches, my 
undrestanding is that I am addressing the problem base on SDM 5-4 with the 
answer to serial injection as first in first service. Your patch doesn;t solve 
generic double fault case for example exception 11 on 11, or GP on GP which 
needs to be converted to #DF per SDM, rather you only handle the case the 
secondary exception is PF,  and servicing PF.

I can check with internal architecture to see what does "handle exceptions 
serially" mean in really. For me serial means first in first out, and thus we 
should remain 1st exception.


Eddie.--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-08 Thread Dong, Eddie
Gleb Natapov wrote:
> On Fri, May 08, 2009 at 06:46:14PM +0800, Dong, Eddie wrote:
>> Dong, Eddie wrote:
>>> ction will be re-executed.
>>>>> 
>>>>> Do you want it to be covered for now? For exception, it is easy
>>>>> but for IRQ, it needs to be pushed back.
>>>>> 
>>>> Yes I want it to be covered now otherwise any serial exception
>>>> generates flood of "Exception happens serially" messages. This
>>>> function does not handle IRQ so no problem there.
>>> 
>>> But we soon will let this function cove IRQ as well per SDM.
>>> Why not do that a little bit later?
>>> 
>>> BTW, this issue exist in original code as well.
>>> 
>>> Eddie
>> 
>> Actually this is already addressed in current patch too: Just keep
>> the former exception. If you mean the prink should be removed, I am
>> fine.  
> Keeping the former exception is not the right thing to do. It can't be
> delivered because delivering it cause another exception and handler
> that may fix the situation is not called since you drop last
> exception and keep re-injecting the one that can't be handled.
> 
>> BTW, this case doesn't happen in reality.
>> 
> Then why do you write all this code then? :) I can easily write test

I am fixing the potential #DF bug existing in current code which only handle
PF on PF.
For those sequential exception, it is WARN_ON in current code.

> case that will do that (actually I did) and if not handled properly it
> just loops taking 100% cpu trying to reinject exception that cannot be
> handled.

Are u sure current code is dead loop in WARN_ON with your test code? 
I don't see it will never happen and thus why printk it, but shouldn't exist
in current guest that KVM can support.

See original kvm_queue_exception in case you ignored the code.

void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
{
WARN_ON(vcpu->arch.exception.pending);
vcpu->arch.exception.pending = true;
vcpu->arch.exception.has_error_code = false;
vcpu->arch.exception.nr = nr;
}

Any comments from Avi?

Thx, eddie


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-08 Thread Dong, Eddie
Dong, Eddie wrote:
> ction will be re-executed.
>>> 
>>> Do you want it to be covered for now? For exception, it is easy but
>>> for IRQ, it needs to be pushed back.
>>> 
>> Yes I want it to be covered now otherwise any serial exception
>> generates flood of "Exception happens serially" messages. This
>> function does not handle IRQ so no problem there.
> 
> But we soon will let this function cove IRQ as well per SDM.
> Why not do that a little bit later?
> 
> BTW, this issue exist in original code as well.
> 
> Eddie

Actually this is already addressed in current patch too: Just keep the former 
exception. If you mean the prink should be removed, I am fine. 
BTW, this case doesn't happen in reality.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-08 Thread Dong, Eddie
ction will be re-executed.
>> 
>> Do you want it to be covered for now? For exception, it is easy but
>> for IRQ, it needs to be pushed back. 
>> 
> Yes I want it to be covered now otherwise any serial exception
> generates flood of "Exception happens serially" messages. This
> function does not handle IRQ so no problem there.

But we soon will let this function cove IRQ as well per SDM.
Why not do that a little bit later?

BTW, this issue exist in original code as well.

Eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-08 Thread Dong, Eddie
Gleb Natapov wrote:
>> +
>> +static int exception_class(int vector)
>> +{
>> +if (vector == 14)
>> +return EXCPT_PF;
>> +else if (vector == 0 || (vector >= 10 && vector <= 13)) +   
>> return
>> EXCPT_CONTRIBUTORY; +else
>> +return EXCPT_BENIGN;
>> +}
>> +
> This makes double fault (8) benign exception. Surely not what you
> want. 

double fault fall into neither of above class per SDM. But it should be 
checked earlier than generating DB fault. See new updated.
>> +/* to check exception */
>> +prev_nr = vcpu->arch.exception.nr;
>> +class2 = exception_class(nr);
>> +class1 = exception_class(prev_nr);
>> +if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
>> +|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
>> +/* generate double fault per SDM Table 5-5 */
>> +printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n",
>> +prev_nr, nr); + vcpu->arch.exception.pending = 
>> true;
>> +vcpu->arch.exception.has_error_code = 1;
>> +vcpu->arch.exception.nr = DF_VECTOR;
>> +vcpu->arch.exception.error_code = 0;
>> +if (prev_nr == DF_VECTOR) {
>> +/* triple fault -> shutdown */
>> +set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); +   
>> }
>> +} else
>> +printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n",
>> +prev_nr, nr); +}
> When two exceptions happens serially is is better to replace pending
> exception with a new one. This way the first exception (that is lost)
> will be regenerated when instruction will be re-executed.

Do you want it to be covered for now? For exception, it is easy but for IRQ, it 
needs to be pushed back.

Thx, eddie



Move Double-Fault generation logic out of page fault
exception generating function to cover more generic case.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab1fdac..d0e75a2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -162,12 +162,60 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
+#define EXCPT_BENIGN   0
+#define EXCPT_CONTRIBUTORY 1
+#define EXCPT_PF   2
+
+static int exception_class(int vector)
+{
+   if (vector == 14)
+   return EXCPT_PF;
+   else if (vector == 0 || (vector >= 10 && vector <= 13))
+   return EXCPT_CONTRIBUTORY;
+   else
+   return EXCPT_BENIGN;
+}
+
+static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
+   unsigned nr, bool has_error, u32 error_code)
+{
+   u32 prev_nr;
+   int class1, class2;
+
+   if (!vcpu->arch.exception.pending) {
+   vcpu->arch.exception.pending = true;
+   vcpu->arch.exception.has_error_code = has_error;
+   vcpu->arch.exception.nr = nr;
+   vcpu->arch.exception.error_code = error_code;
+   return;
+   }
+
+   /* to check exception */
+   prev_nr = vcpu->arch.exception.nr;
+   if (prev_nr == DF_VECTOR) {
+   /* triple fault -> shutdown */
+   set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+   return;
+   }
+   class1 = exception_class(prev_nr);
+   class2 = exception_class(nr);
+   if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
+   || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
+   /* generate double fault per SDM Table 5-5 */
+   printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n",
+   prev_nr, nr);
+   vcpu->arch.exception.pending = true;
+   vcpu->arch.exception.has_error_code = true;
+   vcpu->arch.exception.nr = DF_VECTOR;
+   vcpu->arch.exception.error_code = 0;
+   } else
+   printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n",
+   prev_nr, nr);
+}
+
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-   WARN_ON(vcpu->arch.exception.pending);
-   vcpu->arch.exception.pending = true;
-   vcpu->arch.exception.has_error_code = false;
-   vcpu->arch.exception.nr = nr;
+   kvm_multiple_exception(vcpu, nr, false, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception);
 
@@ -176,18 +224,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned 
long addr,
 {
++vcpu->stat.pf_guest;
 
-   if (vcpu->arch.exception.pending) {
-   if (vcpu->arch.exception.nr == PF_VECTOR) {
-   printk(KERN_DEBUG "kvm: inject_page_fault:"
-   " double fault 0x%lx\n", addr);
-   vcpu->arch.exception.nr = DF_VECTOR;
-   vcpu->arch.exception.error_code = 0;
-   } else if (vcpu->arch.

RE: Implement generic double fault generation mechanism

2009-05-08 Thread Dong, Eddie
Dong, Eddie wrote:
> Move Double-Fault generation logic out of page fault
> exception generating function to cover more generic case.
> 
> Signed-off-by: Eddie Dong 
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab1fdac..51a8dad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -162,12 +162,59 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu,
>  u64 data) }
>  EXPORT_SYMBOL_GPL(kvm_set_apic_base);
> 
> +#define EXCPT_BENIGN 0
> +#define EXCPT_CONTRIBUTORY   1
> +#define EXCPT_PF 2
> +
> +static int exception_class(int vector)
> +{
> + if (vector == 14)
> + return EXCPT_PF;
> + else if (vector == 0 || (vector >= 10 && vector <= 13))
> + return EXCPT_CONTRIBUTORY;
> + else
> + return EXCPT_BENIGN;
> +}
> +
> +static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
> + unsigned nr, bool has_error, u32 error_code)
> +{
> + u32 prev_nr;
> + int class1, class2;
> +
> + if (!vcpu->arch.exception.pending) {
> + vcpu->arch.exception.pending = true;
> + vcpu->arch.exception.has_error_code = has_error;
> + vcpu->arch.exception.nr = nr;
> + vcpu->arch.exception.error_code = error_code;
> + return;
> + }
> +
> + /* to check exception */
> + prev_nr = vcpu->arch.exception.nr;
> + class2 = exception_class(nr);
> + class1 = exception_class(prev_nr);
> + if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
> + || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
> + /* generate double fault per SDM Table 5-5 */
> + printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n",
> + prev_nr, nr);
> + vcpu->arch.exception.pending = true;
> + vcpu->arch.exception.has_error_code = 1;
> + vcpu->arch.exception.nr = DF_VECTOR;
> + vcpu->arch.exception.error_code = 0;
> + if (prev_nr == DF_VECTOR) {
> + /* triple fault -> shutdown */
> + set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
> + }
> + } else
> + printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n",
> + prev_nr, nr);
> +}
> +
>  void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
>  {
> - WARN_ON(vcpu->arch.exception.pending);
> - vcpu->arch.exception.pending = true;
> - vcpu->arch.exception.has_error_code = false;
> - vcpu->arch.exception.nr = nr;
> + kvm_multiple_exception(vcpu, nr, false, 0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_queue_exception);
> 
> @@ -176,18 +223,6 @@ void kvm_inject_page_fault(struct kvm_vcpu
>  *vcpu, unsigned long addr, {
>   ++vcpu->stat.pf_guest;
> 
> - if (vcpu->arch.exception.pending) {
> - if (vcpu->arch.exception.nr == PF_VECTOR) {
> - printk(KERN_DEBUG "kvm: inject_page_fault:"
> - " double fault 0x%lx\n", addr);
> - vcpu->arch.exception.nr = DF_VECTOR;
> - vcpu->arch.exception.error_code = 0;
> - } else if (vcpu->arch.exception.nr == DF_VECTOR) {
> - /* triple fault -> shutdown */
> - set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
> - }
> - return;
> - }
>   vcpu->arch.cr2 = addr;
>   kvm_queue_exception_e(vcpu, PF_VECTOR, error_code);
>  }
> @@ -200,11 +235,7 @@ EXPORT_SYMBOL_GPL(kvm_inject_nmi);
> 
>  void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32
>  error_code) {
> - WARN_ON(vcpu->arch.exception.pending);
> - vcpu->arch.exception.pending = true;
> - vcpu->arch.exception.has_error_code = true;
> - vcpu->arch.exception.nr = nr;
> - vcpu->arch.exception.error_code = error_code;
> + kvm_multiple_exception(vcpu, nr, true, error_code);
>  }
>  EXPORT_SYMBOL_GPL(kvm_queue_exception_e);



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Implement generic double fault generation mechanism

2009-04-30 Thread Dong, Eddie


Move Double-Fault generation logic out of page fault
exception generating function to cover more generic case.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab1fdac..51a8dad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -162,12 +162,59 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
+#define EXCPT_BENIGN   0
+#define EXCPT_CONTRIBUTORY 1
+#define EXCPT_PF   2
+
+static int exception_class(int vector)
+{
+   if (vector == 14)
+   return EXCPT_PF;
+   else if (vector == 0 || (vector >= 10 && vector <= 13))
+   return EXCPT_CONTRIBUTORY;
+   else
+   return EXCPT_BENIGN;
+}
+
+static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
+   unsigned nr, bool has_error, u32 error_code)
+{
+   u32 prev_nr;
+   int class1, class2;
+
+   if (!vcpu->arch.exception.pending) {
+   vcpu->arch.exception.pending = true;
+   vcpu->arch.exception.has_error_code = has_error;
+   vcpu->arch.exception.nr = nr;
+   vcpu->arch.exception.error_code = error_code;
+   return;
+   }
+
+   /* to check exception */
+   prev_nr = vcpu->arch.exception.nr;
+   class2 = exception_class(nr);
+   class1 = exception_class(prev_nr);
+   if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
+   || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
+   /* generate double fault per SDM Table 5-5 */
+   printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n",
+   prev_nr, nr);
+   vcpu->arch.exception.pending = true;
+   vcpu->arch.exception.has_error_code = 1;
+   vcpu->arch.exception.nr = DF_VECTOR;
+   vcpu->arch.exception.error_code = 0;
+   if (prev_nr == DF_VECTOR) {
+   /* triple fault -> shutdown */
+   set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+   }
+   } else
+   printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n",
+   prev_nr, nr);
+}
+
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-   WARN_ON(vcpu->arch.exception.pending);
-   vcpu->arch.exception.pending = true;
-   vcpu->arch.exception.has_error_code = false;
-   vcpu->arch.exception.nr = nr;
+   kvm_multiple_exception(vcpu, nr, false, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception);
 
@@ -176,18 +223,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned 
long addr,
 {
++vcpu->stat.pf_guest;
 
-   if (vcpu->arch.exception.pending) {
-   if (vcpu->arch.exception.nr == PF_VECTOR) {
-   printk(KERN_DEBUG "kvm: inject_page_fault:"
-   " double fault 0x%lx\n", addr);
-   vcpu->arch.exception.nr = DF_VECTOR;
-   vcpu->arch.exception.error_code = 0;
-   } else if (vcpu->arch.exception.nr == DF_VECTOR) {
-   /* triple fault -> shutdown */
-   set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
-   }
-   return;
-   }
vcpu->arch.cr2 = addr;
kvm_queue_exception_e(vcpu, PF_VECTOR, error_code);
 }
@@ -200,11 +235,7 @@ EXPORT_SYMBOL_GPL(kvm_inject_nmi);
 
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
 {
-   WARN_ON(vcpu->arch.exception.pending);
-   vcpu->arch.exception.pending = true;
-   vcpu->arch.exception.has_error_code = true;
-   vcpu->arch.exception.nr = nr;
-   vcpu->arch.exception.error_code = error_code;
+   kvm_multiple_exception(vcpu, nr, true, error_code);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception_e);
 

irq3.patch
Description: irq3.patch


deactivate fpu when cr0.PE ON?

2009-04-10 Thread Dong, Eddie
Following code deactivate fpu when CR0.PE is on, any  explaination? 
Rest of code active/deactive fpu based on cr0.TS bit.


thx, eddie




static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
unsigned long guest_cr3;
u64 eptp;

guest_cr3 = cr3;
if (enable_ept) {
eptp = construct_eptp(cr3);
vmcs_write64(EPT_POINTER, eptp);
ept_sync_context(eptp);
ept_load_pdptrs(vcpu);
guest_cr3 = is_paging(vcpu) ? vcpu->arch.cr3 :
VMX_EPT_IDENTITY_PAGETABLE_ADDR;
}

vmx_flush_tlb(vcpu);
vmcs_writel(GUEST_CR3, guest_cr3);
if (vcpu->arch.cr0 & X86_CR0_PE)
vmx_fpu_deactivate(vcpu);
}--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Check valid bit of VM_EXIT_INTR_INFO

2009-04-01 Thread Dong, Eddie
Thx, eddie



commit ad4a9829c8d5b30995f008e32774bd5f555b7e9f
Author: root 
Date:   Thu Apr 2 11:16:03 2009 +0800

Check valid bit of VM_EXIT_INTR_INFO before unblock nmi.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aba41ae..689523a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3268,16 +3268,18 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
*vmx)
 
exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
if (cpu_has_virtual_nmis()) {
-   unblock_nmi = (exit_intr_info & INTR_INFO_UNBLOCK_NMI) != 0;
-   vector = exit_intr_info & INTR_INFO_VECTOR_MASK;
/*
 * SDM 3: 25.7.1.2
 * Re-set bit "block by NMI" before VM entry if vmexit caused by
 * a guest IRET fault.
 */
-   if (unblock_nmi && vector != DF_VECTOR)
-   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   if (exit_intr_info & INTR_INFO_VALID_MASK) {
+   unblock_nmi = !!(exit_intr_info & 
INTR_INFO_UNBLOCK_NMI);
+   vector = exit_intr_info & INTR_INFO_VECTOR_MASK;
+   if (unblock_nmi && vector != DF_VECTOR)
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
  GUEST_INTR_STATE_NMI);
+   }
} else if (unlikely(vmx->soft_vnmi_blocked))
vmx->vnmi_blocked_time +=
ktime_to_ns(ktime_sub(ktime_get(), vmx->entry_time));

nmi_valid.patch
Description: nmi_valid.patch


RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-31 Thread Dong, Eddie

> 
> Looks good, but doesn't apply; please check if you are working against
> the latest version.

Rebased on top of a317a1e496b22d1520218ecf16a02498b99645e2 + previous rsvd bits 
violation check patch.

thx, eddie



Use rsvd_bits_mask in load_pdptrs and remove bit 5-6 from rsvd_bits_mask 
per latest SDM.

Signed-off-by: Eddie Dong 


diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 41a0482..400c056 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu)
return vcpu->arch.shadow_efer & EFER_NX;
 }
 
-static int is_present_pte(unsigned long pte)
-{
-   return pte & PT_PRESENT_MASK;
-}
-
 static int is_shadow_present_pte(u64 pte)
 {
return pte != shadow_trap_nonpresent_pte
@@ -2195,6 +2190,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int 
level)
context->rsvd_bits_mask[1][0] = 0;
break;
case PT32E_ROOT_LEVEL:
+   context->rsvd_bits_mask[0][2] =
+   rsvd_bits(maxphyaddr, 63) |
+   rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62);  /* PDE */
context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index eaab214..3494a2f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu)
return vcpu->arch.cr0 & X86_CR0_PG;
 }
 
+static inline int is_present_pte(unsigned long pte)
+{
+   return pte & PT_PRESENT_MASK;
+}
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9702353..3d07c9a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -234,7 +234,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
goto out;
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
-   if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) {
+   if (is_present_pte(pdpte[i]) &&
+   (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-31 Thread Dong, Eddie
Neiger, Gil wrote:
> PDPTEs are used only if CR0.PG=CR4.PAE=1.
> 
> In that situation, their format depends the value of IA32_EFER.LMA.
> 
> If IA32_EFER.LMA=0, bit 63 is reserved and must be 0 in any PDPTE
> that is marked present.  The execute-disable setting of a page is
> determined only by the PDE and PTE.  
> 
> If IA32_EFER.LMA=1, bit 63 is used for the execute-disable in PML4
> entries, PDPTEs, PDEs, and PTEs (assuming IA32_EFER.NXE=1). 
> 
>   - Gil

Rebased.
Thanks, eddie


commit 032caed3da123950eeb3e192baf444d4eae80c85
Author: root 
Date:   Tue Mar 31 16:22:49 2009 +0800

Use rsvd_bits_mask in load_pdptrs and remove bit 5-6 from rsvd_bits_mask 
per latest SDM.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2eab758..1bed3aa 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu)
return vcpu->arch.shadow_efer & EFER_NX;
 }
 
-static int is_present_pte(unsigned long pte)
-{
-   return pte & PT_PRESENT_MASK;
-}
-
 static int is_shadow_present_pte(u64 pte)
 {
return pte != shadow_trap_nonpresent_pte
@@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int 
level)
context->rsvd_bits_mask[1][0] = 0;
break;
case PT32E_ROOT_LEVEL:
+   context->rsvd_bits_mask[0][2] =
+   rsvd_bits(maxphyaddr, 63) |
+   rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62);  /* PDE */
context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 258e5d5..2a6eb50 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu)
return vcpu->arch.cr0 & X86_CR0_PG;
 }
 
+static inline int is_present_pte(unsigned long pte)
+{
+   return pte & PT_PRESENT_MASK;
+}
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 961bd2b..b449ff0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
goto out;
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
-   if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) {
+   if (is_present_pte(pdpte[i]) &&
+   (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}

cr3_load_rsvd.patch
Description: cr3_load_rsvd.patch


FW: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-30 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>> @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu
>>  *vcpu, int level) context->rsvd_bits_mask[1][0] = 0;
>>  break;
>>  case PT32E_ROOT_LEVEL:
>> +context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
>> +rsvd_bits(maxphyaddr, 62) |
>> +rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
>>  context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
>>  rsvd_bits(maxphyaddr, 62);  /* PDE */
>>  context->rsvd_bits_mask[0][0] = exb_bit_rsvd
> 
> Are you sure that PDPTEs support NX?  They don't support R/W and U/S,
> so it seems likely that NX is reserved as well even when EFER.NXE is
> enabled. 


Gil:
Here is the original mail in KVM mailinglist. If you would be able to 
help, that is great.
thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-30 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>> @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu
>>  *vcpu, int level) context->rsvd_bits_mask[1][0] = 0;
>>  break;
>>  case PT32E_ROOT_LEVEL:
>> +context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
>> +rsvd_bits(maxphyaddr, 62) |
>> +rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
>>  context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
>>  rsvd_bits(maxphyaddr, 62);  /* PDE */
>>  context->rsvd_bits_mask[0][0] = exb_bit_rsvd
> 
> Are you sure that PDPTEs support NX?  They don't support R/W and U/S,
> so it seems likely that NX is reserved as well even when EFER.NXE is
> enabled. 

I am refering Fig 3-20/3-21 of SDM3A, but I think Fig3-20/21 has EXB bit missed 
since Table 3-5 and section 3.10.3.
I will double check with internal architect. 
thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-30 Thread Dong, Eddie
Dong, Eddie wrote:
> This is followup of rsvd_bits emulation.
> 
Base on new rsvd_bits emulation patch.
thx, eddie


commit 2c1472ef2b9fd87a261e8b58a7db11afd6a111dc
Author: root 
Date:   Mon Mar 30 17:05:47 2009 +0800

Use rsvd_bits_mask in load_pdptrs for cleanup with EXB bit considered.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2eab758..eaf41c0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu)
return vcpu->arch.shadow_efer & EFER_NX;
 }
 
-static int is_present_pte(unsigned long pte)
-{
-   return pte & PT_PRESENT_MASK;
-}
-
 static int is_shadow_present_pte(u64 pte)
 {
return pte != shadow_trap_nonpresent_pte
@@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int 
level)
context->rsvd_bits_mask[1][0] = 0;
break;
case PT32E_ROOT_LEVEL:
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62) |
+   rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62);  /* PDE */
context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 258e5d5..2a6eb50 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu)
return vcpu->arch.cr0 & X86_CR0_PG;
 }
 
+static inline int is_present_pte(unsigned long pte)
+{
+   return pte & PT_PRESENT_MASK;
+}
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 961bd2b..b449ff0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
goto out;
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
-   if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) {
+   if (is_present_pte(pdpte[i]) &&
+   (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Cleanup to reuse is_long_mode()

2009-03-30 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>>  struct vcpu_svm *svm = to_svm(vcpu);
>> 
>>  #ifdef CONFIG_X86_64
>> -if (vcpu->arch.shadow_efer & EFER_LME) {
>> +if (is_long_mode(vcpu)) {
>> 
> 
> is_long_mode() actually tests EFER_LMA, so this is incorrect.

Something missing? Here is the definition of is_long_mode, the patch is just 
for equal replacement.
thx, eddie


static inline int is_long_mode(struct kvm_vcpu *vcpu)
{
#ifdef CONFIG_X86_64
return vcpu->arch.shadow_efer & EFER_LME;
#else
return 0;
#endif
}--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: Add reserved bits check

2009-03-30 Thread Dong, Eddie
> 
> Just noticed that walk_addr() too can be called from tdp context, so
> need to make sure rsvd_bits_mask is initialized in init_kvm_tdp_mmu()
> as well.

Yes, fixed.
Thx, eddie


commit b282565503a78e75af643de42fe7bf495e2213ec
Author: root 
Date:   Mon Mar 30 16:57:39 2009 +0800

Emulate #PF error code of reserved bits violation.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55fd4c5..4fe2742 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,6 +261,7 @@ struct kvm_mmu {
union kvm_mmu_page_role base_role;
 
u64 *pae_root;
+   u64 rsvd_bits_mask[2][4];
 };
 
 struct kvm_vcpu_arch {
@@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ef060ec..2eab758 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644);
 #define PFERR_PRESENT_MASK (1U << 0)
 #define PFERR_WRITE_MASK (1U << 1)
 #define PFERR_USER_MASK (1U << 2)
+#define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
 #define PT_DIRECTORY_LEVEL 2
@@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mt_mask;
 
+static inline u64 rsvd_bits(int s, int e)
+{
+   return ((1ULL << (e - s + 1)) - 1) << s;
+}
+
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
 {
shadow_trap_nonpresent_pte = trap_pte;
@@ -2155,6 +2161,14 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
 }
 
+static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level)
+{
+   int bit7;
+
+   bit7 = (gpte >> 7) & 1;
+   return (gpte & vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -2163,6 +2177,54 @@ static void paging_free(struct kvm_vcpu *vcpu)
 #include "paging_tmpl.h"
 #undef PTTYPE
 
+void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level)
+{
+   struct kvm_mmu *context = &vcpu->arch.mmu;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   u64 exb_bit_rsvd = 0;
+
+   if (!is_nx(vcpu))
+   exb_bit_rsvd = rsvd_bits(63, 63);
+   switch (level) {
+   case PT32_ROOT_LEVEL:
+   /* no rsvd bits for 2 level 4K page table entries */
+   context->rsvd_bits_mask[0][1] = 0;
+   context->rsvd_bits_mask[0][0] = 0;
+   if (is_cpuid_PSE36())
+   /* 36bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
+   else
+   /* 32 bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
+   context->rsvd_bits_mask[1][0] = 0;
+   break;
+   case PT32E_ROOT_LEVEL:
+   context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62);  /* PDE */
+   context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62);  /* PTE */
+   context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62) |
+   rsvd_bits(13, 20);  /* large page */
+   context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+   break;
+   case PT64_ROOT_LEVEL:
+   context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+   context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+   context->rsvd_bits_mask[1][2] = context->rsvd_bits_mask[0][2];
+   context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
+   context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+   break;
+   }
+}
+
 static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
 {
struct kvm_mmu *context = &vcpu->arch.mmu;
@@ -2183,6 +2245,7 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu, int level)
 
 static int paging64_init_context(struct kvm_vcpu *vcpu)
 {
+   reset_rsvds_bits_mask(vcpu, PT64_

Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-03-29 Thread Dong, Eddie
This is followup of rsvd_bits emulation.

thx, eddie




commit 171eb2b2d8282dd913a5d5c6c695fd64e1ddcf4c
Author: root 
Date:   Mon Mar 30 11:39:50 2009 +0800

Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0a6f109..b0bf8b2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2255,6 +2255,9 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu)
if (!is_nx(vcpu))
exb_bit_rsvd = rsvd_bits(63, 63);
 
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62) |
+   rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62);  /* PDE */
context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
@@ -2270,6 +2273,17 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu)
 static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu *context = &vcpu->arch.mmu;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   u64 exb_bit_rsvd = 0;
+
+   if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu)) {
+   if (!is_nx(vcpu))
+   exb_bit_rsvd = rsvd_bits(63, 63);
+
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62) |
+   rsvd_bits(7, 8) | rsvd_bits(1, 2);  /* PDPTE */
+   }
 
context->new_cr3 = nonpaging_new_cr3;
context->page_fault = tdp_page_fault;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 961bd2b..ff178fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
goto out;
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
-   if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) {
+   if ((pdpte[i] & PT_PRESENT_MASK) &&
+   (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cleanup to reuse is_long_mode()

2009-03-29 Thread Dong, Eddie


Thanks, Eddie




commit 6688a1fbc37330f2c4e16d1a78050b64e1ce5dcc
Author: root 
Date:   Mon Mar 30 11:31:10 2009 +0800

cleanup to reuse is_long_mode(vcpu)

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index db5021b..affc31d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -859,7 +859,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long cr0)
struct vcpu_svm *svm = to_svm(vcpu);
 
 #ifdef CONFIG_X86_64
-   if (vcpu->arch.shadow_efer & EFER_LME) {
+   if (is_long_mode(vcpu)) {
if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
vcpu->arch.shadow_efer |= EFER_LMA;
svm->vmcb->save.efer |= EFER_LMA | EFER_LME;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9913a1d..b1f1458 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1543,7 +1543,7 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long cr0)
enter_rmode(vcpu);
 
 #ifdef CONFIG_X86_64
-   if (vcpu->arch.shadow_efer & EFER_LME) {
+   if (is_long_mode(vcpu)) {
if (!is_paging(vcpu) && (cr0 & X86_CR0_PG))
enter_lmode(vcpu);
if (is_paging(vcpu) && !(cr0 & X86_CR0_PG))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf6683a..961bd2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -289,7 +289,7 @@ void kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 
if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
 #ifdef CONFIG_X86_64
-   if ((vcpu->arch.shadow_efer & EFER_LME)) {
+   if (is_long_mode(vcpu)) {
int cs_db, cs_l;
 
if (!is_pae(vcpu)) {

cleanup.patch
Description: cleanup.patch


RE: RFC: Add reserved bits check

2009-03-29 Thread Dong, Eddie
>> +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int
>> level) +{ +  int ps = 0;
>> +
>> +if (level == PT_DIRECTORY_LEVEL)
>> +ps = !!(gpte & PT_PAGE_SIZE_MASK);
>> 
> 
> No need for this.  If you set rsvd_bits_mask[1][0] ==
> rsvd_bits_mask[0][0], then you get the same behaviour.  The first
> index is not the page size, it's just bit 7.

Sure, fixed.

> 
> You'll need to fill all the indexes for bit 7 == 1, but it's worth it,
> with the 1GB pages patch.
> 
>> +return (gpte & vcpu->arch.mmu.rsvd_bits_mask[ps][level-1]) != 0; +}
>> +
>>  #define PTTYPE 64
>>  #include "paging_tmpl.h"
>>  #undef PTTYPE
>> 
>> +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>> +{
>> +struct kvm_cpuid_entry2 *best;
>> +
>> +best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best)
>> +return best->eax & 0xff;
>> +return 32;
>> +}
>> +
>> 
> 
> Best to return 36 if the cpu doesn't support cpuid 8008 but does
> support pae.

Mmm, noticed a conflict information in SDM, but you are right :)

One more modification is that RSVD bit error code won't update if P=0 after 
double checking with internal architect.

Thanks and reposted.
Eddie




Emulate #PF error code of reserved bits violation.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55fd4c5..4fe2742 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,6 +261,7 @@ struct kvm_mmu {
union kvm_mmu_page_role base_role;
 
u64 *pae_root;
+   u64 rsvd_bits_mask[2][4];
 };
 
 struct kvm_vcpu_arch {
@@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ef060ec..0a6f109 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644);
 #define PFERR_PRESENT_MASK (1U << 0)
 #define PFERR_WRITE_MASK (1U << 1)
 #define PFERR_USER_MASK (1U << 2)
+#define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
 #define PT_DIRECTORY_LEVEL 2
@@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mt_mask;
 
+static inline u64 rsvd_bits(int s, int e)
+{
+   return ((1ULL << (e - s + 1)) - 1) << s;
+}
+
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
 {
shadow_trap_nonpresent_pte = trap_pte;
@@ -2155,6 +2161,14 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
 }
 
+static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level)
+{
+   int bit7;
+
+   bit7 = (gpte >> 7) & 1;
+   return (gpte & vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -2183,6 +2197,25 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu, int level)
 
 static int paging64_init_context(struct kvm_vcpu *vcpu)
 {
+   struct kvm_mmu *context = &vcpu->arch.mmu;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   u64 exb_bit_rsvd = 0;
+
+   if (!is_nx(vcpu))
+   exb_bit_rsvd = rsvd_bits(63, 63);
+
+   context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+   context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+   context->rsvd_bits_mask[1][2] = context->rsvd_bits_mask[0][2];
+   context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
+   context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL);
 }
 
@@ -2190,6 +2223,16 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu *context = &vcpu->arch.mmu;
 
+   /* no rsvd bits for 2 level 4K page table entries */
+   context->rsvd_bits_mask[0][1] = 0;
+   context->rsvd_bits_mask[0][0] = 0;
+   if (is_cpuid_PSE36())
+   /* 36bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
+   else
+   /* 32 bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
+   context->rsvd_bits_mask[1][0] = 0;
context->new_cr3 = paging_new_cr3;
context->page_fault = paging32_page_fault;
context->gva_to_gpa = pagin

RE: RFC: Add reserved bits check

2009-03-27 Thread Dong, Eddie
> 
> Need to make sure rsvd_bits_mask[] is maintained on ept and npt, then.

Sure, will be in next patch, post the current modified one.

Thx, eddie



Current KVM doesn't check reserved bits of guest page table entry, but use 
reserved bits to bypass guest #PF in VMX.

 

This patch add reserved bit check while leaving shadow pte un-constructed if 
guest RSVD=1.


commit dd1d697edf42953d407c10f4d38c650aafd3d3d5
Author: root 
Date:   Fri Mar 27 23:35:27 2009 +0800

Emulate #PF error code of reserved bits violation.

Signed-off-by: Eddie Dong 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55fd4c5..4fe2742 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,6 +261,7 @@ struct kvm_mmu {
union kvm_mmu_page_role base_role;
 
u64 *pae_root;
+   u64 rsvd_bits_mask[2][4];
 };
 
 struct kvm_vcpu_arch {
@@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ef060ec..35af90a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644);
 #define PFERR_PRESENT_MASK (1U << 0)
 #define PFERR_WRITE_MASK (1U << 1)
 #define PFERR_USER_MASK (1U << 2)
+#define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
 #define PT_DIRECTORY_LEVEL 2
@@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mt_mask;
 
+static inline u64 rsvd_bits(int s, int e)
+{
+   return ((1ULL << (e - s + 1)) - 1) << s;
+}
+
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
 {
shadow_trap_nonpresent_pte = trap_pte;
@@ -2155,6 +2161,15 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
 }
 
+static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level)
+{
+   int ps = 0;
+
+   if (level == PT_DIRECTORY_LEVEL)
+   ps = !!(gpte & PT_PAGE_SIZE_MASK);
+   return (gpte & vcpu->arch.mmu.rsvd_bits_mask[ps][level-1]) != 0;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -2183,6 +2198,22 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu, int level)
 
 static int paging64_init_context(struct kvm_vcpu *vcpu)
 {
+   struct kvm_mmu *context = &vcpu->arch.mmu;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   u64 exb_bit_rsvd = 0;
+
+   if (!is_nx(vcpu))
+   exb_bit_rsvd = rsvd_bits(63, 63);
+
+   context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+   context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL);
 }
 
@@ -2190,6 +2221,15 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu *context = &vcpu->arch.mmu;
 
+   /* no rsvd bits for 2 level 4K page table entries */
+   context->rsvd_bits_mask[0][0] = 0;
+   context->rsvd_bits_mask[0][1] = 0;
+   if (is_cpuid_PSE36())
+   /* 36bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
+   else
+   /* 32 bits PSE 4MB page */
+   context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
context->new_cr3 = paging_new_cr3;
context->page_fault = paging32_page_fault;
context->gva_to_gpa = paging32_gva_to_gpa;
@@ -2205,6 +2245,21 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 
 static int paging32E_init_context(struct kvm_vcpu *vcpu)
 {
+   struct kvm_mmu *context = &vcpu->arch.mmu;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   u64 exb_bit_rsvd = 0;
+
+   if (!is_nx(vcpu))
+   exb_bit_rsvd = rsvd_bits(63, 63);
+
+   context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62);  /* PDE */
+   context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62);  /* PTE */
+   context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+   rsvd_bits(maxphyaddr, 62) |
+   rsvd_bits(13, 20);  /* large page */
+
return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL);
 }
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_

RE: RFC: Add reserved bits check

2009-03-27 Thread Dong, Eddie

>> Will never be use, PDPTEs are loaded by set_cr3(), not walk_addr().
>> 
> 
> I see, then how about to replace CR3_PAE_RESERVED_BITS check at cr3
> load with 
> rsvd_bits_mask[2]? Seems current code are lacking of enough reserved
> bits check too. 
> 

typo, I mean this:

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -233,7 +233,7 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
goto out;
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
-   if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) {
+   if ((pdpte[i] & 1) && (pdpte[i] & 
vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}
(--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: Add reserved bits check

2009-03-27 Thread Dong, Eddie
>> + context->rsvd_bits_mask[0] = rsvd_bits(maxphyaddr, 51);
>> + context->large_page_rsvd_mask =  /* 2MB PDE */
>> +  rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
>>   return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL);  }
>> 
> 
> Isn't bit 63 reserved if NX is disabled?

Sure.

> 
>> 
>> @@ -2206,6 +2258,18 @@ static int paging32_init_context(struct
>> kvm_vcpu *vcpu) 
>> 
>>  static int paging32E_init_context(struct kvm_vcpu *vcpu)  {
>> + struct kvm_mmu *context = &vcpu->arch.mmu;
>> + int maxphyaddr = cpuid_maxphyaddr(vcpu);
>> +
>> + /* 3 levels */
>> + context->rsvd_bits_mask[2] = rsvd_bits(maxphyaddr, 63) |
>> +  rsvd_bits(7, 8) | rsvd_bits(1,2); /* PDPTE */
>> 
> 
> Will never be use, PDPTEs are loaded by set_cr3(), not walk_addr().
> 

I see, then how about to replace CR3_PAE_RESERVED_BITS check at cr3 load with
rsvd_bits_mask[2]? Seems current code are lacking of enough reserved bits check 
too.

>> @@ -153,10 +154,13 @@ walk:
>>  walker->level - 1, table_gfn);
>> 
>>kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
>> +  rsvd_fault = is_rsvd_bits_set(vcpu, pte, walker->level);
>> 
> 
> Does a not present pte set PFERR_RSVD?

Yes though most commercial OS doesn't use it. 
I plan to post a follow up patch to fix the potential RSVD_fault error code 
mismatch when bypass_guest_pf=1.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: Add reserved bits check

2009-03-26 Thread Dong, Eddie
 



Current KVM doesn't check reserved bits of guest page table, while may use 
reserved bits to bypass guest #PF in VMX.

 

This patch add this check while leaving shadow pte un-constructed if guest 
RSVD=1.

 

Comments?

Thx, eddie

 

 

 

 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55fd4c5..9370ff0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,6 +261,8 @@ struct kvm_mmu {
  union kvm_mmu_page_role base_role;
 
  u64 *pae_root;
+ u64 rsvd_bits_mask[4];
+ u64 large_page_rsvd_mask;
 };
 
 struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 31ba3cb..7f55c4a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -127,6 +127,7 @@ module_param(oos_shadow, bool, 0644);
 #define PFERR_PRESENT_MASK (1U << 0)
 #define PFERR_WRITE_MASK (1U << 1)
 #define PFERR_USER_MASK (1U << 2)
+#define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
 #define PT_DIRECTORY_LEVEL 2
@@ -179,6 +180,13 @@ static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mt_mask;
+extern struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(
+ struct kvm_vcpu *vcpu, u32 function, u32 index);
+
+static inline u64 rsvd_bits(int s, int e)
+{
+ return ((1ULL << (e - s + 1)) - 1) << s;
+}
 
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
 {
@@ -251,6 +259,18 @@ static int is_rmap_pte(u64 pte)
  return is_shadow_present_pte(pte);
 }
 
+static int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
+{
+ u32 function=0x8008;
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, function, 0);
+ if (best) {
+  return best->eax & 0xff;
+ }
+ return 40;
+}
+
 static pfn_t spte_to_pfn(u64 pte)
 {
  return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
@@ -2156,6 +2176,17 @@ static void paging_free(struct kvm_vcpu *vcpu)
  nonpaging_free(vcpu);
 }
 
+static int is_rsvd_bits_set(struct kvm_vcpu *vcpu, unsigned long pte, int 
level)
+{
+ if (level == PT_DIRECTORY_LEVEL && (pte & PT_PAGE_SIZE_MASK)) {
+  /* large page */
+  return (pte & vcpu->arch.mmu.large_page_rsvd_mask) != 0;
+ }
+ else
+  /* 4K page */
+  return (pte & vcpu->arch.mmu.rsvd_bits_mask[level-1]) != 0;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -2184,6 +2215,18 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu, int level)
 
 static int paging64_init_context(struct kvm_vcpu *vcpu)
 {
+ struct kvm_mmu *context = &vcpu->arch.mmu;
+ int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+ context->rsvd_bits_mask[3] =
+  rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+ context->rsvd_bits_mask[2] =
+  rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+ context->rsvd_bits_mask[1] =
+  rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+ context->rsvd_bits_mask[0] = rsvd_bits(maxphyaddr, 51);
+ context->large_page_rsvd_mask =  /* 2MB PDE */
+  rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
  return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL);
 }
 
@@ -2191,6 +2234,15 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 {
  struct kvm_mmu *context = &vcpu->arch.mmu;
 
+ /* no rsvd bits for 2 level 4K page table entries */
+ context->rsvd_bits_mask[0] = 0;
+ context->rsvd_bits_mask[1] = 0;
+ if (is_cpuid_PSE36())
+  /* 36bits PSE 4MB page */
+  context->large_page_rsvd_mask = rsvd_bits(17, 21);
+ else
+  /* 32 bits PSE 4MB page */
+  context->large_page_rsvd_mask = rsvd_bits(13, 21);
  context->new_cr3 = paging_new_cr3;
  context->page_fault = paging32_page_fault;
  context->gva_to_gpa = paging32_gva_to_gpa;
@@ -2206,6 +2258,18 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 
 static int paging32E_init_context(struct kvm_vcpu *vcpu)
 {
+ struct kvm_mmu *context = &vcpu->arch.mmu;
+ int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+ /* 3 levels */
+ context->rsvd_bits_mask[2] = rsvd_bits(maxphyaddr, 63) |
+  rsvd_bits(7, 8) | rsvd_bits(1,2); /* PDPTE */
+ context->rsvd_bits_mask[1] = rsvd_bits(maxphyaddr, 63); /* PDE */
+ context->rsvd_bits_mask[0] =   /* PTE */
+  rsvd_bits(maxphyaddr, 63) | rsvd_bits(7, 8) | rsvd_bits(1, 2);
+ context->large_page_rsvd_mask =   /* 2M page */
+   rsvd_bits(maxphyaddr, 63) | rsvd_bits(13, 20);
+
  return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL);
 }
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7314c09..844efe9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -123,6 +123,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
  gfn_t table_gfn;
  unsigned index, pt_access, pte_access;
  gpa_t pte_gpa;
+ int rsvd_fault;
 
  pgprintk("%s: addr %lx\n", __func__, addr);
 walk:
@@ -153,10 +154,13 @@ walk:
 walker->level - 1, table_gfn);
 
   kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
+  rsvd_fault = is_rsvd_bits_set(vcpu, pte, walker->level);
 
   if (!is_present_pte(pte))
goto not_p

RE: Fix kernel pio emulation mistake

2009-03-20 Thread Dong, Eddie

> kvm_emulate_pio() returns 1 when emulation is complete,
> and 0 when emulation needs further processing in
> userspace.  So I think in both cases cannot_emulate is
> the wrong answer.  I think 'in' emulation gets it right.

OK, yes. Do u mean this?  I may misunderstand.

Thx, eddie

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index ca91749..838fdc9 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1838,11 +1838,11 @@ special_insn:
io_dir_in = 0;
do_io:  if (kvm_emulate_pio(ctxt->vcpu, NULL, io_dir_in,
   (c->d & ByteOp) ? 1 : c->op_bytes,
-  port) != 0) {
+  port) == 0) {
c->eip = saved_eip;
-   goto cannot_emulate;
+   return -1;
}
-   break;
+   return 0;
case 0xf4:  /* hlt */
ctxt->vcpu->arch.halt_request = 1;
break; --
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fix kernel pio emulation mistake

2009-03-19 Thread Dong, Eddie
Kernel pio emulation return value is mistakenly checked, fortuantely it is not 
hit yet for normal OS bootup :(


Signed-off-by: Eddie Dong 



commit 98d3dc8b67ba0bc7f494de3ade8f2b5cfcadaeb4
Author: root 
Date:   Thu Mar 19 15:44:39 2009 +0800

fix a bug when kernel PIO is emulated.

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index ca91749..0edd2e7 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1838,7 +1838,7 @@ special_insn:
io_dir_in = 0;
do_io:  if (kvm_emulate_pio(ctxt->vcpu, NULL, io_dir_in,
   (c->d & ByteOp) ? 1 : c->op_bytes,
-  port) != 0) {
+  port) == 0) {
c->eip = saved_eip;
goto cannot_emulate;
}

pio.patch
Description: pio.patch


RE: copyless virtio net thoughts?

2009-02-18 Thread Dong, Eddie
Simon Horman wrote:
> On Wed, Feb 18, 2009 at 10:08:00PM +1030, Rusty Russell
> wrote: 
>> 
>> 2) Direct NIC attachment This is particularly
>> interesting with SR-IOV or other multiqueue nics, but
>> for boutique cases or benchmarks, could be for normal
>> NICs.  So far I have some very sketched-out patches: for
>> the attached nic dev_alloc_skb() gets an skb from the
>> guest (which supplies them via some kind of AIO
>> interface), and a branch in netif_receive_skb() which
>> returned it to the guest.  This bypasses all firewalling
>> in the host though; we're basically having the guest
>> process drive the NIC directly.  
> 
> Hi Rusty,
> 
> Can I clarify that the idea with utilising SR-IOV would
> be to assign virtual functions to guests? That is,
> something conceptually similar to PCI pass-through in Xen
> (although I'm not sure that anyone has virtual function
> pass-through working yet). If so, wouldn't this also be
> useful on machines that have multiple NICs? 
> 
Yes, and we have successfully get it run with assigning VF to guest in both Xen 
& KVM, but we are still working on pushing those patches out since it needs 
Linux PCI subsystem support & driver support.

Thx, eddie--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-06 Thread Dong, Eddie

> What we would rather do in KVM, is have the VFs appear in
> the host as standard network devices.  We would then like
> to back our existing PV driver to this VF directly
> bypassing the host networking stack.  A key feature here
> is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get
> zero-copy  
> receive traffic.  This will perform just as well as doing
> passthrough (at least) and avoid all that ugliness of
> dealing with SR-IOV in the guest. 
> 

Anthony:
This is already addressed by VMDq solution(or so called netchannel2), 
right? Qing He is debugging the KVM side patch and pretty much close to end.

For this single purpose, we don't need SR-IOV. BTW at least Intel 
SR-IOV NIC also supports VMDq, so you can achieve this by simply use "native" 
VMDq enabled driver here, plus the work we are debugging now.

Thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 6/6 v3] PCI: document the change

2008-10-13 Thread Dong, Eddie
Matthew Wilcox wrote:
> On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote:
>>> BTW, the SR-IOV patch is not only for network, some
>>> other devices such as IDE will use same code base as
>>> well and we image it could have other parameter to set
>>> such as starting LBA of a IDE VF.   
>> 
>> As Eddie said, we have two problems here:
>> 1) User has to set device specific parameters of a VF
>> when he wants to use this VF with KVM (assign this
>> device to KVM guest). In this case, 
>> VF driver is not loaded in the host environment. So
>> operations which 
>> are implemented as driver callback (e.g.
>> set_mac_address()) are not supported.
> 
> I suspect what you want to do is create, then configure
> the device in the host, then assign it to the guest.

That is not true. Rememver the created VFs will be destroyed no matter
for PF power event or error recovery conducted reset.
So what we want is:

Config, create, assign, and then deassign and destroy and then
recreate...

> 
>> 2) For security reason, some SR-IOV devices prohibit the
>> VF driver configuring the VF via its own register space.
>> Instead, the configurations must be done through the PF
>> which the VF is associated with. This means PF driver
>> has to receive parameters that are used to configure its
>> VFs. These parameters obviously can be passed by
>> traditional tools, if without modification for SR-IOV. 
> 
> I think that idea also covers this point.
> 
Sorry can u explain a little bit more? The SR-IOV patch won't define
what kind of entries should be created or not, we leave network
subsystem to decide what to do. Same for disk subsstem etc.

Thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 6/6 v3] PCI: document the change

2008-10-13 Thread Dong, Eddie
Matthew Wilcox wrote:
> On Tue, Oct 14, 2008 at 08:23:34AM +0800, Dong, Eddie
> wrote: 
>> Matthew Wilcox wrote:
>>> Wouldn't it be more useful to have the iov/N directories
>>> be a symlink to the actual pci_dev used by the virtual
>>> function?
>> 
>> The main concern here is that a VF may be disabed such
>> as when PF enter D3 state or undergo an reset and thus
>> be plug-off, but user won't re-configure the VF in case
>> the PF return back to working state. 
> 
> If we're relying on the user to reconfigure virtual
> functions on return to D0 from D3, that's a very fragile
> system. 

No. that is the concern we don't put those configuration under VF nodes
because it will disappear.
Do I miss something?

> 
>>>> +For network device, there are:
>>>> +   - /sys/bus/pci/devices/BB:DD.F/iov/N/mac
>>>> +   - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan
>>>> +   (value update will notify PF driver)
>>> 
>>> We already have tools to set the MAC and VLAN parameters
>>> for network devices.
>> 
>> Do you mean Ethtool? If yes, it is impossible for SR-IOV
>> since the configuration has to be done in PF side,
>> rather than VF. 
> 
> I don't think ethtool has that ability; ip(8) can set mac
> addresses and vconfig(8) sets vlan parameters.
> 
> The device driver already has to be aware of SR-IOV.  If
> it's going to support the standard tools (and it damn
> well ought to), then it should call the PF driver to set
> these kinds of parameters. 

OK, as if it has the VF parameter, will look into details.
BTW, the SR-IOV patch is not only for network, some other devices such
as IDE will use same code base as well and we image it could have other
parameter to set such as starting LBA of a IDE VF.



> 
>>> I'm not 100% convinced about this API.  The assumption
>>> here is that the driver will do it, but I think it
>>> should probably be in the core.  The driver probably
>>> wants to be 
>> 
>> Our concern is that the PF driver may put an default
>> state when it is loaded so that SR-IOV can work without
>> any user level configuration, but of course the driver
>> won't dynamically change it. 
>> Do u mean we remove this ability?
>> 
>>> notified that the PCI core is going to create a virtual
>>> function, and would it please prepare to do so, but I'm
>>> not convinced this should be triggered by the driver.
>>> How would the driver decide to create a new virtual
>>> function?
> 
> Let me try to explain this a bit better.
> 
> The user decides they want a new ethernet virtual
> function.  In the scheme as you have set up:
> 
> 1. User communicates to ethernet driver "I want a new VF"
> 2. Ethernet driver tells PCI core "create new VF".
> 
> I propose:
> 
> 1. User tells PCI core "I want a new VF on PCI device
> :01:03.0" 
> 2. PCI core tells driver "User wants a new VF"

If user need a new VF, the VF must be already enabled or existed in OS.
Otherwise, we need to disable all VFs first and then change NumVFs to
re-enable VFs.
Spec says: "NumVFs may only be written while VF Enable is Clear"

> 
> My scheme gives us a unified way of creating new VFs,
> yours requires each driver to invent a way for the user
> to tell them to create a new VF. Unless I've
> misunderstood your code and docs. 

Assign a VF is kind of user & core job. 

> 
>>> From my reading of the SR-IOV spec, this isn't how it's
>>> supposed to work.  The device is supposed to be a fully
>>> functional PCI device that on demand can start peeling
>>> off virtual functions; it's not supposed to boot up and
>>> initialise all its virtual functions at once.
>> 
>> The spec defines either we enable all VFs or Disable.
>> Per VF enabling is not supported. Is this what you
>> concern? 
> 
> I don't think that's true.  The spec requires you to
> enable all the 
> VFs from 0 to NumVFs, but NumVFs can be lower than
> TotalVFs.  At least, that's how I read it.

Yes, but setting NumVFs can only occur when VFs are disabled.
Following are from spec.

NumVFs may only be written while VF Enable is Clear. If NumVFs is
written when VF Enable is
Set, the results are undefined.
The initial value of NumVFs is undefined.

> 
> But no, that isn't my concern.  My concern is that you've
> written a driver here that seems to be a stub driver. 
> That doesn't seem to be 
> how SR-IOV is supposed to work; it's supposed to be a
> fully-functional driver that has SR-IOV knowledge added
> to it. 

Yes, it is a full feature driver as if PF has resource in, for example
not all queues are assigned to VFs.

Thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 6/6 v3] PCI: document the change

2008-10-13 Thread Dong, Eddie
Matthew Wilcox wrote:
 
> Wouldn't it be more useful to have the iov/N directories
> be a symlink to the actual pci_dev used by the virtual
> function? 

The main concern here is that a VF may be disabed such as when PF enter
D3 state or undergo an reset and thus be plug-off, but user won't
re-configure the VF in case the PF return back to working state.

> 
>> +For network device, there are:
>> +   - /sys/bus/pci/devices/BB:DD.F/iov/N/mac
>> +   - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan
>> +   (value update will notify PF driver)
> 
> We already have tools to set the MAC and VLAN parameters
> for network devices.

Do you mean Ethtool? If yes, it is impossible for SR-IOV since the
configuration has to be done in PF side, rather than VF.

> 
> I'm not 100% convinced about this API.  The assumption
> here is that the driver will do it, but I think it should
> probably be in the core.  The driver probably wants to be

Our concern is that the PF driver may put an default state when it is
loaded so that SR-IOV can work without any user level configuration, but
of course the driver won't dynamically change it.
Do u mean we remove this ability?

> notified that the PCI core is going to create a virtual
> function, and would it please prepare to do so, but I'm
> not convinced this should be triggered by the driver. 
> How would the driver decide to create a new virtual
> function?  
> 
> 
> From my reading of the SR-IOV spec, this isn't how it's
> supposed to work.  The device is supposed to be a fully
> functional PCI device that on demand can start peeling
> off virtual functions; it's not supposed to boot up and
> initialise all its virtual functions at once. 

The spec defines either we enable all VFs or Disable. Per VF enabling is
not supported.
Is this what you concern?

Thanks, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Dong, Eddie
Avi Kivity wrote:
> Dong, Eddie wrote:
>>> I don't see how this relates to shared guest interrupts.
>>> Whatever you have on the host side, you still need to
>>> support shared guest interrupts.  The only way to avoid
>>> the issue is by using MSI for the guest, and even then
>>> we still have to support interrupt sharing since not
>>> all guests have MSI support. 
>>> 
>> 
>> Yes, but guest sharing is easy to solve by emulating
>> NAND gate of PCI line. 
>> 
> 
> Certainly, it isn't difficult.

BTW, Did u have a look at SR-IOV patch? It address both Xen & KVM.
Thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Dong, Eddie

> I don't see how this relates to shared guest interrupts. 
> Whatever you have on the host side, you still need to
> support shared guest interrupts.  The only way to avoid
> the issue is by using MSI for the guest, and even then we
> still have to support interrupt sharing since not all
> guests have MSI support. 

Yes, but guest sharing is easy to solve by emulating NAND gate of PCI
line. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Dong, Eddie
Tian, Kevin wrote:
>> From:Avi Kivity
>> Sent: 2008年9月27日 17:50
>> 
>> Yang, Sheng wrote:
>>> After check host shared interrupts situation, I got a
>>> question here: 
>>> 
>>> If I understand correctly, current solution don't block
>>> host shared irq, just come with the performance pentry.
>>> The penalty come with host disabled irq line for a
>>> period. We have to wait guest to write EOI. But I fail
>>> to see the correctness problem here (except a lot of
>>> spurious interrupt in the guest).  
>>> 
>>> I've checked mail, but can't find clue about that. Can
>>> you explain the situation? 
>>> 
>>> 
>> 
>> If the guest fails to disable interrupts on a device
>> that shares an interrupt line with the host, the host
>> will experience an interrupt flood.  Eventually the host
>> will disable the host device as well. 
>> 
> 
> This issue also exists on host side, that one misbehaved
> driver can hurt all other drivers sharing same irq line.
> But it seems no good way to avoid it. Since not all
> devices support MSI, we still need support irq sharing
> possibly with above caveats given. 
> 
> Existing approach at least works with a sane guest
> driver, with some performance penality there.
> 
> Or do you have better alternative?
> 
> Thanks,
> Kevin
>

MSI is always 1st choice. Including taking host MSI for guest IOAPIC situation 
because we don't if guest OS has MSI support but we are sure host Linux can.

When MSI is impossible, I recommend we disable device assignment for those 
sharing interrupt , or we assign all devices with same interrupt to same guest. 
Yes the issue is same in native, but in native the whole OS (kernel) is in same 
isolation domain, but now different guest has different isolation domain :(

In one world, MSI is pretty important for direct IO, and SR-IOV is #1 usage in 
future. Just advocate more and wish more people can ack the SR-IOV patch from 
ZhaoYU so that we can see 2.6,28 work for direct I/O without sacrificing 
sharing :)

Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Remaining passthrough/VT-d tasks list

2008-09-24 Thread Dong, Eddie
Avi Kivity wrote:
> Han, Weidong wrote:
>> Hi all,
>> 
>> The initial passthrough/VT-d patches have been in kvm,
>> it's time to enhance it, and push them into 2.6.28.
>> 
>>  - Shared Interrupt support
>> 
> 
> Shared guest interrupts is a prerequisite for merging
> into mainline. Without this, device assignment is useless
> in anything but a benchmark scenario.  I won't push
> device assignment for 2.6.28 without it. 
> 
> Shared host interrupts are a different matter; which one
> did you mean? 
> 
Avi:
How about we think in other way? The top usage model of IOMMU is
SR-IOV in my mind, at least for enterprise usage model. We are pushing
the SR-IOV patch for 2.6.28, and are continuously polishing the patch.
Even if it missed the 2.6.28 merge windows (unlikely?), we could be able
to ask OSVs to take the SR-IOV patch seperately before code froze since
it is very small, but it is hard to ask for taking whole IOMMU patches.

In Xen side, IOMMU is there, MSI-x is there, so SR-IOV patch is
the only one missed to enable SR-IOV. In KVM side, very likely we can
get MSI patch down soon before chinese holiday, and we of course will
spend tons of effort in qualities too. Should we target this? If yes, we
put MSI patch and push 2.6.28 as 1st priority. We would be able to see
next major release of VMM using KVM have HW IO virtualization
technology: Close to native performance, non sacriface of IO sharing,
minimal CPU utilization etc.
For those legacy PCI pass thru support, we can continue improve
it too.
Thanks, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Event channels in KVM?

2008-09-22 Thread Dong, Eddie
Matt Anger wrote:
> I was referring to the bounce from host kernel to qemu
> and then back 
> to the host kernel for my BE driver.
> Xen:
> guest -> guest kernel driver-> host kernel driver
> 
> For both situations I need a FE and BE driver, but for
> KVM I need to modify QEMU and teach it how to pass the
> virtio calls to my Host 
> driver. In Xen BE driver and Host driver are the same and
> I don't have 
> to recompile any part of Xen.
> 
You can implement (host) kernel level device model too which will have
same result with Xen, i.e. skip Qemu.
I don't think Xen & KVM has fundamental difference here, but different
implementation style. Xen implement BE driver in driver domain kernel,
but KVM like to implement in user level host process. Different people
has different favorite, I am neutral actually :)

Thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Event channels in KVM?

2008-09-22 Thread Dong, Eddie
Matt Anger wrote:
> Thanks for the info, I've been looking into it by trying
> to look 
> around kvm source code.
> Apparently I have to write a kernel driver for the guest
> os and then 
> also write backend driver and modify qemu to use it? Is
> that correct? 
> That seems ugly, especially since now my io goes
> guest->guest kernel driver->host kernel kvm->qemu->host
> kernel driver 
> With Xen event-channels my driver gets the event
> notification directly 
> with no user space intervention removing the middle 2
> steps and I 

Why virtio needs start from guest user space while event channel not? 
It is exactly same, event channel can start from user space or kernel
space, same for virtio.
The only difference is guest<->host communication needs to go thru Qemu
(host process) in your description model, while Xen handle event channel
in hypervisor. But this can be enahnced also IMO by employing a kernel
level device model which does inter-communication for CPU if performance
is critical, but mostly probably not. 

> don't have to touch any of Xen's codebase. My driver just
> registers 
> for an event-channel.

But you still need the counter part side to handle it.
I think you may want to reuse Xen BE side code which is a wrong
assumption.

> 
> Thanks
> -Matt
> 
> On Fri, Sep 19, 2008 at 12:14 PM, Anthony Liguori
> <[EMAIL PROTECTED]> wrote: 
>> Javier Guerra wrote:
>>> 
>>> On Fri, Sep 19, 2008 at 1:31 PM, Anthony Liguori
>>> <[EMAIL PROTECTED]> wrote: 
>>> 
 
 Matt Anger wrote:
 
> 
> Does KVM have any interface similar to event-channels
> like Xen does? Basically a way to send notifications
> between the host and guest. 
> 
> 
 
 virtio is the abstraction we use.
 
 But virtio is based on the standard hardware
 interfaces of the PC--PIO, MMIO, and interrupts. 
 
>>> 
>>> this is rather low-level, it would be nice to have a
>>> multiplatform interface to this abstraction.
>>> 
>> 
>> That's exactly the purpose of virtio.  virtio is a
>> high-level, cross platform interface.  It's been tested
>> on x86, PPC, s390, and I believe ia64.  It also works in
>> lguest.  
>> 
>> It happens to use PIO, MMIO, and interrupts on x86 under
>> KVM but other virtio implementations exist for other
>> platforms. 
>> 
>>> just for kicks, i've found and printed Rusty's paper
>>> about it. hope it's current :-) 
>>> 
>> 
>> The other good thing to look at is the lguest
>> documentation.  You can skip to just the virtio bits if
>> you're so inclined.  It's really quite thoroughly
>> documented.  
>> 
>> Regards,
>> 
>> Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: KVM IO performance

2008-07-14 Thread Dong, Eddie
Sukanto Ghosh wrote:
> Hi all,
> 
> Why has the IO device emulation part been kept in
> userspace ? 
> IO attempts cause VM-exits to the KVM (running in
> kernel-mode) it then forwards these requests to the
> userspace (mode-switch). After completion of IO in
> userspace, another mode switch is done to KVM before
> resuming the guests. 
> What would be the problems had the device emulation parts
> been moved totally to kernelspace ?
> 
The simple reason is that people don't want to pull a lot of code into
kernel whose failure may crash the whole system, of course there are
many other reasons such as maintaining synchronization with Qemu
upstream, coding style, maintanence effort etc.
Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Vringfd

2008-06-04 Thread Dong, Eddie
Rusty/Anthony/Avi and all:
I did a read at Rusty's April 18 post of vringfd patch, though I
didn't find explicit code for how to call get_user_skb_frags, it seems
to be clear the user only need to trigger a "flush_tx" command. All the
rest will be automatically turned on, fetching the ring and send to
bridge etc, which is perfect from performance point of view.
Now I am working on enabling VMDq which provide HW pre-sorting
mechanism base on MAC address, it can definetely take the advantage of
vringfd with additional advantage of 0-copy in receive side, which makes
VMDq enabled virtio-net fly with 0-copy for both transmit & receive. The
only thing an VMDq agency (vringfd similar components in kernel) may
bother user level is initialization and notification given that vringfd
already solved the ring operations, since KVM virtual interrupt module
is in kernel side, it is pretty easy to make user free of receiving
packet at run time.
Any comments?
Thanks, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html