RE: [Qemu-devel] live migration vs device assignment (motivation)
> > > > Even if the device driver doesn't support migration, you still want to > > migrate VM? That maybe risk and we should add the "bad path" for the > > driver at least. > > At a minimum we should have support for hot-plug if we are expecting to > support migration. You would simply have to hot-plug the device before you > start migration and then return it after. That is how the current bonding > approach for this works if I am not mistaken. Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :) > > The advantage we are looking to gain is to avoid removing/disabling the > device for as long as possible. Ideally we want to keep the device active > through the warm-up period, but if the guest doesn't do that we should still > be able to fall back on the older approaches if needed. > N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
RE: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: > > On 2015年11月25日 13:30, Alexander Duyck wrote: > >> No, what I am getting at is that you can't go around and modify the > >> configuration space for every possible device out there. This > >> solution won't scale. > > > > > > PCI config space regs are emulation by Qemu and so We can find the > > free PCI config space regs for the faked PCI capability. Its position > > can be not permanent. > > Yes, but do you really want to edit every driver on every OS that you plan to > support this on. What about things like direct assignment of regular Ethernet > ports? What you really need is a solution that will work generically on any > existing piece of hardware out there. The fundamental assumption of this patch series is to modify the driver in guest to self-emulate or track the device state, so that the migration may be possible. I don't think we can modify OS, without modifying the drivers, even using the PCIe hotplug mechanism. In the meantime, modifying Windows OS is a big challenge given that only Microsoft can do. While, modifying driver is relatively simple and manageable to device vendors, if the device vendor want to support state-clone based migration. Thx Eddie
RE: [Qemu-devel] [RFC] COLO HA Project proposal
> > Thanks Dave: > > Whether the randomness value/branch/code path the PVM and SVM > may > > have, It is only a performance issue. COLO never assumes the PVM and > > SVM has same internal Machine state. From correctness p.o.v, as if > > the PVM and SVM generate Identical response, we can view the SVM is a > > valid replica of PVM, and the SVM can take over When the PVM suffers > > from hardware failure. We can view the client is all the way talking > > with the SVM, without the notion of PVM. Of course, if the SVM dies, we > can regenerate a copy of PVM with a new checkpoint too. > > The SOCC paper has the detail recovery model :) > > I've had a read; I think the bit I was asking about was what you labelled 'D' > in > that papers fig.4 - so I think that does explain it for me. Very good :) > But I also have some more questions: > > 1) 5.3.3 Web server > a) In fig 11 it shows Remus's performance dropping off with the number > of threads - why is that? Is it >just an increase in the amount of memory changes in each > snapshot? I didn't dig into details of them, but document the throughput we observed. I felt a bit stranger too, memory dirty page set may be larger than small connection Case, but I am not sure and that is the data we saw :( > b) Is fig 11/12 measured with all of the TCP optimisations shown in fig > 13 on? Yes. > > 2) Did you manage to overcome the issue shown in 5.6 with newer guest > kernels degredation - could you just fall > back to micro checkpointing if the guests diverge too quickly? In general, I would say the COLO performance for these 2 workloads is pretty good, and I actually didn't list the subsection 5.6 initially. It is the conference sepherd who ask me to add this paragraph to make the paper to be balanced :) In summary, COLO can have very good MP-guest performance comparing with Remus, with the payment of potential optimization/modification effort to guest TCP/IP stack. One solution may Not work for all workloads, but it provides a large room for OSVs to provide customized solution for a specific usage -- which I think is very good for open source biz model: make money through consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, Probably for specific usage too. > > 3) Was the link between the two servers for synchronisation a low-latency > dedicated connection? We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution itself doesn't require dedicated link. > > 4) Did you try an ftp PUT benchmark using external storage - i.e. that > wouldn't have the local disc > synchronisation overhead? Not yet. External network shared storage works, but today the performance may be not that good, because our optimization so far is still very limited. It is just an initial effort to make the 2 common workloads happy. We believe there are large room ahead to make the response of TCP/IP stack more predictable. Once the basic COLO stuff is ready for product and accepted by the industry, it is possible we may impact TCP community to have this kind of predictability in mind for the future protocol development, which will greatly help the performance. Thx Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] [RFC] COLO HA Project proposal
> > > > Let me clarify on this issue. COLO didn't ignore the TCP sequence > > number, but uses a new implementation to make the sequence number to > > be best effort identical between the primary VM (PVM) and secondary VM > > (SVM). Likely, VMM has to synchronize the emulation of randomization > > number generation mechanism between the PVM and SVM, like the > lock-stepping mechanism does. > > > > Further mnore, for long TCP connection, we can rely on the (on-demand) > > VM checkpoint to get the identical Sequence number both in PVM and > SVM. > > That wasn't really my question; I was worrying about other forms of > randomness, such as winners of lock contention, and other SMP > non-determinisms, and I'm also worried by what proportion of time the > system can't recover from a failure due to being unable to distinguish an > SVM failure from a randomness issue. > Thanks Dave: Whether the randomness value/branch/code path the PVM and SVM may have, It is only a performance issue. COLO never assumes the PVM and SVM has same internal Machine state. From correctness p.o.v, as if the PVM and SVM generate Identical response, we can view the SVM is a valid replica of PVM, and the SVM can take over When the PVM suffers from hardware failure. We can view the client is all the way talking with the SVM, without the notion of PVM. Of course, if the SVM dies, we can regenerate a copy of PVM with a new checkpoint too. The SOCC paper has the detail recovery model :) Thanks, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] [RFC] COLO HA Project proposal
> > > > I didn't quite understand a couple of things though, perhaps you can > > explain: > >1) If we ignore the TCP sequence number problem, in an SMP machine > > don't we get other randomnesses - e.g. which core completes something > > first, or who wins a lock contention, so the output stream might not > > be identical - so do those normal bits of randomness cause the > > machines to flag as out-of-sync? > > It's about COLO agent, CCing Congyang, he can give the detailed > explanation. > Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a new implementation to make the sequence number to be best effort identical between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize the emulation of randomization number generation mechanism between the PVM and SVM, like the lock-stepping mechanism does. Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the identical Sequence number both in PVM and SVM. Thanks, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [KVM timekeeping 26/35] Catchup slower TSC to guest rate
Zachary: Will you extend the logic to cover the situation when the guest runs at higher than the guest rate but the PCPU is over committed. In that case, likely we can use the time spent when the VCPU is scheduled out to catch up as well. Of course if the VCPU scheduled out time is not enough to compensate the cycles caused by fast host TSC (exceeding a threahold), we will eventually have to fall back to trap and emulation mode. Thx, Eddie -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Zachary Amsden Sent: 2010年8月20日 16:08 To: kvm@vger.kernel.org Cc: Zachary Amsden; Avi Kivity; Marcelo Tosatti; Glauber Costa; Thomas Gleixner; John Stultz; linux-ker...@vger.kernel.org Subject: [KVM timekeeping 26/35] Catchup slower TSC to guest rate Use the catchup code to continue adjusting the TSC when running at lower than the guest rate Signed-off-by: Zachary Amsden --- arch/x86/kvm/x86.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a4215d7..086d56a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1013,8 +1013,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp); } local_irq_restore(flags); - if (catchup) + if (catchup) { + if (this_tsc_khz < v->kvm->arch.virtual_tsc_khz) + vcpu->tsc_rebase = 1; return 0; + } /* * Time as measured by the TSC may go backwards when resetting the base @@ -5022,6 +5025,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_guest_exit(); + /* Running on slower TSC without kvmclock, we must bump TSC */ + if (vcpu->arch.tsc_rebase) + kvm_request_clock_update(vcpu); + preempt_enable(); vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
Arnd Bergmann wrote: > On Friday 30 July 2010 17:51:52 Shirley Ma wrote: >> On Fri, 2010-07-30 at 16:53 +0800, Xin, Xiaohui wrote: Since vhost-net already supports macvtap/tun backends, do you think whether it's better to implement zero copy in macvtap/tun than inducing a new media passthrough device here? >>> >>> I'm not sure if there will be more duplicated code in the kernel. >> >> I think it should be less duplicated code in the kernel if we use >> macvtap to support what media passthrough driver here. Since macvtap >> has support virtio_net head and offloading already, the only missing >> func is zero copy. Also QEMU supports macvtap, we just need add a >> zero copy flag in option. > > Yes, I fully agree and that was one of the intended directions for > macvtap to start with. Thank you so much for following up on that, > I've long been planning to work on macvtap zero-copy myself but it's > now lower on my priorities, so it's good to hear that you made > progress on it, even if there are still performance issues. > But zero-copy is a Linux generic feature that can be used by other VMMs as well if the BE service drivers want to incorporate. If we can make mp device VMM-agnostic (it may be not yet in current patch), that will help Linux more. Thx, Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/24] Nested VMX, v5
Nadav Har'El wrote: > Hi Avi, > > This is a followup of our nested VMX patches that Orit Wasserman > posted in December. We've addressed most of the comments and concerns > that you and others on the mailing list had with the previous patch > set. We hope you'll find these patches easier to understand, and > suitable for applying to KVM. > > > The following 24 patches implement nested VMX support. The patches > enable a guest to use the VMX APIs in order to run its own nested > guests. I.e., it allows running hypervisors (that use VMX) under KVM. > We describe the theory behind this work, our implementation, and its > performance characteristics, > in IBM Research report H-0282, "The Turtles Project: Design and > Implementation of Nested Virtualization", available at: > > http://bit.ly/a0o9te > > The current patches support running Linux under a nested KVM using > shadow page table (with bypass_guest_pf disabled). They support > multiple nested hypervisors, which can run multiple guests. Only > 64-bit nested hypervisors are supported. SMP is supported. Additional > patches for running Windows under nested KVM, and Linux under nested > VMware server, and support for nested EPT, are currently running in > the lab, and will be sent as follow-on patchsets. > Nadav & All: Thnaks for the posting and in general the patches are well written. I like the concept of VMCSxy and I feel it is pretty clear (better than my previous naming as well), but there are some confusing inside, especially for the term "shadow" which I feel quit hard. Comments from me: 1: Basically there are 2 diferent type in VMCS, one is defined by hardware, whose layout is unknown to VMM. Another one is defined by VMM (this patch) and used for vmcs12. The former one is using "struct vmcs" to describe its data instance, but the later one doesn't have a clear definition (or struct vmcs12?). I suggest we can have a distinguish struct for this, for example "struct sw_vmcs"(software vmcs), or "struct vvmcs" (virtual vmcs). 2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either "struct vmcs", or "struct sw_vmcs", but not for struct Clear distinguish between data structure and instance helps IMO. 3: We may use prefix or suffix in addition to vmcsxy to explictly state the format of that instance. For example vmcs02 in current patch is for hardware use, hence it is an instance "struct vmcs", but vmcs01 is an instance of "struct sw_vmcs". Postfix and prefix helps to make better understand. 4: Rename l2_vmcs to vmcs02, l1_shadow_vmcs to vmcs01, l1_vmcs to vmcs02, with prefix/postfix can strengthen above concept of vmcsxy. 5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each time when guest VMPTRLD, and free the instance at VMCLEAR. The code may fail if the vmcs (un-vmcleared) exceeds certain threshold to avoid denial of service. That is fine, but it brings additional complexity and may pay with a lot of memory. I think we can emulate using concept of "cached vmcs" here in case L1 VMM doesn't do vmclear in time. L0 VMM can simply flush those vmcs02 to guest memory i.e. vmcs12 per need. For example if the cached vcs02 exceeds 10, we can do automatically flush. Thx, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 8/24] Hold a vmcs02 for each vmcs12
> +/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if > one + * does not already exist. The allocation is done in L0 memory, > so to avoid + * denial-of-service attack by guests, we limit the > number of concurrently- + * allocated vmcss. A well-behaving L1 will > VMCLEAR unused vmcs12s and not + * trigger this limit. > + */ > +static const int NESTED_MAX_VMCS = 256; > +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu) > +{ > + struct vmcs_list *new_l2_guest; > + struct vmcs *l2_vmcs; > + > + if (nested_get_current_vmcs(vcpu)) > + return 0; /* nothing to do - we already have a VMCS */ > + > + if (to_vmx(vcpu)->nested.l2_vmcs_num >= NESTED_MAX_VMCS) > + return -ENOMEM; > + > + new_l2_guest = (struct vmcs_list *) > + kmalloc(sizeof(struct vmcs_list), GFP_KERNEL); > + if (!new_l2_guest) > + return -ENOMEM; > + > + l2_vmcs = alloc_vmcs(); I didn't see where it was used. Hints on the usage? > + if (!l2_vmcs) { > + kfree(new_l2_guest); > + return -ENOMEM; > + } > + > + new_l2_guest->vmcs_addr = to_vmx(vcpu)->nested.current_vmptr; > + new_l2_guest->l2_vmcs = l2_vmcs; > + list_add(&(new_l2_guest->list), > &(to_vmx(vcpu)->nested.l2_vmcs_list)); > + to_vmx(vcpu)->nested.l2_vmcs_num++; + return 0; > +} > + -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
Nadav Har'El wrote: > This patch contains code to prepare the VMCS which can be used to > actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately > merges the information in shadow_vmcs that L1 built for L2 (vmcs12), > and that in the VMCS that we built for L1 (vmcs01). > > VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), > which makes it difficult for us to read from vmcs01 while writing to > vmcs12. This is why we first make a copy of vmcs01 in memory > (l1_shadow_vmcs) and then read that memory copy while writing to > vmcs12. > > Signed-off-by: Nadav Har'El > --- > --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 > +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 > @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v > return flexpriority_enabled; > } > > +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu > *vcpu) +{ > + return cpu_has_vmx_tpr_shadow() && > + get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control & > + CPU_BASED_TPR_SHADOW; > +} > + > +static inline bool nested_cpu_has_secondary_exec_ctrls(struct > kvm_vcpu *vcpu) +{ > + return cpu_has_secondary_exec_ctrls() && > + get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control & > + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > +} > + > +static inline bool nested_vm_need_virtualize_apic_accesses(struct > kvm_vcpu + *vcpu) > +{ > + return nested_cpu_has_secondary_exec_ctrls(vcpu) && > + (get_shadow_vmcs(vcpu)->secondary_vm_exec_control & > + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); > +} > + > +static inline bool nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu) > +{ > + return nested_cpu_has_secondary_exec_ctrls(vcpu) && > + (get_shadow_vmcs(vcpu)->secondary_vm_exec_control & > + SECONDARY_EXEC_ENABLE_EPT); > +} > + > + > static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr) > { > int i; > @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v > preempt_enable(); > } > > +int load_vmcs_host_state(struct shadow_vmcs *src) > +{ > + vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector); > + vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector); > + vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector); > + vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector); > + vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector); > + vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector); > + vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector); > + > + vmcs_write64(TSC_OFFSET, src->tsc_offset); > + > + if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) > + vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat); > + > + vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs); > + > + vmcs_writel(HOST_CR0, src->host_cr0); > + vmcs_writel(HOST_CR3, src->host_cr3); > + vmcs_writel(HOST_CR4, src->host_cr4); > + vmcs_writel(HOST_FS_BASE, src->host_fs_base); > + vmcs_writel(HOST_GS_BASE, src->host_gs_base); > + vmcs_writel(HOST_TR_BASE, src->host_tr_base); > + vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base); > + vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base); > + vmcs_writel(HOST_RSP, src->host_rsp); > + vmcs_writel(HOST_RIP, src->host_rip); > + vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp); > + vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip); > + > + return 0; > +} > + > /* > * Switches to specified vcpu, until a matching vcpu_put(), but > assumes > * vcpu mutex is already taken. > @@ -1922,6 +1985,71 @@ static void vmclear_local_vcpus(void) > __vcpu_clear(vmx); > } > > +int load_vmcs_common(struct shadow_vmcs *src) > +{ > + vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector); > + vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector); > + vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector); > + vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector); > + vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector); > + vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector); > + vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector); > + vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector); > + > + vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl); > + > + if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) > + vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat); > + > + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, > src->vm_entry_intr_info_field); > + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, + > src->vm_entry_exception_error_code); > + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, > src->vm_entry_instruction_len); + + vmcs_write32(GUEST_ES_LIMIT, > src->guest_es_limit); + vmcs_write32(GUEST_CS_LIMIT, > src->guest_cs_limit); + vmcs_write32
RE: [PATCH 10/24] Implement VMPTRLD
Nadav Har'El wrote: > This patch implements the VMPTRLD instruction. > > Signed-off-by: Nadav Har'El > --- > --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 > +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 > @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv > return 0; > } > > +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu) > +{ > + unsigned long rflags; > + rflags = vmx_get_rflags(vcpu); > + rflags |= X86_EFLAGS_CF; > + rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_ZF & > + ~X86_EFLAGS_SF & ~X86_EFLAGS_OF; > + vmx_set_rflags(vcpu, rflags); > +} > + > +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu) > +{ > + unsigned long rflags; > + rflags = vmx_get_rflags(vcpu); > + rflags |= X86_EFLAGS_ZF; > + rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_CF & > + ~X86_EFLAGS_SF & ~X86_EFLAGS_OF; > + vmx_set_rflags(vcpu, rflags); > +} > + > static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu) > { > unsigned long rflags; > @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp > return 1; > } > > +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t > guest_vmcs_addr) +{ > + bool ret; > + struct vmcs12 *vmcs12; > + struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr); > + if (vmcs_page == NULL) > + return 0; > + vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0); > + if (vmcs12->revision_id == VMCS12_REVISION) > + ret = 1; > + else { > + set_rflags_to_vmx_fail_valid(vcpu); > + ret = 0; > + } > + kunmap_atomic(vmcs12, KM_USER0); > + kvm_release_page_dirty(vmcs_page); > + return ret; > +} > + > +/* Emulate the VMPTRLD instruction */ > +static int handle_vmptrld(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + gpa_t guest_vmcs_addr; > + > + if (!nested_vmx_check_permission(vcpu)) > + return 1; > + > + if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) { > + set_rflags_to_vmx_fail_invalid(vcpu); > + return 1; > + } > + > + if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr)) > + return 1; > + > + if (vmx->nested.current_vmptr != guest_vmcs_addr) { > + vmx->nested.current_vmptr = guest_vmcs_addr; > + > + if (nested_create_current_vmcs(vcpu)) { > + printk(KERN_ERR "%s error could not allocate memory", > + __func__); > + return -ENOMEM; > + } > + } > + > + clear_rflags_cf_zf(vcpu); > + skip_emulated_instruction(vcpu); How about the "Launch" status? Should we get that status from vmcs1x to distinguish guest VMLaunch & VMResume? > + return 1; > +} > + > static int handle_invlpg(struct kvm_vcpu *vcpu) > { > unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st > [EXIT_REASON_VMCALL] = handle_vmcall, > [EXIT_REASON_VMCLEAR] = handle_vmclear, > [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, > - [EXIT_REASON_VMPTRLD] = handle_vmx_insn, > + [EXIT_REASON_VMPTRLD] = handle_vmptrld, > [EXIT_REASON_VMPTRST] = handle_vmx_insn, > [EXIT_REASON_VMREAD] = handle_vmx_insn, > [EXIT_REASON_VMRESUME]= handle_vmx_insn, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 9/24] Implement VMCLEAR
Nadav Har'El wrote: > This patch implements the VMCLEAR instruction. > > Signed-off-by: Nadav Har'El > --- > --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 > +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 > @@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs >*/ > u32 revision_id; > u32 abort; > + > + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ > }; > > struct vmcs_list { > @@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv > return 0; > } > > +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu) > +{ > + unsigned long rflags; > + rflags = vmx_get_rflags(vcpu); > + rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF); > + vmx_set_rflags(vcpu, rflags); > +} > + > +/* Emulate the VMCLEAR instruction */ > +static int handle_vmclear(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + gpa_t guest_vmcs_addr, save_current_vmptr; > + > + if (!nested_vmx_check_permission(vcpu)) > + return 1; > + > + if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) > + return 1; > + SDM implements alignment check, range check and reserve bit check and may generate VMfail(VMCLEAR with invalid physical address). As well as "addr != VMXON pointer" check Missed? > + save_current_vmptr = vmx->nested.current_vmptr; > + > + vmx->nested.current_vmptr = guest_vmcs_addr; > + if (!nested_map_current(vcpu)) > + return 1; > + vmx->nested.current_l2_page->launch_state = 0; > + nested_unmap_current(vcpu); > + > + nested_free_current_vmcs(vcpu); > + > + if (save_current_vmptr == guest_vmcs_addr) > + vmx->nested.current_vmptr = -1ull; > + else > + vmx->nested.current_vmptr = save_current_vmptr; > + > + skip_emulated_instruction(vcpu); > + clear_rflags_cf_zf(vcpu); SDM has formal definition of VMSucceed. Cleating CF/ZF only is not sufficient as SDM 2B 5.2 mentioned. Any special concern here? BTW, should we define formal VMfail() & VMsucceed() API for easy understand and map to SDM? > + return 1; > +} > + > static int handle_invlpg(struct kvm_vcpu *vcpu) > { > unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > @@ -4109,7 +4151,7 @@ static int (*kvm_vmx_exit_handlers[])(st > [EXIT_REASON_HLT] = handle_halt, > [EXIT_REASON_INVLPG] = handle_invlpg, > [EXIT_REASON_VMCALL] = handle_vmcall, > - [EXIT_REASON_VMCLEAR] = handle_vmx_insn, > + [EXIT_REASON_VMCLEAR] = handle_vmclear, > [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, > [EXIT_REASON_VMPTRLD] = handle_vmx_insn, > [EXIT_REASON_VMPTRST] = handle_vmx_insn, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3] KVM: VMX: Execute WBINVD to keep data consistency with assigned devices
Avi Kivity wrote: > On 06/28/2010 10:30 AM, Dong, Eddie wrote: >>> >>> Several milliseconds of non-responsiveness may not be acceptable for >>> some applications. So I think queue_work_on() and a clflush loop is >>> better than an IPI and wbinvd. >>> >>> >> Probably we should make it configurable. For RT usage models, we do >> care about responsiveness more than performance, but for typical >> server useg model, we'd better focus on performance in this issue. >> WBINVD may perform much much better than CLFLUSH, and a mallicious >> guest repeatedly issuing wbinvd may greatly impact the system >> performance. >> > > I'm not even sure clflush can work. I thought you could loop on just > the cache size, but it appears you'll need to loop over the entire > guest address space, which could take ages. If RT usage model comes into reality, we may have to do in this way though pay with huge overhead :) Is there any RT customers here? > > So I guess we'll have to settle for wbinvd, just avoiding it when the > hardware allows us to. Yes, for now I agree we can just use wbinvd to emulate wbinvd :) Thx, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3] KVM: VMX: Execute WBINVD to keep data consistency with assigned devices
Avi Kivity wrote: > On 06/28/2010 09:42 AM, Sheng Yang wrote: +static void wbinvd_ipi(void *garbage) +{ + wbinvd(); +} >>> Like Jan mentioned, this is quite heavy. What about a clflush() >>> loop instead? That may take more time, but at least it's >>> preemptible. Of course, it isn't preemptible in an IPI. >>> >> >> I think this kind of behavior happened rarely, and most recent >> processor should have WBINVD exit which means it's an IPI... So I >> think it's maybe acceptable here. >> >> > > Several milliseconds of non-responsiveness may not be acceptable for > some applications. So I think queue_work_on() and a clflush loop is > better than an IPI and wbinvd. > Probably we should make it configurable. For RT usage models, we do care about responsiveness more than performance, but for typical server useg model, we'd better focus on performance in this issue. WBINVD may perform much much better than CLFLUSH, and a mallicious guest repeatedly issuing wbinvd may greatly impact the system performance. thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
Herbert Xu wrote: > On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote: >> >> I mean once the frontend side driver post the buffers to the backend >> driver, the backend driver will "immediately" use that buffers to >> compose skb or gro_frags and post them to the assigned host NIC >> driver as receive buffers. In that case, if the backend driver >> recieves a packet from the NIC that requires to do copy, it may be >> unable to find additional free guest buffer because all of them are >> already used by the NIC driver. We have to reserve some guest >> buffers for the possible copy even if the buffer address is not >> identified by original skb :( > > OK I see what you mean. Can you tell me how does Xiaohui's > previous patch-set deal with this problem? > > Thanks, In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :( BTW, some hardware may require certain level of packet copy such as for broadcast packets in very old VMDq device, which is not addressed in previous Xiaohui's patch yet. We may address this by implementing an additional virtqueue between guest and host for slow path (broadcast packets only here) with additinal complexity in FE/BE driver. Thx, Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
Herbert Xu wrote: > On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote: >> >> Xiaohui & Herbert: >> Mixing copy of head & 0-copy of bulk data imposes additional >> challange to find the guest buffer. The backend driver may be >> unable to find a spare guest buffer from virtqueue at that time >> which may block the receiving process then. Can't we completely >> eliminate netdev_alloc_skb here? Assigning guest buffer at this time >> makes life much easier. > > I'm not sure I understand you concern. If you mean that when > the guest doesn't give enough pages to the host and the host > can't receive on behalf of the guest then isn't that already > the case with the original patch-set? > I mean once the frontend side driver post the buffers to the backend driver, the backend driver will "immediately" use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :( Thx, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will > allocate > as usual, the data pointed by skb->data will be copied into the first > guest buffer. > That means we should reserve sufficient room in guest buffer. For PS > mode > supported driver (for example ixgbe), the room will be more than 128. > After 128bytes, > we will put the first frag data. Look into virtio-net.c the function > page_to_skb() > and receive_mergeable(), that means we should modify guest virtio-net > driver to > compute the offset as the parameter for skb_set_frag(). > > How do you think about this? Attached is a patch to how to modify the > guest driver. > I reserve 512 bytes as an example, and transfer the header len of the > skb in hdr->hdr_len. > Xiaohui & Herbert: Mixing copy of head & 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then. Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier. Thx, Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
>> A VF interrupt usually happens in 4-8KHZ. How about the virtio? >> I assume virtio will be widely used together w/ leagcy guest with >> INTx mode. >> > > True, but in time it will be replaced by MSI. > > Note without vhost virtio is also in userspace, so there are lots of > exits anyway for the status register. Few months ago, we noticed the interrupt frequency of PV I/O in previous solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each PV I/O frontend driver (or its interrupt source) has similar interrupt frequency which means Nx more interrupt. I guess virtio is in similar situation. We then did an optimization for PV IO to mitigate the interrupt to guest by setting interrupt throttle in backend side, because native NIC also does in that way -- so called ITR register in Intel NIC. We can see 30-90% CPU utilization saving depending on how many frontend driver interrupt is employed. Not sure if it is adopted in vhost side. One drawback of course is the latency, but it is mostly tolerable if it is reduced to ~1KHZ. Thx, Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
Avi Kivity wrote: > On 06/09/2010 06:59 PM, Dong, Eddie wrote: >> >> Besides VF IO interrupt and timer interrupt introduced performance >> overhead risk, > > VF usually uses MSI Typo, I mean PV IO. A VF interrupt usually happens in 4-8KHZ. How about the virtio? I assume virtio will be widely used together w/ leagcy guest with INTx mode. > >> EOI message deliver from lapic to ioapic, > > Only for non-MSI > >> which becomes in user land now, may have potential scalability >> issue. For example, if we have a 64 VCPU guest, if each vcpu has >> 1khz interrupt (or ipi), the EOI from guest will normally have to >> involve ioapic module for clearance in 64khz which may have long >> lock contentio. > > No, EOI for IPI or for local APIC timer does not involve the IOAPIC. > >> you may reduce the involvement of ioapic eoi by tracking ioapic >> pin<-> vector map in kernel, but not sure if it is clean enough. > > It's sufficient to look at TMR, no? For edge triggered I don't think > we need the EOI. Mmm, I noticed statements difference between new SDM & old SDM. In old SDM, IPI can have both edge and level trigger mode. But new SDM says only INIT can have both choice. Given the new SDM eliminates the level trigger mode, it is OK. > > But, the amount of churn and risk worries me, so I don't think the > move is worthwhile. This also remind me the debate at early stage of KVM when Gregory Haskins is working on any arbitrary choice of irqchip. The patch even at that time (without SMP) is very complicated. I agree from ROI point of view, this movement may be not worthwhile. thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
Avi Kivity wrote: > I am currently investigating a problem with the a guest running Linux > malfunctioning in the NMI watchdog code. The problem is that we don't > handle NMI delivery mode for the local APIC LINT0 pin; instead we > expect ExtInt deliver mode or that the line is disabled completely. > In addition the i8254 timer is tied to the BSP, while in this case the > timer can broadcast to all vcpus. > > There is some code that tries to second-guess the guest and provide it > the inputs it sees, but this is fragile. The only way to get reliable > operation is to emulate the hardware fully. > > Now I'd much rather do that in userspace, since it's a lot of > sensitive work. I'll enumerate below the general motivation, > advantages and disadvantages, and a plan for moving forward. > > Motivation > == > > The original motivation for moving the PIC and IOAPIC into the kernel > was performance, especially for assigned devices. Both devices are > high interaction since they deal with interrupts; practically after > every interrupt there is either a PIC ioport write, or an APIC bus > message, both signalling an EOI operation. Moving the PIT into the > kernel allowed us to catch up with missed timer interrupt injections, > and speeded up guests which read the PIT counters (e.g. tickless > guests). > > However, modern guests running on modern qemu use MSI extensively; > both virtio and assigned devices now have MSI support; and the > planned VFIO only supports kernel delivery via MSI anyway; line based > interrupts will need to be mediated by userspace. > > The only high frequency non-MSI interrupt sources remaining are the > various timers; and the default one, HPET, is in userspace (and having > its own scaling problems as a result). So in theory we can move PIC, > IOAPIC, and PIT support to userspace and not lose much performance. > > Moving the implementation to userspace allows us more flexibility, and > more consistency in the implementation of timekeeping for the various > clock chips; it becomes easier to follow the nuances of real hardware > in this area. > > Interestingly, while the IOAPIC/PIC code was written we proposed > making it independent of the local APIC; had we done so, the move > would have been much easier (simply dropping the existing code). > > > Advantages of a move > > > 1. Reduced kernel footprint > > Good for security, and allows fixing bugs without reboots. > > 2. Centralized timekeeping > > Instead of having one solution for PIT timekeeping, and another for > RTC and HPET timekeeping, we can have all timer chips in userspace. > The local APIC timer still needs to be in the kernel - it is much too > high bandwidth to be in userspace; but on the other hand it is very > different from the other timer chips. > > 3. Flexibility > > Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a > very strong advantage. > > Disadvantages > = > > 1. Still need to keep the old code around for a long while > > We can't just rip it out - old userspace depends on it. So the > security advantages are only with cooperating userspace, and the > other advantages only show up. > > 2. Need to bring the qemu code up to date > > The current qemu ioapic code lags some way behind the kernel; also > need PIT timekeeping > > 3. May need kernel support for interval-timer-follows-thread > > Currently the timekeeping code has an optimization which causes the > hrtimer that models the PIT to follow the BSP (which is most likely to > receive the interrupt); this reduces cpu cross-talk. > > I don't think the kernel interval timer code has such an optimization; > we may need to implement it. > > 4. Much churn > > This is a lot of work. > > 5. Risk > > We may find out after all this is implemented that performance is not > acceptable and all the work will have to be dropped. > > Besides VF IO interrupt and timer interrupt introduced performance overhead risk, EOI message deliver from lapic to ioapic, which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin <-> vector map in kernel, but not sure if it is clean enough.-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Nested virtualization without nested vmx
Olivier Berghmans wrote: > Hi, > > Thanks for the reply. I have a Intel Q9550 processor. EPT was > introduced in the Nehalem series, where my processor is not part of. > Based on that, I conclude that I am not using EPT and thus using > shadow page tables. > > The problem of running (for example) a VirtualBox guest inside a KVM > guest is a problem in the use of the shadow page tables? Is there a > way I can test or confirm this so that I can know what exactly is the > problem with the shadow page tables? Using EPT (in a Nehalem > processor) the nesting might work without the nested vmx support in > KVM? > Base on the experience done in Xen side, I suspect the issue you saw is in shadow page table side. With EPT or NPT, the complicated shadow page table is removed, so you may avoid those issues and see something different. Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Nested virtualization without nested vmx
The goal of HVM virtualization is to provide an exact same with native platform to guest in KVM guest and Xen HVM guest, however for some reason, it is not strictly followed in today's virtualization solution. VMMs normally take shortcut to make commodity OS happen for performance etc. That brings trouble to nested virtualBox mentioned in your case. 3 years ago, Intel ever tried to run VirtualPC inside Xen HVM guest. Disheng Su fixed a number of bugs in Xen and eventually succesefully launched the nested virtualPC. You may still find the patches and comments in Xenbit (from Disheng Su for nested virtualPC). The main issue at that time is for shadow page table which optimizes MMU virtualization but not exactly follow architecture equavalence for performance reason. Are you running on shadow page table or EPT? IBM and Intel both developed virtual VMX support for nested virtualization. You may leverage. Thx, Eddie Olivier Berghmans wrote: > Hi all, > > I'm currently researching the possibility of nested virtualization. > The goal is to test which hypervisor can be nested inside kvm. > > I have already read much about nested virtualization on AMD machines > (on the mailing list and Avi's blog) with the nested svm patch, but I > did not find much information about this subject on Intel machines. > What I did find is that the nested vmx support is not yet finished (or > maybe almost?). So I understand that testing whether kvm runs inside > kvm on Intel will not work until this patch is included in kvm. > > However, the goal of the patch is to insert the virtualization > extension into the guest (in virtualized processor), right? > Hypervisors that use software virtualization do not need these > extension so is it correct to state that these hypervisors could > already be nested inside kvm (e.g. running a VirtualBox or VMware > guest inside a kvm guest)? When I tried this with VMware, the KVM > guest (L1) crashed and rebooted. The test with VirtualBox displayed > the message "Kernel panic - not syncing: Attempted to kill init!". > Nesting a Xen guest inside a KVM guest also failed in that the Xen > guest tries to boot, crashes and tries to reboot, etc. > > Is there something that I'm overlooking or something that I can try in > order to get it to work? Or what could be the problem that these > hypervisors cannot be nested? > > Some technical information: > > host$ cat /proc/cpuinfo > model name: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz > > host$ dmesg | grep kvm > [ 18.480937] loaded kvm module (kvm-kmod-devel-88) > [ 4924.495094] kvm: emulating exchange as write > > host$ uname -a > Linux 2.6.28-16-generic #57-Ubuntu SMP Wed Nov 11 09:47:24 UTC 2009 > i686 GNU/Linux > > Kind regards, > Olivier > > -- > > Met vriendelijke groet > Olivier Berghmans -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: KVM and Qemu Synchronization
Saksena, Abhishek wrote: > Correction to question:- > > Can keeping device models and KVM pit synchronize with the host clock > <> result in any issues with booting up OSes? > Time virtualization, especially in SMP guest for legacy OS, is tricky. Hardly it can be really synced. But, Linux has PV timer for KVM & Xen. You can use PV timer for SMP case. Windows rely on TSC more which is relatively easy to be synced. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: EOI acceleration for high bandwidth IO
Avi Kivity wrote: > On 07/06/2009 04:42 PM, Dong, Eddie wrote: >> EOI is one of key VM Exit at high bandwidth IO such as VT-d >> with 10Gb/s NIC. This patch accelerate guest EOI emulation >> utilizing HW VM Exit information. >> > > Won't this fail if the guest uses STOSD to issue the EOI? > Good catch, should we use an exclusion list for the opcode? Or use decode cache for hot IP (RO in EPT for gip)? We noticed huge amount of vEOI in 10Gb/s NIC which is ~70KHZ for EOI. With SR-IOV, it could go up much more to even million level. Decode and emulation cost 7K cycles, while short path may only spend 3-4K cycles. Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
EOI acceleration for high bandwidth IO
EOI is one of key VM Exit at high bandwidth IO such as VT-d with 10Gb/s NIC. This patch accelerate guest EOI emulation utilizing HW VM Exit information. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index ccafe0d..b63138f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -875,6 +875,15 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8) | (apic_get_reg(apic, APIC_TASKPRI) & 4)); } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + + if (apic) + apic_set_eoi(apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 40010b0..3a7a29a 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -27,6 +27,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3a75db3..6eea29d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3125,6 +3125,12 @@ static int handle_apic_access(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) exit_qualification = vmcs_readl(EXIT_QUALIFICATION); offset = exit_qualification & 0xffful; + if ((exit_qualification >> 12) & 0xf == 1 && + offset == APIC_EOI) { /* EOI write */ + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } er = emulate_instruction(vcpu, kvm_run, 0, 0, 0); eoi2.patch Description: eoi2.patch
RE: [PATCH 0/1] x2apic implementation for kvm
Avi Kivity wrote: > Sheng Yang wrote: >> I think that means the PV interface for lapic. And yes, we can >> support it follow MS's interface, but x2apic still seems another >> story as you noted... I still don't think support x2apic here would >> bring us more benefits. >> > > x2apic has the following benefit: > > - msr exits are faster than mmio (no page table walk, emulation) > - no need to read back ICR to look at the busy bit > - one ICR write instead of two > - potential to support large guests once we add interrupt remapping > - shared code with the Hyper-V paravirt interface > Is there any plan to implement an PV irqchip such as Xenirqchip for KVM?-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: event injection MACROs
Gleb Natapov wrote: > On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote: >> Gleb Natapov wrote: >>> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote: >>>> Avi Kivity wrote: >>>>> Dong, Eddie wrote: >>>>>> OK. >>>>>> Also back to Gleb's question, the reason I want to do that is to >>>>>> simplify event generation mechanism in current KVM. >>>>>> >>>>>> Today KVM use additional layer of exception/nmi/interrupt such as >>>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & >>>>>> vcpu->arch.nmi_injected. All those additional layer is due to >>>>>> compete of VM_ENTRY_INTR_INFO_FIELD >>>>>> write to inject the event. Both SVM & VMX has only one resource >>>>>> to inject the virtual event but KVM generates 3 catagory of >>>>>> events in parallel which further requires additional >>>>>> logic to dictate among them. >>>>> >>>>> I thought of using a queue to hold all pending events (in a common >>>>> format), sort it by priority, and inject the head. >>>> >>>> The SDM Table 5-4 requires to merge 2 events together, i.e. >>>> convert to #DF/ Triple fault or inject serially when 2 events >>>> happens no matter NMI, IRQ or exception. >>>> >>>> As if considering above events merging activity, that is a single >>>> element queue. >>> I don't know how you got to this conclusion from you previous >>> statement. See explanation to table 5-2 for instate where it is >>> stated that interrupt should be held pending if there is exception >>> with higher priority. Should be held pending where? In the queue, >>> like we do. Note that low prio exceptions are just dropped since >>> they will be regenerated. >> >> I have different understanding here. >> My understanding is that "held" means NO INTA in HW, i.e. LAPIC >> still hold this IRQ. >> > And what if INTA already happened and CPU is ready to fetch IDT for > interrupt vector and at this very moment CPU faults? If INTA happens, that means it is delivered. If its delivery triggers another exception, that is what Table5-4 handles. My understanding is that it is 2 stage process. Table 5-2 talk about events happening before delivery, so that HW needs to prioritize them. Once a decision is make, the highest one is delivered but then it could trigger another exception when fetching IDT etc. Current execption.pending/interrupt.pending/nmi_injected doesn't match either of above, interrupt/nmi is only for failed event injection, and a strange fixed priority check when it is really injected: exception > failed NMI > failed IRQ > new NMI > new IRQ. Table 5-2 looks missed in current KVM IMO except a wrong (but minor) exception > NMI > IRQ sequence. > >>> >>>> We could have either: 1) A pure SW "queue" that will be flush to >>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW >>>> register. >>>> >>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should >>> have queue of three elements sorted by priority. On each entry we >>> should >> >> Table 5-4 alreadys says NMI/IRQ is BENIGN. > Table 5-2 applies here not table 5-4 I think. > >> >>> inject an event with highest priority. And remove it from queue on >>> exit. >> >> The problem is that we have to decide to inject only one of above 3, >> and discard the rest. Whether priority them or merge (to one event >> as Table 5-4) is another story. > Only a small number of event are merged into #DF. Most handled > serially (SDM does not define what serially means unfortunately), so > I don't understand where "discard the rest" is come from. We can vmx_complete_interrupts clear all of them at next EXIT. Even from HW point of view, if there are pending NMI/IRQ/exception, CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds IRQ, thus it can be re-injected), completely discard exception. I don't say discarding has any problem, but unnecessary to keep all of 3. the only difference is when to discard the rest 2, at queue_exception/irq/nmi time or later on (even at next EXIT time), which is same to me. > discard exception since it will be regenerated anyway, but IRQ and > NMI is another story. SDM says that IRQ should be held pending (once > again not much explanation here), nothing about NMI. > >>>
RE: event injection MACROs
Gleb Natapov wrote: > On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote: >> Avi Kivity wrote: >>> Dong, Eddie wrote: >>>> OK. >>>> Also back to Gleb's question, the reason I want to do that is to >>>> simplify event generation mechanism in current KVM. >>>> >>>> Today KVM use additional layer of exception/nmi/interrupt such as >>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & >>>> vcpu->arch.nmi_injected. All those additional layer is due to >>>> compete of VM_ENTRY_INTR_INFO_FIELD >>>> write to inject the event. Both SVM & VMX has only one resource to >>>> inject the virtual event but KVM generates 3 catagory of events in >>>> parallel which further requires additional >>>> logic to dictate among them. >>> >>> I thought of using a queue to hold all pending events (in a common >>> format), sort it by priority, and inject the head. >> >> The SDM Table 5-4 requires to merge 2 events together, i.e. convert >> to #DF/ >> Triple fault or inject serially when 2 events happens no matter NMI, >> IRQ or exception. >> >> As if considering above events merging activity, that is a single >> element queue. > I don't know how you got to this conclusion from you previous > statement. > See explanation to table 5-2 for instate where it is stated that > interrupt should be held pending if there is exception with higher > priority. Should be held pending where? In the queue, like we do. Note > that low prio exceptions are just dropped since they will be > regenerated. I have different understanding here. My understanding is that "held" means NO INTA in HW, i.e. LAPIC still hold this IRQ. > >> We could have either: 1) A pure SW "queue" that will be flush to HW >> register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register. >> > We have three event sources 1) exceptions 2) IRQ 3) NMI. We should > have > queue of three elements sorted by priority. On each entry we should Table 5-4 alreadys says NMI/IRQ is BENIGN. > inject an event with highest priority. And remove it from queue on > exit. The problem is that we have to decide to inject only one of above 3, and discard the rest. Whether priority them or merge (to one event as Table 5-4) is another story. > >> >> A potential benefit is that it can avoid duplicated code and >> potential bugs >> in current code as following patch shows if I understand correctly: >> >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu >> *vcpu, struct kvm_run *kvm_run) cr2 = >> vmcs_readl(EXIT_QUALIFICATION); >> KVMTRACE_3D(PAGE_FAULT, vcpu, >> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); - >> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) + >> if (vcpu->arch.interrupt.pending || >> vcpu->arch.exception.pending || >> vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu, >> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } > This fix is already in Avi's tree (not yet pushed). > >> Either way are OK and up to you. BTW Xen uses HW register directly >> to representing >> an pending event. >> > In this particular case I don't mind to use HW register either, but I > don't see any advantage. > >>> >>>> One example is that exception has higher priority >>>> than NMI/IRQ injection in current code which is not true in >>>> reality. >>>> >>> >>> I don't think it matters in practice, since the guest will see it >>> as a timing issue. NMIs and IRQs are asynchronous (even those >>> generated by the guest through the local APIC). >> >> Yes. But also cause IRQ injection be delayed which may have side >> effect. >> For example if guest exception handler is very longer or if guest >> VCPU fall into recursive #GP. Within current logic, a guest IRQ >> event from KDB (IPI) running >> on VCPU0, as an example, can't force the dead loop VCPU1 into KDB >> since it >> is recursively #GP. > If one #GP causes another #GP this is a #DF. If CPU has a chance to Means another #GP in next instruction i.e. Beginning of #GP handler in guest. No #DF here. > executes > something in between KVM will have a chance to inject NMI. Could have no chance in so
RE: event injection MACROs
Avi Kivity wrote: > Dong, Eddie wrote: >> OK. >> Also back to Gleb's question, the reason I want to do that is to >> simplify event >> generation mechanism in current KVM. >> >> Today KVM use additional layer of exception/nmi/interrupt such as >> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & >> vcpu->arch.nmi_injected. >> All those additional layer is due to compete of >> VM_ENTRY_INTR_INFO_FIELD >> write to inject the event. Both SVM & VMX has only one resource to >> inject the virtual event but KVM generates 3 catagory of events in >> parallel which further requires additional >> logic to dictate among them. > > I thought of using a queue to hold all pending events (in a common > format), sort it by priority, and inject the head. The SDM Table 5-4 requires to merge 2 events together, i.e. convert to #DF/ Triple fault or inject serially when 2 events happens no matter NMI, IRQ or exception. As if considering above events merging activity, that is a single element queue. We could have either: 1) A pure SW "queue" that will be flush to HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register. A potential benefit is that it can avoid duplicated code and potential bugs in current code as following patch shows if I understand correctly: --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) cr2 = vmcs_readl(EXIT_QUALIFICATION); KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); - if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) + if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending || vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu, cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } If using above merged SW "queue" or HW direct register, we can do like following: --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) cr2 = vmcs_readl(EXIT_QUALIFICATION); KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); - if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) + if (vmcs_read(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK) kvm_mmu_unprotect_page_virt(vcpu, cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } Either way are OK and up to you. BTW Xen uses HW register directly to representing an pending event. > >> One example is that exception has higher priority >> than NMI/IRQ injection in current code which is not true in reality. >> > > I don't think it matters in practice, since the guest will see it as a > timing issue. NMIs and IRQs are asynchronous (even those generated by > the guest through the local APIC). Yes. But also cause IRQ injection be delayed which may have side effect. For example if guest exception handler is very longer or if guest VCPU fall into recursive #GP. Within current logic, a guest IRQ event from KDB (IPI) running on VCPU0, as an example, can't force the dead loop VCPU1 into KDB since it is recursively #GP. > >> Another issue is that an failed event from previous injection say >> IRQ or NMI may be discarded if an virtual exception happens in the >> EXIT handling now. With the patch of generic double fault handling, >> this case should be handled as normally. >> > > Discarding an exception is usually okay as it will be regenerated. I > don't think we discard interrupts or NMIs. In reality (Running OS in guest), it doesn't happen so far. But architecturally, it could. For example KVM injects an IRQ, but VM Resume get #PF and back to KVM with IDT_VECTORING valid. Then KVM will put back the failed IRQ to interrupt queue. But if #PF handling generates another exception, then the interrupt queue won't be able to be injected, since KVM inject exception first. And the interrupt queue is discarded at next VM Exit. Overal, I think this is mostly for simplification but may benefit future a lot. Especially with Gleb's recent cleanup, it soulds to be much easier to do than before. I may make mistake here, will like to see more comments. thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: event injection MACROs
Avi Kivity wrote: > Dong, Eddie wrote: >> I noticed the MACRO for SVM vmcb->control.event_inj and VMX >> VM_EXIT_INTR_INFO are almost same, I have a need to query the event >> injection situation in common code so plan to expose this register >> read/write to x86.c. Should we define a new format for >> evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move those >> original MACRO to kvm_host.h? >> >> > > This is dangerous if additional bits or field values are defined by > either architecture. Better to use accessors. OK. Also back to Gleb's question, the reason I want to do that is to simplify event generation mechanism in current KVM. Today KVM use additional layer of exception/nmi/interrupt such as vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & vcpu->arch.nmi_injected. All those additional layer is due to compete of VM_ENTRY_INTR_INFO_FIELD write to inject the event. Both SVM & VMX has only one resource to inject the virtual event but KVM generates 3 catagory of events in parallel which further requires additional logic to dictate among them. One example is that exception has higher priority than NMI/IRQ injection in current code which is not true in reality. Another issue is that an failed event from previous injection say IRQ or NMI may be discarded if an virtual exception happens in the EXIT handling now. With the patch of generic double fault handling, this case should be handled as normally. Will post RFC soon. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
> That is OK, You can send two patches. The first one will WARN_ON and > overwrite exception like the current code does. And the second one > will remove WARN_ON explaining that this case is actually possible to > trigger from a guest. > Sounds you don't like to provide this additional one, here it is for the purpose of removing the block issue. My basic position is still same with what mentioned in previous mail, but I am neutral to either way. Thx, eddie Signed-off-by: Eddie Dong Overwriting former event may help forward progress in case of multiple exception/interrupt happens serially. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d0e75a2..b3de5d2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -183,11 +183,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, int class1, class2; if (!vcpu->arch.exception.pending) { - vcpu->arch.exception.pending = true; - vcpu->arch.exception.has_error_code = has_error; - vcpu->arch.exception.nr = nr; - vcpu->arch.exception.error_code = error_code; - return; + goto out; } /* to check exception */ @@ -208,9 +204,15 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, vcpu->arch.exception.has_error_code = true; vcpu->arch.exception.nr = DF_VECTOR; vcpu->arch.exception.error_code = 0; + return; } else printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n", prev_nr, nr); +out: + vcpu->arch.exception.pending = true; + vcpu->arch.exception.has_error_code = has_error; + vcpu->arch.exception.nr = nr; + vcpu->arch.exception.error_code = error_code; } void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr) serial_irq.patch Description: serial_irq.patch
RE: Enable IRQ windows after exception injection if there are pending virq
Gleb Natapov wrote: > On Tue, May 12, 2009 at 11:06:39PM +0800, Dong, Eddie wrote: >> >> I didn't take many test since our PTS system stop working now due to >> KVM userspace >> build changes. But since the logic is pretty simple, so I want to >> post here to see comments. Thx, eddie >> >> >> >> >> If there is pending irq after an virtual exception is injected, >> KVM needs to enable IRQ window to trap back earlier once >> the exception is handled. >> > I already posted patch to do that > http://patchwork.kernel.org/patch/21830/ Is you patch different? > Is it base on the idea I mentioned to you in private mail (April 27), or a novel one? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Enable IRQ windows after exception injection if there are pending virq
I didn't take many test since our PTS system stop working now due to KVM userspace build changes. But since the logic is pretty simple, so I want to post here to see comments. Thx, eddie If there is pending irq after an virtual exception is injected, KVM needs to enable IRQ window to trap back earlier once the exception is handled. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 308d8e9..f8ceaea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3154,15 +3154,18 @@ static void inject_irq(struct kvm_vcpu *vcpu) } } -static void inject_pending_irq(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static void inject_pending_irq(struct kvm_vcpu *vcpu) { - bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && - kvm_run->request_interrupt_window; - if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) kvm_x86_ops->drop_interrupt_shadow(vcpu); inject_irq(vcpu); +} + +static void set_pending_virq(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && + kvm_run->request_interrupt_window; /* enable NMI/IRQ window open exits if needed */ if (vcpu->arch.nmi_pending) @@ -3229,7 +3232,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) if (vcpu->arch.exception.pending) __queue_exception(vcpu); else - inject_pending_irq(vcpu, kvm_run); + inject_pending_irq(vcpu); + + set_pending_virq(vcpu, kvm_run); if (kvm_lapic_enabled(vcpu)) { if (!vcpu->arch.apic->vapic_addr) irq_windows.patch Description: irq_windows.patch
event injection MACROs
I noticed the MACRO for SVM vmcb->control.event_inj and VMX VM_EXIT_INTR_INFO are almost same, I have a need to query the event injection situation in common code so plan to expose this register read/write to x86.c. Should we define a new format for evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move those original MACRO to kvm_host.h? Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
Gleb Natapov wrote: > On Mon, May 11, 2009 at 09:04:52AM +0800, Dong, Eddie wrote: >> >>> There is not point referring to current code. Current code does not >>> handle serial exceptions properly. So fix it in your patch >>> otherwise I propose to use my patch that fixes current code >>> (http://patchwork.kernel.org/patch/21829/). >>> >> >> I would like Avi to decide. As comments to the difference of 2 >> patches, my undrestanding is that I am addressing the problem base >> on SDM 5-4 with the answer to serial injection as first in first >> service. Your patch doesn;t solve generic double fault case for >> example exception 11 on 11, or GP on GP which needs to be converted >> to #DF per SDM, rather you only handle the case the secondary >> exception is PF, and servicing PF. >> > There is nothing to decide really. I prefer your patch with serial > exception handling fixed. If you'll not do it I'll do it. OK, an additional patch will be constructive but my position is neutral. The reason (mentioned) is: 1: Current KVM just WARN_ON for those case (and never be hit), so the this patch won't introduce additional issues. Either printk or WARN_ON to notify us in case we met the problem in future is safer way for me. 2: In case of real "serial ecception" happens, from architectural point of view, I think we'd better consult Table 5-2 to prioritize them, which is neither reserving former exception nor overwritting. But as you mentioned, the list is not completed. My point is that this is another complicated scenario that we should spend time in future, but not related to current patch. 3: This function will soon needs to be extended to cover IRQ case too, which needs to push back the overwritten IRQ. We need a total solution for this, so I prefer to do that some time later. 4: I prefer to split issue if possible. > >> I can check with internal architecture to see what does "handle >> exceptions serially" mean in really. For me serial means first in >> first out, and thus we should remain 1st exception. >> > There is a table 5.2 that defines an order between some events. The > table is not complete, I don't see #DE there for instance. But > consider > this case: #DE (or #NP) happens while exception stack is paged out so > #PF happens next. #PF is handled by TSS gate so it uses its own stack > and it fixes exception stack in its handler. If we drop #PF because > #DE is already waiting we will keep trying to inject #DE > indefinitely. The result is hanging QEMU process eating 100% cpu > time. If we replace #DE with #PF on the other hand then #PF handler > will fix exception stack instruction that caused #DE will be > re-executed, #DE regenerated and handled properly. So which scenario > do you prefer? See above. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Luvalley-2 has been released: running KVM below any operating system
Xiaodong Yi wrote: > It is not a typo. I copied from UnixBench output directly. Howver, it > must be a bug of Luvalley because even the native Linux benchmark on > Double-Precision Whetstone is not that high. I also noticed that other > benchmarks are all lower than native Linux. > > About timing, Luvalley does nothing more than KVM except that Luvalley > implemented the VLAPIC timer using TSC while KVM uses the services of > Linux kernel. The other timers of both Luvalley and KVM, I think, are > all implemented in Qemu. > > Moreover, I could not explain why Luvalley's benchmarks on process > creation, execl throughput, file copy and shell script are 20% ~ 40% > higher than KVM, while other benchmarks on pipe throughput, context > switching and syscall overhead are almost the same as KVM. > A typical issue in VMM benchmarking using OS benchmark such as what you used is time inaccurate issue. Benchmarks using guest time won't be able to get a right time or right duration due to scheduler etc, and thus VMM benchmarks is using network time for measuring such as vConsolidate. Spec.org is definning their benchmark for VMM, and I believe they will use network time too. For simplicity, you may continue use OS benchmark to measure VMM, but then you need to calibrate guest time accuracy first such as using stop watch etc. In both Xen & KVM, we benchmark using OS benchmark too, but it is usually only to see improvement of a performance patch. Formal benchmark data needs to consult wall clock or stop watch. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
> There is not point referring to current code. Current code does not > handle serial exceptions properly. So fix it in your patch otherwise I > propose to use my patch that fixes current code > (http://patchwork.kernel.org/patch/21829/). > I would like Avi to decide. As comments to the difference of 2 patches, my undrestanding is that I am addressing the problem base on SDM 5-4 with the answer to serial injection as first in first service. Your patch doesn;t solve generic double fault case for example exception 11 on 11, or GP on GP which needs to be converted to #DF per SDM, rather you only handle the case the secondary exception is PF, and servicing PF. I can check with internal architecture to see what does "handle exceptions serially" mean in really. For me serial means first in first out, and thus we should remain 1st exception. Eddie.-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
Gleb Natapov wrote: > On Fri, May 08, 2009 at 06:46:14PM +0800, Dong, Eddie wrote: >> Dong, Eddie wrote: >>> ction will be re-executed. >>>>> >>>>> Do you want it to be covered for now? For exception, it is easy >>>>> but for IRQ, it needs to be pushed back. >>>>> >>>> Yes I want it to be covered now otherwise any serial exception >>>> generates flood of "Exception happens serially" messages. This >>>> function does not handle IRQ so no problem there. >>> >>> But we soon will let this function cove IRQ as well per SDM. >>> Why not do that a little bit later? >>> >>> BTW, this issue exist in original code as well. >>> >>> Eddie >> >> Actually this is already addressed in current patch too: Just keep >> the former exception. If you mean the prink should be removed, I am >> fine. > Keeping the former exception is not the right thing to do. It can't be > delivered because delivering it cause another exception and handler > that may fix the situation is not called since you drop last > exception and keep re-injecting the one that can't be handled. > >> BTW, this case doesn't happen in reality. >> > Then why do you write all this code then? :) I can easily write test I am fixing the potential #DF bug existing in current code which only handle PF on PF. For those sequential exception, it is WARN_ON in current code. > case that will do that (actually I did) and if not handled properly it > just loops taking 100% cpu trying to reinject exception that cannot be > handled. Are u sure current code is dead loop in WARN_ON with your test code? I don't see it will never happen and thus why printk it, but shouldn't exist in current guest that KVM can support. See original kvm_queue_exception in case you ignored the code. void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr) { WARN_ON(vcpu->arch.exception.pending); vcpu->arch.exception.pending = true; vcpu->arch.exception.has_error_code = false; vcpu->arch.exception.nr = nr; } Any comments from Avi? Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
Dong, Eddie wrote: > ction will be re-executed. >>> >>> Do you want it to be covered for now? For exception, it is easy but >>> for IRQ, it needs to be pushed back. >>> >> Yes I want it to be covered now otherwise any serial exception >> generates flood of "Exception happens serially" messages. This >> function does not handle IRQ so no problem there. > > But we soon will let this function cove IRQ as well per SDM. > Why not do that a little bit later? > > BTW, this issue exist in original code as well. > > Eddie Actually this is already addressed in current patch too: Just keep the former exception. If you mean the prink should be removed, I am fine. BTW, this case doesn't happen in reality. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
ction will be re-executed. >> >> Do you want it to be covered for now? For exception, it is easy but >> for IRQ, it needs to be pushed back. >> > Yes I want it to be covered now otherwise any serial exception > generates flood of "Exception happens serially" messages. This > function does not handle IRQ so no problem there. But we soon will let this function cove IRQ as well per SDM. Why not do that a little bit later? BTW, this issue exist in original code as well. Eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Implement generic double fault generation mechanism
Gleb Natapov wrote: >> + >> +static int exception_class(int vector) >> +{ >> +if (vector == 14) >> +return EXCPT_PF; >> +else if (vector == 0 || (vector >= 10 && vector <= 13)) + >> return >> EXCPT_CONTRIBUTORY; +else >> +return EXCPT_BENIGN; >> +} >> + > This makes double fault (8) benign exception. Surely not what you > want. double fault fall into neither of above class per SDM. But it should be checked earlier than generating DB fault. See new updated. >> +/* to check exception */ >> +prev_nr = vcpu->arch.exception.nr; >> +class2 = exception_class(nr); >> +class1 = exception_class(prev_nr); >> +if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) >> +|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) { >> +/* generate double fault per SDM Table 5-5 */ >> +printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n", >> +prev_nr, nr); + vcpu->arch.exception.pending = >> true; >> +vcpu->arch.exception.has_error_code = 1; >> +vcpu->arch.exception.nr = DF_VECTOR; >> +vcpu->arch.exception.error_code = 0; >> +if (prev_nr == DF_VECTOR) { >> +/* triple fault -> shutdown */ >> +set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); + >> } >> +} else >> +printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n", >> +prev_nr, nr); +} > When two exceptions happens serially is is better to replace pending > exception with a new one. This way the first exception (that is lost) > will be regenerated when instruction will be re-executed. Do you want it to be covered for now? For exception, it is easy but for IRQ, it needs to be pushed back. Thx, eddie Move Double-Fault generation logic out of page fault exception generating function to cover more generic case. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab1fdac..d0e75a2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -162,12 +162,60 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data) } EXPORT_SYMBOL_GPL(kvm_set_apic_base); +#define EXCPT_BENIGN 0 +#define EXCPT_CONTRIBUTORY 1 +#define EXCPT_PF 2 + +static int exception_class(int vector) +{ + if (vector == 14) + return EXCPT_PF; + else if (vector == 0 || (vector >= 10 && vector <= 13)) + return EXCPT_CONTRIBUTORY; + else + return EXCPT_BENIGN; +} + +static void kvm_multiple_exception(struct kvm_vcpu *vcpu, + unsigned nr, bool has_error, u32 error_code) +{ + u32 prev_nr; + int class1, class2; + + if (!vcpu->arch.exception.pending) { + vcpu->arch.exception.pending = true; + vcpu->arch.exception.has_error_code = has_error; + vcpu->arch.exception.nr = nr; + vcpu->arch.exception.error_code = error_code; + return; + } + + /* to check exception */ + prev_nr = vcpu->arch.exception.nr; + if (prev_nr == DF_VECTOR) { + /* triple fault -> shutdown */ + set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); + return; + } + class1 = exception_class(prev_nr); + class2 = exception_class(nr); + if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) + || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) { + /* generate double fault per SDM Table 5-5 */ + printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n", + prev_nr, nr); + vcpu->arch.exception.pending = true; + vcpu->arch.exception.has_error_code = true; + vcpu->arch.exception.nr = DF_VECTOR; + vcpu->arch.exception.error_code = 0; + } else + printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n", + prev_nr, nr); +} + void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr) { - WARN_ON(vcpu->arch.exception.pending); - vcpu->arch.exception.pending = true; - vcpu->arch.exception.has_error_code = false; - vcpu->arch.exception.nr = nr; + kvm_multiple_exception(vcpu, nr, false, 0); } EXPORT_SYMBOL_GPL(kvm_queue_exception); @@ -176,18 +224,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long addr, { ++vcpu->stat.pf_guest; - if (vcpu->arch.exception.pending) { - if (vcpu->arch.exception.nr == PF_VECTOR) { - printk(KERN_DEBUG "kvm: inject_page_fault:" - " double fault 0x%lx\n", addr); - vcpu->arch.exception.nr = DF_VECTOR; - vcpu->arch.exception.error_code = 0; - } else if (vcpu->arch.
RE: Implement generic double fault generation mechanism
Dong, Eddie wrote: > Move Double-Fault generation logic out of page fault > exception generating function to cover more generic case. > > Signed-off-by: Eddie Dong > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ab1fdac..51a8dad 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -162,12 +162,59 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, > u64 data) } > EXPORT_SYMBOL_GPL(kvm_set_apic_base); > > +#define EXCPT_BENIGN 0 > +#define EXCPT_CONTRIBUTORY 1 > +#define EXCPT_PF 2 > + > +static int exception_class(int vector) > +{ > + if (vector == 14) > + return EXCPT_PF; > + else if (vector == 0 || (vector >= 10 && vector <= 13)) > + return EXCPT_CONTRIBUTORY; > + else > + return EXCPT_BENIGN; > +} > + > +static void kvm_multiple_exception(struct kvm_vcpu *vcpu, > + unsigned nr, bool has_error, u32 error_code) > +{ > + u32 prev_nr; > + int class1, class2; > + > + if (!vcpu->arch.exception.pending) { > + vcpu->arch.exception.pending = true; > + vcpu->arch.exception.has_error_code = has_error; > + vcpu->arch.exception.nr = nr; > + vcpu->arch.exception.error_code = error_code; > + return; > + } > + > + /* to check exception */ > + prev_nr = vcpu->arch.exception.nr; > + class2 = exception_class(nr); > + class1 = exception_class(prev_nr); > + if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) > + || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) { > + /* generate double fault per SDM Table 5-5 */ > + printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n", > + prev_nr, nr); > + vcpu->arch.exception.pending = true; > + vcpu->arch.exception.has_error_code = 1; > + vcpu->arch.exception.nr = DF_VECTOR; > + vcpu->arch.exception.error_code = 0; > + if (prev_nr == DF_VECTOR) { > + /* triple fault -> shutdown */ > + set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); > + } > + } else > + printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n", > + prev_nr, nr); > +} > + > void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr) > { > - WARN_ON(vcpu->arch.exception.pending); > - vcpu->arch.exception.pending = true; > - vcpu->arch.exception.has_error_code = false; > - vcpu->arch.exception.nr = nr; > + kvm_multiple_exception(vcpu, nr, false, 0); > } > EXPORT_SYMBOL_GPL(kvm_queue_exception); > > @@ -176,18 +223,6 @@ void kvm_inject_page_fault(struct kvm_vcpu > *vcpu, unsigned long addr, { > ++vcpu->stat.pf_guest; > > - if (vcpu->arch.exception.pending) { > - if (vcpu->arch.exception.nr == PF_VECTOR) { > - printk(KERN_DEBUG "kvm: inject_page_fault:" > - " double fault 0x%lx\n", addr); > - vcpu->arch.exception.nr = DF_VECTOR; > - vcpu->arch.exception.error_code = 0; > - } else if (vcpu->arch.exception.nr == DF_VECTOR) { > - /* triple fault -> shutdown */ > - set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); > - } > - return; > - } > vcpu->arch.cr2 = addr; > kvm_queue_exception_e(vcpu, PF_VECTOR, error_code); > } > @@ -200,11 +235,7 @@ EXPORT_SYMBOL_GPL(kvm_inject_nmi); > > void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 > error_code) { > - WARN_ON(vcpu->arch.exception.pending); > - vcpu->arch.exception.pending = true; > - vcpu->arch.exception.has_error_code = true; > - vcpu->arch.exception.nr = nr; > - vcpu->arch.exception.error_code = error_code; > + kvm_multiple_exception(vcpu, nr, true, error_code); > } > EXPORT_SYMBOL_GPL(kvm_queue_exception_e); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Implement generic double fault generation mechanism
Move Double-Fault generation logic out of page fault exception generating function to cover more generic case. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab1fdac..51a8dad 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -162,12 +162,59 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data) } EXPORT_SYMBOL_GPL(kvm_set_apic_base); +#define EXCPT_BENIGN 0 +#define EXCPT_CONTRIBUTORY 1 +#define EXCPT_PF 2 + +static int exception_class(int vector) +{ + if (vector == 14) + return EXCPT_PF; + else if (vector == 0 || (vector >= 10 && vector <= 13)) + return EXCPT_CONTRIBUTORY; + else + return EXCPT_BENIGN; +} + +static void kvm_multiple_exception(struct kvm_vcpu *vcpu, + unsigned nr, bool has_error, u32 error_code) +{ + u32 prev_nr; + int class1, class2; + + if (!vcpu->arch.exception.pending) { + vcpu->arch.exception.pending = true; + vcpu->arch.exception.has_error_code = has_error; + vcpu->arch.exception.nr = nr; + vcpu->arch.exception.error_code = error_code; + return; + } + + /* to check exception */ + prev_nr = vcpu->arch.exception.nr; + class2 = exception_class(nr); + class1 = exception_class(prev_nr); + if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) + || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) { + /* generate double fault per SDM Table 5-5 */ + printk(KERN_DEBUG "kvm: double fault 0x%x on 0x%x\n", + prev_nr, nr); + vcpu->arch.exception.pending = true; + vcpu->arch.exception.has_error_code = 1; + vcpu->arch.exception.nr = DF_VECTOR; + vcpu->arch.exception.error_code = 0; + if (prev_nr == DF_VECTOR) { + /* triple fault -> shutdown */ + set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); + } + } else + printk(KERN_ERR "Exception 0x%x on 0x%x happens serially\n", + prev_nr, nr); +} + void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr) { - WARN_ON(vcpu->arch.exception.pending); - vcpu->arch.exception.pending = true; - vcpu->arch.exception.has_error_code = false; - vcpu->arch.exception.nr = nr; + kvm_multiple_exception(vcpu, nr, false, 0); } EXPORT_SYMBOL_GPL(kvm_queue_exception); @@ -176,18 +223,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long addr, { ++vcpu->stat.pf_guest; - if (vcpu->arch.exception.pending) { - if (vcpu->arch.exception.nr == PF_VECTOR) { - printk(KERN_DEBUG "kvm: inject_page_fault:" - " double fault 0x%lx\n", addr); - vcpu->arch.exception.nr = DF_VECTOR; - vcpu->arch.exception.error_code = 0; - } else if (vcpu->arch.exception.nr == DF_VECTOR) { - /* triple fault -> shutdown */ - set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); - } - return; - } vcpu->arch.cr2 = addr; kvm_queue_exception_e(vcpu, PF_VECTOR, error_code); } @@ -200,11 +235,7 @@ EXPORT_SYMBOL_GPL(kvm_inject_nmi); void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code) { - WARN_ON(vcpu->arch.exception.pending); - vcpu->arch.exception.pending = true; - vcpu->arch.exception.has_error_code = true; - vcpu->arch.exception.nr = nr; - vcpu->arch.exception.error_code = error_code; + kvm_multiple_exception(vcpu, nr, true, error_code); } EXPORT_SYMBOL_GPL(kvm_queue_exception_e); irq3.patch Description: irq3.patch
deactivate fpu when cr0.PE ON?
Following code deactivate fpu when CR0.PE is on, any explaination? Rest of code active/deactive fpu based on cr0.TS bit. thx, eddie static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) { unsigned long guest_cr3; u64 eptp; guest_cr3 = cr3; if (enable_ept) { eptp = construct_eptp(cr3); vmcs_write64(EPT_POINTER, eptp); ept_sync_context(eptp); ept_load_pdptrs(vcpu); guest_cr3 = is_paging(vcpu) ? vcpu->arch.cr3 : VMX_EPT_IDENTITY_PAGETABLE_ADDR; } vmx_flush_tlb(vcpu); vmcs_writel(GUEST_CR3, guest_cr3); if (vcpu->arch.cr0 & X86_CR0_PE) vmx_fpu_deactivate(vcpu); }-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Check valid bit of VM_EXIT_INTR_INFO
Thx, eddie commit ad4a9829c8d5b30995f008e32774bd5f555b7e9f Author: root Date: Thu Apr 2 11:16:03 2009 +0800 Check valid bit of VM_EXIT_INTR_INFO before unblock nmi. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aba41ae..689523a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3268,16 +3268,18 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); if (cpu_has_virtual_nmis()) { - unblock_nmi = (exit_intr_info & INTR_INFO_UNBLOCK_NMI) != 0; - vector = exit_intr_info & INTR_INFO_VECTOR_MASK; /* * SDM 3: 25.7.1.2 * Re-set bit "block by NMI" before VM entry if vmexit caused by * a guest IRET fault. */ - if (unblock_nmi && vector != DF_VECTOR) - vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, + if (exit_intr_info & INTR_INFO_VALID_MASK) { + unblock_nmi = !!(exit_intr_info & INTR_INFO_UNBLOCK_NMI); + vector = exit_intr_info & INTR_INFO_VECTOR_MASK; + if (unblock_nmi && vector != DF_VECTOR) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); + } } else if (unlikely(vmx->soft_vnmi_blocked)) vmx->vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx->entry_time)); nmi_valid.patch Description: nmi_valid.patch
RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
> > Looks good, but doesn't apply; please check if you are working against > the latest version. Rebased on top of a317a1e496b22d1520218ecf16a02498b99645e2 + previous rsvd bits violation check patch. thx, eddie Use rsvd_bits_mask in load_pdptrs and remove bit 5-6 from rsvd_bits_mask per latest SDM. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 41a0482..400c056 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu) return vcpu->arch.shadow_efer & EFER_NX; } -static int is_present_pte(unsigned long pte) -{ - return pte & PT_PRESENT_MASK; -} - static int is_shadow_present_pte(u64 pte) { return pte != shadow_trap_nonpresent_pte @@ -2195,6 +2190,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context->rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: + context->rsvd_bits_mask[0][2] = + rsvd_bits(maxphyaddr, 63) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context->rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context->rsvd_bits_mask[0][0] = exb_bit_rsvd | diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index eaab214..3494a2f 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu) return vcpu->arch.cr0 & X86_CR0_PG; } +static inline int is_present_pte(unsigned long pte) +{ + return pte & PT_PRESENT_MASK; +} + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9702353..3d07c9a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -234,7 +234,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) goto out; } for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { - if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) { + if (is_present_pte(pdpte[i]) && + (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) { ret = 0; goto out; }-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Neiger, Gil wrote: > PDPTEs are used only if CR0.PG=CR4.PAE=1. > > In that situation, their format depends the value of IA32_EFER.LMA. > > If IA32_EFER.LMA=0, bit 63 is reserved and must be 0 in any PDPTE > that is marked present. The execute-disable setting of a page is > determined only by the PDE and PTE. > > If IA32_EFER.LMA=1, bit 63 is used for the execute-disable in PML4 > entries, PDPTEs, PDEs, and PTEs (assuming IA32_EFER.NXE=1). > > - Gil Rebased. Thanks, eddie commit 032caed3da123950eeb3e192baf444d4eae80c85 Author: root Date: Tue Mar 31 16:22:49 2009 +0800 Use rsvd_bits_mask in load_pdptrs and remove bit 5-6 from rsvd_bits_mask per latest SDM. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2eab758..1bed3aa 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu) return vcpu->arch.shadow_efer & EFER_NX; } -static int is_present_pte(unsigned long pte) -{ - return pte & PT_PRESENT_MASK; -} - static int is_shadow_present_pte(u64 pte) { return pte != shadow_trap_nonpresent_pte @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context->rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: + context->rsvd_bits_mask[0][2] = + rsvd_bits(maxphyaddr, 63) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context->rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context->rsvd_bits_mask[0][0] = exb_bit_rsvd | diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 258e5d5..2a6eb50 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu) return vcpu->arch.cr0 & X86_CR0_PG; } +static inline int is_present_pte(unsigned long pte) +{ + return pte & PT_PRESENT_MASK; +} + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 961bd2b..b449ff0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) goto out; } for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { - if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) { + if (is_present_pte(pdpte[i]) && + (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) { ret = 0; goto out; } cr3_load_rsvd.patch Description: cr3_load_rsvd.patch
FW: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Avi Kivity wrote: > Dong, Eddie wrote: >> @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu >> *vcpu, int level) context->rsvd_bits_mask[1][0] = 0; >> break; >> case PT32E_ROOT_LEVEL: >> +context->rsvd_bits_mask[0][2] = exb_bit_rsvd | >> +rsvd_bits(maxphyaddr, 62) | >> +rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ >> context->rsvd_bits_mask[0][1] = exb_bit_rsvd | >> rsvd_bits(maxphyaddr, 62); /* PDE */ >> context->rsvd_bits_mask[0][0] = exb_bit_rsvd > > Are you sure that PDPTEs support NX? They don't support R/W and U/S, > so it seems likely that NX is reserved as well even when EFER.NXE is > enabled. Gil: Here is the original mail in KVM mailinglist. If you would be able to help, that is great. thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Avi Kivity wrote: > Dong, Eddie wrote: >> @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu >> *vcpu, int level) context->rsvd_bits_mask[1][0] = 0; >> break; >> case PT32E_ROOT_LEVEL: >> +context->rsvd_bits_mask[0][2] = exb_bit_rsvd | >> +rsvd_bits(maxphyaddr, 62) | >> +rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ >> context->rsvd_bits_mask[0][1] = exb_bit_rsvd | >> rsvd_bits(maxphyaddr, 62); /* PDE */ >> context->rsvd_bits_mask[0][0] = exb_bit_rsvd > > Are you sure that PDPTEs support NX? They don't support R/W and U/S, > so it seems likely that NX is reserved as well even when EFER.NXE is > enabled. I am refering Fig 3-20/3-21 of SDM3A, but I think Fig3-20/21 has EXB bit missed since Table 3-5 and section 3.10.3. I will double check with internal architect. thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Dong, Eddie wrote: > This is followup of rsvd_bits emulation. > Base on new rsvd_bits emulation patch. thx, eddie commit 2c1472ef2b9fd87a261e8b58a7db11afd6a111dc Author: root Date: Mon Mar 30 17:05:47 2009 +0800 Use rsvd_bits_mask in load_pdptrs for cleanup with EXB bit considered. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2eab758..eaf41c0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -225,11 +225,6 @@ static int is_nx(struct kvm_vcpu *vcpu) return vcpu->arch.shadow_efer & EFER_NX; } -static int is_present_pte(unsigned long pte) -{ - return pte & PT_PRESENT_MASK; -} - static int is_shadow_present_pte(u64 pte) { return pte != shadow_trap_nonpresent_pte @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context->rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context->rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context->rsvd_bits_mask[0][0] = exb_bit_rsvd | diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 258e5d5..2a6eb50 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -75,4 +75,9 @@ static inline int is_paging(struct kvm_vcpu *vcpu) return vcpu->arch.cr0 & X86_CR0_PG; } +static inline int is_present_pte(unsigned long pte) +{ + return pte & PT_PRESENT_MASK; +} + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 961bd2b..b449ff0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) goto out; } for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { - if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) { + if (is_present_pte(pdpte[i]) && + (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) { ret = 0; goto out; }-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Cleanup to reuse is_long_mode()
Avi Kivity wrote: > Dong, Eddie wrote: >> struct vcpu_svm *svm = to_svm(vcpu); >> >> #ifdef CONFIG_X86_64 >> -if (vcpu->arch.shadow_efer & EFER_LME) { >> +if (is_long_mode(vcpu)) { >> > > is_long_mode() actually tests EFER_LMA, so this is incorrect. Something missing? Here is the definition of is_long_mode, the patch is just for equal replacement. thx, eddie static inline int is_long_mode(struct kvm_vcpu *vcpu) { #ifdef CONFIG_X86_64 return vcpu->arch.shadow_efer & EFER_LME; #else return 0; #endif }-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: Add reserved bits check
> > Just noticed that walk_addr() too can be called from tdp context, so > need to make sure rsvd_bits_mask is initialized in init_kvm_tdp_mmu() > as well. Yes, fixed. Thx, eddie commit b282565503a78e75af643de42fe7bf495e2213ec Author: root Date: Mon Mar 30 16:57:39 2009 +0800 Emulate #PF error code of reserved bits violation. Signed-off-by: Eddie Dong diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55fd4c5..4fe2742 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -261,6 +261,7 @@ struct kvm_mmu { union kvm_mmu_page_role base_role; u64 *pae_root; + u64 rsvd_bits_mask[2][4]; }; struct kvm_vcpu_arch { @@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void); #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ef060ec..2eab758 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_PRESENT_MASK (1U << 0) #define PFERR_WRITE_MASK (1U << 1) #define PFERR_USER_MASK (1U << 2) +#define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) #define PT_DIRECTORY_LEVEL 2 @@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; static u64 __read_mostly shadow_mt_mask; +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL << (e - s + 1)) - 1) << s; +} + void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) { shadow_trap_nonpresent_pte = trap_pte; @@ -2155,6 +2161,14 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte >> 7) & 1; + return (gpte & vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0; +} + #define PTTYPE 64 #include "paging_tmpl.h" #undef PTTYPE @@ -2163,6 +2177,54 @@ static void paging_free(struct kvm_vcpu *vcpu) #include "paging_tmpl.h" #undef PTTYPE +void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) +{ + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + switch (level) { + case PT32_ROOT_LEVEL: + /* no rsvd bits for 2 level 4K page table entries */ + context->rsvd_bits_mask[0][1] = 0; + context->rsvd_bits_mask[0][0] = 0; + if (is_cpuid_PSE36()) + /* 36bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21); + else + /* 32 bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21); + context->rsvd_bits_mask[1][0] = 0; + break; + case PT32E_ROOT_LEVEL: + context->rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PDE */ + context->rsvd_bits_mask[0][0] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PTE */ + context->rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(13, 20); /* large page */ + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0]; + break; + case PT64_ROOT_LEVEL: + context->rsvd_bits_mask[0][3] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3]; + context->rsvd_bits_mask[1][2] = context->rsvd_bits_mask[0][2]; + context->rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0]; + break; + } +} + static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) { struct kvm_mmu *context = &vcpu->arch.mmu; @@ -2183,6 +2245,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) static int paging64_init_context(struct kvm_vcpu *vcpu) { + reset_rsvds_bits_mask(vcpu, PT64_
Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
This is followup of rsvd_bits emulation. thx, eddie commit 171eb2b2d8282dd913a5d5c6c695fd64e1ddcf4c Author: root Date: Mon Mar 30 11:39:50 2009 +0800 Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit. Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0a6f109..b0bf8b2 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2255,6 +2255,9 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu) if (!is_nx(vcpu)) exb_bit_rsvd = rsvd_bits(63, 63); + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context->rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context->rsvd_bits_mask[0][0] = exb_bit_rsvd | @@ -2270,6 +2273,17 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu) static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) { struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu)) { + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ + } context->new_cr3 = nonpaging_new_cr3; context->page_fault = tdp_page_fault; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 961bd2b..ff178fd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -233,7 +233,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) goto out; } for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { - if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) { + if ((pdpte[i] & PT_PRESENT_MASK) && + (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) { ret = 0; goto out; }-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cleanup to reuse is_long_mode()
Thanks, Eddie commit 6688a1fbc37330f2c4e16d1a78050b64e1ce5dcc Author: root Date: Mon Mar 30 11:31:10 2009 +0800 cleanup to reuse is_long_mode(vcpu) Signed-off-by: Eddie Dong diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index db5021b..affc31d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -859,7 +859,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) struct vcpu_svm *svm = to_svm(vcpu); #ifdef CONFIG_X86_64 - if (vcpu->arch.shadow_efer & EFER_LME) { + if (is_long_mode(vcpu)) { if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { vcpu->arch.shadow_efer |= EFER_LMA; svm->vmcb->save.efer |= EFER_LMA | EFER_LME; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 9913a1d..b1f1458 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1543,7 +1543,7 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) enter_rmode(vcpu); #ifdef CONFIG_X86_64 - if (vcpu->arch.shadow_efer & EFER_LME) { + if (is_long_mode(vcpu)) { if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) enter_lmode(vcpu); if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bf6683a..961bd2b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -289,7 +289,7 @@ void kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { #ifdef CONFIG_X86_64 - if ((vcpu->arch.shadow_efer & EFER_LME)) { + if (is_long_mode(vcpu)) { int cs_db, cs_l; if (!is_pae(vcpu)) { cleanup.patch Description: cleanup.patch
RE: RFC: Add reserved bits check
>> +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int >> level) +{ + int ps = 0; >> + >> +if (level == PT_DIRECTORY_LEVEL) >> +ps = !!(gpte & PT_PAGE_SIZE_MASK); >> > > No need for this. If you set rsvd_bits_mask[1][0] == > rsvd_bits_mask[0][0], then you get the same behaviour. The first > index is not the page size, it's just bit 7. Sure, fixed. > > You'll need to fill all the indexes for bit 7 == 1, but it's worth it, > with the 1GB pages patch. > >> +return (gpte & vcpu->arch.mmu.rsvd_bits_mask[ps][level-1]) != 0; +} >> + >> #define PTTYPE 64 >> #include "paging_tmpl.h" >> #undef PTTYPE >> >> +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu) >> +{ >> +struct kvm_cpuid_entry2 *best; >> + >> +best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best) >> +return best->eax & 0xff; >> +return 32; >> +} >> + >> > > Best to return 36 if the cpu doesn't support cpuid 8008 but does > support pae. Mmm, noticed a conflict information in SDM, but you are right :) One more modification is that RSVD bit error code won't update if P=0 after double checking with internal architect. Thanks and reposted. Eddie Emulate #PF error code of reserved bits violation. Signed-off-by: Eddie Dong diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55fd4c5..4fe2742 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -261,6 +261,7 @@ struct kvm_mmu { union kvm_mmu_page_role base_role; u64 *pae_root; + u64 rsvd_bits_mask[2][4]; }; struct kvm_vcpu_arch { @@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void); #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ef060ec..0a6f109 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_PRESENT_MASK (1U << 0) #define PFERR_WRITE_MASK (1U << 1) #define PFERR_USER_MASK (1U << 2) +#define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) #define PT_DIRECTORY_LEVEL 2 @@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; static u64 __read_mostly shadow_mt_mask; +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL << (e - s + 1)) - 1) << s; +} + void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) { shadow_trap_nonpresent_pte = trap_pte; @@ -2155,6 +2161,14 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte >> 7) & 1; + return (gpte & vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0; +} + #define PTTYPE 64 #include "paging_tmpl.h" #undef PTTYPE @@ -2183,6 +2197,25 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) static int paging64_init_context(struct kvm_vcpu *vcpu) { + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + + context->rsvd_bits_mask[0][3] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3]; + context->rsvd_bits_mask[1][2] = context->rsvd_bits_mask[0][2]; + context->rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0]; return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL); } @@ -2190,6 +2223,16 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) { struct kvm_mmu *context = &vcpu->arch.mmu; + /* no rsvd bits for 2 level 4K page table entries */ + context->rsvd_bits_mask[0][1] = 0; + context->rsvd_bits_mask[0][0] = 0; + if (is_cpuid_PSE36()) + /* 36bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21); + else + /* 32 bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21); + context->rsvd_bits_mask[1][0] = 0; context->new_cr3 = paging_new_cr3; context->page_fault = paging32_page_fault; context->gva_to_gpa = pagin
RE: RFC: Add reserved bits check
> > Need to make sure rsvd_bits_mask[] is maintained on ept and npt, then. Sure, will be in next patch, post the current modified one. Thx, eddie Current KVM doesn't check reserved bits of guest page table entry, but use reserved bits to bypass guest #PF in VMX. This patch add reserved bit check while leaving shadow pte un-constructed if guest RSVD=1. commit dd1d697edf42953d407c10f4d38c650aafd3d3d5 Author: root Date: Fri Mar 27 23:35:27 2009 +0800 Emulate #PF error code of reserved bits violation. Signed-off-by: Eddie Dong diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55fd4c5..4fe2742 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -261,6 +261,7 @@ struct kvm_mmu { union kvm_mmu_page_role base_role; u64 *pae_root; + u64 rsvd_bits_mask[2][4]; }; struct kvm_vcpu_arch { @@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void); #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ef060ec..35af90a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_PRESENT_MASK (1U << 0) #define PFERR_WRITE_MASK (1U << 1) #define PFERR_USER_MASK (1U << 2) +#define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) #define PT_DIRECTORY_LEVEL 2 @@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; static u64 __read_mostly shadow_mt_mask; +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL << (e - s + 1)) - 1) << s; +} + void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) { shadow_trap_nonpresent_pte = trap_pte; @@ -2155,6 +2161,15 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) +{ + int ps = 0; + + if (level == PT_DIRECTORY_LEVEL) + ps = !!(gpte & PT_PAGE_SIZE_MASK); + return (gpte & vcpu->arch.mmu.rsvd_bits_mask[ps][level-1]) != 0; +} + #define PTTYPE 64 #include "paging_tmpl.h" #undef PTTYPE @@ -2183,6 +2198,22 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) static int paging64_init_context(struct kvm_vcpu *vcpu) { + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + + context->rsvd_bits_mask[0][3] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + context->rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL); } @@ -2190,6 +2221,15 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) { struct kvm_mmu *context = &vcpu->arch.mmu; + /* no rsvd bits for 2 level 4K page table entries */ + context->rsvd_bits_mask[0][0] = 0; + context->rsvd_bits_mask[0][1] = 0; + if (is_cpuid_PSE36()) + /* 36bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21); + else + /* 32 bits PSE 4MB page */ + context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21); context->new_cr3 = paging_new_cr3; context->page_fault = paging32_page_fault; context->gva_to_gpa = paging32_gva_to_gpa; @@ -2205,6 +2245,21 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) static int paging32E_init_context(struct kvm_vcpu *vcpu) { + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + + context->rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PDE */ + context->rsvd_bits_mask[0][0] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PTE */ + context->rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(13, 20); /* large page */ + return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL); } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_
RE: RFC: Add reserved bits check
>> Will never be use, PDPTEs are loaded by set_cr3(), not walk_addr(). >> > > I see, then how about to replace CR3_PAE_RESERVED_BITS check at cr3 > load with > rsvd_bits_mask[2]? Seems current code are lacking of enough reserved > bits check too. > typo, I mean this: --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -233,7 +233,7 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) goto out; } for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { - if ((pdpte[i] & 1) && (pdpte[i] & 0xfff001e6ull)) { + if ((pdpte[i] & 1) && (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) { ret = 0; goto out; } (-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: Add reserved bits check
>> + context->rsvd_bits_mask[0] = rsvd_bits(maxphyaddr, 51); >> + context->large_page_rsvd_mask = /* 2MB PDE */ >> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); >> return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL); } >> > > Isn't bit 63 reserved if NX is disabled? Sure. > >> >> @@ -2206,6 +2258,18 @@ static int paging32_init_context(struct >> kvm_vcpu *vcpu) >> >> static int paging32E_init_context(struct kvm_vcpu *vcpu) { >> + struct kvm_mmu *context = &vcpu->arch.mmu; >> + int maxphyaddr = cpuid_maxphyaddr(vcpu); >> + >> + /* 3 levels */ >> + context->rsvd_bits_mask[2] = rsvd_bits(maxphyaddr, 63) | >> + rsvd_bits(7, 8) | rsvd_bits(1,2); /* PDPTE */ >> > > Will never be use, PDPTEs are loaded by set_cr3(), not walk_addr(). > I see, then how about to replace CR3_PAE_RESERVED_BITS check at cr3 load with rsvd_bits_mask[2]? Seems current code are lacking of enough reserved bits check too. >> @@ -153,10 +154,13 @@ walk: >> walker->level - 1, table_gfn); >> >>kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte)); >> + rsvd_fault = is_rsvd_bits_set(vcpu, pte, walker->level); >> > > Does a not present pte set PFERR_RSVD? Yes though most commercial OS doesn't use it. I plan to post a follow up patch to fix the potential RSVD_fault error code mismatch when bypass_guest_pf=1. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: Add reserved bits check
Current KVM doesn't check reserved bits of guest page table, while may use reserved bits to bypass guest #PF in VMX. This patch add this check while leaving shadow pte un-constructed if guest RSVD=1. Comments? Thx, eddie diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55fd4c5..9370ff0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -261,6 +261,8 @@ struct kvm_mmu { union kvm_mmu_page_role base_role; u64 *pae_root; + u64 rsvd_bits_mask[4]; + u64 large_page_rsvd_mask; }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 31ba3cb..7f55c4a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -127,6 +127,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_PRESENT_MASK (1U << 0) #define PFERR_WRITE_MASK (1U << 1) #define PFERR_USER_MASK (1U << 2) +#define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) #define PT_DIRECTORY_LEVEL 2 @@ -179,6 +180,13 @@ static u64 __read_mostly shadow_user_mask; static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; static u64 __read_mostly shadow_mt_mask; +extern struct kvm_cpuid_entry2 *kvm_find_cpuid_entry( + struct kvm_vcpu *vcpu, u32 function, u32 index); + +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL << (e - s + 1)) - 1) << s; +} void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) { @@ -251,6 +259,18 @@ static int is_rmap_pte(u64 pte) return is_shadow_present_pte(pte); } +static int cpuid_maxphyaddr(struct kvm_vcpu *vcpu) +{ + u32 function=0x8008; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, function, 0); + if (best) { + return best->eax & 0xff; + } + return 40; +} + static pfn_t spte_to_pfn(u64 pte) { return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; @@ -2156,6 +2176,17 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } +static int is_rsvd_bits_set(struct kvm_vcpu *vcpu, unsigned long pte, int level) +{ + if (level == PT_DIRECTORY_LEVEL && (pte & PT_PAGE_SIZE_MASK)) { + /* large page */ + return (pte & vcpu->arch.mmu.large_page_rsvd_mask) != 0; + } + else + /* 4K page */ + return (pte & vcpu->arch.mmu.rsvd_bits_mask[level-1]) != 0; +} + #define PTTYPE 64 #include "paging_tmpl.h" #undef PTTYPE @@ -2184,6 +2215,18 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) static int paging64_init_context(struct kvm_vcpu *vcpu) { + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + + context->rsvd_bits_mask[3] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[2] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[1] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context->rsvd_bits_mask[0] = rsvd_bits(maxphyaddr, 51); + context->large_page_rsvd_mask = /* 2MB PDE */ + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL); } @@ -2191,6 +2234,15 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) { struct kvm_mmu *context = &vcpu->arch.mmu; + /* no rsvd bits for 2 level 4K page table entries */ + context->rsvd_bits_mask[0] = 0; + context->rsvd_bits_mask[1] = 0; + if (is_cpuid_PSE36()) + /* 36bits PSE 4MB page */ + context->large_page_rsvd_mask = rsvd_bits(17, 21); + else + /* 32 bits PSE 4MB page */ + context->large_page_rsvd_mask = rsvd_bits(13, 21); context->new_cr3 = paging_new_cr3; context->page_fault = paging32_page_fault; context->gva_to_gpa = paging32_gva_to_gpa; @@ -2206,6 +2258,18 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) static int paging32E_init_context(struct kvm_vcpu *vcpu) { + struct kvm_mmu *context = &vcpu->arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + + /* 3 levels */ + context->rsvd_bits_mask[2] = rsvd_bits(maxphyaddr, 63) | + rsvd_bits(7, 8) | rsvd_bits(1,2); /* PDPTE */ + context->rsvd_bits_mask[1] = rsvd_bits(maxphyaddr, 63); /* PDE */ + context->rsvd_bits_mask[0] = /* PTE */ + rsvd_bits(maxphyaddr, 63) | rsvd_bits(7, 8) | rsvd_bits(1, 2); + context->large_page_rsvd_mask = /* 2M page */ + rsvd_bits(maxphyaddr, 63) | rsvd_bits(13, 20); + return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL); } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7314c09..844efe9 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -123,6 +123,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker, gfn_t table_gfn; unsigned index, pt_access, pte_access; gpa_t pte_gpa; + int rsvd_fault; pgprintk("%s: addr %lx\n", __func__, addr); walk: @@ -153,10 +154,13 @@ walk: walker->level - 1, table_gfn); kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte)); + rsvd_fault = is_rsvd_bits_set(vcpu, pte, walker->level); if (!is_present_pte(pte)) goto not_p
RE: Fix kernel pio emulation mistake
> kvm_emulate_pio() returns 1 when emulation is complete, > and 0 when emulation needs further processing in > userspace. So I think in both cases cannot_emulate is > the wrong answer. I think 'in' emulation gets it right. OK, yes. Do u mean this? I may misunderstand. Thx, eddie diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c index ca91749..838fdc9 100644 --- a/arch/x86/kvm/x86_emulate.c +++ b/arch/x86/kvm/x86_emulate.c @@ -1838,11 +1838,11 @@ special_insn: io_dir_in = 0; do_io: if (kvm_emulate_pio(ctxt->vcpu, NULL, io_dir_in, (c->d & ByteOp) ? 1 : c->op_bytes, - port) != 0) { + port) == 0) { c->eip = saved_eip; - goto cannot_emulate; + return -1; } - break; + return 0; case 0xf4: /* hlt */ ctxt->vcpu->arch.halt_request = 1; break; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fix kernel pio emulation mistake
Kernel pio emulation return value is mistakenly checked, fortuantely it is not hit yet for normal OS bootup :( Signed-off-by: Eddie Dong commit 98d3dc8b67ba0bc7f494de3ade8f2b5cfcadaeb4 Author: root Date: Thu Mar 19 15:44:39 2009 +0800 fix a bug when kernel PIO is emulated. diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c index ca91749..0edd2e7 100644 --- a/arch/x86/kvm/x86_emulate.c +++ b/arch/x86/kvm/x86_emulate.c @@ -1838,7 +1838,7 @@ special_insn: io_dir_in = 0; do_io: if (kvm_emulate_pio(ctxt->vcpu, NULL, io_dir_in, (c->d & ByteOp) ? 1 : c->op_bytes, - port) != 0) { + port) == 0) { c->eip = saved_eip; goto cannot_emulate; } pio.patch Description: pio.patch
RE: copyless virtio net thoughts?
Simon Horman wrote: > On Wed, Feb 18, 2009 at 10:08:00PM +1030, Rusty Russell > wrote: >> >> 2) Direct NIC attachment This is particularly >> interesting with SR-IOV or other multiqueue nics, but >> for boutique cases or benchmarks, could be for normal >> NICs. So far I have some very sketched-out patches: for >> the attached nic dev_alloc_skb() gets an skb from the >> guest (which supplies them via some kind of AIO >> interface), and a branch in netif_receive_skb() which >> returned it to the guest. This bypasses all firewalling >> in the host though; we're basically having the guest >> process drive the NIC directly. > > Hi Rusty, > > Can I clarify that the idea with utilising SR-IOV would > be to assign virtual functions to guests? That is, > something conceptually similar to PCI pass-through in Xen > (although I'm not sure that anyone has virtual function > pass-through working yet). If so, wouldn't this also be > useful on machines that have multiple NICs? > Yes, and we have successfully get it run with assigning VF to guest in both Xen & KVM, but we are still working on pushing those patches out since it needs Linux PCI subsystem support & driver support. Thx, eddie-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
> What we would rather do in KVM, is have the VFs appear in > the host as standard network devices. We would then like > to back our existing PV driver to this VF directly > bypassing the host networking stack. A key feature here > is being able to fill the VF's receive queue with guest > memory instead of host kernel memory so that you can get > zero-copy > receive traffic. This will perform just as well as doing > passthrough (at least) and avoid all that ugliness of > dealing with SR-IOV in the guest. > Anthony: This is already addressed by VMDq solution(or so called netchannel2), right? Qing He is debugging the KVM side patch and pretty much close to end. For this single purpose, we don't need SR-IOV. BTW at least Intel SR-IOV NIC also supports VMDq, so you can achieve this by simply use "native" VMDq enabled driver here, plus the work we are debugging now. Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 6/6 v3] PCI: document the change
Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: >>> BTW, the SR-IOV patch is not only for network, some >>> other devices such as IDE will use same code base as >>> well and we image it could have other parameter to set >>> such as starting LBA of a IDE VF. >> >> As Eddie said, we have two problems here: >> 1) User has to set device specific parameters of a VF >> when he wants to use this VF with KVM (assign this >> device to KVM guest). In this case, >> VF driver is not loaded in the host environment. So >> operations which >> are implemented as driver callback (e.g. >> set_mac_address()) are not supported. > > I suspect what you want to do is create, then configure > the device in the host, then assign it to the guest. That is not true. Rememver the created VFs will be destroyed no matter for PF power event or error recovery conducted reset. So what we want is: Config, create, assign, and then deassign and destroy and then recreate... > >> 2) For security reason, some SR-IOV devices prohibit the >> VF driver configuring the VF via its own register space. >> Instead, the configurations must be done through the PF >> which the VF is associated with. This means PF driver >> has to receive parameters that are used to configure its >> VFs. These parameters obviously can be passed by >> traditional tools, if without modification for SR-IOV. > > I think that idea also covers this point. > Sorry can u explain a little bit more? The SR-IOV patch won't define what kind of entries should be created or not, we leave network subsystem to decide what to do. Same for disk subsstem etc. Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 6/6 v3] PCI: document the change
Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 08:23:34AM +0800, Dong, Eddie > wrote: >> Matthew Wilcox wrote: >>> Wouldn't it be more useful to have the iov/N directories >>> be a symlink to the actual pci_dev used by the virtual >>> function? >> >> The main concern here is that a VF may be disabed such >> as when PF enter D3 state or undergo an reset and thus >> be plug-off, but user won't re-configure the VF in case >> the PF return back to working state. > > If we're relying on the user to reconfigure virtual > functions on return to D0 from D3, that's a very fragile > system. No. that is the concern we don't put those configuration under VF nodes because it will disappear. Do I miss something? > >>>> +For network device, there are: >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan >>>> + (value update will notify PF driver) >>> >>> We already have tools to set the MAC and VLAN parameters >>> for network devices. >> >> Do you mean Ethtool? If yes, it is impossible for SR-IOV >> since the configuration has to be done in PF side, >> rather than VF. > > I don't think ethtool has that ability; ip(8) can set mac > addresses and vconfig(8) sets vlan parameters. > > The device driver already has to be aware of SR-IOV. If > it's going to support the standard tools (and it damn > well ought to), then it should call the PF driver to set > these kinds of parameters. OK, as if it has the VF parameter, will look into details. BTW, the SR-IOV patch is not only for network, some other devices such as IDE will use same code base as well and we image it could have other parameter to set such as starting LBA of a IDE VF. > >>> I'm not 100% convinced about this API. The assumption >>> here is that the driver will do it, but I think it >>> should probably be in the core. The driver probably >>> wants to be >> >> Our concern is that the PF driver may put an default >> state when it is loaded so that SR-IOV can work without >> any user level configuration, but of course the driver >> won't dynamically change it. >> Do u mean we remove this ability? >> >>> notified that the PCI core is going to create a virtual >>> function, and would it please prepare to do so, but I'm >>> not convinced this should be triggered by the driver. >>> How would the driver decide to create a new virtual >>> function? > > Let me try to explain this a bit better. > > The user decides they want a new ethernet virtual > function. In the scheme as you have set up: > > 1. User communicates to ethernet driver "I want a new VF" > 2. Ethernet driver tells PCI core "create new VF". > > I propose: > > 1. User tells PCI core "I want a new VF on PCI device > :01:03.0" > 2. PCI core tells driver "User wants a new VF" If user need a new VF, the VF must be already enabled or existed in OS. Otherwise, we need to disable all VFs first and then change NumVFs to re-enable VFs. Spec says: "NumVFs may only be written while VF Enable is Clear" > > My scheme gives us a unified way of creating new VFs, > yours requires each driver to invent a way for the user > to tell them to create a new VF. Unless I've > misunderstood your code and docs. Assign a VF is kind of user & core job. > >>> From my reading of the SR-IOV spec, this isn't how it's >>> supposed to work. The device is supposed to be a fully >>> functional PCI device that on demand can start peeling >>> off virtual functions; it's not supposed to boot up and >>> initialise all its virtual functions at once. >> >> The spec defines either we enable all VFs or Disable. >> Per VF enabling is not supported. Is this what you >> concern? > > I don't think that's true. The spec requires you to > enable all the > VFs from 0 to NumVFs, but NumVFs can be lower than > TotalVFs. At least, that's how I read it. Yes, but setting NumVFs can only occur when VFs are disabled. Following are from spec. NumVFs may only be written while VF Enable is Clear. If NumVFs is written when VF Enable is Set, the results are undefined. The initial value of NumVFs is undefined. > > But no, that isn't my concern. My concern is that you've > written a driver here that seems to be a stub driver. > That doesn't seem to be > how SR-IOV is supposed to work; it's supposed to be a > fully-functional driver that has SR-IOV knowledge added > to it. Yes, it is a full feature driver as if PF has resource in, for example not all queues are assigned to VFs. Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 6/6 v3] PCI: document the change
Matthew Wilcox wrote: > Wouldn't it be more useful to have the iov/N directories > be a symlink to the actual pci_dev used by the virtual > function? The main concern here is that a VF may be disabed such as when PF enter D3 state or undergo an reset and thus be plug-off, but user won't re-configure the VF in case the PF return back to working state. > >> +For network device, there are: >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan >> + (value update will notify PF driver) > > We already have tools to set the MAC and VLAN parameters > for network devices. Do you mean Ethtool? If yes, it is impossible for SR-IOV since the configuration has to be done in PF side, rather than VF. > > I'm not 100% convinced about this API. The assumption > here is that the driver will do it, but I think it should > probably be in the core. The driver probably wants to be Our concern is that the PF driver may put an default state when it is loaded so that SR-IOV can work without any user level configuration, but of course the driver won't dynamically change it. Do u mean we remove this ability? > notified that the PCI core is going to create a virtual > function, and would it please prepare to do so, but I'm > not convinced this should be triggered by the driver. > How would the driver decide to create a new virtual > function? > > > From my reading of the SR-IOV spec, this isn't how it's > supposed to work. The device is supposed to be a fully > functional PCI device that on demand can start peeling > off virtual functions; it's not supposed to boot up and > initialise all its virtual functions at once. The spec defines either we enable all VFs or Disable. Per VF enabling is not supported. Is this what you concern? Thanks, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Remaining passthrough/VT-d tasks list
Avi Kivity wrote: > Dong, Eddie wrote: >>> I don't see how this relates to shared guest interrupts. >>> Whatever you have on the host side, you still need to >>> support shared guest interrupts. The only way to avoid >>> the issue is by using MSI for the guest, and even then >>> we still have to support interrupt sharing since not >>> all guests have MSI support. >>> >> >> Yes, but guest sharing is easy to solve by emulating >> NAND gate of PCI line. >> > > Certainly, it isn't difficult. BTW, Did u have a look at SR-IOV patch? It address both Xen & KVM. Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Remaining passthrough/VT-d tasks list
> I don't see how this relates to shared guest interrupts. > Whatever you have on the host side, you still need to > support shared guest interrupts. The only way to avoid > the issue is by using MSI for the guest, and even then we > still have to support interrupt sharing since not all > guests have MSI support. Yes, but guest sharing is easy to solve by emulating NAND gate of PCI line. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Remaining passthrough/VT-d tasks list
Tian, Kevin wrote: >> From:Avi Kivity >> Sent: 2008年9月27日 17:50 >> >> Yang, Sheng wrote: >>> After check host shared interrupts situation, I got a >>> question here: >>> >>> If I understand correctly, current solution don't block >>> host shared irq, just come with the performance pentry. >>> The penalty come with host disabled irq line for a >>> period. We have to wait guest to write EOI. But I fail >>> to see the correctness problem here (except a lot of >>> spurious interrupt in the guest). >>> >>> I've checked mail, but can't find clue about that. Can >>> you explain the situation? >>> >>> >> >> If the guest fails to disable interrupts on a device >> that shares an interrupt line with the host, the host >> will experience an interrupt flood. Eventually the host >> will disable the host device as well. >> > > This issue also exists on host side, that one misbehaved > driver can hurt all other drivers sharing same irq line. > But it seems no good way to avoid it. Since not all > devices support MSI, we still need support irq sharing > possibly with above caveats given. > > Existing approach at least works with a sane guest > driver, with some performance penality there. > > Or do you have better alternative? > > Thanks, > Kevin > MSI is always 1st choice. Including taking host MSI for guest IOAPIC situation because we don't if guest OS has MSI support but we are sure host Linux can. When MSI is impossible, I recommend we disable device assignment for those sharing interrupt , or we assign all devices with same interrupt to same guest. Yes the issue is same in native, but in native the whole OS (kernel) is in same isolation domain, but now different guest has different isolation domain :( In one world, MSI is pretty important for direct IO, and SR-IOV is #1 usage in future. Just advocate more and wish more people can ack the SR-IOV patch from ZhaoYU so that we can see 2.6,28 work for direct I/O without sacrificing sharing :) Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Remaining passthrough/VT-d tasks list
Avi Kivity wrote: > Han, Weidong wrote: >> Hi all, >> >> The initial passthrough/VT-d patches have been in kvm, >> it's time to enhance it, and push them into 2.6.28. >> >> - Shared Interrupt support >> > > Shared guest interrupts is a prerequisite for merging > into mainline. Without this, device assignment is useless > in anything but a benchmark scenario. I won't push > device assignment for 2.6.28 without it. > > Shared host interrupts are a different matter; which one > did you mean? > Avi: How about we think in other way? The top usage model of IOMMU is SR-IOV in my mind, at least for enterprise usage model. We are pushing the SR-IOV patch for 2.6.28, and are continuously polishing the patch. Even if it missed the 2.6.28 merge windows (unlikely?), we could be able to ask OSVs to take the SR-IOV patch seperately before code froze since it is very small, but it is hard to ask for taking whole IOMMU patches. In Xen side, IOMMU is there, MSI-x is there, so SR-IOV patch is the only one missed to enable SR-IOV. In KVM side, very likely we can get MSI patch down soon before chinese holiday, and we of course will spend tons of effort in qualities too. Should we target this? If yes, we put MSI patch and push 2.6.28 as 1st priority. We would be able to see next major release of VMM using KVM have HW IO virtualization technology: Close to native performance, non sacriface of IO sharing, minimal CPU utilization etc. For those legacy PCI pass thru support, we can continue improve it too. Thanks, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Event channels in KVM?
Matt Anger wrote: > I was referring to the bounce from host kernel to qemu > and then back > to the host kernel for my BE driver. > Xen: > guest -> guest kernel driver-> host kernel driver > > For both situations I need a FE and BE driver, but for > KVM I need to modify QEMU and teach it how to pass the > virtio calls to my Host > driver. In Xen BE driver and Host driver are the same and > I don't have > to recompile any part of Xen. > You can implement (host) kernel level device model too which will have same result with Xen, i.e. skip Qemu. I don't think Xen & KVM has fundamental difference here, but different implementation style. Xen implement BE driver in driver domain kernel, but KVM like to implement in user level host process. Different people has different favorite, I am neutral actually :) Thx, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Event channels in KVM?
Matt Anger wrote: > Thanks for the info, I've been looking into it by trying > to look > around kvm source code. > Apparently I have to write a kernel driver for the guest > os and then > also write backend driver and modify qemu to use it? Is > that correct? > That seems ugly, especially since now my io goes > guest->guest kernel driver->host kernel kvm->qemu->host > kernel driver > With Xen event-channels my driver gets the event > notification directly > with no user space intervention removing the middle 2 > steps and I Why virtio needs start from guest user space while event channel not? It is exactly same, event channel can start from user space or kernel space, same for virtio. The only difference is guest<->host communication needs to go thru Qemu (host process) in your description model, while Xen handle event channel in hypervisor. But this can be enahnced also IMO by employing a kernel level device model which does inter-communication for CPU if performance is critical, but mostly probably not. > don't have to touch any of Xen's codebase. My driver just > registers > for an event-channel. But you still need the counter part side to handle it. I think you may want to reuse Xen BE side code which is a wrong assumption. > > Thanks > -Matt > > On Fri, Sep 19, 2008 at 12:14 PM, Anthony Liguori > <[EMAIL PROTECTED]> wrote: >> Javier Guerra wrote: >>> >>> On Fri, Sep 19, 2008 at 1:31 PM, Anthony Liguori >>> <[EMAIL PROTECTED]> wrote: >>> Matt Anger wrote: > > Does KVM have any interface similar to event-channels > like Xen does? Basically a way to send notifications > between the host and guest. > > virtio is the abstraction we use. But virtio is based on the standard hardware interfaces of the PC--PIO, MMIO, and interrupts. >>> >>> this is rather low-level, it would be nice to have a >>> multiplatform interface to this abstraction. >>> >> >> That's exactly the purpose of virtio. virtio is a >> high-level, cross platform interface. It's been tested >> on x86, PPC, s390, and I believe ia64. It also works in >> lguest. >> >> It happens to use PIO, MMIO, and interrupts on x86 under >> KVM but other virtio implementations exist for other >> platforms. >> >>> just for kicks, i've found and printed Rusty's paper >>> about it. hope it's current :-) >>> >> >> The other good thing to look at is the lguest >> documentation. You can skip to just the virtio bits if >> you're so inclined. It's really quite thoroughly >> documented. >> >> Regards, >> >> Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: KVM IO performance
Sukanto Ghosh wrote: > Hi all, > > Why has the IO device emulation part been kept in > userspace ? > IO attempts cause VM-exits to the KVM (running in > kernel-mode) it then forwards these requests to the > userspace (mode-switch). After completion of IO in > userspace, another mode switch is done to KVM before > resuming the guests. > What would be the problems had the device emulation parts > been moved totally to kernelspace ? > The simple reason is that people don't want to pull a lot of code into kernel whose failure may crash the whole system, of course there are many other reasons such as maintaining synchronization with Qemu upstream, coding style, maintanence effort etc. Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Vringfd
Rusty/Anthony/Avi and all: I did a read at Rusty's April 18 post of vringfd patch, though I didn't find explicit code for how to call get_user_skb_frags, it seems to be clear the user only need to trigger a "flush_tx" command. All the rest will be automatically turned on, fetching the ring and send to bridge etc, which is perfect from performance point of view. Now I am working on enabling VMDq which provide HW pre-sorting mechanism base on MAC address, it can definetely take the advantage of vringfd with additional advantage of 0-copy in receive side, which makes VMDq enabled virtio-net fly with 0-copy for both transmit & receive. The only thing an VMDq agency (vringfd similar components in kernel) may bother user level is initialization and notification given that vringfd already solved the ring operations, since KVM virtual interrupt module is in kernel side, it is pretty easy to make user free of receiving packet at run time. Any comments? Thanks, eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html