Re: [PATCH 18/24] Exiting from L2 to L1
On Monday 13 September 2010 01:21:29 Avi Kivity wrote: > On 09/12/2010 07:05 PM, Nadav Har'El wrote: > >> I don't think so. We write this field as part of guest entry (that is, > >> after the decision re which vmcs to use, yes?), so guest entry will > >> always follow writing this field. Since guest entry clears the field, > >> reading it after an exit will necessarily return 0. > > > > Well, you obviously know the KVM code much better than I do, but from > > what I saw, I thought (maybe I misunderstood) that in normal > > (non-nested) KVM, this field only gets written on injection, not on > > every entry, so the code relies on the fact that the processor turns off > > the "valid" bit during exit, to avoid the same event being injected when > > the same field value is used for another entry. > > Correct. > > > I can only find code which resets this field in vmx_vcpu_reset(), > > but that doesn't get called on every entry, right? Or am I missing > > something? > > prepare_vmcs12() is called in response for a 2->1 vmexit which is first > trapped by 0, yes? Because it's called immediately after a vmexit, > VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the > processor. > > There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not > be cleared by hardware: > > 1. if we call prepare_vmcs12() between injection and entry. This cannot > happen AFAICT. > 2. if the vmexit was really a failed 1->2 vmentry, and if the processor > doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures > (need to check scripture) > > If neither of these are valid, the code can be removed. If only the > second, we might make it conditional. > > >> What can happen is that the contents of the field is transferred to the > >> IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. > >> > >> (question: on a failed vmentry, is this field cleared?) > > > > I don't know the answer :-) > > Sheng? According to SDM 23.7 "VM-ENTRY FAILURES DURING OR AFTER LOADING GUEST STATE": Although this process resembles that of a VM exit, many steps taken during a VM exit do not occur for these VM-entry failures: • Most VM-exit information fields are not updated (see step 1 above). • The valid bit in the VM-entry interruption-information field is *not* cleared. • The guest-state area is not modified. • No MSRs are saved into the VM-exit MSR-store area. So VM entry failure would result in _keep_ valid bit of VM_ENTRY_INTR_INFO_FIELD. -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Add a new API to virtio-pci
"Michael S. Tsirkin" wrote on 09/12/2010 05:16:37 PM: > "Michael S. Tsirkin" > 09/12/2010 05:16 PM > > On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote: > > Unfortunately I need a > > constant in vhost for now. > > Maybe not even that: you create multiple vhost-net > devices so vhost-net in kernel does not care about these > either, right? So this can be just part of vhost_net.h > in qemu. Sorry, I didn't understand what you meant. I can remove all socks[] arrays/constants by pre-allocating sockets in vhost_setup_vqs. Then I can remove all "socks" parameters in vhost_net_stop, vhost_net_release and vhost_net_reset_owner. Does this make sense? Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
"Michael S. Tsirkin" wrote on 09/12/2010 05:10:25 PM: > > SINGLE vhost (Guest -> Host): > >1 netperf:BW: 10.7% SD: -1.4% > >4 netperfs: BW: 3%SD: 1.4% > >8 netperfs: BW: 17.7% SD: -10% > > 16 netperfs: BW: 4.7% SD: -7.0% > > 32 netperfs: BW: -6.1% SD: -5.7% > > BW and SD both improves (guest multiple txqs help). For 32 > > netperfs, SD improves. > > > > But with multiple vhosts, guest is able to send more packets > > and BW increases much more (SD too increases, but I think > > that is expected). > > Why is this expected? Results with the original kernel: _ # BW SD RSD __ 1 20903 1 6 2 21963 6 25 4 22042 23 102 8 21674 97 419 16 22281 379 1663 24 22521 857 3748 32 22976 15286594 40 23197 239010239 48 22973 354215074 64 23809 648627244 80 23564 10169 43118 96 22977 14954 62948 128 23649 27067 113892 With higher number of threads running in parallel, SD increased. In this case most threads run in parallel only till __dev_xmit_skb (#numtxqs=1). With mq TX patch, higher number of threads run in parallel through ndo_start_xmit. I *think* the increase in SD is to do with higher # of threads running for larger code path >From the numbers I posted with the patch (cut-n-paste only the % parts), BW increased much more than the SD, sometimes more than twice the increase in SD. N# BW% SD% RSD% 4 54.30 40.00-1.16 8 71.79 46.59-2.68 16 71.89 50.40-2.50 32 72.24 34.26-14.52 48 70.10 31.51-14.35 64 69.01 38.81-9.66 96 70.68 71.2610.74 I also think SD calculation gets skewed for guest->local host testing. For this test, I ran a guest with numtxqs=16. The first result below is with my patch, which creates 16 vhosts. The second result is with a modified patch which creates only 2 vhosts (testing with #netperfs = 64): #vhosts BW% SD%RSD% 16 20.79 186.01 149.74 230.89 34.55 18.44 The remote SD increases with the number of vhost threads, but that number seems to correlate with guest SD. So though BW% increased slightly from 20% to 30%, SD fell drastically from 186% to 34%. I think it could be a calculation skew with host SD, which also fell from 150% to 18%. I am planning to submit 2nd patch rev with restricted number of vhosts. > > Likely cause for the 1 stream degradation with multiple > > vhost patch: > > > > 1. Two vhosts run handling the RX and TX respectively. > >I think the issue is related to cache ping-pong esp > >since these run on different cpus/sockets. > > Right. With TCP I think we are better off handling > TX and RX for a socket by the same vhost, so that > packet and its ack are handled by the same thread. > Is this what happens with RX multiqueue patch? > How do we select an RX queue to put the packet on? My (unsubmitted) RX patch doesn't do this yet, that is something I will check. Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] S390: Add virtio hotplug add support
On Sun, 12 Sep 2010 06:30:43 pm Avi Kivity wrote: > On 09/12/2010 02:42 AM, Alexander Graf wrote: > > On 24.08.2010, at 15:48, Alexander Graf wrote: > > > >> The one big missing feature in s390-virtio was hotplugging. This is no > >> more. > >> This patch implements hotplug add support, so you can on the fly add new > >> devices > >> in the guest. > >> > >> Keep in mind that this needs a patch for qemu to actually leverage the > >> functionality. > >> > >> Signed-off-by: Alexander Graf > > ping (on the patch set)? > > > > Actually Marcelo applied it. But the natural place for it is Rusty's > virtio tree. Rusty, if you want to take it, let me know and I'll drop > it from kvm.git. I thought it would be in the s390 tree, which is why I didn't take it... But I'm *always* happy to let do the work! Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM-test: Add a new macaddress pool algorithm v3
From: Amos Kong Old method uses the addresses in the config files which could lead serious problem when multiple tests running in different hosts. This patch adds a new macaddress pool algorithm, it generates the mac prefix based on mac address of the host, and fix it to correspond to IEEE802. When user have set the mac_prefix in the config file, we should use it instead of the dynamic generated mac prefix. Add a parameter like 'preserve_mac', to preserve the original mac address, for things like migration. MAC addresses are recorded into a dictionary 'address_pool' in following format: {{'20100310-165222-Wt7l:0' : 'AE:9D:94:6A:9b:f9'},...} 20100310-165222-Wt7l : instance attribute of VM 0: index of NIC AE:9D:94:6A:9b:f9: mac address Use 'vm instance' + 'nic index' as the key, macaddress is the value. Changes from v2: - Instead of basing ourselves in a physical interface address to get an address prefix, just generate one with the prefix 0x9a (convention) randomly and add it to the address pool. If there's already one prefix, keep it there and just return it. - Made messages more consistent and informative - Made function names consistent all across the board - Fixed some single line spacing between functions Changs from v1: - Use 'vm instance' + 'nic index' as the key of address_pool, address is value. - Put 'mac_lock' and 'address_pool' to '/tmp', for sharing them to other autotest instances running on the same host. - Change function names for less confusion. - Do not copy 'vm.instance' in vm.clone() - Split 'adding get_ifname function' to another patch Signed-off-by: Jason Wang Signed-off-by: Feng Yang Signed-off-by: Amos Kong --- client/tests/kvm/kvm_utils.py | 114 client/tests/kvm/kvm_vm.py | 83 ++-- client/tests/kvm/tests_base.cfg.sample |2 +- 3 files changed, 193 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index fb2d1c2..bb5c868 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -5,6 +5,7 @@ KVM test utility functions. """ import time, string, random, socket, os, signal, re, logging, commands, cPickle +import fcntl, shelve from autotest_lib.client.bin import utils from autotest_lib.client.common_lib import error, logging_config import kvm_subprocess @@ -82,6 +83,119 @@ def get_sub_dict_names(dict, keyword): # Functions related to MAC/IP addresses +def _generate_mac_address_prefix(): +""" +Generate a MAC address prefix. By convention we will set KVM autotest +MAC addresses to start with 0x9a. +""" +l = [0x9a, random.randint(0x00, 0x7f), random.randint(0x00, 0x7f), + random.randint(0x00, 0xff)] +prefix = ':'.join(map(lambda x: "%02x" % x, l)) + ":" +return prefix + + +def generate_mac_address_prefix(): +""" +Generate a random MAC address prefix and add it to the MAC pool dictionary. +If there's a MAC prefix there already, do not update the MAC pool and just +return what's in there. +""" +lock_file = open("/tmp/mac_lock", 'w') +fcntl.lockf(lock_file.fileno() ,fcntl.LOCK_EX) +mac_pool = shelve.open("/tmp/address_pool", writeback=False) + +if mac_pool.get('prefix'): +prefix = mac_pool.get('prefix') +logging.debug('Retrieved previously generated MAC prefix for this ' + 'host: %s', prefix) +else: +prefix = _generate_mac_address_prefix() +mac_pool['prefix'] = prefix +logging.debug('Generated MAC address prefix for this host: %s', prefix) + +mac_pool.close() +fcntl.lockf(lock_file.fileno(), fcntl.LOCK_UN) +lock_file.close() + +return prefix + + +def generate_mac_address(root_dir, instance_vm, nic_index, prefix=None): +""" +Random generate a MAC address and add it to the MAC pool. + +Try to generate macaddress based on the mac address prefix, add it to a +dictionary 'address_pool'. +key = VM instance + nic index, value = mac address +{['20100310-165222-Wt7l:0'] : 'AE:9D:94:6A:9b:f9'} + +@param root_dir: Root dir for kvm. +@param instance_vm: Here we use instance of vm. +@param nic_index: The index of nic. +@param prefix: Prefix of MAC address. +@return: MAC address string. +""" +if prefix is None: +prefix = generate_mac_address_prefix() + +lock_file = open("/tmp/mac_lock", 'w') +fcntl.lockf(lock_file.fileno() ,fcntl.LOCK_EX) +mac_pool = shelve.open("/tmp/address_pool", writeback=False) +found = False +key = "%s:%s" % (instance_vm, nic_index) + +if mac_pool.get(key): +found = True +mac = mac_pool.get(key) + +while not found: +suffix = "%02x:%02x" % (random.randint(0x00,0xfe), +random.randint(0x00,0xfe)) +mac = prefix + suffix +mac_list = mac.split(":") +# Clear multic
Re: [PATCH 18/24] Exiting from L2 to L1
On Sun, Sep 12, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1": > >+vmcs12->vm_entry_intr_info_field = > >+vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); >... > prepare_vmcs12() is called in response for a 2->1 vmexit which is first > trapped by 0, yes? Because it's called immediately after a vmexit, > VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the > processor. Indeed - it is cleared in the real vmcs, i.e., vmcs02, but we also need to clear it in the emulated vmcs that L1 sees for L2, i.e., vmcs12. The original code (quoted above) just copied the vmcs02 value to vmcs12, which was (mostly) fine because the vmcs02 value has a correctly-cleared bit - but you asked to avoid the vmread. So the second option is to just explicitly remove the valid bit from vmcs12->vm_entry_intro_info_field, which I do now. vmcs12->vm_entry_intro_info_field was set by L1 before it entered L2, and now that L2 is exiting back to L1, we need to clear the valid bit. The more I think about it, the more I become convinced that the second option is indeed better than the first option (the original code in the patch). > There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not > be cleared by hardware: >... > If neither of these are valid, the code can be removed. If only the > second, we might make it conditional. Again, unless I'm misunderstanding what you mean, the hardware only modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12 as well, to remove the "valid" bit. If we don't, when L1 enters into the same L2 again, the same old value will be copied again from vmcs12 to vmcs02, and cause an injection of the same interrupt again. And by the way, I haven't said this enough, but thanks for your continued reviews and all your very useful corrections for these patches! Nadav. -- Nadav Har'El| Sunday, Sep 12 2010, 5 Tishri 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I before E except after C. We live in a http://nadav.harel.org.il |weird society! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/12/2010 07:05 PM, Nadav Har'El wrote: + vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. IIRC we don't support msr bitmaps now, so no need to check anything. I do think we support msr bitmaps... E.g., we have nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a certain MSR access. Where don't we support them? (but I'm not denying the possiblity that this support still has holes or bugs). I was just talking from memory. If you do support them, that's great. Note that kvm itself doesn't support give the guest control of DEBUGCTLMSR, so you should just be able to read it from the shadow value (which strangely doesn't exist - I'll post a fix). In general, avoid vmcs reads as much as possible. Just think of your code running in a guest! Yes. On the other hand, I don't want to be sorry in the future when I want to support some feature, but because I wanted to shave off 1% of the L2->L1 switching time, and 0.01% of the total run time (and I'm just making numbers up...), I now need to find a dozen places where things need to change to support this feature. On the other hand, this will likely happen anyway ;-) Well, with msrs you have two cases: those which the guest controls and those which are shadowed. So all we need is a systematic way for dealing with the two types. + vmcs12->vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject the same interrupt twice. I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. Well, you obviously know the KVM code much better than I do, but from what I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM, this field only gets written on injection, not on every entry, so the code relies on the fact that the processor turns off the "valid" bit during exit, to avoid the same event being injected when the same field value is used for another entry. Correct. I can only find code which resets this field in vmx_vcpu_reset(), but that doesn't get called on every entry, right? Or am I missing something? prepare_vmcs12() is called in response for a 2->1 vmexit which is first trapped by 0, yes? Because it's called immediately after a vmexit, VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the processor. There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not be cleared by hardware: 1. if we call prepare_vmcs12() between injection and entry. This cannot happen AFAICT. 2. if the vmexit was really a failed 1->2 vmentry, and if the processor doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures (need to check scripture) If neither of these are valid, the code can be removed. If only the second, we might make it conditional. What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) Sheng? There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK; That's really a temporary variable, I don't think you need to touch it. But we need to emulate the correct VMX behavior. According to the spec, the "valid" bit on this field needs to be cleared on vmexit, so we need to do it also on emulated exits from L2 to L1. If we're sure that we already cleared it earlier, then fine, but if not (and like I said, I failed to find this code), we need to do it now, on exit - either by explictly clearing the bit or by copying a value where the processor cleared this bit (arguably, the former is more correct emulation). Sorry, I misread it as vmx->idt_vectoring_info which is a temporary variable used to cache IDT_VECTORING_INFO. Ignore my remark. I didn't mean register independent helper; one function for cr0 and one function for cr4 so the reader can just see the name and pretend to understand what it does, instead of seeing a bunch of incomprehensible bitwise operations. Ok, done: /* * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) * without L
Re: [PATCH 18/24] Exiting from L2 to L1
On Sun, Sep 12, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1": > Great. Hopefully you can commit some time to it or you'll spend a lot > of cycles just catching up. Right. I will. > Joerg just merged nested npt; as far as I can tell it is 100% in line > with nested ept, but please take a look as well. Indeed. Making nested EPT work based on the nested NPT work is one of the first things I plan to do after the basic nested VMX patches are accepted. As you know, we already have a version of nested EPT working in our testing code, but I'll need to tweak it a bit to use the common nested NPT code. > >>>+ vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); > >>> > >>Without msr bitmaps, cannot change. > >I added a TODO before this (and a couple of others) for future > >optimization. > >I'm not even convinced how much quicker it is to check the MSR bitmap > >before > >doing vmcs_read64, vs just to going ahead and vmreading it in any case. > > IIRC we don't support msr bitmaps now, so no need to check anything. I do think we support msr bitmaps... E.g., we have nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a certain MSR access. Where don't we support them? (but I'm not denying the possiblity that this support still has holes or bugs). > In general, avoid vmcs reads as much as possible. Just think of your > code running in a guest! Yes. On the other hand, I don't want to be sorry in the future when I want to support some feature, but because I wanted to shave off 1% of the L2->L1 switching time, and 0.01% of the total run time (and I'm just making numbers up...), I now need to find a dozen places where things need to change to support this feature. On the other hand, this will likely happen anyway ;-) > > >>>+vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); > >>Can this change? >.. > If it changes in the future, it can only be under the influence of some > control or at least guest-discoverable capability. Since we don't > expose such control or capability, there's no need to copy it. You convinced me. Removed it. > >>>+ vmcs12->vm_entry_intr_info_field = > >>>+ vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); > >>> > >>Autocleared, no need to read. > >Well, we need to clear the "valid bit" on exit, so we don't mistakenly > >inject > >the same interrupt twice. > > I don't think so. We write this field as part of guest entry (that is, > after the decision re which vmcs to use, yes?), so guest entry will > always follow writing this field. Since guest entry clears the field, > reading it after an exit will necessarily return 0. Well, you obviously know the KVM code much better than I do, but from what I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM, this field only gets written on injection, not on every entry, so the code relies on the fact that the processor turns off the "valid" bit during exit, to avoid the same event being injected when the same field value is used for another entry. I can only find code which resets this field in vmx_vcpu_reset(), but that doesn't get called on every entry, right? Or am I missing something? > What can happen is that the contents of the field is transferred to the > IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. > > (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) > >There were two ways to do it: 1. clear it ourselves, > >or 2. copy the value from vmcs02 where the processor already cleared it. > >There are pros and cons for each approach, but I'll change like you > >suggest, > >to clear it ourselves: > > > > vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK; > > That's really a temporary variable, I don't think you need to touch it. But we need to emulate the correct VMX behavior. According to the spec, the "valid" bit on this field needs to be cleared on vmexit, so we need to do it also on emulated exits from L2 to L1. If we're sure that we already cleared it earlier, then fine, but if not (and like I said, I failed to find this code), we need to do it now, on exit - either by explictly clearing the bit or by copying a value where the processor cleared this bit (arguably, the former is more correct emulation). > I didn't mean register independent helper; one function for cr0 and one > function for cr4 so the reader can just see the name and pretend to > understand what it does, instead of seeing a bunch of incomprehensible > bitwise operations. Ok, done: /* * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) * without L0 trapping the change and updating vmcs12. * This function returns the value we should put in vmcs12.guest_cr0. It's not * enough to just return the current (vmcs02) GUEST_CR0. This may not be the * guest cr0 that L1 thought it was giving its L2 guest - it is
Re: [ANNOUNCE] qemu-kvm-0.13.0-rc1
On 09/12/2010 05:31 PM, Anthony Liguori wrote: On 09/12/2010 01:11 AM, Avi Kivity wrote: On 09/10/2010 10:48 PM, Anthony Liguori wrote: I agree, is there any reason not to enable compiling less into the binary? There are folks interested in eliminating as much as possible to reduce the attack surface and auditing requirements, for example. It's not a bad idea, it's just that what --disable-cpu-emulation does is evil. Being that I wrote the implementation, I'm quite confident in declare it as such :-) Oh, I thought you were against the idea in itself for some reason. I'll patch it for 0.13, but any ideas on how it should be rework for master? Glauber's old Accel interface was close to the right approach. We need to abstract the exec.c interfaces to use a function pointer table and have a TCG and KVM implementation. The function pointer tables can then be registered by a module_init() and we can simply not include the kvm or TCG files are build time to disable the functionality. Yes, I remember it now. Glauber, can you bring those patches back from the land of the dead? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] qemu-kvm-0.13.0-rc1
On 09/12/2010 01:11 AM, Avi Kivity wrote: On 09/10/2010 10:48 PM, Anthony Liguori wrote: I agree, is there any reason not to enable compiling less into the binary? There are folks interested in eliminating as much as possible to reduce the attack surface and auditing requirements, for example. It's not a bad idea, it's just that what --disable-cpu-emulation does is evil. Being that I wrote the implementation, I'm quite confident in declare it as such :-) Oh, I thought you were against the idea in itself for some reason. I'll patch it for 0.13, but any ideas on how it should be rework for master? Glauber's old Accel interface was close to the right approach. We need to abstract the exec.c interfaces to use a function pointer table and have a TCG and KVM implementation. The function pointer tables can then be registered by a module_init() and we can simply not include the kvm or TCG files are build time to disable the functionality. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost, something changed between 2.6.35 and 2.6.36 ?
On Sun, Sep 12, 2010 at 04:39:29PM +0200, Dhaval Giani wrote: > On Sun, Sep 12, 2010 at 2:05 PM, Michael S. Tsirkin wrote: > > On Fri, Sep 10, 2010 at 03:37:36PM +0200, Dhaval Giani wrote: > >> Hi, > >> > >> I have been trying to get vhost+macvtap to work for me. I run it as > >> > >> /root/qemu-kvm-vhost-net/bin/qemu-system-x86_64 -hda $IMAGE -serial > >> stdio -monitor telnet::,server,nowait -vnc :4: -m 3G -net > >> nic,model=virtio,macaddr=$MACADDR,netdev=macvtap0 -netdev > >> tap,id=macvtap0,vhost=on,fd=3 3<> /dev/tap5 > >> > >> in 2.6.35, which worked just fine. On the other hand, with 2.6.36, i > >> don't have working networking. I am using the same image and same > >> macaddress. The qemu is the version from > >> git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git vhost . > > > > BTW, by now, all these patches are merged so upstream qemu-kvm should work > > just fine for you as well. > > > >> Any suggestions will be welcome! > >> > >> Thanks, > >> Dhaval > > > > You are running this as non-root user, correct? > > nope as root. > > > This could be the permission issue that got fixed > > by 87d6a412bd1ed82c14cabd4b408003b23bbd2880. > > Could you please check the latest master from Linus, > > and let me and the list know? Thanks! > > > > this is with git of friday evening CEST. > > > Another thing to try if this does *not* help: > > > > enable CONFIG_DYNAMIC_DEBUG in kernel, > > rebuild the kernel, > > mount debugfs: > > > > mount -t debugfs none /sys/kernel/debug > > > > and then enable debug for vhost_net as described in > > Documentation/dynamic-debug-howto.txt: > > > > I will give this a run on monday morning when i am at the lab again. > > > echo 'module vhost_net +p' > /sys/kernel/debug/dynamic_debug/control > > > > Then start qemu, and after running a test, run dmesg and see if there > > are any messages from vhost_net. If yes please send them to > > me and to the list. > > > > Thanks! > > > > > thanks! > Dhaval Another thing to try check is generic net core issues. For this, try running tcpdump on both tap in host and on virtio net device in guest. Then send packets to host from guest and back, and check whether they appears on virtio and on tap. Thanks! -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] svm: Add test for selective cr0 intercept
On 09/12/2010 11:36 AM, Joerg Roedel wrote: On Sun, Sep 12, 2010 at 11:06:34AM +0200, Avi Kivity wrote: On 09/10/2010 06:34 PM, Joerg Roedel wrote: + /* +* If we are here the test failed, not sure what to do now because we +* are not in guest-mode anymore so we can't trigger an intercept. +* Trigger a tripple-fault for now. +*/ + printf("sel_cr0 test failed. Can not recover from this - exiting\n"); + exit(1); Don't understand - we're still in guest mode (only running very slowly...). All you have to do is fall off the end here, and you'll exit with VMMCALL. The bug I fixed was, that the guest continues to run in l1 mode with the l2 rip, rsp, and rax. So if the bug is there, it continues to run in this function, but with no chance to intercept anymore because the guest is not longer in emulated guest mode. I see. Of course it is right to test for specific regressions, not just random features. I agree that this is a test specific to that bug. I should probably add another test to check if the correct intercepts are reported. The more, the merrier! -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Document that KVM_GET_SUPPORTED_CPUID may return emulated values
Signed-off-by: Avi Kivity --- Documentation/kvm/api.txt |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 24d6341..b336266 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -1042,8 +1042,9 @@ number is just right, the 'nent' field is adjusted to the number of valid entries in the 'entries' array, which is then filled. The entries returned are the host cpuid as returned by the cpuid instruction, -with unknown or unsupported features masked out. The fields in each entry -are defined as follows: +with unknown or unsupported features masked out. Some features (for example, +x2apic), may not be present in the host cpu, but are exposed by kvm if it can +emulate them efficiently. The fields in each entry are defined as follows: function: the eax value used to obtain the entry index: the ecx value used to obtain the entry (for entries that are -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] svm: Run tests with NPT enabled if available
On Sun, Sep 12, 2010 at 11:10:50AM +0200, Avi Kivity wrote: > On 09/10/2010 06:34 PM, Joerg Roedel wrote: >> This patch adds code to setup a nested page table which is >> used for all tests. >> > >> + >> +printf("NPT detected - running all tests with NPT enabled\n"); >> + >> +/* >> + * Nested paging supported - Build a nested page table >> + * Build the page-table bottom-up and map everything with 2M pages >> + */ >> + >> +address = 0; >> + >> +/* PTE level */ > > Conflicts with previous comment - these aren't 2M pages. Oh right, I used 2M pages in the first version but figured out then that 2MB are not granular enough to write all the NPT tests. Thus I changed it to 4k pages and forgot to update the comment. I will resend this one. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost, something changed between 2.6.35 and 2.6.36 ?
On Sun, Sep 12, 2010 at 2:05 PM, Michael S. Tsirkin wrote: > On Fri, Sep 10, 2010 at 03:37:36PM +0200, Dhaval Giani wrote: >> Hi, >> >> I have been trying to get vhost+macvtap to work for me. I run it as >> >> /root/qemu-kvm-vhost-net/bin/qemu-system-x86_64 -hda $IMAGE -serial >> stdio -monitor telnet::,server,nowait -vnc :4: -m 3G -net >> nic,model=virtio,macaddr=$MACADDR,netdev=macvtap0 -netdev >> tap,id=macvtap0,vhost=on,fd=3 3<> /dev/tap5 >> >> in 2.6.35, which worked just fine. On the other hand, with 2.6.36, i >> don't have working networking. I am using the same image and same >> macaddress. The qemu is the version from >> git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git vhost . > > BTW, by now, all these patches are merged so upstream qemu-kvm should work > just fine for you as well. > >> Any suggestions will be welcome! >> >> Thanks, >> Dhaval > > You are running this as non-root user, correct? nope as root. > This could be the permission issue that got fixed > by 87d6a412bd1ed82c14cabd4b408003b23bbd2880. > Could you please check the latest master from Linus, > and let me and the list know? Thanks! > this is with git of friday evening CEST. > Another thing to try if this does *not* help: > > enable CONFIG_DYNAMIC_DEBUG in kernel, > rebuild the kernel, > mount debugfs: > > mount -t debugfs none /sys/kernel/debug > > and then enable debug for vhost_net as described in > Documentation/dynamic-debug-howto.txt: > I will give this a run on monday morning when i am at the lab again. > echo 'module vhost_net +p' > /sys/kernel/debug/dynamic_debug/control > > Then start qemu, and after running a test, run dmesg and see if there > are any messages from vhost_net. If yes please send them to > me and to the list. > > Thanks! > thanks! Dhaval -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 04:30 PM, Joerg Roedel wrote: Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. How about features that are not supported by the hardware but can be supported in emulation? The VMCBCLEAN feature is one of those which makes a lot of sense to reduce the emulated world-switch times. I guess its ok to enable those with -cpu host? It's a good question. We do that with x2apic, so yes. Userspace can do four things with KVM_GET_SUPPORTED_CPUID: - feed it right back to KVM_SET_CPUID2, getting maximum features and hopefully performance - use it to mask the real cpuid which it fetches directly, getting something as similar as possible to the host - use it to mask a predefined cpu type, getting something as similar as possible to that cpu type - use it to verify that a predefined cpu type is supported, getting somthing exactly the same as that cpu type -cpu host is the first option. If the need comes for the second option, we can provide it. I'll note that KVM_GET_SUPPORTED_CPUID can return values not present in the host cpuid in the documentation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On Sun, Sep 12, 2010 at 10:01:28AM +0200, Avi Kivity wrote: > On 09/12/2010 09:16 AM, Alexander Graf wrote: >> Either way, I don't think we need a phenom2 type. The features >> additional are minor enough to not really matter and all use cases I >> can come up with require either -cpu host (local virt) or -cpu phenom >> (migration). > > I'm fine with this (or with adding phenom2). But don't make phenom > contain flags that real phenoms don't have. How about features that are not supported by the hardware but can be supported in emulation? The VMCBCLEAN feature is one of those which makes a lot of sense to reduce the emulated world-switch times. I guess its ok to enable those with -cpu host? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/12/2010 04:05 PM, Nadav Har'El wrote: Hi, Continuing to work on the nested VMX patches, Great. Hopefully you can commit some time to it or you'll spend a lot of cycles just catching up. Joerg just merged nested npt; as far as I can tell it is 100% in line with nested ept, but please take a look as well. + vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. IIRC we don't support msr bitmaps now, so no need to check anything. In general, avoid vmcs reads as much as possible. Just think of your code running in a guest! +vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Can this change? Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state field, but it is listed as being for "future expansion". I guess that with current hardware, it cannot change, but for future hardware it might. I'm not sure if it's wiser to ignore this field for now (and shave a bit off the l2->l1 switch time), or just copy it anyway, as I do now. What would you prefer? If it changes in the future, it can only be under the influence of some control or at least guest-discoverable capability. Since we don't expose such control or capability, there's no need to copy it. + vmcs12->vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject the same interrupt twice. I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK; That's really a temporary variable, I don't think you need to touch it. + vmcs12->vm_exit_reason = vmcs_read32(VM_EXIT_REASON); + vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); + vmcs12->idt_vectoring_info_field = + vmcs_read32(IDT_VECTORING_INFO_FIELD); + vmcs12->idt_vectoring_error_code = + vmcs_read32(IDT_VECTORING_ERROR_CODE); + vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); For the above, if the host handles the exit, we must not clobber guest fields. A subsequent guest vmread will see the changed values even though from its point of view a vmexit has not occurred. But no, that can't happen, since a vmread needs to have a vmexit first to happen. Still, best to delay this.+ All this code is in prepare_vmcs_12, which only gets called if we exit from L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host handles the exit). Ok. That answers my concern. As long as a certain field gets written to on *every* exit, and not just on some of them, I believe we can safely copy the current values in vmcs02 to vmcs12, knowing these are current values from the current exit, and not some old values we shouldn't be copying. You may have a point (if that was your point?) that some fields are not always written to - e.g., for most exits vm_exit_intr_info doesn't get written to and just one "valid" bit is cleared. As the code is now, we copy vmcs02's field, which might have been written earlier (e.g., during an exit to L0) and not now, and an observant L1 might notice this value it should not have seen. However, I don't see any problem with that, because the "valid" bit would be correctly turned off, and the spec says that all other bits are undefined (for irrelevant exits) and copying-old-values is one legal setting for undefined bits... No, I wasn't worried about that, simply misunderstood the code. + /* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may +* have changed some cr0 bits without us ever saving them in the shadow +* vmcs. So we need to save these changes now. ... + + vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4); Can't we have the same issue with cr4? I guess we can. I didn't think this (giving guest full control over cr4 bits) was
Re: [PATCH 18/24] Exiting from L2 to L1
Hi, Continuing to work on the nested VMX patches, On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1": > On 06/13/2010 03:31 PM, Nadav Har'El wrote: >... > >+/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to > >+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this > >+ * function updates it to reflect the state of the registers during the > >exit, >... > >+vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET); > > > > TSC_OFFSET cannot have changed. Right. I cleaned up this function now, to only copy the fields that could have changed, namely fields listed as guest-state or exit-information fields in the spec. Control fields like this TSC_OFFSET and more examples you found below, indeed could not have changed while L2 was running or during the exit. > >+vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); > > > Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. > >+vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); > > Can this change? Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state field, but it is listed as being for "future expansion". I guess that with current hardware, it cannot change, but for future hardware it might. I'm not sure if it's wiser to ignore this field for now (and shave a bit off the l2->l1 switch time), or just copy it anyway, as I do now. What would you prefer? > >+if (vmcs_config.vmentry_ctrl& VM_ENTRY_LOAD_IA32_PAT) > >+vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT); > > > > Should check for VM_EXIT_SAVE_IA32_PAT, no? You're absolutely right. Fixed. > >+vmcs12->vm_entry_intr_info_field = > >+vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); > > > > Autocleared, no need to read. Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject the same interrupt twice. There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK; > >+vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR); > > We don't want to pass this to the guest? I didn't quite understand your question, but now that I look at it, I see that VM_INSTRUCTION_ERROR has nothing to do with exits, but is only modified when running VMX instructions, and our emulation of VMX instructions already sets it appropriately, so no sense of copying it here. > > >+vmcs12->vm_exit_reason = vmcs_read32(VM_EXIT_REASON); > >+vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); > >+vmcs12->vm_exit_intr_error_code = > >vmcs_read32(VM_EXIT_INTR_ERROR_CODE); > >+vmcs12->idt_vectoring_info_field = > >+vmcs_read32(IDT_VECTORING_INFO_FIELD); > >+vmcs12->idt_vectoring_error_code = > >+vmcs_read32(IDT_VECTORING_ERROR_CODE); > >+vmcs12->vm_exit_instruction_len = > >vmcs_read32(VM_EXIT_INSTRUCTION_LEN); > >+vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); > > > > For the above, if the host handles the exit, we must not clobber guest > fields. A subsequent guest vmread will see the changed values even > though from its point of view a vmexit has not occurred. > > But no, that can't happen, since a vmread needs to have a vmexit first > to happen. Still, best to delay this.+ All this code is in prepare_vmcs_12, which only gets called if we exit from L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host handles the exit). As long as a certain field gets written to on *every* exit, and not just on some of them, I believe we can safely copy the current values in vmcs02 to vmcs12, knowing these are current values from the current exit, and not some old values we shouldn't be copying. You may have a point (if that was your point?) that some fields are not always written to - e.g., for most exits vm_exit_intr_info doesn't get written to and just one "valid" bit is cleared. As the code is now, we copy vmcs02's field, which might have been written earlier (e.g., during an exit to L0) and not now, and an observant L1 might notice this value it should not have seen. However, I don't see any problem with that, because the "valid" bit would be correctly turned off, and the spec says that all other bits are undefined (for irrelevant exits) and copying-old-values is one legal setting for undefined bits... > >+/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may > >+ * have changed some cr0 bits without us ever saving them in the > >shadow > >+ * vmcs. So we need to save t
Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote: > >>Playing with rlimit on data path, transparently to the application in this > >>way > >>looks strange to me, I suspect this has unexpected security implications. > >>Further, applications may have other uses for locked memory > >>besides mpassthru - you should not just take it because it's there. > >> > >>Can we have an ioctl that lets userspace configure how much > >>memory to lock? This ioctl will decrement the rlimit and store > >>the data in the device structure so we can do accounting > >>internally. Put it back on close or on another ioctl. > >Yes, we can decrement the rlimit in ioctl in one time to avoid > >data path. > > > >>Need to be careful for when this operation gets called > >>again with 0 or another small value while we have locked memory - > >>maybe just fail with EBUSY? or wait until it gets unlocked? > >>Maybe 0 can be special-cased and deactivate zero-copy?. > >> > > How about we don't use a new ioctl, but just check the rlimit > in one MPASSTHRU_BINDDEV ioctl? If we find mp device > break the rlimit, then we fail the bind ioctl, and thus can't do > zero copy any more. Yes, and not just check, but decrement as well. I think we should give userspace control over how much memory we can lock and subtract from the rlimit. It's OK to add this as a parameter to MPASSTHRU_BINDDEV. Then increment the rlimit back on unbind and on close? This opens up an interesting condition: process 1 calls bind, process 2 calls unbind or close. This will increment rlimit for process 2. Not sure how to fix this properly. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost, something changed between 2.6.35 and 2.6.36 ?
On Fri, Sep 10, 2010 at 03:37:36PM +0200, Dhaval Giani wrote: > Hi, > > I have been trying to get vhost+macvtap to work for me. I run it as > > /root/qemu-kvm-vhost-net/bin/qemu-system-x86_64 -hda $IMAGE -serial > stdio -monitor telnet::,server,nowait -vnc :4: -m 3G -net > nic,model=virtio,macaddr=$MACADDR,netdev=macvtap0 -netdev > tap,id=macvtap0,vhost=on,fd=3 3<> /dev/tap5 > > in 2.6.35, which worked just fine. On the other hand, with 2.6.36, i > don't have working networking. I am using the same image and same > macaddress. The qemu is the version from > git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git vhost . BTW, by now, all these patches are merged so upstream qemu-kvm should work just fine for you as well. > Any suggestions will be welcome! > > Thanks, > Dhaval You are running this as non-root user, correct? This could be the permission issue that got fixed by 87d6a412bd1ed82c14cabd4b408003b23bbd2880. Could you please check the latest master from Linus, and let me and the list know? Thanks! Another thing to try if this does *not* help: enable CONFIG_DYNAMIC_DEBUG in kernel, rebuild the kernel, mount debugfs: mount -t debugfs none /sys/kernel/debug and then enable debug for vhost_net as described in Documentation/dynamic-debug-howto.txt: echo 'module vhost_net +p' > /sys/kernel/debug/dynamic_debug/control Then start qemu, and after running a test, run dmesg and see if there are any messages from vhost_net. If yes please send them to me and to the list. Thanks! -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 01:14 PM, Alexander Graf wrote: Uh, something in the power management module. Just give it a try on an Istambul system. Just use the 11.1 iso - the kernels are close enough. Oh, probably some silly msr. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Add a new API to virtio-pci
On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote: > Unfortunately I need a > constant in vhost for now. Maybe not even that: you create multiple vhost-net devices so vhost-net in kernel does not care about these either, right? So this can be just part of vhost_net.h in qemu. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Sep 09, 2010 at 03:15:53PM +0530, Krishna Kumar2 wrote: > > Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM: > > Some more results and likely cause for single netperf > degradation below. > > > > Guest -> Host (single netperf): > > I am getting a drop of almost 20%. I am trying to figure out > > why. > > > > Host -> guest (single netperf): > > I am getting an improvement of almost 15%. Again - unexpected. > > > > Guest -> Host TCP_RR: I get an average 7.4% increase in #packets > > for runs upto 128 sessions. With fewer netperf (under 8), there > > was a drop of 3-7% in #packets, but beyond that, the #packets > > improved significantly to give an average improvement of 7.4%. > > > > So it seems that fewer sessions is having negative effect for > > some reason on the tx side. The code path in virtio-net has not > > changed much, so the drop in some cases is quite unexpected. > > The drop for the single netperf seems to be due to multiple vhost. > I changed the patch to start *single* vhost: > > Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45% > Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5% > > Single vhost performs well but hits the barrier at 16 netperf > sessions: > > SINGLE vhost (Guest -> Host): > 1 netperf:BW: 10.7% SD: -1.4% > 4 netperfs: BW: 3%SD: 1.4% > 8 netperfs: BW: 17.7% SD: -10% > 16 netperfs: BW: 4.7% SD: -7.0% > 32 netperfs: BW: -6.1% SD: -5.7% > BW and SD both improves (guest multiple txqs help). For 32 > netperfs, SD improves. > > But with multiple vhosts, guest is able to send more packets > and BW increases much more (SD too increases, but I think > that is expected). Why is this expected? > From the earlier results: > > N# BW1 BW2(%) SD1 SD2(%) RSD1RSD2(%) > ___ > 4 26387 40716 (54.30) 20 28 (40.00)86 85 > (-1.16) > 8 24356 41843 (71.79) 88 129 (46.59)372 362 > (-2.68) > 16 23587 40546 (71.89) 375 564 (50.40)15581519 > (-2.50) > 32 22927 39490 (72.24) 16172171 (34.26)66945722 > (-14.52) > 48 23067 39238 (70.10) 39315170 (31.51)15823 13552 > (-14.35) > 64 22927 38750 (69.01) 71429914 (38.81)28972 26173 > (-9.66) > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 > (10.74) > ___ > (All tests were done without any tuning) > > >From my testing: > > 1. Single vhost improves mq guest performance upto 16 >netperfs but degrades after that. > 2. Multiple vhost degrades single netperf guest >performance, but significantly improves performance >for any number of netperf sessions. > > Likely cause for the 1 stream degradation with multiple > vhost patch: > > 1. Two vhosts run handling the RX and TX respectively. >I think the issue is related to cache ping-pong esp >since these run on different cpus/sockets. Right. With TCP I think we are better off handling TX and RX for a socket by the same vhost, so that packet and its ack are handled by the same thread. Is this what happens with RX multiqueue patch? How do we select an RX queue to put the packet on? > 2. I (re-)modified the patch to share RX with TX[0]. The >performance drop is the same, but the reason is the >guest is not using txq[0] in most cases (dev_pick_tx), >so vhost's rx and tx are running on different threads. >But whenever the guest uses txq[0], only one vhost >runs and the performance is similar to original. > > I went back to my *submitted* patch and started a guest > with numtxq=16 and pinned every vhost to cpus #0&1. Now > whether guest used txq[0] or txq[n], the performance is > similar or better (between 10-27% across 10 runs) than > original code. Also, -6% to -24% improvement in SD. > > I will start a full test run of original vs submitted > code with minimal tuning (Avi also suggested the same), > and re-send. Please let me know if you need any other > data. > > Thanks, > > - KK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
Am 12.09.2010 um 12:22 schrieb Avi Kivity : > On 09/12/2010 12:06 PM, Alexander Graf wrote: For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. >>> Curious, what guests balk at modern hardware but are fine with phenom? >> Sles11 GA ;). > > Still curious, how does -cpu host break it? Uh, something in the power management module. Just give it a try on an Istambul system. Just use the 11.1 iso - the kernels are close enough. Alex > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 12:06 PM, Alexander Graf wrote: For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Sles11 GA ;). Still curious, how does -cpu host break it? Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. Those were my words :). Then we are in agreement. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
Am 12.09.2010 um 10:01 schrieb Avi Kivity : > On 09/12/2010 09:16 AM, Alexander Graf wrote: Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. >>> Well, let's have a phenom2 type for those new features (and any other >>> features the phenom 2 has). What's the point of using the name of existing >>> hardware if it doesn't match that hardware? >> Isn't that what cpu=host is there for? I don't want to see cpu type >> cluttering like we have it on ppc. I added the core2duo type for Mac OS X >> guests for which those are basically the oldest supported CPUs. > > -cpu host is to all supported features into your guest. > -cpu phenom is to pretend you are running on a phenom cpu. This is useful > for a migration farm for which the greatest common denominator is a phenom. > > Those are separate use cases. Exactly. And the benefit from phenom -> phenom2 is so minor that it doesn't make sense. > >> For the Phenom type, I honestly don't remember why, but there was also a >> good reason to add it. In fact, I use it today to have nested virt without >> -cpu host on hardware that's too new for my guests. > > Curious, what guests balk at modern hardware but are fine with phenom? Sles11 GA ;). > >> Either way, I don't think we need a phenom2 type. The features additional >> are minor enough to not really matter and all use cases I can come up with >> require either -cpu host (local virt) or -cpu phenom (migration). > > I'm fine with this (or with adding phenom2). But don't make phenom contain > flags that real phenoms don't have. Those were my words :). Alex > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] svm: Add test for selective cr0 intercept
On Sun, Sep 12, 2010 at 11:06:34AM +0200, Avi Kivity wrote: > On 09/10/2010 06:34 PM, Joerg Roedel wrote: >> +/* >> + * If we are here the test failed, not sure what to do now because we >> + * are not in guest-mode anymore so we can't trigger an intercept. >> + * Trigger a tripple-fault for now. >> + */ >> +printf("sel_cr0 test failed. Can not recover from this - exiting\n"); >> +exit(1); > > Don't understand - we're still in guest mode (only running very > slowly...). All you have to do is fall off the end here, and you'll > exit with VMMCALL. The bug I fixed was, that the guest continues to run in l1 mode with the l2 rip, rsp, and rax. So if the bug is there, it continues to run in this function, but with no chance to intercept anymore because the guest is not longer in emulated guest mode. I agree that this is a test specific to that bug. I should probably add another test to check if the correct intercepts are reported. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: high load with usb device
On 09/10/2010 07:17 PM, Michael Tokarev wrote: Note the changed subject line. I just did a few tests with linux guest (amd64 2.6.35 kernel). And it shows the same behavour as win7 (unlike winXP), namely, high host CPU load when guest is idle. Not for me - F12 idle guest takes 3.5% cpu. What does 'top' in the guest show now? Connect with ssh to avoid triggering the GUI. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] svm: Run tests with NPT enabled if available
On 09/10/2010 06:34 PM, Joerg Roedel wrote: This patch adds code to setup a nested page table which is used for all tests. + +printf("NPT detected - running all tests with NPT enabled\n"); + +/* + * Nested paging supported - Build a nested page table + * Build the page-table bottom-up and map everything with 2M pages + */ + +address = 0; + +/* PTE level */ Conflicts with previous comment - these aren't 2M pages. +for (i = 0; i< 2048; ++i) { +page = alloc_page(); + +for (j = 0; j< 512; ++j, address += 4096) +page[j] = address | 0x067ULL; { } static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector, @@ -56,6 +110,11 @@ static void vmcb_ident(struct vmcb *vmcb) save->g_pat = rdmsr(MSR_IA32_CR_PAT); save->dbgctl = rdmsr(MSR_IA32_DEBUGCTLMSR); ctrl->intercept = (1ULL<< INTERCEPT_VMRUN) | (1ULL<< INTERCEPT_VMMCALL); + +if (npt_supported()) { +ctrl->nested_ctl = 1; +ctrl->nested_cr3 = (u64)pml4e; +} } struct test { -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] svm: Add test for selective cr0 intercept
On 09/10/2010 06:34 PM, Joerg Roedel wrote: This patch adds a test to check if the selective cr0 intercept emulation of the kvm svm emulation works. Signed-off-by: Joerg Roedel --- x86/svm.c | 37 - 1 files changed, 36 insertions(+), 1 deletions(-) diff --git a/x86/svm.c b/x86/svm.c index 2f1c900..e65360e 100644 --- a/x86/svm.c +++ b/x86/svm.c @@ -357,6 +357,40 @@ static bool check_asid_zero(struct test *test) return test->vmcb->control.exit_code == SVM_EXIT_ERR; } +static void sel_cr0_prepare(struct test *test) +{ +vmcb_ident(test->vmcb); +test->vmcb->control.intercept |= (1ULL<< INTERCEPT_SELECTIVE_CR0); +} + +static bool sel_cr0_finished(struct test *test) +{ + return true; +} Coding style - kvm-unit-tests uses the qemu style. Please drop the tabs from new code. + +static void sel_cr0_test(struct test *test) +{ + unsigned long cr0; + + /* read cr0, clear CD, and write back */ set CD. Better to ^= it to be sure to trigger. + cr0 = read_cr0(); + cr0 |= (1UL<< 30); + write_cr0(cr0); How about a test that ^= TS to see that we don't intercept unnecessarily? + + /* +* If we are here the test failed, not sure what to do now because we +* are not in guest-mode anymore so we can't trigger an intercept. +* Trigger a tripple-fault for now. +*/ + printf("sel_cr0 test failed. Can not recover from this - exiting\n"); + exit(1); Don't understand - we're still in guest mode (only running very slowly...). All you have to do is fall off the end here, and you'll exit with VMMCALL. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] S390: Add virtio hotplug add support
On 09/12/2010 02:42 AM, Alexander Graf wrote: On 24.08.2010, at 15:48, Alexander Graf wrote: The one big missing feature in s390-virtio was hotplugging. This is no more. This patch implements hotplug add support, so you can on the fly add new devices in the guest. Keep in mind that this needs a patch for qemu to actually leverage the functionality. Signed-off-by: Alexander Graf ping (on the patch set)? Actually Marcelo applied it. But the natural place for it is Rusty's virtio tree. Rusty, if you want to take it, let me know and I'll drop it from kvm.git. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/29] Nested Paging Virtualization for KVM v4
On 09/10/2010 06:30 PM, Joerg Roedel wrote: Hi Avi, Marcelo, as promised here is -v4 of my patch-set for virtualizing nested paging in KVM. I addresses all of your review comments in this version and fixed a misbehavior in the nested page-walker code where it would have reported the wrong error-code on an emulated nested page fault. Here is the complete list of changes: * Fixed the bug in the gpa_to_gfn function * Made sure that the right fault values are kept in the two-dimensional page-table walker * Fixed the return code of x86_decode_insn so that a page-fault within that function can be handled * Set vcpu->arch.mmu.direct_map to true for nonpaging mode too * Changed code so that the right access-mode is used on nested page-table walks * Made the NX mode a capability of the MMU context to distinguish between l1-nx and l2-nx (1 additional patch) * Fixed the bug that KVM always reports the SVM flag as supported to userspace, it should only be reported on AMD hardware when nesting is enabled As the patch-set before this one was tested with the same set of combinations too. I found no regressions to current avi/master. This patch-set applies on current avi/master plus the three fixes I sent last week. As with the last version of the patch-set I also pushed this one to a tree on korg. Find it in git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-kvm.git npt-virt-v4 Please review these patches and/or apply them :-) As usual I appreciate your feedback. All applied, thanks. I'll go play with my new toy now. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/29] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On 09/10/2010 06:30 PM, Joerg Roedel wrote: This patch implements logic to make sure that either a page-fault/page-fault-vmexit or a nested-page-fault-vmexit is propagated back to the guest. @@ -338,6 +338,22 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu) kvm_queue_exception_e(vcpu, PF_VECTOR, error_code); } +void kvm_propagate_fault(struct kvm_vcpu *vcpu) +{ + u32 nested, error; + + error = vcpu->arch.fault.error_code; + nested = error& PFERR_NESTED_MASK; + error = error& ~PFERR_NESTED_MASK; + + vcpu->arch.fault.error_code = error; + + if (mmu_is_nested(vcpu)&& !nested) + vcpu->arch.nested_mmu.inject_page_fault(vcpu); + else + vcpu->arch.mmu.inject_page_fault(vcpu); +} + Tacking a non-architectural bit on top of the error code is not very nice. Can you move it to a separate vcpu->arch.fault.nested? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 24/29] KVM: MMU: Track NX state in struct kvm_mmu
On 09/10/2010 06:31 PM, Joerg Roedel wrote: With Nested Paging emulation the NX state between the two MMU contexts may differ. To make sure that always the right fault error code is recorded this patch moves the NX state into struct kvm_mmu so that the code can distinguish between L1 and L2 NX state. --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -259,6 +259,8 @@ struct kvm_mmu { u64 *lm_root; u64 rsvd_bits_mask[2][4]; + bool nx; + u64 pdptrs[4]; /* pae */ }; This is redundant with mmu->base_role.nx. Please post a follow on patch to use that instead. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 09:16 AM, Alexander Graf wrote: Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. Well, let's have a phenom2 type for those new features (and any other features the phenom 2 has). What's the point of using the name of existing hardware if it doesn't match that hardware? Isn't that what cpu=host is there for? I don't want to see cpu type cluttering like we have it on ppc. I added the core2duo type for Mac OS X guests for which those are basically the oldest supported CPUs. -cpu host is to all supported features into your guest. -cpu phenom is to pretend you are running on a phenom cpu. This is useful for a migration farm for which the greatest common denominator is a phenom. Those are separate use cases. For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Don't launch guest if -no-kvm when tcg is not configured in
Signed-off-by: Avi Kivity --- vl.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 2a32cc5..22a3616 100644 --- a/vl.c +++ b/vl.c @@ -2473,6 +2473,10 @@ int main(int argc, char **argv, char **envp) break; case QEMU_OPTION_no_kvm: kvm_allowed = 0; +#ifdef CONFIG_NO_CPU_EMULATION +fprintf(stderr, "cpu emulation not configured\n"); +exit(1); +#endif break; #ifdef CONFIG_KVM case QEMU_OPTION_no_kvm_irqchip: { -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Add missing tcg_prologue_init() for --disable-cpu-emulation build
Add missing tcg_prologue_init() and tcg_ctx, remove code_gen_max_block_size. Fixes ./configure --disable-cpu-emulation. Signed-off-by: Avi Kivity --- target-i386/fake-exec.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/target-i386/fake-exec.c b/target-i386/fake-exec.c index dfa202d..e6f8363 100644 --- a/target-i386/fake-exec.c +++ b/target-i386/fake-exec.c @@ -12,22 +12,20 @@ */ #include "exec.h" #include "cpu.h" +#include "tcg.h" int code_copy_enabled = 0; CCTable cc_table[CC_OP_NB]; +TCGContext tcg_ctx; + void cpu_dump_statistics (CPUState *env, FILE*f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...), int flags) { } -unsigned long code_gen_max_block_size(void) -{ -return 32; -} - void cpu_gen_init(void) { } @@ -48,3 +46,7 @@ int cpu_x86_gen_code(CPUState *env, TranslationBlock *tb, int *gen_code_size_ptr void optimize_flags_init(void) { } + +void tcg_prologue_init(TCGContext *ctx) +{ +} -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 12.09.2010, at 08:05, Avi Kivity wrote: > On 09/11/2010 05:20 PM, Joerg Roedel wrote: >> On Sat, Sep 11, 2010 at 03:43:02PM +0200, Alexander Graf wrote: @@ -305,6 +322,8 @@ static x86_def_t builtin_x86_defs[] = { CPUID_EXT3_OSVW, CPUID_EXT3_IBS */ .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A, +.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | CPUID_SVM_NRIPSAVE | +CPUID_SVM_VMCBCLEAN, >>> Does that phenom already do all those? It does NPT, but I'm not sure >>> about NRIPSAVE for example. >> Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old >> Phenoms don't have it. For the SVM features it is not that important >> what the host hardware supports but what KVM can emulate. VMCBCLEAN can >> be emulated without supporting it in the host for example. > > Well, let's have a phenom2 type for those new features (and any other > features the phenom 2 has). What's the point of using the name of existing > hardware if it doesn't match that hardware? Isn't that what cpu=host is there for? I don't want to see cpu type cluttering like we have it on ppc. I added the core2duo type for Mac OS X guests for which those are basically the oldest supported CPUs. For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html