Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Macvtap RX is MQ but not TX. I don't think MQ TX support is required for macvtap, though. Is it enough for existing macvtap sendmsg to work, since it calls dev_queue_xmit which selects the txq for the outgoing device? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Macvtap RX is MQ but not TX. I don't think MQ TX support is required for macvtap, though. Is it enough for existing macvtap sendmsg to work, since it calls dev_queue_xmit which selects the txq for the outgoing device? Thanks, - KK I think there would be an issue with using a single poll notifier and contention on send buffer atomic variable. Is tun different than macvtap? We need to support both long term ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] KVM: avoid unnecessary wait for a async pf
On 10/27/2010 06:42 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:04:58PM +0800, Xiao Guangrong wrote: In current code, it checks async pf completion out of the wait context, like this: if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE !vcpu-arch.apf.halted) r = vcpu_enter_guest(vcpu); else { .. kvm_vcpu_block(vcpu) ^- waiting until 'async_pf.done' is not empty } kvm_check_async_pf_completion(vcpu) ^- delete list from async_pf.done So, if we check aysnc pf completion first, it can be blocked at kvm_vcpu_block Correct, but it can be fixed by adding vcpu-arch.apf.halted = false; to kvm_arch_async_page_present(), no? Adding kvm_check_async_pf_completion() to arch independent kvm_vcpu_block() constrains how other archs may implement async pf support IMO. Um, i think it's reasonable, will fix it address your comment. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Correction: The result with binding is is much better for SD/CPU compared to without-binding: _ numtxqs=8,vhosts=5, Bind vs No-bind # BW% CPU% RCPU% SD% RSD% _ 1 11.25 10.771.89 0-6.06 2 18.66 7.20 7.20-14.28-7.40 4 4.24 -1.27 1.56-2.70 -.98 8 14.91-3.79 5.46-12.19-3.76 1612.32-8.67 4.63-35.97-26.66 2411.68-7.83 5.10-40.73-32.37 3213.09-10.516.57-51.52-42.28 4011.04-4.12 11.23 -50.69-42.81 488.61 -10.306.04-62.38-55.54 647.55 -6.05 6.41-61.20-56.04 808.74 -11.456.29-72.65-67.17 969.84 -6.01 9.87-69.89-64.78 128 5.57 -6.23 8.99-75.03-70.97 _ BW: 10.4%, CPU/RCPU: -7.4%,7.7%, SD: -70.5%,-65.7% Notes: 1. All my test results earlier was binding vhost to cpus 0-3 for both org and new kernel. 2. I am not using MST's use_mq patch, only mainline kernel. However, I reported earlier that I got better results with that patch. The result for MQ vs MQ+use_mm patch (from my earlier mail): BW: 0 CPU/RCPU: -4.2,-6.1 SD/RSD: -13.1,-15.6 Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] KVM: don't touch vcpu stat after async pf is complete
On 10/27/2010 06:44 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:05:57PM +0800, Xiao Guangrong wrote: Don't make a KVM_REQ_UNHALT request after async pf is completed since it can break guest's 'halt' instruction. Why is it a problem? CPU may be unhalted by different events so OS shouldn't depend on it. We don't know how guest OS handles it after HLT instruction is completed, according to X86's spec, only NMI/INTR/RESET/INIT/SMI can break halt state, it violations the hardware behavior if we allow other event break this state. Your opinion? :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Add a subtest kdump
Add a new subtest to check whether kdump work correctly in guest. This test just try to trigger crash on each vcpu and then verify it by checking the vmcore. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/kdump.py | 79 ++ client/tests/kvm/tests_base.cfg.sample | 11 client/tests/kvm/unattended/RHEL-5-series.ks |1 3 files changed, 91 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/kdump.py diff --git a/client/tests/kvm/tests/kdump.py b/client/tests/kvm/tests/kdump.py new file mode 100644 index 000..8fa3cca --- /dev/null +++ b/client/tests/kvm/tests/kdump.py @@ -0,0 +1,79 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_kdump(test, params, env): + +KVM reboot test: +1) Log into a guest +2) Check and enable the kdump +3) For each vcpu, trigger a crash and check the vmcore + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout = float(params.get(login_timeout, 240)) +crash_timeout = float(params.get(crash_timeout, 360)) +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +def_kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` \ + --args=crashkernel=1...@64m +kernel_param_cmd = params.get(kernel_param_cmd, def_kernel_param_cmd) +def_kdump_enable_cmd = chkconfig kdump on service kdump start +kdump_enable_cmd = params.get(kdump_enable_cmd, def_kdump_enable_cmd) + +def crash_test(vcpu): + +Trigger a crash dump through sysrq-trigger + +@param vcpu: vcpu which is used to trigger a crash + +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +session.get_command_status(rm -rf /var/crash/*) + +logging.info(Triggering crash on vcpu %d ..., vcpu) +crash_cmd = taskset -c %d echo c /proc/sysrq-trigger % vcpu +session.sendline(crash_cmd) + +if not kvm_utils.wait_for(lambda: not session.is_responsive(), 240, 0, + 1): +raise error.TestFail(Could not trigger crash on vcpu %d % vcpu) + +logging.info(Waiting for the completion of dumping) +session = kvm_test_utils.wait_for_login(vm, 0, crash_timeout, 0, 2) + +logging.info(Probing vmcore file ...) +s = session.get_command_status(ls -R /var/crash | grep vmcore) +if s != 0: +raise error.TestFail(Could not find the generated vmcore file!) +else: +logging.info(Found vmcore.) + +session.get_command_status(rm -rf /var/crash/*) + +try: +logging.info(Check the existence of crash kernel ...) +prob_cmd = grep -q 1 /sys/kernel/kexec_crash_loaded +s = session.get_command_status(prob_cmd) +if s != 0: +logging.info(Crash kernel is not loaded. Try to load it.) +# We need to setup the kernel params +s, o = session.get_command_status_output(kernel_param_cmd) +if s != 0: +raise error.TestFail(Could not add crashkernel params to + kernel) +session = kvm_test_utils.reboot(vm, session, timeout=timeout); + +logging.info(Enable kdump service ...) +# the initrd may be rebuilt here so we need to wait a little more +s, o = session.get_command_status_output(kdump_enable_cmd, timeout=120) +if s != 0: +raise error.TestFail(Could not enable kdump service:%s % o) + +nvcpu = int(params.get(smp, 1)) +[crash_test(i) for i in range(nvcpu)] + +finally: +session.close() diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index fe3563c..25ad688 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -665,6 +665,15 @@ variants: image_name_snapshot1 = sn1 image_name_snapshot2 = sn2 +- kdump: +type = kdump +# time waited for the completion of crash dump +# crash_timeout = 360 +# command to add the crashkerne...@y to kernel cmd line +# kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` --args=crashkernel=1...@64m +# command to enable kdump service +# kdump_enable_cmd = chkconfig kdump on service kdump start + # system_powerdown, system_reset and shutdown *must* be the last ones # defined (in this order), since the effect of such tests can leave # the VM on a bad state. @@ -1924,6 +1933,8 @@ virtio_net|virtio_blk|e1000|balloon_check: only Fedora.11 Fedora.12 Fedora.13 RHEL.5 OpenSUSE.11 SLES.11 Ubuntu-8.10-server #
Re: [PATCH 5/8] KVM: don't touch vcpu stat after async pf is complete
On Thu, Oct 28, 2010 at 03:35:13PM +0800, Xiao Guangrong wrote: On 10/27/2010 06:44 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:05:57PM +0800, Xiao Guangrong wrote: Don't make a KVM_REQ_UNHALT request after async pf is completed since it can break guest's 'halt' instruction. Why is it a problem? CPU may be unhalted by different events so OS shouldn't depend on it. We don't know how guest OS handles it after HLT instruction is completed, according to X86's spec, only NMI/INTR/RESET/INIT/SMI can break halt state, it violations the hardware behavior if we allow other event break this state. Your opinion? :-) I agree in principle, but since SMI (which is completely out of guest OS control) can cause CPU to exit halt, in practice OS can't rely on CPU to be unhalted only by events controlled by OS itself. In the past we had a bug that any timer even unhalted vcpu even when timer interrupt was masked. The only practical problem it caused was that vcpu that executed cli; 1: hlt; jmp 1b sequence still consumed host cpu time. That said I am not against fixing it if the fix is easy. Your current fix though relies on patch 4 that I have problem with. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: simply wakup async pf
On 10/27/2010 06:50 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:07:32PM +0800, Xiao Guangrong wrote: The current way is queued a complete async pf with: asyc_pf.page = bad_page async_pf.arch.gfn = 0 It has two problems while kvm_check_async_pf_completion handle this async_pf: - since !async_pf.page, it can retry a pseudo #PF kvm_arch_async_page_ready checks for is_error_page() - it can delete gfn 0 from vcpu-arch.apf.gfns[] kvm_arch_async_page_present() checks for is_error_page() too and, in case of PV guest, injects special token if it is true. Ah, sorry for my stupid questions. After your patch special token will not be injected and migration will not work. Actually, we can simply record this wakeup request and let kvm_check_async_pf_completion simply break the wait May be wakeup_all function naming is misleading. It means wake up all PV guest processes by sending broadcast async pf notification. It is not about waking host vcpu thread. I'm not good at the KVM PV way, i'll dig into it, please ignore this patch, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
On Sunday 24 October 2010 20:23:20 Michael S. Tsirkin wrote: On Sun, Oct 24, 2010 at 08:19:09PM +0800, Sheng Yang wrote: You need a guarantee that MSIX per-vector mask is used for disable_irq/enable_irq, right? I can't see how this provides it. This one meant to directly operate the mask/unmask bit of the MSI-X table, to emulate the mask/unmask behavior that guest want. In the previous patch I used enable_irq()/disable_irq(), but they won't directly operate the MSI-X table unless it's necessary, and Michael want to read the table in userspace then he prefer using mask/unmask directly. As I said, the main problem was really that the implementation proposed only works for interrupts used by assigned devices. I would like for it to work for irqfd as well. I think we can't let QEmu access mask or pending bit directly. It must communicate with kernel to get the info if kernel owned mask. That's because in fact guest supposed mask/unmask operation has nothing todo with what host would do. Maybe we can emulate it by do the same thing on the device, but it's two layer in fact. Also we know host kernel would does disabling/enabling according to its own mechanism, e.g. it may disable interrupt temporarily if there are too many interrupts. What host does should be transparent to guest. Directly accessing the data from device should be prohibited. And pending bit case is the same. In fact kernel knows which IRQ is pending, we can check IRQ_PENDING bit of desc, though we don't have such interface now. But we can do it in the future if it's necessary. I'm purposing an new interface like kvm_get_msix_entry, to return the mask bit of specific entry. The pending bit support can be added in the future if it's needed. But we can't directly access the MSI-X table/PBA in theory. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Page Eviction Algorithm
On 10/26/2010 03:31 PM, Prasad Joshi wrote: On Tue, Oct 26, 2010 at 2:07 PM, Avi Kivitya...@redhat.com wrote: On 10/26/2010 12:42 PM, Prasad Joshi wrote: Thanks a lot for your reply. On Tue, Oct 26, 2010 at 11:31 AM, Avi Kivitya...@redhat.comwrote: On 10/26/2010 11:19 AM, Prasad Joshi wrote: Hi All, I was just going over TODO list on KVM page. In MMU related TODO I saw only page eviction algorithm currently implemented is FIFO. Is it really the case? Yes. If yes I would like to work on it. Can someone let me know the place where the FIFO code is implemented? Look at the code that touches mmu_active_list. FWIW improving the algorithm is not critically important. It's rare that mmu shadow pages need to be evicted. I would be doing a University project on Virtualization. I would like to work on Linux kernel and KVM. I was looking over the TODO list on KVM wiki. Can you please suggest me something that would add value to KVM? O(1) write protection (on the TODO page) is interesting and important. It's difficult, so you may want to start with O(1) invalidation. I am not sure if I can understand what exactly is a MMU invalidation. Is it cache invalidation or TLB invalidation? Can you please elaborate. I am really sorry if I am asking a silly question. Invalidation of all shadow page tables. The current code which does this is in kvm_mmu_zap_all(). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Page Eviction Algorithm
On 10/26/2010 05:08 PM, Prasad Joshi wrote: Can you please suggest me something that would add value to KVM? O(1) write protection (on the TODO page) is interesting and important. It's difficult, so you may want to start with O(1) invalidation. I am not sure if I can understand what exactly is a MMU invalidation. Is it cache invalidation or TLB invalidation? Can you please elaborate. I am really sorry if I am asking a silly question. Does this MMU invalidation has to do something with the EPT (Extended Page Table) No and instruction INVEPT? No, (though INVEPT has to be run as part of this operation, via kvm_flush_remote_tlbs). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma mashi...@us.ibm.com wrote: This patch changes vhost TX used buffer signal to guest from one by one to up to 3/4 of vring size. This change improves vhost TX message size from 256 to 8K performance for both bandwidth and CPU utilization without inducing any regression. Any concerns about introducing latency or does the guest not care when TX completions come in? Signed-off-by: Shirley Ma x...@us.ibm.com --- drivers/vhost/net.c | 19 ++- drivers/vhost/vhost.c | 31 +++ drivers/vhost/vhost.h | 3 +++ 3 files changed, 52 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 4b4da5b..bd1ba71 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + /* + * if no pending buffer size allocate, signal used buffer + * one by one, otherwise, signal used buffer when reaching + * 3/4 ring size to reduce CPU utilization. + */ + if (unlikely(vq-pend)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else { + vq-pend[vq-num_pend].id = head; I don't understand the logic here: if !vq-pend then we assign to vq-pend[vq-num_pend]. + vq-pend[vq-num_pend].len = 0; + ++vq-num_pend; + if (vq-num_pend == (vq-num - (vq-num 2))) { + vhost_add_used_and_signal_n(net-dev, vq, + vq-pend, + vq-num_pend); + vq-num_pend = 0; + } + } total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 94701ff..47696d2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq-call_ctx = NULL; vq-call = NULL; vq-log_ctx = NULL; + /* signal pending used buffers */ + if (vq-pend) { + if (vq-num_pend != 0) { + vhost_add_used_and_signal_n(dev, vq, vq-pend, + vq-num_pend); + vq-num_pend = 0; + } + kfree(vq-pend); + } + vq-pend = NULL; } static int vhost_worker(void *data) @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev, dev-vqs[i].heads = NULL; dev-vqs[i].dev = dev; mutex_init(dev-vqs[i].mutex); + dev-vqs[i].num_pend = 0; + dev-vqs[i].pend = NULL; vhost_vq_reset(dev, dev-vqs + i); + /* signal 3/4 of ring size used buffers */ + dev-vqs[i].pend = kmalloc((dev-vqs[i].num - + (dev-vqs[i].num 2)) * + sizeof *vq-peed, GFP_KERNEL); Has this patch been compile tested? vq-peed? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi stefa...@gmail.com wrote: Just read the patch 1/1 discussion and it looks like you're already on it. Sorry for the noise. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] KVM: make async_pf work queue lockless
On 10/27/2010 07:41 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:09:41PM +0800, Xiao Guangrong wrote: The async_pf number is very few since only pending interrupt can let it re-enter to the guest mode. During my test(Host 4 CPU + 4G, Guest 4 VCPU + 6G), it's no more than 10 requests in the system. So, we can only increase the completion counter in the work queue context, and walk vcpu-async_pf.queue list to get all completed async_pf That depends on the load. I used memory cgroups to create very big memory pressure and I saw hundreds of apfs per second. We shouldn't optimize for very low numbers. With vcpu-async_pf.queue having more then one element I am not sure your patch is beneficial. Maybe we need a new no-lock way to record the complete apfs, i'll reproduce your test environment and improve it. + +list_del(work-queue); +vcpu-async_pf.queued--; +kmem_cache_free(async_pf_cache, work); +if (atomic_dec_and_test(vcpu-async_pf.done)) +break; You should do atomic_dec() and always break. We cannot inject two apfs during one vcpu entry. Sorry, i'm little confused. Why 'atomic_dec_and_test(vcpu-async_pf.done)' always break? async_pf.done is used to record the complete apfs and many apfs may be completed when vcpu enters guest mode(it means vcpu-async_pf.done 1) Look at the current code: void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu) { .. spin_lock(vcpu-async_pf.lock); work = list_first_entry(vcpu-async_pf.done, typeof(*work), link); list_del(work-link); spin_unlock(vcpu-async_pf.lock); .. } You only handle one complete apf, why we inject them at once? I missed something? :-( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: add debugfs file to show the number of async pf
On 10/27/2010 06:58 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:10:51PM +0800, Xiao Guangrong wrote: It can help us to see the state of async pf I have patch to add three async pf statistics: apf_not_present apf_present apf_doublefault But Avi now wants to deprecate debugfs interface completely and move towards ftrace, so I had to drop it. OK, let's ignore this patch, thanks :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] KVM: make async_pf work queue lockless
On Thu, Oct 28, 2010 at 05:08:58PM +0800, Xiao Guangrong wrote: On 10/27/2010 07:41 PM, Gleb Natapov wrote: On Wed, Oct 27, 2010 at 05:09:41PM +0800, Xiao Guangrong wrote: The async_pf number is very few since only pending interrupt can let it re-enter to the guest mode. During my test(Host 4 CPU + 4G, Guest 4 VCPU + 6G), it's no more than 10 requests in the system. So, we can only increase the completion counter in the work queue context, and walk vcpu-async_pf.queue list to get all completed async_pf That depends on the load. I used memory cgroups to create very big memory pressure and I saw hundreds of apfs per second. We shouldn't optimize for very low numbers. With vcpu-async_pf.queue having more then one element I am not sure your patch is beneficial. Maybe we need a new no-lock way to record the complete apfs, i'll reproduce your test environment and improve it. That is always welcomed :) + + list_del(work-queue); + vcpu-async_pf.queued--; + kmem_cache_free(async_pf_cache, work); + if (atomic_dec_and_test(vcpu-async_pf.done)) + break; You should do atomic_dec() and always break. We cannot inject two apfs during one vcpu entry. Sorry, i'm little confused. Why 'atomic_dec_and_test(vcpu-async_pf.done)' always break? async_pf.done is used to In your code it is not, but it should (at least if guest is PV, read below). record the complete apfs and many apfs may be completed when vcpu enters guest mode(it means vcpu-async_pf.done 1) Correct, but only one apf should be handled on each vcpu entry in case of PV guest. Look at kvm_arch_async_page_present(vcpu, work); that is called in a loop in your code. If vcpu-arch.apf.msr_val KVM_ASYNC_PF_ENABLED is not null it injects exception into the guest. You can't inject more then one exception on each guest entry. If guest is not PV you are correct that we can loop here until vcpu-async_pf.done == 0. Look at the current code: void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu) { .. spin_lock(vcpu-async_pf.lock); work = list_first_entry(vcpu-async_pf.done, typeof(*work), link); list_del(work-link); spin_unlock(vcpu-async_pf.lock); .. } You only handle one complete apf, why we inject them at once? I missed something? :-( -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Add a subtest kdump
On Thu, 2010-10-28 at 15:36 +0800, Jason Wang wrote: Add a new subtest to check whether kdump work correctly in guest. This test just try to trigger crash on each vcpu and then verify it by checking the vmcore. Nice test Jason, some comments below: Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/kdump.py | 79 ++ client/tests/kvm/tests_base.cfg.sample | 11 client/tests/kvm/unattended/RHEL-5-series.ks |1 3 files changed, 91 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/kdump.py diff --git a/client/tests/kvm/tests/kdump.py b/client/tests/kvm/tests/kdump.py new file mode 100644 index 000..8fa3cca --- /dev/null +++ b/client/tests/kvm/tests/kdump.py @@ -0,0 +1,79 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_kdump(test, params, env): + +KVM reboot test: +1) Log into a guest +2) Check and enable the kdump +3) For each vcpu, trigger a crash and check the vmcore + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout = float(params.get(login_timeout, 240)) +crash_timeout = float(params.get(crash_timeout, 360)) +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +def_kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` \ + --args=crashkernel=1...@64m ^ Implicit line continuation is better here def_kernel_param_cmd = (command param1 param 2... param8 param9) +kernel_param_cmd = params.get(kernel_param_cmd, def_kernel_param_cmd) +def_kdump_enable_cmd = chkconfig kdump on service kdump start +kdump_enable_cmd = params.get(kdump_enable_cmd, def_kdump_enable_cmd) + +def crash_test(vcpu): + +Trigger a crash dump through sysrq-trigger + +@param vcpu: vcpu which is used to trigger a crash + +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +session.get_command_status(rm -rf /var/crash/*) + +logging.info(Triggering crash on vcpu %d ..., vcpu) +crash_cmd = taskset -c %d echo c /proc/sysrq-trigger % vcpu +session.sendline(crash_cmd) + +if not kvm_utils.wait_for(lambda: not session.is_responsive(), 240, 0, + 1): +raise error.TestFail(Could not trigger crash on vcpu %d % vcpu) + +logging.info(Waiting for the completion of dumping) ^ Waiting for kernel crash dump to complete would be better +session = kvm_test_utils.wait_for_login(vm, 0, crash_timeout, 0, 2) + +logging.info(Probing vmcore file ...) +s = session.get_command_status(ls -R /var/crash | grep vmcore) +if s != 0: +raise error.TestFail(Could not find the generated vmcore file!) +else: +logging.info(Found vmcore.) + +session.get_command_status(rm -rf /var/crash/*) + +try: +logging.info(Check the existence of crash kernel ...) +prob_cmd = grep -q 1 /sys/kernel/kexec_crash_loaded +s = session.get_command_status(prob_cmd) +if s != 0: +logging.info(Crash kernel is not loaded. Try to load it.) +# We need to setup the kernel params +s, o = session.get_command_status_output(kernel_param_cmd) +if s != 0: +raise error.TestFail(Could not add crashkernel params to + kernel) +session = kvm_test_utils.reboot(vm, session, timeout=timeout); + +logging.info(Enable kdump service ...) +# the initrd may be rebuilt here so we need to wait a little more +s, o = session.get_command_status_output(kdump_enable_cmd, timeout=120) ^ I remember initrd built usually takes longer than 2 minutes in most machines, does this work fine on both Fedora and RHEL? +if s != 0: +raise error.TestFail(Could not enable kdump service:%s % o) + +nvcpu = int(params.get(smp, 1)) +[crash_test(i) for i in range(nvcpu)] ^ Although list comprehension is indeed very cool, since we're not going to do anything with this list, I'd rather prefer to use the good old for loop. +finally: +session.close() diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index fe3563c..25ad688 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -665,6 +665,15 @@ variants: image_name_snapshot1 = sn1 image_name_snapshot2 = sn2 +- kdump: +type = kdump +# time waited for
Re: Page Eviction Algorithm
On Thu, Oct 28, 2010 at 1:45 AM, Avi Kivity a...@redhat.com wrote: Does this MMU invalidation has to do something with the EPT (Extended Page Table) No and instruction INVEPT? No, (though INVEPT has to be run as part of this operation, via kvm_flush_remote_tlbs). Thanks a lot Avi for your help. I would look at the code and do my study. I would ask clarification whenever I need help. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] msix: Allow msix_init on a device with existing MSI-X capability
On 10/23/2010 06:55 PM, Alex Williamson wrote: On Sat, 2010-10-23 at 18:18 +0200, Michael S. Tsirkin wrote: On Fri, Oct 22, 2010 at 02:40:31PM -0600, Alex Williamson wrote: To enable common msix support to be used with pass through devices, don't attempt to change the BAR if the device already has an MSI-X capability. This also means we want to pay closer attention to the size when we map the msix table page, as it isn't necessarily covering the entire end of the BAR. Signed-off-by: Alex Williamsonalex.william...@redhat.com --- hw/msix.c | 67 +++-- 1 files changed, 38 insertions(+), 29 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 43efbd2..4122395 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -167,35 +167,43 @@ static int msix_add_config(struct PCIDevice *pdev, unsigned short nentries, { int config_offset; uint8_t *config; -uint32_t new_size; -if (nentries 1 || nentries PCI_MSIX_FLAGS_QSIZE + 1) -return -EINVAL; -if (bar_size 0x8000) -return -ENOSPC; - -/* Add space for MSI-X structures */ -if (!bar_size) { -new_size = MSIX_PAGE_SIZE; -} else if (bar_size MSIX_PAGE_SIZE) { -bar_size = MSIX_PAGE_SIZE; -new_size = MSIX_PAGE_SIZE * 2; -} else { -new_size = bar_size * 2; -} - -pdev-msix_bar_size = new_size; -config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, MSIX_CAP_LENGTH); -if (config_offset 0) -return config_offset; -config = pdev-config + config_offset; - -pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1); -/* Table on top of BAR */ -pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr); -/* Pending bits on top of that */ -pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) | - bar_nr); +pdev-msix_bar_size = bar_size; + +config_offset = pci_find_capability(pdev, PCI_CAP_ID_MSIX); + +if (!config_offset) { +uint32_t new_size; + +if (nentries 1 || nentries PCI_MSIX_FLAGS_QSIZE + 1) +return -EINVAL; +if (bar_size 0x8000) +return -ENOSPC; + +/* Add space for MSI-X structures */ +if (!bar_size) { +new_size = MSIX_PAGE_SIZE; +} else if (bar_size MSIX_PAGE_SIZE) { +bar_size = MSIX_PAGE_SIZE; +new_size = MSIX_PAGE_SIZE * 2; +} else { +new_size = bar_size * 2; +} + +pdev-msix_bar_size = new_size; +config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, + MSIX_CAP_LENGTH); +if (config_offset 0) +return config_offset; +config = pdev-config + config_offset; + +pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1); +/* Table on top of BAR */ +pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr); +/* Pending bits on top of that */ +pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) | + bar_nr); +} pdev-msix_cap = config_offset; /* Make flags bit writeable. */ pdev-wmask[config_offset + MSIX_CONTROL_OFFSET] |= MSIX_ENABLE_MASK | @@ -337,7 +345,8 @@ void msix_mmio_map(PCIDevice *d, int region_num, return; if (size= offset) return; -cpu_register_physical_memory(addr + offset, size - offset, +cpu_register_physical_memory(addr + offset, + MIN(size - offset, MSIX_PAGE_SIZE), This is wrong I think, the table might not fit in a single page. You would need to read table size out of from device config. That's true, but I was hoping to save that for later since we don't seem to be running into that problem yet. Current device assignment code assumes a single page, and I haven't heard of anyone with a vector table that exceeds that yet. Thanks, Ok; applied. Please add some warning if the condition happens so the breakage is at least not silent. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 07:20 +0200, Michael S. Tsirkin wrote: My concern is this can delay signalling for unlimited time. Could you pls test this with guests that do not have 2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb b0c39dbdc204006ef3558a66716ff09797619778 that is 2.6.31 and older? I will test it out. This seems to be slighltly out of spec, even though for TX, signals are less important. Two ideas: 1. How about writing out used, just delaying the signal? This way we don't have to queue separately. 2. How about flushing out queued stuff before we exit the handle_tx loop? That would address most of the spec issue. I will modify the patch to test out the performance for both 1 2 approaches. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 07:20 +0200, Michael S. Tsirkin wrote: My concern is this can delay signalling for unlimited time. Could you pls test this with guests that do not have 2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb b0c39dbdc204006ef3558a66716ff09797619778 that is 2.6.31 and older? The patch only induces delay signaling unlimited time when there is no TX packet to transmit. I thought TX signaling only noticing guest to release the used buffers, anything else beside this? I tested rhel5u5 guest (2.6.18 kernel), it works fine. I checked the two commits log, I don't think this patch could cause any issue w/o these two patches. Also I found a big TX regression for old guest and new guest. For old guest, I am able to get almost 11Gb/s for 2K message size, but for the new guest kernel, I can only get 3.5 Gb/s with the patch and same host. I will dig it why. thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] VFIO V5: Non-privileged user level PCI drivers
[Rebased to 2.6.36] Just in time for Halloween - be afraid! This version adds support for PCIe extended capabilities, including Advanced Error Reporting. All of the config table initialization has been rewritten to be much more readable. All config accesses are byte-at-a-time and endian issues have been resolved. Reading of PCI ROMs is now handled properly. Problems with usage if pci_iomap and pci_request_regions have now been resolved. Devices are now reset upon first open. Races in rmMemlock rlimit accounting have been cleaned up. Lots of other tweaks and comments. Blurb from version 4: After a long summer break, it's tanned, it's rested, and it's ready to rumble! In this version:*** REBASE to 2.6.35 *** There's new code using generic netlink messages which allows the kernel to notify the user level of weird events and allows the user level to respond. This is currently used to handle device removal (whether software or hardware driven), PCI error events, and system suspend hibernate. The driver now supports devices which use multiple MSI interrupts, reflecting the actual number of interrupts allocated by the system to the user level. PCI config accesses are now done through the pci_user_{read,write)_config routines from drivers/pci/access.c. Numerous other tweaks and cleanups. Blurb from version 3: There are lots of bug fixes and cleanups in this version, but the main change is to check to make sure that the IOMMU has interrupt remapping enabled, which is necessary to prevent user level code from triggering spurious interrupts for other devices. Since most platforms today do not have the necessary hardware and/or software for this, a module option can override this check, thus making vfio useful (but not safe) on many more platforms. In the next version I plan to add kernel to user messaging using the generic netlink mechanism to allow the user driver to react to hot add and remove, and power management requests. Blurb from version 2: This version now requires an IOMMU domain to be set before any access to device registers is granted (except that config space may be read). In addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API which does not have sufficient controls around IOMMU usage. The IOMMU domain is obtained from the 'uiommu' driver which is included in this patch. Various locking, security, and documentation issues have also been fixed. Please commit - it or me! But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers? Blurb from version 1: This patch is the evolution of code which was first proposed as a patch to uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely out of the uio framework, and things seem much cleaner. Of course, there is a lot of functional overlap with uio, but the previous version just seemed like a giant mode switch in the uio code that did not lead to clarity for either the new or old code. [a pony for avi...] The major new functionality in this version is the ability to deal with PCI config space accesses (through read write calls) - but includes table driven code to determine whats safe to write and what is not. Also, some virtualization of the config space to allow drivers to think they're writing some registers when they're not. Also, IO space accesses are also allowed. Drivers for devices which use MSI-X are now prevented from directly writing the MSI-X vector area. All interrupts are now handled using eventfds, which makes things very simple. The name VFIO refers to the Virtual Function capabilities of SR-IOV devices but the driver does support many more types of devices. I was none too sure what driver directory this should live in, so for now I made up my own under drivers/vfio. As a new driver/new directory, who makes the commit decision? I currently have user level drivers working for 3 different network adapters - the Cisco Palo enic, the Intel 82599 VF, and the Intel 82576 VF (but the whole user level framework is a long ways from release). This driver could also clearly replace a number of other drivers written just to give user access to certain devices - but that will take time. Tom Lyon (4): VFIO V5: export pci_user_{read,write}_config VFIO V5: additions to include/linux/pci_regs.h VFIO V5: uiommu driver - allow user progs to manipulate iommu domains VFIO V5: Non-privileged user level PCI drivers Documentation/ioctl/ioctl-number.txt |1 + Documentation/vfio.txt | 182 ++ MAINTAINERS |8 + drivers/Kconfig |2 + drivers/Makefile |2 + drivers/pci/access.c |6 +- drivers/pci/pci.h|7 - drivers/vfio/Kconfig | 18 + drivers/vfio/Makefile| 11 + drivers/vfio/uiommu.c| 126 drivers/vfio/vfio_dma.c | 400
[PATCH 1/4] VFIO V5: export pci_user_{read,write}_config
Acked-by: Jesse Barnes jbar...@virtuousgeek.org Signed-off-by: Tom Lyon p...@cisco.com --- drivers/pci/access.c |6 -- drivers/pci/pci.h|7 --- include/linux/pci.h |8 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index 531bc69..96ed449 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -157,7 +157,8 @@ int pci_user_read_config_##size \ raw_spin_unlock_irq(pci_lock); \ *val = (type)data; \ return ret; \ -} +} \ +EXPORT_SYMBOL_GPL(pci_user_read_config_##size); #define PCI_USER_WRITE_CONFIG(size,type) \ int pci_user_write_config_##size \ @@ -171,7 +172,8 @@ int pci_user_write_config_##size \ pos, sizeof(type), val);\ raw_spin_unlock_irq(pci_lock); \ return ret; \ -} +} \ +EXPORT_SYMBOL_GPL(pci_user_write_config_##size); PCI_USER_READ_CONFIG(byte, u8) PCI_USER_READ_CONFIG(word, u16) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 6beb11b..e1db481 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -77,13 +77,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev) return !!(pci_dev-subordinate); } -extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val); -extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val); -extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val); -extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val); -extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val); -extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val); - struct pci_vpd_ops { ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void *buf); ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const void *buf); diff --git a/include/linux/pci.h b/include/linux/pci.h index c8d95e3..7f22c8a 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -756,6 +756,14 @@ static inline int pci_write_config_dword(struct pci_dev *dev, int where, return pci_bus_write_config_dword(dev-bus, dev-devfn, where, val); } +/* user-space driven config access */ +extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val); +extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val); +extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val); +extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val); +extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val); +extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val); + int __must_check pci_enable_device(struct pci_dev *dev); int __must_check pci_enable_device_io(struct pci_dev *dev); int __must_check pci_enable_device_mem(struct pci_dev *dev); -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] VFIO V5: uiommu driver - allow user progs to manipulate iommu domains
Signed-off-by: Tom Lyon p...@cisco.com --- drivers/Kconfig|2 + drivers/Makefile |1 + drivers/vfio/Kconfig |8 +++ drivers/vfio/Makefile |1 + drivers/vfio/uiommu.c | 126 include/linux/uiommu.h | 76 + 6 files changed, 214 insertions(+), 0 deletions(-) create mode 100644 drivers/vfio/Kconfig create mode 100644 drivers/vfio/Makefile create mode 100644 drivers/vfio/uiommu.c create mode 100644 include/linux/uiommu.h diff --git a/drivers/Kconfig b/drivers/Kconfig index a2b902f..711c1cb 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -111,4 +111,6 @@ source drivers/xen/Kconfig source drivers/staging/Kconfig source drivers/platform/Kconfig + +source drivers/vfio/Kconfig endmenu diff --git a/drivers/Makefile b/drivers/Makefile index a2aea53..c445440 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -53,6 +53,7 @@ obj-$(CONFIG_FUSION) += message/ obj-y += firewire/ obj-y += ieee1394/ obj-$(CONFIG_UIO) += uio/ +obj-$(CONFIG_UIOMMU) += vfio/ obj-y += cdrom/ obj-y += auxdisplay/ obj-$(CONFIG_PCCARD) += pcmcia/ diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig new file mode 100644 index 000..3ab9af3 --- /dev/null +++ b/drivers/vfio/Kconfig @@ -0,0 +1,8 @@ +menuconfig UIOMMU + tristate User level manipulation of IOMMU + help + Device driver to allow user level programs to + manipulate IOMMU domains. + + If you don't know what to do here, say N. + diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile new file mode 100644 index 000..556f3c1 --- /dev/null +++ b/drivers/vfio/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_UIOMMU) += uiommu.o diff --git a/drivers/vfio/uiommu.c b/drivers/vfio/uiommu.c new file mode 100644 index 000..5c17c5a --- /dev/null +++ b/drivers/vfio/uiommu.c @@ -0,0 +1,126 @@ +/* + * Copyright 2010 Cisco Systems, Inc. All rights reserved. + * Author: Tom Lyon, p...@cisco.com + * + * This program is free software; you may redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; version 2 of the License. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. +*/ + +/* + * uiommu driver - issue fd handles for IOMMU domains + * so they may be passed to vfio (and others?) + */ +#include linux/fs.h +#include linux/mm.h +#include linux/init.h +#include linux/kernel.h +#include linux/miscdevice.h +#include linux/module.h +#include linux/slab.h +#include linux/file.h +#include linux/iommu.h +#include linux/uiommu.h + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Tom Lyon p...@cisco.com); +MODULE_DESCRIPTION(User IOMMU driver); + +static struct uiommu_domain *uiommu_domain_alloc(void) +{ + struct iommu_domain *domain; + struct uiommu_domain *udomain; + + domain = iommu_domain_alloc(); + if (!domain) + return NULL; + udomain = kzalloc(sizeof *udomain, GFP_KERNEL); + if (!udomain) { + iommu_domain_free(domain); + return NULL; + } + udomain-domain = domain; + atomic_inc(udomain-refcnt); + return udomain; +} + +static int uiommu_open(struct inode *inode, struct file *file) +{ + struct uiommu_domain *udomain; + + udomain = uiommu_domain_alloc(); + if (!udomain) + return -ENOMEM; + file-private_data = udomain; + return 0; +} + +static int uiommu_release(struct inode *inode, struct file *file) +{ + struct uiommu_domain *udomain; + + udomain = file-private_data; + uiommu_put(udomain); + return 0; +} + +static const struct file_operations uiommu_fops = { + .owner = THIS_MODULE, + .open = uiommu_open, + .release= uiommu_release, +}; + +static struct miscdevice uiommu_dev = { + .name = uiommu, + .minor = MISC_DYNAMIC_MINOR, + .fops = uiommu_fops, +}; + +struct uiommu_domain *uiommu_fdget(int fd) +{ + struct file *file; + struct uiommu_domain *udomain; + + file = fget(fd); + if (!file) + return ERR_PTR(-EBADF); + if (file-f_op != uiommu_fops) { + fput(file); + return ERR_PTR(-EINVAL); + } + udomain = file-private_data; + atomic_inc(udomain-refcnt); + return
[PATCH 2/4] VFIO V5: additions to include/linux/pci_regs.h
Signed-off-by: Tom Lyon p...@cisco.com --- include/linux/pci_regs.h | 107 ++ 1 files changed, 98 insertions(+), 9 deletions(-) diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index 455b9cc..70addc9 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -26,6 +26,7 @@ * Under PCI, each device has 256 bytes of configuration address space, * of which the first 64 bytes are standardized as follows: */ +#definePCI_STD_HEADER_SIZEOF 64 #define PCI_VENDOR_ID 0x00/* 16 bits */ #define PCI_DEVICE_ID 0x02/* 16 bits */ #define PCI_COMMAND0x04/* 16 bits */ @@ -209,9 +210,12 @@ #define PCI_CAP_ID_SHPC 0x0C/* PCI Standard Hot-Plug Controller */ #define PCI_CAP_ID_SSVID 0x0D/* Bridge subsystem vendor/device ID */ #define PCI_CAP_ID_AGP3 0x0E/* AGP Target PCI-PCI bridge */ +#define PCI_CAP_ID_SECDEV 0x0F/* Secure Device */ #define PCI_CAP_ID_EXP0x10/* PCI Express */ #define PCI_CAP_ID_MSIX 0x11/* MSI-X */ +#define PCI_CAP_ID_SATA 0x12/* SATA Data/Index Conf. */ #define PCI_CAP_ID_AF 0x13/* PCI Advanced Features */ +#define PCI_CAP_ID_MAXPCI_CAP_ID_AF #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 @@ -276,6 +280,7 @@ #define PCI_VPD_ADDR_MASK 0x7fff /* Address mask */ #define PCI_VPD_ADDR_F0x8000 /* Write 0, 1 indicates completion */ #define PCI_VPD_DATA 4 /* 32-bits of data returned here */ +#definePCI_CAP_VPD_SIZEOF 8 /* Slot Identification */ @@ -297,8 +302,10 @@ #define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */ #define PCI_MSI_DATA_328 /* 16 bits of data for 32-bit devices */ #define PCI_MSI_MASK_3212 /* Mask bits register for 32-bit devices */ +#define PCI_MSI_PENDING_32 16 /* Pending intrs for 32-bit devices */ #define PCI_MSI_DATA_6412 /* 16 bits of data for 64-bit devices */ #define PCI_MSI_MASK_6416 /* Mask bits register for 64-bit devices */ +#define PCI_MSI_PENDING_64 20 /* Pending intrs for 64-bit devices */ /* MSI-X registers (these are at offset PCI_MSIX_FLAGS) */ #define PCI_MSIX_FLAGS 2 @@ -306,6 +313,7 @@ #define PCI_MSIX_FLAGS_ENABLE (1 15) #define PCI_MSIX_FLAGS_MASKALL(1 14) #define PCI_MSIX_FLAGS_BIRMASK (7 0) +#definePCI_CAP_MSIX_SIZEOF 12 /* size of MSIX registers */ /* CompactPCI Hotswap Register */ @@ -328,6 +336,7 @@ #define PCI_AF_CTRL_FLR 0x01 #define PCI_AF_STATUS 5 #define PCI_AF_STATUS_TP 0x01 +#definePCI_CAP_AF_SIZEOF 6 /* size of AF registers */ /* PCI-X registers */ @@ -364,6 +373,9 @@ #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg */ #define PCI_X_STATUS_266MHZ 0x4000 /* 266 MHz capable */ #define PCI_X_STATUS_533MHZ 0x8000 /* 533 MHz capable */ +#definePCI_X_ECC_CSR 8 /* ECC control and status */ +#define PCI_CAP_PCIX_SIZEOF_V0 8 /* size of registers for Version 0 */ +#define PCI_CAP_PCIX_SIZEOF_V1224 /* size for Version 1 2 */ /* PCI Bridge Subsystem ID registers */ @@ -451,6 +463,7 @@ #define PCI_EXP_LNKSTA_DLLLA 0x2000 /* Data Link Layer Link Active */ #define PCI_EXP_LNKSTA_LBMS 0x4000 /* Link Bandwidth Management Status */ #define PCI_EXP_LNKSTA_LABS 0x8000 /* Link Autonomous Bandwidth Status */ +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20 /* v1 endpoints end here */ #define PCI_EXP_SLTCAP 20 /* Slot Capabilities */ #define PCI_EXP_SLTCAP_ABP0x0001 /* Attention Button Present */ #define PCI_EXP_SLTCAP_PCP0x0002 /* Power Controller Present */ @@ -498,6 +511,7 @@ #define PCI_EXP_DEVCAP2_ARI 0x20/* Alternative Routing-ID */ #define PCI_EXP_DEVCTL240 /* Device Control 2 */ #define PCI_EXP_DEVCTL2_ARI 0x20/* Alternative Routing-ID */ +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44 /* v2 endpoints end here */ #define PCI_EXP_LNKCTL248 /* Link Control 2 */ #define PCI_EXP_SLTCTL256 /* Slot Control 2 */ @@ -506,20 +520,42 @@ #define PCI_EXT_CAP_VER(header)((header 16) 0xf) #define PCI_EXT_CAP_NEXT(header) ((header 20) 0xffc) -#define PCI_EXT_CAP_ID_ERR 1 -#define PCI_EXT_CAP_ID_VC 2 -#define PCI_EXT_CAP_ID_DSN 3 -#define PCI_EXT_CAP_ID_PWR 4 -#define PCI_EXT_CAP_ID_VNDR11 -#define PCI_EXT_CAP_ID_ACS 13 -#define PCI_EXT_CAP_ID_ARI 14 -#define PCI_EXT_CAP_ID_ATS 15 -#define PCI_EXT_CAP_ID_SRIOV
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote: Also I found a big TX regression for old guest and new guest. For old guest, I am able to get almost 11Gb/s for 2K message size, but for the new guest kernel, I can only get 3.5 Gb/s with the patch and same host. I will dig it why. The regression is from guest kernel, not from this patch. Tested 2.6.31 kernel, it's performance is less than 2Gb/s for 2K message size already. I will resubmit the patch for review. I will start to test from 2.6.30 kernel to figure it when TX regression induced in virtio_net. Any suggestion which guest kernel I should test to figure out this regression? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Hitting 29 NIC limit (+Intel VT-c)
On Thu, 14 Oct 2010 14:07 +0200, Avi Kivity a...@redhat.com wrote: On 10/14/2010 12:54 AM, Anthony Liguori wrote: On 10/13/2010 05:32 PM, Anjali Kulkarni wrote: What's the motivation for such a huge number of interfaces? Ultimately to bring multiple 10Gb bonds into a Vyatta guest. --- BTW, I don't think it's possible to hot-add physical functions. I believe I know of a card that supports dynamic add of physical functions (pre-dating SR-IOV) I don't know what you're talking about, but it seems you have a better handle than I on this VT-c stuff, so perhaps misguidedly I'll direct my next question to you. Is additional configuration required to make use of SR-IOV VTq? I don't immediateley understand how the queueing knows who is who in the absense of eth.vlan- or if I need to for that matter. My hope is that this is something like plug n play as long as kernel, host driver versions are foo, but I haven't yet found documentation to confirm it. For the sake of future queries, I've come across these references so far: http://download.intel.com/design/network/applnots/321211.pdf http://www.linux-kvm.org/wiki/images/6/6a/KvmForum2008%24kdf2008_7.pdf http://www.mail-archive.com/kvm@vger.kernel.org/msg27860.html http://www.mail-archive.com/kvm@vger.kernel.org/msg22721.html http://thread.gmane.org/gmane.linux.kernel.mm/38508 http://ark.intel.com/Product.aspx?id=36918 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 13:13 -0700, Shirley Ma wrote: On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote: Also I found a big TX regression for old guest and new guest. For old guest, I am able to get almost 11Gb/s for 2K message size, but for the new guest kernel, I can only get 3.5 Gb/s with the patch and same host. I will dig it why. The regression is from guest kernel, not from this patch. Tested 2.6.31 kernel, it's performance is less than 2Gb/s for 2K message size already. I will resubmit the patch for review. I will start to test from 2.6.30 kernel to figure it when TX regression induced in virtio_net. Any suggestion which guest kernel I should test to figure out this regression? It would be some change in virtio-net driver that may have improved the latency of small messages which in turn would have reduced the bandwidth as TCP could not accumulate and send large packets. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 14:04 -0700, Sridhar Samudrala wrote: It would be some change in virtio-net driver that may have improved the latency of small messages which in turn would have reduced the bandwidth as TCP could not accumulate and send large packets. I will check out any latency improvement patch in virtio_net. If that's the case, whether it is good to have some tunable parameter to benefit both BW and latency workload? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/3] KVM page cache optimization (v3)
This is version 3 of the page cache control patches From: Balbir Singh bal...@linux.vnet.ibm.com This series has three patches, the first controls the amount of unmapped page cache usage via a boot parameter and sysctl. The second patch controls page and slab cache via the balloon driver. Both the patches make heavy use of the zone_reclaim() functionality already present in the kernel. The last patch in the series is against QEmu to make the ballooning hint optional. V2 was posted a long time back (see http://lwn.net/Articles/391293/) One of the review suggestions was to make the hint optional (discussed in the community call as well). I'd appreciate any test results with the patches. TODO 1. libvirt exploits for optional hint page-cache-control balloon-page-cache provide-memory-hint-during-ballooning --- b/balloon.c | 18 +++- b/balloon.h |4 b/drivers/virtio/virtio_balloon.c | 17 +++ b/hmp-commands.hx |7 + b/hw/virtio-balloon.c | 14 ++- b/hw/virtio-balloon.h |3 b/include/linux/gfp.h |8 + b/include/linux/mmzone.h |2 b/include/linux/swap.h|3 b/include/linux/virtio_balloon.h |3 b/mm/page_alloc.c |9 +- b/mm/vmscan.c | 162 -- b/qmp-commands.hx |7 - include/linux/swap.h |9 -- mm/page_alloc.c |3 mm/vmscan.c |2 16 files changed, 202 insertions(+), 69 deletions(-) -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/3] Linux/Guest unmapped page cache control
Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Host Usage without boot parameter (memory in KB) MemFree Cached Time 19900 292912 137 17540 296196 139 17900 296124 141 19356 296660 141 Host usage: (memory in KB) RSS Cache mapped swap 2788664 781884 3780359536 Guest Usage with boot parameter (memory in KB) - Memfree Cached Time 244824 74828 144 237840 81764 143 235880 83044 138 239312 80092 148 Host usage: (memory in KB) RSS Cache mapped swap 2700184 958012 334848 398412 TODOS - 1. Balance slab cache as well Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |2 - include/linux/swap.h |3 + mm/page_alloc.c|9 ++- mm/vmscan.c| 162 4 files changed, 132 insertions(+), 44 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3984c4e..a591a7a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -300,12 +300,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; + unsigned long min_unmapped_pages; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ - unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 7cdd633..5d29097 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern bool should_balance_unmapped_pages(struct zone *zone); +extern int sysctl_min_unmapped_ratio; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f12ad18..d8fe29f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1642,6 +1642,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4101,10 +4104,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA -
[RFC][PATCH 2/3] Linux/Guest cooperative unmapped page cache control
Balloon unmapped page cache pages first From: Balbir Singh bal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed in the gfp_mask. The virtio balloon driver has been changed to use __GFP_FREE_CACHE. During fill_balloon(), the driver looks for hints provided by the hypervisor to reclaim cached memory. By default the hint is off and can be turned on by passing an argument that specifies that we intend to reclaim cached memory. Tests: Test 1 -- I ran a simple filter function that kept frequently ballon a single VM running kernbench. The VM was configured with 2GB of memory and 2 VCPUs. The filter function was a triangular wave function that ballooned the VM under study from 500MB to 1500MB using a triangular wave function continously. The run times of the VM with and without changes are shown below. The run times showed no significant impact of the changes. Withchanges Elapsed Time 223.86 (1.52822) User Time 191.01 (0.65395) System Time 199.468 (2.43616) Percent CPU 174 (1) Context Switches 103182 (595.05) Sleeps 39107.6 (1505.67) Without changes Elapsed Time 225.526 (2.93102) User Time 193.53 (3.53626) System Time 199.832 (3.26281) Percent CPU 173.6 (1.14018) Context Switches 103744 (1311.53) Sleeps 39383.2 (831.865) The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Test 2 -- I ran kernbench under the memory overcommit manager (6 VM's with 2 vCPUs, 2GB) with KSM and ksmtuned enabled. memory overcommit manager details are at http://github.com/aglitke/mom/wiki. The command line for kernbench was kernbench -M. The tests showed the following Withchanges Elapsed Time 842.936 (12.2247) Elapsed Time 844.266 (25.8047) Elapsed Time 844.696 (11.2433) Elapsed Time 846.08 (14.0249) Elapsed Time 838.58 (7.44609) Elapsed Time 842.362 (4.37463) Withoutchanges Elapsed Time 837.604 (14.1311) Elapsed Time 839.322 (17.1772) Elapsed Time 843.744 (9.21541) Elapsed Time 842.592 (7.48622) Elapsed Time 844.272 (25.486) Elapsed Time 838.858 (7.5044) General observations 1. Free memory in each of guests was higher with changes. The additional free memory was of the order of 120MB per VM 2. Cached memory in each guest was lesser with changes 3. Host free memory was almost constant (independent of changes) 4. Host anonymous memory usage was lesser with the changes The goal of this patch is to free up memory locked up in duplicated cache contents and (1) above shows that we are able to successfully free it up. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- drivers/virtio/virtio_balloon.c | 17 +++-- include/linux/gfp.h |8 +++- include/linux/swap.h|9 +++-- include/linux/virtio_balloon.h |3 +++ mm/page_alloc.c |3 ++- mm/vmscan.c |2 +- 6 files changed, 31 insertions(+), 11 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0f1da45..70f97ea 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -99,12 +99,24 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq) static void fill_balloon(struct virtio_balloon *vb, size_t num) { + u32 reclaim_cache_first; + int err; + gfp_t mask = GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | + __GFP_NOWARN; + + err = virtio_config_val(vb-vdev, VIRTIO_BALLOON_F_BALLOON_HINT, + offsetof(struct virtio_balloon_config, + reclaim_cache_first), + reclaim_cache_first); + + if (!err reclaim_cache_first) + mask |= __GFP_FREE_CACHE; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns++) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | - __GFP_NOMEMALLOC | __GFP_NOWARN); + struct page *page = alloc_page(mask); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, vb-vdev-dev, @@ -358,6 +370,7 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_BALLOON_HINT, }; static struct
[RFC][PATCH 3/3] QEmu changes to provide balloon hint
Provide memory hint during ballooning From: Balbir Singh bal...@linux.vnet.ibm.com This patch adds an optional hint to the qemu monitor balloon command. The hint tells the guest operating system to consider a class of memory during reclaim. Currently the supported hint is cached memory. The design is generic and can be extended to provide other hints in the future if required. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- balloon.c | 18 ++ balloon.h |4 +++- hmp-commands.hx |7 +-- hw/virtio-balloon.c | 15 +++ hw/virtio-balloon.h |3 +++ qmp-commands.hx |7 --- 6 files changed, 40 insertions(+), 14 deletions(-) diff --git a/balloon.c b/balloon.c index 0021fef..b2bdda5 100644 --- a/balloon.c +++ b/balloon.c @@ -41,11 +41,13 @@ void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque) qemu_balloon_event_opaque = opaque; } -int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque) +int qemu_balloon(ram_addr_t target, bool reclaim_cache_first, + MonitorCompletion cb, void *opaque) { if (qemu_balloon_event) { trace_balloon_event(qemu_balloon_event_opaque, target); -qemu_balloon_event(qemu_balloon_event_opaque, target, cb, opaque); +qemu_balloon_event(qemu_balloon_event_opaque, target, + reclaim_cache_first, cb, opaque); return 1; } else { return 0; @@ -55,7 +57,7 @@ int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque) int qemu_balloon_status(MonitorCompletion cb, void *opaque) { if (qemu_balloon_event) { -qemu_balloon_event(qemu_balloon_event_opaque, 0, cb, opaque); +qemu_balloon_event(qemu_balloon_event_opaque, 0, 0, cb, opaque); return 1; } else { return 0; @@ -131,13 +133,21 @@ int do_balloon(Monitor *mon, const QDict *params, MonitorCompletion cb, void *opaque) { int ret; +int val; +const char *cache_hint; +int reclaim_cache_first = 0; if (kvm_enabled() !kvm_has_sync_mmu()) { qerror_report(QERR_KVM_MISSING_CAP, synchronous MMU, balloon); return -1; } -ret = qemu_balloon(qdict_get_int(params, value), cb, opaque); +val = qdict_get_int(params, value); +cache_hint = qdict_get_try_str(params, hint); +if (cache_hint) +reclaim_cache_first = 1; + +ret = qemu_balloon(val, reclaim_cache_first, cb, opaque); if (ret == 0) { qerror_report(QERR_DEVICE_NOT_ACTIVE, balloon); return -1; diff --git a/balloon.h b/balloon.h index d478e28..65d68c1 100644 --- a/balloon.h +++ b/balloon.h @@ -17,11 +17,13 @@ #include monitor.h typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target, +bool reclaim_cache_first, MonitorCompletion cb, void *cb_data); void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque); -int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque); +int qemu_balloon(ram_addr_t target, bool reclaim_cache_first, + MonitorCompletion cb, void *opaque); int qemu_balloon_status(MonitorCompletion cb, void *opaque); diff --git a/hmp-commands.hx b/hmp-commands.hx index 81999aa..80e42aa 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -925,8 +925,8 @@ ETEXI { .name = balloon, -.args_type = value:M, -.params = target, +.args_type = value:M,hint:s?, +.params = target [cache], .help = request VM to change its memory allocation (in MB), .user_print = monitor_user_noop, .mhandler.cmd_async = do_balloon, @@ -937,6 +937,9 @@ STEXI @item balloon @var{value} @findex balloon Request VM to change its memory allocation to @var{value} (in MB). +An optional @var{hint} can be specified to indicate if the guest +should reclaim from the cached memory in the guest first. The +...@var{hint} may be ignored by the guest. ETEXI { diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 8adddea..e363507 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -44,6 +44,7 @@ typedef struct VirtIOBalloon size_t stats_vq_offset; MonitorCompletion *stats_callback; void *stats_opaque_callback_data; +uint32_t reclaim_cache_first; } VirtIOBalloon; static VirtIOBalloon *to_virtio_balloon(VirtIODevice *vdev) @@ -181,8 +182,11 @@ static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data) config.num_pages = cpu_to_le32(dev-num_pages); config.actual = cpu_to_le32(dev-actual); - -memcpy(config_data, config, 8); +if (vdev-guest_features (1 VIRTIO_BALLOON_F_BALLOON_HINT)) { +config.reclaim_cache_first = cpu_to_le32(dev-reclaim_cache_first); +memcpy(config_data, config, 12); +} else +memcpy(config_data, config, 8);
[PATCH iproute2] Support 'mode' parameter when creating macvtap device
Add support for 'mode' parameter when creating a macvtap device. This allows a macvtap device to be created in bridge, private or the default vepa modes. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- diff --git a/ip/Makefile b/ip/Makefile index 2f223ca..6054e8a 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \ ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \ ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \ iplink_vlan.o link_veth.o link_gre.o iplink_can.o \ -iplink_macvlan.o +iplink_macvlan.o iplink_macvtap.o RTMONOBJ=rtmon.o diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c new file mode 100644 index 000..35199b1 --- /dev/null +++ b/ip/iplink_macvtap.c @@ -0,0 +1,90 @@ +/* + * iplink_macvtap.cmacvtap device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void explain(void) +{ + fprintf(stderr, + Usage: ... macvtap mode { private | vepa | bridge }\n + ); +} + +static int mode_arg(void) +{ +fprintf(stderr, Error: argument of \mode\ must be \private\, + \vepa\ or \bridge\\n); +return -1; +} + +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, mode) == 0) { + __u32 mode = 0; + NEXT_ARG(); + + if (strcmp(*argv, private) == 0) + mode = MACVLAN_MODE_PRIVATE; + else if (strcmp(*argv, vepa) == 0) + mode = MACVLAN_MODE_VEPA; + else if (strcmp(*argv, bridge) == 0) + mode = MACVLAN_MODE_BRIDGE; + else + return mode_arg(); + + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, macvtap: what is \%s\?\n, *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + __u32 mode; + + if (!tb) + return; + + if (!tb[IFLA_MACVLAN_MODE] || + RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) sizeof(__u32)) + return; + + mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]); + fprintf(f, mode %s , + mode == MACVLAN_MODE_PRIVATE ? private + : mode == MACVLAN_MODE_VEPA? vepa + : mode == MACVLAN_MODE_BRIDGE ? bridge + :unknown); +} + +struct link_util macvtap_link_util = { + .id = macvtap, + .maxattr= IFLA_MACVLAN_MAX, + .parse_opt = macvtap_parse_opt, + .print_opt = macvtap_print_opt, +}; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2] macvlan/macvtap: support 'passthru' mode
Add support for 'passthru' mode when creating a macvlan/macvtap device which allows takeover of the underlying device and passing it to a KVM guest using virtio with macvtap backend. Only one macvlan device is allowed in passthru mode and it inherits the mac address from the underlying device and sets it in promiscuous mode to receive and forward all the packets. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- diff --git a/include/linux/if_link.h b/include/linux/if_link.h index f5bb2dc..23de79e 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -230,6 +230,7 @@ enum macvlan_mode { MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */ + MACVLAN_MODE_PASSTHRU = 8, /* take over the underlying device */ }; /* SR-IOV virtual function management section */ diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c index a3c78bd..15022aa 100644 --- a/ip/iplink_macvlan.c +++ b/ip/iplink_macvlan.c @@ -23,14 +23,14 @@ static void explain(void) { fprintf(stderr, - Usage: ... macvlan mode { private | vepa | bridge }\n + Usage: ... macvlan mode { private | vepa | bridge | passthru }\n ); } static int mode_arg(void) { fprintf(stderr, Error: argument of \mode\ must be \private\, - \vepa\ or \bridge\\n); + \vepa\, \bridge\ or \passthru\ \n); return -1; } @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c index 35199b1..5665b6d 100644 --- a/ip/iplink_macvtap.c +++ b/ip/iplink_macvtap.c @@ -20,14 +20,14 @@ static void explain(void) { fprintf(stderr, - Usage: ... macvtap mode { private | vepa | bridge }\n + Usage: ... macvtap mode { private | vepa | bridge | passthru }\n ); } static int mode_arg(void) { fprintf(stderr, Error: argument of \mode\ must be \private\, - \vepa\ or \bridge\\n); + \vepa\, \bridge\ or \passthru\ \n); return -1; } @@ -45,6 +45,8 @@ static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -79,6 +81,7 @@ static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM devices assignment; PCIe AER?
On Wed, 27 Oct 2010, Alex Williamson wrote: KVM already has an internal IRQ ACK notifier (which is what current device assignment uses to do the same thing), it's just a matter of adding a callback that does a kvm_register_irq_ack_notifier that sends off the eventfd signal. I've got this working and will probably send out the KVM patch this week. For now the eventfd goes to userspace, but this is where I imagine we could steal some of the irqfd code to make VFIO consume the irqfd signal directly. Thanks, Thanks for the clarification. I must admit I was somewhat confuse about that irqfd mechanism until I realized that all it does is to consume an eventfd from kernel context (like you pointed out earlier...) So from userspace I guess that it means that the same eventfd is going to be assigned to both VFIO and KVM right? Going back to the original discussion, I think that devices assignment over VFIO is a great way to support PCIe AER for the assigned devices. I'm going to spend some time in that direction for sure. In the mean time I'll send some patches (shortly) that address the problem without any major surgery to the current implementation. thanks, -Etienne -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM Test report, kernel 1414115... qemu 013ddf74...
Hi, all, This is KVM test result against kvm.git 1414115b34b9ae69d260a2e4e5d2fd6e956b64b9 and qemu-kvm.git 013ddf74dd9ac698d0206effdf268c8768959099. Currently qemu-kvm has a build failure issue on RHEL5 system, this issue exist for about 1 month, we build qemu-kvm on RHEL5u1 with a wordaround patch(attchment mail). The linux guest boot slow issue got fixed. However, we found a regression issue that 32PAE Windows guest can not boot up without APCI on. Fixed issue: 1. [KVM] Linux guest is too slow to boot up https://bugzilla.kernel.org/show_bug.cgi?id=17882 New issue 1. [KVM] Noacpi Windows guest can not boot up on 32bit KVM host https://bugzilla.kernel.org/show_bug.cgi?id=21402 Four old Issues: 1. ltp diotest running time is 2.54 times than before https://sourceforge.net/tracker/?func=detailaid=2723366group_id=180599atid=893831 2. 32bits Rhel5/FC6 guest may fail to reboot after installation https://sourceforge.net/tracker/?func=detailatid=893831aid=1991647group_id=180599 3. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831 4. [SR] qemu return form migrate command spend long time https://sourceforge.net/tracker/?func=detailaid=2942079group_id=180599atid=893831 Test environment Platform A Westmere-EP CPU 8 Memory size 12G = Summary Test Report of Last Session = Total PassFailNoResult Crash = control_panel_ept_vpid 12 12 0 00 control_panel_vpid 3 3 0 00 control_panel 3 3 0 00 control_panel_ept 4 4 0 00 gtest_vpid 1 1 0 00 gtest_ept 1 1 0 00 gtest 3 3 0 00 vtd_ept_vpid8 8 0 00 gtest_ept_vpid 11 11 0 00 sriov_ept_vpid 5 5 0 00 = control_panel_ept_vpid 12 12 0 00 :KVM_LM_Continuity_64_g3 1 1 0 00 :KVM_four_dguest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 :KVM_SR_SMP_64_g32e1 1 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_two_winxp_64_g32e 1 1 0 00 :KVM_256M_guest_64_gPAE1 1 0 00 :KVM_SR_Continuity_64_g3 1 1 0 00 :KVM_256M_guest_64_g32e1 1 0 00 :KVM_four_sguest_64_g32e 1 1 0 00 control_panel_vpid 3 3 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 control_panel 3 3 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 control_panel_ept 4 4 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 gtest_vpid 1 1 0 00 :boot_smp_win7_ent_64_g3 1 1 0 00 gtest_ept 1 1 0 00 :boot_smp_win7_ent_64_g3 1 1 0 00 gtest 3 3 0 00 :boot_smp_win2008_64_g32 1 1 0 00 :boot_smp_win7_ent_64_gP 1 1 0 00 :boot_smp_vista_64_g32e1 1 0 00 vtd_ept_vpid8 8 0 00 :one_pcie_up_64_g32e 1 1 0 00 :hp_pcie_smp_nomsi_64_g3 1 1 0 00 :lm_pcie_smp_64_g32e 1 1 0 00