Re: windows 2008 guest causing rcu_shed to emit NMI
On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote: On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote: On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote: Thank you Marcelo, Host node locking up sometimes later than yesterday, bur problem still here, please see attached dmesg. Stuck process looks like root 19251 0.0 0.0 228476 12488 ?D14:42 0:00 /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device virtio-blk-pci,? -device on fourth vm by count. Should I try upstream kernel instead of applying patch to the latest 3.4 or it is useless? If you can upgrade to an upstream kernel, please do that. With vanilla 3.7.4 there is almost no changes, and NMI started firing again. External symptoms looks like following: starting from some count, may be third or sixth vm, qemu-kvm process allocating its memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to kill stuck kvm processes and node returned back to the normal, when on 3.2 sending SIGKILL to the process causing zombies and hanged ``ps'' output (problem and workaround when no scheduler involved described here http://www.spinics.net/lists/kvm/msg84799.html). Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter. Hi Marcelo, thanks, this parameter helped to increase number of working VMs in a half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10 to 15 percents, persists on such numbers for a long time, where linux guests in same configuration do not jump over one percent even under stress bench. After I disabled HT, crash happens only in long runs and now it is kernel panic :) Stair-like memory allocation behaviour disappeared, but other symptom leading to the crash which I have not counted previously, persists: if VM count is ``enough'' for crash, some qemu processes starting to eat one core, and they`ll panic system after run in tens of minutes in such state or if I try to attach debugger to one of them. If needed, I can log entire crash output via netconsole, now I have some tail, almost the same every time: http://xdel.ru/downloads/btwin.png Yes, please log entire crash output, thanks. Here please, 3.7.4-vanilla, 16 vms, ple_gap=0: http://xdel.ru/downloads/oops-default-kvmintel.txt Just an update: I was able to reproduce that on pure linux VMs using qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at start of vm(with count ten working machines at the moment). Qemu-1.1.2 generally is not able to reproduce that, but host node with older version crashing on less amount of Windows VMs(three to six instead ten to fifteen) than with 1.3, please see trace below: http://xdel.ru/downloads/oops-old-qemu.txt Single bit memory error, apparently. Try: 1. memtest86. 2. Boot with slub_debug=ZFPU kernel parameter. 3. Reproduce on different machine Hi Marcelo, I always follow the rule - if some weird bug exists, check it on ECC-enabled machine and check IPMI logs too before start complaining :) I have finally managed to ``fix'' the problem, but my solution seems a bit strange: - I have noticed that if virtual machines started without any cgroup setting they will not cause this bug under any conditions, - I have thought, very wrong in my mind, that the CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and should not touch tasks already inside any existing cpu cgroup. First sight on the 200-line patch shows that the autogrouping always applies to all tasks, so I tried to disable it, - wild magic appears - VMs didn`t crashed host any more, even in count 30+ they work fine. I still don`t know what exactly triggered that and will I face it again under different conditions, so my solution more likely to be a patch of mud in wall of the dam, instead of proper fixing. There seems to be two possible origins of such error - a very very hideous race condition involving cgroups and processes like qemu-kvm causing frequent context switches and simple incompatibility between NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing work in the cgroup, since I have not observed this errors on single numa node(mean, desktop) on relatively heavier condition. -- To unsubscribe from this list: send the line unsubscribe kvm in
Re: What to do about non-qdevified devices?
Am 30.01.2013 08:02, schrieb Markus Armbruster: Anthony Liguori aligu...@us.ibm.com writes: [...] The problems I ran into were (1) this is a lot of work (2) it basically requires that all bus children have been qdev/QOM-ified. Even with something like the ISA bus which is where I started, quite a few devices were not qdevified still. So what's the plan to complete the qdevification job? Lay really low and quietly hope the problem goes away? We've tried that for about three years, doesn't seem to work. Stating (file) names would make that discussion much easier... ;) I'd expect non-qdev'ified devices to rather be SysBusDevices (e.g., m68k, sh4, ppc). PReP's pc87312 qdev'ification was forgotten for 1.2 and recently merged. Would dma.c be a candidate for ISADevice? It uses isa_* API. (The stubs in sun4m.c/sun4u.c due to use in fdc.c might be a candidate for stubs/ at least, short of an fdc.c rewrite.) I recently went through all ISADevices and QOM'ified them: https://lists.gnu.org/archive/html/qemu-devel/2012-11/msg02746.html It became too late for 1.4 and I'm not quite sure where Anthony wanted to draw the line between his 1) and 2): https://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00071.html Thus I've only been rebasing my queue [1] without sending a v2 so far. Lack of an official ISA maintainer for reviewing is another issue, any volunteers? :) Cheers, Andreas [1] https://github.com/afaerber/qemu-cpu/commits/realize-isa -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] s390/kvm fixes
On 29/01/13 22:03, Gleb Natapov wrote: The question about 1/1. It is CCed to stable, does this mean you want it to go to 3.8? kvm-next is for 3.9. On the second thought, if it is not a regression 3.9 is the right place. The store status part is broken, but it only has a severe impact in case of a machine check. (The machine check handler revalidates all registers with the content of the save area). Since machine checks are part of the virtio-ccw code, this can go into 3.9. Feel free to remove the CC:stable. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] QEMU buildbot maintenance state
Hi, Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel and Christian? It would be awesome if you could do this given your experience running and customizing buildbot. I'll try to set aside some time for that. Christians idea to host the config at github is good, that certainly makes it easier to balance things to more people. Another thing which would be helpful: Any chance we can setup a maintainer tree mirror @ git.qemu.org? A single repository where each maintainer tree shows up as a branch? This would make the buildbot setup *alot* easier. We can go for a AnyBranchScheduler then with BuildFactory and BuildConfig shared, instead of needing one BuildFactory and BuildConfig per branch. Also makes the buildbot web interface less cluttered as we don't have a insane amount of BuildConfigs any more. And saves some resources (bandwidth + diskspace) for the buildslaves. I think people who want to look what is coming or who want to test stuff cooking it would be a nice service too if they have a one-stop shop where they can get everything. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QEMU buildbot maintenance state
On Tue, Jan 29, 2013 at 04:04:39PM +0100, Christian Berendt wrote: On 01/28/2013 03:29 PM, Daniel Gollub wrote: JFYI, the main buildbot configuration which controls everything (beside buildslave credentials) is accessible to everyone: http://people.b1-systems.de/~gollub/buildbot/ If you are familiar with buildbot feel free to incorporate your suggested changes directly on a copy and send me or Christian the diff so we just have to review and apply it. I moved the configuration on GitHub (https://github.com/b1-systems/buildbot). I'll add a cron job to the buildbot system to regular pull and apply the latest configuration. Simply open a pull request to modify the configuration. Thanks Christian! I have updated the QEMU wiki page: http://wiki.qemu.org/ContinuousIntegration Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V2 11/20] tap: support enabling or disabling a queue
On 01/30/2013 07:03 AM, Michael S. Tsirkin wrote: On Tue, Jan 29, 2013 at 04:55:25PM -0600, Anthony Liguori wrote: Michael S. Tsirkin m...@redhat.com writes: On Tue, Jan 29, 2013 at 08:10:26PM +, Blue Swirl wrote: On Tue, Jan 29, 2013 at 1:50 PM, Jason Wang jasow...@redhat.com wrote: On 01/26/2013 03:13 AM, Blue Swirl wrote: On Fri, Jan 25, 2013 at 10:35 AM, Jason Wang jasow...@redhat.com wrote: This patch introduce a new bit - enabled in TAPState which tracks whether a specific queue/fd is enabled. The tap/fd is enabled during initialization and could be enabled/disabled by tap_enalbe() and tap_disable() which calls platform specific helpers to do the real work. Polling of a tap fd can only done when the tap was enabled. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/tap.h |2 ++ net/tap-win32.c | 10 ++ net/tap.c | 43 --- 3 files changed, 52 insertions(+), 3 deletions(-) diff --git a/include/net/tap.h b/include/net/tap.h index bb7efb5..0caf8c4 100644 --- a/include/net/tap.h +++ b/include/net/tap.h @@ -35,6 +35,8 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len); void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr); void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int ecn, int ufo); void tap_set_vnet_hdr_len(NetClientState *nc, int len); +int tap_enable(NetClientState *nc); +int tap_disable(NetClientState *nc); int tap_get_fd(NetClientState *nc); diff --git a/net/tap-win32.c b/net/tap-win32.c index 265369c..a2cd94b 100644 --- a/net/tap-win32.c +++ b/net/tap-win32.c @@ -764,3 +764,13 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int len) { assert(0); } + +int tap_enable(NetClientState *nc) +{ +assert(0); abort() This is just to be consistent with the reset of the helpers in this file. +} + +int tap_disable(NetClientState *nc) +{ +assert(0); +} diff --git a/net/tap.c b/net/tap.c index 67080f1..95e557b 100644 --- a/net/tap.c +++ b/net/tap.c @@ -59,6 +59,7 @@ typedef struct TAPState { unsigned int write_poll : 1; unsigned int using_vnet_hdr : 1; unsigned int has_ufo: 1; +unsigned int enabled : 1; bool without bit field? Also to be consistent with other field. If you wish I can send patches to convert all those bit field to bool on top of this series. That would be nice, likewise for the assert(0). OK so let's go ahead with this patchset as is, and a cleanup patch will be send after 1.4 then. Why? I'd prefer that we didn't rush things into 1.4 just because. There's still ample time to respin a corrected series. Regards, Anthony Liguori Confused. Do you want the coding style rework of net/tap.c switching it from assert(0)/bitfields to abort()/bool for 1.4? I will send a new series with the patches that addresses Blue's comments on assert(0) and bitfields. Thanks Thanks VHostNetState *vhost_net; unsigned host_vnet_hdr_len; } TAPState; @@ -72,9 +73,9 @@ static void tap_writable(void *opaque); static void tap_update_fd_handler(TAPState *s) { qemu_set_fd_handler2(s-fd, - s-read_poll ? tap_can_send : NULL, - s-read_poll ? tap_send : NULL, - s-write_poll ? tap_writable : NULL, + s-read_poll s-enabled ? tap_can_send : NULL, + s-read_poll s-enabled ? tap_send : NULL, + s-write_poll s-enabled ? tap_writable : NULL, s); } @@ -339,6 +340,7 @@ static TAPState *net_tap_fd_init(NetClientState *peer, s-host_vnet_hdr_len = vnet_hdr ? sizeof(struct virtio_net_hdr) : 0; s-using_vnet_hdr = 0; s-has_ufo = tap_probe_has_ufo(s-fd); +s-enabled = 1; tap_set_offload(s-nc, 0, 0, 0, 0, 0); /* * Make sure host header length is set correctly in tap: @@ -737,3 +739,38 @@ VHostNetState *tap_get_vhost_net(NetClientState *nc) assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP); return s-vhost_net; } + +int tap_enable(NetClientState *nc) +{ +TAPState *s = DO_UPCAST(TAPState, nc, nc); +int ret; + +if (s-enabled) { +return 0; +} else { +ret = tap_fd_enable(s-fd); +if (ret == 0) { +s-enabled = 1; +tap_update_fd_handler(s); +} +return ret; +} +} + +int tap_disable(NetClientState *nc) +{ +TAPState *s = DO_UPCAST(TAPState, nc, nc); +int ret; + +if (s-enabled == 0) { +return 0; +} else { +ret = tap_fd_disable(s-fd); +if (ret == 0) { +qemu_purge_queued_packets(nc); +s-enabled = 0; +tap_update_fd_handler(s); +} +return ret; +} +} -- 1.7.1 -- To unsubscribe from this list: send the line
[Bug 53191] hardware error 0x80000021 on a KVM virtual machine with kernel 3.7
https://bugzilla.kernel.org/show_bug.cgi?id=53191 Gleb g...@redhat.com changed: What|Removed |Added CC||g...@redhat.com --- Comment #2 from Gleb g...@redhat.com 2013-01-30 09:51:48 --- Can you try to load kvm-intel module with emulate_invalid_guest_state=0 flag? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] s390/kvm fixes
On Wed, Jan 30, 2013 at 09:51:24AM +0100, Christian Borntraeger wrote: On 29/01/13 22:03, Gleb Natapov wrote: The question about 1/1. It is CCed to stable, does this mean you want it to go to 3.8? kvm-next is for 3.9. On the second thought, if it is not a regression 3.9 is the right place. The store status part is broken, but it only has a severe impact in case of a machine check. (The machine check handler revalidates all registers with the content of the save area). Since machine checks are part of the virtio-ccw code, this can go into 3.9. Feel free to remove the CC:stable. No reason to drop stable, but 3.8 will have to get the fix through stable to after it is released. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] What to do about non-qdevified devices? (was: KVM call minutes 2013-01-29)
On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote: Anthony Liguori aligu...@us.ibm.com writes: [...] The problems I ran into were (1) this is a lot of work (2) it basically requires that all bus children have been qdev/QOM-ified. Even with something like the ISA bus which is where I started, quite a few devices were not qdevified still. So what's the plan to complete the qdevification job? Lay really low and quietly hope the problem goes away? We've tried that for about three years, doesn't seem to work. Do we have a list of not-yet-qdevified devices? Maybe we need to start saying fix X Y and Z or platform P is dropped from the next release. (This would of course be easier if we had a way to let users know that platform P was in danger...) -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] s390/kvm fixes
On Fri, Jan 25, 2013 at 03:34:14PM +0100, Christian Borntraeger wrote: Gleb, Marcelo, here are 3 kvm fixes for kvm-next. Christian Borntraeger (3): s390/kvm: Fix store status for ACRS/FPRS s390/virtio-ccw: Fix setup_vq error handling. s390/kvm: Fix instruction decoding arch/s390/kvm/kvm-s390.c | 8 arch/s390/kvm/kvm-s390.h | 25 ++--- drivers/s390/kvm/virtio_ccw.c | 20 +++- 3 files changed, 33 insertions(+), 20 deletions(-) Applied, thanks. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM: set_memory_region: Cleanup and new restriction
Patch 1: just rebased for this series. Patch 2: an API change, so please let me know if you notice any problems. Takuya Yoshikawa (2): KVM: set_memory_region: Identify the requested change explicitly KVM: set_memory_region: Disallow changing read-only attribute later Documentation/virtual/kvm/api.txt | 12 ++-- virt/kvm/kvm_main.c | 95 + 2 files changed, 60 insertions(+), 47 deletions(-) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2 -v3] KVM: set_memory_region: Identify the requested change explicitly
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify what kind of change is being requested by checking the arguments. The current code does this checking at various points in code and each condition being used there is not easy to understand at first glance. This patch consolidates these checks and introduces an enum to name the possible changes to clean up the code. Although this does not introduce any functional changes, there is one change which optimizes the code a bit: if we have nothing to change, the new code returns 0 immediately. Note that the return value for this case cannot be changed since QEMU relies on it: we noticed this when we changed it to -EINVAL and got a section mismatch error at the final stage of live migration. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- v2: updated iommu related parts v3: converted !(A == B) to A != B virt/kvm/kvm_main.c | 64 +++ 1 files changed, 44 insertions(+), 20 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a83ca63..64c5dc3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -719,6 +719,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, } /* + * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations: + * - create a new memory slot + * - delete an existing memory slot + * - modify an existing memory slot + * -- move it in the guest physical memory space + * -- just change its flags + * + * Since flags can be changed by some of these operations, the following + * differentiation is the best we can do for __kvm_set_memory_region(): + */ +enum kvm_mr_change { + KVM_MR_CREATE, + KVM_MR_DELETE, + KVM_MR_MOVE, + KVM_MR_FLAGS_ONLY, +}; + +/* * Allocate some memory and give it an address in the guest physical address * space. * @@ -737,6 +755,7 @@ int __kvm_set_memory_region(struct kvm *kvm, struct kvm_memory_slot old, new; struct kvm_memslots *slots = NULL, *old_memslots; bool old_iommu_mapped; + enum kvm_mr_change change; r = check_memory_region_flags(mem); if (r) @@ -780,17 +799,30 @@ int __kvm_set_memory_region(struct kvm *kvm, old_iommu_mapped = old.npages; - /* -* Disallow changing a memory slot's size or changing anything about -* zero sized slots that doesn't involve making them non-zero. -*/ r = -EINVAL; - if (npages old.npages npages != old.npages) - goto out; - if (!npages !old.npages) + if (npages) { + if (!old.npages) + change = KVM_MR_CREATE; + else { /* Modify an existing slot. */ + if ((mem-userspace_addr != old.userspace_addr) || + (npages != old.npages)) + goto out; + + if (base_gfn != old.base_gfn) + change = KVM_MR_MOVE; + else if (new.flags != old.flags) + change = KVM_MR_FLAGS_ONLY; + else { /* Nothing to change. */ + r = 0; + goto out; + } + } + } else if (old.npages) { + change = KVM_MR_DELETE; + } else /* Modify a non-existent slot: disallowed. */ goto out; - if ((npages !old.npages) || (base_gfn != old.base_gfn)) { + if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) { /* Check for overlaps */ r = -EEXIST; kvm_for_each_memslot(slot, kvm-memslots) { @@ -808,20 +840,12 @@ int __kvm_set_memory_region(struct kvm *kvm, new.dirty_bitmap = NULL; r = -ENOMEM; - - /* -* Allocate if a slot is being created. If modifying a slot, -* the userspace_addr cannot change. -*/ - if (!old.npages) { + if (change == KVM_MR_CREATE) { new.user_alloc = user_alloc; new.userspace_addr = mem-userspace_addr; if (kvm_arch_create_memslot(new, npages)) goto out_free; - } else if (npages mem-userspace_addr != old.userspace_addr) { - r = -EINVAL; - goto out_free; } /* Allocate page dirty bitmap if needed */ @@ -830,7 +854,7 @@ int __kvm_set_memory_region(struct kvm *kvm, goto out_free; } - if (!npages || base_gfn != old.base_gfn) { + if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) { r = -ENOMEM; slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots), GFP_KERNEL); @@ -881,7 +905,7 @@ int __kvm_set_memory_region(struct kvm *kvm, * slots (size changes, userspace
[PATCH 2/2] KVM: set_memory_region: Disallow changing read-only attribute later
As Xiao pointed out, there are a few problems with it: - kvm_arch_commit_memory_region() write protects the memory slot only for GET_DIRTY_LOG when modifying the flags. - FNAME(sync_page) uses the old spte value to set a new one without checking KVM_MEM_READONLY flag. Since we flush all shadow pages when creating a new slot, the simplest fix is to disallow such problematic flag changes: this is safe because no one is doing such things. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Cc: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt | 12 ++-- virt/kvm/kvm_main.c | 35 --- 2 files changed, 18 insertions(+), 29 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 09905cb..0e03b19 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -874,12 +874,12 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flag, KVM_MEM_LOG_DIRTY_PAGES, which instructs -kvm to keep track of writes to memory within the slot. See KVM_GET_DIRTY_LOG -ioctl. The KVM_CAP_READONLY_MEM capability indicates the availability of the -KVM_MEM_READONLY flag. When this flag is set for a memory region, KVM only -allows read accesses. Writes will be posted to userspace as KVM_EXIT_MMIO -exits. +The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and +KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of +writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to +use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, +to make a new slot read-only. In this case, writes to this memory will be +posted to userspace as KVM_EXIT_MMIO exits. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 64c5dc3..2e93630 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -754,7 +754,6 @@ int __kvm_set_memory_region(struct kvm *kvm, struct kvm_memory_slot *slot; struct kvm_memory_slot old, new; struct kvm_memslots *slots = NULL, *old_memslots; - bool old_iommu_mapped; enum kvm_mr_change change; r = check_memory_region_flags(mem); @@ -797,15 +796,14 @@ int __kvm_set_memory_region(struct kvm *kvm, new.npages = npages; new.flags = mem-flags; - old_iommu_mapped = old.npages; - r = -EINVAL; if (npages) { if (!old.npages) change = KVM_MR_CREATE; else { /* Modify an existing slot. */ if ((mem-userspace_addr != old.userspace_addr) || - (npages != old.npages)) + (npages != old.npages) || + ((new.flags ^ old.flags) KVM_MEM_READONLY)) goto out; if (base_gfn != old.base_gfn) @@ -867,7 +865,6 @@ int __kvm_set_memory_region(struct kvm *kvm, /* slot was deleted or moved, clear iommu mapping */ kvm_iommu_unmap_pages(kvm, old); - old_iommu_mapped = false; /* From this point no new shadow pages pointing to a deleted, * or moved, memslot will be created. * @@ -898,25 +895,17 @@ int __kvm_set_memory_region(struct kvm *kvm, /* * IOMMU mapping: New slots need to be mapped. Old slots need to be -* un-mapped and re-mapped if their base changes or if flags that the -* iommu cares about change (read-only). Base change unmapping is -* handled above with slot deletion, so we only unmap incompatible -* flags here. Anything else the iommu might care about for existing -* slots (size changes, userspace addr changes) is disallowed above, -* so any other attribute changes getting here can be skipped. +* un-mapped and re-mapped if their base changes. Since base change +* unmapping is handled above with slot deletion, mapping alone is +* needed here. Anything else the iommu might care about for existing +* slots (size changes, userspace addr changes and read-only flag +* changes) is disallowed above, so any other attribute changes getting +* here can be skipped. */ - if (change != KVM_MR_DELETE) { - if (old_iommu_mapped - ((new.flags ^ old.flags) KVM_MEM_READONLY)) { - kvm_iommu_unmap_pages(kvm, old); - old_iommu_mapped = false; -
vCPU hotplug roadmap (was: Minutes for KVM call 2013-01-15)
Am 15.01.2013 17:16, schrieb Juan Quintela: * cpu hot plug - use qdev propierties conected to a set of socket objects (anthony) - cpusets are the wrong interface (anthony) - make a link between cpu - socket instead of a propierty? - how far are we from being able to describe a cpu with -device? (didn't heare the answer, andreas?) - perhaps the best approach? - After soft-freeze, exceptions depend on the maintainer - After hard-freeze, no exceptions -device don't require a bus, just an implementation detail, we can change that - use cpuset as an intermediate step until full vision is implemented - several approaches from where we are now, to have something before we get a full solution At this point, Andreas agreed to write a better summary of the discussion and suggestions O:-) Got buried, here we go: == vCPU hot-plug user interfaces == === cpu_set === Previously available in qemu-kvm.git: `cpu_set n+1 online` via HMP Pros: * Hides QOM/qdev implementation details (afaerber) * Thus: Doesn't depend on QOM CPUState refactoring (imammedo) * Opens a fast route to implementing vCPU unplug in KVM (imammedo) * Unintrusive to add and easy to obsolete/remove in future (imammedo) * Existing virt-test cases (afaerber) * Supported by libvirt (imammedo) * Prevents confusing guests by hot-plugging random mix of CPUs (agraf) Cons: * Cannot express topologies (ehabkost) === device_add === `device_add driver=Haswell-x86_64-cpu id=qdevid` [You can try this today and see it failing / not working.] Pros: * QMP/HMP command available today and known to users (afaerber) * Unified command for device and CPU hot-plug (imammedo) * Would allow first doing thread-level vCPU hotplug (imammedo) * Could be extended to support socket-level hot-plug (aliguori/imammedo) Cons: * Operates on raw QOM type name unlike -cpu (afaerber) * Needs support in libvirt for device_add driver=CPU (imammedo) * libvirt needs means to enumerate CPU types (imammedo) = QMP? (AF) Challenges: * No CPU qbus (afaerber) = should work without (aliguori) * CPU subclasses needed for identifying type name (afaerber/imammedo) = Haswell-x86_64-cpu does not exist yet, just x86_64-cpu * CPU class_init for -cpu host requires KVM init (imammedo) [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber] * Conversion of CPU features to static properties needed (imammedo) = device_add driver=foo,level=x,xlevel=y,... * Alternatively conversion to global properties (imammedo) * Cements type names - rename for 1.4? (afaerber) = permissable (alig.) [patches for arm, m68k, openrisc, unicore32 on list] === qom-set === `qom-set` via QMP w/ linkCPUSocket property (aliguori) Topology represented in QOM: CPUSocket has-aCPUCore has-aCPUThread a.k.a. CPUState, or CPUSocket links-to CPUCore links-to CPUThread a.k.a. CPUState Challenges (afaerber): * No CPUSocket/CPUCore objects yet and may take a while to get there... topology fields being moved to CPUState for 1.4 [done, more WIP] * No decisions on canonical paths for CPUs: CPU? machine? unassigned? * Duality of thread-level device types and socket-level? (afaerber) = fine to have, e.g., quad-core Xeon 500 device (aliguori) * CPUState is no_user (afaerber) = need to generally drop no_user for QOM (aliguori) === libvirt === libvirt's XML topology modelling is closer to today's -smp than to the desired QOM modelling: http://www.libvirt.org/formatcaps.html `virsh setvcpus domain n` http://libvirt.org/sources/virshcmdref/html/sect-setvcpus.html == qom-cpu course of action (afaerber) == It was requested to have vCPU hot-plug in v1.5. For device_add we need to move code from cpu_init() into QOM facilities. = QOM realize support would help [applied by aliguori] = cleanups piggy-backed onto CPU realizefn [applied to qom-cpu-next] Agreement on goal of X86CPU subclasses, but conflicts how to get there: * Refactor x86_def_t to X86CPUInfo for X86CPUClass class_init? (AF 2012) * Refactor x86_def_t to X86CPU instance_init as done for arm? * Refactor x86_def_t to class_inits? (afaerber) - heavy merge conflicts due to bug fixes / cleanups Pro: We can get things into a consistent QOM'ish state across targets. Con: We will refactor again on top for machine-compat properties. * Keep x86_def_t within X86CPUClass as done for ppc? (WIP: afaerber) = smallest common denominator, separates x86 from cross-target work APIC ID topology fixes are being reviewed for 1.4. [merged] X86CPU wave 4 cleanups by Igor are being reviewed for 1.4. [merged] Rename CPU types according to unified name-arch-cpu scheme for 1.4? (aliguori: permissable) [patches on list] VMState series by Juan being rebased - subset for 1.4, rest for 1.5. [1.4 part on list, WIP for 1.5] Remainder is considered 1.5 material, qom-cpu-next avail. during Freeze. == Common issues (imammedo) == - back-port CPU hot-plug ACPI notification - hot-plug is not allowed on SysBus: - APIC that
RE: [PATCH 8/8] KVM:PPC:booke: Allow debug interrupt injection to guest
-Original Message- From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org] On Behalf Of Alexander Graf Sent: Friday, January 25, 2013 5:44 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777 Subject: Re: [PATCH 8/8] KVM:PPC:booke: Allow debug interrupt injection to guest On 16.01.2013, at 09:24, Bharat Bhushan wrote: Allow userspace to inject debug interrupt to guest. QEMU can s/QEMU/user space. inject the debug interrupt to guest if it is not able to handle the debug interrupt. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/booke.c | 32 +++- arch/powerpc/kvm/e500mc.c | 10 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index faa0a0b..547797f 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,13 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +#ifdef CONFIG_KVM_BOOKE_HV +static int kvmppc_core_pending_debug(struct kvm_vcpu *vcpu) { + return test_bit(BOOKE_IRQPRIO_DEBUG, +vcpu-arch.pending_exceptions); } #endif + /* * Helper function for full MSR writes. No need to call this if only * EE/CE/ME/DE/RI are changing. @@ -144,7 +151,11 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr) #ifdef CONFIG_KVM_BOOKE_HV new_msr |= MSR_GS; - if (vcpu-guest_debug) + /* +* Set MSR_DE if the hardware debug resources are owned by user-space +* and there is no debug interrupt pending for guest to handle. Why? QEMU is using the IAC/DAC registers to set hardware breakpoint/watchpoints via debug ioctls. As debug events are enabled/gated by MSR_DE so somehow we need to set MSR_DE on hardware MSR when guest is running in this case. On bookehv this is how I am controlling the MSR_DE in hardware MSR. And why is this whole thing only executed on HV? On e500v2 we always enable MSR_DE using vcpu-arch.shadow_msr in e500.c #ifndef CONFIG_KVM_BOOKE_HV - vcpu-arch.shadow_msr = MSR_USER | MSR_IS | MSR_DS; + vcpu-arch.shadow_msr = MSR_USER | MSR_DE | MSR_IS | MSR_DS; vcpu-arch.shadow_pid = 1; vcpu-arch.shared-msr = 0; #endif Thanks -Bharat Alex +*/ + if (vcpu-guest_debug !kvmppc_core_pending_debug(vcpu)) new_msr |= MSR_DE; #endif @@ -234,6 +245,16 @@ static void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu) clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions); } +static void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu) +{ + kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG); +} + +static void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu) +{ + clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions); +} + static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 srr1) { #ifdef CONFIG_KVM_BOOKE_HV @@ -1278,6 +1299,7 @@ static void get_sregs_base(struct kvm_vcpu *vcpu, sregs-u.e.dec = kvmppc_get_dec(vcpu, tb); sregs-u.e.tb = tb; sregs-u.e.vrsave = vcpu-arch.vrsave; + sregs-u.e.dbsr = vcpu-arch.dbsr; } static int set_sregs_base(struct kvm_vcpu *vcpu, @@ -1310,6 +1332,14 @@ static int set_sregs_base(struct kvm_vcpu *vcpu, update_timer_ints(vcpu); } + if (sregs-u.e.update_special KVM_SREGS_E_UPDATE_DBSR) { + vcpu-arch.dbsr = sregs-u.e.dbsr; + if (vcpu-arch.dbsr) + kvmppc_core_queue_debug(vcpu); + else + kvmppc_core_dequeue_debug(vcpu); + } + return 0; } diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c index 81abe92..7d90622 100644 --- a/arch/powerpc/kvm/e500mc.c +++ b/arch/powerpc/kvm/e500mc.c @@ -208,7 +208,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu); sregs-u.e.features |= KVM_SREGS_E_ARCH206_MMU | KVM_SREGS_E_PM | - KVM_SREGS_E_PC; + KVM_SREGS_E_PC | KVM_SREGS_E_ED; sregs-u.e.impl_id = KVM_SREGS_E_IMPL_FSL; sregs-u.e.impl.fsl.features = 0; @@ -216,6 +216,9 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) sregs-u.e.impl.fsl.hid0 = vcpu_e500-hid0; sregs-u.e.impl.fsl.mcar = vcpu_e500-mcar; + sregs-u.e.dsrr0 = vcpu-arch.dsrr0; + sregs-u.e.dsrr1 = vcpu-arch.dsrr1; + kvmppc_get_sregs_e500_tlb(vcpu, sregs); sregs-u.e.ivor_high[3] = @@ -256,6 +259,11 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) sregs-u.e.ivor_high[5]; } + if (sregs-u.e.features KVM_SREGS_E_ED) { + vcpu-arch.dsrr0 = sregs-u.e.dsrr0; +
[PATCH V4 00/22] Multiqueue virtio-net
Hello all: This seires is an update of last version of multiqueue virtio-net support. This series tries to brings multiqueue support to virtio-net through a multiqueue support tap backend and multiple vhost threads. Patch 1 converts bitfield in TAPState to bool. Patch 2 replace assert(0) with abort() in tap. To support this, multiqueue nic support were added to qemu. This is done by introducing an array of NetClientStates in NICState, and make each pair of peers to be an queue of the nic. This is done in patch 3-9. Tap were also converted to be able to create a multiple queue backend. Currently, only linux support this by issuing TUNSETIFF N times with the same device name to create N queues. Each fd returned by TUNSETIFF were a queue supported by kernel. Three new command lines were introduced, queues were used to tell how many queues will be created by qemu; fds were used to pass multiple pre-created tap file descriptors to qemu; vhostfds were used to pass multiple pre-created vhost descriptors to qemu. This is done in patch 10-15. A method of deleting a queue and queue_index were also introduce for virtio, this is done in patch 16-17. Vhost were also changed to support multiqueue by introducing a start vq index which tracks the first virtqueue that will be used by vhost instead of the assumption that the vhost always use virtqueue from index 0. This is done in patch 18. The last part is the multiqueue userspace changes, this is done in patch 19-22. With this changes, user could start a multiqueue virtio-net device through ./qemu -netdev tap,id=hn0,queues=2,vhost=on -device virtio-net-pci,netdev=hn0 Management tools such as libvirt can pass multiple pre-created fds/vhostfds through ./qemu -netdev tap,id=hn0,fds=X:Y,vhostfds=M:N -device virtio-net-pci,netdev=hn0 For the one who wants to try, a git tree is available at: git://github.com/jasowang/qemu.git Changes from V3: - convert bitfield to bool in TAPState (Blue) - use abort() instead of assert(0) in tap code (Blue) - rebase to the latest - fix a bug that breaks the non-tap network Changes from V2: - Don't start/stop vhost threads when changing queues and simplify the interface between virtio-net and vhost further. Changes from V1: - silent checkpatch (Blue) - use fds/vhostfds instead of fd/vhostfd (Stefan) - use fds=X:Y:Z instead of fd=X,fd=Y,fd=Z (Anthony) - split patches (Stefan) - typos in commit log (Stefan) - Warn 'queues=' when fds/vhostfds is used (Stefan) - rename __net_init_tap to net_init_tap_one (Stefan) - check the consistency of vnet_hdr of multiple tap fds (Stefan) - disable multiqueue support for bridge-helper (Stefan) - rename tap_attach()/tap_detach() to tap_enable()/tap_disable() (Stefan) - fix booting with legacy guest (WanLong) - don't bump the version when doing migration (Michael) - simplify the interface between virtio-net and multiqueue vhost_net (Michael) - rebase the patches to latest - re-order the patches that let the net part comes first to simplify the reviewing - simplify the interface between virtio-net and multiqueue vhost_net - move the guest notifiers setup from vhost to vhost_net - fix a build issue of hw/mcf_fce.c Changes from RFC v2: - rebase the codes to latest qemu - align the multiqueue virtio-net implementation to virtio spec - split the patches into more smaller patches - set_link and hotplug support Changes from RFC V1: - rebase to the latest - fix memory leak in parse_netdev - fix guest notifiers assignment/de-assignment - changes the command lines to: qemu -netdev tap,queues=2 -device virtio-net-pci,queues=2 Reference: V1: http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg03558.html RFC v2: http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04108.html RFC v1: http://comments.gmane.org/gmane.comp.emulators.qemu/100481 Perf Numbers: - norm is short for normalize result - trans.rate is short for transaction rate Two Intel Xeon 5620 with direct connected intel 82599EB Host/Guest kernel: David net tree vhost enabled - lots of improvents of both latency and cpu utilization in request-reponse test - get regression of guest sending small packets which because TCP tends to batch less when the latency were improved 1q/2q/4q TCP_RR size #sessions trans.rate norm trans.rate norm trans.rate norm 1 1 9393.26 595.64 9408.18 597.34 9375.19 584.12 1 2072162.1 2214.24 129880.22 2456.13 196949.81 2298.13 1 50107513.38 2653.99 139721.93 2490.58 259713.82 2873.57 1 100 126734.63 2676.54 145553.5 2406.63 265252.68 2943 64 19453.42 632.33 9371.37 616.13 9338.19 615.97 64 20 70620.03 2093.68 125155.75 2409.15 191239.91 2253.32 64 50 1069662448.29 146518.67 2514.47 242134.07 2720.91 64 100 117046.35 2394.56 190153.09 2696.82 238881.29 2704.41 256 1 8733.29 736.36 8701.07 680.83 8608.92 530.1 256 20 69279.89 2274.45 115103.07 2299.76 144555.16 1963.53 256 50 97676.02 2296.09 150719.57 2522.92 254510.5 3028.44 256 100
[PATCH V4 01/22] net: tap: using bool instead of bitfield
Signed-off-by: Jason Wang jasow...@redhat.com --- hw/virtio-net.c |2 +- include/net/tap.h |4 ++-- net/tap-win32.c |6 +++--- net/tap.c | 38 ++ 4 files changed, 24 insertions(+), 26 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 3bb01b1..faf4cc9 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -1069,7 +1069,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf, n-nic = qemu_new_nic(net_virtio_info, conf, object_get_typename(OBJECT(dev)), dev-id, n); peer_test_vnet_hdr(n); if (peer_has_vnet_hdr(n)) { -tap_using_vnet_hdr(n-nic-nc.peer, 1); +tap_using_vnet_hdr(n-nic-nc.peer, true); n-host_hdr_len = sizeof(struct virtio_net_hdr); } else { n-host_hdr_len = 0; diff --git a/include/net/tap.h b/include/net/tap.h index bb7efb5..883cebf 100644 --- a/include/net/tap.h +++ b/include/net/tap.h @@ -29,10 +29,10 @@ #include qemu-common.h #include qapi-types.h -int tap_has_ufo(NetClientState *nc); +bool tap_has_ufo(NetClientState *nc); int tap_has_vnet_hdr(NetClientState *nc); int tap_has_vnet_hdr_len(NetClientState *nc, int len); -void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr); +void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr); void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int ecn, int ufo); void tap_set_vnet_hdr_len(NetClientState *nc, int len); diff --git a/net/tap-win32.c b/net/tap-win32.c index 265369c..3052bba 100644 --- a/net/tap-win32.c +++ b/net/tap-win32.c @@ -722,9 +722,9 @@ int net_init_tap(const NetClientOptions *opts, const char *name, return 0; } -int tap_has_ufo(NetClientState *nc) +bool tap_has_ufo(NetClientState *nc) { -return 0; +return false; } int tap_has_vnet_hdr(NetClientState *nc) @@ -741,7 +741,7 @@ void tap_fd_set_vnet_hdr_len(int fd, int len) { } -void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr) +void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr) { } diff --git a/net/tap.c b/net/tap.c index eb40c42..5542c98 100644 --- a/net/tap.c +++ b/net/tap.c @@ -55,10 +55,10 @@ typedef struct TAPState { char down_script[1024]; char down_script_arg[128]; uint8_t buf[TAP_BUFSIZE]; -unsigned int read_poll : 1; -unsigned int write_poll : 1; -unsigned int using_vnet_hdr : 1; -unsigned int has_ufo: 1; +bool read_poll; +bool write_poll; +bool using_vnet_hdr; +bool has_ufo; VHostNetState *vhost_net; unsigned host_vnet_hdr_len; } TAPState; @@ -78,15 +78,15 @@ static void tap_update_fd_handler(TAPState *s) s); } -static void tap_read_poll(TAPState *s, int enable) +static void tap_read_poll(TAPState *s, bool enable) { -s-read_poll = !!enable; +s-read_poll = enable; tap_update_fd_handler(s); } -static void tap_write_poll(TAPState *s, int enable) +static void tap_write_poll(TAPState *s, bool enable) { -s-write_poll = !!enable; +s-write_poll = enable; tap_update_fd_handler(s); } @@ -94,7 +94,7 @@ static void tap_writable(void *opaque) { TAPState *s = opaque; -tap_write_poll(s, 0); +tap_write_poll(s, false); qemu_flush_queued_packets(s-nc); } @@ -108,7 +108,7 @@ static ssize_t tap_write_packet(TAPState *s, const struct iovec *iov, int iovcnt } while (len == -1 errno == EINTR); if (len == -1 errno == EAGAIN) { -tap_write_poll(s, 1); +tap_write_poll(s, true); return 0; } @@ -186,7 +186,7 @@ ssize_t tap_read_packet(int tapfd, uint8_t *buf, int maxlen) static void tap_send_completed(NetClientState *nc, ssize_t len) { TAPState *s = DO_UPCAST(TAPState, nc, nc); -tap_read_poll(s, 1); +tap_read_poll(s, true); } static void tap_send(void *opaque) @@ -209,12 +209,12 @@ static void tap_send(void *opaque) size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed); if (size == 0) { -tap_read_poll(s, 0); +tap_read_poll(s, false); } } while (size 0 qemu_can_send_packet(s-nc)); } -int tap_has_ufo(NetClientState *nc) +bool tap_has_ufo(NetClientState *nc) { TAPState *s = DO_UPCAST(TAPState, nc, nc); @@ -253,12 +253,10 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int len) s-host_vnet_hdr_len = len; } -void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr) +void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr) { TAPState *s = DO_UPCAST(TAPState, nc, nc); -using_vnet_hdr = using_vnet_hdr != 0; - assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP); assert(!!s-host_vnet_hdr_len == using_vnet_hdr); @@ -290,8 +288,8 @@ static void tap_cleanup(NetClientState *nc) if (s-down_script[0]) launch_script(s-down_script, s-down_script_arg, s-fd); -tap_read_poll(s, 0); -tap_write_poll(s, 0); +
[PATCH V4 02/22] net: tap: use abort() instead of assert(0)
Signed-off-by: Jason Wang jasow...@redhat.com --- net/tap-linux.c |4 ++-- net/tap-win32.c |2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/tap-linux.c b/net/tap-linux.c index 059f5f3..0a6acc7 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -164,7 +164,7 @@ int tap_probe_vnet_hdr_len(int fd, int len) if (ioctl(fd, TUNSETVNETHDRSZ, orig) == -1) { fprintf(stderr, TUNGETVNETHDRSZ ioctl() failed: %s. Exiting.\n, strerror(errno)); -assert(0); +abort(); return -errno; } return 1; @@ -175,7 +175,7 @@ void tap_fd_set_vnet_hdr_len(int fd, int len) if (ioctl(fd, TUNSETVNETHDRSZ, len) == -1) { fprintf(stderr, TUNSETVNETHDRSZ ioctl() failed: %s. Exiting.\n, strerror(errno)); -assert(0); +abort(); } } diff --git a/net/tap-win32.c b/net/tap-win32.c index 3052bba..601437e 100644 --- a/net/tap-win32.c +++ b/net/tap-win32.c @@ -762,5 +762,5 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len) void tap_set_vnet_hdr_len(NetClientState *nc, int len) { -assert(0); +abort(); } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 03/22] net: introduce qemu_get_queue()
To support multiqueue, the patch introduce a helper qemu_get_queue() which is used to get the NetClientState of a device. The following patches would refactor this helper to support multiqueue. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/cadence_gem.c|9 +++-- hw/dp8393x.c|9 +++-- hw/e1000.c | 24 --- hw/eepro100.c | 12 hw/etraxfs_eth.c|5 ++- hw/lan9118.c| 10 +++--- hw/mcf_fec.c|4 +- hw/milkymist-minimac2.c |4 +- hw/mipsnet.c|4 +- hw/musicpal.c |2 +- hw/ne2000-isa.c |2 +- hw/ne2000.c |7 ++-- hw/opencores_eth.c |6 ++-- hw/pcnet-pci.c |2 +- hw/pcnet.c |7 ++-- hw/rtl8139.c| 14 hw/smc91c111.c |4 +- hw/spapr_llan.c |4 +- hw/stellaris_enet.c |5 ++- hw/usb/dev-network.c| 10 +++--- hw/virtio-net.c | 76 ++- hw/xen_nic.c| 13 +--- hw/xgmac.c |4 +- hw/xilinx_axienet.c |4 +- hw/xilinx_ethlite.c |6 ++-- include/net/net.h |1 + net/net.c |5 +++ savevm.c|2 +- 28 files changed, 140 insertions(+), 115 deletions(-) diff --git a/hw/cadence_gem.c b/hw/cadence_gem.c index 0d83442..9de688f 100644 --- a/hw/cadence_gem.c +++ b/hw/cadence_gem.c @@ -389,10 +389,10 @@ static void gem_init_register_masks(GemState *s) */ static void phy_update_link(GemState *s) { -DB_PRINT(down %d\n, s-nic-nc.link_down); +DB_PRINT(down %d\n, qemu_get_queue(s-nic)-link_down); /* Autonegotiation status mirrors link status. */ -if (s-nic-nc.link_down) { +if (qemu_get_queue(s-nic)-link_down) { s-phy_regs[PHY_REG_STATUS] = ~(PHY_REG_STATUS_ANEGCMPL | PHY_REG_STATUS_LINK); s-phy_regs[PHY_REG_INT_ST] |= PHY_REG_INT_ST_LINKC; @@ -906,9 +906,10 @@ static void gem_transmit(GemState *s) /* Send the packet somewhere */ if (s-phy_loop) { -gem_receive(s-nic-nc, tx_packet, total_bytes); +gem_receive(qemu_get_queue(s-nic), tx_packet, total_bytes); } else { -qemu_send_packet(s-nic-nc, tx_packet, total_bytes); +qemu_send_packet(qemu_get_queue(s-nic), tx_packet, + total_bytes); } /* Prepare for next packet */ diff --git a/hw/dp8393x.c b/hw/dp8393x.c index b501450..c2d0bc8 100644 --- a/hw/dp8393x.c +++ b/hw/dp8393x.c @@ -339,6 +339,7 @@ static void do_receiver_disable(dp8393xState *s) static void do_transmit_packets(dp8393xState *s) { +NetClientState *nc = qemu_get_queue(s-nic); uint16_t data[12]; int width, size; int tx_len, len; @@ -408,13 +409,13 @@ static void do_transmit_packets(dp8393xState *s) if (s-regs[SONIC_RCR] (SONIC_RCR_LB1 | SONIC_RCR_LB0)) { /* Loopback */ s-regs[SONIC_TCR] |= SONIC_TCR_CRSL; -if (s-nic-nc.info-can_receive(s-nic-nc)) { +if (nc-info-can_receive(nc)) { s-loopback_packet = 1; -s-nic-nc.info-receive(s-nic-nc, s-tx_buffer, tx_len); +nc-info-receive(nc, s-tx_buffer, tx_len); } } else { /* Transmit packet */ -qemu_send_packet(s-nic-nc, s-tx_buffer, tx_len); +qemu_send_packet(nc, s-tx_buffer, tx_len); } s-regs[SONIC_TCR] |= SONIC_TCR_PTX; @@ -903,7 +904,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift, s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s); -qemu_format_nic_info_str(s-nic-nc, s-conf.macaddr.a); +qemu_format_nic_info_str(qemu_get_queue(s-nic), s-conf.macaddr.a); qemu_register_reset(nic_reset, s); nic_reset(s); diff --git a/hw/e1000.c b/hw/e1000.c index ef06ca1..7b310d7 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -167,11 +167,11 @@ set_phy_ctrl(E1000State *s, int index, uint16_t val) { if ((val MII_CR_AUTO_NEG_EN) (val MII_CR_RESTART_AUTO_NEG)) { /* no need auto-negotiation if link was down */ -if (s-nic-nc.link_down) { +if (qemu_get_queue(s-nic)-link_down) { s-phy_reg[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE; return; } -s-nic-nc.link_down = true; +qemu_get_queue(s-nic)-link_down = true; e1000_link_down(s); s-phy_reg[PHY_STATUS] = ~MII_SR_AUTONEG_COMPLETE; DBGOUT(PHY, Start link auto negotiation\n); @@ -183,7 +183,7 @@ static void e1000_autoneg_timer(void *opaque) { E1000State *s = opaque; -s-nic-nc.link_down = false; +qemu_get_queue(s-nic)-link_down = false; e1000_link_up(s); s-phy_reg[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
[PATCH V4 04/22] net: introduce qemu_get_nic()
To support multiqueue, this patch introduces a helper qemu_get_nic() to get NICState from a NetClientState. The following patches would refactor this helper to support multiqueue. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/cadence_gem.c|8 hw/dp8393x.c|6 +++--- hw/e1000.c |8 hw/eepro100.c |6 +++--- hw/etraxfs_eth.c|6 +++--- hw/lan9118.c|6 +++--- hw/lance.c |2 +- hw/mcf_fec.c|6 +++--- hw/milkymist-minimac2.c |6 +++--- hw/mipsnet.c|6 +++--- hw/musicpal.c |4 ++-- hw/ne2000-isa.c |2 +- hw/ne2000.c |6 +++--- hw/opencores_eth.c |6 +++--- hw/pcnet-pci.c |2 +- hw/pcnet.c |6 +++--- hw/rtl8139.c|8 hw/smc91c111.c |6 +++--- hw/spapr_llan.c |4 ++-- hw/stellaris_enet.c |6 +++--- hw/usb/dev-network.c|6 +++--- hw/virtio-net.c | 10 +- hw/xen_nic.c|4 ++-- hw/xgmac.c |6 +++--- hw/xilinx_axienet.c |6 +++--- hw/xilinx_ethlite.c |6 +++--- include/net/net.h |2 ++ net/net.c | 20 28 files changed, 92 insertions(+), 78 deletions(-) diff --git a/hw/cadence_gem.c b/hw/cadence_gem.c index 9de688f..ab35329 100644 --- a/hw/cadence_gem.c +++ b/hw/cadence_gem.c @@ -409,7 +409,7 @@ static int gem_can_receive(NetClientState *nc) { GemState *s; -s = DO_UPCAST(NICState, nc, nc)-opaque; +s = qemu_get_nic_opaque(nc); DB_PRINT(\n); @@ -612,7 +612,7 @@ static ssize_t gem_receive(NetClientState *nc, const uint8_t *buf, size_t size) uint8_trxbuf[2048]; uint8_t *rxbuf_ptr; -s = DO_UPCAST(NICState, nc, nc)-opaque; +s = qemu_get_nic_opaque(nc); /* Do nothing if receive is not enabled. */ if (!(s-regs[GEM_NWCTRL] GEM_NWCTRL_RXENA)) { @@ -1149,7 +1149,7 @@ static const MemoryRegionOps gem_ops = { static void gem_cleanup(NetClientState *nc) { -GemState *s = DO_UPCAST(NICState, nc, nc)-opaque; +GemState *s = qemu_get_nic_opaque(nc); DB_PRINT(\n); s-nic = NULL; @@ -1158,7 +1158,7 @@ static void gem_cleanup(NetClientState *nc) static void gem_set_link(NetClientState *nc) { DB_PRINT(\n); -phy_update_link(DO_UPCAST(NICState, nc, nc)-opaque); +phy_update_link(qemu_get_nic_opaque(nc)); } static NetClientInfo net_gem_info = { diff --git a/hw/dp8393x.c b/hw/dp8393x.c index c2d0bc8..0273fad 100644 --- a/hw/dp8393x.c +++ b/hw/dp8393x.c @@ -676,7 +676,7 @@ static const MemoryRegionOps dp8393x_ops = { static int nic_can_receive(NetClientState *nc) { -dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque; +dp8393xState *s = qemu_get_nic_opaque(nc); if (!(s-regs[SONIC_CR] SONIC_CR_RXEN)) return 0; @@ -725,7 +725,7 @@ static int receive_filter(dp8393xState *s, const uint8_t * buf, int size) static ssize_t nic_receive(NetClientState *nc, const uint8_t * buf, size_t size) { -dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque; +dp8393xState *s = qemu_get_nic_opaque(nc); uint16_t data[10]; int packet_type; uint32_t available, address; @@ -861,7 +861,7 @@ static void nic_reset(void *opaque) static void nic_cleanup(NetClientState *nc) { -dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque; +dp8393xState *s = qemu_get_nic_opaque(nc); memory_region_del_subregion(s-address_space, s-mmio); memory_region_destroy(s-mmio); diff --git a/hw/e1000.c b/hw/e1000.c index 7b310d7..36f4051 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -743,7 +743,7 @@ receive_filter(E1000State *s, const uint8_t *buf, int size) static void e1000_set_link_status(NetClientState *nc) { -E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque; +E1000State *s = qemu_get_nic_opaque(nc); uint32_t old_status = s-mac_reg[STATUS]; if (nc-link_down) { @@ -777,7 +777,7 @@ static bool e1000_has_rxbufs(E1000State *s, size_t total_size) static int e1000_can_receive(NetClientState *nc) { -E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque; +E1000State *s = qemu_get_nic_opaque(nc); return (s-mac_reg[RCTL] E1000_RCTL_EN) e1000_has_rxbufs(s, 1); } @@ -793,7 +793,7 @@ static uint64_t rx_desc_base(E1000State *s) static ssize_t e1000_receive(NetClientState *nc, const uint8_t *buf, size_t size) { -E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque; +E1000State *s = qemu_get_nic_opaque(nc); struct e1000_rx_desc desc; dma_addr_t base; unsigned int n, rdt; @@ -1230,7 +1230,7 @@ e1000_mmio_setup(E1000State *d) static void e1000_cleanup(NetClientState *nc) { -E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque; +E1000State *s = qemu_get_nic_opaque(nc); s-nic = NULL; } diff --git a/hw/eepro100.c b/hw/eepro100.c index
[PATCH V4 05/22] net: intorduce qemu_del_nic()
To support multiqueue nic, this patch separate the nic destructor from qemu_del_net_client() to a new helper qemu_del_nic() since the mapping bettween NiCState and NetClientState were not 1:1 in multiqueue. The following patches would refactor this function to support multiqueue nic. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/e1000.c |2 +- hw/eepro100.c|2 +- hw/ne2000.c |2 +- hw/pcnet-pci.c |2 +- hw/rtl8139.c |2 +- hw/usb/dev-network.c |2 +- hw/virtio-net.c |2 +- hw/xen_nic.c |2 +- include/net/net.h|1 + net/net.c| 15 ++- 10 files changed, 23 insertions(+), 9 deletions(-) diff --git a/hw/e1000.c b/hw/e1000.c index 36f4051..f3590a9 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -1244,7 +1244,7 @@ pci_e1000_uninit(PCIDevice *dev) qemu_free_timer(d-autoneg_timer); memory_region_destroy(d-mmio); memory_region_destroy(d-io); -qemu_del_net_client(qemu_get_queue(d-nic)); +qemu_del_nic(d-nic); } static NetClientInfo net_e1000_info = { diff --git a/hw/eepro100.c b/hw/eepro100.c index f9856ae..5d23796 100644 --- a/hw/eepro100.c +++ b/hw/eepro100.c @@ -1849,7 +1849,7 @@ static void pci_nic_uninit(PCIDevice *pci_dev) memory_region_destroy(s-flash_bar); vmstate_unregister(pci_dev-qdev, s-vmstate, s); eeprom93xx_free(pci_dev-qdev, s-eeprom); -qemu_del_net_client(qemu_get_queue(s-nic)); +qemu_del_nic(s-nic); } static NetClientInfo net_eepro100_info = { diff --git a/hw/ne2000.c b/hw/ne2000.c index c989190..3dd1c84 100644 --- a/hw/ne2000.c +++ b/hw/ne2000.c @@ -751,7 +751,7 @@ static void pci_ne2000_exit(PCIDevice *pci_dev) NE2000State *s = d-ne2000; memory_region_destroy(s-io); -qemu_del_net_client(qemu_get_queue(s-nic)); +qemu_del_nic(s-nic); } static Property ne2000_properties[] = { diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c index 26c90bf..df63b22 100644 --- a/hw/pcnet-pci.c +++ b/hw/pcnet-pci.c @@ -279,7 +279,7 @@ static void pci_pcnet_uninit(PCIDevice *dev) memory_region_destroy(d-io_bar); qemu_del_timer(d-state.poll_timer); qemu_free_timer(d-state.poll_timer); -qemu_del_net_client(qemu_get_queue(d-state.nic)); +qemu_del_nic(d-state.nic); } static NetClientInfo net_pci_pcnet_info = { diff --git a/hw/rtl8139.c b/hw/rtl8139.c index b825e83..d7716be 100644 --- a/hw/rtl8139.c +++ b/hw/rtl8139.c @@ -3446,7 +3446,7 @@ static void pci_rtl8139_uninit(PCIDevice *dev) } qemu_del_timer(s-timer); qemu_free_timer(s-timer); -qemu_del_net_client(qemu_get_queue(s-nic)); +qemu_del_nic(s-nic); } static void rtl8139_set_link_status(NetClientState *nc) diff --git a/hw/usb/dev-network.c b/hw/usb/dev-network.c index abc6eac..a01a5e7 100644 --- a/hw/usb/dev-network.c +++ b/hw/usb/dev-network.c @@ -1330,7 +1330,7 @@ static void usb_net_handle_destroy(USBDevice *dev) /* TODO: remove the nd_table[] entry */ rndis_clear_responsequeue(s); -qemu_del_net_client(qemu_get_queue(s-nic)); +qemu_del_nic(s-nic); } static NetClientInfo net_usbnet_info = { diff --git a/hw/virtio-net.c b/hw/virtio-net.c index af9a17b..1a3fc74 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -1124,6 +1124,6 @@ void virtio_net_exit(VirtIODevice *vdev) qemu_bh_delete(n-tx_bh); } -qemu_del_net_client(qemu_get_queue(n-nic)); +qemu_del_nic(n-nic); virtio_cleanup(n-vdev); } diff --git a/hw/xen_nic.c b/hw/xen_nic.c index 55b7960..4be077d 100644 --- a/hw/xen_nic.c +++ b/hw/xen_nic.c @@ -408,7 +408,7 @@ static void net_disconnect(struct XenDevice *xendev) netdev-rxs = NULL; } if (netdev-nic) { -qemu_del_net_client(qemu_get_queue(netdev-nic)); +qemu_del_nic(netdev-nic); netdev-nic = NULL; } } diff --git a/include/net/net.h b/include/net/net.h index 96e05c4..f0d1aa2 100644 --- a/include/net/net.h +++ b/include/net/net.h @@ -77,6 +77,7 @@ NICState *qemu_new_nic(NetClientInfo *info, const char *model, const char *name, void *opaque); +void qemu_del_nic(NICState *nic); NetClientState *qemu_get_queue(NICState *nic); NICState *qemu_get_nic(NetClientState *nc); void *qemu_get_nic_opaque(NetClientState *nc); diff --git a/net/net.c b/net/net.c index 41dc12c..8999f8d 100644 --- a/net/net.c +++ b/net/net.c @@ -291,6 +291,15 @@ void qemu_del_net_client(NetClientState *nc) return; } +assert(nc-info-type != NET_CLIENT_OPTIONS_KIND_NIC); + +qemu_cleanup_net_client(nc); +qemu_free_net_client(nc); +} + +void qemu_del_nic(NICState *nic) +{ +NetClientState *nc = qemu_get_queue(nic); /* If this is a peer NIC and peer has already been deleted, free it now. */ if (nc-peer nc-info-type == NET_CLIENT_OPTIONS_KIND_NIC) { NICState *nic = qemu_get_nic(nc); @@ -933,7 +942,11 @@ void net_cleanup(void)
[PATCH V4 06/22] net: introduce qemu_find_net_clients_except()
In multiqueue, all NetClientState that belongs to the same netdev or nic has the same id. So this patches introduces an helper qemu_find_net_clients_except() which finds all NetClientState with the same id. This will be used by multiqueue networking. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/net.h |2 ++ net/net.c | 21 + 2 files changed, 23 insertions(+), 0 deletions(-) diff --git a/include/net/net.h b/include/net/net.h index f0d1aa2..995df5c 100644 --- a/include/net/net.h +++ b/include/net/net.h @@ -68,6 +68,8 @@ typedef struct NICState { } NICState; NetClientState *qemu_find_netdev(const char *id); +int qemu_find_net_clients_except(const char *id, NetClientState **ncs, + NetClientOptionsKind type, int max); NetClientState *qemu_new_net_client(NetClientInfo *info, NetClientState *peer, const char *model, diff --git a/net/net.c b/net/net.c index 8999f8d..6457fc0 100644 --- a/net/net.c +++ b/net/net.c @@ -508,6 +508,27 @@ NetClientState *qemu_find_netdev(const char *id) return NULL; } +int qemu_find_net_clients_except(const char *id, NetClientState **ncs, + NetClientOptionsKind type, int max) +{ +NetClientState *nc; +int ret = 0; + +QTAILQ_FOREACH(nc, net_clients, next) { +if (nc-info-type == type) { +continue; +} +if (!strcmp(nc-name, id)) { +if (ret max) { +ncs[ret] = nc; +} +ret++; +} +} + +return ret; +} + static int nic_get_free_idx(void) { int index; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 07/22] net: introduce qemu_net_client_setup()
This patch separates the setup of NetClientState from its allocation, this will allow allocating an arrays of NetClientState and does the initialization one by one which is what multiqueue needs. Signed-off-by: Jason Wang jasow...@redhat.com --- net/net.c | 29 +++-- 1 files changed, 19 insertions(+), 10 deletions(-) diff --git a/net/net.c b/net/net.c index 6457fc0..4e84d54 100644 --- a/net/net.c +++ b/net/net.c @@ -182,17 +182,12 @@ static char *assign_name(NetClientState *nc1, const char *model) return g_strdup(buf); } -NetClientState *qemu_new_net_client(NetClientInfo *info, -NetClientState *peer, -const char *model, -const char *name) +static void qemu_net_client_setup(NetClientState *nc, + NetClientInfo *info, + NetClientState *peer, + const char *model, + const char *name) { -NetClientState *nc; - -assert(info-size = sizeof(NetClientState)); - -nc = g_malloc0(info-size); - nc-info = info; nc-model = g_strdup(model); if (name) { @@ -210,6 +205,20 @@ NetClientState *qemu_new_net_client(NetClientInfo *info, nc-send_queue = qemu_new_net_queue(nc); +} + +NetClientState *qemu_new_net_client(NetClientInfo *info, +NetClientState *peer, +const char *model, +const char *name) +{ +NetClientState *nc; + +assert(info-size = sizeof(NetClientState)); + +nc = g_malloc0(info-size); +qemu_net_client_setup(nc, info, peer, model, name); + return nc; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 08/22] net: introduce NetClientState destructor
To allow allocating an array of NetClientState and free it once, this patch introduces destructor of NetClientState. Which could do type specific free, which could be used by multiqueue to free the array once. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/net.h |2 ++ net/net.c | 17 + 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/net/net.h b/include/net/net.h index 995df5c..22adc99 100644 --- a/include/net/net.h +++ b/include/net/net.h @@ -35,6 +35,7 @@ typedef ssize_t (NetReceive)(NetClientState *, const uint8_t *, size_t); typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int); typedef void (NetCleanup) (NetClientState *); typedef void (LinkStatusChanged)(NetClientState *); +typedef void (NetClientDestructor)(NetClientState *); typedef struct NetClientInfo { NetClientOptionsKind type; @@ -58,6 +59,7 @@ struct NetClientState { char *name; char info_str[256]; unsigned receive_disabled : 1; +NetClientDestructor *destructor; }; typedef struct NICState { diff --git a/net/net.c b/net/net.c index 4e84d54..6368896 100644 --- a/net/net.c +++ b/net/net.c @@ -182,11 +182,17 @@ static char *assign_name(NetClientState *nc1, const char *model) return g_strdup(buf); } +static void qemu_net_client_destructor(NetClientState *nc) +{ +g_free(nc); +} + static void qemu_net_client_setup(NetClientState *nc, NetClientInfo *info, NetClientState *peer, const char *model, - const char *name) + const char *name, + NetClientDestructor *destructor) { nc-info = info; nc-model = g_strdup(model); @@ -204,7 +210,7 @@ static void qemu_net_client_setup(NetClientState *nc, QTAILQ_INSERT_TAIL(net_clients, nc, next); nc-send_queue = qemu_new_net_queue(nc); - +nc-destructor = destructor; } NetClientState *qemu_new_net_client(NetClientInfo *info, @@ -217,7 +223,8 @@ NetClientState *qemu_new_net_client(NetClientInfo *info, assert(info-size = sizeof(NetClientState)); nc = g_malloc0(info-size); -qemu_net_client_setup(nc, info, peer, model, name); +qemu_net_client_setup(nc, info, peer, model, name, + qemu_net_client_destructor); return nc; } @@ -279,7 +286,9 @@ static void qemu_free_net_client(NetClientState *nc) } g_free(nc-name); g_free(nc-model); -g_free(nc); +if (nc-destructor) { +nc-destructor(nc); +} } void qemu_del_net_client(NetClientState *nc) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 09/22] net: multiqueue support
This patch adds basic multiqueue support for qemu. The idea is simple, an array of NetClientStates were introduced in NICState, parse_netdev() were extended to find and match all NetClientStates belongs to the backend and place their pointers in NICConf. Then qemu_new_nic can setup a N:N mapping between NICStates that belongs to a nic and NICStates belongs to the netdev. And a queue_index were introduced in NetClientState to track its index. After this, each peers of a NICState were abstracted as a queue. After this change, all NetClientState that belongs to the same backend/nic has the same id. When use want to change the link status, all NetClientStates that belongs to the same backend/nic will be also changed. When user want to delete a device or netdev, all NetClientStates that belongs to the same backend/nic will be deleted also. Changing or deleting an specific queue is not allowed. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/dp8393x.c|2 +- hw/mcf_fec.c|2 +- hw/qdev-properties-system.c | 46 +++--- hw/qdev-properties.h|6 +- include/net/net.h | 18 +-- net/net.c | 113 +++ 6 files changed, 139 insertions(+), 48 deletions(-) diff --git a/hw/dp8393x.c b/hw/dp8393x.c index 0273fad..808157b 100644 --- a/hw/dp8393x.c +++ b/hw/dp8393x.c @@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift, s-regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */ s-conf.macaddr = nd-macaddr; -s-conf.peer = nd-netdev; +s-conf.peers.ncs[0] = nd-netdev; s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s); diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c index 909e32b..8e60f09 100644 --- a/hw/mcf_fec.c +++ b/hw/mcf_fec.c @@ -472,7 +472,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd, memory_region_add_subregion(sysmem, base, s-iomem); s-conf.macaddr = nd-macaddr; -s-conf.peer = nd-netdev; +s-conf.peers.ncs[0] = nd-netdev; s-nic = qemu_new_nic(net_mcf_fec_info, s-conf, nd-model, nd-name, s); diff --git a/hw/qdev-properties-system.c b/hw/qdev-properties-system.c index ce0f793..ce3af22 100644 --- a/hw/qdev-properties-system.c +++ b/hw/qdev-properties-system.c @@ -173,16 +173,47 @@ PropertyInfo qdev_prop_chr = { static int parse_netdev(DeviceState *dev, const char *str, void **ptr) { -NetClientState *netdev = qemu_find_netdev(str); +NICPeers *peers_ptr = (NICPeers *)ptr; +NICConf *conf = container_of(peers_ptr, NICConf, peers); +NetClientState **ncs = peers_ptr-ncs; +NetClientState *peers[MAX_QUEUE_NUM]; +int queues, i = 0; +int ret; -if (netdev == NULL) { -return -ENOENT; +queues = qemu_find_net_clients_except(str, peers, + NET_CLIENT_OPTIONS_KIND_NIC, + MAX_QUEUE_NUM); +if (queues == 0) { +ret = -ENOENT; +goto err; } -if (netdev-peer) { -return -EEXIST; + +if (queues MAX_QUEUE_NUM) { +ret = -E2BIG; +goto err; +} + +for (i = 0; i queues; i++) { +if (peers[i] == NULL) { +ret = -ENOENT; +goto err; +} + +if (peers[i]-peer) { +ret = -EEXIST; +goto err; +} + +ncs[i] = peers[i]; +ncs[i]-queue_index = i; } -*ptr = netdev; + +conf-queues = queues; + return 0; + +err: +return ret; } static const char *print_netdev(void *ptr) @@ -249,7 +280,8 @@ static void set_vlan(Object *obj, Visitor *v, void *opaque, { DeviceState *dev = DEVICE(obj); Property *prop = opaque; -NetClientState **ptr = qdev_get_prop_ptr(dev, prop); +NICPeers *peers_ptr = qdev_get_prop_ptr(dev, prop); +NetClientState **ptr = peers_ptr-ncs[0]; Error *local_err = NULL; int32_t id; NetClientState *hubport; diff --git a/hw/qdev-properties.h b/hw/qdev-properties.h index ddcf774..20c67f3 100644 --- a/hw/qdev-properties.h +++ b/hw/qdev-properties.h @@ -31,7 +31,7 @@ extern PropertyInfo qdev_prop_pci_host_devaddr; .name = (_name),\ .info = (_prop), \ .offset= offsetof(_state, _field)\ -+ type_check(_type,typeof_field(_state, _field)),\ ++ type_check(_type, typeof_field(_state, _field)), \ } #define DEFINE_PROP_DEFAULT(_name, _state, _field, _defval, _prop, _type) { \ .name = (_name), \ @@ -77,9 +77,9 @@ extern PropertyInfo qdev_prop_pci_host_devaddr; #define DEFINE_PROP_STRING(_n, _s, _f) \ DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*) #define DEFINE_PROP_NETDEV(_n, _s, _f) \ -DEFINE_PROP(_n, _s, _f,
[PATCH V4 10/22] tap: import linux multiqueue constants
Import multiqueue constants from if_tun.h from 3.8-rc3. A new ifr flag IFF_MULTI_QUEUE were introduced to create a multiqueue backend by calling TUNSETIFF with the this flag and with the same interface name many times. A new ioctl TUNSETQUEUE were introduced. When doing this ioctl with IFF_DETACH_QUEUE, the queue were disabled in the linux kernel. When doing this ioctl with IFF_ATTACH_QUEUE, the queue were enabled in the linux kernel. Signed-off-by: Jason Wang jasow...@redhat.com --- net/tap-linux.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/tap-linux.h b/net/tap-linux.h index cb2a6d4..65087e1 100644 --- a/net/tap-linux.h +++ b/net/tap-linux.h @@ -29,6 +29,7 @@ #define TUNSETSNDBUF _IOW('T', 212, int) #define TUNGETVNETHDRSZ _IOR('T', 215, int) #define TUNSETVNETHDRSZ _IOW('T', 216, int) +#define TUNSETQUEUE _IOW('T', 217, int) #endif @@ -36,6 +37,9 @@ #define IFF_TAP0x0002 #define IFF_NO_PI 0x1000 #define IFF_VNET_HDR 0x4000 +#define IFF_MULTI_QUEUE 0x0100 +#define IFF_ATTACH_QUEUE 0x0200 +#define IFF_DETACH_QUEUE 0x0400 /* Features for GSO (TUNSETOFFLOAD). */ #define TUN_F_CSUM 0x01/* You can hand me unchecksummed packets. */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 11/22] tap: factor out common tap initialization
This patch factors out the common initialization of tap into a new helper net_init_tap_one(). This will be used by multiqueue tap patches. Signed-off-by: Jason Wang jasow...@redhat.com --- net/tap.c | 130 ++--- 1 files changed, 73 insertions(+), 57 deletions(-) diff --git a/net/tap.c b/net/tap.c index 5542c98..23fb6e0 100644 --- a/net/tap.c +++ b/net/tap.c @@ -591,6 +591,73 @@ static int net_tap_init(const NetdevTapOptions *tap, int *vnet_hdr, return fd; } +static int net_init_tap_one(const NetdevTapOptions *tap, NetClientState *peer, +const char *model, const char *name, +const char *ifname, const char *script, +const char *downscript, const char *vhostfdname, +int vnet_hdr, int fd) +{ +TAPState *s; + +s = net_tap_fd_init(peer, model, name, fd, vnet_hdr); +if (!s) { +close(fd); +return -1; +} + +if (tap_set_sndbuf(s-fd, tap) 0) { +return -1; +} + +if (tap-has_fd) { +snprintf(s-nc.info_str, sizeof(s-nc.info_str), fd=%d, fd); +} else if (tap-has_helper) { +snprintf(s-nc.info_str, sizeof(s-nc.info_str), helper=%s, + tap-helper); +} else { +const char *downscript; + +downscript = tap-has_downscript ? tap-downscript : +DEFAULT_NETWORK_DOWN_SCRIPT; + +snprintf(s-nc.info_str, sizeof(s-nc.info_str), + ifname=%s,script=%s,downscript=%s, ifname, script, + downscript); + +if (strcmp(downscript, no) != 0) { +snprintf(s-down_script, sizeof(s-down_script), %s, downscript); +snprintf(s-down_script_arg, sizeof(s-down_script_arg), + %s, ifname); +} +} + +if (tap-has_vhost ? tap-vhost : +vhostfdname || (tap-has_vhostforce tap-vhostforce)) { +int vhostfd; + +if (tap-has_vhostfd) { +vhostfd = monitor_handle_fd_param(cur_mon, vhostfdname); +if (vhostfd == -1) { +return -1; +} +} else { +vhostfd = -1; +} + +s-vhost_net = vhost_net_init(s-nc, vhostfd, + tap-has_vhostforce tap-vhostforce); +if (!s-vhost_net) { +error_report(vhost-net requested but could not be initialized); +return -1; +} +} else if (tap-has_vhostfd) { +error_report(vhostfd= is not valid without vhost); +return -1; +} + +return 0; +} + int net_init_tap(const NetClientOptions *opts, const char *name, NetClientState *peer) { @@ -598,10 +665,10 @@ int net_init_tap(const NetClientOptions *opts, const char *name, int fd, vnet_hdr = 0; const char *model; -TAPState *s; /* for the no-fd, no-helper case */ const char *script = NULL; /* suppress wrong uninit'd use gcc warning */ +const char *downscript = NULL; char ifname[128]; assert(opts-kind == NET_CLIENT_OPTIONS_KIND_TAP); @@ -647,6 +714,8 @@ int net_init_tap(const NetClientOptions *opts, const char *name, } else { script = tap-has_script ? tap-script : DEFAULT_NETWORK_SCRIPT; +downscript = tap-has_downscript ? tap-downscript : +DEFAULT_NETWORK_DOWN_SCRIPT; fd = net_tap_init(tap, vnet_hdr, script, ifname, sizeof ifname); if (fd == -1) { return -1; @@ -655,62 +724,9 @@ int net_init_tap(const NetClientOptions *opts, const char *name, model = tap; } -s = net_tap_fd_init(peer, model, name, fd, vnet_hdr); -if (!s) { -close(fd); -return -1; -} - -if (tap_set_sndbuf(s-fd, tap) 0) { -return -1; -} - -if (tap-has_fd) { -snprintf(s-nc.info_str, sizeof(s-nc.info_str), fd=%d, fd); -} else if (tap-has_helper) { -snprintf(s-nc.info_str, sizeof(s-nc.info_str), helper=%s, - tap-helper); -} else { -const char *downscript; - -downscript = tap-has_downscript ? tap-downscript : - DEFAULT_NETWORK_DOWN_SCRIPT; - -snprintf(s-nc.info_str, sizeof(s-nc.info_str), - ifname=%s,script=%s,downscript=%s, ifname, script, - downscript); - -if (strcmp(downscript, no) != 0) { -snprintf(s-down_script, sizeof(s-down_script), %s, downscript); -snprintf(s-down_script_arg, sizeof(s-down_script_arg), %s, ifname); -} -} - -if (tap-has_vhost ? tap-vhost : -tap-has_vhostfd || (tap-has_vhostforce tap-vhostforce)) { -int vhostfd; - -if (tap-has_vhostfd) { -vhostfd = monitor_handle_fd_param(cur_mon, tap-vhostfd); -if (vhostfd == -1) { -return -1; -} -} else { -vhostfd =
[PATCH V4 12/22] tap: add Linux multiqueue support
This patch add basic multiqueue support for Linux. When multiqueue is needed, we will first check whether kernel support multiqueue tap before creating more queues. Two new functions tap_fd_enable() and tap_fd_disable() were introduced to enable and disable a specific queue. Since the multiqueue is only supported in Linux, return error on other platforms. Signed-off-by: Jason Wang jasow...@redhat.com --- net/tap-aix.c | 10 ++ net/tap-bsd.c | 11 +++ net/tap-haiku.c | 11 +++ net/tap-linux.c | 52 net/tap-solaris.c | 11 +++ net/tap_int.h |2 ++ 6 files changed, 97 insertions(+), 0 deletions(-) diff --git a/net/tap-aix.c b/net/tap-aix.c index aff6c52..66e0574 100644 --- a/net/tap-aix.c +++ b/net/tap-aix.c @@ -59,3 +59,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo) { } + +int tap_fd_enable(int fd) +{ +return -1; +} + +int tap_fd_disable(int fd) +{ +return -1; +} diff --git a/net/tap-bsd.c b/net/tap-bsd.c index 01c705b..cfc7a28 100644 --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -145,3 +145,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo) { } + +int tap_fd_enable(int fd) +{ +return -1; +} + +int tap_fd_disable(int fd) +{ +return -1; +} + diff --git a/net/tap-haiku.c b/net/tap-haiku.c index 08cc034..664d40f 100644 --- a/net/tap-haiku.c +++ b/net/tap-haiku.c @@ -59,3 +59,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo) { } + +int tap_fd_enable(int fd) +{ +return -1; +} + +int tap_fd_disable(int fd) +{ +return -1; +} + diff --git a/net/tap-linux.c b/net/tap-linux.c index 0a6acc7..bdb0a79 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -41,6 +41,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required struct ifreq ifr; int fd, ret; int len = sizeof(struct virtio_net_hdr); +int mq_required = 0; TFR(fd = open(PATH_NET_TUN, O_RDWR)); if (fd 0) { @@ -76,6 +77,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required ioctl(fd, TUNSETVNETHDRSZ, len); } +if (mq_required) { +unsigned int features; + +if ((ioctl(fd, TUNGETFEATURES, features) != 0) || +!(features IFF_MULTI_QUEUE)) { +error_report(multiqueue required, but no kernel + support for IFF_MULTI_QUEUE available); +close(fd); +return -1; +} else { +ifr.ifr_flags |= IFF_MULTI_QUEUE; +} +} + if (ifname[0] != '\0') pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname); else @@ -209,3 +224,40 @@ void tap_fd_set_offload(int fd, int csum, int tso4, } } } + +/* Enable a specific queue of tap. */ +int tap_fd_enable(int fd) +{ +struct ifreq ifr; +int ret; + +memset(ifr, 0, sizeof(ifr)); + +ifr.ifr_flags = IFF_ATTACH_QUEUE; +ret = ioctl(fd, TUNSETQUEUE, (void *) ifr); + +if (ret != 0) { +error_report(could not enable queue); +} + +return ret; +} + +/* Disable a specific queue of tap/ */ +int tap_fd_disable(int fd) +{ +struct ifreq ifr; +int ret; + +memset(ifr, 0, sizeof(ifr)); + +ifr.ifr_flags = IFF_DETACH_QUEUE; +ret = ioctl(fd, TUNSETQUEUE, (void *) ifr); + +if (ret != 0) { +error_report(could not disable queue); +} + +return ret; +} + diff --git a/net/tap-solaris.c b/net/tap-solaris.c index 486a7ea..12cc392 100644 --- a/net/tap-solaris.c +++ b/net/tap-solaris.c @@ -225,3 +225,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo) { } + +int tap_fd_enable(int fd) +{ +return -1; +} + +int tap_fd_disable(int fd) +{ +return -1; +} + diff --git a/net/tap_int.h b/net/tap_int.h index 1dffe12..ca1c21b 100644 --- a/net/tap_int.h +++ b/net/tap_int.h @@ -42,5 +42,7 @@ int tap_probe_vnet_hdr_len(int fd, int len); int tap_probe_has_ufo(int fd); void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo); void tap_fd_set_vnet_hdr_len(int fd, int len); +int tap_fd_enable(int fd); +int tap_fd_disable(int fd); #endif /* QEMU_TAP_H */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 13/22] tap: support enabling or disabling a queue
This patch introduce a new bit - enabled in TAPState which tracks whether a specific queue/fd is enabled. The tap/fd is enabled during initialization and could be enabled/disabled by tap_enalbe() and tap_disable() which calls platform specific helpers to do the real work. Polling of a tap fd can only done when the tap was enabled. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/tap.h |2 ++ net/tap-win32.c | 10 ++ net/tap.c | 43 --- 3 files changed, 52 insertions(+), 3 deletions(-) diff --git a/include/net/tap.h b/include/net/tap.h index 883cebf..a994f20 100644 --- a/include/net/tap.h +++ b/include/net/tap.h @@ -35,6 +35,8 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len); void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr); void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int ecn, int ufo); void tap_set_vnet_hdr_len(NetClientState *nc, int len); +int tap_enable(NetClientState *nc); +int tap_disable(NetClientState *nc); int tap_get_fd(NetClientState *nc); diff --git a/net/tap-win32.c b/net/tap-win32.c index 601437e..91e9e84 100644 --- a/net/tap-win32.c +++ b/net/tap-win32.c @@ -764,3 +764,13 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int len) { abort(); } + +int tap_enable(NetClientState *nc) +{ +abort(); +} + +int tap_disable(NetClientState *nc) +{ +abort(); +} diff --git a/net/tap.c b/net/tap.c index 23fb6e0..8610ba2 100644 --- a/net/tap.c +++ b/net/tap.c @@ -59,6 +59,7 @@ typedef struct TAPState { bool write_poll; bool using_vnet_hdr; bool has_ufo; +bool enabled; VHostNetState *vhost_net; unsigned host_vnet_hdr_len; } TAPState; @@ -72,9 +73,9 @@ static void tap_writable(void *opaque); static void tap_update_fd_handler(TAPState *s) { qemu_set_fd_handler2(s-fd, - s-read_poll ? tap_can_send : NULL, - s-read_poll ? tap_send : NULL, - s-write_poll ? tap_writable : NULL, + s-read_poll s-enabled ? tap_can_send : NULL, + s-read_poll s-enabled ? tap_send : NULL, + s-write_poll s-enabled ? tap_writable : NULL, s); } @@ -337,6 +338,7 @@ static TAPState *net_tap_fd_init(NetClientState *peer, s-host_vnet_hdr_len = vnet_hdr ? sizeof(struct virtio_net_hdr) : 0; s-using_vnet_hdr = false; s-has_ufo = tap_probe_has_ufo(s-fd); +s-enabled = true; tap_set_offload(s-nc, 0, 0, 0, 0, 0); /* * Make sure host header length is set correctly in tap: @@ -735,3 +737,38 @@ VHostNetState *tap_get_vhost_net(NetClientState *nc) assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP); return s-vhost_net; } + +int tap_enable(NetClientState *nc) +{ +TAPState *s = DO_UPCAST(TAPState, nc, nc); +int ret; + +if (s-enabled) { +return 0; +} else { +ret = tap_fd_enable(s-fd); +if (ret == 0) { +s-enabled = true; +tap_update_fd_handler(s); +} +return ret; +} +} + +int tap_disable(NetClientState *nc) +{ +TAPState *s = DO_UPCAST(TAPState, nc, nc); +int ret; + +if (s-enabled == 0) { +return 0; +} else { +ret = tap_fd_disable(s-fd); +if (ret == 0) { +qemu_purge_queued_packets(nc); +s-enabled = false; +tap_update_fd_handler(s); +} +return ret; +} +} -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 14/22] tap: introduce a helper to get the name of an interface
This patch introduces a helper tap_get_ifname() to get the device name of tap device. This is needed when ifname is unspecified in the command line and qemu were asked to create tap device by itself. In this situation, the name were allocated by kernel, so if multiqueue is asked, we need to fetch its name after creating the first queue. Only linux has this support since it's the only platform that supports multiqueue tap. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/tap.h |1 + net/tap-aix.c |6 ++ net/tap-bsd.c |4 net/tap-haiku.c |4 net/tap-linux.c | 13 + net/tap-solaris.c |4 net/tap_int.h |1 + 7 files changed, 33 insertions(+), 0 deletions(-) diff --git a/include/net/tap.h b/include/net/tap.h index a994f20..c3eb85a 100644 --- a/include/net/tap.h +++ b/include/net/tap.h @@ -37,6 +37,7 @@ void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int ecn, void tap_set_vnet_hdr_len(NetClientState *nc, int len); int tap_enable(NetClientState *nc); int tap_disable(NetClientState *nc); +int tap_get_ifname(NetClientState *nc, char *ifname); int tap_get_fd(NetClientState *nc); diff --git a/net/tap-aix.c b/net/tap-aix.c index 66e0574..e760e9a 100644 --- a/net/tap-aix.c +++ b/net/tap-aix.c @@ -69,3 +69,9 @@ int tap_fd_disable(int fd) { return -1; } + +int tap_fd_get_ifname(int fd, char *ifname) +{ +return -1; +} + diff --git a/net/tap-bsd.c b/net/tap-bsd.c index cfc7a28..4f22109 100644 --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -156,3 +156,7 @@ int tap_fd_disable(int fd) return -1; } +int tap_fd_get_ifname(int fd, char *ifname) +{ +return -1; +} diff --git a/net/tap-haiku.c b/net/tap-haiku.c index 664d40f..b3b5fbb 100644 --- a/net/tap-haiku.c +++ b/net/tap-haiku.c @@ -70,3 +70,7 @@ int tap_fd_disable(int fd) return -1; } +int tap_fd_get_ifname(int fd, char *ifname) +{ +return -1; +} diff --git a/net/tap-linux.c b/net/tap-linux.c index bdb0a79..3b21662 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -261,3 +261,16 @@ int tap_fd_disable(int fd) return ret; } +int tap_fd_get_ifname(int fd, char *ifname) +{ +struct ifreq ifr; + +if (ioctl(fd, TUNGETIFF, ifr) != 0) { +error_report(TUNGETIFF ioctl() failed: %s, + strerror(errno)); +return -1; +} + +pstrcpy(ifname, sizeof(ifr.ifr_name), ifr.ifr_name); +return 0; +} diff --git a/net/tap-solaris.c b/net/tap-solaris.c index 12cc392..214d95e 100644 --- a/net/tap-solaris.c +++ b/net/tap-solaris.c @@ -236,3 +236,7 @@ int tap_fd_disable(int fd) return -1; } +int tap_fd_get_ifname(int fd, char *ifname) +{ +return -1; +} diff --git a/net/tap_int.h b/net/tap_int.h index ca1c21b..125f83d 100644 --- a/net/tap_int.h +++ b/net/tap_int.h @@ -44,5 +44,6 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo); void tap_fd_set_vnet_hdr_len(int fd, int len); int tap_fd_enable(int fd); int tap_fd_disable(int fd); +int tap_fd_get_ifname(int fd, char *ifname); #endif /* QEMU_TAP_H */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 15/22] tap: multiqueue support
Recently, linux support multiqueue tap which could let userspace call TUNSETIFF for a signle device many times to create multiple file descriptors as independent queues. User could also enable/disabe a specific queue through TUNSETQUEUE. The patch adds the generic infrastructure to create multiqueue taps. To achieve this a new parameter queues were introduced to specify how many queues were expected to be created for tap by qemu itself. Alternatively, management could also pass multiple pre-created tap file descriptors separated with ':' through a new parameter fds like -netdev tap,id=hn0,fds=X:Y:..:Z. Multiple vhost file descriptors could also be passed in this way. Each TAPState were still associated to a tap fd, which mean multiple TAPStates were created when user needs multiqueue taps. Since each TAPState contains one NetClientState, with the multiqueue nic support, an N peers of NetClientState were built up. A new parameter, mq_required were introduce in tap_open() to create multiqueue tap fds. Signed-off-by: Jason Wang jasow...@redhat.com --- include/net/tap.h |1 - net/tap-aix.c |3 +- net/tap-bsd.c |3 +- net/tap-haiku.c |3 +- net/tap-linux.c |4 +- net/tap-solaris.c |3 +- net/tap.c | 158 + net/tap_int.h |3 +- qapi-schema.json |5 +- 9 files changed, 139 insertions(+), 44 deletions(-) diff --git a/include/net/tap.h b/include/net/tap.h index c3eb85a..a994f20 100644 --- a/include/net/tap.h +++ b/include/net/tap.h @@ -37,7 +37,6 @@ void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int ecn, void tap_set_vnet_hdr_len(NetClientState *nc, int len); int tap_enable(NetClientState *nc); int tap_disable(NetClientState *nc); -int tap_get_ifname(NetClientState *nc, char *ifname); int tap_get_fd(NetClientState *nc); diff --git a/net/tap-aix.c b/net/tap-aix.c index e760e9a..804d164 100644 --- a/net/tap-aix.c +++ b/net/tap-aix.c @@ -25,7 +25,8 @@ #include tap_int.h #include stdio.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, int *vnet_hdr, + int vnet_hdr_required, int mq_required) { fprintf(stderr, no tap on AIX\n); return -1; diff --git a/net/tap-bsd.c b/net/tap-bsd.c index 4f22109..bcdb268 100644 --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -33,7 +33,8 @@ #include net/if_tap.h #endif -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, int *vnet_hdr, + int vnet_hdr_required, int mq_required) { int fd; #ifdef TAPGIFNAME diff --git a/net/tap-haiku.c b/net/tap-haiku.c index b3b5fbb..e5ce436 100644 --- a/net/tap-haiku.c +++ b/net/tap-haiku.c @@ -25,7 +25,8 @@ #include tap_int.h #include stdio.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, int *vnet_hdr, + int vnet_hdr_required, int mq_required) { fprintf(stderr, no tap on Haiku\n); return -1; diff --git a/net/tap-linux.c b/net/tap-linux.c index 3b21662..a953189 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -36,12 +36,12 @@ #define PATH_NET_TUN /dev/net/tun -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, int *vnet_hdr, + int vnet_hdr_required, int mq_required) { struct ifreq ifr; int fd, ret; int len = sizeof(struct virtio_net_hdr); -int mq_required = 0; TFR(fd = open(PATH_NET_TUN, O_RDWR)); if (fd 0) { diff --git a/net/tap-solaris.c b/net/tap-solaris.c index 214d95e..9c7278f 100644 --- a/net/tap-solaris.c +++ b/net/tap-solaris.c @@ -173,7 +173,8 @@ static int tap_alloc(char *dev, size_t dev_size) return tap_fd; } -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, int *vnet_hdr, + int vnet_hdr_required, int mq_required) { char dev[10]=; int fd; diff --git a/net/tap.c b/net/tap.c index 8610ba2..1bf7609 100644 --- a/net/tap.c +++ b/net/tap.c @@ -558,17 +558,10 @@ int net_init_bridge(const NetClientOptions *opts, const char *name, static int net_tap_init(const NetdevTapOptions *tap, int *vnet_hdr, const char *setup_script, char *ifname, -size_t ifname_sz) +size_t ifname_sz, int mq_required) { int fd, vnet_hdr_required; -if (tap-has_ifname) { -pstrcpy(ifname, ifname_sz, tap-ifname); -} else { -assert(ifname_sz 0); -ifname[0] = '\0'; -} - if (tap-has_vnet_hdr) { *vnet_hdr = tap-vnet_hdr; vnet_hdr_required = *vnet_hdr; @@ -577,7 +570,8 @@ static int net_tap_init(const NetdevTapOptions *tap, int *vnet_hdr,
[PATCH V4 16/22] vhost: multiqueue support
This patch lets vhost support multiqueue. The idea is simple, just launching multiple threads of vhost and let each of vhost thread processing a subset of the virtqueues of the device. After this change each emulated device can have multiple vhost threads as its backend. To do this, a virtqueue index were introduced to record to first virtqueue that will be handled by this vhost_net device. Based on this and nvqs, vhost could calculate its relative index to setup vhost_net device. Since we may have many vhost/net devices for a virtio-net device. The setting of guest notifiers were moved out of the starting/stopping of a specific vhost thread. The vhost_net_{start|stop}() were renamed to vhost_net_{start|stop}_one(), and a new vhost_net_{start|stop}() were introduced to configure the guest notifiers and start/stop all vhost/vhost_net devices. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/vhost.c | 82 +++- hw/vhost.h |2 + hw/vhost_net.c | 86 +- hw/vhost_net.h |4 +- hw/virtio-net.c |4 +- 5 files changed, 120 insertions(+), 58 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index cee8aad..38257b9 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -619,14 +619,17 @@ static int vhost_virtqueue_start(struct vhost_dev *dev, { hwaddr s, l, a; int r; +int vhost_vq_index = idx - dev-vq_index; struct vhost_vring_file file = { -.index = idx, +.index = vhost_vq_index }; struct vhost_vring_state state = { -.index = idx, +.index = vhost_vq_index }; struct VirtQueue *vvq = virtio_get_queue(vdev, idx); +assert(idx = dev-vq_index idx dev-vq_index + dev-nvqs); + vq-num = state.num = virtio_queue_get_num(vdev, idx); r = ioctl(dev-control, VHOST_SET_VRING_NUM, state); if (r) { @@ -669,11 +672,12 @@ static int vhost_virtqueue_start(struct vhost_dev *dev, goto fail_alloc_ring; } -r = vhost_virtqueue_set_addr(dev, vq, idx, dev-log_enabled); +r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev-log_enabled); if (r 0) { r = -errno; goto fail_alloc; } + file.fd = event_notifier_get_fd(virtio_queue_get_host_notifier(vvq)); r = ioctl(dev-control, VHOST_SET_VRING_KICK, file); if (r) { @@ -709,9 +713,10 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev, unsigned idx) { struct vhost_vring_state state = { -.index = idx, +.index = idx - dev-vq_index }; int r; +assert(idx = dev-vq_index idx dev-vq_index + dev-nvqs); r = ioctl(dev-control, VHOST_GET_VRING_BASE, state); if (r 0) { fprintf(stderr, vhost VQ %d ring restore failed: %d\n, idx, r); @@ -867,7 +872,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev) } for (i = 0; i hdev-nvqs; ++i) { -r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, true); +r = vdev-binding-set_host_notifier(vdev-binding_opaque, + hdev-vq_index + i, + true); if (r 0) { fprintf(stderr, vhost VQ %d notifier binding failed: %d\n, i, -r); goto fail_vq; @@ -877,7 +884,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev) return 0; fail_vq: while (--i = 0) { -r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false); +r = vdev-binding-set_host_notifier(vdev-binding_opaque, + hdev-vq_index + i, + false); if (r 0) { fprintf(stderr, vhost VQ %d notifier cleanup error: %d\n, i, -r); fflush(stderr); @@ -898,7 +907,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev) int i, r; for (i = 0; i hdev-nvqs; ++i) { -r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false); +r = vdev-binding-set_host_notifier(vdev-binding_opaque, + hdev-vq_index + i, + false); if (r 0) { fprintf(stderr, vhost VQ %d notifier cleanup failed: %d\n, i, -r); fflush(stderr); @@ -912,8 +923,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev) */ bool vhost_virtqueue_pending(struct vhost_dev *hdev, int n) { -struct vhost_virtqueue *vq = hdev-vqs + n; +struct vhost_virtqueue *vq = hdev-vqs + n - hdev-vq_index; assert(hdev-started); +assert(n = hdev-vq_index n hdev-vq_index + hdev-nvqs); return event_notifier_test_and_clear(vq-masked_notifier); } @@ -922,15 +934,16 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev,
[PATCH V4 17/22] virtio: introduce virtio_del_queue()
Some device (such as virtio-net) needs the ability to destroy or re-order the virtqueues, this patch adds a helper to do this. Signed-off-by: Jason Wang jasowang --- hw/virtio.c |9 + hw/virtio.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index ca170c3..d8c77b0 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -701,6 +701,15 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, return vdev-vq[i]; } +void virtio_del_queue(VirtIODevice *vdev, int n) +{ +if (n 0 || n = VIRTIO_PCI_QUEUE_MAX) { +abort(); +} + +vdev-vq[n].vring.num = 0; +} + void virtio_irq(VirtQueue *vq) { trace_virtio_irq(vq); diff --git a/hw/virtio.h b/hw/virtio.h index 9cc7b85..d3da1d2 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -181,6 +181,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, void (*handle_output)(VirtIODevice *, VirtQueue *)); +void virtio_del_queue(VirtIODevice *vdev, int n); + void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem, unsigned int len); void virtqueue_flush(VirtQueue *vq, unsigned int count); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 18/22] virtio: add a queue_index to VirtQueue
Add a queue_index to VirtQueue and a helper to fetch it, this could be used by multiqueue supported device. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/virtio.c |8 hw/virtio.h |1 + 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index d8c77b0..e259348 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -73,6 +73,8 @@ struct VirtQueue /* Notification enabled? */ bool notification; +uint16_t queue_index; + int inuse; uint16_t vector; @@ -931,6 +933,7 @@ void virtio_init(VirtIODevice *vdev, const char *name, for (i = 0; i VIRTIO_PCI_QUEUE_MAX; i++) { vdev-vq[i].vector = VIRTIO_NO_VECTOR; vdev-vq[i].vdev = vdev; +vdev-vq[i].queue_index = i; } vdev-name = name; @@ -1018,6 +1021,11 @@ VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n) return vdev-vq + n; } +uint16_t virtio_get_queue_index(VirtQueue *vq) +{ +return vq-queue_index; +} + static void virtio_queue_guest_notifier_read(EventNotifier *n) { VirtQueue *vq = container_of(n, VirtQueue, guest_notifier); diff --git a/hw/virtio.h b/hw/virtio.h index d3da1d2..a29a54d 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -280,6 +280,7 @@ hwaddr virtio_queue_get_ring_size(VirtIODevice *vdev, int n); uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n); void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx); VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n); +uint16_t virtio_get_queue_index(VirtQueue *vq); int virtio_queue_get_id(VirtQueue *vq); EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq); void virtio_queue_set_guest_notifier_fd_handler(VirtQueue *vq, bool assign, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 19/22] virtio-net: separate virtqueue from VirtIONet
To support multiqueue virtio-net, the first step is to separate the virtqueue related fields from VirtIONet to a new structure VirtIONetQueue. The following patches will add an array of VirtIONetQueue to VirtIONet based on this patch. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/virtio-net.c | 195 --- 1 files changed, 114 insertions(+), 81 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index d30cc31..b4d53b3 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -26,28 +26,33 @@ #define MAC_TABLE_ENTRIES64 #define MAX_VLAN(1 12) /* Per 802.1Q definition */ +typedef struct VirtIONetQueue { +VirtQueue *rx_vq; +VirtQueue *tx_vq; +QEMUTimer *tx_timer; +QEMUBH *tx_bh; +int tx_waiting; +struct { +VirtQueueElement elem; +ssize_t len; +} async_tx; +struct VirtIONet *n; +} VirtIONetQueue; + typedef struct VirtIONet { VirtIODevice vdev; uint8_t mac[ETH_ALEN]; uint16_t status; -VirtQueue *rx_vq; -VirtQueue *tx_vq; +VirtIONetQueue vq; VirtQueue *ctrl_vq; NICState *nic; -QEMUTimer *tx_timer; -QEMUBH *tx_bh; uint32_t tx_timeout; int32_t tx_burst; -int tx_waiting; uint32_t has_vnet_hdr; size_t host_hdr_len; size_t guest_hdr_len; uint8_t has_ufo; -struct { -VirtQueueElement elem; -ssize_t len; -} async_tx; int mergeable_rx_bufs; uint8_t promisc; uint8_t allmulti; @@ -67,6 +72,12 @@ typedef struct VirtIONet DeviceState *qdev; } VirtIONet; +static VirtIONetQueue *virtio_net_get_queue(NetClientState *nc) +{ +VirtIONet *n = qemu_get_nic_opaque(nc); + +return n-vq; +} /* TODO * - we could suppress RX interrupt if we were so inclined. */ @@ -134,6 +145,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) error_report(unable to start vhost net: %d: falling back on userspace virtio, -r); n-vhost_started = 0; +} else { +n-vhost_started = 1; } } else { vhost_net_stop(n-vdev, nc, 1); @@ -144,25 +157,26 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) { VirtIONet *n = to_virtio_net(vdev); +VirtIONetQueue *q = n-vq; virtio_net_vhost_status(n, status); -if (!n-tx_waiting) { +if (!q-tx_waiting) { return; } if (virtio_net_started(n, status) !n-vhost_started) { -if (n-tx_timer) { -qemu_mod_timer(n-tx_timer, +if (q-tx_timer) { +qemu_mod_timer(q-tx_timer, qemu_get_clock_ns(vm_clock) + n-tx_timeout); } else { -qemu_bh_schedule(n-tx_bh); +qemu_bh_schedule(q-tx_bh); } } else { -if (n-tx_timer) { -qemu_del_timer(n-tx_timer); +if (q-tx_timer) { +qemu_del_timer(q-tx_timer); } else { -qemu_bh_cancel(n-tx_bh); +qemu_bh_cancel(q-tx_bh); } } } @@ -474,35 +488,40 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) static int virtio_net_can_receive(NetClientState *nc) { VirtIONet *n = qemu_get_nic_opaque(nc); +VirtIONetQueue *q = virtio_net_get_queue(nc); + if (!n-vdev.vm_running) { return 0; } -if (!virtio_queue_ready(n-rx_vq) || -!(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) +if (!virtio_queue_ready(q-rx_vq) || +!(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) { return 0; +} return 1; } -static int virtio_net_has_buffers(VirtIONet *n, int bufsize) +static int virtio_net_has_buffers(VirtIONetQueue *q, int bufsize) { -if (virtio_queue_empty(n-rx_vq) || +VirtIONet *n = q-n; +if (virtio_queue_empty(q-rx_vq) || (n-mergeable_rx_bufs - !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) { -virtio_queue_set_notification(n-rx_vq, 1); + !virtqueue_avail_bytes(q-rx_vq, bufsize, 0))) { +virtio_queue_set_notification(q-rx_vq, 1); /* To avoid a race condition where the guest has made some buffers * available after the above check but before notification was * enabled, check for available buffers again. */ -if (virtio_queue_empty(n-rx_vq) || +if (virtio_queue_empty(q-rx_vq) || (n-mergeable_rx_bufs - !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) + !virtqueue_avail_bytes(q-rx_vq, bufsize, 0))) { return 0; +} } -virtio_queue_set_notification(n-rx_vq, 0); +virtio_queue_set_notification(q-rx_vq, 0); return 1; } @@ -605,6 +624,7 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size) static ssize_t virtio_net_receive(NetClientState *nc, const
[PATCH V4 20/22] virtio-net: multiqueue support
This patch implements both userspace and vhost support for multiple queue virtio-net (VIRTIO_NET_F_MQ). This is done by introducing an array of VirtIONetQueue to VirtIONet. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/virtio-net.c | 303 +++ hw/virtio-net.h | 28 +- 2 files changed, 264 insertions(+), 67 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index b4d53b3..0e4063f 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -44,7 +44,7 @@ typedef struct VirtIONet VirtIODevice vdev; uint8_t mac[ETH_ALEN]; uint16_t status; -VirtIONetQueue vq; +VirtIONetQueue vqs[MAX_QUEUE_NUM]; VirtQueue *ctrl_vq; NICState *nic; uint32_t tx_timeout; @@ -70,14 +70,23 @@ typedef struct VirtIONet } mac_table; uint32_t *vlans; DeviceState *qdev; +int multiqueue; +uint16_t max_queues; +uint16_t curr_queues; } VirtIONet; -static VirtIONetQueue *virtio_net_get_queue(NetClientState *nc) +static VirtIONetQueue *virtio_net_get_subqueue(NetClientState *nc) { VirtIONet *n = qemu_get_nic_opaque(nc); -return n-vq; +return n-vqs[nc-queue_index]; } + +static int vq2q(int queue_index) +{ +return queue_index / 2; +} + /* TODO * - we could suppress RX interrupt if we were so inclined. */ @@ -93,6 +102,7 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config) struct virtio_net_config netcfg; stw_p(netcfg.status, n-status); +stw_p(netcfg.max_virtqueue_pairs, n-max_queues); memcpy(netcfg.mac, n-mac, ETH_ALEN); memcpy(config, netcfg, sizeof(netcfg)); } @@ -119,6 +129,7 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status) static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) { NetClientState *nc = qemu_get_queue(n-nic); +int queues = n-multiqueue ? n-max_queues : 1; if (!nc-peer) { return; @@ -130,6 +141,7 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!tap_get_vhost_net(nc-peer)) { return; } + if (!!n-vhost_started == virtio_net_started(n, status) !nc-peer-link_down) { return; @@ -140,16 +152,14 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) return; } n-vhost_started = 1; -r = vhost_net_start(n-vdev, nc, 1); +r = vhost_net_start(n-vdev, n-nic-ncs, queues); if (r 0) { error_report(unable to start vhost net: %d: falling back on userspace virtio, -r); n-vhost_started = 0; -} else { -n-vhost_started = 1; } } else { -vhost_net_stop(n-vdev, nc, 1); +vhost_net_stop(n-vdev, n-nic-ncs, queues); n-vhost_started = 0; } } @@ -157,26 +167,38 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) { VirtIONet *n = to_virtio_net(vdev); -VirtIONetQueue *q = n-vq; +VirtIONetQueue *q; +int i; +uint8_t queue_status; virtio_net_vhost_status(n, status); -if (!q-tx_waiting) { -return; -} +for (i = 0; i n-max_queues; i++) { +q = n-vqs[i]; -if (virtio_net_started(n, status) !n-vhost_started) { -if (q-tx_timer) { -qemu_mod_timer(q-tx_timer, - qemu_get_clock_ns(vm_clock) + n-tx_timeout); +if ((!n-multiqueue i != 0) || i = n-curr_queues) { +queue_status = 0; } else { -qemu_bh_schedule(q-tx_bh); +queue_status = status; } -} else { -if (q-tx_timer) { -qemu_del_timer(q-tx_timer); + +if (!q-tx_waiting) { +continue; +} + +if (virtio_net_started(n, queue_status) !n-vhost_started) { +if (q-tx_timer) { +qemu_mod_timer(q-tx_timer, + qemu_get_clock_ns(vm_clock) + n-tx_timeout); +} else { +qemu_bh_schedule(q-tx_bh); +} } else { -qemu_bh_cancel(q-tx_bh); +if (q-tx_timer) { +qemu_del_timer(q-tx_timer); +} else { +qemu_bh_cancel(q-tx_bh); +} } } } @@ -208,6 +230,8 @@ static void virtio_net_reset(VirtIODevice *vdev) n-nomulti = 0; n-nouni = 0; n-nobcast = 0; +/* multiqueue is disabled by default */ +n-curr_queues = 1; /* Flush any MAC and VLAN filter table state */ n-mac_table.in_use = 0; @@ -249,18 +273,70 @@ static int peer_has_ufo(VirtIONet *n) static void virtio_net_set_mrg_rx_bufs(VirtIONet *n, int mergeable_rx_bufs) { +int i; +NetClientState *nc; + n-mergeable_rx_bufs = mergeable_rx_bufs; n-guest_hdr_len = n-mergeable_rx_bufs ?
[PATCH V4 21/22] virtio-net: migration support for multiqueue
This patch add migration support for multiqueue virtio-net. Instead of bumping the version, we conditionally send the info of multiqueue only when the device support more than one queue to maintain the backward compatibility. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/virtio-net.c | 35 +-- 1 files changed, 29 insertions(+), 6 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 0e4063f..d57b255 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -1062,8 +1062,8 @@ static void virtio_net_set_multiqueue(VirtIONet *n, int multiqueue, int ctrl) static void virtio_net_save(QEMUFile *f, void *opaque) { +int i; VirtIONet *n = opaque; -VirtIONetQueue *q = n-vqs[0]; /* At this point, backend must be stopped, otherwise * it might keep writing to memory. */ @@ -1071,7 +1071,7 @@ static void virtio_net_save(QEMUFile *f, void *opaque) virtio_save(n-vdev, f); qemu_put_buffer(f, n-mac, ETH_ALEN); -qemu_put_be32(f, q-tx_waiting); +qemu_put_be32(f, n-vqs[0].tx_waiting); qemu_put_be32(f, n-mergeable_rx_bufs); qemu_put_be16(f, n-status); qemu_put_byte(f, n-promisc); @@ -1087,13 +1087,19 @@ static void virtio_net_save(QEMUFile *f, void *opaque) qemu_put_byte(f, n-nouni); qemu_put_byte(f, n-nobcast); qemu_put_byte(f, n-has_ufo); +if (n-max_queues 1) { +qemu_put_be16(f, n-max_queues); +qemu_put_be16(f, n-curr_queues); +for (i = 1; i n-curr_queues; i++) { +qemu_put_be32(f, n-vqs[i].tx_waiting); +} +} } static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) { VirtIONet *n = opaque; -VirtIONetQueue *q = n-vqs[0]; -int ret, i; +int ret, i, link_down; if (version_id 2 || version_id VIRTIO_NET_VM_VERSION) return -EINVAL; @@ -1104,7 +1110,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) } qemu_get_buffer(f, n-mac, ETH_ALEN); -q-tx_waiting = qemu_get_be32(f); +n-vqs[0].tx_waiting = qemu_get_be32(f); virtio_net_set_mrg_rx_bufs(n, qemu_get_be32(f)); @@ -1174,6 +1180,20 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) } } +if (n-max_queues 1) { +if (n-max_queues != qemu_get_be16(f)) { +error_report(virtio-net: different max_queues ); +return -1; +} + +n-curr_queues = qemu_get_be16(f); +for (i = 1; i n-curr_queues; i++) { +n-vqs[i].tx_waiting = qemu_get_be32(f); +} +} + +virtio_net_set_queues(n); + /* Find the first multicast entry in the saved MAC filter */ for (i = 0; i n-mac_table.in_use; i++) { if (n-mac_table.macs[i * ETH_ALEN] 1) { @@ -1184,7 +1204,10 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) /* nc.link_down can't be migrated, so infer link_down according * to link status bit in n-status */ -qemu_get_queue(n-nic)-link_down = (n-status VIRTIO_NET_S_LINK_UP) == 0; +link_down = (n-status VIRTIO_NET_S_LINK_UP) == 0; +for (i = 0; i n-max_queues; i++) { +qemu_get_subqueue(n-nic, i)-link_down = link_down; +} return 0; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 22/22] virtio-net: compat multiqueue support
Disable multiqueue support for pre 1.4. Signed-off-by: Jason Wang jasow...@redhat.com --- hw/pc_piix.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/hw/pc_piix.c b/hw/pc_piix.c index b9a9b2e..84069b1 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -309,6 +309,10 @@ static QEMUMachine pc_i440fx_machine_v1_4 = { .driver = usb-tablet,\ .property = usb_version,\ .value= stringify(1),\ +},{ \ +.driver = virtio-net-pci, \ +.property = mq, \ +.value= off, \ } static QEMUMachine pc_machine_v1_3 = { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/8] KVM: PPC: booke: Added debug handler
-Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Friday, January 25, 2013 5:13 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777 Subject: Re: [PATCH 3/8] KVM: PPC: booke: Added debug handler On 16.01.2013, at 09:24, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com Installed debug handler will be used for guest debug support and debug facility emulation features (patches for these features will follow this patch). Signed-off-by: Liu Yu yu@freescale.com [bharat.bhus...@freescale.com: Substantial changes] Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kernel/asm-offsets.c |1 + arch/powerpc/kvm/booke_interrupts.S | 49 ++- 3 files changed, 44 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 8a72d59..f4ba881 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -503,6 +503,7 @@ struct kvm_vcpu_arch { u32 tlbcfg[4]; u32 mmucfg; u32 epr; + u32 crit_save; struct kvmppc_booke_debug_reg dbg_reg; #endif gpa_t paddr_accessed; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 46f6afd..02048f3 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -562,6 +562,7 @@ int main(void) DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst)); DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear)); DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr)); + DEFINE(VCPU_CRIT_SAVE, offsetof(struct kvm_vcpu, arch.crit_save)); #endif /* CONFIG_PPC_BOOK3S */ #endif /* CONFIG_KVM */ diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S index eae8483..dd9c5d4 100644 --- a/arch/powerpc/kvm/booke_interrupts.S +++ b/arch/powerpc/kvm/booke_interrupts.S @@ -52,12 +52,7 @@ (1BOOKE_INTERRUPT_PROGRAM) | \ (1BOOKE_INTERRUPT_DTLB_MISS)) -.macro KVM_HANDLER ivor_nr scratch srr0 -_GLOBAL(kvmppc_handler_\ivor_nr) - /* Get pointer to vcpu and record exit number. */ - mtspr \scratch , r4 - mfspr r4, SPRN_SPRG_THREAD - lwz r4, THREAD_KVM_VCPU(r4) +.macro __KVM_HANDLER ivor_nr scratch srr0 stw r3, VCPU_GPR(R3)(r4) stw r5, VCPU_GPR(R5)(r4) stw r6, VCPU_GPR(R6)(r4) @@ -74,6 +69,46 @@ _GLOBAL(kvmppc_handler_\ivor_nr) bctr .endm +.macro KVM_HANDLER ivor_nr scratch srr0 +_GLOBAL(kvmppc_handler_\ivor_nr) + /* Get pointer to vcpu and record exit number. */ + mtspr \scratch , r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + __KVM_HANDLER \ivor_nr \scratch \srr0 .endm + +.macro KVM_DBG_HANDLER ivor_nr scratch srr0 +_GLOBAL(kvmppc_handler_\ivor_nr) + mtspr \scratch, r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + stw r3, VCPU_CRIT_SAVE(r4) + mfcrr3 + mfspr r4, SPRN_CSRR1 + andi. r4, r4, MSR_PR + bne 1f + /* debug interrupt happened in enter/exit path */ + mfspr r4, SPRN_CSRR1 + rlwinm r4, r4, 0, ~MSR_DE + mtspr SPRN_CSRR1, r4 + lis r4, 0x + ori r4, r4, 0x + mtspr SPRN_DBSR, r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + mtcrr3 + lwz r3, VCPU_CRIT_SAVE(r4) + mfspr r4, \scratch + rfci What is this part doing? Try to ignore the debug exit? As BOOKE doesn't have hardware support for virtualization, hardware never know current pc is in guest or in host. So when enable hardware single step for guest, it cannot be disabled at the time guest exit. Thus, we'll see that an single step interrupt happens at the beginning of guest exit path. With the above code we recognize this kind of single step interrupt disable single step and rfci. Why would we have MSR_DE enabled in the first place when we can't handle it? When QEMU is using hardware debug resource then we always set MSR_DE during guest is running. +1: /* debug interrupt happened in guest */ + mtcrr3 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + lwz r3, VCPU_CRIT_SAVE(r4) + __KVM_HANDLER \ivor_nr \scratch \srr0 I don't think you need the __KVM_HANDLER split. This should be quite easily refactorable into a simple DBG prolog. Can you please elaborate how you are envisioning this? Thanks -Bharat Alex +.endm + .macro KVM_HANDLER_ADDR ivor_nr .long kvmppc_handler_\ivor_nr .endm @@ -98,7 +133,7 @@ KVM_HANDLER BOOKE_INTERRUPT_FIT SPRN_SPRG_RSCRATCH0 SPRN_SRR0
[Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Am 29.01.2013 16:41, schrieb Juan Quintela: * Portio port to new memory regions? Andreas, could you fill? MemoryRegion's .old_portio mechanism requires workarounds for VGA on ppc, affecting among others the sPAPR PCI host bridge: http://git.qemu.org/?p=qemu.git;a=commit;h=a3cfa18eb075c7ef78358ca1956fe7b01caa1724 Patches were posted and merged removing all .old_portio users but one: hw/ioport.c:portio_list_add_1(), used by portio_list_add() hw/isa-bus.c:portio_list_add(piolist, isabus-address_space_io, start); hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); hw/vga.c:portio_list_add(vbe_port_list, address_space_io, 0x1ce); Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... All programming is done by the OS, devices do not register with controller. Each bridge has two ways to claim an IO transaction: - transaction is within the window programmed in the bridge - subtractive decoding enabled and no one else claims the transaction At the bus level, transaction happens on a bus and an appropriate device will claim it. My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On 30.01.2013, at 12:48, Peter Maydell wrote: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... That's pretty much how it works for PCI hardware, yes. For ISA like hardware, I asked Ben last night: 29-01-2013 23:41:10 agraf: benh: hey ben :) 29-01-2013 23:41:50 agraf: benh: do you remember if g3 beige (grackle) and/or U2 based macs had an actual ISA bus exposed through MMIO or whether it was PCI only with a PIO compat region mapped by the PCI controller? 29-01-2013 23:59:28 benh!~benh@180.200.150.145: agraf: no ISA 29-01-2013 23:59:48 benh!~benh@180.200.150.145: agraf: no mac ever had one 29-01-2013 23:59:57 agraf: benh: well, MCP750 has one 30-01-2013 00:00:06 agraf: benh: that's why I'm asking :) 30-01-2013 00:00:17 benh!~benh@180.200.150.145: mcp750 ? what is this ? 30-01-2013 00:00:28 agraf: benh: some motorola soc 30-01-2013 00:00:39 benh!~benh@180.200.150.145: ah ok 30-01-2013 00:00:50 benh!~benh@180.200.150.145: mostly ISA is just hooked onto PCI anyway 30-01-2013 00:00:59 benh!~benh@180.200.150.145: ie, PCI cycles with low addresses land on ISA 30-01-2013 00:01:59 agraf: benh: sounds tricky to model :) 30-01-2013 00:02:44 benh!~benh@180.200.150.145: that's also how it works on x86 30-01-2013 00:03:05 benh!~benh@180.200.150.145: dunno how it works on that specific SoC tho but that's how it's usually done 30-01-2013 00:04:36 agraf: interesting - didn't know that :) 30-01-2013 00:04:51 agraf: on x86 it's hard to see from a software pov, because everything's linear ;) 30-01-2013 00:26:27 benh!~benh@180.200.150.145: yeah, that's why x86 has a memory hole to make room for ISA 30-01-2013 00:26:40 benh!~benh@180.200.150.145: while usually on ppc we remap things with an offset so we don't have to punch a hole in ram My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. Right. With the addition that on some boards, the PCI host controller which provides a portio map would also expose an ISABus for devices to plug in. At least if I understand Ben correctly. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit
From: Yang Zhang yang.z.zh...@intel.com The acknowledge interrupt on exit feature controls processor behavior for external interrupt acknowledgement. When this control is set, the processor acknowledges the interrupt controller to acquire the interrupt vector on VM exit. After enabling this feature, an interrupt which arrived when target cpu is running in vmx non-root mode will be handled by vmx handler instead of handler in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt table to let real handler to handle it. Further, we will recognize the interrupt and only delivery the interrupt which not belong to current vcpu through idt table. The interrupt which belonged to current vcpu will be handled inside vmx handler. This will reduce the interrupt handle cost of KVM. Also, interrupt enable logic is changed if this feature is turnning on: Before this patch, hypervior call local_irq_enable() to enable it directly. Now IF bit is set on interrupt stack frame, and will be enabled on a return from interrupt handler if exterrupt interrupt exists. If no external interrupt, still call local_irq_enable() to enable it. Refer to Intel SDM volum 3, chapter 33.2. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c |6 +++ arch/x86/kvm/vmx.c | 70 -- arch/x86/kvm/x86.c |4 ++- 4 files changed, 76 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 77d56a4..1f1b2f8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -725,6 +725,7 @@ struct kvm_x86_ops { int (*check_intercept)(struct kvm_vcpu *vcpu, struct x86_instruction_info *info, enum x86_intercept_stage stage); + void (*handle_external_intr)(struct kvm_vcpu *vcpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d29d3cd..c283185 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4227,6 +4227,11 @@ out: return ret; } +static void svm_handle_external_intr(struct kvm_vcpu *vcpu) +{ + local_irq_enable(); +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -4318,6 +4323,7 @@ static struct kvm_x86_ops svm_x86_ops = { .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + .handle_external_intr = svm_handle_external_intr, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 02eeba8..eaef185 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,7 @@ struct vcpu_vmx { struct shared_msr_entry *guest_msrs; int nmsrs; int save_nmsrs; + unsigned long host_idt_base; #ifdef CONFIG_X86_64 u64 msr_host_kernel_gs_base; u64 msr_guest_kernel_gs_base; @@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif - opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT; + opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT | + VM_EXIT_ACK_INTR_ON_EXIT; if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS, _vmexit_control) 0) return -EIO; @@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) * Note that host-state that does change is set elsewhere. E.g., host-state * that is set differently for each CPU is set in vmx_vcpu_load(), not here. */ -static void vmx_set_constant_host_state(void) +static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu) { u32 low32, high32; unsigned long tmpl; struct desc_ptr dt; + struct vcpu_vmx *vmx = to_vmx(vcpu); vmcs_writel(HOST_CR0, read_cr0() ~X86_CR0_TS); /* 22.2.3 */ vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */ @@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void) native_store_idt(dt); vmcs_writel(HOST_IDTR_BASE, dt.address); /* 22.2.4 */ + vmx-host_idt_base = dt.address; vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */ @@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write16(HOST_FS_SELECTOR, 0);/* 22.2.4 */ vmcs_write16(HOST_GS_SELECTOR, 0);/* 22.2.4 */ - vmx_set_constant_host_state(); + vmx_set_constant_host_state(vmx-vcpu); #ifdef CONFIG_X86_64 rdmsrl(MSR_FS_BASE, a); vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */ @@ -6094,6 +6098,63 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx
Re: [Qemu-devel] What to do about non-qdevified devices?
Peter Maydell peter.mayd...@linaro.org writes: On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote: Anthony Liguori aligu...@us.ibm.com writes: [...] The problems I ran into were (1) this is a lot of work (2) it basically requires that all bus children have been qdev/QOM-ified. Even with something like the ISA bus which is where I started, quite a few devices were not qdevified still. So what's the plan to complete the qdevification job? Lay really low and quietly hope the problem goes away? We've tried that for about three years, doesn't seem to work. Do we have a list of not-yet-qdevified devices? Maybe we need to start saying fix X Y and Z or platform P is dropped from the next release. (This would of course be easier if we had a way to let users know that platform P was in danger...) I think that's a good idea. Only problem is identifying pre-qdev devices in the code requires code inspection (grep won't do, I'm afraid). If we agree on a qdevify or else plan, I'd be prepared to help with the digging up of devices. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit
On Wed, Jan 30, 2013 at 08:36:12PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com The acknowledge interrupt on exit feature controls processor behavior for external interrupt acknowledgement. When this control is set, the processor acknowledges the interrupt controller to acquire the interrupt vector on VM exit. After enabling this feature, an interrupt which arrived when target cpu is running in vmx non-root mode will be handled by vmx handler instead of handler in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt table to let real handler to handle it. Further, we will recognize the interrupt and only delivery the interrupt which not belong to current vcpu through idt table. The interrupt which belonged to current vcpu will be handled inside vmx handler. This will reduce the interrupt handle cost of KVM. Also, interrupt enable logic is changed if this feature is turnning on: Before this patch, hypervior call local_irq_enable() to enable it directly. Now IF bit is set on interrupt stack frame, and will be enabled on a return from interrupt handler if exterrupt interrupt exists. If no external interrupt, still call local_irq_enable() to enable it. Refer to Intel SDM volum 3, chapter 33.2. Looks good to me except one comment bellow. Send that patch as part of posted interrupt series, there is not point to apply it separately. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c |6 +++ arch/x86/kvm/vmx.c | 70 -- arch/x86/kvm/x86.c |4 ++- 4 files changed, 76 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 77d56a4..1f1b2f8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -725,6 +725,7 @@ struct kvm_x86_ops { int (*check_intercept)(struct kvm_vcpu *vcpu, struct x86_instruction_info *info, enum x86_intercept_stage stage); + void (*handle_external_intr)(struct kvm_vcpu *vcpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d29d3cd..c283185 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4227,6 +4227,11 @@ out: return ret; } +static void svm_handle_external_intr(struct kvm_vcpu *vcpu) +{ + local_irq_enable(); +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -4318,6 +4323,7 @@ static struct kvm_x86_ops svm_x86_ops = { .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + .handle_external_intr = svm_handle_external_intr, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 02eeba8..eaef185 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,7 @@ struct vcpu_vmx { struct shared_msr_entry *guest_msrs; int nmsrs; int save_nmsrs; + unsigned long host_idt_base; #ifdef CONFIG_X86_64 u64 msr_host_kernel_gs_base; u64 msr_guest_kernel_gs_base; @@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif - opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT; + opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT | + VM_EXIT_ACK_INTR_ON_EXIT; if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS, _vmexit_control) 0) return -EIO; @@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) * Note that host-state that does change is set elsewhere. E.g., host-state * that is set differently for each CPU is set in vmx_vcpu_load(), not here. */ -static void vmx_set_constant_host_state(void) +static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu) Pass vmx to the function. No need to convert vmx op vcpu and back. { u32 low32, high32; unsigned long tmpl; struct desc_ptr dt; + struct vcpu_vmx *vmx = to_vmx(vcpu); vmcs_writel(HOST_CR0, read_cr0() ~X86_CR0_TS); /* 22.2.3 */ vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */ @@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void) native_store_idt(dt); vmcs_writel(HOST_IDTR_BASE, dt.address); /* 22.2.4 */ + vmx-host_idt_base = dt.address; vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */ @@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write16(HOST_FS_SELECTOR, 0);/* 22.2.4 */
Re: vCPU hotplug roadmap (was: Minutes for KVM call 2013-01-15)
On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote: Am 15.01.2013 17:16, schrieb Juan Quintela: * cpu hot plug - use qdev propierties conected to a set of socket objects (anthony) - cpusets are the wrong interface (anthony) - make a link between cpu - socket instead of a propierty? - how far are we from being able to describe a cpu with -device? (didn't heare the answer, andreas?) - perhaps the best approach? - After soft-freeze, exceptions depend on the maintainer - After hard-freeze, no exceptions -device don't require a bus, just an implementation detail, we can change that - use cpuset as an intermediate step until full vision is implemented - several approaches from where we are now, to have something before we get a full solution At this point, Andreas agreed to write a better summary of the discussion and suggestions O:-) Got buried, here we go: == vCPU hot-plug user interfaces == === cpu_set === Previously available in qemu-kvm.git: `cpu_set n+1 online` via HMP Pros: * Hides QOM/qdev implementation details (afaerber) * Thus: Doesn't depend on QOM CPUState refactoring (imammedo) * Opens a fast route to implementing vCPU unplug in KVM (imammedo) * Unintrusive to add and easy to obsolete/remove in future (imammedo) * Existing virt-test cases (afaerber) * Supported by libvirt (imammedo) * Prevents confusing guests by hot-plugging random mix of CPUs (agraf) Cons: * Cannot express topologies (ehabkost) Actually, I believe this is not the main problem (we will have exactly the same limitation if using thread-level device_add). To me, the main problem is that we are creating a new QMP command that should be eventually obsoleted by device_add. === device_add === `device_add driver=Haswell-x86_64-cpu id=qdevid` [You can try this today and see it failing / not working.] Pros: * QMP/HMP command available today and known to users (afaerber) * Unified command for device and CPU hot-plug (imammedo) * Would allow first doing thread-level vCPU hotplug (imammedo) * Could be extended to support socket-level hot-plug (aliguori/imammedo) Cons: * Operates on raw QOM type name unlike -cpu (afaerber) * Needs support in libvirt for device_add driver=CPU (imammedo) * libvirt needs means to enumerate CPU types (imammedo) = QMP? (AF) Challenges: * No CPU qbus (afaerber) = should work without (aliguori) * CPU subclasses needed for identifying type name (afaerber/imammedo) = Haswell-x86_64-cpu does not exist yet, just x86_64-cpu * CPU class_init for -cpu host requires KVM init (imammedo) [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber] I don't know what you mean by use kvm_arch_vcpu_init(). I sent a RFC following somebody's suggestion of simply make kvm_arch_init() call a function to finish the -cpu host initialization, as we can't initialize everything inside class_init. See x86_cpu_finish_host_class_init() at: Message-Id: 1357329382-20944-7-git-send-email-ehabk...@redhat.com http://article.gmane.org/gmane.comp.emulators.qemu/186778 * Conversion of CPU features to static properties needed (imammedo) = device_add driver=foo,level=x,xlevel=y,... * Alternatively conversion to global properties (imammedo) * Cements type names - rename for 1.4? (afaerber) = permissable (alig.) [patches for arm, m68k, openrisc, unicore32 on list] === qom-set === `qom-set` via QMP w/ linkCPUSocket property (aliguori) Topology represented in QOM: CPUSocket has-aCPUCore has-aCPUThread a.k.a. CPUState, or CPUSocket links-to CPUCore links-to CPUThread a.k.a. CPUState Challenges (afaerber): * No CPUSocket/CPUCore objects yet and may take a while to get there... topology fields being moved to CPUState for 1.4 [done, more WIP] * No decisions on canonical paths for CPUs: CPU? machine? unassigned? * Duality of thread-level device types and socket-level? (afaerber) = fine to have, e.g., quad-core Xeon 500 device (aliguori) * CPUState is no_user (afaerber) = need to generally drop no_user for QOM (aliguori) I would like to drop no_user on 1.5 even if we don't manage to finish CPU hotplug, as exposing the CPU objects and classes will be very useful to allow libvirt to probe for the available CPU models and features. === libvirt === libvirt's XML topology modelling is closer to today's -smp than to the desired QOM modelling: http://www.libvirt.org/formatcaps.html `virsh setvcpus domain n` http://libvirt.org/sources/virshcmdref/html/sect-setvcpus.html == qom-cpu course of action (afaerber) == It was requested to have vCPU hot-plug in v1.5. For device_add we need to move code from cpu_init() into QOM facilities. = QOM realize support would help [applied by aliguori] = cleanups piggy-backed onto CPU realizefn [applied to qom-cpu-next] Agreement on goal of X86CPU subclasses, but conflicts how to get there: * Refactor
RE: [PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit
Gleb Natapov wrote on 2013-01-30: On Wed, Jan 30, 2013 at 08:36:12PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com The acknowledge interrupt on exit feature controls processor behavior for external interrupt acknowledgement. When this control is set, the processor acknowledges the interrupt controller to acquire the interrupt vector on VM exit. After enabling this feature, an interrupt which arrived when target cpu is running in vmx non-root mode will be handled by vmx handler instead of handler in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt table to let real handler to handle it. Further, we will recognize the interrupt and only delivery the interrupt which not belong to current vcpu through idt table. The interrupt which belonged to current vcpu will be handled inside vmx handler. This will reduce the interrupt handle cost of KVM. Also, interrupt enable logic is changed if this feature is turnning on: Before this patch, hypervior call local_irq_enable() to enable it directly. Now IF bit is set on interrupt stack frame, and will be enabled on a return from interrupt handler if exterrupt interrupt exists. If no external interrupt, still call local_irq_enable() to enable it. Refer to Intel SDM volum 3, chapter 33.2. Looks good to me except one comment bellow. Send that patch as part of posted interrupt series, there is not point to apply it separately. Sure. I will send out the PI patch after it passes all testings. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c |6 +++ arch/x86/kvm/vmx.c | 70 -- arch/x86/kvm/x86.c |4 ++- 4 files changed, 76 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 77d56a4..1f1b2f8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -725,6 +725,7 @@ struct kvm_x86_ops { int (*check_intercept)(struct kvm_vcpu *vcpu, struct x86_instruction_info *info,enum x86_intercept_stage stage); + void (*handle_external_intr)(struct kvm_vcpu *vcpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d29d3cd..c283185 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4227,6 +4227,11 @@ out: return ret; } +static void svm_handle_external_intr(struct kvm_vcpu *vcpu) +{ +local_irq_enable(); +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm,.disabled_by_bios = is_disabled, @@ -4318,6 +4323,7 @@ static struct kvm_x86_ops svm_x86_ops = { .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + .handle_external_intr = svm_handle_external_intr, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 02eeba8..eaef185 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,7 @@ struct vcpu_vmx { struct shared_msr_entry *guest_msrs;int nmsrs; int save_nmsrs; + unsigned long host_idt_base; #ifdef CONFIG_X86_64 u64 msr_host_kernel_gs_base;u64 msr_guest_kernel_gs_base; @@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif -opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT; +opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT | +VM_EXIT_ACK_INTR_ON_EXIT; if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS, _vmexit_control) 0) return -EIO; @@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) * Note that host-state that does change is set elsewhere. E.g., host-state * that is set differently for each CPU is set in vmx_vcpu_load(), not here. */ -static void vmx_set_constant_host_state(void) +static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu) Pass vmx to the function. No need to convert vmx op vcpu and back. { u32 low32, high32; unsigned long tmpl; struct desc_ptr dt; +struct vcpu_vmx *vmx = to_vmx(vcpu); vmcs_writel(HOST_CR0, read_cr0() ~X86_CR0_TS); /* 22.2.3 */ vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */ @@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void) native_store_idt(dt); vmcs_writel(HOST_IDTR_BASE, dt.address); /* 22.2.4 */ +vmx-host_idt_base = dt.address; vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */ @@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write16(HOST_FS_SELECTOR,
Re: vCPU hotplug roadmap
Am 30.01.2013 13:49, schrieb Eduardo Habkost: On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote: * CPU class_init for -cpu host requires KVM init (imammedo) [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber] I don't know what you mean by use kvm_arch_vcpu_init(). Sorry, scratch the _vcpu. I.e., the x86-specific KVM init hook. I sent a RFC following somebody's suggestion of simply make kvm_arch_init() call a function to finish the -cpu host initialization, as we can't initialize everything inside class_init. See x86_cpu_finish_host_class_init() at: Message-Id: 1357329382-20944-7-git-send-email-ehabk...@redhat.com http://article.gmane.org/gmane.comp.emulators.qemu/186778 ...and I have been working on making it even simpler for the still-x86_def_t-based approach. I'm still busy looking at 1.4 issues currently though. Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] vCPU hotplug roadmap
On Wed, 30 Jan 2013 14:02:16 +0100 Andreas Färber afaer...@suse.de wrote: Am 30.01.2013 13:49, schrieb Eduardo Habkost: On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote: [...] http://article.gmane.org/gmane.comp.emulators.qemu/186778 ...and I have been working on making it even simpler for the still-x86_def_t-based approach. I'm still busy looking at 1.4 issues currently though. Andreas I'll try to cook series that would do properties and classes in one seamless approach without intermediate steps. Perhaps it would work out better. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Michael S. Tsirkin m...@redhat.com writes: On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... All programming is done by the OS, devices do not register with controller. Each bridge has two ways to claim an IO transaction: - transaction is within the window programmed in the bridge - subtractive decoding enabled and no one else claims the transaction And there can only be one endpoint that accepts subtractive decoding and this is usually the ISA bridge. Also note that there are some really special cases with PCI. The legacy VGA ports are always routed to the first device with a DISPLAY class type. Likewise, with legacy IDE ports are routed to the first device with an IDE class. That's the only reason you can have these legacy devices not behind the ISA bridge. Regards, Anthony Liguori At the bus level, transaction happens on a bus and an appropriate device will claim it. My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
VCPU's MMUCFG register initialization should not depend on KVM_CAP_SW_TLB ioctl call. Move it earlier into tlb initalization phase. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/e500_mmu.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c index 5c44759..bb1b2b0 100644 --- a/arch/powerpc/kvm/e500_mmu.c +++ b/arch/powerpc/kvm/e500_mmu.c @@ -692,8 +692,6 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, vcpu_e500-gtlb_offset[0] = 0; vcpu_e500-gtlb_offset[1] = params.tlb_sizes[0]; - vcpu-arch.mmucfg = mfspr(SPRN_MMUCFG) ~MMUCFG_LPIDSIZE; - vcpu-arch.tlbcfg[0] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); if (params.tlb_sizes[0] = 2048) vcpu-arch.tlbcfg[0] |= params.tlb_sizes[0]; @@ -781,6 +779,8 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500) if (!vcpu_e500-g2h_tlb1_map) goto err; + vcpu-arch.mmucfg = mfspr(SPRN_MMUCFG) ~MMUCFG_LPIDSIZE; + /* Init TLB configuration register */ vcpu-arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] KVM: PPC: e500: Emulate TLBnPS registers
Emulate TLBnPS registers which are available in MMU Architecture Version (MAV) 2.0. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/e500.h |5 + arch/powerpc/kvm/e500_emulate.c | 10 ++ arch/powerpc/kvm/e500_mmu.c |5 + 4 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 8a72d59..88fcfe6 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -501,6 +501,7 @@ struct kvm_vcpu_arch { spinlock_t wdt_lock; struct timer_list wdt_timer; u32 tlbcfg[4]; + u32 tlbps[4]; u32 mmucfg; u32 epr; struct kvmppc_booke_debug_reg dbg_reg; diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h index 41cefd4..b9f76d8 100644 --- a/arch/powerpc/kvm/e500.h +++ b/arch/powerpc/kvm/e500.h @@ -303,4 +303,9 @@ static inline unsigned int get_tlbmiss_tid(struct kvm_vcpu *vcpu) #define get_tlb_sts(gtlbe) (MAS1_TS) #endif /* !BOOKE_HV */ +static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu) +{ + return ((vcpu-arch.mmucfg MMUCFG_MAVN) == MMUCFG_MAVN_V2); +} + #endif /* KVM_E500_H */ diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c index e78f353..5515dc5 100644 --- a/arch/powerpc/kvm/e500_emulate.c +++ b/arch/powerpc/kvm/e500_emulate.c @@ -329,6 +329,16 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val) *spr_val = vcpu-arch.ivor[BOOKE_IRQPRIO_DBELL_CRIT]; break; #endif + case SPRN_TLB0PS: + if (!has_mmu_v2(vcpu)) + return EMULATE_FAIL; + *spr_val = vcpu-arch.tlbps[0]; + break; + case SPRN_TLB1PS: + if (!has_mmu_v2(vcpu)) + return EMULATE_FAIL; + *spr_val = vcpu-arch.tlbps[1]; + break; default: emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val); } diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c index bb1b2b0..129299a 100644 --- a/arch/powerpc/kvm/e500_mmu.c +++ b/arch/powerpc/kvm/e500_mmu.c @@ -794,6 +794,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500) vcpu-arch.tlbcfg[1] |= vcpu_e500-gtlb_params[1].ways TLBnCFG_ASSOC_SHIFT; + if (has_mmu_v2(vcpu)) { + vcpu-arch.tlbps[0] = mfspr(SPRN_TLB0PS); + vcpu-arch.tlbps[1] = mfspr(SPRN_TLB1PS); + } + kvmppc_recalc_tlb1map_range(vcpu_e500); return 0; -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] KVM: PPC: e500: Enable FSL e6500 core
Enable Freescale e6500 core adding missing MAV 2.0 support. LRAT and Page Table are not addresses by this commit. Mihai Caraman (5): KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier KVM: PPC: e500: Emulate TLBnPS registers KVM: PPC: e500: Remove E.PT category from VCPUs KVM: PPC: e500: Emulate EPTCFG register KVM: PPC: e500mc: Enable e6500 cores arch/powerpc/include/asm/kvm_host.h |2 ++ arch/powerpc/kvm/e500.h | 11 +++ arch/powerpc/kvm/e500_emulate.c | 19 +++ arch/powerpc/kvm/e500_mmu.c | 24 ++-- arch/powerpc/kvm/e500mc.c |2 ++ 5 files changed, 52 insertions(+), 6 deletions(-) -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] KVM: PPC: e500: Remove E.PT category from VCPUs
Embedded.Page Table (E.PT) category in VMs requires indirect tlb entries emulation which is not supported yet. Configure TLBnCFG to remove E.PT category from VCPUs. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/e500_mmu.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c index 129299a..9a1f7b7 100644 --- a/arch/powerpc/kvm/e500_mmu.c +++ b/arch/powerpc/kvm/e500_mmu.c @@ -692,12 +692,14 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, vcpu_e500-gtlb_offset[0] = 0; vcpu_e500-gtlb_offset[1] = params.tlb_sizes[0]; - vcpu-arch.tlbcfg[0] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); + vcpu-arch.tlbcfg[0] = + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); if (params.tlb_sizes[0] = 2048) vcpu-arch.tlbcfg[0] |= params.tlb_sizes[0]; vcpu-arch.tlbcfg[0] |= params.tlb_ways[0] TLBnCFG_ASSOC_SHIFT; - vcpu-arch.tlbcfg[1] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); + vcpu-arch.tlbcfg[1] = + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); vcpu-arch.tlbcfg[1] |= params.tlb_sizes[1]; vcpu-arch.tlbcfg[1] |= params.tlb_ways[1] TLBnCFG_ASSOC_SHIFT; @@ -783,13 +785,13 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500) /* Init TLB configuration register */ vcpu-arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) -~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); +~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); vcpu-arch.tlbcfg[0] |= vcpu_e500-gtlb_params[0].entries; vcpu-arch.tlbcfg[0] |= vcpu_e500-gtlb_params[0].ways TLBnCFG_ASSOC_SHIFT; vcpu-arch.tlbcfg[1] = mfspr(SPRN_TLB1CFG) -~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); +~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); vcpu-arch.tlbcfg[1] |= vcpu_e500-gtlb_params[1].entries; vcpu-arch.tlbcfg[1] |= vcpu_e500-gtlb_params[1].ways TLBnCFG_ASSOC_SHIFT; -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register
EPTCFG register defined by E.PT is accessed unconditionally by Linux guests in the presence of MAV 2.0. Emulate EPTCFG register now. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/e500.h |6 ++ arch/powerpc/kvm/e500_emulate.c |9 + arch/powerpc/kvm/e500_mmu.c |5 + 4 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 88fcfe6..f480b20 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -503,6 +503,7 @@ struct kvm_vcpu_arch { u32 tlbcfg[4]; u32 tlbps[4]; u32 mmucfg; + u32 eptcfg; u32 epr; struct kvmppc_booke_debug_reg dbg_reg; #endif diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h index b9f76d8..983eb95 100644 --- a/arch/powerpc/kvm/e500.h +++ b/arch/powerpc/kvm/e500.h @@ -308,4 +308,10 @@ static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu) return ((vcpu-arch.mmucfg MMUCFG_MAVN) == MMUCFG_MAVN_V2); } +static inline unsigned int supports_page_tables(const struct kvm_vcpu *vcpu) +{ + return ((vcpu-arch.tlbcfg[0] TLBnCFG_IND) + || (vcpu-arch.tlbcfg[1] TLBnCFG_IND)); +} + #endif /* KVM_E500_H */ diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c index 5515dc5..493e231 100644 --- a/arch/powerpc/kvm/e500_emulate.c +++ b/arch/powerpc/kvm/e500_emulate.c @@ -339,6 +339,15 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val) return EMULATE_FAIL; *spr_val = vcpu-arch.tlbps[1]; break; + case SPRN_EPTCFG: + if (!has_mmu_v2(vcpu)) + return EMULATE_FAIL; + /* +* Legacy Linux guests access EPTCFG register even if the E.PT +* category is disabled in the VM. Give them a chance to live. +*/ + *spr_val = vcpu-arch.eptcfg; + break; default: emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val); } diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c index 9a1f7b7..199c11e 100644 --- a/arch/powerpc/kvm/e500_mmu.c +++ b/arch/powerpc/kvm/e500_mmu.c @@ -799,6 +799,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500) if (has_mmu_v2(vcpu)) { vcpu-arch.tlbps[0] = mfspr(SPRN_TLB0PS); vcpu-arch.tlbps[1] = mfspr(SPRN_TLB1PS); + + if (supports_page_tables(vcpu)) + vcpu-arch.eptcfg = mfspr(SPRN_EPTCFG); + else + vcpu-arch.eptcfg = 0; } kvmppc_recalc_tlb1map_range(vcpu_e500); -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] KVM: PPC: e500mc: Enable e6500 cores
Extend processor compatibility names to e6500 cores. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/e500mc.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c index 1f89d26..6c87299 100644 --- a/arch/powerpc/kvm/e500mc.c +++ b/arch/powerpc/kvm/e500mc.c @@ -172,6 +172,8 @@ int kvmppc_core_check_processor_compat(void) r = 0; else if (strcmp(cur_cpu_spec-cpu_name, e5500) == 0) r = 0; + else if (strcmp(cur_cpu_spec-cpu_name, e6500) == 0) + r = 0; else r = -ENOTSUPP; -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] What to do about non-qdevified devices?
Am 30.01.2013 13:35, schrieb Markus Armbruster: Peter Maydell peter.mayd...@linaro.org writes: On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote: Anthony Liguori aligu...@us.ibm.com writes: [...] The problems I ran into were (1) this is a lot of work (2) it basically requires that all bus children have been qdev/QOM-ified. Even with something like the ISA bus which is where I started, quite a few devices were not qdevified still. So what's the plan to complete the qdevification job? Lay really low and quietly hope the problem goes away? We've tried that for about three years, doesn't seem to work. Do we have a list of not-yet-qdevified devices? Maybe we need to start saying fix X Y and Z or platform P is dropped from the next release. (This would of course be easier if we had a way to let users know that platform P was in danger...) I think that's a good idea. Only problem is identifying pre-qdev devices in the code requires code inspection (grep won't do, I'm afraid). +1 That would address my request as well. Having a list of low-hanging fruit on the Wiki might also give new contributors some ideas of where and how to start poking at the code. If we agree on a qdevify or else plan, I'd be prepared to help with the digging up of devices. I disagree on the or else part. I have been qdev'ifying and QOM'ifying devices in my maintenance area, and progress is slow. It gets even slower if one leaves clearly maintained areas. I see no good reason to force a pistol on someone's breast, like you have done for IDE, unless there is a good reason to do so. Currently I don't see any. Just think of my pending ide/mmio.c patch [1] that no one has reviewed or applied so far. Similarly, Fred's virtio refactoring has pretty long review cycles, with discussions about very basic QOM and OOD idioms. If we want to make progress, we need to encourage contributors to send such patches by making sure they get feedback and find their way into the tree within a reasonable time frame. It's always easier to rip out and damage other people's work than to get things right yourself. To take that thought to the extreme, I could propose to rip out any qdev device that's not properly QOM'ified and realize'ified yet. That would include i440fx, fdc and many core x86 devices in the repository... Technical risks have been raised elsewhere: Making random code SysBusDevices can lead to PCIDevices instantiating them not being hot-pluggable any more simply because SysBus is a crappy fallback, overused in lack of a clear alternative. I already started reviewing parent_bus and qdev_get_parent_bus() uses in the tree [2, 3], but constructive help would be more welcome than constant nagging about code that's in bad shape. There's a lot of work to be done! Andreas [1] http://patchwork.ozlabs.org/patch/215482/ [2] http://patchwork.ozlabs.org/patch/209499/ [3] http://patchwork.ozlabs.org/patch/213971/ -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Andreas Färber afaer...@suse.de writes: Am 29.01.2013 16:41, schrieb Juan Quintela: * Portio port to new memory regions? Andreas, could you fill? MemoryRegion's .old_portio mechanism requires workarounds for VGA on ppc, affecting among others the sPAPR PCI host bridge: http://git.qemu.org/?p=qemu.git;a=commit;h=a3cfa18eb075c7ef78358ca1956fe7b01caa1724 Patches were posted and merged removing all .old_portio users but one: hw/ioport.c:portio_list_add_1(), used by portio_list_add() hw/isa-bus.c:portio_list_add(piolist, isabus-address_space_io, start); hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); hw/vga.c:portio_list_add(vbe_port_list, address_space_io, 0x1ce); Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Okay, a couple things here: There is no such thing as PIO as a general concept. What leaves the CPU and what a bus interprets are totally different things. An x86 CPU has a MMIO capability that's essentially 65 bits. Whether the top bit is set determines whether it's a PIO transaction or an MMIO transaction. A large chunk of that address space is invalid of course. PCI has a 65 bit address space too. The 65th bit determines whether it's an IO transaction or an MMIO transaction. For architectures that only have a 64-bit address space, what the PCI controller typically does is pick a 16-bit window within that address space to map to a PCI address with the 65th bit set. Within the PCI bus, transactions are usually routed to devices via positive decoding. The device lists what address regions it wants to handle (via BARs) and the PCI bus uses those to determine who to send transactions to. There are some exceptions though. Specifically: 1) A chipset will route any non-positively decoded IO transaction (65th bit set) to a single end point (usually the ISA-bridge). Which one it chooses is up to the chipset. This is called subtractive decoding because the PCI bus will wait multiple cycles for that device to claim the transaction before bouncing it. 2) There are special hacks in most PCI chipsets to route very specific addresses ranges to certain devices. Namely, legacy VGA IO transactions go to the first VGA device. Legacy IDE IO transactions go to the first IDE device. This doesn't need to be programmed in the BARs. It will just happen. 3) As it turns out, all legacy PIIX3 devices are positively decoded and sent to the ISA-bridge (because it's faster this way). Notice the lack of the word ISA in all of this other than describing the PCI class of an end point. So how should this be modeled? On x86, the CPU has a pio address space. That can propagate down through the PCI bus which is what we do today. On !x86, the PCI controller ought to setup a MemoryRegion for downstream PIO that devices can use to register on. We probably need to do something like change the PCI VGA devices to export a MemoryRegion and allow the PCI controller to device how to register that as a subregion. Regards, Anthony Liguori Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, Jan 30, 2013 at 07:24:57AM -0600, Anthony Liguori wrote: Michael S. Tsirkin m...@redhat.com writes: On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... All programming is done by the OS, devices do not register with controller. Each bridge has two ways to claim an IO transaction: - transaction is within the window programmed in the bridge - subtractive decoding enabled and no one else claims the transaction And there can only be one endpoint that accepts subtractive decoding and this is usually the ISA bridge. Also note that there are some really special cases with PCI. The legacy VGA ports are always routed to the first device with a DISPLAY class type. Likewise, with legacy IDE ports are routed to the first device with an IDE class. That's the only reason you can have these legacy devices not behind the ISA bridge. Regards, Anthony Liguori Yes. And to futher clarify that, 'routed' in the sense that the spec specifies the addresses for each class, it's a hard-coded set of addresses. The hardware never looks at the class, each device of simply knows which addresses to claim and whether it's enabled. What happens if you have more than one VGA adapter on a bus? As long as only one is enabled, you are fine. If more than one is enabled, bad things will happen including possibly overheating. Also, it's not just the class that specifies the addresses, it's the programming interface too. For example for display, hardcoded addresses are used for legacy sublass 0x0 and for programming ifc 0x0 - vga compatible adapter and 0x1 - 8514 compatible adapter. But again - it specifies this to the OS. At the bus level, transaction happens on a bus and an appropriate device will claim it. My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Peter Maydell peter.mayd...@linaro.org writes: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. Makes sense me, but I'm naive, too :) For me, I/O ports are just an alternate address space some devices have. For instance, x86 CPUs have an extra pin for selecting I/O vs. memory address space. The ISA bus has separate read/write pins for memory and I/O. This isn't terribly special. Mapping address spaces around is what devices bridging buses do. I'd expect a system bus for an x86 CPU to have both a memory and an I/O address space. I'd expect an ISA PC's sysbus - ISA bridge to map both directly. I'd expect an ISA bridge for a sysbus without a separate I/O address space to map the ISA I/O address space into the sysbus's normal address space somehow. PCI ISA bridges have their own rules, but I've gotten away with ignoring the details so far :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 5/8] KVM: PPC: debug stub interface parameter defined
-Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Friday, January 25, 2013 5:24 PM To: Bhushan Bharat-R65777 Cc: Paul Mackerras; kvm-...@vger.kernel.org; kvm@vger.kernel.org Subject: Re: [PATCH 5/8] KVM: PPC: debug stub interface parameter defined On 17.01.2013, at 12:11, Bhushan Bharat-R65777 wrote: -Original Message- From: Paul Mackerras [mailto:pau...@samba.org] Sent: Thursday, January 17, 2013 12:53 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan Bharat- R65777 Subject: Re: [PATCH 5/8] KVM: PPC: debug stub interface parameter defined On Wed, Jan 16, 2013 at 01:54:42PM +0530, Bharat Bhushan wrote: This patch defines the interface parameter for KVM_SET_GUEST_DEBUG ioctl support. Follow up patches will use this for setting up hardware breakpoints, watchpoints and software breakpoints. [snip] diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 453a10f..7d5a51c 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -1483,6 +1483,12 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) return r; } +int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, + struct kvm_guest_debug *dbg) +{ + return -EINVAL; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { return -ENOTSUPP; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 934413c..4c94ca9 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -532,12 +532,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) #endif } -int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, -struct kvm_guest_debug *dbg) -{ - return -EINVAL; -} - This will break the build for non-book E machines, since kvm_arch_vcpu_ioctl_set_guest_debug() is referenced from generic code. You need to add it to arch/powerpc/kvm/book3s.c as well. right, I will correct this. Would the implementation actually be different on booke vs book3s? My feeling is that powerpc.c is actually the right place for this. I am not sure there will be anything common between book3s and booke. Should we define the cpu specific function something like kvm_ppc_vcpu_ioctl_set_guest_debug() for booke and book3s and call this new defined function from kvm_arch_vcpu_ioctl_set_guest_debug() in powerpc.c ? Thanks -Bharat -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] What to do about non-qdevified devices?
Markus Armbruster arm...@redhat.com writes: Peter Maydell peter.mayd...@linaro.org writes: On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote: Anthony Liguori aligu...@us.ibm.com writes: [...] The problems I ran into were (1) this is a lot of work (2) it basically requires that all bus children have been qdev/QOM-ified. Even with something like the ISA bus which is where I started, quite a few devices were not qdevified still. So what's the plan to complete the qdevification job? Lay really low and quietly hope the problem goes away? We've tried that for about three years, doesn't seem to work. Do we have a list of not-yet-qdevified devices? Maybe we need to start saying fix X Y and Z or platform P is dropped from the next release. (This would of course be easier if we had a way to let users know that platform P was in danger...) I think that's a good idea. Only problem is identifying pre-qdev devices in the code requires code inspection (grep won't do, I'm afraid). If we agree on a qdevify or else plan, I'd be prepared to help with the digging up of devices. That's a nice thought, but we're not going to rip out dma.c and break every PC target. But I will help put together a list of devices that need converting. I have patches actually for most of the PC devices. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] QEMU buildbot maintenance state
Gerd Hoffmann kra...@redhat.com writes: Hi, Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel and Christian? It would be awesome if you could do this given your experience running and customizing buildbot. I'll try to set aside some time for that. Christians idea to host the config at github is good, that certainly makes it easier to balance things to more people. Another thing which would be helpful: Any chance we can setup a maintainer tree mirror @ git.qemu.org? A single repository where each maintainer tree shows up as a branch? I will setup a tree based on the 'T:' fields in MAINTAINERS. So if you want your tree to be part of buildbot, please make sure that you have a correct entry in MAINTAINERS. Regards, Anthony Liguori This would make the buildbot setup *alot* easier. We can go for a AnyBranchScheduler then with BuildFactory and BuildConfig shared, instead of needing one BuildFactory and BuildConfig per branch. Also makes the buildbot web interface less cluttered as we don't have a insane amount of BuildConfigs any more. And saves some resources (bandwidth + diskspace) for the buildslaves. I think people who want to look what is coming or who want to test stuff cooking it would be a nice service too if they have a one-stop shop where they can get everything. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] KVM: MMU: make spte_is_locklessly_modifiable() more clear
From: Gleb Natapov g...@redhat.com spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are present on spte. Make it more explicit. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9f628f7..2fa82b0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -448,7 +448,8 @@ static bool __check_direct_spte_mmio_pf(u64 spte) static bool spte_is_locklessly_modifiable(u64 spte) { - return !(~spte (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)); + return (spte (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) == + (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE); } static bool spte_has_volatile_bits(u64 spte) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] KVM: MMU: drop unneeded checks.
From: Gleb Natapov g...@redhat.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2fa82b0..40737b3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2328,9 +2328,8 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, if (s-role.level != PT_PAGE_TABLE_LEVEL) return 1; - if (!need_unsync !s-unsync) { + if (!s-unsync) need_unsync = true; - } } if (need_unsync) kvm_unsync_pages(vcpu, gfn); @@ -4008,7 +4007,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, !((sp-role.word ^ vcpu-arch.mmu.base_role.word) mask.word) rmap_can_add(vcpu)) mmu_pte_write_new_pte(vcpu, sp, spte, gentry); - if (!remote_flush need_remote_flush(entry, *spte)) + if (need_remote_flush(entry, *spte)) remote_flush = true; ++spte; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] small cleanups in MMU code
From: Gleb Natapov g...@redhat.com Any of those should not change functionality. Gleb Natapov (6): KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page arch/x86/kvm/mmu.c | 32 +--- arch/x86/kvm/paging_tmpl.h |3 --- arch/x86/kvm/x86.c |2 -- 3 files changed, 9 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page
From: Gleb Natapov g...@redhat.com This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4. There is not user for kvm_mmu_isolate_page() any more. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 21 +++-- 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 42ba85c..0242a8a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1461,28 +1461,14 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) percpu_counter_add(kvm_total_used_mmu_pages, nr); } -/* - * Remove the sp from shadow page cache, after call it, - * we can not find this sp from the cache, and the shadow - * page table is still valid. - * It should be under the protection of mmu lock. - */ -static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp) +static void kvm_mmu_free_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); hlist_del(sp-hash_link); - if (!sp-role.direct) - free_page((unsigned long)sp-gfns); -} - -/* - * Free the shadow page table and the sp, we can do it - * out of the protection of mmu lock. - */ -static void kvm_mmu_free_page(struct kvm_mmu_page *sp) -{ list_del(sp-link); free_page((unsigned long)sp-spt); + if (!sp-role.direct) + free_page((unsigned long)sp-gfns); kmem_cache_free(mmu_page_header_cache, sp); } @@ -2126,7 +2112,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, do { sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); WARN_ON(!sp-role.invalid || sp-root_count); - kvm_mmu_isolate_page(sp); kvm_mmu_free_page(sp); } while (!list_empty(invalid_list)); } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] KVM: MMU: drop superfluous is_present_gpte() check.
From: Gleb Natapov g...@redhat.com Gust page walker puts only present ptes into ptes[] array. No need to check it again. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/paging_tmpl.h |3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index ca69dcc..34c5c99 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -409,9 +409,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, unsigned direct_access, access = gw-pt_access; int top_level, emulate = 0; - if (!is_present_gpte(gw-ptes[gw-level - 1])) - return 0; - direct_access = gw-pte_access; top_level = vcpu-arch.mmu.root_level; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] KVM: MMU: drop superfluous min() call.
From: Gleb Natapov g...@redhat.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8028ac6..42ba85c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3854,7 +3854,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ *gpa = ~(gpa_t)7; *bytes = 8; - r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8)); + r = kvm_read_guest(vcpu-kvm, *gpa, gentry, 8); if (r) gentry = 0; new = (const u8 *)gentry; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] KVM: MMU: set base_role.nxe during mmu initialization.
From: Gleb Natapov g...@redhat.com Move base_role.nxe initialisation to where all other roles are initialized. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |1 + arch/x86/kvm/x86.c |2 -- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 40737b3..8028ac6 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3687,6 +3687,7 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) else r = paging32_init_context(vcpu, context); + vcpu-arch.mmu.base_role.nxe = is_nx(vcpu); vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu); vcpu-arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); vcpu-arch.mmu.base_role.smep_andnot_wp diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cf512e70..373e17a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -870,8 +870,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) kvm_x86_ops-set_efer(vcpu, efer); - vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; - /* Update reserved bits */ if ((efer ^ old_efer) EFER_NX) kvm_mmu_reset_context(vcpu); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] small cleanups in MMU code
Something went wrong with git send-email. Ignore this one please. On Wed, Jan 30, 2013 at 04:42:27PM +0200, y...@redhat.com wrote: From: Gleb Natapov g...@redhat.com Any of those should not change functionality. Gleb Natapov (6): KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page arch/x86/kvm/mmu.c | 32 +--- arch/x86/kvm/paging_tmpl.h |3 --- arch/x86/kvm/x86.c |2 -- 3 files changed, 9 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] KVM: MMU: drop unneeded checks.
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2fa82b0..40737b3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2328,9 +2328,8 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, if (s-role.level != PT_PAGE_TABLE_LEVEL) return 1; - if (!need_unsync !s-unsync) { + if (!s-unsync) need_unsync = true; - } } if (need_unsync) kvm_unsync_pages(vcpu, gfn); @@ -4008,7 +4007,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, !((sp-role.word ^ vcpu-arch.mmu.base_role.word) mask.word) rmap_can_add(vcpu)) mmu_pte_write_new_pte(vcpu, sp, spte, gentry); - if (!remote_flush need_remote_flush(entry, *spte)) + if (need_remote_flush(entry, *spte)) remote_flush = true; ++spte; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] small cleanups in MMU code
Any of those should not change functionality. Gleb Natapov (6): KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page arch/x86/kvm/mmu.c | 32 +--- arch/x86/kvm/paging_tmpl.h |3 --- arch/x86/kvm/x86.c |2 -- 3 files changed, 9 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] KVM: MMU: set base_role.nxe during mmu initialization.
Move base_role.nxe initialisation to where all other roles are initialized. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |1 + arch/x86/kvm/x86.c |2 -- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 40737b3..8028ac6 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3687,6 +3687,7 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) else r = paging32_init_context(vcpu, context); + vcpu-arch.mmu.base_role.nxe = is_nx(vcpu); vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu); vcpu-arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); vcpu-arch.mmu.base_role.smep_andnot_wp diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cf512e70..373e17a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -870,8 +870,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) kvm_x86_ops-set_efer(vcpu, efer); - vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; - /* Update reserved bits */ if ((efer ^ old_efer) EFER_NX) kvm_mmu_reset_context(vcpu); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] KVM: MMU: drop superfluous is_present_gpte() check.
Gust page walker puts only present ptes into ptes[] array. No need to check it again. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/paging_tmpl.h |3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index ca69dcc..34c5c99 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -409,9 +409,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, unsigned direct_access, access = gw-pt_access; int top_level, emulate = 0; - if (!is_present_gpte(gw-ptes[gw-level - 1])) - return 0; - direct_access = gw-pte_access; top_level = vcpu-arch.mmu.root_level; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] KVM: MMU: make spte_is_locklessly_modifiable() more clear
spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are present on spte. Make it more explicit. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9f628f7..2fa82b0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -448,7 +448,8 @@ static bool __check_direct_spte_mmio_pf(u64 spte) static bool spte_is_locklessly_modifiable(u64 spte) { - return !(~spte (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)); + return (spte (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) == + (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE); } static bool spte_has_volatile_bits(u64 spte) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page
This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4. There is not user for kvm_mmu_isolate_page() any more. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 21 +++-- 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 42ba85c..0242a8a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1461,28 +1461,14 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) percpu_counter_add(kvm_total_used_mmu_pages, nr); } -/* - * Remove the sp from shadow page cache, after call it, - * we can not find this sp from the cache, and the shadow - * page table is still valid. - * It should be under the protection of mmu lock. - */ -static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp) +static void kvm_mmu_free_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); hlist_del(sp-hash_link); - if (!sp-role.direct) - free_page((unsigned long)sp-gfns); -} - -/* - * Free the shadow page table and the sp, we can do it - * out of the protection of mmu lock. - */ -static void kvm_mmu_free_page(struct kvm_mmu_page *sp) -{ list_del(sp-link); free_page((unsigned long)sp-spt); + if (!sp-role.direct) + free_page((unsigned long)sp-gfns); kmem_cache_free(mmu_page_header_cache, sp); } @@ -2126,7 +2112,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, do { sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); WARN_ON(!sp-role.invalid || sp-root_count); - kvm_mmu_isolate_page(sp); kvm_mmu_free_page(sp); } while (!list_empty(invalid_list)); } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] KVM: MMU: drop superfluous min() call.
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8028ac6..42ba85c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3854,7 +3854,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ *gpa = ~(gpa_t)7; *bytes = 8; - r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8)); + r = kvm_read_guest(vcpu-kvm, *gpa, gentry, 8); if (r) gentry = 0; new = (const u8 *)gentry; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] small cleanups in MMU code
On Wed, Jan 30, 2013 at 10:42 PM, y...@redhat.com wrote: y...@redhat.com ? From: Gleb Natapov g...@redhat.com Any of those should not change functionality. Gleb Natapov (6): KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page arch/x86/kvm/mmu.c | 32 +--- arch/x86/kvm/paging_tmpl.h |3 --- arch/x86/kvm/x86.c |2 -- 3 files changed, 9 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Asias He -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] small cleanups in MMU code
On Wed, Jan 30, 2013 at 10:46:56PM +0800, Asias He wrote: On Wed, Jan 30, 2013 at 10:42 PM, y...@redhat.com wrote: y...@redhat.com ? y not? From: Gleb Natapov g...@redhat.com Any of those should not change functionality. Gleb Natapov (6): KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page arch/x86/kvm/mmu.c | 32 +--- arch/x86/kvm/paging_tmpl.h |3 --- arch/x86/kvm/x86.c |2 -- 3 files changed, 9 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Asias He -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Markus Armbruster arm...@redhat.com writes: Peter Maydell peter.mayd...@linaro.org writes: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. Makes sense me, but I'm naive, too :) For me, I/O ports are just an alternate address space some devices have. For instance, x86 CPUs have an extra pin for selecting I/O vs. memory address space. The ISA bus has separate read/write pins for memory and I/O. This isn't terribly special. Mapping address spaces around is what devices bridging buses do. I'd expect a system bus for an x86 CPU to have both a memory and an I/O address space. There is no such thing as a system bus. There is a bus that links the CPUs to each other and to the North Bridge. This is QPI on modern systems. Sometimes there's a bus to link the North Bridge to the South Bridge. On modern systems, this is QPI. On the i440fx, the i440fx is both the South Bridge and North Bridge and the link between the two is internal to the chip. The South Bridge may then export one or more downstream interfaces. In the i440fx, it only exports PCI. Behind the PCI bus, there may be bridges. On the i440fx, there is a ISA Bridge which also acts as a Super I/O chip. It exposes a downstream ISA bus. sysbus is a relic of poor modeling. A major milestone in QEMU's evolution will be when sysbus is completely removed. Regards, Anthony Liguori I'd expect an ISA PC's sysbus - ISA bridge to map both directly. I'd expect an ISA bridge for a sysbus without a separate I/O address space to map the ISA I/O address space into the sysbus's normal address space somehow. PCI ISA bridges have their own rules, but I've gotten away with ignoring the details so far :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Hi, hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. Anyone has clues / suggestions? thanks, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Gerd Hoffmann kra...@redhat.com writes: Hi, hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. The VGA accessors should be exposed as a memory region but the sub class ought to be responsible for actually adding it to a subregion. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. That should be possible with priorities, but I think it's wrong. There aren't two VGA devices. QXL is-a VGA device and the best way to override behavior of base VGA device is through polymorphism. This isn't really a memory API issue, it's a modeling issue. Regards, Anthony Liguori Anyone has clues / suggestions? thanks, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Am 30.01.2013 17:33, schrieb Anthony Liguori: Gerd Hoffmann kra...@redhat.com writes: hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. That would require polymorphism since we already need to derive from PCIDevice or ISADevice respectively for interfacing with the bus... Modern object-oriented languages have tried to avoid multi-inheritence due to arising complications, I thought. Wouldn't object if someone wanted to do the dirty implementation work though. ;) Another such example is EHCI, with PCIDevice and SysBusDevice frontends, sharing an EHCIState struct and having helper functions operating on that core state only. Quite a few device share such a pattern today actually (serial, m48t59, ...). The VGA accessors should be exposed as a memory region but the sub class ought to be responsible for actually adding it to a subregion. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. That should be possible with priorities, but I think it's wrong. There aren't two VGA devices. QXL is-a VGA device and the best way to override behavior of base VGA device is through polymorphism. In this particular case QXL is-a PCI VGA device though, so we can decouple it from core VGA modeling. Placing the MemoryRegionOps inside the Class (rather than static const) might be a short-term solution for overriding read/write handlers of a particular VGA MemoryRegion. :) Cheers, Andreas This isn't really a memory API issue, it's a modeling issue. Regards, Anthony Liguori Anyone has clues / suggestions? thanks, Gerd -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] What to do about non-qdevified devices?
Il 30/01/2013 14:44, Andreas Färber ha scritto: I disagree on the or else part. I have been qdev'ifying and QOM'ifying devices in my maintenance area, and progress is slow. It gets even slower if one leaves clearly maintained areas. I see no good reason to force a pistol on someone's breast, like you have done for IDE, unless there is a good reason to do so. Currently I don't see any. The reason for IDE is that it involved devices that are not SysBusDevices (the IDE disk devices). Having the same code work in two ways, one qdevified and one not, is bad. For simple SysBusDevice you're changing a crappy default to a less bad one, but there's really little incentive to qdev/QOM-ification. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Il 30/01/2013 17:33, Anthony Liguori ha scritto: Gerd Hoffmann kra...@redhat.com writes: Hi, hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. I think QXL should have-a VGA rather than being one. It completely bypasses the VGA infrastructure if not in VGA mode. The VGA accessors should be exposed as a memory region but the sub class ought to be responsible for actually adding it to a subregion. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. Avi had a prototype patch series for IOMMU regions. You could add one between the QXL device and the VGA. It doesn't have to do a translation, but trying to translate a VGA address already means that you must go to VGA mode. Paolo That should be possible with priorities, but I think it's wrong. There aren't two VGA devices. QXL is-a VGA device and the best way to override behavior of base VGA device is through polymorphism. This isn't really a memory API issue, it's a modeling issue. Regards, Anthony Liguori Anyone has clues / suggestions? thanks, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] What to do about non-qdevified devices?
Am 30.01.2013 17:58, schrieb Paolo Bonzini: Il 30/01/2013 14:44, Andreas Färber ha scritto: I disagree on the or else part. I have been qdev'ifying and QOM'ifying devices in my maintenance area, and progress is slow. It gets even slower if one leaves clearly maintained areas. I see no good reason to force a pistol on someone's breast, like you have done for IDE, unless there is a good reason to do so. Currently I don't see any. The reason for IDE is that it involved devices that are not SysBusDevices (the IDE disk devices). Having the same code work in two ways, one qdevified and one not, is bad. Sure, I did help with the QOM'ification there. Currently I don't see any [good reason] by contrast referred to removing *all* devices that are not yet qdev/QOM'ified without such pressing reason. For simple SysBusDevice you're changing a crappy default to a less bad one, but there's really little incentive to qdev/QOM-ification. No disagreement. The benefits don't come from doing a conversion, they come from basing new work on the result of a conversion. :) Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Andreas Färber afaer...@suse.de writes: Am 30.01.2013 17:33, schrieb Anthony Liguori: Gerd Hoffmann kra...@redhat.com writes: hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. That would require polymorphism since we already need to derive from PCIDevice or ISADevice respectively for interfacing with the bus... Nope. You can use composition: QXLDevice is-a VGACommonState QXLPCI is-a PCIDevice has-a QXLDevice Modern object-oriented languages have tried to avoid multi-inheritence due to arising complications, I thought. Wouldn't object if someone wanted to do the dirty implementation work though. ;) There is no need for MI. Another such example is EHCI, with PCIDevice and SysBusDevice frontends, sharing an EHCIState struct and having helper functions operating on that core state only. Quite a few device share such a pattern today actually (serial, m48t59, ...). Yes, this is all about chipset modelling. Chipsets should derive from device and then be embedded in the appropriate bus device. For instance. SerialState is-a DeviceState ISASerialState is-a ISADevice, has-a SerialState MMIOSerialState is-a SysbusDevice, has-a SerialState This is what we're doing in practice, we just aren't modeling the chipsets and we're open coding the relationships (often in subtley different ways). Regards, Anthony Liguori The VGA accessors should be exposed as a memory region but the sub class ought to be responsible for actually adding it to a subregion. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. That should be possible with priorities, but I think it's wrong. There aren't two VGA devices. QXL is-a VGA device and the best way to override behavior of base VGA device is through polymorphism. In this particular case QXL is-a PCI VGA device though, so we can decouple it from core VGA modeling. Placing the MemoryRegionOps inside the Class (rather than static const) might be a short-term solution for overriding read/write handlers of a particular VGA MemoryRegion. :) Cheers, Andreas This isn't really a memory API issue, it's a modeling issue. Regards, Anthony Liguori Anyone has clues / suggestions? thanks, Gerd -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Am 30.01.2013 12:48, schrieb Peter Maydell: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, One remark on same way as memory regions, me not knowing all the gory hardware details myself. PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO device you would have a continuous region from say 0xa000 to 0xa007 inclusive and within that region you have some kind of sparse registers. With ISA ports you often have dense overlapping ranges, say, 0x3-0x6 byte-reads foo, while 0x4 word-write does bar. This is handled by having lists of (offset, length, size, handler) quadruplets and consolidating those into MemoryRegions and aliases (cf. patches) that then have a validation function to check whether a particular access is valid and by whom it should be handled - that's what MemoryRegionPortio[] and similar APIs are good for. So yes, it might be possible to have a device declare its ports at PCIDevice or DeviceState level, but it can't be directly passed through to MemoryRegion API in most cases, or conflicts would arise. At least that was my experience with PReP. Andreas and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, Jan 30, 2013 at 11:29:58AM -0600, Anthony Liguori wrote: Andreas Färber afaer...@suse.de writes: Am 30.01.2013 17:33, schrieb Anthony Liguori: Gerd Hoffmann kra...@redhat.com writes: hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. That would require polymorphism since we already need to derive from PCIDevice or ISADevice respectively for interfacing with the bus... Nope. You can use composition: QXLDevice is-a VGACommonState QXLPCI is-a PCIDevice has-a QXLDevice But why like this? The distinction is artificial, isn't it? Modern object-oriented languages have tried to avoid multi-inheritence due to arising complications, I thought. Wouldn't object if someone wanted to do the dirty implementation work though. ;) There is no need for MI. Another such example is EHCI, with PCIDevice and SysBusDevice frontends, sharing an EHCIState struct and having helper functions operating on that core state only. Quite a few device share such a pattern today actually (serial, m48t59, ...). Yes, this is all about chipset modelling. Chipsets should derive from device and then be embedded in the appropriate bus device. For instance. SerialState is-a DeviceState ISASerialState is-a ISADevice, has-a SerialState MMIOSerialState is-a SysbusDevice, has-a SerialState ISASerialState is not a SerialState? Hmm but why? This is what we're doing in practice, we just aren't modeling the chipsets and we're open coding the relationships (often in subtley different ways). Regards, Anthony Liguori The VGA accessors should be exposed as a memory region but the sub class ought to be responsible for actually adding it to a subregion. That twist makes it a bit hard to convert vga ... Anyone knows how one would do that with the memory api instead? I think taking over the ports is easy as the memory regions have priorities so I can simply register a region with higher priority. I have no clue how to forward the access to the vga code though. That should be possible with priorities, but I think it's wrong. There aren't two VGA devices. QXL is-a VGA device and the best way to override behavior of base VGA device is through polymorphism. In this particular case QXL is-a PCI VGA device though, so we can decouple it from core VGA modeling. Placing the MemoryRegionOps inside the Class (rather than static const) might be a short-term solution for overriding read/write handlers of a particular VGA MemoryRegion. :) Cheers, Andreas This isn't really a memory API issue, it's a modeling issue. Regards, Anthony Liguori Anyone has clues / suggestions? thanks, Gerd -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote: Am 30.01.2013 12:48, schrieb Peter Maydell: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, One remark on same way as memory regions, me not knowing all the gory hardware details myself. PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO device you would have a continuous region from say 0xa000 to 0xa007 inclusive and within that region you have some kind of sparse registers. With ISA ports you often have dense overlapping ranges, say, 0x3-0x6 byte-reads foo, while 0x4 word-write does bar. Hmm on x86 this is what happens with cf8..cfb range registers for example. We plan handle this ATM using memory region priorities. Same would work for prep won't it? This is handled by having lists of (offset, length, size, handler) quadruplets and consolidating those into MemoryRegions and aliases (cf. patches) that then have a validation function to check whether a particular access is valid and by whom it should be handled - that's what MemoryRegionPortio[] and similar APIs are good for. So yes, it might be possible to have a device declare its ports at PCIDevice or DeviceState level, but it can't be directly passed through to MemoryRegion API in most cases, or conflicts would arise. At least that was my experience with PReP. Andreas and the controller for the bus (ISA or PCI) exposes those to the next layer up, and something at board level maps it all into the right places. -- PMM -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Am 30.01.2013 18:29, schrieb Anthony Liguori: Andreas Färber afaer...@suse.de writes: Am 30.01.2013 17:33, schrieb Anthony Liguori: Gerd Hoffmann kra...@redhat.com writes: hw/qxl.c:portio_list_add(qxl_vga_port_list, pci_address_space_io(dev), 0x3b0); hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0); That reminds me I should solve this in a more elegant way. qxl takes over the vga io ports. The reason it does this is because qxl switches into vga mode in case the vga ports are accessed while not in vga mode. After doing the check (and possibly switching mode) the vga handler is called to actually handle it. The best way to handle this would be to remodel how we do VGA. Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. That would require polymorphism since we already need to derive from PCIDevice or ISADevice respectively for interfacing with the bus... Nope. You can use composition: QXLDevice is-a VGACommonState QXLPCI is-a PCIDevice has-a QXLDevice Modern object-oriented languages have tried to avoid multi-inheritence due to arising complications, I thought. Wouldn't object if someone wanted to do the dirty implementation work though. ;) There is no need for MI. Another such example is EHCI, with PCIDevice and SysBusDevice frontends, sharing an EHCIState struct and having helper functions operating on that core state only. Quite a few device share such a pattern today actually (serial, m48t59, ...). Yes, this is all about chipset modelling. Chipsets should derive from device and then be embedded in the appropriate bus device. For instance. SerialState is-a DeviceState ISASerialState is-a ISADevice, has-a SerialState MMIOSerialState is-a SysbusDevice, has-a SerialState Okay, but I don't like that both are transitively DeviceState then. It's much too easy to add / hot-add the wrong device then, especially when dropping no_user. Andreas This is what we're doing in practice, we just aren't modeling the chipsets and we're open coding the relationships (often in subtley different ways). Regards, Anthony Liguori -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On 30 January 2013 20:08, Michael S. Tsirkin m...@redhat.com wrote: Anthony wrote: Nope. You can use composition: QXLDevice is-a VGACommonState QXLPCI is-a PCIDevice has-a QXLDevice But why like this? The distinction is artificial, isn't it? I think it's the wrong way round. QXLPCI should has-a PCI interface (the physical card possesses an edge connector which fits a PCI socket; it is not the case that the physical card is a kind of edge connector). Having PCI card models inherit from PCIDevice is just a convenient (but misleading) shortcut, and that is what we should drop if it turns out that we should be inheriting from some other class. Or you could make them both has-a; I don't know enough about QXLDevice to know if it should be is-a or has-a. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
Am 30.01.2013 21:20, schrieb Michael S. Tsirkin: On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote: Am 30.01.2013 12:48, schrieb Peter Maydell: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, One remark on same way as memory regions, me not knowing all the gory hardware details myself. PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO device you would have a continuous region from say 0xa000 to 0xa007 inclusive and within that region you have some kind of sparse registers. With ISA ports you often have dense overlapping ranges, say, 0x3-0x6 byte-reads foo, while 0x4 word-write does bar. Hmm on x86 this is what happens with cf8..cfb range registers for example. We plan handle this ATM using memory region priorities. Same would work for prep won't it? Hm, my point was that iiuc a MemoryRegion is per-address-range whereas for I/O ports we seem to have per-data-width mappings. Priorities would allow us to say: 0x1-0xff is one region 0x8-0xab is a region with higher priority but fallback for, e.g., word-access at 0xa0 to the lower-priority region being unsupported today, no? I.e., the region being opaque. Having said that, for the purposes of this discussion PReP is pretty much a PC with a PowerPC CPU in it, unlike the modern CHRP machines. Andreas This is handled by having lists of (offset, length, size, handler) quadruplets and consolidating those into MemoryRegions and aliases (cf. patches) that then have a validation function to check whether a particular access is valid and by whom it should be handled - that's what MemoryRegionPortio[] and similar APIs are good for. So yes, it might be possible to have a device declare its ports at PCIDevice or DeviceState level, but it can't be directly passed through to MemoryRegion API in most cases, or conflicts would arise. At least that was my experience with PReP. -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, Jan 30, 2013 at 09:33:05PM +0100, Andreas Färber wrote: Am 30.01.2013 21:20, schrieb Michael S. Tsirkin: On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote: Am 30.01.2013 12:48, schrieb Peter Maydell: On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote: Proposal by hpoussin was to move _list_add() code to ISADevice: http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html Concerns: * PCI devices (VGA, QXL) register I/O ports as well = above patches add dependency on ISABus to machines - benh no mac ever had one = PCIDevice shouldn't use ISA API with NULL ISADevice * Lack of avi: Who decides about memory API these days? armbru and agraf concluded that moving this into ISA is wrong. = I will drop the remaining ioport patches from above series. Suggestions on how to proceed with tackling the issue are welcome. How does this stuff work on real hardware? I would have expected that a PCI device registering the fact it has IO ports would have to do so via the PCI controller it is plugged into... My naive don't-know-much-about-portio suggestion is that this should work the same way as memory regions: each device provides portio regions, One remark on same way as memory regions, me not knowing all the gory hardware details myself. PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO device you would have a continuous region from say 0xa000 to 0xa007 inclusive and within that region you have some kind of sparse registers. With ISA ports you often have dense overlapping ranges, say, 0x3-0x6 byte-reads foo, while 0x4 word-write does bar. Hmm on x86 this is what happens with cf8..cfb range registers for example. We plan handle this ATM using memory region priorities. Same would work for prep won't it? Hm, my point was that iiuc a MemoryRegion is per-address-range whereas for I/O ports we seem to have per-data-width mappings. Priorities would allow us to say: 0x1-0xff is one region 0x8-0xab is a region with higher priority but fallback for, e.g., word-access at 0xa0 to the lower-priority region being unsupported today, no? I.e., the region being opaque. No, MemoryRegion takes data width into account too. See 'PIIX3: reset the VM when the Reset Control Register's RCPU bit gets set' as one example. Having said that, for the purposes of this discussion PReP is pretty much a PC with a PowerPC CPU in it, unlike the modern CHRP machines. Andreas This is handled by having lists of (offset, length, size, handler) quadruplets and consolidating those into MemoryRegions and aliases (cf. patches) that then have a validation function to check whether a particular access is valid and by whom it should be handled - that's what MemoryRegionPortio[] and similar APIs are good for. So yes, it might be possible to have a device declare its ports at PCIDevice or DeviceState level, but it can't be directly passed through to MemoryRegion API in most cases, or conflicts would arise. At least that was my experience with PReP. -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, 2013-01-30 at 07:59 -0600, Anthony Liguori wrote: An x86 CPU has a MMIO capability that's essentially 65 bits. Whether the top bit is set determines whether it's a PIO transaction or an MMIO transaction. A large chunk of that address space is invalid of course. PCI has a 65 bit address space too. The 65th bit determines whether it's an IO transaction or an MMIO transaction. This is somewhat an over simplification since IO and MMIO differs in other ways, such as ordering rules :-) But for the sake of memory regions decoding I suppose it will do. For architectures that only have a 64-bit address space, what the PCI controller typically does is pick a 16-bit window within that address space to map to a PCI address with the 65th bit set. Sort-of yes. The window doesn't have to be 16-bit (we commonly have larger IO space windows on powerpc) and there's a window per host bridge, so there's effectively more than one IO space (as there is more than one PCI MMIO space, with only a window off the CPU space routed to each brigde). Making a hard wired assumption that the PCI (MMIO and IO) space relates directly to the CPU bus space is wrong on pretty much all !x86 architectures. .../... You make it sound like substractive decode is a chipset hack. It's not, it's specified in the PCI spec. 1) A chipset will route any non-positively decoded IO transaction (65th bit set) to a single end point (usually the ISA-bridge). Which one it chooses is up to the chipset. This is called subtractive decoding because the PCI bus will wait multiple cycles for that device to claim the transaction before bouncing it. This is not a chipset matter. It's the ISA bridge itself that does substractive decoding. There also exists P2P bridges doing such substractive decoding, this used to be fairly common with transparent bridges used for laptop docking. 2) There are special hacks in most PCI chipsets to route very specific addresses ranges to certain devices. Namely, legacy VGA IO transactions go to the first VGA device. Legacy IDE IO transactions go to the first IDE device. This doesn't need to be programmed in the BARs. It will just happen. This is also mostly not a hack in the chipset. It's a well defined behaviour for legacy devices, sometimes call hard decoding. Of course often those devices are built into the chipset but they don't have to. Plug-in VGA devices will hard decode legacy VGA regions for both IO and MMIO by default (this can be disabled on most of them nowadays) for example. This has nothing to do with the chipset. There's a specific bit in P2P bridge to control the forwarding of legacy transaction downstream (and VGA palette snoops), this is also fully specified in the PCI spec. 3) As it turns out, all legacy PIIX3 devices are positively decoded and sent to the ISA-bridge (because it's faster this way). Chipsets don't send to a bridge. It's the bridge itself that decodes. Notice the lack of the word ISA in all of this other than describing the PCI class of an end point. ISA is only relevant to the extent that the legacy regions of IO space originate from the original ISA addresses of devices (VGA, IDE, etc...) and to the extent that an ISA bus might still be present which will get the transactions that nothing else have decoded in that space. So how should this be modeled? On x86, the CPU has a pio address space. That can propagate down through the PCI bus which is what we do today. On !x86, the PCI controller ought to setup a MemoryRegion for downstream PIO that devices can use to register on. We probably need to do something like change the PCI VGA devices to export a MemoryRegion and allow the PCI controller to device how to register that as a subregion. The VGA device should just register fixed address port IOs the same way it would register an IO BAR. Essentially, hard coded IO addresses (or memory, VGA does memory too, don't forget that) are equivalent to having an invisible BAR with a fixed value in it. There should be no global port IO because that concept is broken on real multi-domain setups. Those legacy address ranges are just hard-wired sub regions of the normal PCI space on which the device sits on (unless you start doing real non-PCI ISA x86). Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, 2013-01-30 at 17:54 +0100, Andreas Färber wrote: That would require polymorphism since we already need to derive from PCIDevice or ISADevice respectively for interfacing with the bus... Modern object-oriented languages have tried to avoid multi-inheritence due to arising complications, I thought. Wouldn't object if someone wanted to do the dirty implementation work though. ;) Another such example is EHCI, with PCIDevice and SysBusDevice frontends, sharing an EHCIState struct and having helper functions operating on that core state only. Quite a few device share such a pattern today actually (serial, m48t59, ...). This is a design bug of your model :-) You shouldn't derive from your bus interface IMHO but from your functional interface, and have an ownership relation to the PCIDevice (a bit like IOKit does if my memory serves me well). Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O
On Wed, 2013-01-30 at 18:08 +0100, Paolo Bonzini wrote: Make VGACommonState a proper QOM object and use it as the base class for QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA. I think QXL should have-a VGA rather than being one. It completely bypasses the VGA infrastructure if not in VGA mode. ... Like any modern video card the minute you turn off the enable legacy crap bit on them :-) Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: windows 2008 guest causing rcu_shed to emit NMI
On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote: On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote: On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote: On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote: Thank you Marcelo, Host node locking up sometimes later than yesterday, bur problem still here, please see attached dmesg. Stuck process looks like root 19251 0.0 0.0 228476 12488 ?D14:42 0:00 /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device virtio-blk-pci,? -device on fourth vm by count. Should I try upstream kernel instead of applying patch to the latest 3.4 or it is useless? If you can upgrade to an upstream kernel, please do that. With vanilla 3.7.4 there is almost no changes, and NMI started firing again. External symptoms looks like following: starting from some count, may be third or sixth vm, qemu-kvm process allocating its memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to kill stuck kvm processes and node returned back to the normal, when on 3.2 sending SIGKILL to the process causing zombies and hanged ``ps'' output (problem and workaround when no scheduler involved described here http://www.spinics.net/lists/kvm/msg84799.html). Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter. Hi Marcelo, thanks, this parameter helped to increase number of working VMs in a half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10 to 15 percents, persists on such numbers for a long time, where linux guests in same configuration do not jump over one percent even under stress bench. After I disabled HT, crash happens only in long runs and now it is kernel panic :) Stair-like memory allocation behaviour disappeared, but other symptom leading to the crash which I have not counted previously, persists: if VM count is ``enough'' for crash, some qemu processes starting to eat one core, and they`ll panic system after run in tens of minutes in such state or if I try to attach debugger to one of them. If needed, I can log entire crash output via netconsole, now I have some tail, almost the same every time: http://xdel.ru/downloads/btwin.png Yes, please log entire crash output, thanks. Here please, 3.7.4-vanilla, 16 vms, ple_gap=0: http://xdel.ru/downloads/oops-default-kvmintel.txt Just an update: I was able to reproduce that on pure linux VMs using qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at start of vm(with count ten working machines at the moment). Qemu-1.1.2 generally is not able to reproduce that, but host node with older version crashing on less amount of Windows VMs(three to six instead ten to fifteen) than with 1.3, please see trace below: http://xdel.ru/downloads/oops-old-qemu.txt Single bit memory error, apparently. Try: 1. memtest86. 2. Boot with slub_debug=ZFPU kernel parameter. 3. Reproduce on different machine Hi Marcelo, I always follow the rule - if some weird bug exists, check it on ECC-enabled machine and check IPMI logs too before start complaining :) I have finally managed to ``fix'' the problem, but my solution seems a bit strange: - I have noticed that if virtual machines started without any cgroup setting they will not cause this bug under any conditions, - I have thought, very wrong in my mind, that the CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and should not touch tasks already inside any existing cpu cgroup. First sight on the 200-line patch shows that the autogrouping always applies to all tasks, so I tried to disable it, - wild magic appears - VMs didn`t crashed host any more, even in count 30+ they work fine. I still don`t know what exactly triggered that and will I face it again under different conditions, so my solution more likely to be a patch of mud in wall of the dam, instead of proper fixing. There seems to be two possible origins of such error - a very very hideous race condition involving cgroups and processes like qemu-kvm causing frequent context switches and simple incompatibility between NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing