Re: Windows slow boot: contractor wanted
On 08/17/2012 03:36 PM, Richard Davies wrote: Hi Avi, Thanks to you and several others for offering help. We will work with Avi at first, but are grateful for all the other offers of help. We have a number of other qemu-related projects which we'd be interested in getting done, and will get in touch with these names (and anyone else who comes forward) to see if any are of interest to you. This slow boot problem is intermittent and varys in how slow the boots are, but I managed to trigger it this morning with medium slow booting (5-10 minutes) and link to the requested traces below. The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores total). It is running kernel 3.5.1 and qemu-kvm 1.1.1. In this morning's test, we have 3 guests, all booting Windows with 40GB RAM and 8 cores each (we have seen small VMs go slow as I originally said, but it is easier to trigger with big VMs): pid 15665: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :99 -monitor stdio -hda test1.raw pid 15676: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :98 -monitor stdio -hda test2.raw pid 15653: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :97 -monitor stdio -hda test3.raw 40+40+40=120, pretty close to your server specs. Are you swapping? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 3/6] vhost-scsi: add -vhost-scsi host device for use with tcm-vhost
On Sat, Aug 18, 2012 at 05:36:26PM -0700, Nicholas A. Bellinger wrote: On Sat, 2012-08-18 at 22:12 +0300, Michael S. Tsirkin wrote: On Tue, Aug 14, 2012 at 01:31:14PM -0700, Nicholas A. Bellinger wrote: On Mon, 2012-08-13 at 11:53 +0300, Michael S. Tsirkin wrote: On Mon, Aug 13, 2012 at 08:35:14AM +, Nicholas A. Bellinger wrote: From: Stefan Hajnoczi stefa...@linux.vnet.ibm.com SNIP +static VHostSCSI *vhost_scsi_add(const char *id, const char *wwpn, + uint16_t tpgt) +{ +VHostSCSI *vs = g_malloc0(sizeof(*vs)); +int ret; + +/* TODO set up vhost-scsi device and bind to tcm_vhost/$wwpm/tpgt_$tpgt */ +fprintf(stderr, wwpn = \%s\ tpgt = \%u\\n, id, tpgt); + +ret = vhost_dev_init(vs-dev, -1, /dev/vhost-scsi, false); This -1 is a hack. You need to support passing in fd from the monitor, and pass it here. Mmm, looking at how vhost_net_init + tap.c does this, but am not quite what fd needs to be propagated up for virtio-scsi - vhost-scsi.. Can you please elaborate on this one a bit more..? The idea is to allow running as a user without access to /dev/vhost-scsi. For this, allow passing in the fd of /dev/vhost-scsi through unix domain sockets. Ah, that is a pretty neat trick.. So for vhost-scsi code, this would mean something along the lines of the following, yes..? Yes but with one correction. See below. Thanks MST! diff --git a/hw/vhost-scsi.c b/hw/vhost-scsi.c index 4206a75..8af8758 100644 --- a/hw/vhost-scsi.c +++ b/hw/vhost-scsi.c @@ -21,6 +21,7 @@ struct VHostSCSI { const char *id; const char *wwpn; uint16_t tpgt; +int vhostfd; struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_SCSI_VQ_NUM]; QLIST_ENTRY(VHostSCSI) list; @@ -114,13 +115,32 @@ void vhost_scsi_stop(VHostSCSI *vs, VirtIODevice *vdev) } static VHostSCSI *vhost_scsi_add(const char *id, const char *wwpn, - uint16_t tpgt) + uint16_t tpgt, const char *vhostfd_str) { -VHostSCSI *vs = g_malloc0(sizeof(*vs)); +VHostSCSI *vs; int ret; +vs = g_malloc0(sizeof(*vs)); +if (!vs) { +error_report(vhost-scsi: unable to allocate *vs\n); +return NULL; +} +vs-vhostfd = -1; + +if (vhostfd_str) { +if (!qemu_isdigit(vhostfd_str[0])) { +error_report(vhost-scsi: passed vhostfd value is not a digit\n); +return NULL; This let you use an fd which was open at exec but does not allow for fd to be open later in case device is hot-plugged. See net_handle_fd_param - I think you can just rename it qemu_handle_fd_param to avoid code duplication. +} + +vs-vhostfd = qemu_parse_fd(vhostfd_str); +if (vs-vhostfd == -1) { +error_report(vhost-scsi: unable to parse vs-vhostfd\n); +return NULL; +} +} /* TODO set up vhost-scsi device and bind to tcm_vhost/$wwpm/tpgt_$tpgt */ -ret = vhost_dev_init(vs-dev, -1, /dev/vhost-scsi, false); +ret = vhost_dev_init(vs-dev, vs-vhostfd, /dev/vhost-scsi, false); if (ret 0) { error_report(vhost-scsi: vhost initialization failed: %s\n, strerror(-ret)); @@ -140,7 +160,7 @@ static VHostSCSI *vhost_scsi_add(const char *id, const char *wwpn, VHostSCSI *vhost_scsi_add_opts(QemuOpts *opts) { const char *id; -const char *wwpn; +const char *wwpn, *vhostfd; uint64_t tpgt; id = qemu_opts_id(opts); @@ -164,6 +184,7 @@ VHostSCSI *vhost_scsi_add_opts(QemuOpts *opts) error_report(vhost-scsi: \%s\ needs a 16-bit tpgt\n, id); return NULL; } +vhostfd = qemu_opt_get(opts, vhostfd); -return vhost_scsi_add(id, wwpn, tpgt); +return vhost_scsi_add(id, wwpn, tpgt, vhostfd); } diff --git a/qemu-config.c b/qemu-config.c index 33399ea..2d4884c 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -636,6 +636,9 @@ QemuOptsList qemu_vhost_scsi_opts = { }, { .name = tpgt, .type = QEMU_OPT_NUMBER, +}, { +.name = vhostfd, +.type = QEMU_OPT_STRING, }, { /* end of list */ } }, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows slow boot: contractor wanted
Avi Kivity wrote: Richard Davies wrote: The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores total). It is running kernel 3.5.1 and qemu-kvm 1.1.1. In this morning's test, we have 3 guests, all booting Windows with 40GB RAM and 8 cores each (we have seen small VMs go slow as I originally said, but it is easier to trigger with big VMs): 40+40+40=120, pretty close to your server specs. Are you swapping? No - you can see on the top screenshot that there's no swap in use. Richard. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [regression] virtio net locks up
On Sat, Aug 18, 2012 at 01:25:05AM +0200, Bernd Schubert wrote: On 08/12/2012 01:45 PM, Michael S. Tsirkin wrote: On Mon, Jul 30, 2012 at 08:08:31PM +0200, Bernd Schubert wrote: On 07/30/2012 07:33 PM, Bernd Schubert wrote: Hello Stefan, Stefan Hajnoczi stefanha at gmail.com writes: On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert bernd.schubert at itwm.fraunhofer.de wrote: On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote: Try pinging the host's IP address from inside the guest. Run tcpdump on the guest's tap interface from the host and observe whether or not you see any packets being sent from the guest. sorry for my terribly late reply. As usual I got distracted by too many other things and then returned the hardware I was running the VMs on. My new desktop system is better suitable to run kvm and I can easily reproduce it now with 3.5 on host and guest side. So its not fixed in recent versions yet. Seems arp requests are still going out, but then don't go in: 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui Unknown), length 28 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown), length 28 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28 Okay, so it seems networking from the tap device and beyond is fine. rmmod virtio_net inside the guest and then modprobe virtio_net again. See if network connectivity is restored (remember to rerun DHCP or whatever, if necessary). Yep, that makes it work again. But probably is not the real solution ;) It's just another piece of information which helps debug this :). At least nothing has wedged itself into an unrecoverable state. When you said the problem happens without vhost, did you explicitly run vhost=off? Or did you just omit vhost=on? It was definitely off and I can confirm that it also locks up with vhost=on and vhost=off with 3.5. This sounds like a guest kernel/driver issue. I recommend testing git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in the guest to see if this has already been fixed. If you have the -dbg RPMs installed it may be possible to insert a probe into the virtio_net kernel module and observe receive interrupts. This does require the right kernel CONFIG_ but you might already have it enabled: $ sudo perf probe --add skb_recv_done $ sudo perf record -e probe:skb_recv_done -a ...send some packets to the guest... ^C $ sudo perf script If you see no skb_recv_done events then the guest driver is not receiving a notification when packets are received. You can find more about how to use perf-probe(1) at http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html. Ah nice, I would have used systemtap, but always wanted to check how to do it with perf :) So once the virtio NIC has locked up, I don't get any events from it anymore - until I remove/re-insert the virtio module (including ifup/ifdown). I will try to find some time later on this week to look into it again. Any further ideas how to proceed (I haven't even checked yet how virtio works at all...). I took a quick glance where skb_recv_done is registered at all and traced it back to vp_find_vqs(). Looking into that function I noticed MSI and so tried to boot with pci=nomsi. And indeed I guessed it right, with pci=nomsi I don't get any lockups anymore. Am I the only one booting kvm-qemu usually with enabled MSI? Cheers, Bernd No :) I am guessing it has to do with OOM handling in the guest - it is tested very little but maybe your guest is such that atomic pool gets exhausted for some reason. Could you pls check whether refill_work runs by tracing it? This is our OOM handler. Just checked it, it does not show up in perf script output. Cheers, Bernd When running with vhost-net on, if you enable DEBUG in *host* kernel build (or set CONFIG_DYNAMIC_DEBUG and enable messages for the vhost_net module) pr_debug will output some debug messages if guest bug is detected. Can you try this? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file
On 08/19/2012 05:56 AM, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com The VMX code references a local assembler label between two inline assembler statements. This assumes they both end up in the same assembler files. In some experimental builds of gcc this is not necessarily true, causing linker failures. Replace the local label reference with a more traditional asmlinkage extern. This also eliminates one assembler statement and generates a bit better code on 64bit: the compiler can use a RIP relative LEA instead of a movabs, saving a few bytes. I'm happy to see work on lto-enabling the kernel. +extern __visible unsigned long kvm_vmx_return; + /* * Set up the vmcs's constant host-state fields, i.e., host-state fields that * will not change in the lifetime of the guest. @@ -3753,8 +3755,7 @@ static void vmx_set_constant_host_state(void) native_store_idt(dt); vmcs_writel(HOST_IDTR_BASE, dt.address); /* 22.2.4 */ - asm(mov $.Lkvm_vmx_return, %0 : =r(tmpl)); - vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */ + vmcs_writel(HOST_RIP, (unsigned long)kvm_vmx_return); /* 22.2.5 */ rdmsr(MSR_IA32_SYSENTER_CS, low32, high32); vmcs_write32(HOST_IA32_SYSENTER_CS, low32); @@ -6305,9 +6306,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) /* Enter guest mode */ jne .Llaunched \n\t __ex(ASM_VMX_VMLAUNCH) \n\t - jmp .Lkvm_vmx_return \n\t + jmp kvm_vmx_return \n\t .Llaunched: __ex(ASM_VMX_VMRESUME) \n\t - .Lkvm_vmx_return: + .globl kvm_vmx_return\n + kvm_vmx_return: /* Save guest registers, load host registers, keep flags */ mov %0, %c[wordsize](%%Rsp) \n\t pop %0 \n\t The reason we use a local label is so that we the function isn't split into two from the profiler's point of view. See cd2276a795b013d1. One way to fix this is to have a .data variable initialized to point to .Lkvm_vmx_return (this can be done from the same asm statement in vmx_vcpu_run), and reference that variable in vmx_set_constant_host_state(). If no one comes up with a better idea, I'll write a patch doing this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: x86 emulator: access GPRs on demand
On 08/17/2012 08:29 PM, Marcelo Tosatti wrote: On Thu, Aug 16, 2012 at 05:54:49PM +0300, Avi Kivity wrote: Instead of populating the the entire register file, read in registers as they are accessed, and write back only the modified ones. This saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually used during emulation), and a two 128-byte copies for the registers. @@ -2715,14 +2764,17 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, { int rc; +invalidate_registers(ctxt); ctxt-_eip = ctxt-eip; ctxt-dst.type = OP_NONE; rc = emulator_do_task_switch(ctxt, tss_selector, idt_index, reason, has_error_code, error_code); -if (rc == X86EMUL_CONTINUE) +if (rc == X86EMUL_CONTINUE) { ctxt-eip = ctxt-_eip; +writeback_registers(ctxt); +} return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK; } No clear point when emulator register cache is active, when it is not (AFAICS this patch does not invalidate registers on emulation start (the above being one of the exceptions) does not clear valid bit on writeback-to-vcpu-cache on emulation exit). It is cleared when emulation starts. For the non-insn-emulation entry points, there is an explicit invalidate. For the emulation entry point, there is a memset() that clears everything up to _regs, which includes the cache. This discrepancy isn't nice, but it preexists. I don't know whether we should decompose the memset() or not, it is rather efficient. Concern is that emulator can start with cached registers marked as valid but in fact are invalid from previous emulation round. Maybe move invalidate() to init_emulate_ctxt? See the memset() in init_decode_cache(). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup
On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com wrote: On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote: A single hva can have multiple gpas mapped, no? At least that's what I gathered from the discussion about my attempt to a function similar to this :). I don't think this is the case for ARM, can you provide an example? We use gfn_to_pfn_prot and only allow user memory regions. What you suggest would be multiple physical addresses pointing to the same memory bank, I don't think that makes any sense on ARM hardware, for x86 and PPC I don't know. I don't know what an hva is, but yes, ARM boards can have the same block of RAM aliased into multiple places in the physical address space. (we don't currently bother to implement the aliases in qemu's vexpress-a15 though because it's a bunch of mappings of the low 2GB into high addresses mostly intended to let you test LPAE code without having to put lots of RAM on the hardware). -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly
On 08/17/2012 09:39 PM, Marcelo Tosatti wrote: Yes. Well, Avi mentioned earlier that there are users for change of GPA base. But, if my understanding is correct, the code that emulates change of BAR in QEMU is: /* now do the real mapping */ if (r-addr != PCI_BAR_UNMAPPED) { memory_region_del_subregion(r-address_space, r-memory); } r-addr = new_addr; if (r-addr != PCI_BAR_UNMAPPED) { memory_region_add_subregion_overlap(r-address_space, r-addr, r-memory, 1); These translate to two kvm_set_user_memory ioctls. Not directly. These functions change a qemu-internal memory map, which is then transferred to kvm. Those two calls might be in a transaction (they aren't now), in which case the memory map update is atomic. So indeed we issue two ioctls now, but that's a side effect of the implementation, not related to those two calls being separate. Without taking into consideration backwards compatibility, userspace can first delete the slot and later create a new one. Current qemu will in fact do that. Not sure about older ones. Avi, where it does that? By that I meant first deleting the first slot and then creating a new one. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop
On 08/17/2012 06:04 PM, Jan Kiszka wrote: Can anyone imagine that such a barrier may actually be required? If it is currently possible that env-stop is evaluated before we called into sigtimedwait in qemu_kvm_eat_signals, then we could actually eat the signal without properly processing its reason (stop). Should not be required (TM): Both signal eating / stop checking and stop setting / signal generation happens under the BQL, thus the ordering must not make a difference here. Agree. Don't see where we could lose a signal. Maybe due to a subtle memory corruption that sets thread_kicked to non-zero, preventing the kicking this way. Cannot be ruled out, yet too much of a coincidence. Could be a kernel bug (either in kvm or elsewhere), we've had several before in this area. Is this reproducible? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/7] memory: Flush coalesced MMIO on selected region access
On 08/17/2012 01:55 PM, Jan Kiszka wrote: On 2012-07-10 12:41, Jan Kiszka wrote: On 2012-07-02 11:07, Avi Kivity wrote: On 06/29/2012 07:37 PM, Jan Kiszka wrote: Instead of flushing pending coalesced MMIO requests on every vmexit, this provides a mechanism to selectively flush when memory regions related to the coalesced one are accessed. This first of all includes the coalesced region itself but can also applied to other regions, e.g. of the same device, by calling memory_region_set_flush_coalesced. Looks fine. I have a hard time deciding whether this should go through the kvm tree or memory tree. Anthony, perhaps you can commit it directly to avoid the livelock? Reviewed-by: Avi Kivity a...@redhat.com Anthony, ping? Argh, missed that this series was forgotten. Patch 1 is a bug fix, will resend it separately so that it can make it into 1.2. Will repost the rest once master reopens. My fault, I should have just taken it into memory/core and sent a pull request. Sorry about that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: perf uncore lkvm woes
On 08/17/2012 09:56 AM, Peter Zijlstra wrote: On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote: Peter, do I need to submit a patch disables uncore on virtualized CPU? I think Avi prefers the method where KVM 'fakes' the MSRs and we have to detect if the MSRs actually work or not. s/we have/we don't have/. If you're willing to have a go at that, please do so. If you're not sure how to do the KVM part, I'm sure Avi and/or Gleb can help you out. Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr(). The approach is that if an msr write can be emulated correctly (for example, it disables a counter) then we let it proceed; if it cannot be emulated correctly (for example it enables a counter that we cannot emulate), then we ignore it, but print out a message that tells the user that we're faking something that may cause the guest to malfunction. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.6] KVM: x86 emulator: use stack size attribute to mask rsp in stack ops
The sub-register used to access the stack (sp, esp, or rsp) is not determined by the address size attribute like other memory references, but by the stack segment's B bit (if not in x86_64 mode). Fix by using the existing stack_mask() to figure out the correct mask. This long-existing bug was exposed by a combination of a27685c33ae (emulate invalid guest state by default), which causes many more instructions to be emulated, and a seabios change (possibly a bug) which causes the high 16 bits of esp to become polluted across calls to real mode software interrupts. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/emulate.c | 30 +- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 97d9a99..a3b57a2 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -475,13 +475,26 @@ static int stack_size(struct x86_emulate_ctxt *ctxt) return address_mask(ctxt, reg); } +static void masked_increment(ulong *reg, ulong mask, int inc) +{ + assign_masked(reg, *reg + inc, mask); +} + static inline void register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long *reg, int inc) { + ulong mask; + if (ctxt-ad_bytes == sizeof(unsigned long)) - *reg += inc; + mask = ~0UL; else - *reg = (*reg ~ad_mask(ctxt)) | ((*reg + inc) ad_mask(ctxt)); + mask = ad_mask(ctxt); + masked_increment(reg, mask, inc); +} + +static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) +{ + masked_increment(ctxt-regs[VCPU_REGS_RSP], stack_mask(ctxt), inc); } static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) @@ -1522,8 +1535,8 @@ static int push(struct x86_emulate_ctxt *ctxt, void *data, int bytes) { struct segmented_address addr; - register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], -bytes); - addr.ea = register_address(ctxt, ctxt-regs[VCPU_REGS_RSP]); + rsp_increment(ctxt, -bytes); + addr.ea = ctxt-regs[VCPU_REGS_RSP] stack_mask(ctxt); addr.seg = VCPU_SREG_SS; return segmented_write(ctxt, addr, data, bytes); @@ -1542,13 +1555,13 @@ static int emulate_pop(struct x86_emulate_ctxt *ctxt, int rc; struct segmented_address addr; - addr.ea = register_address(ctxt, ctxt-regs[VCPU_REGS_RSP]); + addr.ea = ctxt-regs[VCPU_REGS_RSP] stack_mask(ctxt); addr.seg = VCPU_SREG_SS; rc = segmented_read(ctxt, addr, dest, len); if (rc != X86EMUL_CONTINUE) return rc; - register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], len); + rsp_increment(ctxt, len); return rc; } @@ -1688,8 +1701,7 @@ static int em_popa(struct x86_emulate_ctxt *ctxt) while (reg = VCPU_REGS_RAX) { if (reg == VCPU_REGS_RSP) { - register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], - ctxt-op_bytes); + rsp_increment(ctxt, ctxt-op_bytes); --reg; } @@ -2825,7 +2837,7 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt) rc = emulate_pop(ctxt, ctxt-dst.val, ctxt-op_bytes); if (rc != X86EMUL_CONTINUE) return rc; - register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], ctxt-src.val); + rsp_increment(ctxt, ctxt-src.val); return X86EMUL_CONTINUE; } -- 1.7.11.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup
On 08/19/2012 12:38 PM, Peter Maydell wrote: On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com wrote: On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote: A single hva can have multiple gpas mapped, no? At least that's what I gathered from the discussion about my attempt to a function similar to this :). I don't think this is the case for ARM, can you provide an example? We use gfn_to_pfn_prot and only allow user memory regions. What you suggest would be multiple physical addresses pointing to the same memory bank, I don't think that makes any sense on ARM hardware, for x86 and PPC I don't know. I don't know what an hva is, host virtual address (see Documentation/virtual/kvm/mmu.txt for more TLAs in this area). but yes, ARM boards can have the same block of RAM aliased into multiple places in the physical address space. (we don't currently bother to implement the aliases in qemu's vexpress-a15 though because it's a bunch of mappings of the low 2GB into high addresses mostly intended to let you test LPAE code without having to put lots of RAM on the hardware). Even if it weren't common, the API allows it, so we must behave sensibly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows slow boot: contractor wanted
On 08/17/2012 03:36 PM, Richard Davies wrote: Hi Avi, Thanks to you and several others for offering help. We will work with Avi at first, but are grateful for all the other offers of help. We have a number of other qemu-related projects which we'd be interested in getting done, and will get in touch with these names (and anyone else who comes forward) to see if any are of interest to you. This slow boot problem is intermittent and varys in how slow the boots are, but I managed to trigger it this morning with medium slow booting (5-10 minutes) and link to the requested traces below. The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores total). It is running kernel 3.5.1 and qemu-kvm 1.1.1. In this morning's test, we have 3 guests, all booting Windows with 40GB RAM and 8 cores each (we have seen small VMs go slow as I originally said, but it is easier to trigger with big VMs): pid 15665: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :99 -monitor stdio -hda test1.raw pid 15676: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :98 -monitor stdio -hda test2.raw pid 15653: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \ -vga cirrus -usbdevice tablet -vnc :97 -monitor stdio -hda test3.raw We are running with hv_relaxed since this was suggested in the previous thread, but we see intermittent slow boots with and without this flag. All 3 VMs are booting slowly for most of the attached capture, which I started after confirming the slow boots and stopped as soon as the first of them (15665) had booted. In terms of visible symptoms, the VMs are showing the Windows boot progress bar, which is moving very slowly. In top, the VMs are at 400% CPU and their resident state size (RES) memory is slowly counting up until it reaches the full VM size, at which point they finish booting. Here are the trace files: http://users.org.uk/slow-win-boot-1/ps.txt (ps auxwwwf as root) http://users.org.uk/slow-win-boot-1/top.txt (top with 2 VMs still slow) http://users.org.uk/slow-win-boot-1/trace-console.txt (running trace-cmd) http://users.org.uk/slow-win-boot-1/trace.dat (the 1.7G trace data file) http://users.org.uk/slow-win-boot-1/trace-report.txt (the 4G trace report) Please let me know if there is anything else which I can provide? There are tons of PAUSE exits indicating cpu overcommit (and indeed you are overcommitted by about 50%). What host kernel version are you running? Does this reproduce without overcommit? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Big real mode use in ipxe
ipxe contains the following snippet: /* Copy ROM to image source PMM block */ pushw %es xorw%ax, %ax movw%ax, %es movl%esi, %edi xorl%esi, %esi movzbl romheader_size, %ecx shll$9, %ecx addr32 rep movsb/* PMM presence implies flat real mode */ Which copies an image to %edi, with %edi = 0x1. This is in accordance with the PMM spec: 3.2.4 Accessing Extended Memory This section specifies how clients should access extended memory blocks allocated by the PMM. When control is passed to an option ROM from a BIOS that supports PMM, the processor will be in big real mode, and Gate A20 will be disabled (segment wrap turned off). This allows access to extended memory blocks using real mode addressing. In big real mode, access to memory above 1MB can be accomplished by using a 32-bit extended index register (EDI, etc.) and setting the segment register to h. The following code example assumes that the pmmAllocate function was just called to allocate a block of extended memory, and DX:AX returned the 32-bit buffer address. ; Assume here that DX:AX contains the 32-bit address of our allocated buffer. ; Clear the DS segment register. push h pop ds ; Put the DX:AX 32-bit buffer address into EDI. mov di, dx ; Get the upper word. shl edi, 16 ; Shift it to upper EDI. mov di, ax ; Get the lower word. ; Example: clear the first four bytes of the extended memory buffer. mov [edi], h ; DS:EDI is used as the memory pointer. In a similar way, the other segment registers and 32-bit index registers can be used for extended memory accessing. So far so good. But the Intel SDM says (20.1.1): The IA-32 processors beginning with the Intel386 processor can generate 32-bit offsets using an address override prefix; however, in real-address mode, the value of a 32-bit offset may not exceed H without causing an exception. For full compatibility with Intel 286 real-address mode, pseudo-protection faults (interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 0 through H. Which is exactly what happens here. My understanding of big real mode is that to achieve a segment limit != 0x, you must go into 32-bit protected mode, load a segment with a larger limit, and return into real mode without touching the segment. The next load of the segment will reset the limit to 0x. Due to bugs in both qemu tcg and kvm, limit checks are not enforced in real mode, but once this bugs are fixed, the code above will break. The PMM spec also has this to say (1.3): Big Real Mode Big Real Mode is a modified version of the processor’s real mode with the segment limits changed from 1MB to 4GB. Big real mode allows the BIOS or an Option ROM to read and write extended memory without the overhead of protected mode. The BIOS puts the processor in big real mode during POST to allow simplified access to extended memory. The processor will be in big real mode while the PMM Services are callable. This is more in line with the Intel spec, and means that the modification to %es must be avoided (and that seabios needs changes to either work in big real mode, or to put the processor back into big real mode after returning from a PMM service. The whole thing is very unfortunate, as kvm is very slow while in big real mode, on certain processors. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file
The reason we use a local label is so that we the function isn't split into two from the profiler's point of view. See cd2276a795b013d1. Hmm that commit message is not very enlightening. The goal was to force a compiler error? With LTO there is no way to force two functions be in the same assembler file. The partitioner is always allowed to split. One way to fix this is to have a .data variable initialized to point to .Lkvm_vmx_return (this can be done from the same asm statement in vmx_vcpu_run), and reference that variable in vmx_set_constant_host_state(). If no one comes up with a better idea, I'll write a patch doing this. I'm not clear how that is better than my patch. -andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file
On 08/19/2012 06:09 PM, Andi Kleen wrote: The reason we use a local label is so that we the function isn't split into two from the profiler's point of view. See cd2276a795b013d1. Hmm that commit message is not very enlightening. The goal was to force a compiler error? No, the goal was to avoid a global label in the middle of a function. The profiler interprets it as a new function. After your patch, profiles will show a function named kvm_vmx_return taking a few percent cpu, although there is no such function. With LTO there is no way to force two functions be in the same assembler file. The partitioner is always allowed to split. I'm not trying to force two functions to be in the same assembler file. One way to fix this is to have a .data variable initialized to point to .Lkvm_vmx_return (this can be done from the same asm statement in vmx_vcpu_run), and reference that variable in vmx_set_constant_host_state(). If no one comes up with a better idea, I'll write a patch doing this. I'm not clear how that is better than my patch. My patch will not generate the artifact with kvm_vmx_return. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file
On Sun, Aug 19, 2012 at 06:12:57PM +0300, Avi Kivity wrote: On 08/19/2012 06:09 PM, Andi Kleen wrote: The reason we use a local label is so that we the function isn't split into two from the profiler's point of view. See cd2276a795b013d1. Hmm that commit message is not very enlightening. The goal was to force a compiler error? No, the goal was to avoid a global label in the middle of a function. The profiler interprets it as a new function. After your patch, Ah got it now. I always used to have the same problem with sys_call_return.` I wonder if there shouldn't be a way to tell perf to ignore a symbol. One way to fix this is to have a .data variable initialized to point to .Lkvm_vmx_return (this can be done from the same asm statement in vmx_vcpu_run), and reference that variable in vmx_set_constant_host_state(). If no one comes up with a better idea, I'll write a patch doing this. I'm not clear how that is better than my patch. My patch will not generate the artifact with kvm_vmx_return. Ok fine for me. I'll keep this patch for now, until you have something better. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ipxe-devel] Big real mode use in ipxe
On 08/19/2012 06:34 PM, Michael Brown wrote: On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote: Which is exactly what happens here. My understanding of big real mode is that to achieve a segment limit != 0x, you must go into 32-bit protected mode, load a segment with a larger limit, and return into real mode without touching the segment. The next load of the segment will reset the limit to 0x. Not quite. You can't return into real mode without touching the segment, since part of the process of returning to real mode is to reload the segment registers with real-mode values, and this happens _after_ setting CR0.PE=0. Whenever CR0.PE=0, loading a segment register with value N will load the literal value (N4) into the base address for that segment, without changing the limit. This is the trick that allows flat real mode (aka big real mode) to work; the limit remains at 4G even after loading the segment register with a real-mode value. So I see, from looking at the Xen source. I'll also double-check with bochs. Looks like I'll need to fix kvm not to reset the segment limit when reloading a segment in real mode. (and that seabios needs changes to either work in big real mode, or to put the processor back into big real mode after returning from a PMM service. If seabios switches into protected mode when performing a PMM service, then it _must_ leave the segment limits at 4G when returning to real mode. To do otherwise will violate the PMM spec, and will break conforming clients such as iPXE. This probably works, since iPXE works on kvm on AMD and on Intel processors with unrestricted guest support. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ipxe-devel] Big real mode use in ipxe
On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote: Which is exactly what happens here. My understanding of big real mode is that to achieve a segment limit != 0x, you must go into 32-bit protected mode, load a segment with a larger limit, and return into real mode without touching the segment. The next load of the segment will reset the limit to 0x. Not quite. You can't return into real mode without touching the segment, since part of the process of returning to real mode is to reload the segment registers with real-mode values, and this happens _after_ setting CR0.PE=0. Whenever CR0.PE=0, loading a segment register with value N will load the literal value (N4) into the base address for that segment, without changing the limit. This is the trick that allows flat real mode (aka big real mode) to work; the limit remains at 4G even after loading the segment register with a real-mode value. (and that seabios needs changes to either work in big real mode, or to put the processor back into big real mode after returning from a PMM service. If seabios switches into protected mode when performing a PMM service, then it _must_ leave the segment limits at 4G when returning to real mode. To do otherwise will violate the PMM spec, and will break conforming clients such as iPXE. Michael -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Big real mode use in ipxe
On Sun, Aug 19, 2012 at 06:07:05PM +0300, Avi Kivity wrote: ipxe contains the following snippet: /* Copy ROM to image source PMM block */ pushw %es xorw%ax, %ax movw%ax, %es movl%esi, %edi xorl%esi, %esi movzbl romheader_size, %ecx shll$9, %ecx addr32 rep movsb/* PMM presence implies flat real mode */ Which copies an image to %edi, with %edi = 0x1. This is in accordance with the PMM spec: [...] So far so good. But the Intel SDM says (20.1.1): The IA-32 processors beginning with the Intel386 processor can generate 32-bit offsets using an address override prefix; however, in real-address mode, the value of a 32-bit offset may not exceed H without causing an exception. For full compatibility with Intel 286 real-address mode, pseudo-protection faults (interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 0 through H. I interpretted the above to mean however, in [normal real-mode where the segment registers are set to 0x] real-address mode, the value of a 32-bit offset may not exceed H without causing an exception Which is exactly what happens here. My understanding of big real mode is that to achieve a segment limit != 0x, you must go into 32-bit protected mode, load a segment with a larger limit, and return into real mode without touching the segment. The next load of the segment will reset the limit to 0x. No, the segment limit is only changed when the protected mode bit is set and the segment register is loaded. When the protected mode bit is not set, only the segment offset changes. [...] The PMM spec also has this to say (1.3): Big Real Mode Big Real Mode is a modified version of the processor’s real mode with the segment limits changed from 1MB to 4GB. Big real mode allows the BIOS or an Option ROM to read and write extended memory without the overhead of protected mode. The BIOS puts the processor in big real mode during POST to allow simplified access to extended memory. The processor will be in big real mode while the PMM Services are callable. This is more in line with the Intel spec, and means that the modification to %es must be avoided (and that seabios needs changes to either work in big real mode, or to put the processor back into big real mode after returning from a PMM service. The SeaBIOS code is regularly used on a variety of real processors (which do enforce segment limits). This includes several different AMD processors and Intel processors. It has also been tested in the past with other manufacturers (eg, Via). We've never seen an issue with the big real mode support. The whole thing is very unfortunate, as kvm is very slow while in big real mode, on certain processors. Unfortunately, big real mode is a requirement for option roms. -Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Big real mode use in ipxe
On 08/19/2012 06:44 PM, Kevin O'Connor wrote: On Sun, Aug 19, 2012 at 06:07:05PM +0300, Avi Kivity wrote: ipxe contains the following snippet: /* Copy ROM to image source PMM block */ pushw %es xorw%ax, %ax movw%ax, %es movl%esi, %edi xorl%esi, %esi movzbl romheader_size, %ecx shll$9, %ecx addr32 rep movsb/* PMM presence implies flat real mode */ Which copies an image to %edi, with %edi = 0x1. This is in accordance with the PMM spec: [...] So far so good. But the Intel SDM says (20.1.1): The IA-32 processors beginning with the Intel386 processor can generate 32-bit offsets using an address override prefix; however, in real-address mode, the value of a 32-bit offset may not exceed H without causing an exception. For full compatibility with Intel 286 real-address mode, pseudo-protection faults (interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 0 through H. I interpretted the above to mean however, in [normal real-mode where the segment registers are set to 0x] real-address mode, the value of a 32-bit offset may not exceed H without causing an exception I understood it the same way. Which is exactly what happens here. My understanding of big real mode is that to achieve a segment limit != 0x, you must go into 32-bit protected mode, load a segment with a larger limit, and return into real mode without touching the segment. The next load of the segment will reset the limit to 0x. No, the segment limit is only changed when the protected mode bit is set and the segment register is loaded. When the protected mode bit is not set, only the segment offset changes. That's what I missed. I always understood a segment reload in real mode to reset the limit field, though I had no basis for it. I'll fix kvm not to do this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [ipxe-devel] Big real mode use in ipxe
On Sun, Aug 19, 2012 at 04:34:50PM +0100, Michael Brown wrote: On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote: (and that seabios needs changes to either work in big real mode, or to put the processor back into big real mode after returning from a PMM service. If seabios switches into protected mode when performing a PMM service, then it _must_ leave the segment limits at 4G when returning to real mode. To do otherwise will violate the PMM spec, and will break conforming clients such as iPXE. SeaBIOS does switch to 32bit mode during PMM calls and does switch to 16bit big real mode (segment limits set to 4G) on return. -Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup
On Sun, Aug 19, 2012 at 9:00 AM, Avi Kivity a...@redhat.com wrote: On 08/19/2012 12:38 PM, Peter Maydell wrote: On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com wrote: On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote: A single hva can have multiple gpas mapped, no? At least that's what I gathered from the discussion about my attempt to a function similar to this :). I don't think this is the case for ARM, can you provide an example? We use gfn_to_pfn_prot and only allow user memory regions. What you suggest would be multiple physical addresses pointing to the same memory bank, I don't think that makes any sense on ARM hardware, for x86 and PPC I don't know. I don't know what an hva is, host virtual address (see Documentation/virtual/kvm/mmu.txt for more TLAs in this area). but yes, ARM boards can have the same block of RAM aliased into multiple places in the physical address space. (we don't currently bother to implement the aliases in qemu's vexpress-a15 though because it's a bunch of mappings of the low 2GB into high addresses mostly intended to let you test LPAE code without having to put lots of RAM on the hardware). I stand corrected. Even if it weren't common, the API allows it, so we must behave sensibly. true, this should be a solution: commit 2a8661fd7e6c15889a20a4547bd7861e84b778a8 Author: Christoffer Dall c.d...@virtualopensystems.com Date: Sun Aug 19 15:52:10 2012 -0400 KVM: ARM: A single hva can map multiple gpas Handle mmu notifier ops for every such mapping. Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 3df4fa8..9b23230 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -754,11 +754,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) return ret ? ret : 1; } -static bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa) +static int handle_hva_to_gpa(struct kvm *kvm, unsigned long hva, +void (*handler)(struct kvm *kvm, unsigned long hva, +gpa_t gpa, void *data), +void *data) { struct kvm_memslots *slots; struct kvm_memory_slot *memslot; - bool found = false; + int cnt = 0; slots = kvm_memslots(kvm); @@ -769,31 +772,36 @@ static bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa) end = start + (memslot-npages PAGE_SHIFT); if (hva = start hva end) { + gpa_t gpa; gpa_t gpa_offset = hva - start; - *gpa = (memslot-base_gfn PAGE_SHIFT) + gpa_offset; - found = true; - /* no overlapping memslots allowed: break */ - break; + gpa = (memslot-base_gfn PAGE_SHIFT) + gpa_offset; + handler(kvm, hva, gpa, data); + cnt++; } } - return found; + return cnt; +} + +static void kvm_unmap_hva_handler(struct kvm *kvm, unsigned long hva, + gpa_t gpa, void *data) +{ + spin_lock(kvm-arch.pgd_lock); + stage2_clear_pte(kvm, gpa); + spin_unlock(kvm-arch.pgd_lock); } int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) { - bool found; - gpa_t gpa; + int found; if (!kvm-arch.pgd) return 0; - found = hva_to_gpa(kvm, hva, gpa); - if (found) { - spin_lock(kvm-arch.pgd_lock); - stage2_clear_pte(kvm, gpa); - spin_unlock(kvm-arch.pgd_lock); - } + found = handle_hva_to_gpa(kvm, hva, kvm_unmap_hva_handler, NULL); + if (found 0) + __kvm_tlb_flush_vmid(kvm); + return 0; } @@ -814,21 +822,27 @@ int kvm_unmap_hva_range(struct kvm *kvm, return 0; } +static void kvm_set_spte_handler(struct kvm *kvm, unsigned long hva, +gpa_t gpa, void *data) +{ + pte_t *pte = (pte_t *)data; + + spin_lock(kvm-arch.pgd_lock); + stage2_set_pte(kvm, NULL, gpa, pte); + spin_unlock(kvm-arch.pgd_lock); +} + + void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) { - gpa_t gpa; - bool found; + int found; if (!kvm-arch.pgd) return; - found = hva_to_gpa(kvm, hva, gpa); - if (found) { - spin_lock(kvm-arch.pgd_lock); - stage2_set_pte(kvm, NULL, gpa, pte); - spin_unlock(kvm-arch.pgd_lock); + found = handle_hva_to_gpa(kvm, hva, kvm_set_spte_handler, pte); + if (found 0) __kvm_tlb_flush_vmid(kvm); - } } void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu) -- -- To unsubscribe from this list: send the line unsubscribe kvm
Re: perf uncore lkvm woes
On 08/19/2012 05:55 PM, Avi Kivity wrote: On 08/17/2012 09:56 AM, Peter Zijlstra wrote: On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote: Peter, do I need to submit a patch disables uncore on virtualized CPU? I think Avi prefers the method where KVM 'fakes' the MSRs and we have to detect if the MSRs actually work or not. s/we have/we don't have/. If you're willing to have a go at that, please do so. If you're not sure how to do the KVM part, I'm sure Avi and/or Gleb can help you out. Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr(). The approach is that if an msr write can be emulated correctly (for example, it disables a counter) then we let it proceed; if it cannot be emulated correctly (for example it enables a counter that we cannot emulate), then we ignore it, but print out a message that tells the user that we're faking something that may cause the guest to malfunction. There is only one kvm_pmu structure in struct kvm_vcpu_arch, but the uncore driver may define dozens of PMUs. Besides the uncore PMUs make extensive use of extra registers, I don't think we can store these information in kvm_pmu structure. The uncore pmu collects system-wide events on a given socket, it may not be possible to be simulated by virtualized CPU. I think it's better to just disable uncore on virtualized CPU. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: perf uncore lkvm woes
On 08/19/2012 05:55 PM, Avi Kivity wrote: On 08/17/2012 09:56 AM, Peter Zijlstra wrote: On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote: Peter, do I need to submit a patch disables uncore on virtualized CPU? I think Avi prefers the method where KVM 'fakes' the MSRs and we have to detect if the MSRs actually work or not. s/we have/we don't have/. If you're willing to have a go at that, please do so. If you're not sure how to do the KVM part, I'm sure Avi and/or Gleb can help you out. Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr(). The approach is that if an msr write can be emulated correctly (for example, it disables a counter) then we let it proceed; if it cannot be emulated correctly (for example it enables a counter that we cannot emulate), then we ignore it, but print out a message that tells the user that we're faking something that may cause the guest to malfunction. Anyone knows how to detect if the kernel is running on virtualized CPU? Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly
On 08/17/2012 09:39 PM, Marcelo Tosatti wrote: Yes. Well, Avi mentioned earlier that there are users for change of GPA base. But, if my understanding is correct, the code that emulates change of BAR in QEMU is: /* now do the real mapping */ if (r-addr != PCI_BAR_UNMAPPED) { memory_region_del_subregion(r-address_space, r-memory); } r-addr = new_addr; if (r-addr != PCI_BAR_UNMAPPED) { memory_region_add_subregion_overlap(r-address_space, r-addr, r-memory, 1); These translate to two kvm_set_user_memory ioctls. Not directly. These functions change a qemu-internal memory map, which is then transferred to kvm. Those two calls might be in a transaction (they aren't now), in which case the memory map update is atomic. So indeed we issue two ioctls now, but that's a side effect of the implementation, not related to those two calls being separate. Without taking into consideration backwards compatibility, userspace can first delete the slot and later create a new one. Current qemu will in fact do that. Not sure about older ones. Avi, where it does that? By that I meant first deleting the first slot and then creating a new one. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html