Clarification on KVM + vhost-net
Hi, I would like to invoke QEMU and KVM so that the guest sees a virtio NIC, and that NIC goes through a SR-IOV VF of a host NIC as directly and efficiently as possible. But I don't actually want to pass the VF through to the guest. I've found a bunch of discussion and confusing examples on the web, but I'm not able to figure out what the right thing to do with modern QEMU is. I don't think I want to create a macvtap interface attached to the VF, because I just want to use one MAC address for the VF itself (and allow the NIC anti-spoofing hardware to work etc). Am I supposed to create a raw socket bound to the interface I want to use in a helper, and then pass that to qemu? How exactly do I pass that in — do I still use "-net tap"? Do I have to create my own vhostfd in my helper too? Thanks! Roland -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Splitting a multi-function PCI device between guests with VFIO?
Hi everyone, I'm updating my dev environment to use the shiny new vfio infrastructure for PCI assignment to kvm guests, and I'm not able to do what I used to do with the old-school KVM passthrough. In particular, I have, say, a two-port QLogic adapter that looks like: 82:00.0 0200: 1077:8030 (rev 02) 82:00.1 0200: 1077:8030 (rev 02) 82:00.2 0c04: 1077:8031 (rev 02) 82:00.3 0c04: 1077:8031 (rev 02) 82:00.4 0280: 1077:8032 (rev 02) 82:00.5 0280: 1077:8032 (rev 02) that is, each port gets three different PCI functions (one for NIC, one for FCoE and one for iSCSI). I used to be able to assign 82:00.2 to one VM and 82:00.3 to a different VM by binding those devices to pci_stub and using -device pci-assign,host=82:00.2 and -device pci-assign,host=82:00.3 on my respective QEMU command lines. (That let me have an initiator and target in separate VMs with one adapter in one dev system) However, all of those PCI devices have the same iommu_group, so now if I bind the devices to vfio-pci and do s/pci-assign/vfio-pci/, the second QEMU to start fails with something like qemu-system-x86_64: -device vfio-pci,host=82:00.3: vfio: error opening /dev/vfio/41: Device or resource busy qemu-system-x86_64: -device vfio-pci,host=82:00.3: vfio: failed to get group 41 qemu-system-x86_64: -device vfio-pci,host=82:00.3: Device initialization failed. qemu-system-x86_64: -device vfio-pci,host=82:00.3: Device 'vfio-pci' could not be initialized Is there a way to split multi-function devices (with the same iommu_group) between VMs with vfio? Thanks! Roland -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with strange problem passing mlx4 device into kvm guests
On Tue, Mar 27, 2012 at 3:24 PM, Roland Dreier rol...@purestorage.com wrote: Just to follow up on this, it turns out this is a bug in how the Mellanox firmware deals with FLR (function level reset). The FW will be fixed in a future release, but in the meantime I've been able to work around this with the following hack (probably going to be whitespace destroyed by the gmail web interface I'm using, but you should be able to recreate it if you care): --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3085,6 +3085,12 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, int probe) return 0; } +static int reset_mellanox_dev(struct pci_dev *dev, int probe) +{ + /* skip FLR, it busts the Mellanox FW */ + return 0; +} + #define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { @@ -3092,6 +3098,8 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { reset_intel_82599_sfp_virtfn }, { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, reset_intel_generic_dev }, + { PCI_VENDOR_ID_MELLANOX, 0x673c, + reset_mellanox_dev }, { 0 } }; And just to be clear, this is in the host kernel to avoid FLR there. The guest running the standard kernel would never do FLR anyway, so with the hack above in the host, the standard driver works fine. - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with strange problem passing mlx4 device into kvm guests
On Sun, Jan 29, 2012 at 6:29 PM, Roland Dreier rol...@purestorage.com wrote: I'm having a strange problem passing an mlx4 device into a kvm guest. The device in question is: 05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0) running the latest (I believe) FW version 2.9.1000. The symptom of the problem is that when the mlx4_core driver starts, I get normal output like mlx4_core :00:04.0: FW version 2.9.1000 (cmd intf rev 3), max commands 16 mlx4_core :00:04.0: Catastrophic error buffer at 0x1f020, size 0x10, BAR 0 mlx4_core :00:04.0: FW size 385 KB up until the driver tries to enable interrupts, when I get a long stream of Completion event for bogus CQ and then it gives up because the NOP command interrupt test fails. Just to follow up on this, it turns out this is a bug in how the Mellanox firmware deals with FLR (function level reset). The FW will be fixed in a future release, but in the meantime I've been able to work around this with the following hack (probably going to be whitespace destroyed by the gmail web interface I'm using, but you should be able to recreate it if you care): --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3085,6 +3085,12 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, int probe) return 0; } +static int reset_mellanox_dev(struct pci_dev *dev, int probe) +{ + /* skip FLR, it busts the Mellanox FW */ + return 0; +} + #define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { @@ -3092,6 +3098,8 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { reset_intel_82599_sfp_virtfn }, { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, reset_intel_generic_dev }, + { PCI_VENDOR_ID_MELLANOX, 0x673c, + reset_mellanox_dev }, { 0 } }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Help with strange problem passing mlx4 device into kvm guests
Hi everyone, I'm having a strange problem passing an mlx4 device into a kvm guest. The device in question is: 05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0) running the latest (I believe) FW version 2.9.1000. The host system is a fairly standard dual-socket Xeon 5600 system, perhaps a tiny bit unusual in that it is a dual Tylersburg motherboard. I'm using QEMU emulator version 1.0 (qemu-kvm-1.0 Debian 1.0+dfsg-3), Copyright (c) 2003-2008 Fabrice Bellard and Linux pure-driver3 3.1.0-1-amd64 #1 SMP Tue Jan 10 05:01:58 UTC 2012 x86_64 GNU/Linux (the latest Debian testing versions). The symptom of the problem is that when the mlx4_core driver starts, I get normal output like mlx4_core :00:04.0: FW version 2.9.1000 (cmd intf rev 3), max commands 16 mlx4_core :00:04.0: Catastrophic error buffer at 0x1f020, size 0x10, BAR 0 mlx4_core :00:04.0: FW size 385 KB up until the driver tries to enable interrupts, when I get a long stream of Completion event for bogus CQ and then it gives up because the NOP command interrupt test fails. Apparently what happens is that the SW2HW_EQ firmware command succeeds as far as the driver is concerned, but the EQ buffer is left as all 0s, so the driver thinks every entry is a completion event (for CQN 0). Several things are weird here: first, the command interface including DMA from the device is definitely working since we get a reasonable-looking response for the query FW command etc, so I'm not sure what is different about the SW2HW_EQ command (it is the first thing that uses the MTT I guess, so maybe there is a problem setting that up?) The guest is running 2.6.39, so there is no SR-IOV support in the mlx4 driver (but I am passing the only physical function of a non-virtualized device through, so I hope that isn't needed -- the device shouldn't know it's talking to a guest at all) Second, passing through another device on the same system: 86:00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10 Gigabit TN Network Connection [8086:151c] (rev 01) works fine, including MSI-X interrupts, running traffic works, etc. Finally, the craziest thing is that this setup was working a week or so ago, but there may have been BIOS, kernel and kvm updates since then (my guest image is unchanged at least ;). Anyone have any idea what might be going on or how to debug this further? Unfortunately I don't have a PCIe analyzer handy to get a better idea of what's happening with the device... Thanks, Roland -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
This is Linux virtualization, where _both_ the host and the guest source code is fully known, and bugs (if any) can be found with a high degree of It may sound strange but Windows is very popular guest and last I checked my HW there was no Windows sources there, but the answer to that is to emulate HW as close as possible to real one and then closed source guests will not have a reason to be upset. determinism. This is Linux where the players dont just vanish overnight, and are expected to do a proper job. And without even getting into closed/proprietary guests, virt is useful for testing/developing/deploying many free OSes, eg FreeBSD, NetBSD, OpenBSD, Hurd, random research OS, etc. Not to mention just wanting a stable [virtual] platform to run old enterprise Linux distro on. So having a virtual platform whose interface doesn't change very often or very much has a lot of value at least in avoiding churn in guest OSes. - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Still LSI scsi problems with Windows guest and kvm 82?
I have a (32-bit) Windows XP guest installed, and if I start kvm-82 (from the Debian packages) on a 2.6.29-rc1 kernel (running 64-bit on an AMD host), the guest boots but soon dies with messages along the lines of: lsi_scsi: error: Unimplemented message 0x0c lsi_scsi: error: Reselect with pending DMA scsi-disk: Tag 0x0 already in use scsi-disk: Unsupported command length, command f0 lsi_scsi: error: Unimplemented message 0x0c lsi_scsi: error: Reselect with pending DMA scsi-disk: Unsupported command length, command f0 lsi_scsi: error: Reselect with pending DMA scsi-disk: Bad buffer tag 0x0 Is this a known problem? - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
I'm guessing that the missing comment explains that this is intentional, to trap buffer overflows? Actually, speaking of comments, it's interesting that __get_vm_area_node() -- which is called from vmalloc() -- does: /* * We always allocate a guard page. */ size += PAGE_SIZE; va = alloc_vmap_area(size, align, start, end, node, gfp_mask); and alloc_vmap_area() adds another PAGE_SIZE, as the original email pointed out: while (addr + size = first-va_start addr + size = vend) { addr = ALIGN(first-va_end + PAGE_SIZE, align); I wonder if the double padding is causing a problem when things get too fragmented? - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
I suspect it's a case of off-by-one... ALIGN() might round down, and the + (PAGE_SIZE-1) was there to make it round up. Except for that missing -1 ... ALIGN() has always rounded up, at least back to 2.4. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6 v3] PCI: support SR-IOV capability
+ ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0; + pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl); + ssleep(1); You seem to sleep for 1 second wherever you write the IOV_CTRL register. Why is this? Is this specified by PCI, or is it coming from somewhere else? - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4 v2] PCI: support ARI capability
+config PCI_ARI + bool PCI ARI support + depends on PCI + default n + help +This enables PCI Alternative Routing-ID Interpretation. This Kconfig help text is a little weak. Why not include the text you've already written here: Support Alternative Routing-ID Interpretation (ARI), which increases the number of functions that can be supported by a PCIe endpoint. ARI is required by SR-IOV. I agree with this improvement to the help text. But a further question is whether ARI even merits its own user-visible config option. Is it worth having yet another choice for users? When would someone want ARI but not SR-IOV? - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh install of Windows XP hangs early in boot?
Hm, your comment later on makes me think you tried this on AMD. If so, I have also run into a similar problem with Windows guests under AMD. After installing WinDbg, it told me that it was a Paging Request in Non-Paged memory related to the Video memory area. Does yours look similar to that? I have not had time to track it further than that, though. Yes, I got a bluescreen on an AMD host but not an Intel host. And the bluescreen (I posted earlier) said PAGE_FAULT_IN_NONPAGED_AREA - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh install of Windows XP hangs early in boot?
I experienced random hangs during the install (stracing kvm shows no system calls, and it appears to be spinning at 100% CPU), but eventually I got an install that ran all the way to completion. However, that image seems to hang every time shortly after boot starts. I see the Windows splash screen, the little blue dots move for a few seconds, and then the guest hangs in the same way -- no system calls, 100% CPU. FWIW, when I ltrace the stuck kvm process, I get an endless string of memcpy(0x7fffb09355f0, \224\213'\206, 4) (the value changes from run to run, but stays constant within a run) Unfortunately Debian packages don't seem to be built with debugging symbols, so gdb doesn't show a very enlightening backtrace. I'll try to get more info. - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh install of Windows XP hangs early in boot?
I built with debugging symbols, and this seems to be an issue with SCSI disk emulation. The traceback is: #0 0x7fc086d7dd10 in memcpy () from /lib/libc.so.6 #1 0x004a319b in cpu_physical_memory_rw (addr=108661608, buf=0x7fff904ca190 \224['\206\210\030z\006I�A, len=4, is_write=0) at /users/rdreier/kvm-deb.git/qemu/exec.c:2847 #2 0x0041f0c2 in lsi_execute_script (s=0x2ef7a30) at ../cpu-all.h:924 #3 0x0049bd91 in qcow_aio_read_cb (opaque=0x3018d70, ret=0) at block-qcow2.c:840 #4 0x0041cba0 in qemu_aio_poll () at /users/rdreier/kvm-deb.git/qemu/block-raw-posix.c:513 #5 0x0040b38a in main_loop_wait (timeout=value optimized out) at /users/rdreier/kvm-deb.git/qemu/vl.c: #6 0x004f607a in kvm_main_loop () at /users/rdreier/kvm-deb.git/qemu/qemu-kvm.c:587 #7 0x00412b46 in main (argc=value optimized out, argv=0x7fff904cb0c8) at /users/rdreier/kvm-deb.git/qemu/vl.c:7811 and no progress ever seems to be made (the same address is read over and over) I'm trying again with IDE instead of SCSI disks. But I would like to help debug the SCSI emulation... will look at it further later, and I'm happy to provide any info someone else could use. - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh install of Windows XP hangs early in boot?
BTW I tried using if=ide to install Windows XP and got a blue screen during the installer. What are people doing to run XP in a kvm guest? Are you using a recent version of kvm-userspace/kernel modules? Please save the blue screen and mail it to the list or fill a bug. Pretty recent... this is with kvm modules from vanilla mainline 2.6.27-rc1 and kvm-72 userspace from Debian. Host is a 64-bit kernel running on AMD CPU (no NPT). Here's the bluescreen -- it seems the same as the last install, so pretty reproducible: attachment: xp-bluescreen.png
Re: Fresh install of Windows XP hangs early in boot?
BTW I tried using if=ide to install Windows XP and got a blue screen . during the installer. What are people doing to run XP in a kvm guest? Funnily enough installing XP SP2 with if=ide worked fine on an Intel host. I notice that I left off -std-vga on the working install too (I used it on the AMD host that blue-screened). Anyway, just another data point. Let me know if any other data would help. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh install of Windows XP hangs early in boot?
Known problem: http://www.nabble.com/LSI:-avoid-infinite-loops-p17116605.html I tried this hack (and actually made the magic insns number 500), and doing an XP install I got lsi_scsi: error: Reselect with pending DMA do you have any feeling if this is because the script execution got stopped too soon? Or is this likely a further issue? - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM overflows the stack
Yes, things like kvm_lapic_state are way too big to be on the stack. I had a quick look at the code, and my worry about dynamic allocation would be that handling allocation failure seems like it might get tricky. Eg for handling struct kvm_pv_mmu_op_buffer (which is 528 bytes on the stack in kvm_pv_mmu_op()) can you deal with an mmu op failing? (maybe in that case you can easily by just setting *ret to 0?) There's an additional problem here, that apparently your gcc (which version?) doesn't fold objects in a switch statement into the same stack slot: switch (...) { case x: { struct medium a; ... } case y: struct medium b; ... } }; A trick for this is to do: union { struct medium1 a; struct medium2 b; } u; switch (...) { case x: use u.a; ... case y: use u.b; ... } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] reduce KVM stack usage
+struct kvm_pv_mmu_op_buffer *buffer = +kmalloc(GFP_KERNEL, sizeof(struct kvm_pv_mmu_op_buffer)); Surely this produces a warning? kmalloc takes (size, flags) -- you have them reversed here. +lapic = kzalloc(GFP_KERNEL, sizeof(*lapic)); +lapic = kmalloc(GFP_KERNEL, sizeof(*lapic)); +struct kvm_irqchip *chip = kmalloc(GFP_KERNEL, sizeof(*chip)); +kvm_sregs = kmalloc(GFP_KERNEL, sizeof kvm_sregs); +fpu = kmalloc(GFP_KERNEL, sizeof(*fpu)); same for all of these places. +if (lapic) +kfree(lapic); +if (fpu) +kfree(fpu); +if (kvm_sregs) +kfree(kvm_sregs); kfree(NULL) is fine, so you can remove the if()s here. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] reduce KVM stack usage
+ kmalloc(GFP_KERNEL, sizeof(struct kvm_pv_mmu_op_buffer)); Surely this produces a warning? kmalloc takes (size, flags) -- you have them reversed here. Heh. It actually doesn't. Yeah, I guess you need sparse to catch the gfp_t mismatch. kfree(NULL) is fine, so you can remove the if()s here. I know it is fine, but I kinda like putting the if()s, just to let people know that we don't always *expect* something to be in there. But, it doesn't matter to me too much either way. It's not really a big deal, but if (x) kfree(x); does bloat the object code with the extra test of 'x'. I guess you could put a comment like /* free any temp structures we allocated */ or something like that. - R. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html