Clarification on KVM + vhost-net

2015-10-21 Thread Roland Dreier
Hi,

I would like to invoke QEMU and KVM so that the guest sees a virtio
NIC, and that NIC goes through a SR-IOV VF of a host NIC as directly
and efficiently as possible.  But I don't actually want to pass the VF
through to the guest.  I've found a bunch of discussion and confusing
examples on the web, but I'm not able to figure out what the right
thing to do with modern QEMU is.

I don't think I want to create a macvtap interface attached to the VF,
because I just want to use one MAC address for the VF itself (and
allow the NIC anti-spoofing hardware to work etc).  Am I supposed to
create a raw socket bound to the interface I want to use in a helper,
and then pass that to qemu?  How exactly do I pass that in — do I
still use "-net tap"?  Do I have to create my own vhostfd in my helper
too?

Thanks!
  Roland
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Splitting a multi-function PCI device between guests with VFIO?

2014-02-10 Thread Roland Dreier
Hi everyone,

I'm updating my dev environment to use the shiny new vfio
infrastructure for PCI assignment to kvm guests, and I'm not able to
do what I used to do with the old-school KVM passthrough.  In
particular, I have, say, a two-port QLogic adapter that looks like:

82:00.0 0200: 1077:8030 (rev 02)
82:00.1 0200: 1077:8030 (rev 02)
82:00.2 0c04: 1077:8031 (rev 02)
82:00.3 0c04: 1077:8031 (rev 02)
82:00.4 0280: 1077:8032 (rev 02)
82:00.5 0280: 1077:8032 (rev 02)

that is, each port gets three different PCI functions (one for NIC,
one for FCoE and one for iSCSI).

I used to be able to assign 82:00.2 to one VM and 82:00.3 to a
different VM by binding those devices to pci_stub and using -device
pci-assign,host=82:00.2 and -device pci-assign,host=82:00.3 on my
respective QEMU command lines.  (That let me have an initiator and
target in separate VMs with one adapter in one dev system)

However, all of those PCI devices have the same iommu_group, so now if
I bind the devices to vfio-pci and do s/pci-assign/vfio-pci/, the
second QEMU to start fails with something like

qemu-system-x86_64: -device vfio-pci,host=82:00.3: vfio: error
opening /dev/vfio/41: Device or resource busy
qemu-system-x86_64: -device vfio-pci,host=82:00.3: vfio: failed to
get group 41
qemu-system-x86_64: -device vfio-pci,host=82:00.3: Device
initialization failed.
qemu-system-x86_64: -device vfio-pci,host=82:00.3: Device
'vfio-pci' could not be initialized

Is there a way to split multi-function devices (with the same
iommu_group) between VMs with vfio?

Thanks!
  Roland
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with strange problem passing mlx4 device into kvm guests

2012-03-28 Thread Roland Dreier
On Tue, Mar 27, 2012 at 3:24 PM, Roland Dreier rol...@purestorage.com wrote:
 Just to follow up on this, it turns out this is a bug in how the
 Mellanox firmware deals with FLR (function level reset).  The
 FW will be fixed in a future release, but in the meantime I've
 been able to work around this with the following hack (probably
 going to be whitespace destroyed by the gmail web interface
 I'm using, but you should be able to recreate it if you care):

 --- a/drivers/pci/quirks.c
 +++ b/drivers/pci/quirks.c
 @@ -3085,6 +3085,12 @@ static int reset_intel_82599_sfp_virtfn(struct
 pci_dev *dev, int probe)
        return 0;
  }

 +static int reset_mellanox_dev(struct pci_dev *dev, int probe)
 +{
 +       /* skip FLR, it busts the Mellanox FW */
 +       return 0;
 +}
 +
  #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed

  static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 @@ -3092,6 +3098,8 @@ static const struct pci_dev_reset_methods
 pci_dev_reset_methods[] = {
                 reset_intel_82599_sfp_virtfn },
        { PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
                reset_intel_generic_dev },
 +       { PCI_VENDOR_ID_MELLANOX, 0x673c,
 +               reset_mellanox_dev },
        { 0 }
  };

And just to be clear, this is in the host kernel to avoid FLR there.
The guest running the standard kernel would never do FLR anyway,
so with the hack above in the host, the standard driver works fine.

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with strange problem passing mlx4 device into kvm guests

2012-03-27 Thread Roland Dreier
On Sun, Jan 29, 2012 at 6:29 PM, Roland Dreier rol...@purestorage.com wrote:

 I'm having a strange problem passing an mlx4 device into a kvm guest.
 The device in question is:

    05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX
 VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0)

 running the latest (I believe) FW version 2.9.1000.

 The symptom of the problem is
 that when the mlx4_core driver starts, I get normal output like

    mlx4_core :00:04.0: FW version 2.9.1000 (cmd intf rev 3), max
 commands 16
    mlx4_core :00:04.0: Catastrophic error buffer at 0x1f020, size
 0x10, BAR 0
    mlx4_core :00:04.0: FW size 385 KB

 up until the driver tries to enable interrupts, when I get a long
 stream of

    Completion event for bogus CQ 

 and then it gives up because the NOP command interrupt test
 fails.

Just to follow up on this, it turns out this is a bug in how the
Mellanox firmware deals with FLR (function level reset).  The
FW will be fixed in a future release, but in the meantime I've
been able to work around this with the following hack (probably
going to be whitespace destroyed by the gmail web interface
I'm using, but you should be able to recreate it if you care):

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3085,6 +3085,12 @@ static int reset_intel_82599_sfp_virtfn(struct
pci_dev *dev, int probe)
return 0;
 }

+static int reset_mellanox_dev(struct pci_dev *dev, int probe)
+{
+   /* skip FLR, it busts the Mellanox FW */
+   return 0;
+}
+
 #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed

 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
@@ -3092,6 +3098,8 @@ static const struct pci_dev_reset_methods
pci_dev_reset_methods[] = {
 reset_intel_82599_sfp_virtfn },
{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
reset_intel_generic_dev },
+   { PCI_VENDOR_ID_MELLANOX, 0x673c,
+   reset_mellanox_dev },
{ 0 }
 };
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Help with strange problem passing mlx4 device into kvm guests

2012-01-29 Thread Roland Dreier
Hi everyone,

I'm having a strange problem passing an mlx4 device into a kvm guest.
The device in question is:

05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX
VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0)

running the latest (I believe) FW version 2.9.1000.  The host system
is a fairly standard dual-socket Xeon 5600 system, perhaps a tiny bit
unusual in that it is a dual Tylersburg motherboard.  I'm using

QEMU emulator version 1.0 (qemu-kvm-1.0 Debian 1.0+dfsg-3),
Copyright (c) 2003-2008 Fabrice Bellard

and

Linux pure-driver3 3.1.0-1-amd64 #1 SMP Tue Jan 10 05:01:58 UTC
2012 x86_64 GNU/Linux

(the latest Debian testing versions).  The symptom of the problem is
that when the mlx4_core driver starts, I get normal output like

mlx4_core :00:04.0: FW version 2.9.1000 (cmd intf rev 3), max
commands 16
mlx4_core :00:04.0: Catastrophic error buffer at 0x1f020, size
0x10, BAR 0
mlx4_core :00:04.0: FW size 385 KB

up until the driver tries to enable interrupts, when I get a long
stream of

Completion event for bogus CQ 

and then it gives up because the NOP command interrupt test
fails.

Apparently what happens is that the SW2HW_EQ firmware command succeeds
as far as the driver is concerned, but the EQ buffer is left as all
0s, so the driver thinks every entry is a completion event (for CQN 0).

Several things are weird here: first, the command interface including
DMA from the device is definitely working since we get a
reasonable-looking response for the query FW command etc, so I'm not
sure what is different about the SW2HW_EQ command (it is the first
thing that uses the MTT I guess, so maybe there is a problem setting
that up?)  The guest is running 2.6.39, so there is no SR-IOV support
in the mlx4 driver (but I am passing the only physical function of a
non-virtualized device through, so I hope that isn't needed -- the
device shouldn't know it's talking to a guest at all)

Second, passing through another device on the same system:

86:00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10
Gigabit TN Network Connection [8086:151c] (rev 01)

works fine, including MSI-X interrupts, running traffic works, etc.

Finally, the craziest thing is that this setup was working a week or
so ago, but there may have been BIOS, kernel and kvm updates since
then (my guest image is unchanged at least ;).

Anyone have any idea what might be going on or how to debug this
further?  Unfortunately I don't have a PCIe analyzer handy to get a
better idea of what's happening with the device...

Thanks,
  Roland
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-24 Thread Roland Dreier

   This is Linux virtualization, where _both_ the host and the guest source 
   code 
   is fully known, and bugs (if any) can be found with a high degree of 

  It may sound strange but Windows is very popular guest and last I
  checked my HW there was no Windows sources there, but the answer to that
  is to emulate HW as close as possible to real one and then closed source
  guests will not have a reason to be upset.
  
   determinism. This is Linux where the players dont just vanish overnight, 
   and 
   are expected to do a proper job.

And without even getting into closed/proprietary guests, virt is useful
for testing/developing/deploying many free OSes, eg FreeBSD, NetBSD,
OpenBSD, Hurd, random research OS, etc.  Not to mention just wanting a
stable [virtual] platform to run old enterprise Linux distro on.  So
having a virtual platform whose interface doesn't change very often or
very much has a lot of value at least in avoiding churn in guest OSes.

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Still LSI scsi problems with Windows guest and kvm 82?

2009-01-13 Thread Roland Dreier
I have a (32-bit) Windows XP guest installed, and if I start kvm-82
(from the Debian packages) on a 2.6.29-rc1 kernel (running 64-bit on an
AMD host), the guest boots but soon dies with messages along the lines
of:

lsi_scsi: error: Unimplemented message 0x0c
lsi_scsi: error: Reselect with pending DMA
scsi-disk: Tag 0x0 already in use
scsi-disk: Unsupported command length, command f0
lsi_scsi: error: Unimplemented message 0x0c
lsi_scsi: error: Reselect with pending DMA
scsi-disk: Unsupported command length, command f0
lsi_scsi: error: Reselect with pending DMA
scsi-disk: Bad buffer tag 0x0

Is this a known problem?

 - R.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-28 Thread Roland Dreier
  I'm guessing that the missing comment explains that this is
  intentional, to trap buffer overflows?

Actually, speaking of comments, it's interesting that
__get_vm_area_node() -- which is called from vmalloc() -- does:

/*
 * We always allocate a guard page.
 */
size += PAGE_SIZE;

va = alloc_vmap_area(size, align, start, end, node, gfp_mask);

and alloc_vmap_area() adds another PAGE_SIZE, as the original email
pointed out:

while (addr + size = first-va_start  addr + size = vend) {
addr = ALIGN(first-va_end + PAGE_SIZE, align);

I wonder if the double padding is causing a problem when things get too
fragmented?

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-28 Thread Roland Dreier
  I suspect it's a case of off-by-one... ALIGN() might round down, and
  the + (PAGE_SIZE-1) was there to make it round up.
  Except for that missing -1 ...

ALIGN() has always rounded up, at least back to 2.4.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6 v3] PCI: support SR-IOV capability

2008-09-30 Thread Roland Dreier
  +   ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0;
  +   pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl);
  +   ssleep(1);

You seem to sleep for 1 second wherever you write the IOV_CTRL
register.  Why is this?  Is this specified by PCI, or is it coming from
somewhere else?

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4 v2] PCI: support ARI capability

2008-09-01 Thread Roland Dreier
   +config PCI_ARI
   +  bool PCI ARI support
   +  depends on PCI
   +  default n
   +  help
   +This enables PCI Alternative Routing-ID Interpretation.
  
  This Kconfig help text is a little weak. Why not include the text
  you've already written here:
  
   Support Alternative Routing-ID Interpretation (ARI), which
   increases the number of functions that can be supported by a PCIe
   endpoint. ARI is required by SR-IOV.

I agree with this improvement to the help text.  But a further question
is whether ARI even merits its own user-visible config option.  Is it
worth having yet another choice for users?  When would someone want ARI
but not SR-IOV?

 - R.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh install of Windows XP hangs early in boot?

2008-07-30 Thread Roland Dreier
  Hm, your comment later on makes me think you tried this on AMD.  If so, I 
  have
  also run into a similar problem with Windows guests under AMD.  After 
  installing
  WinDbg, it told me that it was a Paging Request in Non-Paged memory 
  related to
  the Video memory area.  Does yours look similar to that?  I have not had 
  time to
  track it further than that, though.

Yes, I got a bluescreen on an AMD host but not an Intel host.  And the
bluescreen (I posted earlier) said PAGE_FAULT_IN_NONPAGED_AREA

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh install of Windows XP hangs early in boot?

2008-07-29 Thread Roland Dreier
  I experienced random hangs during the install (stracing kvm shows no
  system calls, and it appears to be spinning at 100% CPU), but eventually
  I got an install that ran all the way to completion.  However, that
  image seems to hang every time shortly after boot starts.  I see the
  Windows splash screen, the little blue dots move for a few seconds, and
  then the guest hangs in the same way -- no system calls, 100% CPU.

FWIW, when I ltrace the stuck kvm process, I get an endless string of

memcpy(0x7fffb09355f0, \224\213'\206, 4)

(the value changes from run to run, but stays constant within a run)

Unfortunately Debian packages don't seem to be built with debugging
symbols, so gdb doesn't show a very enlightening backtrace.  I'll try to
get more info.

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh install of Windows XP hangs early in boot?

2008-07-29 Thread Roland Dreier
I built with debugging symbols, and this seems to be an issue with SCSI
disk emulation.  The traceback is:

#0  0x7fc086d7dd10 in memcpy () from /lib/libc.so.6
#1  0x004a319b in cpu_physical_memory_rw (addr=108661608,
buf=0x7fff904ca190 \224['\206\210\030z\006I�A, len=4, is_write=0)
at /users/rdreier/kvm-deb.git/qemu/exec.c:2847
#2  0x0041f0c2 in lsi_execute_script (s=0x2ef7a30) at ../cpu-all.h:924
#3  0x0049bd91 in qcow_aio_read_cb (opaque=0x3018d70, ret=0) at 
block-qcow2.c:840
#4  0x0041cba0 in qemu_aio_poll () at 
/users/rdreier/kvm-deb.git/qemu/block-raw-posix.c:513
#5  0x0040b38a in main_loop_wait (timeout=value optimized out)
at /users/rdreier/kvm-deb.git/qemu/vl.c:
#6  0x004f607a in kvm_main_loop () at 
/users/rdreier/kvm-deb.git/qemu/qemu-kvm.c:587
#7  0x00412b46 in main (argc=value optimized out, argv=0x7fff904cb0c8)
at /users/rdreier/kvm-deb.git/qemu/vl.c:7811

and no progress ever seems to be made (the same address is read over and
over)

I'm trying again with IDE instead of SCSI disks.  But I would like to
help debug the SCSI emulation... will look at it further later, and I'm
happy to provide any info someone else could use.

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh install of Windows XP hangs early in boot?

2008-07-29 Thread Roland Dreier
   BTW I tried using if=ide to install Windows XP and got a blue screen
   during the installer.  What are people doing to run XP in a kvm guest?

  Are you using a recent version of kvm-userspace/kernel modules? Please
  save the blue screen and mail it to the list or fill a bug.

Pretty recent... this is with kvm modules from vanilla mainline
2.6.27-rc1 and kvm-72 userspace from Debian.  Host is a 64-bit kernel
running on AMD CPU (no NPT).

Here's the bluescreen -- it seems the same as the last install, so
pretty reproducible:

attachment: xp-bluescreen.png

Re: Fresh install of Windows XP hangs early in boot?

2008-07-29 Thread Roland Dreier
   BTW I tried using if=ide to install Windows XP and got a blue screen
 .  during the installer.  What are people doing to run XP in a kvm guest?

Funnily enough installing XP SP2 with if=ide worked fine on an Intel
host.  I notice that I left off -std-vga on the working install too (I
used it on the AMD host that blue-screened).

Anyway, just another data point.  Let me know if any other data would help.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh install of Windows XP hangs early in boot?

2008-07-29 Thread Roland Dreier
  Known problem:
  http://www.nabble.com/LSI:-avoid-infinite-loops-p17116605.html

I tried this hack (and actually made the magic insns number 500), and
doing an XP install I got

lsi_scsi: error: Reselect with pending DMA

do you have any feeling if this is because the script execution got
stopped too soon?  Or is this likely a further issue?

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM overflows the stack

2008-07-17 Thread Roland Dreier
  Yes, things like kvm_lapic_state are way too big to be on the stack.

I had a quick look at the code, and my worry about dynamic allocation
would be that handling allocation failure seems like it might get
tricky.  Eg for handling struct kvm_pv_mmu_op_buffer (which is 528 bytes
on the stack in kvm_pv_mmu_op()) can you deal with an mmu op failing?
(maybe in that case you can easily by just setting *ret to 0?)

  There's an additional problem here, that apparently your gcc (which
  version?) doesn't fold objects in a switch statement into the same
  stack slot:
  
  switch (...) {
 case x: {
  struct medium a;
  ...
 }
 case y:
   struct medium b;
   ...
 }
  };

A trick for this is to do:

union {
struct medium1 a;
struct medium2 b;
} u;

switch (...) {
case x:
use u.a;
...

case y:
use u.b;
...
}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] reduce KVM stack usage

2008-07-17 Thread Roland Dreier
  +struct kvm_pv_mmu_op_buffer *buffer =
  +kmalloc(GFP_KERNEL, sizeof(struct kvm_pv_mmu_op_buffer));

Surely this produces a warning?  kmalloc takes (size, flags) -- you have
them reversed here.

  +lapic = kzalloc(GFP_KERNEL, sizeof(*lapic));
  +lapic = kmalloc(GFP_KERNEL, sizeof(*lapic));
  +struct kvm_irqchip *chip = kmalloc(GFP_KERNEL, sizeof(*chip));
  +kvm_sregs = kmalloc(GFP_KERNEL, sizeof kvm_sregs);
  +fpu = kmalloc(GFP_KERNEL, sizeof(*fpu));

same for all of these places.

  +if (lapic)
  +kfree(lapic);

  +if (fpu)
  +kfree(fpu);
  +if (kvm_sregs)
  +kfree(kvm_sregs);

kfree(NULL) is fine, so you can remove the if()s here.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] reduce KVM stack usage

2008-07-17 Thread Roland Dreier
 +   kmalloc(GFP_KERNEL, sizeof(struct 
   kvm_pv_mmu_op_buffer));

   Surely this produces a warning?  kmalloc takes (size, flags) -- you have
   them reversed here.

  Heh.  It actually doesn't.  

Yeah, I guess you need sparse to catch the gfp_t mismatch.

   kfree(NULL) is fine, so you can remove the if()s here.

  I know it is fine, but I kinda like putting the if()s, just to let
  people know that we don't always *expect* something to be in there.
  But, it doesn't matter to me too much either way.

It's not really a big deal, but if (x) kfree(x); does bloat the object
code with the extra test of 'x'.  I guess you could put a comment like
/* free any temp structures we allocated */ or something like that.

 - R.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html