Hi Laine,

I have a few question before I can give my opinion.
I the Mellanox Card Dual Port that support one PCI with 2 PF is  ConnectX-3 and 
ConnectX-3 Pro. (maybe others cards  I will check this)
The ConnectX-4 Dual Port and above is implemented with 2 PCI devices per 2 PF.
I can check with our driver architect why it was done like this in the past.

The pci address in the xml below is VF pci which is different between all VF so 
I am not sure why it causing problems with libvirt for setting mac? 
<source>
             <address type='pci' slot='0x08' function='0x4'/>
</source>


I will do a little shift to openstack with SR-IOV mechanism driver.
I remember with try to enable support on the second port for such  cards in 
openstack 
I remember we tested Mellanox ConnectX-3  Pro Dual Port  with openstack to 
allow boot a vm on both ports.
I implemented the pci-passthrough-whitelist-regex  to allow a flexibly way to 
whitelist pci device.
And we also had a patch in neutron for the SR-IOV to allow the agent to allow 
mapping of multiple PFs to a PCI device, but the community didn't like it 
especially intel. 

[1] - 
https://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/pci-passthrough-whitelist-regex.html
 
[2] - https://review.openstack.org/#/c/409526/




-----Original Message-----
From: sendmail [mailto:justsendmailnothinge...@gmail.com] On Behalf Of Laine 
Stump
Sent: Thursday, August 3, 2017 7:09 AM
To: Libvirt <libvir-list@redhat.com>
Cc: Doug Ledford <dledf...@redhat.com>; Moshe Levi <mosh...@mellanox.com>; 
Daniel P. Berrange <berra...@redhat.com>
Subject: RFC: support for configuring all ports of a multiport SRIOV VF when 
assigning to guest

("No matter how far you've gone down the wrong road, turn back." - paraphrase 
of a Turkish proverb that is apropos to this discussion)

Several years ago, when I was apparently naive and narrow in my thinking and 
someone wanted us support setting the MAC address and vlan tag for SRIOV VFs 
when assigning them to a guest with PCI device assignment (this was before VFIO 
existed), I had the idea to do this by creating a new type of <interface> 
device:

   <interface type='hostdev'>
    ....

My thinking was that <interface> already had elements for mac address, 
802.11Qb[gh] virtualport config, and vlan tag (or maybe it was that we were 
*going to add* support for vlan tag), so by just adding a <source> that was a 
PCI address, we would have everything we needed. Basically, there is some 
amount of config that needs to be applied to the device before it's assigned to 
the guest, and since the device ends up being a netdev in the guest, all that 
config is already present in an <interface>. As a bonus, because it was an 
<interface> we could easily re-use the recently added "pool of devices" network 
type (with some minor adjustment) to avoid needing to hardcode the host-side 
PCI address of the VF.

At the time Dan Berrange countered (I think - correct me if I'm wrong!) that we 
should instead do this with modifications to <hostdev>, but somehow I managed 
to either convince him, or maybe he just finally tired of my stubbornness and 
decided it was easier to deal with the after effects of giving in rather than 
continuing to debate with me :-)

So right now if you want to assign an SRIOV VF network device to a guest with 
VFIO, you need something like this (ignoring network device pools for the 
moment):

    <interface type='hostdev'>
      <source>
        <address type='pci' slot='0x08' function='0x4'/>
      </source>
      <mac address='52:54:00:01:01:01'/>
      <vlan>
        <tag id='42'/>
      </vlan>
    </interface>

(or in place of <vlan>, you could have a <virtualport> element for 
802.11Qb[gh]).


The SRIOV cards that we had around when we were doing this work had multiple 
physical ports on them (either 2 or 4), but each physical port was associated 
with its own PCI Physical Function (PF), and each of the PCI Virtual Functions 
associated with a PF was tied to a single netdev, i.e. in all cases there was 
always a 1:1 correspondence between a netdev and a PCI device. All of libvirt's 
code dealing with SRIOV VFs and PFs assumes this 1:1 relationship.

And then came Mellanox "dual port" SRIOV cards....

A Mellanox SRIOV NIC doesn't necessarily do that. Instead, it can operate in 
"dual port" mode, where it has a single PCI PF device for both physical ports; 
the single PF PCI device has 2 separate netdevs associated with it (so when you 
look in the "net" subdirectory for the PCI device, you'll see two netdevs 
listed, and when you look in the "device" subdirectory of those two netdevs in 
sysfs, they both point back to the same PCI device). VFs associated with that 
PF will also each have two netdevs associated with them. This means that when 
you assign a VF to a guest, the guest is getting a single PCI device, but it's 
getting two netdevs. (I've been told that the advantage of doing both ports 
with a single PCI device is that each Mellanox PCI device uses a huge amount of 
MMIO space, two ports on each device cuts the MMIO usage in half).

In order for this to be useful, libvirt needs to set the mac address and vlan 
tag of *both* netdevs prior to starting the guest. But we have no way to 
represent that in our configuration. In the past it's been suggested that we 
just do something like this:

   <interface type='hostdev'>
     <mac address='blah'/>
     <mac2 address='blah'/>
     ...
   </interface>

but I have two problems with that:

1) <interface> is supposed to represent a single network device, but this is 
trying to make it represent 2 network devices (and what if someone else comes 
up with a card that puts *4* netdevs on the same PCI
device?)

2) We would need to do the same thing for <vlan> tag. It starts to get ugly.

Alternately we could add a new <port number='2'> subelement, like this:


    <interface type='hostdev'>
      <source>
        <address type='pci' slot='0x08' function='0x4'/>
      </source>
      <mac address='52:54:00:01:01:01'/>
      <vlan>
        <tag id='42'/>
      </vlan>
      <port number='2'>
        <mac address='52:54:00:01:01:01'/>
        <vlan>
          <tag id='42'/>
        </vlan>
      </port>
    </interface>

(or some variation of that) just so that all the stuff for the 2nd port is 
grouped together. But I don't like that the config for port 1 is at a different 
level in the hierarchy than the config for port 2, and we still have the 
problem that we're trying to describe *2* netdevs with a single <interface> 
element, which just feels wrong.

- OR -

what if we admit that <interface type='hostdev'> was a bad idea, and try doing 
it all with <hostdev>, something like this:

  <hostdev mode='subsystem' type='pci' managed='yes'>
    <source>
      <address domain='0x0000' bus='0x06' slot='0x02' function='0x0'/>
    </source>
    <netdev port='1'>
      <mac address='52:54:00:01:02:03'/>
      <vlan>
        <tag id='42'/>
      </vlan>
    </netdev>
    <netdev port='2'>
      <mac address='52:54:00:01:02:03'/>
      <vlan>
        <tag id='43'/>
      </vlan>
    </netdev>
  </hostdev>

The downsides are:

1) It's providing a 2nd way of describing single port VFs, which could confuse 
people (my recommendation would be to deprecate usage of <interface 
type='hostdev'> in the documentation, while still allowing it; i.e. we'd still 
have to maintain that code while discouraging its use).

2) This wouldn't be able to take advantage of the pools of devices maintained 
by libvirt networks. (This isn't a problem for Openstack, since they don't use 
that anyway, but ovirt does use it).

3) It's an explicit admission that I made a bad decision in 2011 :-P

The upsides?

1) it models the hardware more correctly. (it really is a PCI device that has 
two subordinate netdevs, *not* a netdev that is part of a PCI device, "oh and 
that PCI device also has another netdev")

2) it could be more logically and easily expanded if there were more ports, or 
if there were other types of PCI devices that had different kinds of 
device-type-specific config that needed to be setup.

3) we could eliminate "downside (2)" by enhancing the nodedevice driver to 
provide and manage more generalized pools of devices (if desired by anyone - 
Openstack's opinion seems to be that libvirt shouldn't be doing this anyway).


So does anyone have an opinion about this? An alternate proposal? (e.g.
Should we instead just tell everyone to run their Mellanox cards in single port 
mode and ignore/avoid all this complexity?)


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Reply via email to