On 9/4/2018 3:20 AM, Jakub Kicinski wrote:
On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:
On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:
Hi!
Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,

Note I replied couple of minutes ago but it didn't get to the list, so
lets take it from this one:

I wonder if we can use phys_port_id in switchdev to group together
interfaces of a single PCI PF?  Here is the problem:

With a mix of PF and VF interfaces it gets increasingly difficult to
figure out which one corresponds to which PF.  We can identify which
*representor* is which, by means of phys_port_name and devlink
flavours.  But if the actual VF/PF interfaces are also present on the
same host, it gets confusing when one tries to identify the PF they
came from.  Generally one has to resort of matching between PCI DBDF of
the PF and VFs or read relevant info out of ethtool -i.

In multi host scenario this is particularly painful, as there seems to
be no immediately obvious way to match PCI interface ID of a card (0,
1, 2, 3, 4...) to the DBDF we have connected.

Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
random manner, which means we have to provide those for all devices with
link to the PF (all reprs).  And we have to link them (a) because it's
right (tm) and (b) to get correct naming.
wait, as you commented in later, not only the mellanox vf reprs but rather also
the nfp vf reprs are not linked to the PF, because ip link output
grows quadratically.
Right, correct.  If we set phys_port_id libvirt will reliably pick the
correct netdev to run NDOs on (PF/PF repr) so we can remove them from
the other netdevs and therefore limit the size of ip link show output.

The only reliable way to make
user space (libvirt) choose the repr it should run the NDOs on (which is
IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
not the external/Ethernet port when in switchdev mode.  User space
should understand phys_port_id in this context, given it was originally
introduced for matching VFs to ports.
Using phy_port_id to match/group VFs to PFs makes sense to me.

So what would be the libvirt use case you envision that needs
the VF and PF reprs to support that as well? or maybe you were
not referring to libvirt but to some other provisioning element? I need
to refresh my memory on that area.
Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
phys_port_id on the actual VF and then *a* netdev linked to physfn in
sysfs which will have the legacy NDOs.

We can't set the phys_port_id on the VF reprs because then we're back
to the problem of ip link output growing.  Perhaps we shouldn't set it
on PF repr either?

Let's make a table (assuming bare metal cloud scenario where Host0 is
controlling the network, while Host1 is the actual server):

[act - actual; rpr - representor; SN -serial number]

Today:

   dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
           |      | link  | port_id | dev_id  | port_name |
---------------------------------------------------------------
uplink    |   0  |   PF0 |   -     | ASIC SN | p0        | PF0
act PF0   |   0  |   PF0 |   -     |   -     |  -        |  -
act VF0/0 |   0  | VF0/0 |   -     |   -     |  -        |  -
rpr PF0   |   0  |    -  |   -     | ASIC SN | pf0       |  -
rpr VF0/0 |   0  |    -  |   -     | ASIC SN | pf0vf0    |  -
act PF1   |   1  |   PF1 |   -     |   -     |  -        | PF1
act VF1/0 |   1  | VF1/0 |   -     |   -     |  -        |  -
rpr PF1   |   0  |    -  |   -     | ASIC SN | pf1       |  -
rpr VF1/0 |   0  |    -  |   -     | ASIC SN | pf1vf0    |  -

Proposed:

   dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
           |      | link  | port_id | dev_id  | port_name |
---------------------------------------------------------------
uplink    |   0  |   PF0 |   -     | ASIC SN | p0        |  -
act PF0   |   0  |   PF0 | PF0 SN  |   -     |  -        | PF0
act VF0/0 |   0  | VF0/0 | PF0 SN  |   -     |  -        |  -
rpr PF0   |   0  |   PF0 |   -     | ASIC SN | pf0       |  -
rpr VF0/0 |   0  |   PF0 |   -     | ASIC SN | pf0vf0    |  -
act PF1   |   1  |   PF1 | PF1 SN  |   -     |  -        | PF1
act VF1/0 |   1  | VF1/0 | PF1 SN  |   -     |  -        |  -
rpr PF1   |   0  |   PF0 |   -     | ASIC SN | pf1       |  -
rpr VF1/0 |   0  |   PF0 |   -     | ASIC SN | pf1vf0    |  -

With this libvirt on Host0 should easily find the actual PF0 netdev to
run the NDO on, if it wants to use VFs:
  - libvrit finds act VF0/0 to plug into the VF;
  - reads its phys_port_id -> "PF0 SN";
  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
  - runs NDOs on "act PF0" for PF0's VF correctly.

I think Host0 corresponds to embedded OS on the NIC. Is this correct?
I guess in this setup, only PF0's PCI interface on Host0 is in switchdev mode 
and
the representors for PF0 and its VFs are created on Host0 when they come up
on Host1. I would think PF0 on Host0 acts as a Control PF for PF1 on Host1.

Isn't hypervisor running only on Host1?




The other problem remains unsolved - Host0 can't be sure without
vendor-specific knowledge whether it's connected to PF0 or PF1.
That's why I was thinking maybe we should provide phys_port_id
on PF representors as well.  That means we'd have to provide the
legacy NDOs on PF reprs too because libvirt may now find PF repr...
Would it be cleaner to add a new attribute?

Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?

Do you mean the legacy VF ndo ops on the PF?  I think it is possible to 
configure
the VFs on Host1 via the port representors except for the MAC address.


Reply via email to