Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Samudrala, Sridhar

On 5/30/2018 3:53 PM, Jakub Kicinski wrote:

On Wed, 30 May 2018 14:23:06 -0700, Samudrala, Sridhar wrote:

On 5/29/2018 11:33 PM, Jakub Kicinski wrote:

On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:

On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:

On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:

On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:

Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
extended?

+1 it's painful to see this feature being added to the legacy
API :(  Another duplicated configuration knob.
  

I didn't know about that.
  

Shouldn't we enable this via ethtool on the port representor netdev?

We discussed about this.  ethtool on the VF representor will only work
in switchdev mode and also will not support min/max values.

Ethtool channel API may be overdue a rewrite in devlink anyway, but I
feel like implementing switchdev mode and rewriting features in devlink
may be too much to ask.

Totally agreed.  And switchdev mode doesn't seem to be that widely
used at the moment.  Do you have other suggestions besides NDO?

At some points you (Broadcom) were working whole bunch of devlink
configuration options for the PCIe side of the ASIC.  The number of
queues relates to things like number of allocated MSI-X vectors, which
if memory serves me was in your devlink patch set.  In an ideal world
we would try to keep all those in one place :)

For PCIe config there is always the question of what can be configured
at runtime, and what requires a HW reset.  Therefore that devlink API
which could configure current as well as persistent device settings was
quite nice.  I'm not sure if reallocating queues would ever require
PCIe block reset but maybe...  Certainly it seems the notion of min
queues would make more sense in PCIe configuration devlink API than
ethtool channel API to me as well.

Queues are in the grey area between netdev and non-netdev constructs.
They make sense both from PCIe resource allocation perspective (i.e.
devlink PCIe settings) and netdev perspective (ethtool) because they
feed into things like qdisc offloads, maybe per-queue stats etc.

So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
API and/or switchdev representors.  But neither of those are really an
option for you today so IDK :)

One reason why 'switchdev' mode is not yet widely used or enabled by default
could be due to the requirement to program the flow rules only via slow path.

Do you mean the fallback traffic requirement?


Yes.




Would it make sense to relax this requirement and support a mode where port
representors are created and let the PF driver implement a default policy that
adds flow rules for all the VFs to enable connectivity and let the user
add/modify the rules via port representors?

I definitely share your concerns, stopping a major HW vendor from using
this new and preferred mode is not helping us make progress.

The problem is that if we allow this diversion, i.e. driver to implement
some special policy, or pre-populate a bridge in a configuration that
suits the HW we may condition users to expect that as the standard Linux
behaviour.  And we will be stuck with it forever even tho your next gen
HW (ice?) may support correct behaviour.


Yes. ice can support slowpath behavior as required to support OVS offload.
However, i was just wondering if we should have an option to allow switchdev
without slowpath so that the user can use alternate mechanisms to program
the flow rules instead of having to use OVS.




We should perhaps separate switchdev mode from TC flower/OvS offloads.
Is your objective to implement OvS offload or just switchdev mode?

For OvS without proper fallback behaviour you may struggle.

Switchdev mode could be within your reach even without changing the
default rules.  What if you spawned all port netdevs (I dislike the
term representor, sorry, it's confusing people) in down state and then
refuse to bring them up unless user instantiated a bridge that would
behave in a way that your HW can support?  If ports are down you won't
have fallback traffic so no problem to solve.


If we want to use port netdev's admin state to control the link state of the
VFs then this will not work.
We need to only disable TX/RX but admin state and link state need to be
supported on the port netdevs.




Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Jakub Kicinski
On Wed, 30 May 2018 14:23:06 -0700, Samudrala, Sridhar wrote:
> On 5/29/2018 11:33 PM, Jakub Kicinski wrote:
> > On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:  
> >> On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:  
> >>> On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:  
>  On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:  
> > Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to 
> > be
> > extended?  
> >>> +1 it's painful to see this feature being added to the legacy
> >>> API :(  Another duplicated configuration knob.
> >>>  
>  I didn't know about that.
>   
> > Shouldn't we enable this via ethtool on the port representor netdev?  
>  We discussed about this.  ethtool on the VF representor will only work
>  in switchdev mode and also will not support min/max values.  
> >>> Ethtool channel API may be overdue a rewrite in devlink anyway, but I
> >>> feel like implementing switchdev mode and rewriting features in devlink
> >>> may be too much to ask.  
> >> Totally agreed.  And switchdev mode doesn't seem to be that widely
> >> used at the moment.  Do you have other suggestions besides NDO?  
> > At some points you (Broadcom) were working whole bunch of devlink
> > configuration options for the PCIe side of the ASIC.  The number of
> > queues relates to things like number of allocated MSI-X vectors, which
> > if memory serves me was in your devlink patch set.  In an ideal world
> > we would try to keep all those in one place :)
> >
> > For PCIe config there is always the question of what can be configured
> > at runtime, and what requires a HW reset.  Therefore that devlink API
> > which could configure current as well as persistent device settings was
> > quite nice.  I'm not sure if reallocating queues would ever require
> > PCIe block reset but maybe...  Certainly it seems the notion of min
> > queues would make more sense in PCIe configuration devlink API than
> > ethtool channel API to me as well.
> >
> > Queues are in the grey area between netdev and non-netdev constructs.
> > They make sense both from PCIe resource allocation perspective (i.e.
> > devlink PCIe settings) and netdev perspective (ethtool) because they
> > feed into things like qdisc offloads, maybe per-queue stats etc.
> >
> > So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
> > API and/or switchdev representors.  But neither of those are really an
> > option for you today so IDK :)  
> 
> One reason why 'switchdev' mode is not yet widely used or enabled by default
> could be due to the requirement to program the flow rules only via slow path.

Do you mean the fallback traffic requirement?

> Would it make sense to relax this requirement and support a mode where port
> representors are created and let the PF driver implement a default policy that
> adds flow rules for all the VFs to enable connectivity and let the user
> add/modify the rules via port representors?

I definitely share your concerns, stopping a major HW vendor from using
this new and preferred mode is not helping us make progress.

The problem is that if we allow this diversion, i.e. driver to implement
some special policy, or pre-populate a bridge in a configuration that
suits the HW we may condition users to expect that as the standard Linux
behaviour.  And we will be stuck with it forever even tho your next gen
HW (ice?) may support correct behaviour.

We should perhaps separate switchdev mode from TC flower/OvS offloads.
Is your objective to implement OvS offload or just switchdev mode?  

For OvS without proper fallback behaviour you may struggle.

Switchdev mode could be within your reach even without changing the
default rules.  What if you spawned all port netdevs (I dislike the
term representor, sorry, it's confusing people) in down state and then
refuse to bring them up unless user instantiated a bridge that would
behave in a way that your HW can support?  If ports are down you won't
have fallback traffic so no problem to solve.


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Jakub Kicinski
On Wed, 30 May 2018 00:18:39 -0700, Michael Chan wrote:
> On Tue, May 29, 2018 at 11:33 PM, Jakub Kicinski wrote:
> > At some points you (Broadcom) were working whole bunch of devlink
> > configuration options for the PCIe side of the ASIC.  The number of
> > queues relates to things like number of allocated MSI-X vectors, which
> > if memory serves me was in your devlink patch set.  In an ideal world
> > we would try to keep all those in one place :)  
> 
> Yeah, another colleague is now working with Mellanox on something similar.
> 
> One difference between those devlink parameters and these queue
> parameters is that the former are more permanent and global settings.
> For example, number of VFs or number of MSIX per VF are persistent
> settings once they are set and after PCIe reset.  On the other hand,
> these queue settings are pure run-time settings and may be unique for
> each VF.  These are not stored as there is no room in NVRAM to store
> 128 sets or more of these parameters.

Indeed, I think the API must be flexible as to what is persistent and
what is not because HW will certainly differ in that regard.  And
agreed, queues may be a bit of a stretch here, but worth a try.

> Anyway, let me discuss this with my colleague to see if there is a
> natural fit for these queue parameters in the devlink infrastructure
> that they are working on.

Thank you!


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Samudrala, Sridhar

On 5/29/2018 11:33 PM, Jakub Kicinski wrote:

On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:

On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:

On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:

On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:

Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
extended?

+1 it's painful to see this feature being added to the legacy
API :(  Another duplicated configuration knob.


I didn't know about that.


Shouldn't we enable this via ethtool on the port representor netdev?

We discussed about this.  ethtool on the VF representor will only work
in switchdev mode and also will not support min/max values.

Ethtool channel API may be overdue a rewrite in devlink anyway, but I
feel like implementing switchdev mode and rewriting features in devlink
may be too much to ask.

Totally agreed.  And switchdev mode doesn't seem to be that widely
used at the moment.  Do you have other suggestions besides NDO?

At some points you (Broadcom) were working whole bunch of devlink
configuration options for the PCIe side of the ASIC.  The number of
queues relates to things like number of allocated MSI-X vectors, which
if memory serves me was in your devlink patch set.  In an ideal world
we would try to keep all those in one place :)

For PCIe config there is always the question of what can be configured
at runtime, and what requires a HW reset.  Therefore that devlink API
which could configure current as well as persistent device settings was
quite nice.  I'm not sure if reallocating queues would ever require
PCIe block reset but maybe...  Certainly it seems the notion of min
queues would make more sense in PCIe configuration devlink API than
ethtool channel API to me as well.

Queues are in the grey area between netdev and non-netdev constructs.
They make sense both from PCIe resource allocation perspective (i.e.
devlink PCIe settings) and netdev perspective (ethtool) because they
feed into things like qdisc offloads, maybe per-queue stats etc.

So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
API and/or switchdev representors.  But neither of those are really an
option for you today so IDK :)


One reason why 'switchdev' mode is not yet widely used or enabled by default
could be due to the requirement to program the flow rules only via slow path.

Would it make sense to relax this requirement and support a mode where port
representors are created and let the PF driver implement a default policy that
adds flow rules for all the VFs to enable connectivity and let the user
add/modify the rules via port representors?



Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Michael Chan
On Tue, May 29, 2018 at 11:33 PM, Jakub Kicinski
 wrote:

>
> At some points you (Broadcom) were working whole bunch of devlink
> configuration options for the PCIe side of the ASIC.  The number of
> queues relates to things like number of allocated MSI-X vectors, which
> if memory serves me was in your devlink patch set.  In an ideal world
> we would try to keep all those in one place :)

Yeah, another colleague is now working with Mellanox on something similar.

One difference between those devlink parameters and these queue
parameters is that the former are more permanent and global settings.
For example, number of VFs or number of MSIX per VF are persistent
settings once they are set and after PCIe reset.  On the other hand,
these queue settings are pure run-time settings and may be unique for
each VF.  These are not stored as there is no room in NVRAM to store
128 sets or more of these parameters.

Anyway, let me discuss this with my colleague to see if there is a
natural fit for these queue parameters in the devlink infrastructure
that they are working on.

>
> For PCIe config there is always the question of what can be configured
> at runtime, and what requires a HW reset.  Therefore that devlink API
> which could configure current as well as persistent device settings was
> quite nice.  I'm not sure if reallocating queues would ever require
> PCIe block reset but maybe...  Certainly it seems the notion of min
> queues would make more sense in PCIe configuration devlink API than
> ethtool channel API to me as well.
>
> Queues are in the grey area between netdev and non-netdev constructs.
> They make sense both from PCIe resource allocation perspective (i.e.
> devlink PCIe settings) and netdev perspective (ethtool) because they
> feed into things like qdisc offloads, maybe per-queue stats etc.
>
> So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
> API and/or switchdev representors.  But neither of those are really an
> option for you today so IDK :)


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Jakub Kicinski
On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:
> On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:
> > On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:
> >> On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:
> >> > Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to 
> >> > be
> >> > extended?
> >
> > +1 it's painful to see this feature being added to the legacy
> > API :(  Another duplicated configuration knob.
> >
> >> I didn't know about that.
> >>
> >> > Shouldn't we enable this via ethtool on the port representor netdev?
> >>
> >> We discussed about this.  ethtool on the VF representor will only work
> >> in switchdev mode and also will not support min/max values.
> >
> > Ethtool channel API may be overdue a rewrite in devlink anyway, but I
> > feel like implementing switchdev mode and rewriting features in devlink
> > may be too much to ask.
>
> Totally agreed.  And switchdev mode doesn't seem to be that widely
> used at the moment.  Do you have other suggestions besides NDO?

At some points you (Broadcom) were working whole bunch of devlink
configuration options for the PCIe side of the ASIC.  The number of
queues relates to things like number of allocated MSI-X vectors, which
if memory serves me was in your devlink patch set.  In an ideal world
we would try to keep all those in one place :)

For PCIe config there is always the question of what can be configured
at runtime, and what requires a HW reset.  Therefore that devlink API
which could configure current as well as persistent device settings was
quite nice.  I'm not sure if reallocating queues would ever require
PCIe block reset but maybe...  Certainly it seems the notion of min
queues would make more sense in PCIe configuration devlink API than
ethtool channel API to me as well.

Queues are in the grey area between netdev and non-netdev constructs.
They make sense both from PCIe resource allocation perspective (i.e.
devlink PCIe settings) and netdev perspective (ethtool) because they
feed into things like qdisc offloads, maybe per-queue stats etc.

So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
API and/or switchdev representors.  But neither of those are really an
option for you today so IDK :)


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-30 Thread Michael Chan
On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski
 wrote:
> On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:
>> On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:
>> > Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
>> > extended?
>
> +1 it's painful to see this feature being added to the legacy
> API :(  Another duplicated configuration knob.
>
>> I didn't know about that.
>>
>> > Shouldn't we enable this via ethtool on the port representor netdev?
>>
>> We discussed about this.  ethtool on the VF representor will only work
>> in switchdev mode and also will not support min/max values.
>
> Ethtool channel API may be overdue a rewrite in devlink anyway, but I
> feel like implementing switchdev mode and rewriting features in devlink
> may be too much to ask.

Totally agreed.  And switchdev mode doesn't seem to be that widely
used at the moment.  Do you have other suggestions besides NDO?


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-29 Thread Jakub Kicinski
On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:
> On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:
> > Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
> > extended?

+1 it's painful to see this feature being added to the legacy
API :(  Another duplicated configuration knob.

> I didn't know about that.
>
> > Shouldn't we enable this via ethtool on the port representor netdev?
>
> We discussed about this.  ethtool on the VF representor will only work
> in switchdev mode and also will not support min/max values.

Ethtool channel API may be overdue a rewrite in devlink anyway, but I
feel like implementing switchdev mode and rewriting features in devlink
may be too much to ask.


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-29 Thread Michael Chan
On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar
 wrote:

>
> Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
> extended?

I didn't know about that.

> Shouldn't we enable this via ethtool on the port representor netdev?
>
>

We discussed about this.  ethtool on the VF representor will only work
in switchdev mode and also will not support min/max values.


Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.

2018-05-29 Thread Samudrala, Sridhar

On 5/29/2018 1:18 AM, Michael Chan wrote:

VF Queue resources are always limited and there is currently no
infrastructure to allow the admin. on the host to add or reduce queue
resources for any particular VF.  With ever increasing number of VFs
being supported, it is desirable to allow the admin. to configure queue
resources differently for the VFs.  Some VFs may require more or fewer
queues due to different bandwidth requirements or different number of
vCPUs in the VM.  This patch adds the infrastructure to do that by
adding IFLA_VF_QUEUES netlink attribute and a new .ndo_set_vf_queues()
to the net_device_ops.

Four parameters are exposed for each VF:

o min_tx_queues - Guaranteed tx queues available to the VF.

o max_tx_queues - Maximum but not necessarily guaranteed tx queues
   available to the VF.

o min_rx_queues - Guaranteed rx queues available to the VF.

o max_rx_queues - Maximum but not necessarily guaranteed rx queues
   available to the VF.

The "ip link set" command will subsequently be patched to support the new
operation to set the above parameters.

After the admin. makes a change to the above parameters, the corresponding
VF will have a new range of channels to set using ethtool -L.  The VF may
have to go through IF down/up before the new queues will take effect.  Up
to the min values are guaranteed.  Up to the max values are possible but not
guaranteed.

Signed-off-by: Michael Chan 
---
  include/linux/if_link.h  |  4 
  include/linux/netdevice.h|  6 ++
  include/uapi/linux/if_link.h |  9 +
  net/core/rtnetlink.c | 32 +---
  4 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 622658d..8e81121 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -29,5 +29,9 @@ struct ifla_vf_info {
__u32 rss_query_en;
__u32 trusted;
__be16 vlan_proto;
+   __u32 min_tx_queues;
+   __u32 max_tx_queues;
+   __u32 min_rx_queues;
+   __u32 max_rx_queues;
  };
  #endif /* _LINUX_IF_LINK_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8452f72..17f5892 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1023,6 +1023,8 @@ struct dev_ifalias {
   *  with PF and querying it may introduce a theoretical security risk.
   * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool 
setting);
   * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff 
*skb);
+ * int (*ndo_set_vf_queues)(struct net_device *dev, int vf, int min_txq,
+ * int max_txq, int min_rxq, int max_rxq);


Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be 
extended?
Shouldn't we enable this via ethtool on the port representor netdev?