from:"Scott Feldman"

Re: [patch net-next v5 5/8] switchdev: introduce possibility to defer obj_add/del

2015-10-19 Thread Scott Feldman

On Wed, Oct 14, 2015 at 10:40 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Similar to the attr usecase, the caller knows if he is holding RTNL and is
> in atomic section. So let the called to decide the correct call variant.
>
> This allows drivers to sleep inside their ops and wait for hw to get the
> operation status. Then the status is propagated into switchdev core.
> This avoids silent errors in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   1 +
>  net/switchdev/switchdev.c | 100 
> --
>  2 files changed, 81 insertions(+), 20 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index f8672d7..bc865e2 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -69,6 +69,7 @@ enum switchdev_obj_id {
>
>  struct switchdev_obj {
> enum switchdev_obj_id id;
> +   u32 flags;
>  };
>
>  /* SWITCHDEV_OBJ_ID_PORT_VLAN */
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 5963d7a..eac68c4 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -362,21 +362,8 @@ static int __switchdev_port_obj_add(struct net_device 
> *dev,
> return err;
>  }
>
> -/**
> - * switchdev_port_obj_add - Add port object
> - *
> - * @dev: port device
> - * @id: object ID
> - * @obj: object to add
> - *
> - * Use a 2-phase prepare-commit transaction model to ensure
> - * system is not left in a partially updated state due to
> - * failure from driver/device.
> - *
> - * rtnl_lock must be held.
> - */
> -int switchdev_port_obj_add(struct net_device *dev,
> -  const struct switchdev_obj *obj)
> +static int switchdev_port_obj_add_now(struct net_device *dev,
> + const struct switchdev_obj *obj)
>  {
> struct switchdev_trans trans;
> int err;
> @@ -418,18 +405,53 @@ int switchdev_port_obj_add(struct net_device *dev,
>
> return err;
>  }
> -EXPORT_SYMBOL_GPL(switchdev_port_obj_add);
> +
> +static void switchdev_port_obj_add_deferred(struct net_device *dev,
> +   const void *data)
> +{
> +   const struct switchdev_obj *obj = data;
> +   int err;
> +
> +   err = switchdev_port_obj_add_now(dev, obj);
> +   if (err && err != -EOPNOTSUPP)
> +   netdev_err(dev, "failed (err=%d) to add object (id=%d)\n",
> +  err, obj->id);
> +}
> +
> +static int switchdev_port_obj_add_defer(struct net_device *dev,
> +   const struct switchdev_obj *obj)
> +{
> +   return switchdev_deferred_enqueue(dev, obj, sizeof(*obj),
> + switchdev_port_obj_add_deferred);
> +}

Hi Jiri,

I think there is a problem here with this patch.  When we defer adding
an obj, we're only copying the inner struct switchdev_obj.  The rest
of the containing obj is lost.  So when
switchdev_port_obj_add_deferred() runs, we're up casting to garbage.
So if this FDB obj was deferred, the addr, vid, and ndm_state members
would be garbage:

struct switchdev_obj_port_fdb {
struct switchdev_obj obj;
unsigned char addr[ETH_ALEN];
u16 vid;
u16 ndm_state;
};
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] Adding switchdev ageing notification on port bridged

2015-10-19 Thread Scott Feldman

On Mon, Oct 19, 2015 at 5:37 AM, Elad Raz <el...@mellanox.com> wrote:
> Configure ageing time to the HW for newly bridged device
>
> CC: Scott Feldman <sfel...@gmail.com>
> CC: Jiri Pirko <j...@resnulli.us>
> Signed-off-by: Elad Raz <el...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-16 Thread Scott Feldman

On Fri, Oct 16, 2015 at 1:23 AM, Jiri Pirko  wrote:
> Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote:
>>On 15-10-14 10:40 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to defer the att_set processing for later.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Benefit from newly introduced switchdev deferred ops infrastructure.
>>>
>>
>>A nit but the patch description should note your setting the defer bit
>>on the bridge set state.
>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/bridge/br_stp.c   |   3 +-
>>>  net/switchdev/switchdev.c | 108 
>>> ++
>>>  3 files changed, 46 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index d1c7f90..f7de6f8 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -17,6 +17,7 @@
>>>
>>>  #define SWITCHDEV_F_NO_RECURSE  BIT(0)
>>>  #define SWITCHDEV_F_SKIP_EOPNOTSUPP BIT(1)
>>> +#define SWITCHDEV_F_DEFER   BIT(2)
>>>
>>>  struct switchdev_trans_item {
>>>  struct list_head list;
>>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>>> index db6d243de..80c34d7 100644
>>> --- a/net/bridge/br_stp.c
>>> +++ b/net/bridge/br_stp.c
>>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned 
>>> int state)
>>>  {
>>>  struct switchdev_attr attr = {
>>>  .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>> +.flags = SWITCHDEV_F_DEFER,
>>>  .u.stp_state = state,
>>>  };
>>
>>
>>This creates a possible race (with 6/8) I think, please check!
>
> Wait. This patch does not change the previous behaviour. Patch 6 does,
> so I don't understand why you are asking here. Confusing.
>
>
>>
>>In del_nbp() we call br_stp_disable_port() to set the port state
>>to BR_STATE_DISABLE and disabling learning events. But with this
>>patch it can be deferred. Also note the STP agent may be in userspace
>>which actually seems more likely the case because you likely want to
>>run some more modern variant of STP than the kernel supports.
>>
>>So at some point in the future the driver will turn off learning. At
>>the same time we call br_fdb_delete_by_port which calls a deferred
>>set of fdb deletes.
>>
>>I don't see how you guarantee learning is off before you start doing
>>the deletes here and possibly learning new addresses after the software
>>side believes the port is down.
>>
>>So
>>
>>   br_stp_disable_port
>>   br_fdb_delete_by_port
>>   {fdb_del_external_learn}
>>   [hw learns a fdb]
>>   [hw disables learning]
>>
>>What stops this from happening?
>
> Okay. This behaviour is the same as without the patchset. What would
> resolve the issue it to put switchdev_deferred_process() after
> br_stp_disable_port() and before br_fdb_delete_by_port() call.
> That would enforce stp change to happen in hw before fdbs are explicitly
> deleted. Sound good to you?

Doesn't HW already see things in the right order since items are
dequeued from the deferred list in the order queued?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-14 Thread Scott Feldman

On Wed, Oct 14, 2015 at 8:25 AM, Vivien Didelot
 wrote:
> On Oct. Wednesday 14 (42) 09:14 AM, Ido Schimmel wrote:
>> Tue, Oct 13, 2015 at 05:32:26PM IDT, vivien.dide...@savoirfairelinux.com 
>> wrote:
>> >On Oct. Tuesday 13 (42) 11:31 AM, Ido Schimmel wrote:
>> >> Mon, Oct 12, 2015 at 08:36:25PM IDT, vivien.dide...@savoirfairelinux.com 
>> >> wrote:
>> >> >Hi guys,
>> >> >
>> >> >On Oct. Monday 12 (42) 02:01 PM, Nikolay Aleksandrov wrote:
>> >> >> From: Nikolay Aleksandrov 
>> >> >>
>> >> >> We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
>> >> >>
>> >> >> Signed-off-by: Nikolay Aleksandrov 
>> >> >> ---
>> >> >>  net/switchdev/switchdev.c | 3 +++
>> >> >>  1 file changed, 3 insertions(+)
>> >> >>
>> >> >> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>> >> >> index 6e4a4f9ad927..256c596de896 100644
>> >> >> --- a/net/switchdev/switchdev.c
>> >> >> +++ b/net/switchdev/switchdev.c
>> >> >> @@ -720,6 +720,9 @@ static int switchdev_port_br_afspec(struct 
>> >> >> net_device *dev,
>> >> >> if (vlan.vid_begin)
>> >> >> return -EINVAL;
>> >> >> vlan.vid_begin = vinfo->vid;
>> >> >> +   /* don't allow range of pvids */
>> >> >> +   if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
>> >> >> +   return -EINVAL;
>> >> >> } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>> >> >> if (!vlan.vid_begin)
>> >> >> return -EINVAL;
>> >> >> --
>> >> >> 2.4.3
>> >> >>
>> >> >
>> >> >Yes the patch looks good, but it is a minor check though. I hope the
>> >> >subject of this thread is making sense.
>> >> >
>> >> >VLAN ranges seem to have been included for an UX purpose (so commands
>> >> >look like Cisco IOS). We don't want to change any existing interface, so
>> >> >we pushed that down to drivers, with the only valid reason that, maybe
>> >> >one day, an hardware can be capable of programming a range on a per-port
>> >> >basis.
>> >> Hi,
>> >>
>> >> That's actually what we are doing in mlxsw. We can do up to 256 entries in
>> >> one go. We've yet to submit this part.
>> >
>> >Perfect Ido, thanks for pointing this out! I'm OK with the range then.
>> >
>> >So there is now a very last question in my head for this, which is more
>> >a matter of kernel design. Should the user be aware of such underlying
>> >support? In other words, would it make sense to do this in a driver:
>> >
>> >foo_port_vlan_add(struct net_device *dev,
>> >  struct switchdev_obj_port_vlan *vlan)
>> >{
>> >if (vlan->vid_begin != vlan->vid_end)
>> >return -ENOTSUPP; /* or something more relevant for user */
>> >
>> >return foo_port_single_vlan_add(dev, vlan->vid_begin);
>> >}
>> >
>> >So drivers keep being simple, and we can easily propagate the fact that
>> >one-or-all VLAN is not supportable, vs. the VLAN feature itself is not
>> >implemented and must be done in software.
>> I think that if you want to keep it simple, then Scott's advice from the
>> previous thread is the most appropriate one. I believe the hardware you
>> are using is simply not meant to support multiple 802.1Q bridges.
>
> You mean allowing only one Linux bridge over an hardware switch?
>
> It would for sure simplify how, as developers and users, we represent a
> physical switch. But I am not sure how to achieve that and I don't have
> strong opinions on this TBH.

Hi Vivien, I think it's possible to keep switch ports on just one
bridge if we do a little bit of work on the NETDEV_CHANGEUPPER
notifier.  This will give you the driver-level control you want.  Do
you have time to investigate?  The idea is:

1) In your driver's handler for NETDEV_CHANGEUPPER, if switch port is
being added to a second bridge,then return NOTIFY_BAD.  Your driver
needs to track the bridge count.

2) In __netdev_upper_dev_link(), check the return code from the
call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, ...) call, and if
NOTIFY_BAD, abort the linking operation (goto rollback_xxx).

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-14 Thread Scott Feldman

On Wed, Oct 14, 2015 at 10:40 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to defer the att_set processing for later.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Benefit from newly introduced switchdev deferred ops infrastructure.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   1 +
>  net/bridge/br_stp.c   |   3 +-
>  net/switchdev/switchdev.c | 108 
> ++
>  3 files changed, 46 insertions(+), 66 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index d1c7f90..f7de6f8 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -17,6 +17,7 @@
>
>  #define SWITCHDEV_F_NO_RECURSE BIT(0)
>  #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
> +#define SWITCHDEV_F_DEFER  BIT(2)
>
>  struct switchdev_trans_item {
> struct list_head list;
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index db6d243de..80c34d7 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
> state)
>  {
> struct switchdev_attr attr = {
> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
> +   .flags = SWITCHDEV_F_DEFER,
> .u.stp_state = state,
> };
> int err;
>
> p->state = state;
> err = switchdev_port_attr_set(p->dev, );
> -   if (err && err != -EOPNOTSUPP)
> +   if (err)
> br_warn(p->br, "error setting offload STP state on port 
> %u(%s)\n",
> (unsigned int) p->port_no, p->dev->name);
>  }

Should this part of the patch be moved to patch 6/8 where
switchdev_deferred_process() is called from del_nbp()?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 4/6] net: dsa: mv88e6xxx: implement port_fdb_dump

2015-10-14 Thread Scott Feldman

On Tue, Oct 13, 2015 at 9:46 AM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> Implement the port_fdb_dump DSA operation.
>
> Signed-off-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>

Reviewed-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 5/7] bridge: defer switchdev fdb del call in fdb_del_external_learn

2015-10-13 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:05 PM, Jiri Pirko <j...@resnulli.us> wrote:
> Tue, Oct 13, 2015 at 05:28:20AM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>> From: Jiri Pirko <j...@mellanox.com>
>>>
>>> Since spinlock is held here, defer the switchdev operation.
>>>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>>  net/bridge/br_fdb.c | 5 -
>>>  net/bridge/br_if.c  | 3 +++
>>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>> index f5e7da0..c88bd8e 100644
>>> --- a/net/bridge/br_fdb.c
>>> +++ b/net/bridge/br_fdb.c
>>> @@ -134,7 +134,10 @@ static void fdb_del_hw_addr(struct net_bridge *br, 
>>> const unsigned char *addr)
>>>  static void fdb_del_external_learn(struct net_bridge_fdb_entry *f)
>>>  {
>>> struct switchdev_obj_port_fdb fdb = {
>>> -   .obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
>>> +   .obj = {
>>> +   .id = SWITCHDEV_OBJ_ID_PORT_FDB,
>>> +   .flags = SWITCHDEV_F_DEFER,
>>> +   },
>>> .vid = f->vlan_id,
>>> };
>>>
>>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>>> index 934cae9..09147cb 100644
>>> --- a/net/bridge/br_if.c
>>> +++ b/net/bridge/br_if.c
>>> @@ -24,6 +24,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  #include "br_private.h"
>>>
>>> @@ -249,6 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
>>> list_del_rcu(>list);
>>>
>>> br_fdb_delete_by_port(br, p, 0, 1);
>>> +   switchdev_flush_deferred();
>>> +
>>
>>This potentially flushes other (valid) work on the deferred queue not
>>related to FDB del.
>
> flush_deferred is implemented by flush_workqueue. This enforces the
> workqueue to be processed heve and sleeps until it is done. Nothing is
> lost.

My bad, I saw "flush" and assumed everything went down the toilet.

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 6/7] rocker: remove nowait from switchdev callbacks.

2015-10-13 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:25 PM, Jiri Pirko <j...@resnulli.us> wrote:
> Tue, Oct 13, 2015 at 06:02:28AM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>> From: Jiri Pirko <j...@mellanox.com>
>>>
>>> No need to avoid sleeping in switchdev callbacks now, as the switchdev
>>> core allows it.
>>>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>>  drivers/net/ethernet/rocker/rocker.c | 7 +++
>>>  1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/rocker/rocker.c 
>>> b/drivers/net/ethernet/rocker/rocker.c
>>> index bb956a5..9629c5b5 100644
>>> --- a/drivers/net/ethernet/rocker/rocker.c
>>> +++ b/drivers/net/ethernet/rocker/rocker.c
>>> @@ -3672,7 +3672,7 @@ static int rocker_port_fdb_flush(struct rocker_port 
>>> *rocker_port,
>>> rocker_port->stp_state == BR_STATE_FORWARDING)
>>> return 0;
>>>
>>> -   flags |= ROCKER_OP_FLAG_REMOVE;
>>> +   flags |= ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
>>
>>I understand the two changes below where you're removing NOWAIT, but
>>here you're adding NOWAIT which I'm not sure how that is related to
>>the switchdev core changes.  Is this two patches?
>
> I removed ROCKER_OP_FLAG_NOWAIT from attr_set. But here in
> rocker_port_fdb_flush, which is called from attr_set, we call
> rocker_port_fdb_learn with spin lock. Therefore I had to put
> ROCKER_OP_FLAG_NOWAIT here. Before ROCKER_OP_FLAG_NOWAIT removal from
> attr_set this was there already.

Gotcha.

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 2/7] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-13 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:21 PM, Jiri Pirko  wrote:
> Tue, Oct 13, 2015 at 08:03:46AM CEST, john.fastab...@gmail.com wrote:
>>On 15-10-12 10:44 PM, Jiri Pirko wrote:
>>> Tue, Oct 13, 2015 at 04:52:42AM CEST, sfel...@gmail.com wrote:
 On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to defer the att_set processing for later.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   1 +
>  net/bridge/br_stp.c   |   3 +-
>  net/switchdev/switchdev.c | 107 
> --
>  3 files changed, 59 insertions(+), 52 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index d2879f2..6b109e4 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -17,6 +17,7 @@
>
>  #define SWITCHDEV_F_NO_RECURSE BIT(0)
>  #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
> +#define SWITCHDEV_F_DEFER  BIT(2)
>
>  struct switchdev_trans_item {
> struct list_head list;
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index db6d243de..80c34d7 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned 
> int state)
>  {
> struct switchdev_attr attr = {
> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
> +   .flags = SWITCHDEV_F_DEFER,
> .u.stp_state = state,
> };
> int err;
>
> p->state = state;
> err = switchdev_port_attr_set(p->dev, );
> -   if (err && err != -EOPNOTSUPP)
> +   if (err)

 This looks like a problem as now all other non-switchdev ports will
 get an WARN in the log when STP state changes.  We should only WARN if
 there was an err and the err is not -EOPNOTSUPP.
>>>
>>> If SWITCHDEV_F_DEFER flag is set, there's only 0 of -ENOMEM.
>>>
>>>

> br_warn(p->br, "error setting offload STP state on port 
> %u(%s)\n",
> (unsigned int) p->port_no, p->dev->name);
>  }

 

>  struct switchdev_attr_set_work {
> struct work_struct work;
> struct net_device *dev;
> @@ -183,14 +226,17 @@ static void switchdev_port_attr_set_work(struct 
> work_struct *work)
>  {
> struct switchdev_attr_set_work *asw =
> container_of(work, struct switchdev_attr_set_work, work);
> +   bool rtnl_locked = rtnl_is_locked();
> int err;
>
> -   rtnl_lock();
> -   err = switchdev_port_attr_set(asw->dev, >attr);
> +   if (!rtnl_locked)
> +   rtnl_lock();

 I'm not following this change.  If someone else has rtnl_lock, we'll
 not wait to grab it here ourselves, and proceed as if we have the
 lock.  But what if that someone else releases the lock in the middle
 of us doing switchdev_port_attr_set_now?  Seems we want to
 unconditionally wait and grab the lock.  We need to block anything
from moving while we do the attr set.
>>>
>>> Why would someone we call (driver) return the lock? In that case, he is
>>> buggy and should be fixed.
>>>
>>> This hunk only ensures we have rtnl_lock. If not, we take it here. We do
>>> not take it unconditionally because we may already have it, for example
>>> if caller of switchdev_flush_deferred holds rtnl_lock.
>>>
>>
>>This is where you lost me. How do you know another core doesn't happen
>>to have the lock when you hit this code path? Maybe someone is running
>>an ethtool command on another core or something.
>
> You are right. The same problem exists currently in switchdev_port_attr_set.

You are right as in you'll change this back to unconditional grabbing
of rtnl_lock?  I don't follow how this problem currently exists as
current code does an unconditional grab of rtnl_lock.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 4/4] bridge: vlan: combine (br|nbp)_vlan_flush into one

2015-10-13 Thread Scott Feldman

On Mon, Oct 12, 2015 at 10:51 AM, Ido Schimmel  wrote:
> Mon, Oct 12, 2015 at 02:41:09PM IDT, ra...@blackwall.org wrote:
>>From: Nikolay Aleksandrov 
>>
>>As Ido Schimmel pointed out the vlan_vid_del() loop in nbp_vlan_flush is
>>unnecessary (and is actually a remnant of the old vlan code) so we can
>>remove it and combine both br/nbp vlan_flush functions into one.
> Just a small note to Scott and Vivien:
>
> One of the side effects of Nik's recent patchsets is that when VLANs are
> flushed on a port the deletion is propagated to the driver via
> switchdev ops, as __vlan_vid_del is called.
>
> Therefore there is no need to do internal bookkeeping and remove VLANs
> yourself when port is removed from bridge.

Thanks for the heads-up.  Seems like a nice result.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 2/7] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-13 Thread Scott Feldman

On Tue, Oct 13, 2015 at 12:30 AM, Jiri Pirko  wrote:
> Tue, Oct 13, 2015 at 08:53:45AM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 12, 2015 at 11:21 PM, Jiri Pirko  wrote:
>>> Tue, Oct 13, 2015 at 08:03:46AM CEST, john.fastab...@gmail.com wrote:
On 15-10-12 10:44 PM, Jiri Pirko wrote:
> Tue, Oct 13, 2015 at 04:52:42AM CEST, sfel...@gmail.com wrote:
>> On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to defer the att_set processing for later.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/bridge/br_stp.c   |   3 +-
>>>  net/switchdev/switchdev.c | 107 
>>> --
>>>  3 files changed, 59 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index d2879f2..6b109e4 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -17,6 +17,7 @@
>>>
>>>  #define SWITCHDEV_F_NO_RECURSE BIT(0)
>>>  #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
>>> +#define SWITCHDEV_F_DEFER  BIT(2)
>>>
>>>  struct switchdev_trans_item {
>>> struct list_head list;
>>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>>> index db6d243de..80c34d7 100644
>>> --- a/net/bridge/br_stp.c
>>> +++ b/net/bridge/br_stp.c
>>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, 
>>> unsigned int state)
>>>  {
>>> struct switchdev_attr attr = {
>>> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>> +   .flags = SWITCHDEV_F_DEFER,
>>> .u.stp_state = state,
>>> };
>>> int err;
>>>
>>> p->state = state;
>>> err = switchdev_port_attr_set(p->dev, );
>>> -   if (err && err != -EOPNOTSUPP)
>>> +   if (err)
>>
>> This looks like a problem as now all other non-switchdev ports will
>> get an WARN in the log when STP state changes.  We should only WARN if
>> there was an err and the err is not -EOPNOTSUPP.
>
> If SWITCHDEV_F_DEFER flag is set, there's only 0 of -ENOMEM.
>
>
>>
>>> br_warn(p->br, "error setting offload STP state on port 
>>> %u(%s)\n",
>>> (unsigned int) p->port_no, 
>>> p->dev->name);
>>>  }
>>
>> 
>>
>>>  struct switchdev_attr_set_work {
>>> struct work_struct work;
>>> struct net_device *dev;
>>> @@ -183,14 +226,17 @@ static void switchdev_port_attr_set_work(struct 
>>> work_struct *work)
>>>  {
>>> struct switchdev_attr_set_work *asw =
>>> container_of(work, struct switchdev_attr_set_work, 
>>> work);
>>> +   bool rtnl_locked = rtnl_is_locked();
>>> int err;
>>>
>>> -   rtnl_lock();
>>> -   err = switchdev_port_attr_set(asw->dev, >attr);
>>> +   if (!rtnl_locked)
>>> +   rtnl_lock();
>>
>> I'm not following this change.  If someone else has rtnl_lock, we'll
>> not wait to grab it here ourselves, and proceed as if we have the
>> lock.  But what if that someone else releases the lock in the middle
>> of us doing switchdev_port_attr_set_now?  Seems we want to
>> unconditionally wait and grab the lock.  We need to block anything
>>from moving while we do the attr set.
>
> Why would someone we call (driver) return the lock? In that case, he is
> buggy and should be fixed.
>
> This hunk only ensures we have rtnl_lock. If not, we take it here. We do
> not take it unconditionally because we may already have it, for example
> if caller of switchdev_flush_deferred holds rtnl_lock.
>

This is where you lost me. How do you know another core doesn't happen
to have the lock when you hit this code path? Maybe someone is running
an ethtool command on another core or something.
>>>
>>> You are right. The same problem exists currently in switchdev_port_attr_set.
>>
>>You are right as in you'll change this back to unconditional grabbing
>>of rtnl_lock?  I don't follow how this problem currently exists as
>>current code does an unconditional grab of rtnl_lock.
>
>   cpu1  cpu2
> rtnl_lock()
> switchdev_port_attr_set
>   !rtnl_is_locked() == false
>   switchdev_trans_init
>

Re: [PATCH net-next] bridge: fix gc_timer mod/del race condition

2015-10-13 Thread Scott Feldman

d8a1 ("bridge: push bridge setting ageing_time down to 
> switchdev")
> Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
> ---
>  net/bridge/br_sysfs_br.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> index 04ef1926ee7e..8365bd53c421 100644
> --- a/net/bridge/br_sysfs_br.c
> +++ b/net/bridge/br_sysfs_br.c
> @@ -102,7 +102,15 @@ static ssize_t ageing_time_show(struct device *d,
>
>  static int set_ageing_time(struct net_bridge *br, unsigned long val)
>  {
> -   return br_set_ageing_time(br, val);
> +   int ret;
> +
> +   if (!rtnl_trylock())
> +   return restart_syscall();
> +
> +   ret = br_set_ageing_time(br, val);
> +   rtnl_unlock();
> +
> +   return ret;
>  }

Looks good, thanks Nikolay.  The other option would have been to
simply not restart gc_timer in br_set_ageing_time(), but this would
make changes to ageing_time "sluggish" in that it might take a while
for the fdb cleanup algo to use the new ageing_time value.  By kicking
the timer when the ageing_time changes, fdb cleanup immediately
switches to the new schedule.  In any case,

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-13 Thread Scott Feldman

On Mon, Oct 12, 2015 at 10:36 AM, Vivien Didelot
 wrote:
> Hi guys,
>
> On Oct. Monday 12 (42) 02:01 PM, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov 
>>
>> We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
>>
>> Signed-off-by: Nikolay Aleksandrov 
>> ---
>>  net/switchdev/switchdev.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>> index 6e4a4f9ad927..256c596de896 100644
>> --- a/net/switchdev/switchdev.c
>> +++ b/net/switchdev/switchdev.c
>> @@ -720,6 +720,9 @@ static int switchdev_port_br_afspec(struct net_device 
>> *dev,
>>   if (vlan.vid_begin)
>>   return -EINVAL;
>>   vlan.vid_begin = vinfo->vid;
>> + /* don't allow range of pvids */
>> + if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
>> + return -EINVAL;
>>   } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>>   if (!vlan.vid_begin)
>>   return -EINVAL;
>> --
>> 2.4.3
>>
>
> Yes the patch looks good, but it is a minor check though. I hope the
> subject of this thread is making sense.
>
> VLAN ranges seem to have been included for an UX purpose (so commands
> look like Cisco IOS). We don't want to change any existing interface, so
> we pushed that down to drivers, with the only valid reason that, maybe
> one day, an hardware can be capable of programming a range on a per-port
> basis.
>
> So what happens is that we'll add some code to fix and check non-sense
> (e.g. range + PVID) in switchdev, bridge, and I'm sure we are missing
> other spots.
>
> Sorry for being insistent, but this still doesn't look right to me.
>
> It seems like we are bloating bridge, switchdev and drivers for the only
> reason to maintain a kernel support for something like:
>
> # for i in $(seq 100 4000); do bridge vlan add vid $i dev swp0; done

[Adding Roopa]

(Roopa, correct me if I'm wrong).

Ref: commit # bdced7ef

IIRC, vlan ranges were introduced to reduce the volume of netlink
traffic when doing a command like your's above.  Your command would
create 3901 netlink msgs with a single IFLA_BRIDGE_VLAN_INFO, sent to
the kernel and then echoed back to user-space.  With vlan ranges, the
cmd "bridge vlan add vid 100-4000 dev swp0" would result in one
netlink msg with one IFLA_BRIDGE_VLAN_INFO.  Seems there were problems
with a getlink dump over-running the netlink msg (not using
NLM_MULTI)?   Maybe Roopa can expand on the problem that was solved.

So the bridge command, and everything below it, has this feature.  We
can't undo it...it's in the wild, so to speak  And, as mentioned
before, it's and all-or-nothing command, and the underlying switchdev
or driver code needs to deal with overlapping vlan maps.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 2/7] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to defer the att_set processing for later.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   1 +
>  net/bridge/br_stp.c   |   3 +-
>  net/switchdev/switchdev.c | 107 
> --
>  3 files changed, 59 insertions(+), 52 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index d2879f2..6b109e4 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -17,6 +17,7 @@
>
>  #define SWITCHDEV_F_NO_RECURSE BIT(0)
>  #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
> +#define SWITCHDEV_F_DEFER  BIT(2)
>
>  struct switchdev_trans_item {
> struct list_head list;
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index db6d243de..80c34d7 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
> state)
>  {
> struct switchdev_attr attr = {
> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
> +   .flags = SWITCHDEV_F_DEFER,
> .u.stp_state = state,
> };
> int err;
>
> p->state = state;
> err = switchdev_port_attr_set(p->dev, );
> -   if (err && err != -EOPNOTSUPP)
> +   if (err)

This looks like a problem as now all other non-switchdev ports will
get an WARN in the log when STP state changes.  We should only WARN if
there was an err and the err is not -EOPNOTSUPP.

> br_warn(p->br, "error setting offload STP state on port 
> %u(%s)\n",
> (unsigned int) p->port_no, p->dev->name);
>  }



>  struct switchdev_attr_set_work {
> struct work_struct work;
> struct net_device *dev;
> @@ -183,14 +226,17 @@ static void switchdev_port_attr_set_work(struct 
> work_struct *work)
>  {
> struct switchdev_attr_set_work *asw =
> container_of(work, struct switchdev_attr_set_work, work);
> +   bool rtnl_locked = rtnl_is_locked();
> int err;
>
> -   rtnl_lock();
> -   err = switchdev_port_attr_set(asw->dev, >attr);
> +   if (!rtnl_locked)
> +   rtnl_lock();

I'm not following this change.  If someone else has rtnl_lock, we'll
not wait to grab it here ourselves, and proceed as if we have the
lock.  But what if that someone else releases the lock in the middle
of us doing switchdev_port_attr_set_now?  Seems we want to
unconditionally wait and grab the lock.  We need to block anything
from moving while we do the attr set.

> +   err = switchdev_port_attr_set_now(asw->dev, >attr);
> if (err && err != -EOPNOTSUPP)
> netdev_err(asw->dev, "failed (err=%d) to set attribute 
> (id=%d)\n",
>err, asw->attr.id);
> -   rtnl_unlock();
> +   if (!rtnl_locked)
> +   rtnl_unlock();
>
> dev_put(asw->dev);
> kfree(work);
> @@ -211,7 +257,7 @@ static int switchdev_port_attr_set_defer(struct 
> net_device *dev,
> asw->dev = dev;
> memcpy(>attr, attr, sizeof(asw->attr));
>
> -   schedule_work(>work);
> +   queue_work(switchdev_wq, >work);
>
> return 0;
>  }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 1/7] switchdev: introduce switchdev workqueue

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 10:54 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> This is going to be used for deferred operations.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 3/7] switchdev: remove pointers from switchdev objects

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> When object is used in deferred work, we cannot use pointers in
> switchdev object structures because the memory they point at may be already
> used by someone else. So rather do local copy of the value.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 5/7] bridge: defer switchdev fdb del call in fdb_del_external_learn

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Since spinlock is held here, defer the switchdev operation.
>
> Signed-off-by: Jiri Pirko 
> ---
>  net/bridge/br_fdb.c | 5 -
>  net/bridge/br_if.c  | 3 +++
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index f5e7da0..c88bd8e 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -134,7 +134,10 @@ static void fdb_del_hw_addr(struct net_bridge *br, const 
> unsigned char *addr)
>  static void fdb_del_external_learn(struct net_bridge_fdb_entry *f)
>  {
> struct switchdev_obj_port_fdb fdb = {
> -   .obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
> +   .obj = {
> +   .id = SWITCHDEV_OBJ_ID_PORT_FDB,
> +   .flags = SWITCHDEV_F_DEFER,
> +   },
> .vid = f->vlan_id,
> };
>
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 934cae9..09147cb 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "br_private.h"
>
> @@ -249,6 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
> list_del_rcu(>list);
>
> br_fdb_delete_by_port(br, p, 0, 1);
> +   switchdev_flush_deferred();
> +

This potentially flushes other (valid) work on the deferred queue not
related to FDB del.

I wonder if this flush step is necessary at all?  The work we deferred
to delete the FDB entry can still happen after the port has been
removed (del_nbp).  If the port driver/device find the FDB entry, then
delete it, otherwise ignore it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] switchdev: check if the vlan id is in the proper vlan range

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 5:31 AM, Nikolay Aleksandrov
<ra...@blackwall.org> wrote:
> From: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>
> VLANs 0 and 4095 are reserved and shouldn't be used, add checks to
> switchdev similar to the bridge. Also make sure ids above 4095 cannot
> be passed either.
>
> Fixes: 47f8328bb1a4 ("switchdev: add new switchdev bridge setlink")
> Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Similar to the attr usecase, the caller knows if he is holding RTNL and is
> in atomic section. So let the called to decide the correct call variant.
>
> This allows drivers to sleep inside their ops and wait for hw to get the
> operation status. Then the status is propagated into switchdev core.
> This avoids silent errors in drivers.
>
> Signed-off-by: Jiri Pirko 



> +static void switchdev_port_obj_work(struct work_struct *work)
> +{
> +   struct switchdev_obj_work *ow =
> +   container_of(work, struct switchdev_obj_work, work);
> +   bool rtnl_locked = rtnl_is_locked();
> +   int err;
> +
> +   if (!rtnl_locked)
> +   rtnl_lock();

Same comment as on patch 2/7 about not unconditionally grabbing rtnl_lock.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 6/7] rocker: remove nowait from switchdev callbacks.

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> No need to avoid sleeping in switchdev callbacks now, as the switchdev
> core allows it.
>
> Signed-off-by: Jiri Pirko 
> ---
>  drivers/net/ethernet/rocker/rocker.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/rocker/rocker.c 
> b/drivers/net/ethernet/rocker/rocker.c
> index bb956a5..9629c5b5 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -3672,7 +3672,7 @@ static int rocker_port_fdb_flush(struct rocker_port 
> *rocker_port,
> rocker_port->stp_state == BR_STATE_FORWARDING)
> return 0;
>
> -   flags |= ROCKER_OP_FLAG_REMOVE;
> +   flags |= ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;

I understand the two changes below where you're removing NOWAIT, but
here you're adding NOWAIT which I'm not sure how that is related to
the switchdev core changes.  Is this two patches?


> spin_lock_irqsave(>fdb_tbl_lock, lock_flags);
>
> @@ -4382,8 +4382,7 @@ static int rocker_port_attr_set(struct net_device *dev,
>
> switch (attr->id) {
> case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
> -   err = rocker_port_stp_update(rocker_port, trans,
> -ROCKER_OP_FLAG_NOWAIT,
> +   err = rocker_port_stp_update(rocker_port, trans, 0,
>  attr->u.stp_state);
> break;
> case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
> @@ -4517,7 +4516,7 @@ static int rocker_port_fdb_del(struct rocker_port 
> *rocker_port,
>const struct switchdev_obj_port_fdb *fdb)
>  {
> __be16 vlan_id = rocker_port_vid_to_vlan(rocker_port, fdb->vid, NULL);
> -   int flags = ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
> +   int flags = ROCKER_OP_FLAG_REMOVE;
>
> if (!rocker_port_is_bridged(rocker_port))
> return -EINVAL;
> --
> 1.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v4 5/7] bridge: defer switchdev fdb del call in fdb_del_external_learn

2015-10-12 Thread Scott Feldman

On Mon, Oct 12, 2015 at 8:31 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 15-10-12 08:28 PM, Scott Feldman wrote:
>> On Mon, Oct 12, 2015 at 11:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>> From: Jiri Pirko <j...@mellanox.com>
>>>
>>> Since spinlock is held here, defer the switchdev operation.
>>>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>>  net/bridge/br_fdb.c | 5 -
>>>  net/bridge/br_if.c  | 3 +++
>>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>> index f5e7da0..c88bd8e 100644
>>> --- a/net/bridge/br_fdb.c
>>> +++ b/net/bridge/br_fdb.c
>>> @@ -134,7 +134,10 @@ static void fdb_del_hw_addr(struct net_bridge *br, 
>>> const unsigned char *addr)
>>>  static void fdb_del_external_learn(struct net_bridge_fdb_entry *f)
>>>  {
>>> struct switchdev_obj_port_fdb fdb = {
>>> -   .obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
>>> +   .obj = {
>>> +   .id = SWITCHDEV_OBJ_ID_PORT_FDB,
>>> +   .flags = SWITCHDEV_F_DEFER,
>>> +   },
>>> .vid = f->vlan_id,
>>> };
>>>
>>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>>> index 934cae9..09147cb 100644
>>> --- a/net/bridge/br_if.c
>>> +++ b/net/bridge/br_if.c
>>> @@ -24,6 +24,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  #include "br_private.h"
>>>
>>> @@ -249,6 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
>>> list_del_rcu(>list);
>>>
>>> br_fdb_delete_by_port(br, p, 0, 1);
>>> +   switchdev_flush_deferred();
>>> +
>>
>> This potentially flushes other (valid) work on the deferred queue not
>> related to FDB del.
>>
>> I wonder if this flush step is necessary at all?  The work we deferred
>> to delete the FDB entry can still happen after the port has been
>> removed (del_nbp).  If the port driver/device find the FDB entry, then
>> delete it, otherwise ignore it.
>>
>
> Just the first thing that springs to mind reading this comment is,
>
>   - del gets deffered
>   - add fdb
>   - del runs
>
> Is there an issue here? Sorry I'll do a more thorough review now just
> thought I would toss it out there before I forget.

It's a valid thought to consider, for sure.  The context is these are
only FDB entries added by an external learn event.  So I believe in
your sequence, the second step to add fdb entry wouldn't happen as the
fdb entry already exists at that point (in other words, the entry has
already been learned on external device and pushed up via notifier to
bridge).  So I think we're OK in regards to your question.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-12 Thread Scott Feldman

On Sat, Oct 10, 2015 at 8:56 AM, Vivien Didelot
 wrote:

> Scott, didn't you have a plan to add a struct device for the parent of
> switchdev ports?

I had sent out a rough RFC for a switch device in the last window.  I
have continued working on it, and I plan to send it very soon,
probably again as an RFC.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-11 Thread Scott Feldman

On Sat, Oct 10, 2015 at 12:04 AM, Jiri Pirko <j...@resnulli.us> wrote:
> Sat, Oct 10, 2015 at 04:53:52AM CEST, sfel...@gmail.com wrote:
>>On Thu, Oct 8, 2015 at 9:38 PM, Premkumar Jonnala <pjonn...@broadcom.com> 
>>wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: sfel...@gmail.com [mailto:sfel...@gmail.com]
>>>> Sent: Friday, October 09, 2015 7:53 AM
>>>> To: netdev@vger.kernel.org
>>>> Cc: da...@davemloft.net; j...@resnulli.us; siva.mannem@gmail.com;
>>>> Premkumar Jonnala; step...@networkplumber.org;
>>>> ro...@cumulusnetworks.com; and...@lunn.ch; f.faine...@gmail.com;
>>>> vivien.dide...@savoirfairelinux.com
>>>> Subject: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time 
>>>> down
>>>> to switchdev
>>>>
>>>> From: Scott Feldman <sfel...@gmail.com>
>>>>
>>>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>>>> support setting ageing_time (or setting bridge attrs in general).
>>>>
>>>> If push fails, don't update ageing_time in bridge and return err to user.
>>>>
>>>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>>>> recalabrate when to run gc_timer next, based on new ageing_time.
>>>>
>>>> Signed-off-by: Scott Feldman <sfel...@gmail.com>
>>>> Signed-off-by: Jiri Pirko <j...@resnulli.us>
>>
>>
>>
>>>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>>>> +{
>>>> + struct switchdev_attr attr = {
>>>> + .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>>>> + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>>>> + .u.ageing_time = ageing_time,
>>>> + };
>>>> + unsigned long t = clock_t_to_jiffies(ageing_time);
>>>> + int err;
>>>> +
>>>> + if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>>>> + return -ERANGE;
>>>> +
>>>> + err = switchdev_port_attr_set(br->dev, );
>>>
>>> A thought - given that the ageing time is not a per-bridge-port attr, why 
>>> are we using a "port based api"
>>> to pass the attribute down?  May be I'm missing something here?
>>
>>I think Florian raised the same point earlier.  Sigh, I think this
>>should be addressedv4 coming soon...thanks guys for keeping the
>>standard high.
>
> Scott, can you tell us how do you want to address this? I like the
> current implementation.

I like it also, but there is some awkwardness in calling
switchdev_port_attr_set() with the first argument being the bridge
netdev, not a port netdev.  But, the basic algorithm to recurse from
_this_ netdev down to its lower netdevs works great in this case; it's
just the name "switchdev_port_attr_set" implies a port netdev for
top-level netdev.  So I was thinking about adding another call,
something like "switchdev_master_attr_set", which basically just does
the same thing as switchdev_port_attr_set, except maybe skips the
check to call the ops->switchdev_port_attr_add on the top-level
netdev.  But now I don't like that idea so much as "master" would be
confusing when your passing a bond netdev (which is also a master),
but the bond _is_ the port netdev this time, a port on the bridge.

So let's scrap v4 and go with v3.  I think I can live with this naming
awkwardness, given that we got something for essentially free by using
switchdev_port_attr_set() in a new way.

Davem, I think we're OK going with v3.

(There is a follow-on discussion about a switch device, which we'll
continue but it shouldn't block this v3 version).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: switchdev and VLAN ranges

2015-10-11 Thread Scott Feldman

On Sun, Oct 11, 2015 at 5:13 PM, Nikolay Aleksandrov
 wrote:
> On 10/12/2015 12:41 AM, Vivien Didelot wrote:
>> On Oct. Sunday 11 (41) 09:12 AM, Jiri Pirko wrote:
>>> Sat, Oct 10, 2015 at 12:36:26PM CEST, niko...@cumulusnetworks.com wrote:
 On 10/10/2015 09:49 AM, Elad Raz wrote:
>
>> On Oct 10, 2015, at 2:30 AM, Vivien Didelot 
>>  wrote:
>>
>> I have two concerns in mind:
>>
>> a) if we imagine that drivers like Rocker allocate memory in the prepare
>> phase for each VID, preparing a range like 100-4000 would definitely not
>> be recommended.
>>
>> b) imagine that you have two Linux bridges on a switch, one using the
>> hardware VLAN 100. If you request the VLAN range 99-101 for the other
>> bridge members, it is not possible for the driver to say "I can
>> accelerate VLAN 99 and 101, but not 100". It must return OPNOTSUPP for
>> the whole range.
>
> Another concern I have with vid_being..vid_end range is the “flags”. 
> Where flags can be BRIDGE_VLAN_INFO_PVID.
> There is no sense having more than one VLAN as a PVID.
> This leave the HW vendor the choice which VLAN id they will use as the 
> PVID.
>

 iproute2 doesn't allow to do it but I can see that someone can actually 
 make it
 so the flags for the range have it and it doesn't look correct. Perhaps we 
 need
 something like the patch below to enforce this from kernel-side.


 diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
 index d78b4429505a..02b17b53e9a6 100644
 --- a/net/bridge/br_netlink.c
 +++ b/net/bridge/br_netlink.c
 @@ -524,6 +524,9 @@ static int br_afspec(struct net_bridge *br,
 if (vinfo_start)
 return -EINVAL;
 vinfo_start = vinfo;
 +   /* don't allow range of pvids */
 +   if (vinfo_start->flags & BRIDGE_VLAN_INFO_PVID)
 +   return -EINVAL;
 continue;
 }

>>>
>>> Looks correct to me. Could you please submit this properly? Thanks!
>>
>> The above patch is correct, but we only solve part of the problem, since
>> the range and bridge flags are exposed by switchdev_obj_port_vlan as is.
>>
>> Thanks,
>> -v
>>
>
> Yes, the above fixes the bridge side. About the switchdev side it seems like 
> it's
> up to the switchdev driver to do the right thing in its switchdev_ops. I took 
> a
> quick look at DSA and it seems correct, the flag isn't saved and on dump 
> request
> the flags are generated so it shouldn't be possible to export multiple pvids.
> But switchdev_port_br_afspec() seems problematic, in fact I don't even see a 
> vlan
> id check, i.e. ==0 || >= VLAN_N_MASK.
> Of course, I might be totally off point as I'm not that familiar with 
> switchdev and
> it's very late. :-)
> But maybe it needs something like:
>
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 6e4a4f9ad927..3dd52a53867f 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -716,10 +717,14 @@ static int switchdev_port_br_afspec(struct net_device 
> *dev,
> return -EINVAL;
> vinfo = nla_data(attr);
> vlan.flags = vinfo->flags;
> +   if (!vinfo->vid || vinfo->vid >= VLAN_VID_MASK)
> +return -EINVAL;
> if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
> if (vlan.vid_begin)
> return -EINVAL;
> vlan.vid_begin = vinfo->vid;
> +   if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
> +   return -EINVAL;
> } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
> if (!vlan.vid_begin)
> return -EINVAL;

This (and you other patch) seem right to me, if we're going to block
setting PVID when specifying a vlan range.  Would you mind combining
and resending both patches as one as a proper patch?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-11 Thread Scott Feldman

On Sat, Oct 10, 2015 at 9:03 AM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> On Oct. Friday 09 (41) 08:20 PM, Scott Feldman wrote:
>> On Fri, Oct 9, 2015 at 3:44 PM, Vivien Didelot
>> <vivien.dide...@savoirfairelinux.com> wrote:
>> > Hi Jiri,
>> >
>> > On Oct. Friday 09 (41) 01:54 PM, Jiri Pirko wrote:
>> >> From: Jiri Pirko <j...@mellanox.com>
>> >>
>> >> Some drivers need to implement both switchdev vlan ops and
>> >> vid_add/kill ndos. For that to work in bridge code, we need to try
>> >> switchdev op first when adding/deleting vlan id.
>> >
>> > Just curious, when would a driver need to have both operations?
>>
>> Ya, I was kind of curious of that myself. Is this because the driver
>> wants to support standalone VLANs using 802.1q module and vconfig, as
>> well as bridge vlans?  With the vlan support built into the bridge,
>> I've been working under the assumption that 802.1q module (and
>> vconfig) aren't needed, and vlans for a bridged and non-bridge port
>> can be managed using the "bridge" iproute2 cmd.
>>
>> > I kinda have the same question regarding ndo_fdb_{add,del} and the
>> > bridge_{get,set}link equivalent, which is a bit confusing to me.
>>
>> I had to look back at my commit 7f109539 to remind myself about the
>> vid_add/kill ndos and bridge_{get,set}link usage.   Maybe that
>> write-up helps?  I'm not following you on the ndo_fdb_add/del part of
>> your question.
>
> Sorry I wasn't clear. What is confusing to me for FDB ops is that we
> define net_device_ops like:
>
> .ndo_fdb_add= switchdev_port_fdb_add,
> .ndo_fdb_del= switchdev_port_fdb_del,
> .ndo_fdb_dump   = switchdev_port_fdb_dump,
>
> But if I'm not mistaken, "bridge fdb" commands use the
> .ndo_bridge_{get,set,del}link ops, isn't it?

No, the "bridge fdb" cmds use the .ndo_fdb_xxx ops.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 3/3] switchdev: introduce deferred variants of obj_add/del helpers

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 6:25 AM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 03:21:44PM CEST, gerlitz...@gmail.com wrote:
>>On Thu, Oct 8, 2015 at 4:09 PM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 10:28:58AM CEST, j...@resnulli.us wrote:
Thu, Oct 08, 2015 at 08:45:58AM CEST, gerlitz...@gmail.com wrote:
>On Wed, Oct 7, 2015 at 9:30 PM, Jiri Pirko  wrote:
>>
>This introduced a regression to the 2-phase commit scheme, since the
>prepare commit can fail
>and that would go un-noticed toward the upper layer, agree?
>>
Well, no. This still does the transaction for all lower devices in one
go. No change in that.
>>
>>> Now I get it, yes you are right. But currently there is no code in
>>> kernel which would control retval of deferred attr_set or obj_add/del
>>
>>I am not sure to understand your reply. You are saying that when the deferred
>>procedures complete (e.g fail in the prepare phase) they can't actually let
>>the upper layer to realize that this change isn't possible? this is
>>exactly the bug.
>
> Correct. But check the code. Callers of current deferred variants do
> not care about the retval. Therefore this is not a regression.
>
> It makes sense in my opinion. If you are a called and you explicitly say to
> defer the operation, you cannot expect retval.

Makes sense to me also, FWIW.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 7:36 AM, Jiri Pirko  wrote:
> Wed, Oct 07, 2015 at 07:39:56PM CEST, j...@resnulli.us wrote:
>>Wed, Oct 07, 2015 at 06:53:22PM CEST, sfel...@gmail.com wrote:
>>>On Tue, Oct 6, 2015 at 11:03 PM, Jiri Pirko  wrote:
 Tue, Oct 06, 2015 at 07:14:39PM CEST, sfel...@gmail.com wrote:
>On Tue, Oct 6, 2015 at 12:30 AM, Jiri Pirko  wrote:
>> Tue, Oct 06, 2015 at 05:56:12AM CEST, sfel...@gmail.com wrote:
>>>On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
 From: Jiri Pirko 

 This patchset allows new rocker worlds to be easily added in future 
 (like eBPF
 based one I have been working on). The main part of the patchset is 
 the OF-DPA
 carve-out. It resuts in OF-DPA specific file. Clean cut.

 v1->v2:
  - rtnl rocker mode change userspace expose patch was removed

 Jiri Pirko (13):
   rocker: remove unused rocker_port param from alloc funcs and shorten
 their names
   rocker: rename rocker.h to rocker_hw.h
   rocker: rename rocker.c to rocker_main.c
   rocker: push tlv processing into separate files
   rocker: implement set settings mode command
   rocker: introduce worlds infrastructure
   rocker: introduce OF-DPA world skeleton
   rocker: set default world on port probe and clean world on remove
   rocker: pass "learning" value as a parameter to
 rocker_port_set_learning
   rocker: pre-allocate wait structures during cmd ring init
   rocker: remove trans parameter to rocker_cmd_exec function
   rocker: call rocker_cmd_exec function with "nowait" boolean instead 
 of
 flags
   rocker: move OF-DPA stuff into separate file
>>>
>>>A couple of my tests are failing with this patchset.  A simple port
>>>test is failing and IPv4 routing test is failing.
>>>
>>>The port test is simple: just connect a port on DUT to a port on
>>>another system and assign an IP address to each port and verify IP
>>>connectivity.  I have this:
>>>
>>>   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
>>>
>>>The IPv4 routing tests is a bit more complicated to setup.  I'm using
>>>OSPF, but I'm not seeing full routes formed in the topology, so I
>>>suspect OSPF hellos aren't getting thru.
>>>
>>>Please fix find/fix these issues and send v3.  I don't want any git
>>>bisect issues when running tests.  Thanks.
>>
>> I fixed that. Sending v3 in a sec. Thanks.
>
>Sorry, both tests are still broken.  Would you send me your tests
>scripts so I can see why your tests are passing?

 I'm trying some smoke tests including bridge setup and just ip-ip
 setup by hand. Meybe if you send me your scripts, I can run it locally.
>>>
>>>My test scripts are already included in the qemu tree.
>>
>>Okay, will rework and use your scripts. Hope I will find some time
>>during this weekend.
>
> Scott, could you try to test with current net-next?
> I'm trying basic:
> DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
> and it does not work for me now. It worked previously when I tested with
> my patchset. This is getting odd.

I had just re-run the tests against net-next before submitting the
ageing_time patchset and everything passes.

Are you using a namespace or a VM for host1?  Either one should work.

This would be a bad test, as the kernel will loop the traffic and the
offload device will not see it:

DUT:sw1p1 (11.0.0.1/24) <>DUT:sw1p2(11.0.0.2/24)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 11:46 PM, Jiri Pirko  wrote:
> Fri, Oct 09, 2015 at 06:39:41AM CEST, sfel...@gmail.com wrote:
>>On Thu, Oct 8, 2015 at 1:26 AM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 08:03:35AM CEST, sfel...@gmail.com wrote:
On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
>>On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to use deferred version of this function.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   2 +
>>>  net/bridge/br_stp.c   |   4 +-
>>>  net/switchdev/switchdev.c | 113 
>>> +-
>>>  3 files changed, 65 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index 89266a3..320be44 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>>  int switchdev_port_attr_set(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>> +int switchdev_port_attr_set_deferred(struct net_device *dev,
>>> +struct switchdev_attr *attr);
>>
>>Rather than adding another op, use attr->flags and define:
>>
>>#define SWITCHDEV_F_DEFERRED  BIT(x)
>>
>>So we get:
>>
>>void br_set_state(struct net_bridge_port *p, unsigned int state)
>>{
>>struct switchdev_attr attr = {
>>.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>+  .flags = SWITCHDEV_F_DEFERRED,
>>.u.stp_state = state,
>>};
>>int err;
>>
>>p->state = state;
>>err = switchdev_port_attr_set(p->dev, );
>>if (err && err != -EOPNOTSUPP)
>>br_warn(p->br, "error setting offload STP state on
>>port %u(%s)\n",
>>(unsigned int) p->port_no,
>>p->dev->name);
>>}
>>
>>(And add obj->flags to do the same).
>
> That's what I wanted to avoid. Also because the obj is const and for
> call from work, this flag would have to be removed.

What did you want to avoid?
>>>
>>> Having this as a flag. I don't like it too much.
>>> But that is cosmetics. Other than that, does the patchset make sense?
>>> Do you see some possible issues?
>>
>>patch 1/3 makes sense, I tested it out and no issues.  (Looks like
>>there are other places to assert rtnl_lock, are you going to add
>>those?)
>
> Sure, can you pinpoint the places?

Isn't every place we use netdev_for_each_lower_dev, like you mentioned
in 1/3 patch?

>>patch 2/3: Rather than trying to guess the call context in the core,
>>make the caller call the right variant for its context.  That part is
>>good.  On the flag vs. no flags, the reasons why I want this as a flag
>>are:
>>
>>a) I want to keep the switchdev ops set to the core set: get/set attr
>>and add/del/dump objs.  I've pushed back on changing this before.  I
>>don't want ops explosion (like netdev_ops), and I'd like to avoid the
>>1000-line patch when the arg list in an op changes, and we need to
>>update N drivers.  The flags lets the caller modify the algo behavior,
>>while keeping the core call (and args) fixed.
>>
>>b) the caller can combine flags, where it makes sense.  For example,
>>maybe I'm in a locked context and I don't want to recurse the device
>>tree, so I would make the call with NO_RECURSE | DEFERRED.  If we
>>didn't use flags, then we need to supply ops for each variant on the
>>call, and then things explode.
>
> Fair enough. I'll process this in.

Actually, I realized later that my reply here was only half true.
Part b) to combine flags for various calling situation is good.   Part
a) is bogus because I confused adding a new op or adding a new wrapper
to call existing op.  You did the latter; but I was complaining about
the former.  Sorry about that.  Regardless, port b) I think justifies
using flags.

>
>>
>>patch 3/3 I haven't looked at yet...I'm stuck on 2/3.
>
> It is very similar to 2/3, only for obj_add/del.

Do we have examples of a deferred obj add or del?  Maybe we should
hold off adding that support until someone finds a use-case.  I'm kind
of hoping there isn't a use-case, but who knows?
--
To unsubscribe

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 9:38 PM, Premkumar Jonnala <pjonn...@broadcom.com> wrote:
>
>
>> -Original Message-
>> From: sfel...@gmail.com [mailto:sfel...@gmail.com]
>> Sent: Friday, October 09, 2015 7:53 AM
>> To: netdev@vger.kernel.org
>> Cc: da...@davemloft.net; j...@resnulli.us; siva.mannem@gmail.com;
>> Premkumar Jonnala; step...@networkplumber.org;
>> ro...@cumulusnetworks.com; and...@lunn.ch; f.faine...@gmail.com;
>> vivien.dide...@savoirfairelinux.com
>> Subject: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down
>> to switchdev
>>
>> From: Scott Feldman <sfel...@gmail.com>
>>
>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>> support setting ageing_time (or setting bridge attrs in general).
>>
>> If push fails, don't update ageing_time in bridge and return err to user.
>>
>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>> recalabrate when to run gc_timer next, based on new ageing_time.
>>
>> Signed-off-by: Scott Feldman <sfel...@gmail.com>
>> Signed-off-by: Jiri Pirko <j...@resnulli.us>



>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>> +{
>> + struct switchdev_attr attr = {
>> + .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>> + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>> + .u.ageing_time = ageing_time,
>> + };
>> + unsigned long t = clock_t_to_jiffies(ageing_time);
>> + int err;
>> +
>> + if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>> + return -ERANGE;
>> +
>> + err = switchdev_port_attr_set(br->dev, );
>
> A thought - given that the ageing time is not a per-bridge-port attr, why are 
> we using a "port based api"
> to pass the attribute down?  May be I'm missing something here?

I think Florian raised the same point earlier.  Sigh, I think this
should be addressedv4 coming soon...thanks guys for keeping the
standard high.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 4:54 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Some drivers need to implement both switchdev vlan ops and
> vid_add/kill ndos. For that to work in bridge code, we need to try
> switchdev op first when adding/deleting vlan id.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>
> Signed-off-by: Ido Schimmel <ido...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: switchdev and VLAN ranges

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 4:30 PM, Vivien Didelot
 wrote:
> Hi All,
>
> I understand that specifying a VLAN range on the command line is nice
> for the user, and it makes no big deal for software implementation.

[Adding Roopa, since she did the original vlan range support in the
kernel/iproute2]

> However, AFAICT a VLAN range does not make sense at all for hardware
> such as Ethernet switch chips. Am I wrong?
>
> I would suggest to make switchdev directly answer to a bridge request
> that the operation is not supported when the user asks for a VLAN range.
>
> That way, we can simply use a single "vid" member in struct
> switchdev_obj_port_vlan instead of "vid_begin" and "vid_end" and thus
> avoid making drivers heavier with iteration loops on such range.
>
> I have two concerns in mind:
>
> a) if we imagine that drivers like Rocker allocate memory in the prepare
> phase for each VID, preparing a range like 100-4000 would definitely not
> be recommended.

This call should be in process context so it doesn't seem to terrible
for the driver to take its time to reserve/allocate resources in
prepare phase, even for a vlan range.  I think I'm missing your point.

> b) imagine that you have two Linux bridges on a switch, one using the
> hardware VLAN 100. If you request the VLAN range 99-101 for the other
> bridge members, it is not possible for the driver to say "I can
> accelerate VLAN 99 and 101, but not 100". It must return OPNOTSUPP for
> the whole range.

Well, it probably should return -ERANGE to indicate the range can't be
added, but that's an aside.

The reason why vlan ranges need to work down to the switchdev driver
is, from the user's perspective, it's an all-or-nothing request from
the user to add the vlan range to the device.  So we need to ask the
driver in the prepare phase, "can you support this range,
completely?", and if yes, then commit it as a whole.  The netlink
response back to the user isn't equipped to describe what subset of
the range was added, and what subset was not.

> That's why I think that avoiding VLAN range at the switchdev level would
> be a good idea.

As a general rule with switchdev, we've tried to keep the user's
experience the same when using {Linux} as a soft switch/router vs.
using {Linux + offload device} as a hard switch/router.  So if native
Linux supports some operation, for example vlan ranges, then we should
try to extend that to the offload model.  In other words, we don't
want to re-train the user when moving from soft switch to hard switch!
 But there are physical limitations when dealing with an offload
device

Anyway, with your vlan range example, we've got a case where each soft
bridge has an independent vlan set, and the vlan sets between soft
bridges can overlap.  For the (typical) hard switch, there is one vlan
set for the whole switch, and trying to overlay the soft bridges'
(overlapping) vlan sets on the hard switch fails.  That failure is
reported to the user.  We tried, but due to offload device
limitations, we can't support that operation.  Of course, if the vlan
sets didn't overlap, then we don't have a problem.

This will not be the only case where something we can do on a soft
Linux switch/router can't be offloaded to some physical offload
device.  But I think the philosophy has been to try offload what we
can, up to the point of failure.  In some cases, we can mask that
failure from the user by falling back to soft-switch only, but in
other cases the failure will pop up right in the user's face, like in
your example.

One idea to help mitigate the user's confusion would be to limit the
number of bridges overlaid on the device to just one.  Our drivers
know when ports are enslaved to bridges, so is there something we can
do there to fail the enslave on a second bridge?  Exercise left to the
reader.  If we had that, now vlan ranges work 1:1 with soft Linux
because both soft bridge and device have single vlan set.

Sorry for the long-winded response.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] ipv6 route: use err pointers instead of returning pointer by reference

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 10:26 AM, Roopa Prabhu  wrote:
> From: Roopa Prabhu 
>
> This patch makes ip6_route_info_create return err pointer instead of
> returning the rt pointer by reference as suggested  by Dave
>
> Signed-off-by: Roopa Prabhu 
> ---
> Dave, sorry abt the delay on this one. net-next was closed when i got to it
> and its been in my queue since then.
>
>  net/ipv6/route.c | 30 --
>  1 file changed, 16 insertions(+), 14 deletions(-)



>  int ip6_route_add(struct fib6_config *cfg)
> @@ -1980,9 +1976,12 @@ int ip6_route_add(struct fib6_config *cfg)
> struct rt6_info *rt = NULL;

nit: don't need to init rt since it's now set unconditionally.

> int err;
>
> -   err = ip6_route_info_create(cfg, );
> -   if (err)
> +   rt = ip6_route_info_create(cfg);
> +   if (IS_ERR(rt)) {
> +   err = PTR_ERR(rt);
> +   rt = NULL;
> goto out;
> +   }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 3:44 PM, Vivien Didelot
 wrote:
> Hi Jiri,
>
> On Oct. Friday 09 (41) 01:54 PM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>>
>> Some drivers need to implement both switchdev vlan ops and
>> vid_add/kill ndos. For that to work in bridge code, we need to try
>> switchdev op first when adding/deleting vlan id.
>
> Just curious, when would a driver need to have both operations?

Ya, I was kind of curious of that myself. Is this because the driver
wants to support standalone VLANs using 802.1q module and vconfig, as
well as bridge vlans?  With the vlan support built into the bridge,
I've been working under the assumption that 802.1q module (and
vconfig) aren't needed, and vlans for a bridged and non-bridge port
can be managed using the "bridge" iproute2 cmd.

> I kinda have the same question regarding ndo_fdb_{add,del} and the
> bridge_{get,set}link equivalent, which is a bit confusing to me.

I had to look back at my commit 7f109539 to remind myself about the
vid_add/kill ndos and bridge_{get,set}link usage.   Maybe that
write-up helps?  I'm not following you on the ndo_fdb_add/del part of
your question.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3] bridge: allow adding of fdb entries pointing to the bridge device

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 10:38 AM, Roopa Prabhu  wrote:
> From: Roopa Prabhu 
>
> This patch enables adding of fdb entries pointing to the bridge device.
> This can be used to propagate mac address of vlan interfaces
> configured on top of the vlan filtering bridge.
>
> Before:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
> RTNETLINK answers: Invalid argument
>
> After:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
>
> Signed-off-by: Roopa Prabhu 
> Reviewed-by: Nikolay Aleksandrov 
> ---
> v1 - v2 : fix kbuild warnings
> v2 - v3 : address review comments from Nikolay (use of br_vlan_should_use)
>
>  net/bridge/br_fdb.c  | 122 
> ++-
>  net/bridge/br_vlan.c |   1 +
>  2 files changed, 93 insertions(+), 30 deletions(-)
>
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index 7f7d551..f43ce05 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -608,13 +608,14 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
> }
>  }
>
> -static int fdb_to_nud(const struct net_bridge_fdb_entry *fdb)
> +static int fdb_to_nud(const struct net_bridge *br,
> + const struct net_bridge_fdb_entry *fdb)
>  {
> if (fdb->is_local)
> return NUD_PERMANENT;
> else if (fdb->is_static)
> return NUD_NOARP;
> -   else if (has_expired(fdb->dst->br, fdb))
> +   else if (has_expired(br, fdb))
> return NUD_STALE;
> else
> return NUD_REACHABLE;
> @@ -640,7 +641,7 @@ static int fdb_fill_info(struct sk_buff *skb, const 
> struct net_bridge *br,
> ndm->ndm_flags   = fdb->added_by_external_learn ? NTF_EXT_LEARNED : 0;
> ndm->ndm_type= 0;
> ndm->ndm_ifindex = fdb->dst ? fdb->dst->dev->ifindex : 
> br->dev->ifindex;
> -   ndm->ndm_state   = fdb_to_nud(fdb);
> +   ndm->ndm_state   = fdb_to_nud(br, fdb);
>
> if (nla_put(skb, NDA_LLADDR, ETH_ALEN, >addr))
> goto nla_put_failure;
> @@ -785,7 +786,7 @@ static int fdb_add_entry(struct net_bridge_port *source, 
> const __u8 *addr,
> }
> }
>
> -   if (fdb_to_nud(fdb) != state) {
> +   if (fdb_to_nud(br, fdb) != state) {


Hi Roopa,

Are the above changes to fdb_to_nud() related to the patch subject?
I was trying to figure out this part of the patch...seems unrelated.
Is fdb->dst->br now not valid in some cases?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 5/6] net: dsa: push prepare phase in port_fdb_add

2015-10-08 Thread Scott Feldman

On Wed, Oct 7, 2015 at 4:48 PM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> Now that the prepare phase is pushed down to the DSA drivers, propagate
> it to the port_fdb_add function.
>
> Signed-off-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>

Reviewed-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 6/6] net: dsa: use switchdev obj in port_fdb_del

2015-10-08 Thread Scott Feldman

On Wed, Oct 7, 2015 at 4:48 PM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> For consistency with the FDB add operation, propagate the
> switchdev_obj_port_fdb structure in the DSA drivers.
>
> Signed-off-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>

Reviewed-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 4/6] net: dsa: add port_fdb_prepare

2015-10-08 Thread Scott Feldman

On Wed, Oct 7, 2015 at 4:48 PM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> Push the prepare phase for FDB operations down to the DSA drivers, with
> a new port_fdb_prepare function. Currently only mv88e6xxx is affected.
>
> Signed-off-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>

Reviewed-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-08 Thread Scott Feldman

On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
>>On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to use deferred version of this function.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   2 +
>>>  net/bridge/br_stp.c   |   4 +-
>>>  net/switchdev/switchdev.c | 113 
>>> +-
>>>  3 files changed, 65 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index 89266a3..320be44 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>>  int switchdev_port_attr_set(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>> +int switchdev_port_attr_set_deferred(struct net_device *dev,
>>> +struct switchdev_attr *attr);
>>
>>Rather than adding another op, use attr->flags and define:
>>
>>#define SWITCHDEV_F_DEFERRED  BIT(x)
>>
>>So we get:
>>
>>void br_set_state(struct net_bridge_port *p, unsigned int state)
>>{
>>struct switchdev_attr attr = {
>>.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>+  .flags = SWITCHDEV_F_DEFERRED,
>>.u.stp_state = state,
>>};
>>int err;
>>
>>p->state = state;
>>err = switchdev_port_attr_set(p->dev, );
>>if (err && err != -EOPNOTSUPP)
>>br_warn(p->br, "error setting offload STP state on
>>port %u(%s)\n",
>>(unsigned int) p->port_no,
>>p->dev->name);
>>}
>>
>>(And add obj->flags to do the same).
>
> That's what I wanted to avoid. Also because the obj is const and for
> call from work, this flag would have to be removed.

What did you want to avoid?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/4] switchdev: add bridge attributes

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 1:39 AM, Jiri Pirko <j...@resnulli.us> wrote:
> Thu, Oct 08, 2015 at 08:04:40AM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman <sfel...@gmail.com>
>>
>>Setting the stage to push bridge-level attributes down to port driver so
>>hardware can be programmed accordingly.  Bridge-level attribute example is
>>ageing_time.  This is a per-bridge attribute, not a per-bridge-port attr.
>>
>>Signed-off-by: Scott Feldman <sfel...@gmail.com>
>>---
>> include/net/switchdev.h  |5 +
>> include/uapi/linux/if_link.h |2 +-
>> 2 files changed, 6 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>index 89266a3..8d92cd0 100644
>>--- a/include/net/switchdev.h
>>+++ b/include/net/switchdev.h
>>@@ -43,6 +43,7 @@ enum switchdev_attr_id {
>>   SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>>   SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>   SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
>>+  SWITCHDEV_ATTR_ID_BRIDGE,
>> };
>>
>> struct switchdev_attr {
>>@@ -52,6 +53,10 @@ struct switchdev_attr {
>>   struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
>>   u8 stp_state;   /* PORT_STP_STATE */
>>   unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
>>+  struct switchdev_attr_bridge {  /* BRIDGE */
>>+  enum ifla_br attr;
>
> I don't like pushing down IFLA_BR_* values throught switchdev. I think
> it might better to just intruduce:
> SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME
>
> and "u32 ageing_time" here. Something similar to stp_state. Much easier
> to read and does not give blank cheque for passing any bridge IFLA_BR_*
> down to drivers.
>
> It also aligns with bridge code nicely, I believe.

Done, v3 sent with your suggested change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 7:40 PM, Florian Fainelli <f.faine...@gmail.com> wrote:
> 2015-10-08 19:23 GMT-07:00  <sfel...@gmail.com>:
>> From: Scott Feldman <sfel...@gmail.com>
>>
>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>> support setting ageing_time (or setting bridge attrs in general).
>>
>> If push fails, don't update ageing_time in bridge and return err to user.
>>
>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>> recalabrate when to run gc_timer next, based on new ageing_time.
>>
>> Signed-off-by: Scott Feldman <sfel...@gmail.com>
>> Signed-off-by: Jiri Pirko <j...@resnulli.us>
>> ---
>>  net/bridge/br_ioctl.c|3 +--
>>  net/bridge/br_netlink.c  |6 +++---
>>  net/bridge/br_private.h  |1 +
>>  net/bridge/br_stp.c  |   23 +++
>>  net/bridge/br_sysfs_br.c |3 +--
>>  5 files changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
>> index 8d423bc..263b4de 100644
>> --- a/net/bridge/br_ioctl.c
>> +++ b/net/bridge/br_ioctl.c
>> @@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
>> ifreq *rq, int cmd)
>> if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
>> return -EPERM;
>>
>> -   br->ageing_time = clock_t_to_jiffies(args[1]);
>> -   return 0;
>> +   return br_set_ageing_time(br, args[1]);
>>
>> case BRCTL_GET_PORT_INFO:
>> {
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index d78b442..544ab96 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, 
>> struct nlattr *tb[],
>> }
>>
>> if (data[IFLA_BR_AGEING_TIME]) {
>> -   u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
>> -
>> -   br->ageing_time = clock_t_to_jiffies(ageing_time);
>> +   err = br_set_ageing_time(br, 
>> nla_get_u32(data[IFLA_BR_AGEING_TIME]));
>> +   if (err)
>> +   return err;
>> }
>>
>> if (data[IFLA_BR_STP_STATE]) {
>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>> index 09d3ecb..ba0c67b 100644
>> --- a/net/bridge/br_private.h
>> +++ b/net/bridge/br_private.h
>> @@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br, 
>> unsigned long t);
>>  int br_set_forward_delay(struct net_bridge *br, unsigned long x);
>>  int br_set_hello_time(struct net_bridge *br, unsigned long x);
>>  int br_set_max_age(struct net_bridge *br, unsigned long x);
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
>>
>>
>>  /* br_stp_if.c */
>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>> index 3a982c0..db6d243de 100644
>> --- a/net/bridge/br_stp.c
>> +++ b/net/bridge/br_stp.c
>> @@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
>> val)
>>
>>  }
>>
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>> +{
>> +   struct switchdev_attr attr = {
>> +   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>> +   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>> +   .u.ageing_time = ageing_time,
>> +   };
>> +   unsigned long t = clock_t_to_jiffies(ageing_time);
>> +   int err;
>> +
>> +   if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>> +   return -ERANGE;
>> +
>> +   err = switchdev_port_attr_set(br->dev, );
>> +   if (err)
>> +   return err;
>> +
>> +   br->ageing_time = t;
>> +   mod_timer(>gc_timer, jiffies);
>
> If the switch driver/HW supports ageing, does it still make sense to
> have this software timer ticking?

Yes, because the bridge still needs to age out entries it has learned
(those not marked with added_by_external_learn), for example entries
learned on non-offloaded ports.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 1:26 AM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 08:03:35AM CEST, sfel...@gmail.com wrote:
>>On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to use deferred version of this function.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   2 +
>  net/bridge/br_stp.c   |   4 +-
>  net/switchdev/switchdev.c | 113 
> +-
>  3 files changed, 65 insertions(+), 54 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 89266a3..320be44 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
> struct switchdev_attr *attr);
>  int switchdev_port_attr_set(struct net_device *dev,
> struct switchdev_attr *attr);
> +int switchdev_port_attr_set_deferred(struct net_device *dev,
> +struct switchdev_attr *attr);

Rather than adding another op, use attr->flags and define:

#define SWITCHDEV_F_DEFERRED  BIT(x)

So we get:

void br_set_state(struct net_bridge_port *p, unsigned int state)
{
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
+  .flags = SWITCHDEV_F_DEFERRED,
.u.stp_state = state,
};
int err;

p->state = state;
err = switchdev_port_attr_set(p->dev, );
if (err && err != -EOPNOTSUPP)
br_warn(p->br, "error setting offload STP state on
port %u(%s)\n",
(unsigned int) p->port_no,
p->dev->name);
}

(And add obj->flags to do the same).
>>>
>>> That's what I wanted to avoid. Also because the obj is const and for
>>> call from work, this flag would have to be removed.
>>
>>What did you want to avoid?
>
> Having this as a flag. I don't like it too much.
> But that is cosmetics. Other than that, does the patchset make sense?
> Do you see some possible issues?

patch 1/3 makes sense, I tested it out and no issues.  (Looks like
there are other places to assert rtnl_lock, are you going to add
those?)

patch 2/3: Rather than trying to guess the call context in the core,
make the caller call the right variant for its context.  That part is
good.  On the flag vs. no flags, the reasons why I want this as a flag
are:

a) I want to keep the switchdev ops set to the core set: get/set attr
and add/del/dump objs.  I've pushed back on changing this before.  I
don't want ops explosion (like netdev_ops), and I'd like to avoid the
1000-line patch when the arg list in an op changes, and we need to
update N drivers.  The flags lets the caller modify the algo behavior,
while keeping the core call (and args) fixed.

b) the caller can combine flags, where it makes sense.  For example,
maybe I'm in a locked context and I don't want to recurse the device
tree, so I would make the call with NO_RECURSE | DEFERRED.  If we
didn't use flags, then we need to supply ops for each variant on the
call, and then things explode.

patch 3/3 I haven't looked at yet...I'm stuck on 2/3.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-07 Thread Scott Feldman

On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to use deferred version of this function.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   2 +
>  net/bridge/br_stp.c   |   4 +-
>  net/switchdev/switchdev.c | 113 
> +-
>  3 files changed, 65 insertions(+), 54 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 89266a3..320be44 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
> struct switchdev_attr *attr);
>  int switchdev_port_attr_set(struct net_device *dev,
> struct switchdev_attr *attr);
> +int switchdev_port_attr_set_deferred(struct net_device *dev,
> +struct switchdev_attr *attr);

Rather than adding another op, use attr->flags and define:

#define SWITCHDEV_F_DEFERRED  BIT(x)

So we get:

void br_set_state(struct net_bridge_port *p, unsigned int state)
{
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
+  .flags = SWITCHDEV_F_DEFERRED,
.u.stp_state = state,
};
int err;

p->state = state;
err = switchdev_port_attr_set(p->dev, );
if (err && err != -EOPNOTSUPP)
br_warn(p->br, "error setting offload STP state on
port %u(%s)\n",
(unsigned int) p->port_no,
p->dev->name);
}

(And add obj->flags to do the same).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 06/14] rocker: introduce worlds infrastructure

2015-10-07 Thread Scott Feldman

On Tue, Oct 6, 2015 at 11:14 PM, Jiri Pirko  wrote:
> Wed, Oct 07, 2015 at 03:50:08AM CEST, sfel...@gmail.com wrote:
>
> 
>
>>
>>> Also I wonder how this works when a pkt ingresses a port in mode A and
>>> egresses a port in mode B? What fib/fdb tables does it cross when this
>>> happens? It seems easier to just have two switch devices not a
>>> hybrid. If this per port implementation maps to some hardware that
>>> would be really interesting though.
>>
>>In retrospect, I regret adding the port mode feature to rocker.  I
>>like the world idea, so we can have a device with different
>>pipeline/resources, but we should have locked all ports on a switch to
>>one mode, or even as you hinted at earlier, use a unique sub-device ID
>>for a switch with all ports in a particular mode.  If you want to
>>ports with different worlds, just instantiate a switch in each world.
>>Instantiating new devices is easy.
>>
>>But, now Jiri has locked on to the dynamic port mode idea with pit
>>bull zeal, to the point of being able to switch a port mode at any
>>time from one mode to another from the host.  I just don't see that as
>>a real-world use-case.  Life is too short and we need to be focusing
>>on switchdev features, not refactoring or adding cool but useless
>>features.
>
> Can can still change this if you want. We can make
> ROCKER_TLV_CMD_PORT_SETTINGS_MODE read-only in hw (As it is in fact now
> as we have only one world).
>
> Then we add another property:
> static Property rocker_properties[] = {
> DEFINE_PROP_STRING("name", Rocker, name),
> DEFINE_PROP_STRING("world", Rocker, world),
> 
>
> and we use this value in pci_rocker_init instead of r->world_dflt
>
> Looks straightforward.

Yes, perfect, I totally on-board with that change.  This puts all
ports on the device in the same mode.  I you want ports in a different
mode, instantiate another switch.

Mixing port modes on a switch, either statically or dynamically, would
have created another problem.  Switchdev uses the switch_id for each
port to know when ports belong to the same switch, same switching
domain, if you will.  Mixing port modes within a switch breaks this.
For example, the check we added to avoid double forwarding of pkts by
the bridge and the device depends on ports on the switch having the
same switch_id.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-07 Thread Scott Feldman

On Tue, Oct 6, 2015 at 11:03 PM, Jiri Pirko  wrote:
> Tue, Oct 06, 2015 at 07:14:39PM CEST, sfel...@gmail.com wrote:
>>On Tue, Oct 6, 2015 at 12:30 AM, Jiri Pirko  wrote:
>>> Tue, Oct 06, 2015 at 05:56:12AM CEST, sfel...@gmail.com wrote:
On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> This patchset allows new rocker worlds to be easily added in future (like 
> eBPF
> based one I have been working on). The main part of the patchset is the 
> OF-DPA
> carve-out. It resuts in OF-DPA specific file. Clean cut.
>
> v1->v2:
>  - rtnl rocker mode change userspace expose patch was removed
>
> Jiri Pirko (13):
>   rocker: remove unused rocker_port param from alloc funcs and shorten
> their names
>   rocker: rename rocker.h to rocker_hw.h
>   rocker: rename rocker.c to rocker_main.c
>   rocker: push tlv processing into separate files
>   rocker: implement set settings mode command
>   rocker: introduce worlds infrastructure
>   rocker: introduce OF-DPA world skeleton
>   rocker: set default world on port probe and clean world on remove
>   rocker: pass "learning" value as a parameter to
> rocker_port_set_learning
>   rocker: pre-allocate wait structures during cmd ring init
>   rocker: remove trans parameter to rocker_cmd_exec function
>   rocker: call rocker_cmd_exec function with "nowait" boolean instead of
> flags
>   rocker: move OF-DPA stuff into separate file

A couple of my tests are failing with this patchset.  A simple port
test is failing and IPv4 routing test is failing.

The port test is simple: just connect a port on DUT to a port on
another system and assign an IP address to each port and verify IP
connectivity.  I have this:

   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)

The IPv4 routing tests is a bit more complicated to setup.  I'm using
OSPF, but I'm not seeing full routes formed in the topology, so I
suspect OSPF hellos aren't getting thru.

Please fix find/fix these issues and send v3.  I don't want any git
bisect issues when running tests.  Thanks.
>>>
>>> I fixed that. Sending v3 in a sec. Thanks.
>>
>>Sorry, both tests are still broken.  Would you send me your tests
>>scripts so I can see why your tests are passing?
>
> I'm trying some smoke tests including bridge setup and just ip-ip
> setup by hand. Meybe if you send me your scripts, I can run it locally.

My test scripts are already included in the qemu tree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 06/14] rocker: introduce worlds infrastructure

2015-10-06 Thread Scott Feldman

On Tue, Oct 6, 2015 at 2:25 PM, John Fastabend  wrote:
> [...]
>

 Using void * in these ops is unacceptable, I can't agree to this patch.

 There is a much cleaner way to architect this.  If you look at the ops
 defined, they're mostly duplicates of the already defined
 switchdev_ops.  It would be much cleaner to:

 0) set port mode on qemu/rocker (the device)
 1) get the port mode on port probe
 2) based on port mode, set the switchdev_ops to point to the port mode
 world switchdev_ops
 3) sub-class rocker_port, like I mentioned in before, to store
 world-specific stuff in rocker_port

 I don't buy the argument that we need to change port mode dynamically
 from the driver.  Set it at the device and be done.

>>>
>>> Maybe as a reference this strikes me as similar to how we do multiple
>>> device support in a single driver like ixgbe or fm10k (the two I'm most
>>> familiar with). At probe time we read the device id and then stub in
>>> the specific callbacks for that device.
>>
>> Exactly
>>
>>> Sorry I'm still hung up on the multiple worlds thing, is it really
>>> trying to model different devices under a single driver? In which case
>>> maybe rather than port mode expose it as its own device id. Just a
>>> thought.
>>
>> Yes, different devices under single driver.  New device ID or
>> sub-device ID will not work in this case as we're trying to slice it
>> at the port level, not the device level.
>>
>
> OK uncovered my next level of suspicion/confusion.
>
> Do you actually have or seen hardware that has completely different
> programming interface per port? And completely different pipelines?

I haven't.

> This seems really strange to me and perhaps just an artifact of
> the qemu implementation? Typically or at least what I expect is you
> have a switch pipeline with a set of data structures, tcams, hash
> tables, etc all connected together in some (possibly configurable)
> topology. Ports feed packets into this pipeline and packets egress
> out ports. In my logical view of a "switch" device the pipeline
> is a shared resource you can partition it so that ports are isolated
> in some sense but you can't use fundamentally different underlying
> resources per ports. Its not a per port attribute/mode like this
> series sort of hints at.

Yes, this is an artifact of the qemu implementation.  The idea was
ports could run in one mode (world) or another.  The device would make
sure port traffic stays within the world.  Each world would have its
own pipeline and resources.  So one switch device could have ports in
different worlds.

> Also I wonder how this works when a pkt ingresses a port in mode A and
> egresses a port in mode B? What fib/fdb tables does it cross when this
> happens? It seems easier to just have two switch devices not a
> hybrid. If this per port implementation maps to some hardware that
> would be really interesting though.

In retrospect, I regret adding the port mode feature to rocker.  I
like the world idea, so we can have a device with different
pipeline/resources, but we should have locked all ports on a switch to
one mode, or even as you hinted at earlier, use a unique sub-device ID
for a switch with all ports in a particular mode.  If you want to
ports with different worlds, just instantiate a switch in each world.
Instantiating new devices is easy.

But, now Jiri has locked on to the dynamic port mode idea with pit
bull zeal, to the point of being able to switch a port mode at any
time from one mode to another from the host.  I just don't see that as
a real-world use-case.  Life is too short and we need to be focusing
on switchdev features, not refactoring or adding cool but useless
features.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 06/14] rocker: introduce worlds infrastructure

2015-10-06 Thread Scott Feldman

On Tue, Oct 6, 2015 at 9:15 AM, Jiri Pirko  wrote:
> Tue, Oct 06, 2015 at 06:06:45PM CEST, sfel...@gmail.com wrote:
>>On Tue, Oct 6, 2015 at 12:51 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> This is another step on the way to per-world clean cut. Introduce world
>>> ops hooks which each world can implement in world-specific way.
>>> Also introduce world infrastructure including function for port worlds
>>> change.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>> v1->v2:
>>>  - removed functions to change worlds based on name, as rtnl mode
>>>set patch is removed from patchset.
>>> v2->v3:
>>>  - fix checks in rocker_world_port_open and rocker_world_port_stop
>>> ---
>>>  drivers/net/ethernet/rocker/rocker.h  |  56 
>>>  drivers/net/ethernet/rocker/rocker_main.c | 464 
>>> +-
>>>  2 files changed, 519 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/rocker/rocker.h 
>>> b/drivers/net/ethernet/rocker/rocker.h
>>> index 650caa0..d49bc5d 100644
>>> --- a/drivers/net/ethernet/rocker/rocker.h
>>> +++ b/drivers/net/ethernet/rocker/rocker.h
>>> @@ -12,7 +12,11 @@
>>>  #ifndef _ROCKER_H
>>>  #define _ROCKER_H
>>>
>>> +#include 
>>>  #include 
>>> +#include 
>>> +#include 
>>> +#include 
>>>
>>>  #include "rocker_hw.h"
>>>
>>> @@ -24,4 +28,56 @@ struct rocker_desc_info {
>>> dma_addr_t mapaddr;
>>>  };
>>>
>>> +struct rocker;
>>> +struct rocker_port;
>>> +
>>> +struct rocker_world_ops {
>>> +   const char *kind;
>>> +   size_t priv_size;
>>> +   size_t port_priv_size;
>>> +   u8 mode;
>>> +   int (*init)(struct rocker *rocker, void *priv);
>>> +   void (*fini)(void *priv);
>>> +   int (*port_init)(struct rocker_port *rocker_port, void *priv,
>>> +void *port_priv);
>>> +   void (*port_fini)(void *port_priv);
>>> +   int (*port_open)(void *port_priv);
>>> +   void (*port_stop)(void *port_priv);
>>> +   int (*port_attr_stp_state_set)(void *port_priv, u8 state,
>>> +  struct switchdev_trans *trans);
>>> +   int (*port_attr_bridge_flags_set)(void *port_priv,
>>> + unsigned long brport_flags,
>>> + struct switchdev_trans *trans);
>>> +   int (*port_attr_bridge_flags_get)(void *port_priv,
>>> + unsigned long *p_brport_flags);
>>> +   int (*port_obj_vlan_add)(void *port_priv,
>>> +const struct switchdev_obj_port_vlan *vlan,
>>> +struct switchdev_trans *trans);
>>> +   int (*port_obj_vlan_del)(void *port_priv,
>>> +const struct switchdev_obj_port_vlan 
>>> *vlan);
>>> +   int (*port_obj_vlan_dump)(void *port_priv,
>>> + struct switchdev_obj_port_vlan *vlan,
>>> + switchdev_obj_dump_cb_t *cb);
>>> +   int (*port_obj_fib4_add)(void *port_priv,
>>> +const struct switchdev_obj_ipv4_fib *fib4,
>>> +struct switchdev_trans *trans);
>>> +   int (*port_obj_fib4_del)(void *port_priv,
>>> +const struct switchdev_obj_ipv4_fib *fib4);
>>> +   int (*port_obj_fdb_add)(void *port_priv,
>>> +   const struct switchdev_obj_port_fdb *fdb,
>>> +   struct switchdev_trans *trans);
>>> +   int (*port_obj_fdb_del)(void *port_priv,
>>> +   const struct switchdev_obj_port_fdb *fdb);
>>> +   int (*port_obj_fdb_dump)(void *port_priv,
>>> +struct switchdev_obj_port_fdb *fdb,
>>> +switchdev_obj_dump_cb_t *cb);
>>> +   int (*port_master_linked)(void *port_priv, struct net_device 
>>> *master);
>>> +   int (*port_master_unlinked)(void *port_priv, struct net_device 
>>> *master);
>>> +   int (*port_neigh_update)(void *port_priv, struct neighbour *n);
>>> +   int (*port_neigh_destroy)(void *port_priv, struct neighbour *n);
>>> +   int (*port_ev_mac_vlan_seen)(void *port_priv,
>>> +const unsigned char *addr,
>>> +__be16 vlan_id);
>>> +};
>>
>>Using void * in these ops is unacceptable, I can't agree to this patch.
>
> Could you please elaborate more why is it unacceptable. I don't see any
> problem in that.

void * is OK if you're passing around un-typed, TBD-sized objects.

Wait, this conversation doesn't even seem real.  You're really
advocating, publicly, the use of void * in ops callbacks, when the
pointer passed could be a well-typed pointer?

How could define a new API internal to the driver with void * as the main arg?

>>There is a much cleaner way to

Re: [patch net-next v3 06/14] rocker: introduce worlds infrastructure

2015-10-06 Thread Scott Feldman

On Tue, Oct 6, 2015 at 12:51 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> This is another step on the way to per-world clean cut. Introduce world
> ops hooks which each world can implement in world-specific way.
> Also introduce world infrastructure including function for port worlds
> change.
>
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
>  - removed functions to change worlds based on name, as rtnl mode
>set patch is removed from patchset.
> v2->v3:
>  - fix checks in rocker_world_port_open and rocker_world_port_stop
> ---
>  drivers/net/ethernet/rocker/rocker.h  |  56 
>  drivers/net/ethernet/rocker/rocker_main.c | 464 
> +-
>  2 files changed, 519 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/rocker/rocker.h 
> b/drivers/net/ethernet/rocker/rocker.h
> index 650caa0..d49bc5d 100644
> --- a/drivers/net/ethernet/rocker/rocker.h
> +++ b/drivers/net/ethernet/rocker/rocker.h
> @@ -12,7 +12,11 @@
>  #ifndef _ROCKER_H
>  #define _ROCKER_H
>
> +#include 
>  #include 
> +#include 
> +#include 
> +#include 
>
>  #include "rocker_hw.h"
>
> @@ -24,4 +28,56 @@ struct rocker_desc_info {
> dma_addr_t mapaddr;
>  };
>
> +struct rocker;
> +struct rocker_port;
> +
> +struct rocker_world_ops {
> +   const char *kind;
> +   size_t priv_size;
> +   size_t port_priv_size;
> +   u8 mode;
> +   int (*init)(struct rocker *rocker, void *priv);
> +   void (*fini)(void *priv);
> +   int (*port_init)(struct rocker_port *rocker_port, void *priv,
> +void *port_priv);
> +   void (*port_fini)(void *port_priv);
> +   int (*port_open)(void *port_priv);
> +   void (*port_stop)(void *port_priv);
> +   int (*port_attr_stp_state_set)(void *port_priv, u8 state,
> +  struct switchdev_trans *trans);
> +   int (*port_attr_bridge_flags_set)(void *port_priv,
> + unsigned long brport_flags,
> + struct switchdev_trans *trans);
> +   int (*port_attr_bridge_flags_get)(void *port_priv,
> + unsigned long *p_brport_flags);
> +   int (*port_obj_vlan_add)(void *port_priv,
> +const struct switchdev_obj_port_vlan *vlan,
> +struct switchdev_trans *trans);
> +   int (*port_obj_vlan_del)(void *port_priv,
> +const struct switchdev_obj_port_vlan *vlan);
> +   int (*port_obj_vlan_dump)(void *port_priv,
> + struct switchdev_obj_port_vlan *vlan,
> + switchdev_obj_dump_cb_t *cb);
> +   int (*port_obj_fib4_add)(void *port_priv,
> +const struct switchdev_obj_ipv4_fib *fib4,
> +struct switchdev_trans *trans);
> +   int (*port_obj_fib4_del)(void *port_priv,
> +const struct switchdev_obj_ipv4_fib *fib4);
> +   int (*port_obj_fdb_add)(void *port_priv,
> +   const struct switchdev_obj_port_fdb *fdb,
> +   struct switchdev_trans *trans);
> +   int (*port_obj_fdb_del)(void *port_priv,
> +   const struct switchdev_obj_port_fdb *fdb);
> +   int (*port_obj_fdb_dump)(void *port_priv,
> +struct switchdev_obj_port_fdb *fdb,
> +switchdev_obj_dump_cb_t *cb);
> +   int (*port_master_linked)(void *port_priv, struct net_device *master);
> +   int (*port_master_unlinked)(void *port_priv, struct net_device 
> *master);
> +   int (*port_neigh_update)(void *port_priv, struct neighbour *n);
> +   int (*port_neigh_destroy)(void *port_priv, struct neighbour *n);
> +   int (*port_ev_mac_vlan_seen)(void *port_priv,
> +const unsigned char *addr,
> +__be16 vlan_id);
> +};

Using void * in these ops is unacceptable, I can't agree to this patch.

There is a much cleaner way to architect this.  If you look at the ops
defined, they're mostly duplicates of the already defined
switchdev_ops.  It would be much cleaner to:

0) set port mode on qemu/rocker (the device)
1) get the port mode on port probe
2) based on port mode, set the switchdev_ops to point to the port mode
world switchdev_ops
3) sub-class rocker_port, like I mentioned in before, to store
world-specific stuff in rocker_port

I don't buy the argument that we need to change port mode dynamically
from the driver.  Set it at the device and be done.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 06/14] rocker: introduce worlds infrastructure

2015-10-06 Thread Scott Feldman

On Tue, Oct 6, 2015 at 9:19 AM, John Fastabend <john.fastab...@gmail.com> wrote:
> On 15-10-06 09:06 AM, Scott Feldman wrote:
>> On Tue, Oct 6, 2015 at 12:51 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>> From: Jiri Pirko <j...@mellanox.com>
>>>
>>> This is another step on the way to per-world clean cut. Introduce world
>>> ops hooks which each world can implement in world-specific way.
>>> Also introduce world infrastructure including function for port worlds
>>> change.
>>>
>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>> ---
>>> v1->v2:
>>>  - removed functions to change worlds based on name, as rtnl mode
>>>set patch is removed from patchset.
>>> v2->v3:
>>>  - fix checks in rocker_world_port_open and rocker_world_port_stop
>>> ---
>>>  drivers/net/ethernet/rocker/rocker.h  |  56 
>>>  drivers/net/ethernet/rocker/rocker_main.c | 464 
>>> +-
>>>  2 files changed, 519 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/rocker/rocker.h 
>>> b/drivers/net/ethernet/rocker/rocker.h
>>> index 650caa0..d49bc5d 100644
>>> --- a/drivers/net/ethernet/rocker/rocker.h
>>> +++ b/drivers/net/ethernet/rocker/rocker.h
>>> @@ -12,7 +12,11 @@
>>>  #ifndef _ROCKER_H
>>>  #define _ROCKER_H
>>>
>>> +#include 
>>>  #include 
>>> +#include 
>>> +#include 
>>> +#include 
>>>
>>>  #include "rocker_hw.h"
>>>
>>> @@ -24,4 +28,56 @@ struct rocker_desc_info {
>>> dma_addr_t mapaddr;
>>>  };
>>>
>>> +struct rocker;
>>> +struct rocker_port;
>>> +
>>> +struct rocker_world_ops {
>>> +   const char *kind;
>>> +   size_t priv_size;
>>> +   size_t port_priv_size;
>>> +   u8 mode;
>>> +   int (*init)(struct rocker *rocker, void *priv);
>>> +   void (*fini)(void *priv);
>>> +   int (*port_init)(struct rocker_port *rocker_port, void *priv,
>>> +void *port_priv);
>>> +   void (*port_fini)(void *port_priv);
>>> +   int (*port_open)(void *port_priv);
>>> +   void (*port_stop)(void *port_priv);
>>> +   int (*port_attr_stp_state_set)(void *port_priv, u8 state,
>>> +  struct switchdev_trans *trans);
>>> +   int (*port_attr_bridge_flags_set)(void *port_priv,
>>> + unsigned long brport_flags,
>>> + struct switchdev_trans *trans);
>>> +   int (*port_attr_bridge_flags_get)(void *port_priv,
>>> + unsigned long *p_brport_flags);
>>> +   int (*port_obj_vlan_add)(void *port_priv,
>>> +const struct switchdev_obj_port_vlan *vlan,
>>> +struct switchdev_trans *trans);
>>> +   int (*port_obj_vlan_del)(void *port_priv,
>>> +const struct switchdev_obj_port_vlan 
>>> *vlan);
>>> +   int (*port_obj_vlan_dump)(void *port_priv,
>>> + struct switchdev_obj_port_vlan *vlan,
>>> + switchdev_obj_dump_cb_t *cb);
>>> +   int (*port_obj_fib4_add)(void *port_priv,
>>> +const struct switchdev_obj_ipv4_fib *fib4,
>>> +struct switchdev_trans *trans);
>>> +   int (*port_obj_fib4_del)(void *port_priv,
>>> +const struct switchdev_obj_ipv4_fib *fib4);
>>> +   int (*port_obj_fdb_add)(void *port_priv,
>>> +   const struct switchdev_obj_port_fdb *fdb,
>>> +   struct switchdev_trans *trans);
>>> +   int (*port_obj_fdb_del)(void *port_priv,
>>> +   const struct switchdev_obj_port_fdb *fdb);
>>> +   int (*port_obj_fdb_dump)(void *port_priv,
>>> +struct switchdev_obj_port_fdb *fdb,
>>> +switchdev_obj_dump_cb_t *cb);
>>> +   int (*port_master_linked)(void *port_priv, struct net_device 
>>> *master);
>>> +   int (*port_master_unlinked)(void *port_priv, struct net_device 
>>> *master);
>>> +   int (*port_ne

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-06 Thread Scott Feldman

On Tue, Oct 6, 2015 at 12:30 AM, Jiri Pirko  wrote:
> Tue, Oct 06, 2015 at 05:56:12AM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> This patchset allows new rocker worlds to be easily added in future (like 
>>> eBPF
>>> based one I have been working on). The main part of the patchset is the 
>>> OF-DPA
>>> carve-out. It resuts in OF-DPA specific file. Clean cut.
>>>
>>> v1->v2:
>>>  - rtnl rocker mode change userspace expose patch was removed
>>>
>>> Jiri Pirko (13):
>>>   rocker: remove unused rocker_port param from alloc funcs and shorten
>>> their names
>>>   rocker: rename rocker.h to rocker_hw.h
>>>   rocker: rename rocker.c to rocker_main.c
>>>   rocker: push tlv processing into separate files
>>>   rocker: implement set settings mode command
>>>   rocker: introduce worlds infrastructure
>>>   rocker: introduce OF-DPA world skeleton
>>>   rocker: set default world on port probe and clean world on remove
>>>   rocker: pass "learning" value as a parameter to
>>> rocker_port_set_learning
>>>   rocker: pre-allocate wait structures during cmd ring init
>>>   rocker: remove trans parameter to rocker_cmd_exec function
>>>   rocker: call rocker_cmd_exec function with "nowait" boolean instead of
>>> flags
>>>   rocker: move OF-DPA stuff into separate file
>>
>>A couple of my tests are failing with this patchset.  A simple port
>>test is failing and IPv4 routing test is failing.
>>
>>The port test is simple: just connect a port on DUT to a port on
>>another system and assign an IP address to each port and verify IP
>>connectivity.  I have this:
>>
>>   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
>>
>>The IPv4 routing tests is a bit more complicated to setup.  I'm using
>>OSPF, but I'm not seeing full routes formed in the topology, so I
>>suspect OSPF hellos aren't getting thru.
>>
>>Please fix find/fix these issues and send v3.  I don't want any git
>>bisect issues when running tests.  Thanks.
>
> I fixed that. Sending v3 in a sec. Thanks.

Sorry, both tests are still broken.  Would you send me your tests
scripts so I can see why your tests are passing?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 13/13] rocker: move OF-DPA stuff into separate file

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:44 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Carve out OF-DPA would specific code from the common file to the world
> file. This change required struct rocker and struct rocker_port split
> into world specific struct ofdpa and struct ofdpa_port. Along with this
> the world specific functions and defines were renamed from prefix
> "rocker_" to "ofdpa_".
>
> Signed-off-by: Jiri Pirko 

I'd like a chance to run tests against this patchset and study the
patches closer.  It's going to take a couple of days, so I'd like to
ask that we take a little bit of time before merging this in.  It's
all good work; just want to do due-diligence before it's pushed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 00/14] rocker: add support for multiple worlds

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 9:58 AM, John Fastabend  wrote:
> On 15-10-05 09:30 AM, Jiri Pirko wrote:
>> Mon, Oct 05, 2015 at 05:41:38PM CEST, john.fastab...@gmail.com wrote:
>>> On 15-10-04 02:25 PM, Jiri Pirko wrote:
 From: Jiri Pirko 

 This patchset allows new rocker worlds to be easily added in future (like 
 eBPF
 based one I have been working on). The main part of the patchset is the 
 OF-DPA
 carve-out. It resuts in OF-DPA specific file. Clean cut.
 The user is able to change rocker port world/mode using rtnl.

>>>
>>> Hi Jiri,
>>>
>>> I'm not sure I understand the motivation here. Are you thinking the
>>> "real" drivers will start to load worlds or what I've been calling
>>> profiles on the devices I have here. If this is the case using
>>> opaque strings without any other infrastructure around it to expose
>>> what the profile is doing is not sufficient in my opinion. What I
>>> would rather have is for drivers to expose the actual configuration
>>> parameters they are using, preferable these would be both readable
>>> and writable so we don't end up with what the firmware/device driver
>>> writers think is best. I think we can get there by exposing a model
>>> of the device and configuring "tables". I'll post my latest patch
>>> set today to give you a better idea what I'm thinking here. Without
>>> this I guess you will end up with drivers creating many profiles and
>>> in no consistent way so you end up with here is my "vxlan" profile,
>>> here is my "geneve" profile, here is my "magic-foo" profile, etc. I
>>> wanted to avoid this.
>>
>> This is just for rocker purposes. I do not want to do something similar
>> for real devices. It does not make sense as real hw always have some
>> hard-wired topology. Rocker HW does not. I think that this is the main
>> part that may cause some misunderstandings.
>
> I think your underestimating the flexibility of hardware. And
> completely missing the hardware that is based on FPGAs and/or cell
> architectures. This hardware is available today and could support
> topology changes like this. But even less exotic hardware can/will
> support parser updates which makes the device behave differently.
>
> Other hardware can reconfigure the topology within some constraints,
> the fm10k device supports this model. An extreme example would put
> an ebpf interpreter in a fpga on the nic and expose it via a driver.
>
> If its just for rocker purposes I'm not really excited about adding
> it to the kernel to support a qemu device. If we allow it for one
> driver I don't see how/why we should block it for "real" devices.
> From the kernels point of view these are all real drivers. I could
> build a qemu model that maps 1:1 with real hardware and do a drop
> in replacement.
>
>>
>> Rocker has a notion of "worlds". When a port is set to be in a certain
>> world, it behaves in completely different way. Now we have just OF-DPA
>> world. I will be adding BPF world shortly.
>>
>> This has nothing to do with profiles as you describe it, this is
>> something completely different!
>>
>>
>
> I'm missing why its different.
>
> Would you object to me adding multiple worlds to fm10k
> using opaque strings? I'll create a world with a topology that maps
> well to ipv4 networks, a world for ipv6 networks, a world for l2 flat
> networks, etc. Each world in this example will have a specific table
> topology and parser to support it. In this sense the ports will behave
> in completely different ways i.e. packets will be processed by
> different pipelines. Are you suggesting we do this?
>
> I'm not sure what you mean by completely different? Is it just a
> different parser and table topology? Real hardware can support changing
> or at least modifying these today.
>
>>>
>>> But if this is only meant to be a rocker thing then why expose it on
>>> the driver side vs just compiling it on the qemu side? If its just
>>
>> I want user to be able to set the world/mode of the port on fly. No need to
>> re-set the hardware if possible to do it from driver.
>>
>
> But the user has no way to know what these strings are doing?
>
>>
>>> for convenience and only meant for the emulated device we should be
>>> clear in the documentation and patch set.
>>
>> This is rocker-only patchset, where do you want to clear it?
>>
>
> I don't think this is reasonable from the kernel side to "know" or
> expose a driver is running on qemu like this. The kernel shouldn't
> know or care if a device is emulated or not.
>
>>
>>>
>>> Final, comment can we abstract the interfaces better? An L2 and L3
>>> table could be mapped generically onto a table pipeline model if the
>>> driver gave some small hints like this is my l2 table and this is my l3
>>> table. Then you don't need all the world specific callbacks and the
>>> OF-DPA model just looks like an instance of a pipeline with some
>>> specific hints where to put l2/l3 rules.
>>
>> I think you are

Re: [patch net-next 05/14] rocker: implement set settings mode command

2015-10-05 Thread Scott Feldman

On Sun, Oct 4, 2015 at 2:25 PM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Introduce a helper to ask HW for change of the port mode (world).
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 01/13] rocker: remove unused rocker_port param from alloc funcs and shorten their names

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> No need to pass rocker_port around to alloc/free rocker functions,
> since they now use switchdev_trans for memory management storage.
> With the param removal, shorten the name of the functions since they now
> has nothing to do with rocker port.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 03/13] rocker: rename rocker.c to rocker_main.c

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Since "rocker.c" is going to be split into multiple files, start with
> renaming original "rocker.c" file to "rocker_main.c". Multiple code
> parts are going to be cut from "rocker_main.c" later on.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 06/13] rocker: introduce worlds infrastructure

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> This is another step on the way to per-world clean cut. Introduce world
> ops hooks which each world can implement in world-specific way.
> Also introduce world infrastructure including function for port worlds
> change.
>
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
>  - removed functions to change worlds based on name, as rtnl mode
>set patch is removed from patchset.
> ---
>  drivers/net/ethernet/rocker/rocker.h  |  56 
>  drivers/net/ethernet/rocker/rocker_main.c | 464 
> +-
>  2 files changed, 519 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/rocker/rocker.h 
> b/drivers/net/ethernet/rocker/rocker.h
> index 650caa0..d49bc5d 100644
> --- a/drivers/net/ethernet/rocker/rocker.h
> +++ b/drivers/net/ethernet/rocker/rocker.h
> @@ -12,7 +12,11 @@
>  #ifndef _ROCKER_H
>  #define _ROCKER_H
>
> +#include 
>  #include 
> +#include 
> +#include 
> +#include 
>
>  #include "rocker_hw.h"
>
> @@ -24,4 +28,56 @@ struct rocker_desc_info {
> dma_addr_t mapaddr;
>  };
>
> +struct rocker;
> +struct rocker_port;
> +
> +struct rocker_world_ops {
> +   const char *kind;
> +   size_t priv_size;
> +   size_t port_priv_size;
> +   u8 mode;
> +   int (*init)(struct rocker *rocker, void *priv);
> +   void (*fini)(void *priv);
> +   int (*port_init)(struct rocker_port *rocker_port, void *priv,
> +void *port_priv);
> +   void (*port_fini)(void *port_priv);
> +   int (*port_open)(void *port_priv);
> +   void (*port_stop)(void *port_priv);
> +   int (*port_attr_stp_state_set)(void *port_priv, u8 state,
> +  struct switchdev_trans *trans);
> +   int (*port_attr_bridge_flags_set)(void *port_priv,
> + unsigned long brport_flags,
> + struct switchdev_trans *trans);
> +   int (*port_attr_bridge_flags_get)(void *port_priv,
> + unsigned long *p_brport_flags);
> +   int (*port_obj_vlan_add)(void *port_priv,
> +const struct switchdev_obj_port_vlan *vlan,
> +struct switchdev_trans *trans);
> +   int (*port_obj_vlan_del)(void *port_priv,
> +const struct switchdev_obj_port_vlan *vlan);
> +   int (*port_obj_vlan_dump)(void *port_priv,
> + struct switchdev_obj_port_vlan *vlan,
> + switchdev_obj_dump_cb_t *cb);
> +   int (*port_obj_fib4_add)(void *port_priv,
> +const struct switchdev_obj_ipv4_fib *fib4,
> +struct switchdev_trans *trans);
> +   int (*port_obj_fib4_del)(void *port_priv,
> +const struct switchdev_obj_ipv4_fib *fib4);
> +   int (*port_obj_fdb_add)(void *port_priv,
> +   const struct switchdev_obj_port_fdb *fdb,
> +   struct switchdev_trans *trans);
> +   int (*port_obj_fdb_del)(void *port_priv,
> +   const struct switchdev_obj_port_fdb *fdb);
> +   int (*port_obj_fdb_dump)(void *port_priv,
> +struct switchdev_obj_port_fdb *fdb,
> +switchdev_obj_dump_cb_t *cb);
> +   int (*port_master_linked)(void *port_priv, struct net_device *master);
> +   int (*port_master_unlinked)(void *port_priv, struct net_device 
> *master);
> +   int (*port_neigh_update)(void *port_priv, struct neighbour *n);
> +   int (*port_neigh_destroy)(void *port_priv, struct neighbour *n);
> +   int (*port_ev_mac_vlan_seen)(void *port_priv,
> +const unsigned char *addr,
> +__be16 vlan_id);
> +};

Yuck, void *.  Can we do better?

How about using base struct rocker_port and then sub-classing
world-specific rocker ports.  And make rocker_world_ops pass struct
rocker_port * (for the port-centric ops).

struct rocker_port {
// common stuff for all worlds
};

struct my_world_port {
struct rocker_port rocker_port;
// world-specific stuff for port
};

#define MY_WORLD_PORT(rocker_port) \
container_of(rocker_port, struct my_world_port, rocker_port)

Same for world-centric ops:

struct rocker_world {
   // common stuff for all worlds
};

struct my_world {
struct rocker_world rocker_world;
// stuff for this world
};


> +static int rocker_world_port_ev_mac_vlan_seen(struct rocker_port 
> *rocker_port,
> + const unsigned char *addr,
> + __be16 vlan_id);
>
>  static int rocker_event_mac_vlan_seen(const struct rocker *rocker,

Re: [patch net-next v2 02/13] rocker: rename rocker.h to rocker_hw.h

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Since "rocker.h" file is going to be used for different purpose,
> rename the hardware-specific header to "rocker_hw.h".
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 04/13] rocker: push tlv processing into separate files

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Carve out TLV processing helpers into separate files.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> This patchset allows new rocker worlds to be easily added in future (like eBPF
> based one I have been working on). The main part of the patchset is the OF-DPA
> carve-out. It resuts in OF-DPA specific file. Clean cut.
>
> v1->v2:
>  - rtnl rocker mode change userspace expose patch was removed
>
> Jiri Pirko (13):
>   rocker: remove unused rocker_port param from alloc funcs and shorten
> their names
>   rocker: rename rocker.h to rocker_hw.h
>   rocker: rename rocker.c to rocker_main.c
>   rocker: push tlv processing into separate files
>   rocker: implement set settings mode command
>   rocker: introduce worlds infrastructure
>   rocker: introduce OF-DPA world skeleton
>   rocker: set default world on port probe and clean world on remove
>   rocker: pass "learning" value as a parameter to
> rocker_port_set_learning
>   rocker: pre-allocate wait structures during cmd ring init
>   rocker: remove trans parameter to rocker_cmd_exec function
>   rocker: call rocker_cmd_exec function with "nowait" boolean instead of
> flags
>   rocker: move OF-DPA stuff into separate file

A couple of my tests are failing with this patchset.  A simple port
test is failing and IPv4 routing test is failing.

The port test is simple: just connect a port on DUT to a port on
another system and assign an IP address to each port and verify IP
connectivity.  I have this:

   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)

The IPv4 routing tests is a bit more complicated to setup.  I'm using
OSPF, but I'm not seeing full routes formed in the topology, so I
suspect OSPF hellos aren't getting thru.

Please fix find/fix these issues and send v3.  I don't want any git
bisect issues when running tests.  Thanks.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 09/14] rocker: add rtnl ops for port mode [gs]etting

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 9:50 AM, Jiri Pirko  wrote:
> Mon, Oct 05, 2015 at 06:37:20PM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 5, 2015 at 8:53 AM, Jiri Pirko  wrote:
>>> Mon, Oct 05, 2015 at 05:41:29PM CEST, sfel...@gmail.com wrote:
On Sun, Oct 4, 2015 at 2:25 PM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Introduce a stub for allowing user to change rocker port world/mode.
> This is implemented using rtnl changelink op.
>
> Signed-off-by: Jiri Pirko 

[cut]

> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 3a5f263..7da768e 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -416,6 +416,17 @@ enum {
>  };
>  #define IFLA_GENEVE_MAX(__IFLA_GENEVE_MAX - 1)
>
> +/* Rocker section */
> +enum {
> +   IFLA_ROCKER_UNSPEC,
> +   IFLA_ROCKER_MODE,
> +   __IFLA_ROCKER_MAX,
> +};
> +
> +#define IFLA_ROCKER_MAX(__IFLA_ROCKER_MAX - 1)
> +
> +#define ROCKER_MODE_MAX 16

Does this mean there is going to be a "ip link set dev DEV type rocker
mode MODE" command option?

It doesn't seem right to be adding driver-specific IFLA_'s here.  I
think this sets bad precedence for other drivers to add their own
knobs without thinking about a generic shared mechanism.
>>>
>>> I understand you point. I somehow share it as well. But on the other
>>> hand, That is neat way to set mode of the port, and I cannot find other
>>> this nice.
>>>
>>>

Actually, I don't see the point of letting the user dynamically change
the port mode.  I would prefer this knob be moved to qemu/rocker.  Let
the port mode be specified on device creation.
>>>
>>> Hmm, interesting, why? I find it great for user to be able to switch the
>>> port mode easily on the running system. It is a setting of a port.
>>> I don't see why this should be a hard-coded hw setting. We just have
>>> to find a way to expose this to user.
>>
>>You could just as easy restart the qemu session with different port
>>mode.  Even if rocker was a real device, I don't see (real) users
>>wanting to change the port mode at run-time.  It just doesn't seem
>>like a real-world scenario.   I know it's convenient for developers,
>
> We the hw-iface is already designed for on-fly changes.
> I'm convinced that this is one of the situation where we should not do
> stuff in HW just because we can. (Or we should do both).

I've used that argument myself for other things, such as the ageing
timer in rocker driver, so I agree with you on that point.

> We should develop in-kernel solution in order to wrap up HW functionality.
> Even if not easy. That is the purpose of rocker from day 1, and that is
> why we love it :)

Well then you need to define a generic interface for this concept of
changing port mode that's not rocker-specific.  What you have in
if_link.h is easy.  What's the not-easy solution?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 09/14] rocker: add rtnl ops for port mode [gs]etting

2015-10-05 Thread Scott Feldman

On Sun, Oct 4, 2015 at 2:25 PM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Introduce a stub for allowing user to change rocker port world/mode.
> This is implemented using rtnl changelink op.
>
> Signed-off-by: Jiri Pirko 

[cut]

> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 3a5f263..7da768e 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -416,6 +416,17 @@ enum {
>  };
>  #define IFLA_GENEVE_MAX(__IFLA_GENEVE_MAX - 1)
>
> +/* Rocker section */
> +enum {
> +   IFLA_ROCKER_UNSPEC,
> +   IFLA_ROCKER_MODE,
> +   __IFLA_ROCKER_MAX,
> +};
> +
> +#define IFLA_ROCKER_MAX(__IFLA_ROCKER_MAX - 1)
> +
> +#define ROCKER_MODE_MAX 16

Does this mean there is going to be a "ip link set dev DEV type rocker
mode MODE" command option?

It doesn't seem right to be adding driver-specific IFLA_'s here.  I
think this sets bad precedence for other drivers to add their own
knobs without thinking about a generic shared mechanism.

Actually, I don't see the point of letting the user dynamically change
the port mode.  I would prefer this knob be moved to qemu/rocker.  Let
the port mode be specified on device creation.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 09/14] rocker: add rtnl ops for port mode [gs]etting

2015-10-05 Thread Scott Feldman

On Mon, Oct 5, 2015 at 8:53 AM, Jiri Pirko  wrote:
> Mon, Oct 05, 2015 at 05:41:29PM CEST, sfel...@gmail.com wrote:
>>On Sun, Oct 4, 2015 at 2:25 PM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Introduce a stub for allowing user to change rocker port world/mode.
>>> This is implemented using rtnl changelink op.
>>>
>>> Signed-off-by: Jiri Pirko 
>>
>>[cut]
>>
>>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>> index 3a5f263..7da768e 100644
>>> --- a/include/uapi/linux/if_link.h
>>> +++ b/include/uapi/linux/if_link.h
>>> @@ -416,6 +416,17 @@ enum {
>>>  };
>>>  #define IFLA_GENEVE_MAX(__IFLA_GENEVE_MAX - 1)
>>>
>>> +/* Rocker section */
>>> +enum {
>>> +   IFLA_ROCKER_UNSPEC,
>>> +   IFLA_ROCKER_MODE,
>>> +   __IFLA_ROCKER_MAX,
>>> +};
>>> +
>>> +#define IFLA_ROCKER_MAX(__IFLA_ROCKER_MAX - 1)
>>> +
>>> +#define ROCKER_MODE_MAX 16
>>
>>Does this mean there is going to be a "ip link set dev DEV type rocker
>>mode MODE" command option?
>>
>>It doesn't seem right to be adding driver-specific IFLA_'s here.  I
>>think this sets bad precedence for other drivers to add their own
>>knobs without thinking about a generic shared mechanism.
>
> I understand you point. I somehow share it as well. But on the other
> hand, That is neat way to set mode of the port, and I cannot find other
> this nice.
>
>
>>
>>Actually, I don't see the point of letting the user dynamically change
>>the port mode.  I would prefer this knob be moved to qemu/rocker.  Let
>>the port mode be specified on device creation.
>
> Hmm, interesting, why? I find it great for user to be able to switch the
> port mode easily on the running system. It is a setting of a port.
> I don't see why this should be a hard-coded hw setting. We just have
> to find a way to expose this to user.

You could just as easy restart the qemu session with different port
mode.  Even if rocker was a real device, I don't see (real) users
wanting to change the port mode at run-time.  It just doesn't seem
like a real-world scenario.   I know it's convenient for developers,
but a qemu knob is just as convenient.  Doing this in qemu will keep
the rocker driver less complicated, cleaner.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 3/6] switchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlan

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Make the struct name in sync with object id name.
>
> Suggested-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 2/6] switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> To be aligned with obj.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 5/6] switchdev: bring back switchdev_obj and use it as a generic object param

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Replace "void *obj" with a generic structure. Introduce couple of
> helpers along that.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>
> ---
> v1->v2:
> * align object structure getter name with structure name as suggested by
>   Vivien
> ---

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 1/6] switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Suggested-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 6/6] switchdev: push object ID back to object structure

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Suggested-by: Scott Feldman <sfel...@gmail.com>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/6] switchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdb

2015-10-01 Thread Scott Feldman

On Thu, Oct 1, 2015 at 2:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Make the struct name in sync with object id name.
>
> Suggested-by: Vivien Didelot <vivien.dide...@savoirfairelinux.com>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 02/10] switchdev: introduce transaction item queue for attr_set and obj_add

2015-09-30 Thread Scott Feldman

On Wed, Sep 30, 2015 at 11:56 AM, Vivien Didelot
<vivien.dide...@savoirfairelinux.com> wrote:
> Hi all,
>
> On Sep. Friday 25 (39) 11:03 AM, Vivien Didelot wrote:
>> On Sep. Thursday 24 (39) 10:55 PM, David Miller wrote:
>> > From: Scott Feldman <sfel...@gmail.com>
>> > Date: Thu, 24 Sep 2015 22:29:43 -0700
>> >
>> > > I'd rather keep 2-phase not optional, or at least make it some what of
>> > > a pain for drivers to opt-out of 2-phase.  Forcing the driver to see
>> > > both phases means the driver needs to put some code to skip phase 1
>> > > (and hopefully has some persistent comment explaining why its being
>> > > skipped).  Something like:
>> > >
>> > > /* I'm skipping phase 1 prepare for this operation.  I have infinite 
>> > > hardware
>> > >  * resources and I'm not setting any persistent state in the driver or 
>> > > device
>> > >  * and I don't need any dynamic resources from the kernel, so its 
>> > > impossible
>> > >  * for me to fail phase 2 commit.  Nothing to prepare, sorry.
>> > >  */
>> >
>> > I agree with Scott here.
>> >
>> > If you can opt out of something, you can not think about it and thus
>> > more likely get it wrong.
>> >
>> > I can just see a driver not implementing prepare at all and then doing
>> > stupid things in commit when they hit some resource limit or whatever,
>> > rather than taking care of such issues in prepare.
>>
>> OK, I have no experience with stacked devices nor what it actually looks
>> like, but I understand that it is a redundant setup where it makes sense
>> to ensure that an operation is feasible before programming the hardware.
>>
>> I agree with both of you on imposing switchdev drivers such notion.
>>
>> I was confused with the rtnl lock (from bridge netlink requests) which
>> seemed to limit a lot the usage of this prepare phase.
>>
>> I don't know the batch mode neither, but I can think about a potentially
>> powerful usage of the prepare phase in Marvell switches (or any basic
>> home router switches), please tell me if the following is feasible:
>>
>> Every hardware VLANs I know of are programmed with all port membership
>> in one shot. This is not feasible today with the bridge command. If I
>> could bundle in one request the equivalent of ("VID 100: 0u 1u 5t"):
>>
>> bridge vlan add master dev swp0 vid 100 pvid untagged
>> bridge vlan add master dev swp1 vid 100 pvid untagged
>> bridge vlan add master dev swp5 vid 100 # cpu
>>
>> In such case the prepare phase could be great to allocate and populate a
>> VLAN entry structure (i.e. struct mv88e6xxx_vtu_stu_entry) before
>> programming the hardware *just once*. Is that doable?
>
> May I get answers for this? I'd need that in order to suggest a next
> step for the prepare phase in DSA drivers.

You know we're just making this all up as we go, right?  ;)

Actually, bringing real-world examples us think about solutions, so thank you.

A partial answer to your question, I guess, is to consider
switchdev_port_obj_add(dev, obj), which is basically:

   switchdev_port_obj_add(dev, obj)
   dev->ops->switchdev_port_obj_add(dev, obj, PREPARE)
   if no err
   dev->ops->switchdev_port_obj_add(dev, obj, COMMIT)
   else
   // abort transaction

Seems you could make a batch version that iterates over list of devs
to add the same obj:

   switchdev_port_obj_add_batch(devs, obj)
   for dev in devs:
   dev->ops->switchdev_port_obj_add(dev, obj, PREPARE)
   if no errs:
   for dev in devs:
   dev->ops->switchdev_port_obj_add(dev, obj, COMMIT)
   else
   // abort transaction (on all devs)

Who calls this _batch version with a list of devs is TBD.   Not sure
if you can get the -batch and -force options in iproute2 ip cmd to get
you here.

And as I look back at your example, you really have two diff objs in
your cmd list.  First obj is vlan 100 pvid untagged, second obj is
vlan 100.  Would this be two batches?  Oh, hmmm, maybe the batch
version iterates over list of {dev, obj} tuples.  Something like
(sorry, my pseudo code is turning into python):

   switchdev_port_obj_add_batch(dev_objs)
   for dev, obj in dev_objs.items:
   dev->ops->switchdev_port_obj_add(dev, obj, PREPARE)
   if no errs:
   for dev, obj in dev_objs.items:
   dev->ops->switchdev_port_obj_add(dev, obj, COMMIT)
   else
   // abort transaction (on all devs)

Does this help?  Maybe we should walk before running and focus on
getting non-batch ops working and then revisit?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] switchdev: bring back switchdev_obj and use it as a generic object param

2015-09-30 Thread Scott Feldman

On Wed, Sep 30, 2015 at 9:00 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Replace "void *obj" with a generic structure. Introduce couple of
> helpers along that.
>
> Signed-off-by: Jiri Pirko 

Looks good to me, except for the macro/enum collision Vivien pointed
out.  Vivien's suggestion to use SWITCHDEV_OBJ_ID_xxx for ID enums
would work.

Actually, id should be moved back into switchdev_obj, otherwise that
info is lost in dump callbacks.  Could have a common callback that
switches on id, for id specific work.  I'll submit a patch to move id
back into obj if Jiri doesn't add it to this one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 6/6] net: switchdev: extract struct switchdev_obj_*

2015-09-30 Thread Scott Feldman

On Wed, Sep 30, 2015 at 3:36 AM, Jiri Pirko  wrote:
> Tue, Sep 29, 2015 at 06:07:18PM CEST, vivien.dide...@savoirfairelinux.com 
> wrote:
>>Now that switchdev and its drivers directly use specific switchdev_obj_*
>>structures, move them out of the switchdev_obj union and get rif of this
>>outer structure.
>>
>>Signed-off-by: Vivien Didelot 
>>---
>> include/net/switchdev.h | 53 
>> -
>> 1 file changed, 26 insertions(+), 27 deletions(-)
>>
>>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>index bcadac3..e11425e 100644
>>--- a/include/net/switchdev.h
>>+++ b/include/net/switchdev.h
>>@@ -64,30 +64,29 @@ enum switchdev_obj_id {
>>   SWITCHDEV_OBJ_PORT_FDB,
>> };
>>
>>-struct switchdev_obj {
>>-  enum switchdev_obj_id id;
>>-  int (*cb)(struct switchdev_obj *obj);
>>-  union {
>>-  struct switchdev_obj_vlan { /* PORT_VLAN */
>>-  u16 flags;
>>-  u16 vid_begin;
>>-  u16 vid_end;
>>-  } vlan;
>>-  struct switchdev_obj_ipv4_fib { /* IPV4_FIB */
>>-  u32 dst;
>>-  int dst_len;
>>-  struct fib_info *fi;
>>-  u8 tos;
>>-  u8 type;
>>-  u32 nlflags;
>>-  u32 tb_id;
>>-  } ipv4_fib;
>>-  struct switchdev_obj_fdb {  /* PORT_FDB */
>>-  const unsigned char *addr;
>>-  u16 vid;
>>-  u16 ndm_state;
>>-  } fdb;
>>-  } u;
>>+/* SWITCHDEV_OBJ_PORT_VLAN */
>>+struct switchdev_obj_vlan {
>>+  u16 flags;
>>+  u16 vid_begin;
>>+  u16 vid_end;
>>+};
>>+
>>+/* SWITCHDEV_OBJ_IPV4_FIB */
>>+struct switchdev_obj_ipv4_fib {
>>+  u32 dst;
>>+  int dst_len;
>>+  struct fib_info *fi;
>>+  u8 tos;
>>+  u8 type;
>>+  u32 nlflags;
>>+  u32 tb_id;
>>+};
>>+
>>+/* SWITCHDEV_OBJ_PORT_FDB */
>>+struct switchdev_obj_fdb {
>>+  const unsigned char *addr;
>>+  u16 vid;
>>+  u16 ndm_state;
>> };
>
>
> I don't like these structs being passed down as a "void *". I think that
> we should have some "common" struct for these objects, event if it would
> be empty and pass it down. "void *" does not look good at all, does not
> tell the reader what that param is about. How about:
>
> struct switchdev_obj {
> };
>
> struct switchdev_obj_vlan {
> struct switchdev_obj obj;
> u16 flags;
> u16 vid_begin;
> u16 vid_end;
> };
> #define SWITCHDEV_OBJ_VLAN(obj) \
> container_of(obj, struct switchdev_obj_vlan, obj)
>
> /* SWITCHDEV_OBJ_IPV4_FIB */
> struct switchdev_obj_ipv4_fib {
> struct switchdev_obj obj;
> u32 dst;
> int dst_len;
> struct fib_info *fi;
> u8 tos;
> u8 type;
> u32 nlflags;
> u32 tb_id;
> };
> #define SWITCHDEV_OBJ_IPV4_FIB(obj) \
> container_of(obj, struct switchdev_obj_ipv4_fib, obj)
>
> /* SWITCHDEV_OBJ_PORT_FDB */
> struct switchdev_obj_fdb {
> struct switchdev_obj obj;
> const unsigned char *addr;
> u16 vid;
> u16 ndm_state;
> };
> #define SWITCHDEV_OBJ_FDB(obj)  \
> container_of(obj, struct switchdev_obj_fdb, obj)
>
>
> then pass struct switchdev_obj *obj down to drivers and in driver, get
> original object by SWITCHDEV_OBJ_* ?


Yes, I agree with Jiri, keep the struct switchdev_obj, use it within
each specific struct, and pass struct switchdev * in core funcs rather
than void *.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 00/10] switchdev: transaction item queue and cleanup

2015-09-24 Thread Scott Feldman

On Thu, Sep 24, 2015 at 1:02 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Jiri Pirko (10):
>   switchdev: rename "trans" to "trans_ph".
>   switchdev: introduce transaction item queue for attr_set and obj_add
>   switchdev: move transaction phase enum under transaction structure
>   switchdev: add switchdev_trans_ph_prepare/commit helpers
>   rocker: push struct switchdev_trans down through rocker code
>   rocker: use switchdev transaction queue for allocated memory
>   switchdev: remove "NONE" transaction phase
>   switchdev: remove "ABORT" transaction phase
>   dsa: use prepare/commit switchdev transaction helpers
>   switchdev: reduce transaction phase enum down to a boolean
>
>  Documentation/networking/switchdev.txt |  19 ++
>  drivers/net/ethernet/rocker/rocker.c   | 308 
> ++---
>  include/net/switchdev.h|  40 +++--
>  net/dsa/slave.c|  31 ++--
>  net/switchdev/switchdev.c  | 125 ++++++---
>  5 files changed, 296 insertions(+), 227 deletions(-)

Tests pass.  ACK on all.

Acked-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] switchdev: push bridge attributes down

2015-09-24 Thread Scott Feldman

On Thu, Sep 24, 2015 at 2:05 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Thu, 24 Sep 2015 13:59:26 -0700
> sfel...@gmail.com wrote:
>
>> From: Scott Feldman <sfel...@gmail.com>
>>
>> Push bridge-level attributes down to switchdev drivers.  This patchset
>> adds the infrastructure and then pushes, as an example, ageing_time attribute
>> down from bridge to switchdev (rocker) driver.  Add some range-checking
>> for ageing_time.
>>
>> # ip link set dev br0 type bridge ageing_time 1000
>>
>> # ip link set dev br0 type bridge ageing_time 999
>> RTNETLINK answers: Numerical result out of range
>>
>> Up until now, switchdev attrs where port-level attrs, so the netdev used in
>> switchdev_attr_set() would be a switch port or bond of switch ports.  With
>> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
>> netdev.  The same recusive algo is used to visit the leaves of the stacked
>> drivers to set the attr, it's just in this case we start one layer higher in
>> the stack.  One note is not all ports in the bridge may support setting a
>> bridge-level attribute, so rather than failing the entire set, we'll skip 
>> over
>> those ports returning -EOPNOTSUPP.
>
>
> Rather than having more in bridge, shouldn't this just be a netlink event?

You lost me?  Oh, do you mean netdev notifier?  Jiri (privately) had
suggested that also.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] switchdev: push bridge attributes down

2015-09-24 Thread Scott Feldman

On Thu, Sep 24, 2015 at 6:23 PM, Florian Fainelli <f.faine...@gmail.com> wrote:
> On 24/09/15 13:59, sfel...@gmail.com wrote:
>> From: Scott Feldman <sfel...@gmail.com>
>>
>> Push bridge-level attributes down to switchdev drivers.  This patchset
>> adds the infrastructure and then pushes, as an example, ageing_time attribute
>> down from bridge to switchdev (rocker) driver.  Add some range-checking
>> for ageing_time.
>>
>> # ip link set dev br0 type bridge ageing_time 1000
>>
>> # ip link set dev br0 type bridge ageing_time 999
>> RTNETLINK answers: Numerical result out of range
>>
>> Up until now, switchdev attrs where port-level attrs, so the netdev used in
>> switchdev_attr_set() would be a switch port or bond of switch ports.  With
>> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
>> netdev.  The same recusive algo is used to visit the leaves of the stacked
>> drivers to set the attr, it's just in this case we start one layer higher in
>> the stack.  One note is not all ports in the bridge may support setting a
>> bridge-level attribute, so rather than failing the entire set, we'll skip 
>> over
>> those ports returning -EOPNOTSUPP.
>
> So, without a better device to hold that kind of information (in the
> future it could be a global, switch-specific device holding that
> information), I agree with your decision to take the bridge device to
> hold that attribute, it still feels a bit uncomfortable to have
> switchdev_attr_port() take a bridge device parameter, but whatever, here
> is a scenario I am wondering how we would want to proceed with:
>
> - suppose we have a switch which is only able to control ageing
> globally, not per port or any other kind of logical domain
>
> - we have enabled two software bridges on the same physical switch, with
> different ageing timeouts
>
> It does not seem to me like it hurts ageing the other bridge faster than
> expected (even though that could be expensive for MDIO devices), but we
> would need to have consistent reporting here for the other bridge.

It could hurt a little bit by ageing out entries prematurely, forcing
relearning.  ;)

I think if the switch can't support multiple ageing timeouts (which is
probably typical), then the switch driver should not implement this
switchdev attr at the port level (this patchset).  So how does ageing
timeout get set for a switch with a global timer?  I believe we need
the switch-specific device to handle those switch-global attr sets.
I'll send out a refresh of my RFC patches for switch device class, and
add this ageing_time attr.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 02/10] switchdev: introduce transaction item queue for attr_set and obj_add

2015-09-24 Thread Scott Feldman

On Thu, Sep 24, 2015 at 9:36 PM, Vivien Didelot
 wrote:
> Hi Jiri,
>
> On Sep. Thursday 24 (39) 10:02 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>>
>> Now, the memory allocation in prepare/commit state is done separatelly
>> in each driver (rocker). Introduce the similar mechanism in generic
>> switchdev code, in form of queue. That can be used not only for memory
>> allocations, but also for different items. Abort item destruction
>> is handled as well.
>>
>> Signed-off-by: Jiri Pirko 
>
> [...]
>
>>  /**
>>   * struct switchdev_ops - switchdev operations
>>   *
>> @@ -94,9 +110,11 @@ struct switchdev_ops {
>>   int (*switchdev_port_attr_get)(struct net_device *dev,
>>  struct switchdev_attr *attr);
>>   int (*switchdev_port_attr_set)(struct net_device *dev,
>> -struct switchdev_attr *attr);
>> +struct switchdev_attr *attr,
>> +struct switchdev_trans *trans);
>>   int (*switchdev_port_obj_add)(struct net_device *dev,
>> -   struct switchdev_obj *obj);
>> +   struct switchdev_obj *obj,
>> +   struct switchdev_trans *trans);
>>   int (*switchdev_port_obj_del)(struct net_device *dev,
>> struct switchdev_obj *obj);
>>   int (*switchdev_port_obj_dump)(struct net_device *dev,
>
> This version is better than the current state, but it's too bad we
> didn't get feedback yet on the real purpose of this 2-phase model...

Hmmm...I think I missed the question about the 2-phase model.  Let me
see if I can answer it here.

We got here first when working with stacked drivers and we wanted to
propagate an change (attr set or obj add) down the stack in a safe way
as to not leave hardware or software in an inconsistent state when
unexpected things happen.  The simple stacked driver example was two
switch ports bonded.  Setting an attr on the bond would propagate down
to each switch port.  If the setting stuck on the first port, but
didn't on the second port, we didn't want to have to go back and undo
the setting on the first port.  It gets messy real fast trying to
handle the undo operation.  So 2-phase model was suggested by davem to
solve the problem in a generic way.  First phase was to check with
each device in the stack if the operation is OK, and second phase
would commit operation.  The second phase cannot fail; the first phase
already said everything was OK to proceed.  That means no running out
of hardware resources in phase 2.  Phase 1 should have check/reserved
the resources.  And no OOM situations in phase 2.  And so on for other
detectable failures.

Stacked driver case is one case where the 2-phase model strives to
maintain hw/sw consistency.  I could see it working for batch
operations also, although we don't have any examples of batch
processing yet (in switchdev).

I'd rather keep 2-phase not optional, or at least make it some what of
a pain for drivers to opt-out of 2-phase.  Forcing the driver to see
both phases means the driver needs to put some code to skip phase 1
(and hopefully has some persistent comment explaining why its being
skipped).  Something like:

/* I'm skipping phase 1 prepare for this operation.  I have infinite hardware
 * resources and I'm not setting any persistent state in the driver or device
 * and I don't need any dynamic resources from the kernel, so its impossible
 * for me to fail phase 2 commit.  Nothing to prepare, sorry.
 */

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 02/10] switchdev: introduce transaction item queue for attr_set and obj_add

2015-09-23 Thread Scott Feldman

On Wed, Sep 23, 2015 at 12:57 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Now, the memory allocation in prepare/commit state is done separatelly
> in each driver (rocker). Introduce the similar mechanism in generic
> switchdev code, in form of queue. That can be used not only for memory
> allocations, but also for different items. Commit/abort item destruction
> is handled as well.
>
> Signed-off-by: Jiri Pirko 
> ---
> v1-v2:
> - added comment blocks to exported functions as suggested by Scott
> - made switchdev_trans_init and switchdev_trans_items_destroy static
> - added documentation for this to switchdev.txt as suggested by Scott

[cut]

> @@ -232,10 +304,13 @@ static int __switchdev_port_obj_add(struct net_device 
> *dev,
>   */
>  int switchdev_port_obj_add(struct net_device *dev, struct switchdev_obj *obj)
>  {
> +   struct switchdev_trans trans;
> int err;
>
> ASSERT_RTNL();
>
> +   switchdev_trans_init();
> +
> /* Phase I: prepare for obj add. Driver/device should fail
>  * here if there are going to be issues in the commit phase,
>  * such as lack of resources or support.  The driver/device
> @@ -244,7 +319,7 @@ int switchdev_port_obj_add(struct net_device *dev, struct 
> switchdev_obj *obj)
>  */
>
> obj->trans_ph = SWITCHDEV_TRANS_PREPARE;
> -   err = __switchdev_port_obj_add(dev, obj);
> +   err = __switchdev_port_obj_add(dev, obj, );
> if (err) {
> /* Prepare phase failed: abort the transaction.  Any
>  * resources reserved in the prepare phase are
> @@ -253,7 +328,8 @@ int switchdev_port_obj_add(struct net_device *dev, struct 
> switchdev_obj *obj)
>
> if (err != -EOPNOTSUPP) {
> obj->trans_ph = SWITCHDEV_TRANS_ABORT;
> -   __switchdev_port_obj_add(dev, obj);
> +   __switchdev_port_obj_add(dev, obj, );
> +   switchdev_trans_items_destroy();
> }
>
> return err;
> @@ -265,8 +341,9 @@ int switchdev_port_obj_add(struct net_device *dev, struct 
> switchdev_obj *obj)
>  */
>
> obj->trans_ph = SWITCHDEV_TRANS_COMMIT;
> -   err = __switchdev_port_obj_add(dev, obj);
> +   err = __switchdev_port_obj_add(dev, obj, );
> WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, 
> obj->id);
> +   switchdev_trans_items_destroy();

Before this patchset, rocker would BUG_ON that the list was empty
after commit.  This was really helpful in finding bugs in the driver
as it meant the commit and prepare phases didn't follow the same code
path.  (Everything the driver queued in prepare was dequeued in
commit, so the list should be empty after commit).

Should this last switchdev_trans_items_destroy() be replaced with:

BUG_ON(!list_empty(_list));
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rtnl_fdb_dump: catch errors from ndo_fdb_dump and ndo_dflt_fdb_dump

2015-09-23 Thread Scott Feldman

On Wed, Sep 23, 2015 at 9:21 AM, Roopa Prabhu <ro...@cumulusnetworks.com> wrote:
> From: Wilson Kok <w...@cumulusnetworks.com>
>
> current ndo_fdb_dump and ndo_dflt_fdb_dump always return the current
> fdb index. They dont return errors. Which results in fdb dumps
> continuing on errors.
>
> In one such case where bridges and vxlan devices were involved,
> bridge driver returned -EMSGSIZE on a bridge, but since it continued
> on error, the next vxlan device fdb dump (which was smaller in size)
> succeeded, leaving fdb idx at an inconsistent value. This
> resulted in the bridge fdb entry getting skipped and vxlan
> fdb entry getting dumped twice.
>
> This patch changes ndo_fdb_dump() to return the status and pass the
> idx by reference for update. The dump aborts if non-zero status is
> returned.
>
> Signed-off-by: Wilson Kok <w...@cumulusnetworks.com>
> Signed-off-by: Roopa Prabhu <ro...@cumulusnetworks.com>

Reviewed-by: Scott Feldman <sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 02/10] switchdev: introduce transaction item queue for attr_set and obj_add

2015-09-22 Thread Scott Feldman

On Tue, Sep 22, 2015 at 6:53 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> Now, the memory allocation in prepare/commit state is done separatelly
> in each driver (rocker). Introduce the similar mechanism in generic
> switchdev code, in form of queue. That can be used not only for memory
> allocations, but also for different items. Commit/abort item destruction
> is handled as well.
>
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index df5a544..55a8411 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -1,6 +1,6 @@
>  /*
>   * net/switchdev/switchdev.c - Switch device API
> - * Copyright (c) 2014 Jiri Pirko <j...@resnulli.us>
> + * Copyright (c) 2014-2015 Jiri Pirko <j...@resnulli.us>
>   * Copyright (c) 2014-2015 Scott Feldman <sfel...@gmail.com>
>   *
>   * This program is free software; you can redistribute it and/or modify
> @@ -16,9 +16,56 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
> +void switchdev_trans_item_enqueue(struct switchdev_trans *trans,
> + void *data, void (*destructor)(void const 
> *),
> + struct switchdev_trans_item *tritem)
> +{
> +   tritem->data = data;
> +   tritem->destructor = destructor;
> +   list_add_tail(>list, >item_list);
> +}
> +EXPORT_SYMBOL_GPL(switchdev_trans_item_enqueue);
> +
> +static struct switchdev_trans_item *
> +__switchdev_trans_item_dequeue(struct switchdev_trans *trans)
> +{
> +   struct switchdev_trans_item *tritem;
> +
> +   if (list_empty(>item_list))
> +   return NULL;
> +   tritem = list_first_entry(>item_list,
> + struct switchdev_trans_item, list);
> +   list_del(>list);
> +   return tritem;
> +}
> +
> +void *switchdev_trans_item_dequeue(struct switchdev_trans *trans)
> +{
> +   struct switchdev_trans_item *tritem;
> +
> +   tritem = __switchdev_trans_item_dequeue(trans);
> +   BUG_ON(!tritem);
> +   return tritem->data;
> +}
> +EXPORT_SYMBOL_GPL(switchdev_trans_item_dequeue);

Please add header comment block for these EXPORT_SYMBOL_GPL funcs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 00/10] switchdev: transaction item queue and cleanup

2015-09-22 Thread Scott Feldman

On Tue, Sep 22, 2015 at 6:53 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Jiri Pirko (10):
>   switchdev: rename "trans" to "trans_ph".
>   switchdev: introduce transaction item queue for attr_set and obj_add
>   switchdev: move transaction phase enum under transaction structure
>   switchdev: add switchdev_trans_ph_prepare/commit helpers
>   rocker: push struct switchdev_trans down through rocker code
>   rocker: use switchdev transaction queue for allocated memory
>   switchdev: remove "NONE" transaction phase
>   switchdev: remove "ABORT" transaction phase
>   dsa: use prepare/commit switchdev transaction helpers
>   switchdev: reduce transaction phase enum down to a boolean
>
>  drivers/net/ethernet/rocker/rocker.c | 311 
> +++
>  include/net/switchdev.h  |  40 +++--
>  net/dsa/slave.c  |  31 ++--
>  net/switchdev/switchdev.c|  99 ---
>  4 files changed, 252 insertions(+), 229 deletions(-)

I like this version much better!  Thank you for making the
adjustments.  My main concern about easily opting-out of prepare phase
is gone.  And I appreciate that you moved trans * as arg to attr_set
and obj_add.

Would you add a write-up in switchdev.txt about prepare-commit model
and how to use the transaction object?  Add it to this patchset, if
you can.

What testing have you done?  I'll do testing today on my bench and
report back, but I kind of want to know what to expect.

Give a day or two for testing and review.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-21 Thread Scott Feldman

On Sat, Sep 19, 2015 at 5:29 AM, Jiri Pirko  wrote:
> Jiri Pirko (6):
>   switchdev: rename "trans" to "trans_ph".
>   switchdev: introduce transaction infrastructure for attr_set and
> obj_add
>   rocker: switch to local transaction phase enum
>   switchdev: move transaction phase enum under transaction structure
>   rocker: use switchdev transaction queue for allocated memory
>   switchdev: split commit and prepare phase into two callbacks

Patches compile, but first test bombs.  Cut-and-paste of dump at end
of this email.

I'm not sure I'm liking this patchset because it looks like a way for
switchdev drivers to easily opt-out of the prepare-commit transaction
model by simply not implementing the *_pre op.  I would rather drivers
explicitly handle the PREPARE phase in code, even if that means
skipping it gracefully (in code) with a comment (in code) explaining
why it does not matter for this device/operation.  That's what DSA had
done, mostly because it was a retro-fit.

Also, the patchset removes the ABORT callback in case of a rollback
due to a failed PREPARE.  We can't make the assumption that it's just
a memory list to destroy on ABORT.  The driver, on PREPARE, may have
reserved device space or staged an operation on the device which we'll
need to undo on ABORT.

So we need ABORT back, and we need PREPARE to not be optional, so
what's left list enqueue/dequeue helpers, which I'm not seeing much
value in up-leveling as the driver can do list_add/del itself.

Am I missing something?  I didn't see a motivation statement for the
RFC so I'm not sure where you wanted to take this.

-scott



[1.998791] BUG: unable to handle kernel NULL pointer dereference
at 0008
[2.05] IP: []
rocker_port_kzalloc.isra.50+0x5/0x10 [rocker]
[2.05] PGD 0
[2.05] Oops:  [#1] SMP
[2.05] Modules linked in: floppy(+) ata_piix(+) libata
rocker(+) virtio_pci(+) virtio_ring virtio scsi_mod
[2.05] CPU: 0 PID: 91 Comm: modprobe Not tainted 4.2.0+ #3
[2.05] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[2.05] task: 88000f5e0800 ti: 88000f5f8000 task.ti:
88000f5f8000
[2.05] RIP: 0010:[]  []
rocker_port_kzalloc.isra.50+0x5/0x10 [rocker]
[2.05] RSP: 0018:88000f5fba20  EFLAGS: 00010246
[2.05] RAX: 88000f17c050 RBX: 88000b40 RCX: 0020
[2.05] RDX:  RSI:  RDI: 
[2.05] RBP:  R08:  R09: a0058680
[2.05] R10: 0001 R11: 9319 R12: 
[2.05] R13: 88000f5b7000 R14: 88000f5b7840 R15: 
[2.05] FS:  7fd132237700() GS:88000fc0()
knlGS:
[2.05] CS:  0010 DS:  ES:  CR0: 8005003b
[2.05] CR2: 0008 CR3: 0f5d5000 CR4: 06f0
[2.05] Stack:
[2.05]  a005a32a 0001 00ff88000f5b7000
8175725a
[2.05]  a0057750  a0058680
0020001a
[2.05]  88000b40  88000ea06000
88000f5b7000
[2.05] Call Trace:
[2.05]  [] ? rocker_cmd_exec+0x5a/0x1f0 [rocker]
[2.05]  [] ?
rocker_cmd_get_port_stats_prep+0x90/0x90 [rocker]
[2.05]  [] ?
rocker_cmd_get_port_settings_phys_name_proc+0xb0/0xb0 [rocker]
[2.05]  [] ? rocker_probe+0xb6a/0xe59 [rocker]
[2.05]  [] ? local_pci_probe+0x48/0xa0
[2.05]  [] ? pci_device_probe+0x112/0x130
[2.05]  [] ? driver_probe_device+0x196/0x2a0
[2.05]  [] ? driver_probe_device+0x2a0/0x2a0
[2.05]  [] ? __driver_attach+0x8d/0x90
[2.05]  [] ? driver_probe_device+0x2a0/0x2a0
[2.05]  [] ? bus_for_each_dev+0x4c/0x80
[2.05]  [] ? bus_add_driver+0x119/0x220
[2.05]  [] ? 0xa0065000
[2.05]  [] ? driver_register+0x5a/0xe0
[2.05]  [] ? 0xa0065000
[2.05]  [] ? rocker_module_init+0x33/0x1000 [rocker]
[2.05]  [] ? do_one_initcall+0x90/0x1e0
[2.05]  [] ? do_init_module+0x50/0x1d8
[2.05]  [] ? load_module+0x1c1c/0x2240
[2.05]  [] ? show_initstate+0x50/0x50
[2.05]  [] ? SyS_init_module+0x104/0x130
[2.05]  [] ? entry_SYSCALL_64_fastpath+0x16/0x75
[2.05] Code: 48 89 c1 48 c7 c2 10 dc 15 81 48 89 c6 e8 e4 0d
4a e1 48 8b 3b e8 1c 1b 4a e1 eb b6 66 2e 0f 1f 84 00 00 00 00 00 48
89 d1 89 f2 <8b> 77 08 e9 73 ff ff ff 0f 1f 00 48 83 ec 68 4c 89 7c 24
60 41
[2.05] RIP  []
rocker_port_kzalloc.isra.50+0x5/0x10 [rocker]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-21 Thread Scott Feldman

On Mon, Sep 21, 2015 at 1:09 AM, Jiri Pirko  wrote:
> Mon, Sep 21, 2015 at 09:23:24AM CEST, sfel...@gmail.com wrote:
>>On Sat, Sep 19, 2015 at 5:29 AM, Jiri Pirko  wrote:
>>> Jiri Pirko (6):
>>>   switchdev: rename "trans" to "trans_ph".
>>>   switchdev: introduce transaction infrastructure for attr_set and
>>> obj_add
>>>   rocker: switch to local transaction phase enum
>>>   switchdev: move transaction phase enum under transaction structure
>>>   rocker: use switchdev transaction queue for allocated memory
>>>   switchdev: split commit and prepare phase into two callbacks
>>
>>Patches compile, but first test bombs.  Cut-and-paste of dump at end
>>of this email.
>
> Told you :)
>
>
>>
>>I'm not sure I'm liking this patchset because it looks like a way for
>>switchdev drivers to easily opt-out of the prepare-commit transaction
>>model by simply not implementing the *_pre op.  I would rather drivers
>>explicitly handle the PREPARE phase in code, even if that means
>>skipping it gracefully (in code) with a comment (in code) explaining
>>why it does not matter for this device/operation.  That's what DSA had
>>done, mostly because it was a retro-fit.
>
> Each driver should handle this inside it. If it does not need prepare
> state, it simply does not implement it. That is the same for all cb,
> ndos, netdev notifiers, etc. It is much cleaner and nicer to have these as
> separate callbacks. Implementing multiple callback in one is just ugly,
> sorry.
>
>
>>
>>Also, the patchset removes the ABORT callback in case of a rollback
>>due to a failed PREPARE.  We can't make the assumption that it's just
>>a memory list to destroy on ABORT.  The driver, on PREPARE, may have
>>reserved device space or staged an operation on the device which we'll
>>need to undo on ABORT.
>
> Yep, just register an item with custom destructor, there you can do
> whatever. Also, I believe much nicer comparing to current code.
>
>
>>
>>So we need ABORT back, and we need PREPARE to not be optional, so
>>what's left list enqueue/dequeue helpers, which I'm not seeing much
>>value in up-leveling as the driver can do list_add/del itself.
>
> Why would every driver do it itself, over and over when there can be a
> clean infrastructure to do that. Including abort phase. Without the driver
> needed to be involved.
>
>
>>
>>Am I missing something?  I didn't see a motivation statement for the
>>RFC so I'm not sure where you wanted to take this.
>
> I want to make current code much nicer, easier to read and implement in
> other drivers. Look at rocker.c and how often there is == PREPARE there.
> It's nearly impossible to followthe code, sorry.

You're going to end up duplicate a lot of code and make it easier to
make mistakes.  The reason why the == PREPARE checks are there is the
same code is run for both PREPARE and COMMIT, with == PREPARE checks
right at the last moment to avoid a irreversible change, such as
deleting an entry from a table or changing a state variable or bumping
a hardware desc pointer.  Otherwise the code paths should be the same
for PREPARE and COMMIT.  If you split these, you're going to duplicate
lots of code and one path will deviate from the other over time and
now we've broken the prepare-commit model.  We want PREPARE to answer
the question "is it OK to do this?" and COMMIT to do without failure.

Do you have a specific example where you can't follow the code with == PREPARE?

> My next patchset is to un-mess rocker.c (that freaking ofdpa stuff is 
> everywhere)

Rocker implements OF-DPA so it's seems natural that freaking OF-DPA
stuff would be everywhere.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 7/7] switchdev: update documentation on FDB ageing_time

2015-09-20 Thread Scott Feldman

On Sun, Sep 20, 2015 at 7:24 AM, roopa <ro...@cumulusnetworks.com> wrote:
> On 9/19/15, 7:21 PM, Scott Feldman wrote:
>>
>> Yes, your switch driver is in user-space so you have to use NTF_USE to
>> refresh the entry since you cannot use the kernel driver model to
>> call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  Consequently, your
>> entries are not marked with NTF_EXT_LEARNED, so this patch is a no-op
>> for you.  You can continue to use the bridge driver to age out your
>> entries.
>
> yes, correct.  I was not really saying this because it will cause us any
> problems.
> I was trying to say this for switchdev in general.
>
>> I'd rather someone add that knob when it's actually needed. When the first
>> in-kernel switchdev driver that wants to use the bridge driver's ageing
>> function, then we can make that adjustment.
>
> I was suggesting the other way around. Keep the default to what is in the
> kernel today and the first in-kernel switchdev driver that wants to age,
> should introduce the ability to not age in the bridge driver (Rocker will
> continue to work as it does today). Because, I am only concerned that rocker
> may end up being the only device that uses the default behavior introduced
> by this patch. And every real hardware uses the bridge driver to age
> (because there are no in kernel examples today).  I am curious to know who
> else is using hardware ageing today.

A driver patch for a (real) hardware device which does the ageing in
hw is around the corner.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 7/7] switchdev: update documentation on FDB ageing_time

2015-09-19 Thread Scott Feldman

On Sat, Sep 19, 2015 at 6:21 PM, roopa <ro...@cumulusnetworks.com> wrote:
> On 9/18/15, 12:55 PM, sfel...@gmail.com wrote:
>>
>> From: Scott Feldman <sfel...@gmail.com>
>>
>> Signed-off-by: Scott Feldman <sfel...@gmail.com>
>> ---
>>   Documentation/networking/switchdev.txt |   24 
>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/networking/switchdev.txt
>> b/Documentation/networking/switchdev.txt
>> index 476df04..67e43ee 100644
>> --- a/Documentation/networking/switchdev.txt
>> +++ b/Documentation/networking/switchdev.txt
>> @@ -239,20 +239,20 @@ The driver should initialize the attributes to the
>> hardware defaults.
>>   FDB Ageing
>>   ^^
>>   -There are two FDB ageing models supported: 1) ageing by the device, and
>> 2)
>> -ageing by the kernel.  Ageing by the device is preferred if many FDB
>> entries
>> -are supported.  The driver calls
>> call_switchdev_notifiers(SWITCHDEV_FDB_DEL,
>> -...) to age out the FDB entry.  In this model, ageing by the kernel
>> should be
>> -turned off.  XXX: how to turn off ageing in kernel on a per-port basis or
>> -otherwise prevent the kernel from ageing out the FDB entry?
>> -
>> -In the kernel ageing model, the standard bridge ageing mechanism is used
>> to age
>> -out stale FDB entries.  To keep an FDB entry "alive", the driver should
>> refresh
>> -the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD,
>> ...).  The
>> +The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and
>> it is
>> +the responsibility of the port driver/device to age out these entries.
>> If the
>> +port device supports ageing, when the FDB entry expires, it will notify
>> the
>> +driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If
>> the
>> +device does not support ageing, the driver can simulate ageing using a
>> +garbage collection timer to monitor FBD entries.  Expired entries will be
>> +notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
>> +example of driver running ageing timer.
>
> We do rely on the bridge driver ageing out entries. We have gone from
> hardware ageing to ageing in the switch driver to ultimately ageing in the
> bridge driver.  :). And we keep the fdb entries in the bridge driver "alive"
> by using 'NTF_USE' from the user-space driver.

Yes, your switch driver is in user-space so you have to use NTF_USE to
refresh the entry since you cannot use the kernel driver model to
call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  Consequently, your
entries are not marked with NTF_EXT_LEARNED, so this patch is a no-op
for you.  You can continue to use the bridge driver to age out your
entries.

>> +To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the
>> FDB
>> +entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
>>
> Even with your current patches, looks like the switch driver will need to
> refresh the entries anyways to keep the "last-used" time to now.
> In which case is there much value in switch driver doing the ageing ?.

"should" not "must".

Value is for the many learned FDB entries case, to move the ageing
function to hardware.

> I am thinking keeping the default behavior of the bridge driver to age and
> anything else configurable might be a better option.

I'd rather someone add that knob when it's actually needed.  When the
first in-kernel switchdev driver that wants to use the bridge driver's
ageing function, then we can make that adjustment.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 3/7] rocker: adding port ageing_time for ageing out FDB entries

2015-09-19 Thread Scott Feldman

On Fri, Sep 18, 2015 at 11:30 PM, Jiri Pirko <j...@resnulli.us> wrote:
> Fri, Sep 18, 2015 at 09:55:47PM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman <sfel...@gmail.com>
>>
>>Follow-up patcheset will allow user to change ageing_time, but for now
>>just hard-code it to a fixed value (the same value used as the default
>>for the bridge driver).
>>
>>Signed-off-by: Scott Feldman <sfel...@gmail.com>
>>---
>> drivers/net/ethernet/rocker/rocker.c |2 ++
>> 1 file changed, 2 insertions(+)
>>
>>diff --git a/drivers/net/ethernet/rocker/rocker.c 
>>b/drivers/net/ethernet/rocker/rocker.c
>>index f55ed2c..eba22f5 100644
>>--- a/drivers/net/ethernet/rocker/rocker.c
>>+++ b/drivers/net/ethernet/rocker/rocker.c
>>@@ -221,6 +221,7 @@ struct rocker_port {
>>   __be16 internal_vlan_id;
>>   int stp_state;
>>   u32 brport_flags;
>>+  unsigned long ageing_time;
>>   bool ctrls[ROCKER_CTRL_MAX];
>>   unsigned long vlan_bitmap[ROCKER_VLAN_BITMAP_LEN];
>>   struct napi_struct napi_tx;
>>@@ -4975,6 +4976,7 @@ static int rocker_probe_port(struct rocker *rocker, 
>>unsigned int port_number)
>>   rocker_port->port_number = port_number;
>>   rocker_port->pport = port_number + 1;
>>   rocker_port->brport_flags = BR_LEARNING | BR_LEARNING_SYNC;
>>+  rocker_port->ageing_time = 300 * HZ;
>
> How about to add also "BR_DEFAULT_AGEING_TIME" and use it here?

Yes, added for v2
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 4/7] bridge: define some min/max ageing time constants we'll use next

2015-09-19 Thread Scott Feldman

On Fri, Sep 18, 2015 at 11:45 PM, Jiri Pirko <j...@resnulli.us> wrote:
> Fri, Sep 18, 2015 at 09:55:48PM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman <sfel...@gmail.com>
>>
>>Signed-off-by: Scott Feldman <sfel...@gmail.com>
>>---
>> include/linux/if_bridge.h |4 
>> 1 file changed, 4 insertions(+)
>>
>>diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
>>index dad8b00..6cc6dbc 100644
>>--- a/include/linux/if_bridge.h
>>+++ b/include/linux/if_bridge.h
>>@@ -46,6 +46,10 @@ struct br_ip_list {
>> #define BR_LEARNING_SYNC  BIT(9)
>> #define BR_PROXYARP_WIFI  BIT(10)
>>
>>+/* values as per ieee8021QBridgeFdbAgingTime */
>>+#define BR_MIN_AGEING_TIME(10 * HZ)
>>+#define BR_MAX_AGEING_TIME(100 * HZ)
>
> I think that a bridge patch checking against these values should be
> introduced along with these values, in the same patchset

I need the MIN value for this patchset in rocker's ageing timer, so
it's introduced here.  MIN/MAX will be used again in follow-on patch
Prem is going to send to range check user input.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Scott Feldman

On Sat, Sep 19, 2015 at 5:29 AM, Jiri Pirko  wrote:
> Jiri Pirko (6):
>   switchdev: rename "trans" to "trans_ph".
>   switchdev: introduce transaction infrastructure for attr_set and
> obj_add
>   rocker: switch to local transaction phase enum
>   switchdev: move transaction phase enum under transaction structure
>   rocker: use switchdev transaction queue for allocated memory
>   switchdev: split commit and prepare phase into two callbacks

Whew, that's a lot of work!  Seems like a good idea to up-level this
for other drivers to share.  Let me apply the patches and run my tests
and get back to you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 1/3] net: switchdev: extract switchdev_obj_vlan

2015-09-09 Thread Scott Feldman

On Tue, Sep 8, 2015 at 1:47 PM, Vivien Didelot
 wrote:
> Move the switchdev_obj_vlan structure out of the switchdev_obj union.
>
> This lightens the switchdev_obj structure and allows drivers to access
> the object transaction and callback directly from a switchdev_obj_vlan.
> This is more consistent and convenient for add and dump operations.
>
> The patch updates bridge, the Rocker driver and DSA accordingly.
>
> Signed-off-by: Vivien Didelot 
> ---

[cut]

> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 16c1c43..9923a97 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -397,7 +397,7 @@ int call_switchdev_notifiers(unsigned long val, struct 
> net_device *dev,
>  EXPORT_SYMBOL_GPL(call_switchdev_notifiers);
>
>  struct switchdev_vlan_dump {
> -   struct switchdev_obj obj;
> +   struct switchdev_obj_vlan vlan; /* must be first */
> struct sk_buff *skb;
> u32 filter_mask;
> u16 flags;
> @@ -439,9 +439,8 @@ static int switchdev_port_vlan_dump_put(struct net_device 
> *dev,
>  static int switchdev_port_vlan_dump_cb(struct net_device *dev,
>struct switchdev_obj *obj)
>  {
> -   struct switchdev_vlan_dump *dump =
> -   container_of(obj, struct switchdev_vlan_dump, obj);
> -   struct switchdev_obj_vlan *vlan = >obj.u.vlan;
> +   struct switchdev_vlan_dump *dump = (struct switchdev_vlan_dump *) obj;
> +   struct switchdev_obj_vlan *vlan = (struct switchdev_obj_vlan *) obj;

Same comment as Andrew's: use container_of() rather than casting for
both of the use-cases above.  (And don't require inner-structure to be
fist in outer structure).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 0/3] net: switchdev: extract specific object structures

2015-09-09 Thread Scott Feldman

On Tue, Sep 8, 2015 at 1:47 PM, Vivien Didelot
 wrote:
> Hi!
>
> Current implementations of .switchdev_port_obj_add and 
> .switchdev_port_obj_dump
> must pass the generic switchdev_obj structure instead of a specific one (e.g.
> switchdev_obj_fdb) to the related driver accessors, because it contains the
> object transaction and callback. This is not very convenient for drivers.
>
> Instead of having all specific structures included into a switchdev_obj union,
> it would be simpler to embed a switchdev_obj structure as the first member of
> the more specific ones. That way, a driver can cast the switchdev_obj pointer
> to the specific structure and have access to all the information from it.
>
> As an example, this allows Rocker to change its two FDB accessors:
>
> rocker_port_fdb_add(rocker_port, obj->trans, >u.fdb);
> rocker_port_fdb_dump(rocker_port, obj);
>
> for these most consistent ones:
>
> fdb = (struct switchdev_obj_fdb *) obj;
> rocker_port_fdb_add(rocker_port, fdb);
> rocker_port_fdb_dump(rocker_port, fdb);
>
> This is what struct netdev_notifier_info and its specific supersets (e.g.
> struct netdev_notifier_changeupper_info) do in include/linux/netdevice.h.
>
> This patchset does that and updates bridge, Rocker and DSA accordingly.
>
> Note that this patchset was sent as an RFC not to bother David with new
> net-next stuffs, but if the changes look good, it is ready to merge.
>
> Also, please take note that the change sits on top of:
> http://patchwork.ozlabs.org/patch/514894/
>

HI Vivien, looks good. Some requests for the next version:

1) Move the cb() member out of struct switchdev_obj and into the
containing struct switchdev_obj_xxx structure.  This way the second
arg to cb() can be struct switchdev_obj_xxx *.  (And, I have pending
patches where the first arg to cb() is different, so making the dump
cb() func specific to obj type will help).

2) As already mentioned, use container_of where possible to avoid
locking struct member at a position within struct.

Should the same treatment be done for switchdev_attr?  Maybe that's a
second patchset.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net] switchdev: fix return value of switchdev_port_fdb_dump in case of error

2015-09-03 Thread Scott Feldman

On Thu, Sep 3, 2015 at 5:04 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> switchdev_port_fdb_dump is used as .ndo_fdb_dump. Its return value is
> idx, so we cannot return errval.
>
> Fixes: 45d4122ca7cd ("switchdev: add support for fdb add/del/dump via 
> switchdev_port_obj ops.")
> Signed-off-by: Jiri Pirko <j...@mellanox.com>
Acked-by: Scott Feldman<sfel...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing

2015-08-29 Thread Scott Feldman

On Sat, Aug 29, 2015 at 1:46 PM, David Miller da...@davemloft.net wrote:
 From: Peter Nørlund p...@ordbogen.com
 Date: Sat, 29 Aug 2015 22:31:15 +0200

 On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
 David Miller da...@davemloft.net wrote:

 From: p...@ordbogen.com
 Date: Fri, 28 Aug 2015 22:00:47 +0200

  When the routing cache was removed in 3.6, the IPv4 multipath
  algorithm changed from more or less being destination-based into
  being quasi-random per-packet scheduling. This increases the risk
  of out-of-order packets and makes it impossible to use multipath
  together with anycast services.

 Don't even try to be fancy.

 Simply kill the round-robin stuff off completely, and make hash based
 routing the one and only mode, no special configuration stuff
 necessary.

 I like the sound of that! Just to be clear - are you telling me to
 stick with L3 and skip the L4 part?

 For now it seems best to just do L3 and make ipv4 and ipv6 behave the
 same.

That makes mapping the hash to offload hardware easier also.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread Scott Feldman

On Thu, Aug 27, 2015 at 2:06 AM, Andrew Lunn and...@lunn.ch wrote:
 On Thu, Aug 27, 2015 at 01:42:24AM -0700, Scott Feldman wrote:
 On Thu, Aug 27, 2015 at 12:45 AM, Andrew Lunn and...@lunn.ch wrote:
  I don't know about how this overlaps with DSA platform_class.  Florian?
 
  There is some overlap with DSA, but the current DSA model, with
  respect to probing, is broken. So this might be interesting as a way
  towards fix that.
 
  One thing to keep in mind is the D in DSA. You talk about switch,
  singular. DSA has a number of switches in a cluster. We currently
  export a single switchdev interface for the cluster, but there are
  some properties which are per switch, e.g. temperature, eeprom
  contents, statistics, power management etc.

 Export a single 'switchdev' or 'netdev' for the cluster?  I hope that
 was a typo.

 I probably expressed that badly. The hardware i have on my desk has
 three Marvell switches in a chain, with one end of the chain connected
 to a host Ethernet interface.

 From the switchdev ops level, you don't see anything of this
 chain. But some of the operations do need to be aware of this chain,
 for example vlans which span multiple chips in this chain.

 With switchdev device class, you'd instantiate one per
 phy switch, and have per-switch props (temp, eeprom, etc) thru each
 switchdev instance.

 O.K. This is fine, but we need people to understand that a switchdev
 device class represents some middle layer in the hierarchy, not the
 top layer. Otherwise false assumptions might be made.

So with kobj, a device can have a parent.  So I experimented with my
RFC patch and changed register_switchdev to take a parent switchdev
arg, which is NULL for leaf switchdevs:

int register_switchdev(struct switchdev *sdev, const char *name,
   struct switchdev *parent)
{
struct device *dev = sdev-dev;
int err;

device_initialize(dev);

dev-class = switchdev_class;
if (parent)
dev-parent = parent-dev;

err = dev_set_name(dev, %s, name);
if (err)
return err;

return device_add(dev);
}

Then I tried this with rocker and it works as expected.  On module
load, I create the master switchdev, and then on PCI probe for each
phys switch dev, I put the slave switchdev in the master using
register_switchdev.  Here's one slave in the master rockers switch:

tree /sys/class/switchdev/rockers
/sys/class/switchdev/rockers
├── 525400123501
│   ├── device - ../../rocker
│   ├── foo
│   ├── power
│   │   ├── async
│   │   ├── autosuspend_delay_ms
│   │   ├── control
│   │   ├── runtime_active_kids
│   │   ├── runtime_active_time
│   │   ├── runtime_enabled
│   │   ├── runtime_status
│   │   ├── runtime_suspended_time
│   │   └── runtime_usage
│   ├── subsystem - ../../../../../class/switchdev
│   └── uevent
├── foo
├── power
│   ├── async
│   ├── autosuspend_delay_ms
│   ├── control
│   ├── runtime_active_kids
│   ├── runtime_active_time
│   ├── runtime_enabled
│   ├── runtime_status
│   ├── runtime_suspended_time
│   └── runtime_usage
├── subsystem - ../../../../class/switchdev
└── uevent

With this, we can stack switchdevs, I guess as high as we want.  Does
this look usable for DSA?   An attr set on the master would get pushed
down to the leaves.  We'd can do it with the same style of recursive
algos we use for switchdev port attrs.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 3/6] net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag

2015-08-27 Thread Scott Feldman

On Wed, Aug 26, 2015 at 10:43 PM, Jiri Pirko j...@resnulli.us wrote:
 Wed, Aug 26, 2015 at 07:43:18PM CEST, sfel...@gmail.com wrote:
On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

 Add this helper so code can easily figure out if netdev is openswitch.

 Signed-off-by: Jiri Pirko j...@mellanox.com
 ---
  include/linux/netdevice.h| 8 
  net/openvswitch/vport-internal_dev.c | 2 +-
  2 files changed, 9 insertions(+), 1 deletion(-)

 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index be625f4..0a884e6 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1264,6 +1264,7 @@ struct net_device_ops {
   * @IFF_MACVLAN: Macvlan device
   * @IFF_VRF_MASTER: device is a VRF master
   * @IFF_NO_QUEUE: device can run without qdisc attached
 + * @IFF_VRF_OPENVSWITCH: device is a Open vSwitch master
   */
  enum netdev_priv_flags {
 IFF_802_1Q_VLAN = 10,
 @@ -1293,6 +1294,7 @@ enum netdev_priv_flags {
 IFF_IPVLAN_SLAVE= 124,
 IFF_VRF_MASTER  = 125,
 IFF_NO_QUEUE= 126,
 +   IFF_OPENVSWITCH = 127,
  };

  #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
 @@ -1322,6 +1324,7 @@ enum netdev_priv_flags {
  #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
  #define IFF_VRF_MASTER IFF_VRF_MASTER
  #define IFF_NO_QUEUE   IFF_NO_QUEUE
 +#define IFF_OPENVSWITCHIFF_OPENVSWITCH

  /**
   * struct net_device - The DEVICE structure.
 @@ -3853,6 +3856,11 @@ static inline bool netif_is_bridge_master(const 
 struct net_device *dev)
 return dev-priv_flags  IFF_EBRIDGE;
  }

 +static inline bool netif_is_ovs_master(const struct net_device *dev)
 +{
 +   return dev-priv_flags  IFF_OPENVSWITCH;
 +}

We're going to run out of priv_flags bits.  This flag doesn't seem
like something that will be checked lots of places.  How about using
rtnl_link_ops-kind to save a bit in priv_flags?

static inline bool netif_is_ovs_master(const struct net_device *dev)
{
return !strcmp(dev-rtnl_link_ops-kind, openvswitch));
}

 There are lot of helpers like this for other soft-devices. I think that
 is okay to have it this way. The thing is that sometimes you need to use
 thi helper in fast path and in that case, you do not want to strcmp.

 There is plenty of priv_flags bits for now when I killed the bonding
 stuff.

Ya, but think about the bit: you (and others) used a bit in priv_flags
to indicate the netdev type.  Can you add an enum field to
rtnl_link_ops-type to indicate link type?  Then it's not a strcmp.
You can write your helper using strcmp first, and then later migrate
to using rtnl_link_ops-type.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread Scott Feldman

On Thu, Aug 27, 2015 at 12:27 AM, Jiri Pirko j...@resnulli.us wrote:
 Thu, Aug 27, 2015 at 09:16:44AM CEST, sfel...@gmail.com wrote:
From: Scott Feldman sfel...@gmail.com

In the switchdev model, we use netdevs to represent switchdev ports, but we
have no representation for the switch itself.  So, introduce a new switchdev
device class so we can define semantics and programming interfaces for the
switch itself.  Switchdev device class isn't tied to any particular bus.

This patch set is just the skeleton to get us started.  It adds the sysfs
object registration for the new class and defines a class-level attr foo.
With the new class, we could hook PM functions, for example, to handle power
transitions at the switch level.  I registered rocker and get:

   $ ls /sys/class/switchdev/525400123501/
   foo  power  subsystem  uevent

 No, please avoid adding anything to sysfs. If we need to add anything,
 lets make is accesible using Netlink only.

I see no harm in using the device model to define and new device class
which just so happens to show up in sysfs.  What sysfs attrs get
exposed is where we can have some discussion/rules.

So what next?  I'd rather not build APIs around sysfs, so we need a netlink 
API
we can build on top of this.  It's not really rtnl.  Maybe genl would work?
What ever it is, we'd need to teach iproute2 about a new 'switch' command.

Netlink API would allow us to represent switch-wide objects such as registers,
tables, stats, firmware, and maybe even control.  I think with with netlink
TLVs, we can create a framework for these objects but still allow the switch
driver provide switch-specific info.  For example, a table object:

[TABLES]
   [TABLE]
   [FIELDS]
   [FIELD]
   [ID, TYPE]
   [DATA]
   [ID, VALUE]

 Alert! I feel that someone would like to abuse this iface for writing
 configuration through. This should be read-only by design. I also think
 that this should not be something switch-specific. I believe that NIC
 drivers would benefit from this iface as well when they want to expose
 something. I think we should use genl for this.

Read-only is fine.  Look, I'm just trying to dump rocker internal
tables in some format I can grep outside the kernel.  The tables are
get big and complicated fast and printk doesn't cut it.   I can use
degugfs privately, but I need to be able to dump same for field
troubleshooting.   I can't use debugfs, so I want some kind of
XML-like dump facility.  It's going to have device-specific data, so
I'm looking for an XML-like way to represent this data in netlink.


Maybe iproute2 has pretty-printers for specific switches like ethtool has for
reg dumps.

 I feel like a lot of what you described overlaps with existing
 interfaces and tools. Why don't we just reuse that? For firmware for
 example, just take one of the ports. Same for stats (I plan to expose my
 mlxsw switch-wide stats in ethtool so they are accessible through every
 port netdevice).

Port-centric stats are fine for port netdevs, but I'd like switch-wide
stats to show up elsewhere.

Thinking ahead: I'd like to put port into namespaces and I don't want
to dump stats on a port and see stats on ports in other namespaces.

 I still do not see the need for new device class. I have strong feeling
 that it should be avoided.

Ok
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread Scott Feldman

On Thu, Aug 27, 2015 at 12:36 AM, John Fastabend
john.fastab...@gmail.com wrote:
 On 15-08-27 12:16 AM, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com

 In the switchdev model, we use netdevs to represent switchdev ports, but we
 have no representation for the switch itself.  So, introduce a new switchdev
 device class so we can define semantics and programming interfaces for the
 switch itself.  Switchdev device class isn't tied to any particular bus.

 This patch set is just the skeleton to get us started.  It adds the sysfs
 object registration for the new class and defines a class-level attr foo.
 With the new class, we could hook PM functions, for example, to handle power
 transitions at the switch level.  I registered rocker and get:

$ ls /sys/class/switchdev/525400123501/
foo  power  subsystem  uevent

 So what next?  I'd rather not build APIs around sysfs, so we need a netlink 
 API
 we can build on top of this.  It's not really rtnl.  Maybe genl would work?
 What ever it is, we'd need to teach iproute2 about a new 'switch' command.

 Netlink API would allow us to represent switch-wide objects such as 
 registers,
 tables, stats, firmware, and maybe even control.  I think with with netlink
 TLVs, we can create a framework for these objects but still allow the switch
 driver provide switch-specific info.  For example, a table object:


 Hi Scott,

 I was going to take the flow-api presented in Feb and submitted
 to netdev and get it up to date. I started doing this today should be
 ready end of week.

 [TABLES]
   [TABLE]
   [FIELDS]
   [FIELD]
   [ID, TYPE]
   [DATA]
   [ID, VALUE]


 The structure I used previously which looks like your prototype I think,

  
 (https://github.com/jrfastab/rocker-net-next/blob/master/include/uapi/linux/if_flow.h)

  * [NFL_TABLE_IDENTIFIER_TYPE]
  * [NFL_TABLE_IDENTIFIER]
  * [NFL_TABLE_TABLES]
  * [NFL_TABLE]
  *   [NFL_TABLE_ATTR_NAME]
  *   [NFL_TABLE_ATTR_UID]
  *   [NFL_TABLE_ATTR_SOURCE]
  *   [NFL_TABLE_ATTR_APPLY]
  *   [NFL_TABLE_ATTR_SIZE]
  *   [NFL_TABLE_ATTR_MATCHES]
  * [NFL_FIELD_REF]
  *   [NFL_FIELD_REF_INSTANCE]
  *   [NFL_FIELD_REF_HEADER]
  *   [NFL_FIELD_REF_FIELD]
  *   [NFL_FIELD_REF_MASK]
  *   [NFL_FIELD_REF_TYPE]
  * [...]
  *   [NFL_TABLE_ATTR_ACTIONS]
  *   [NFL_ACTION_ATTR_UID]
  *   [...]
  * [NFL_TABLE]
  *   [...]
  *

 This is well-typed per Dave's comment. And because its Netlink based it
 can be easily extended as needed. The above gives basic information on
 the table. Sure it could stand to have some other entries in it but I
 never needed them for my capabilities/resource discovery. We could argue
 about removing some if they are too specific to my use cases.

I was looking for something more generic.  Not quite as raw as ethtool
reg dump, but not too rigid I have to bend the def of fields to make
it work.  Maybe what I want is impossible.

 Maybe iproute2 has pretty-printers for specific switches like ethtool has for
 reg dumps.

 I don't know about how this overlaps with DSA platform_class.  Florian?

 Comments?

 A few other interesting (at least to me) structures you can find in the
 if_flow.h header file give how the tables in the hardware are put
 together. I found this really useful when trying to sort out where my
 various ACLs/nexthop entries/etc were in the hardware order. This is
 important to know for many cases. This is NFL_TABLE_GRAPH.

 Also I found being able to know' what headers the hardware supports
 to be very useful. For example I want/need to know if QinQ will work.
 Or VXLAN/NSH, Geneve, etc. This is NFL_HEADER_GRAPH.

 Sure if_flow.h is a poor name and FlowAPI is probably not really great.
 But both those names come from how I model the hardware as match action
 tables. I can change those names to switchdev_resources or
 switchdev_info if folks want.

 My point is the interface is generic and provides a common interface for
 most hardware I've seen to date. Certainly curious if there is hardware
 it doesn't map to. Although tables and pipelines seems fairly universal
 for a large set of devices to me. Further it can be extended as needed.

 I'll drop the set_rule/del_rule parts to be merged with ebpf/tc/qdisc.

Ya, you're talking about tables in hardware.  I just want to dump some
driver-internal driver-specific data in tabular form to user-space so
I can grep thru it and debug.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 3/6] net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag

2015-08-27 Thread Scott Feldman

On Wed, Aug 26, 2015 at 11:30 PM, Jiri Pirko j...@resnulli.us wrote:
 Thu, Aug 27, 2015 at 08:23:13AM CEST, sfel...@gmail.com wrote:
On Wed, Aug 26, 2015 at 10:43 PM, Jiri Pirko j...@resnulli.us wrote:
 Wed, Aug 26, 2015 at 07:43:18PM CEST, sfel...@gmail.com wrote:
On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

We're going to run out of priv_flags bits.  This flag doesn't seem
like something that will be checked lots of places.  How about using
rtnl_link_ops-kind to save a bit in priv_flags?

static inline bool netif_is_ovs_master(const struct net_device *dev)
{
return !strcmp(dev-rtnl_link_ops-kind, openvswitch));
}

 There are lot of helpers like this for other soft-devices. I think that
 is okay to have it this way. The thing is that sometimes you need to use
 thi helper in fast path and in that case, you do not want to strcmp.

 There is plenty of priv_flags bits for now when I killed the bonding
 stuff.

Ya, but think about the bit: you (and others) used a bit in priv_flags
to indicate the netdev type.  Can you add an enum field to
rtnl_link_ops-type to indicate link type?  Then it's not a strcmp.
You can write your helper using strcmp first, and then later migrate
to using rtnl_link_ops-type.


 Also, dev can be multiple things, it can be bridge port and vlan dev at
 the same time. Flags are good for this.

priv_flags bits are three types:

1) dev attribute (IFF_XMIT_DST_RELEASE, IFF_DISABLE_NETPOLL, etc)
2) dev type (IFF_802_1Q_VLAN, IFF_EBRIDGE, etc)
3) and dev membership (IFF_BRIDGE_PORT, IFF_TEAM_PORT, etc)

Are there types 2 or 3 in any fast paths?  Type 2 can move to enum;
they're mutually exclusive.  Type 3 is the dev's master's type 2, and
since dev can have only one master, no flag is needed: just look up
master type to know if dev is bridged or ovs'ed.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 0/2] Add new switchdev device class

2015-08-27 Thread Scott Feldman

On Thu, Aug 27, 2015 at 12:45 AM, Andrew Lunn and...@lunn.ch wrote:
 I don't know about how this overlaps with DSA platform_class.  Florian?

 There is some overlap with DSA, but the current DSA model, with
 respect to probing, is broken. So this might be interesting as a way
 towards fix that.

 One thing to keep in mind is the D in DSA. You talk about switch,
 singular. DSA has a number of switches in a cluster. We currently
 export a single switchdev interface for the cluster, but there are
 some properties which are per switch, e.g. temperature, eeprom
 contents, statistics, power management etc.

Export a single 'switchdev' or 'netdev' for the cluster?  I hope that
was a typo.  With switchdev device class, you'd instantiate one per
phy switch, and have per-switch props (temp, eeprom, etc) thru each
switchdev instance.  The cluster netdev would stay a netdev which
spans the switches.


 Although ethtool does have options for these, it is not always a
 natural fit. ethtool --eeprom-dump on a switch port dumps the switch
 EEPROM, and all ports on the switch can be used. --register-dump on a
 port is good for showing the per ports registers, but there is no
 natural interface for showing the global switch registers. Florian's
 resent L2 interface patch shows we have issues getting access to the
 'cpu' port, the port which interfaces to the host.

 We need to be careful that any new interfaces we add are better at
 representing the true structure of the hardware, which includes there
 being multiple physical switches below a switchdev, and they are
 connected together by ports which are currently not visible as
 netdevs.

Right.  I'm thinking one switchdev instance (based on the new
switchdev class) per phy switch.  Port netdevs for switch ports worth
exposing (for user interaction).  And cluster netdev for dsa tagging
encap/decap.  Do I have this right?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 2/3] mlxsw: expose EMAD transactions statistics via debugfs

2015-08-26 Thread Scott Feldman

On Wed, Aug 26, 2015 at 10:49 AM, David Miller da...@davemloft.net wrote:
 From: Jiri Pirko j...@resnulli.us
 Date: Wed, 26 Aug 2015 09:37:57 +0200

 I don't think that are much more cases like this. Therefore I think that
 for this cases, debugfs might be a good way to expose debugging stats.

 Scott wanted to do similar things in rocker.  DSA guys too.

 Every switch device is going to have some kind of hierarchy like
 this, it's not a unique situation.

We've been able to get buy so far without a user-visible device for
the switch.  The switch ports are represented by netdevs, so that's
easy.  How can we create an object for the switch itself, so we can
attach common interfaces for the user to dump switch-level stats or
tables?   Using another netdev doesn't seem right.  Do we need a new
device class for switches, and then create some common tool/interfaces
for switch device class?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 0/6] rocker: make master change handling nicer

2015-08-26 Thread Scott Feldman

On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

 Jiri Pirko (6):
   net: introduce change upper device notifier change info
   net: add netif_is_bridge_master helper
   net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag
   net: kill long time unused bonding private flags
   rocker: use new helper to figure out master kind
   rocker: use change upper info

Looks good Jiri, thanks for cleaning this up.  Only nit was about
conserving netdev-priv_flags.   Using rtnl_link_ops-kind to
determine netdev type scales better.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 6/6] rocker: use change upper info

2015-08-26 Thread Scott Feldman

On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

 Since now information about changed upper is passed along, benefit from
 that and use this info directly.

 This also fixes possible issues that could happen when non-master device
 is added (current code does not distinguish between master and non-master
 upper device).

 Signed-off-by: Jiri Pirko j...@mellanox.com

Acked-by: Scott Feldman sfel...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 5/6] rocker: use new helper to figure out master kind

2015-08-26 Thread Scott Feldman

On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

 Looking at rtnl kind string is kind of ugly. So use new helpers to do
 this in nicer way.

 Signed-off-by: Jiri Pirko j...@mellanox.com

Acked-by: Scott Feldman sfel...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next 3/6] net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag

2015-08-26 Thread Scott Feldman

On Wed, Aug 26, 2015 at 9:36 AM, Jiri Pirko j...@resnulli.us wrote:
 From: Jiri Pirko j...@mellanox.com

 Add this helper so code can easily figure out if netdev is openswitch.

 Signed-off-by: Jiri Pirko j...@mellanox.com
 ---
  include/linux/netdevice.h| 8 
  net/openvswitch/vport-internal_dev.c | 2 +-
  2 files changed, 9 insertions(+), 1 deletion(-)

 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index be625f4..0a884e6 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1264,6 +1264,7 @@ struct net_device_ops {
   * @IFF_MACVLAN: Macvlan device
   * @IFF_VRF_MASTER: device is a VRF master
   * @IFF_NO_QUEUE: device can run without qdisc attached
 + * @IFF_VRF_OPENVSWITCH: device is a Open vSwitch master
   */
  enum netdev_priv_flags {
 IFF_802_1Q_VLAN = 10,
 @@ -1293,6 +1294,7 @@ enum netdev_priv_flags {
 IFF_IPVLAN_SLAVE= 124,
 IFF_VRF_MASTER  = 125,
 IFF_NO_QUEUE= 126,
 +   IFF_OPENVSWITCH = 127,
  };

  #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
 @@ -1322,6 +1324,7 @@ enum netdev_priv_flags {
  #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
  #define IFF_VRF_MASTER IFF_VRF_MASTER
  #define IFF_NO_QUEUE   IFF_NO_QUEUE
 +#define IFF_OPENVSWITCHIFF_OPENVSWITCH

  /**
   * struct net_device - The DEVICE structure.
 @@ -3853,6 +3856,11 @@ static inline bool netif_is_bridge_master(const struct 
 net_device *dev)
 return dev-priv_flags  IFF_EBRIDGE;
  }

 +static inline bool netif_is_ovs_master(const struct net_device *dev)
 +{
 +   return dev-priv_flags  IFF_OPENVSWITCH;
 +}

We're going to run out of priv_flags bits.  This flag doesn't seem
like something that will be checked lots of places.  How about using
rtnl_link_ops-kind to save a bit in priv_flags?

static inline bool netif_is_ovs_master(const struct net_device *dev)
{
return !strcmp(dev-rtnl_link_ops-kind, openvswitch));
}

 +
  static inline bool netif_index_is_vrf(struct net *net, int ifindex)
  {
 bool rc = false;
 diff --git a/net/openvswitch/vport-internal_dev.c 
 b/net/openvswitch/vport-internal_dev.c
 index c058bbf..80b3e12 100644
 --- a/net/openvswitch/vport-internal_dev.c
 +++ b/net/openvswitch/vport-internal_dev.c
 @@ -135,7 +135,7 @@ static void do_setup(struct net_device *netdev)
 netdev-netdev_ops = internal_dev_netdev_ops;

 netdev-priv_flags = ~IFF_TX_SKB_SHARING;
 -   netdev-priv_flags |= IFF_LIVE_ADDR_CHANGE;
 +   netdev-priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_OPENVSWITCH;
 netdev-destructor = internal_dev_destructor;
 netdev-ethtool_ops = internal_dev_ethtool_ops;
 netdev-rtnl_link_ops = internal_dev_link_ops;
 --
 1.9.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 230 matches

Mail list logo