date:20180626

Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-26 Thread Jakub Kicinski

On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko  wrote:
> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicin...@netronome.com wrote:
>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> For the TC clsact offload these days, some of HW drivers need
>>> to hold a magic ball. The reason is, with the first inserted rule inside
>>> HW they need to guess what fields will be used for the matching. If
>>> later on this guess proves to be wrong and user adds a filter with a
>>> different field to match, there's a problem. Mlxsw resolves it now with
>>> couple of patterns. Those try to cover as many match fields as possible.
>>> This aproach is far from optimal, both performance-wise and scale-wise.
>>> Also, there is a combination of filters that in certain order won't
>>> succeed.
>>>
>>> Most of the time, when user inserts filters in chain, he knows right away
>>> how the filters are going to look like - what type and option will they
>>> have. For example, he knows that he will only insert filters of type
>>> flower matching destination IP address. He can specify a template that
>>> would cover all the filters in the chain.
>>
>>Perhaps it's lack of sleep, but this paragraph threw me a little off
>>the track.  IIUC the goal of this set is to provide a way to inform the
>>HW about expected matches before any rule is programmed into the HW.
>>Not before any rule is added to a particular chain.  One can just use
>>the first rule in the chain to make a guess about the chain, but thanks
>>to this set user can configure *all* chains before any rules are added.
>
> The template is per-chain. User can use template for chain x and
> not-use it for chain y. Up to him.

Makes sense.

I can't help but wonder if it'd be better to associate the
constraints/rules with chains instead of creating a new "template"
object.  It seems more natural to create a chain with specific
constraints in place than add and delete template of which there can
be at most one to a chain...  Perhaps that's more about the user space
tc command line.  Anyway, not a strong objection, just a thought.

>>And that's needed because once any rule is added the tcam config can no
>>longer be easily modified?
>
> Yes.

Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization

2018-06-26 Thread Jakub Kicinski

On Mon, Jun 25, 2018 at 10:50 PM, Song Liu  wrote:
> On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
>  wrote:
>> From: Jiong Wang 
>>
>> Map read has been supported on NFP, this patch enables optimization for
>> memcpy from map to packet.
>>
>> This patch also fixed one latent bug which will cause copying from
>> unexpected address once memcpy for map pointer enabled.
>>
>> Reported-by: Mary Pham 
>> Reported-by: David Beckett 
>> Signed-off-by: Jiong Wang 
>> Reviewed-by: Jakub Kicinski 
>> ---
>>  drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
>> b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> index 8a92088df0d7..33111739b210 100644
>> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, 
>> struct nfp_insn_meta *meta)
>> xfer_num = round_up(len, 4) / 4;
>>
>> if (src_40bit_addr)
>> -   addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
>> +   addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, 
>> &src_base,
>>   &off);
>
> Did this break other cases before this patch?
>
> I am sorry if this is a dumb question. I don't think I fully
> understand addr40_offset().

Only map memory uses 40 bit addressing right now, so the if was pretty
much dead code before the patch.

The memcpy optimization was left out of the initial map support due to
insufficient test coverage, I should have probably left more of the 40
bit addressing code out back then.

Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-26 Thread Jiri Pirko

Tue, Jun 26, 2018 at 09:00:45AM CEST, jakub.kicin...@netronome.com wrote:
>On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko  wrote:
>> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicin...@netronome.com wrote:
>>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
 From: Jiri Pirko 

 For the TC clsact offload these days, some of HW drivers need
 to hold a magic ball. The reason is, with the first inserted rule inside
 HW they need to guess what fields will be used for the matching. If
 later on this guess proves to be wrong and user adds a filter with a
 different field to match, there's a problem. Mlxsw resolves it now with
 couple of patterns. Those try to cover as many match fields as possible.
 This aproach is far from optimal, both performance-wise and scale-wise.
 Also, there is a combination of filters that in certain order won't
 succeed.

 Most of the time, when user inserts filters in chain, he knows right away
 how the filters are going to look like - what type and option will they
 have. For example, he knows that he will only insert filters of type
 flower matching destination IP address. He can specify a template that
 would cover all the filters in the chain.
>>>
>>>Perhaps it's lack of sleep, but this paragraph threw me a little off
>>>the track.  IIUC the goal of this set is to provide a way to inform the
>>>HW about expected matches before any rule is programmed into the HW.
>>>Not before any rule is added to a particular chain.  One can just use
>>>the first rule in the chain to make a guess about the chain, but thanks
>>>to this set user can configure *all* chains before any rules are added.
>>
>> The template is per-chain. User can use template for chain x and
>> not-use it for chain y. Up to him.
>
>Makes sense.
>
>I can't help but wonder if it'd be better to associate the
>constraints/rules with chains instead of creating a new "template"
>object.  It seems more natural to create a chain with specific
>constraints in place than add and delete template of which there can
>be at most one to a chain...  Perhaps that's more about the user space
>tc command line.  Anyway, not a strong objection, just a thought.

Hmm. I don't think it is good idea. User should see the template in a
"show" command per chain. We would have to have 2 show commands, one to
list the template objects and one to list templates per chains. It makes
things more complicated for no good reason. I think that this simple
chain-lock is easier and serves the purpose.

>
>>>And that's needed because once any rule is added the tcam config can no
>>>longer be easily modified?
>>
>> Yes.

[patch net-next v2 1/9] net: sched: push ops lookup bits into tcf_proto_lookup_ops()

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Push all bits that take care of ops lookup, including module loading
outside tcf_proto_create() function, into tcf_proto_lookup_ops()

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_api.c | 53 +++--
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index cdc3c87c53e6..db45931bbada 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -39,7 +39,7 @@ static DEFINE_RWLOCK(cls_mod_lock);
 
 /* Find classifier type by string name */
 
-static const struct tcf_proto_ops *tcf_proto_lookup_ops(const char *kind)
+static const struct tcf_proto_ops *__tcf_proto_lookup_ops(const char *kind)
 {
const struct tcf_proto_ops *t, *res = NULL;
 
@@ -57,6 +57,33 @@ static const struct tcf_proto_ops 
*tcf_proto_lookup_ops(const char *kind)
return res;
 }
 
+static const struct tcf_proto_ops *
+tcf_proto_lookup_ops(const char *kind, struct netlink_ext_ack *extack)
+{
+   const struct tcf_proto_ops *ops;
+
+   ops = __tcf_proto_lookup_ops(kind);
+   if (ops)
+   return ops;
+#ifdef CONFIG_MODULES
+   rtnl_unlock();
+   request_module("cls_%s", kind);
+   rtnl_lock();
+   ops = __tcf_proto_lookup_ops(kind);
+   /* We dropped the RTNL semaphore in order to perform
+* the module load. So, even if we succeeded in loading
+* the module we have to replay the request. We indicate
+* this using -EAGAIN.
+*/
+   if (ops) {
+   module_put(ops->owner);
+   return ERR_PTR(-EAGAIN);
+   }
+#endif
+   NL_SET_ERR_MSG(extack, "TC classifier not found");
+   return ERR_PTR(-ENOENT);
+}
+
 /* Register(unregister) new classifier type */
 
 int register_tcf_proto_ops(struct tcf_proto_ops *ops)
@@ -133,27 +160,9 @@ static struct tcf_proto *tcf_proto_create(const char 
*kind, u32 protocol,
if (!tp)
return ERR_PTR(-ENOBUFS);
 
-   err = -ENOENT;
-   tp->ops = tcf_proto_lookup_ops(kind);
-   if (!tp->ops) {
-#ifdef CONFIG_MODULES
-   rtnl_unlock();
-   request_module("cls_%s", kind);
-   rtnl_lock();
-   tp->ops = tcf_proto_lookup_ops(kind);
-   /* We dropped the RTNL semaphore in order to perform
-* the module load. So, even if we succeeded in loading
-* the module we have to replay the request. We indicate
-* this using -EAGAIN.
-*/
-   if (tp->ops) {
-   module_put(tp->ops->owner);
-   err = -EAGAIN;
-   } else {
-   NL_SET_ERR_MSG(extack, "TC classifier not found");
-   err = -ENOENT;
-   }
-#endif
+   tp->ops = tcf_proto_lookup_ops(kind, extack);
+   if (IS_ERR(tp->ops)) {
+   err = PTR_ERR(tp->ops);
goto errout;
}
tp->classify = tp->ops->classify;
-- 
2.14.4

[patch net-next v2 4/9] net: sched: cls_flower: change fl_init_dissector to accept mask and dissector

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

This function is going to be used for templates as well, so we need to
pass the pointer separately.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 76c5516357d5..9ce4375b3252 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -793,47 +793,48 @@ static int fl_init_mask_hashtable(struct fl_flow_mask 
*mask)
FL_KEY_SET(keys, cnt, id, member);  
\
} while(0);
 
-static void fl_init_dissector(struct fl_flow_mask *mask)
+static void fl_init_dissector(struct flow_dissector *dissector,
+ struct fl_flow_key *mask)
 {
struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
size_t cnt = 0;
 
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_CONTROL, control);
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_BASIC, basic);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ETH_ADDRS, eth);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_PORTS, tp);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IP, ip);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_TCP, tcp);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ICMP, icmp);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ARP, arp);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_MPLS, mpls);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_VLAN, vlan);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, enc_ipv4);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, enc_ipv6);
-   if (FL_KEY_IS_MASKED(&mask->key, enc_ipv4) ||
-   FL_KEY_IS_MASKED(&mask->key, enc_ipv6))
+   if (FL_KEY_IS_MASKED(mask, enc_ipv4) ||
+   FL_KEY_IS_MASKED(mask, enc_ipv6))
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_ENC_CONTROL,
   enc_control);
-   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_PORTS, enc_tp);
 
-   skb_flow_dissector_init(&mask->dissector, keys, cnt);
+   skb_flow_dissector_init(dissector, keys, cnt);
 }
 
 static struct fl_flow_mask *fl_create_new_mask(struct cls_fl_head *head,
@@ -852,7 +853,7 @@ static struct fl_flow_mask *fl_create_new_mask(struct 
cls_fl_head *head,
if (err)
goto errout_free;
 
-   fl_init_dissector(newmask);
+   fl_init_dissector(&newmask->dissector, &newmask->key);
 
INIT_LIST_HEAD_RCU(&newmask->filters);
 
-- 
2.14.4

[patch net-next v2 9/9] selftests: forwarding: add tests for TC chain templates

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Add basic sanity tests for TC chain templates.

Signed-off-by: Jiri Pirko 
---
 tools/testing/selftests/net/forwarding/lib.sh  |   9 ++
 .../selftests/net/forwarding/tc_chaintemplates.sh  | 160 +
 2 files changed, 169 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_chaintemplates.sh

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index a736d1d7ecdb..128a5b5a8ea9 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -39,6 +39,15 @@ check_tc_shblock_support()
fi
 }
 
+check_tc_chaintemplate_support()
+{
+   tc filter help 2>&1|grep template &> /dev/null
+   if [[ $? -ne 0 ]]; then
+   echo "SKIP: iproute2 too old; tc is missing chain template 
support"
+   exit 1
+   fi
+}
+
 if [[ "$(id -u)" -ne 0 ]]; then
echo "SKIP: need root privileges"
exit 0
diff --git a/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh 
b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh
new file mode 100755
index ..21f2c18e973a
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh
@@ -0,0 +1,160 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+ALL_TESTS="template_create_destroy template_filter_fits \
+  template_create_nonempty template_destroy_nonempty"
+NUM_NETIFS=2
+source tc_common.sh
+source lib.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/24
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/24
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/24
+   tc qdisc add dev $h2 clsact
+}
+
+h2_destroy()
+{
+   tc qdisc del dev $h2 clsact
+   simple_if_fini $h2 192.0.2.2/24
+}
+
+template_create_destroy()
+{
+   RET=0
+
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF
+   check_err $? "Failed to create template for default chain"
+
+   tc filter template add dev $h2 ingress chain 1 protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF
+   check_err $? "Failed to create template for chain 1"
+
+   tc filter template del dev $h2 ingress
+   check_err $? "Failed to destroy template for default chain"
+
+   tc filter template del dev $h2 ingress chain 1
+   check_err $? "Failed to destroy template for chain 1"
+
+   log_test "template create destroy"
+}
+
+template_filter_fits()
+{
+   RET=0
+
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+   tc filter template add dev $h2 ingress chain 1 protocol ip \
+   flower src_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower dst_mac $h2mac action drop
+   check_err $? "Failed to insert filter which fits template"
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1102 \
+   flower src_mac $h2mac action drop &> /dev/null
+   check_fail $? "Incorrectly succeded to insert filter which does not 
template"
+
+   tc filter add dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \
+   flower src_mac $h2mac action drop
+   check_err $? "Failed to insert filter which fits template"
+
+   tc filter add dev $h2 ingress chain 1protocol ip pref 1 handle 1102 \
+   flower dst_mac $h2mac action drop &> /dev/null
+   check_fail $? "Incorrectly succeded to insert filter which does not 
template"
+
+   tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1102 \
+   flower &> /dev/null
+   tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \
+   flower &> /dev/null
+
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1102 \
+   flower &> /dev/null
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower &> /dev/null
+
+   tc filter template del dev $h2 ingress chain 1
+   tc filter template del dev $h2 ingress
+
+   log_test "template filter fits"
+}
+
+template_create_nonempty()
+{
+   RET=0
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower dst_mac $h2mac action drop
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+   check_fail $? "Incorrectly succeded to create template for non-empty 
chain"
+
+   tc filter template del dev $h2 ingress &> /dev/null
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 flower
+
+   log_test "template create non-empty"
+}
+
+template_destroy_nonempty()
+{
+   RET=0
+
+   tc filter template ad

[patch net-next v2 8/9] selftests: forwarding: move shblock tc support check to a separate helper

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

The shared block support is only needed for tc_shblock.sh. No need to
require that for other test.

Signed-off-by: Jiri Pirko 
---
 tools/testing/selftests/net/forwarding/lib.sh | 3 +++
 tools/testing/selftests/net/forwarding/tc_shblocks.sh | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 7b18a53aa556..a736d1d7ecdb 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -28,7 +28,10 @@ check_tc_version()
echo "SKIP: iproute2 too old; tc is missing JSON support"
exit 1
fi
+}
 
+check_tc_shblock_support()
+{
tc filter help 2>&1 | grep block &> /dev/null
if [[ $? -ne 0 ]]; then
echo "SKIP: iproute2 too old; tc is missing shared block 
support"
diff --git a/tools/testing/selftests/net/forwarding/tc_shblocks.sh 
b/tools/testing/selftests/net/forwarding/tc_shblocks.sh
index b5b917203815..9826a446e2c0 100755
--- a/tools/testing/selftests/net/forwarding/tc_shblocks.sh
+++ b/tools/testing/selftests/net/forwarding/tc_shblocks.sh
@@ -105,6 +105,8 @@ cleanup()
ip link set $swp2 address $swp2origmac
 }
 
+check_tc_shblock_support
+
 trap cleanup EXIT
 
 setup_prepare
-- 
2.14.4

[patch net-next v2 7/9] mlxsw: spectrum: Implement chain template hinting

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Since cld_flower provides information about the filter template for
specific chain, use this information in order to prepare a region.
Use the template to find out what elements are going to be used
and pass that down to mlxsw_sp_acl_tcam_group_add(). Later on, when the
first filter is inserted, the mlxsw_sp_acl_tcam_group_use_patterns()
function would use this element usage information instead of looking
up a pattern.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  5 +++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 12 --
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 25 ++--
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 44 --
 5 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 968b88af2ef5..da19fa343d0b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1441,6 +1441,11 @@ mlxsw_sp_setup_tc_cls_flower(struct mlxsw_sp_acl_block 
*acl_block,
return 0;
case TC_CLSFLOWER_STATS:
return mlxsw_sp_flower_stats(mlxsw_sp, acl_block, f);
+   case TC_CLSFLOWER_TMPLT_CREATE:
+   return mlxsw_sp_flower_tmplt_create(mlxsw_sp, acl_block, f);
+   case TC_CLSFLOWER_TMPLT_DESTROY:
+   mlxsw_sp_flower_tmplt_destroy(mlxsw_sp, acl_block, f);
+   return 0;
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 4a519d8edec8..b0a8e611e730 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -459,7 +459,8 @@ enum mlxsw_sp_acl_profile {
 struct mlxsw_sp_acl_profile_ops {
size_t ruleset_priv_size;
int (*ruleset_add)(struct mlxsw_sp *mlxsw_sp,
-  void *priv, void *ruleset_priv);
+  void *priv, void *ruleset_priv,
+  struct mlxsw_afk_element_usage *tmplt_elusage);
void (*ruleset_del)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv);
int (*ruleset_bind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
struct mlxsw_sp_port *mlxsw_sp_port,
@@ -514,7 +515,8 @@ mlxsw_sp_acl_ruleset_lookup(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_acl_ruleset *
 mlxsw_sp_acl_ruleset_get(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_acl_block *block, u32 chain_index,
-enum mlxsw_sp_acl_profile profile);
+enum mlxsw_sp_acl_profile profile,
+struct mlxsw_afk_element_usage *tmplt_elusage);
 void mlxsw_sp_acl_ruleset_put(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_acl_ruleset *ruleset);
 u16 mlxsw_sp_acl_ruleset_group_id(struct mlxsw_sp_acl_ruleset *ruleset);
@@ -594,6 +596,12 @@ void mlxsw_sp_flower_destroy(struct mlxsw_sp *mlxsw_sp,
 int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_acl_block *block,
  struct tc_cls_flower_offload *f);
+int mlxsw_sp_flower_tmplt_create(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_acl_block *block,
+struct tc_cls_flower_offload *f);
+void mlxsw_sp_flower_tmplt_destroy(struct mlxsw_sp *mlxsw_sp,
+  struct mlxsw_sp_acl_block *block,
+  struct tc_cls_flower_offload *f);
 
 /* spectrum_qdisc.c */
 int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index 79b1fa27a9a4..ea42605c451d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -319,7 +319,8 @@ int mlxsw_sp_acl_block_unbind(struct mlxsw_sp *mlxsw_sp,
 static struct mlxsw_sp_acl_ruleset *
 mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
struct mlxsw_sp_acl_block *block, u32 chain_index,
-   const struct mlxsw_sp_acl_profile_ops *ops)
+   const struct mlxsw_sp_acl_profile_ops *ops,
+   struct mlxsw_afk_element_usage *tmplt_elusage)
 {
struct mlxsw_sp_acl *acl = mlxsw_sp->acl;
struct mlxsw_sp_acl_ruleset *ruleset;
@@ -339,7 +340,8 @@ mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rhashtable_init;
 
-   err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv);
+   err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv,
+

[patch net-next v2 2/9] net: sched: introduce chain templates

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Introduce a group of new tc-rtnl commands to allow user to set per-chain
template. Templates lock down individual chains for particular
classifier type/options combinations. The classifier needs to support
templates, otherwise kernel would reply with error.

For example, to lock chain 22 to allow only filters of type
flower with destination mac address, user needs to do:
  chain 22 flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF

In case the chain already contains some filters it is not possible to
add or remove template. That is permitted only for empty chains.

Alongside with add/del commands, introduce also get/dump and
notifications.

Signed-off-by: Jiri Pirko 
---
 include/net/sch_generic.h  |  14 +-
 include/uapi/linux/rtnetlink.h |   7 +
 net/sched/cls_api.c| 371 -
 net/sched/cls_basic.c  |   2 +-
 net/sched/cls_bpf.c|   3 +-
 net/sched/cls_cgroup.c |   2 +-
 net/sched/cls_flow.c   |   3 +-
 net/sched/cls_flower.c |   3 +-
 net/sched/cls_fw.c |   3 +-
 net/sched/cls_matchall.c   |   3 +-
 net/sched/cls_route.c  |   2 +-
 net/sched/cls_rsvp.h   |   3 +-
 net/sched/cls_tcindex.c|   2 +-
 net/sched/cls_u32.c|   2 +-
 security/selinux/nlmsgtab.c|   2 +-
 15 files changed, 405 insertions(+), 17 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6488daa32f82..f2a27d41fed5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -235,6 +235,8 @@ struct tcf_result {
};
 };
 
+struct tcf_chain;
+
 struct tcf_proto_ops {
struct list_headhead;
charkind[IFNAMSIZ];
@@ -250,17 +252,25 @@ struct tcf_proto_ops {
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
-   void **, bool,
+   void **, bool, void *tmplt_priv,
struct netlink_ext_ack *);
int (*delete)(struct tcf_proto *tp, void *arg,
  bool *last,
  struct netlink_ext_ack *);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);
void(*bind_class)(void *, u32, unsigned long);
+   void *  (*tmplt_create)(struct net *net,
+   struct tcf_chain *chain,
+   struct nlattr **tca,
+   struct netlink_ext_ack *extack);
+   void(*tmplt_destroy)(void *tmplt_priv);
 
/* rtnetlink specific */
int (*dump)(struct net*, struct tcf_proto*, void *,
struct sk_buff *skb, struct tcmsg*);
+   int (*tmplt_dump)(struct sk_buff *skb,
+ struct net *net,
+ void *tmplt_priv);
 
struct module   *owner;
 };
@@ -299,6 +309,8 @@ struct tcf_chain {
struct tcf_block *block;
u32 index; /* chain index */
unsigned int refcnt;
+   const struct tcf_proto_ops *tmplt_ops;
+   void *tmplt_priv;
 };
 
 struct tcf_block {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 7d8502313c99..45fd8cc1fdb2 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -150,6 +150,13 @@ enum {
RTM_NEWCACHEREPORT = 96,
 #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
 
+   RTM_NEWCHAINTMPLT = 100,
+#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT
+   RTM_DELCHAINTMPLT,
+#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT
+   RTM_GETCHAINTMPLT,
+#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index db45931bbada..0c88520f80f2 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -227,7 +227,7 @@ static void tcf_chain_head_change(struct tcf_chain *chain,
tcf_chain_head_change_item(item, tp_head);
 }
 
-static void tcf_chain_flush(struct tcf_chain *chain)
+static void tcf_chain_flush(struct tcf_chain *chain, bool destroy_template)
 {
struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
 
@@ -238,6 +238,11 @@ static void tcf_chain_flush(struct tcf_chain *chain)
tp = rtnl_dereference(chain->filter_chain);
tcf_chain_put(chain);
}
+   if (destroy_template && chain->tmplt_ops) {
+   chain->tmplt_ops->tmplt_destroy(

[patch net-next v2 5/9] net: sched: cls_flower: implement chain templates

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Use the previously introduced template extension and implement
callback to create, destroy and dump chain template. The existing
parsing and dumping functions are re-used. Also, check if newly added
filters fit the template if it is set.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 107 -
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9ce4375b3252..d64d43843a3a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -70,6 +70,13 @@ struct fl_flow_mask {
struct list_head list;
 };
 
+struct fl_flow_tmplt {
+   struct fl_flow_key dummy_key;
+   struct fl_flow_key mask;
+   struct flow_dissector dissector;
+   struct tcf_chain *chain;
+};
+
 struct cls_fl_head {
struct rhashtable ht;
struct list_head masks;
@@ -144,6 +151,23 @@ static void fl_set_masked_key(struct fl_flow_key *mkey, 
struct fl_flow_key *key,
*lmkey++ = *lkey++ & *lmask++;
 }
 
+static bool fl_mask_fits_tmplt(struct fl_flow_tmplt *tmplt,
+  struct fl_flow_mask *mask)
+{
+   const long *lmask = fl_key_get_start(&mask->key, mask);
+   const long *ltmplt;
+   int i;
+
+   if (!tmplt)
+   return true;
+   ltmplt = fl_key_get_start(&tmplt->mask, mask);
+   for (i = 0; i < fl_mask_range(mask); i += sizeof(long)) {
+   if (~*ltmplt++ & *lmask++)
+   return false;
+   }
+   return true;
+}
+
 static void fl_clear_masked_range(struct fl_flow_key *key,
  struct fl_flow_mask *mask)
 {
@@ -902,6 +926,7 @@ static int fl_set_parms(struct net *net, struct tcf_proto 
*tp,
struct cls_fl_filter *f, struct fl_flow_mask *mask,
unsigned long base, struct nlattr **tb,
struct nlattr *est, bool ovr,
+   struct fl_flow_tmplt *tmplt,
struct netlink_ext_ack *extack)
 {
int err;
@@ -922,6 +947,11 @@ static int fl_set_parms(struct net *net, struct tcf_proto 
*tp,
fl_mask_update_range(mask);
fl_set_masked_key(&f->mkey, &f->key, mask);
 
+   if (!fl_mask_fits_tmplt(tmplt, mask)) {
+   NL_SET_ERR_MSG_MOD(extack, "Mask does not fit the template");
+   return -EINVAL;
+   }
+
return 0;
 }
 
@@ -932,6 +962,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
 struct netlink_ext_ack *extack)
 {
struct cls_fl_head *head = rtnl_dereference(tp->root);
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
struct cls_fl_filter *fold = *arg;
struct cls_fl_filter *fnew;
struct nlattr **tb;
@@ -988,7 +1019,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
}
 
err = fl_set_parms(net, tp, fnew, &mask, base, tb, tca[TCA_RATE], ovr,
-  extack);
+  tmplt, extack);
if (err)
goto errout_idr;
 
@@ -1089,6 +1120,52 @@ static void fl_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain,
+struct nlattr **tca,
+struct netlink_ext_ack *extack)
+{
+   struct fl_flow_tmplt *tmplt;
+   struct nlattr **tb;
+   int err;
+
+   if (!tca[TCA_OPTIONS])
+   return ERR_PTR(-EINVAL);
+
+   tb = kcalloc(TCA_FLOWER_MAX + 1, sizeof(struct nlattr *), GFP_KERNEL);
+   if (!tb)
+   return ERR_PTR(-ENOBUFS);
+   err = nla_parse_nested(tb, TCA_FLOWER_MAX, tca[TCA_OPTIONS],
+  fl_policy, NULL);
+   if (err)
+   goto errout_tb;
+
+   tmplt = kzalloc(sizeof(*tmplt), GFP_KERNEL);
+   if (!tmplt)
+   goto errout_tb;
+   tmplt->chain = chain;
+   err = fl_set_key(net, tb, &tmplt->dummy_key, &tmplt->mask, extack);
+   if (err)
+   goto errout_tmplt;
+   kfree(tb);
+
+   fl_init_dissector(&tmplt->dissector, &tmplt->mask);
+
+   return tmplt;
+
+errout_tmplt:
+   kfree(tmplt);
+errout_tb:
+   kfree(tb);
+   return ERR_PTR(err);
+}
+
+static void fl_tmplt_destroy(void *tmplt_priv)
+{
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
+
+   kfree(tmplt);
+}
+
 static int fl_dump_key_val(struct sk_buff *skb,
   void *val, int val_type,
   void *mask, int mask_type, int len)
@@ -1435,6 +1512,31 @@ static int fl_dump(struct net *net, struct tcf_proto 
*tp, void *fh,
return -1;
 }
 
+static int fl_tmplt_dump(struct sk_buff *skb, struct net *net, void 
*tmplt_priv)
+{
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
+   struct fl_flow_key *key, *mask;
+   struct nlattr *nest;
+
+

[patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

For the TC clsact offload these days, some of HW drivers need
to hold a magic ball. The reason is, with the first inserted rule inside
HW they need to guess what fields will be used for the matching. If
later on this guess proves to be wrong and user adds a filter with a
different field to match, there's a problem. Mlxsw resolves it now with
couple of patterns. Those try to cover as many match fields as possible.
This aproach is far from optimal, both performance-wise and scale-wise.
Also, there is a combination of filters that in certain order won't
succeed.

Most of the time, when user inserts filters in chain, he knows right away
how the filters are going to look like - what type and option will they
have. For example, he knows that he will only insert filters of type
flower matching destination IP address. He can specify a template that
would cover all the filters in the chain.

This patchset is providing the possibility to user to provide such
template  to kernel and propagate it all the way down to device
drivers.

See the examples below.

Create dummy device with clsact first:
# ip link add type dummy
# tc qdisc add dev dummy0 clsact

There is no template assigned by default:
# tc filter template show dev dummy0 ingress

Add a template of type flower allowing to insert rules matching on last
2 bytes of destination mac address:
# tc filter template add dev dummy0 ingress proto ip flower dst_mac 
00:00:00:00:00:00/00:00:00:00:FF:FF

The template is now showed in the list:
# tc filter template show dev dummy0 ingress
filter flower chain 0
  dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
  eth_type ipv4

Add another template, this time for chain number 22:
# tc filter template add dev dummy0 ingress proto ip chain 22 flower dst_ip 
0.0.0.0/16
# tc filter template show dev dummy0 ingress
filter flower chain 0
  dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
  eth_type ipv4
filter flower chain 22
  eth_type ipv4
  dst_ip 0.0.0.0/16

Add a filter that fits the template:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop

Addition of filters that does not fit the template would fail:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:11:22:33:44:55/00:00:00:FF:00:00 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Additions of filters to chain 22:
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 
action drop
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/24 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Removal of a template from non-empty chain would fail:
# tc filter template del dev dummy0 ingress
Error: The chain is not empty, unable to delete template.
We have an error talking to the kernel, -1

Once the chain is flushed, the template could be removed:
# tc filter del dev dummy0 ingress
# tc filter template del dev dummy0 ingress

---
v1->v2:
-patch 6:
  - remove leftover extack arg in fl_hw_create_tmplt()

Jiri Pirko (9):
  net: sched: push ops lookup bits into tcf_proto_lookup_ops()
  net: sched: introduce chain templates
  net: sched: cls_flower: move key/mask dumping into a separate function
  net: sched: cls_flower: change fl_init_dissector to accept mask and
dissector
  net: sched: cls_flower: implement chain templates
  net: sched: cls_flower: propagate chain teplate creation and
destruction to drivers
  mlxsw: spectrum: Implement chain template hinting
  selftests: forwarding: move shblock tc support check to a separate
helper
  selftests: forwarding: add tests for TC chain templates

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   5 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  12 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  12 +-
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c|  25 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  44 ++-
 include/net/pkt_cls.h  |   2 +
 include/net/sch_generic.h  |  14 +-
 include/uapi/linux/rtnetlink.h |   7 +
 net/sched/cls_api.c| 424 +++--
 net/sched/cls_basic.c  |   2 +-
 net/sched/cls_bpf.c|   3 +-
 net/sched/cls_cgroup.c |   2 +-
 net/sched/cls_flow.c   |   3 +-
 net/sched/cls_flower.c | 250 +---
 net/sched/cls_fw.c |   3 +-
 net/sched/cls_mat

[patch net-next v2 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Introduce a couple of flower offload commands in order to propagate
template creation/destruction events down to device drivers.
Drivers may use this information to prepare HW in an optimal way
for future filter insertions.

Signed-off-by: Jiri Pirko 
---
v1->v2:
- remove leftover extack arg in fl_hw_create_tmplt()
---
 include/net/pkt_cls.h  |  2 ++
 net/sched/cls_flower.c | 39 +++
 2 files changed, 41 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a3c1a2c47cd4..e83968cf9a70 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -715,6 +715,8 @@ enum tc_fl_command {
TC_CLSFLOWER_REPLACE,
TC_CLSFLOWER_DESTROY,
TC_CLSFLOWER_STATS,
+   TC_CLSFLOWER_TMPLT_CREATE,
+   TC_CLSFLOWER_TMPLT_DESTROY,
 };
 
 struct tc_cls_flower_offload {
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index d64d43843a3a..614dd558d5f1 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1120,6 +1120,42 @@ static void fl_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static void fl_hw_create_tmplt(struct tcf_chain *chain,
+  struct fl_flow_tmplt *tmplt)
+{
+   struct tc_cls_flower_offload cls_flower = {};
+   struct tcf_block *block = chain->block;
+   struct tcf_exts dummy_exts = { 0, };
+
+   cls_flower.common.chain_index = chain->index;
+   cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
+   cls_flower.cookie = (unsigned long) tmplt;
+   cls_flower.dissector = &tmplt->dissector;
+   cls_flower.mask = &tmplt->mask;
+   cls_flower.key = &tmplt->dummy_key;
+   cls_flower.exts = &dummy_exts;
+
+   /* We don't care if driver (any of them) fails to handle this
+* call. It serves just as a hint for it.
+*/
+   tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
+&cls_flower, false);
+}
+
+static void fl_hw_destroy_tmplt(struct tcf_chain *chain,
+   struct fl_flow_tmplt *tmplt)
+{
+   struct tc_cls_flower_offload cls_flower = {};
+   struct tcf_block *block = chain->block;
+
+   cls_flower.common.chain_index = chain->index;
+   cls_flower.command = TC_CLSFLOWER_TMPLT_DESTROY;
+   cls_flower.cookie = (unsigned long) tmplt;
+
+   tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
+&cls_flower, false);
+}
+
 static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain,
 struct nlattr **tca,
 struct netlink_ext_ack *extack)
@@ -1150,6 +1186,8 @@ static void *fl_tmplt_create(struct net *net, struct 
tcf_chain *chain,
 
fl_init_dissector(&tmplt->dissector, &tmplt->mask);
 
+   fl_hw_create_tmplt(chain, tmplt);
+
return tmplt;
 
 errout_tmplt:
@@ -1163,6 +1201,7 @@ static void fl_tmplt_destroy(void *tmplt_priv)
 {
struct fl_flow_tmplt *tmplt = tmplt_priv;
 
+   fl_hw_destroy_tmplt(tmplt->chain, tmplt);
kfree(tmplt);
 }
 
-- 
2.14.4

[patch net-next v2 3/9] net: sched: cls_flower: move key/mask dumping into a separate function

2018-06-26 Thread Jiri Pirko

From: Jiri Pirko 

Push key/mask dumping from fl_dump() into a separate function
fl_dump_key(), that will be reused for template dumping.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 62 ++
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 09d6c6e67f9d..76c5516357d5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1217,29 +1217,9 @@ static int fl_dump_key_flags(struct sk_buff *skb, u32 
flags_key, u32 flags_mask)
return nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask);
 }
 
-static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
-  struct sk_buff *skb, struct tcmsg *t)
+static int fl_dump_key(struct sk_buff *skb, struct net *net,
+  struct fl_flow_key *key, struct fl_flow_key *mask)
 {
-   struct cls_fl_filter *f = fh;
-   struct nlattr *nest;
-   struct fl_flow_key *key, *mask;
-
-   if (!f)
-   return skb->len;
-
-   t->tcm_handle = f->handle;
-
-   nest = nla_nest_start(skb, TCA_OPTIONS);
-   if (!nest)
-   goto nla_put_failure;
-
-   if (f->res.classid &&
-   nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid))
-   goto nla_put_failure;
-
-   key = &f->key;
-   mask = &f->mask->key;
-
if (mask->indev_ifindex) {
struct net_device *dev;
 
@@ -1248,9 +1228,6 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, 
void *fh,
goto nla_put_failure;
}
 
-   if (!tc_skip_hw(f->flags))
-   fl_hw_update_stats(tp, f);
-
if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst)) ||
@@ -1404,6 +1381,41 @@ static int fl_dump(struct net *net, struct tcf_proto 
*tp, void *fh,
if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
goto nla_put_failure;
 
+   return 0;
+
+nla_put_failure:
+   return -EMSGSIZE;
+}
+
+static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
+  struct sk_buff *skb, struct tcmsg *t)
+{
+   struct cls_fl_filter *f = fh;
+   struct nlattr *nest;
+   struct fl_flow_key *key, *mask;
+
+   if (!f)
+   return skb->len;
+
+   t->tcm_handle = f->handle;
+
+   nest = nla_nest_start(skb, TCA_OPTIONS);
+   if (!nest)
+   goto nla_put_failure;
+
+   if (f->res.classid &&
+   nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid))
+   goto nla_put_failure;
+
+   key = &f->key;
+   mask = &f->mask->key;
+
+   if (fl_dump_key(skb, net, key, mask))
+   goto nla_put_failure;
+
+   if (!tc_skip_hw(f->flags))
+   fl_hw_update_stats(tp, f);
+
if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags))
goto nla_put_failure;
 
-- 
2.14.4

[PATCH net-next V3 1/2] cxgb4: Add support for FW_ETH_TX_PKT_VM_WR

2018-06-26 Thread Ganesh Goudar

From: Arjun Vynipadath 

The present TX workrequest(FW_ETH_TX_PKT_WR) cant be used for
host->vf communication, since it doesn't loopback the outgoing
packets to virtual interfaces on the same port. This can be done using
FW_ETH_TX_PKT_VM_WR.
This fix depends on ethtool_flags to determine what WR to use for
TX path. Support for setting this flags by user is added in next commit.

Based on the original work by : Casey Leedom 

V3
- Made eth_flags type consistent across struct adapter and
  struct port_info.
V2
- Renamed t4_eth_xmit() and t4vf_eth_xmit(), since some compilers
  were warning about conflicting definition in cxgb4vf driver

Signed-off-by: Casey Leedom 
Signed-off-by: Arjun Vynipadath 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  13 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c| 372 +++-
 3 files changed, 383 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 1adb968..a4ea53d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -522,6 +522,15 @@ enum {
MAX_INGQ = MAX_ETH_QSETS + INGQ_EXTRAS,
 };
 
+enum {
+   PRIV_FLAG_PORT_TX_VM_BIT,
+};
+
+#define PRIV_FLAG_PORT_TX_VM   BIT(PRIV_FLAG_PORT_TX_VM_BIT)
+
+#define PRIV_FLAGS_ADAP0
+#define PRIV_FLAGS_PORTPRIV_FLAG_PORT_TX_VM
+
 struct adapter;
 struct sge_rspq;
 
@@ -558,6 +567,7 @@ struct port_info {
struct hwtstamp_config tstamp_config;
bool ptp_enable;
struct sched_table *sched_tbl;
+   u32 eth_flags;
 };
 
 struct dentry;
@@ -868,6 +878,7 @@ struct adapter {
unsigned int flags;
unsigned int adap_idx;
enum chip_type chip;
+   u32 eth_flags;
 
int msg_enable;
__be16 vxlan_port;
@@ -1334,7 +1345,7 @@ void t4_os_link_changed(struct adapter *adap, int 
port_id, int link_stat);
 void t4_free_sge_resources(struct adapter *adap);
 void t4_free_ofld_rxqs(struct adapter *adap, int n, struct sge_ofld_rxq *q);
 irq_handler_t t4_intr_handler(struct adapter *adap);
-netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t t4_start_xmit(struct sk_buff *skb, struct net_device *dev);
 int t4_ethrx_handler(struct sge_rspq *q, const __be64 *rsp,
 const struct pkt_gl *gl);
 int t4_mgmt_tx(struct adapter *adap, struct sk_buff *skb);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index bc03c17..d3b0f9c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3217,7 +3217,7 @@ static netdev_features_t cxgb_fix_features(struct 
net_device *dev,
 static const struct net_device_ops cxgb4_netdev_ops = {
.ndo_open = cxgb_open,
.ndo_stop = cxgb_close,
-   .ndo_start_xmit   = t4_eth_xmit,
+   .ndo_start_xmit   = t4_start_xmit,
.ndo_select_queue = cxgb_select_queue,
.ndo_get_stats64  = cxgb_get_stats,
.ndo_set_rx_mode  = cxgb_set_rxmode,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 395e2a0..f1311fd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -1288,13 +1288,13 @@ static inline void t6_fill_tnl_lso(struct sk_buff *skb,
 }
 
 /**
- * t4_eth_xmit - add a packet to an Ethernet Tx queue
+ * cxgb4_eth_xmit - add a packet to an Ethernet Tx queue
  * @skb: the packet
  * @dev: the egress net device
  *
  * Add a packet to an SGE Ethernet Tx queue.  Runs with softirqs disabled.
  */
-netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t cxgb4_eth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
u32 wr_mid, ctrl0, op;
u64 cntrl, *end, *sgl;
@@ -1547,6 +1547,374 @@ out_free:   dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
 }
 
+/* Constants ... */
+enum {
+   /* Egress Queue sizes, producer and consumer indices are all in units
+* of Egress Context Units bytes.  Note that as far as the hardware is
+* concerned, the free list is an Egress Queue (the host produces free
+* buffers which the hardware consumes) and free list entries are
+* 64-bit PCI DMA addresses.
+*/
+   EQ_UNIT = SGE_EQ_IDXSIZE,
+   FL_PER_EQ_UNIT = EQ_UNIT / sizeof(__be64),
+   TXD_PER_EQ_UNIT = EQ_UNIT / sizeof(__be64),
+
+   T4VF_ETHTXQ_MAX_HDR = (sizeof(struct fw_eth_tx_pkt_vm_wr) +
+  sizeof(struct cpl_tx_pkt_lso_core) +
+  sizeof(struct cpl_tx_pkt_core)) / sizeof(__be64),
+};
+
+/**
+ * t4vf_is_eth_imm - can an Ethernet packet be sent a

[PATCH net-next V3 2/2] cxgb4: Support ethtool private flags

2018-06-26 Thread Ganesh Goudar

From: Arjun Vynipadath 

This is used to change TX workrequests, which helps in
host->vf communication.

Signed-off-by: Arjun Vynipadath 
Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index f7eef93..ddb8b9e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -177,6 +177,10 @@ static char loopback_stats_strings[][ETH_GSTRING_LEN] = {
"bg3_frames_trunc   ",
 };
 
+static const char cxgb4_priv_flags_strings[][ETH_GSTRING_LEN] = {
+   [PRIV_FLAG_PORT_TX_VM_BIT] = "port_tx_vm_wr",
+};
+
 static int get_sset_count(struct net_device *dev, int sset)
 {
switch (sset) {
@@ -185,6 +189,8 @@ static int get_sset_count(struct net_device *dev, int sset)
   ARRAY_SIZE(adapter_stats_strings) +
   ARRAY_SIZE(channel_stats_strings) +
   ARRAY_SIZE(loopback_stats_strings);
+   case ETH_SS_PRIV_FLAGS:
+   return ARRAY_SIZE(cxgb4_priv_flags_strings);
default:
return -EOPNOTSUPP;
}
@@ -235,6 +241,7 @@ static void get_drvinfo(struct net_device *dev, struct 
ethtool_drvinfo *info)
 FW_HDR_FW_VER_MINOR_G(exprom_vers),
 FW_HDR_FW_VER_MICRO_G(exprom_vers),
 FW_HDR_FW_VER_BUILD_G(exprom_vers));
+   info->n_priv_flags = ARRAY_SIZE(cxgb4_priv_flags_strings);
 }
 
 static void get_strings(struct net_device *dev, u32 stringset, u8 *data)
@@ -250,6 +257,9 @@ static void get_strings(struct net_device *dev, u32 
stringset, u8 *data)
data += sizeof(channel_stats_strings);
memcpy(data, loopback_stats_strings,
   sizeof(loopback_stats_strings));
+   } else if (stringset == ETH_SS_PRIV_FLAGS) {
+   memcpy(data, cxgb4_priv_flags_strings,
+  sizeof(cxgb4_priv_flags_strings));
}
 }
 
@@ -1499,6 +1509,36 @@ static int cxgb4_get_module_eeprom(struct net_device 
*dev,
 offset, len, &data[eprom->len - len]);
 }
 
+static u32 cxgb4_get_priv_flags(struct net_device *netdev)
+{
+   struct port_info *pi = netdev_priv(netdev);
+   struct adapter *adapter = pi->adapter;
+
+   return (adapter->eth_flags | pi->eth_flags);
+}
+
+/**
+ * set_flags - set/unset specified flags if passed in new_flags
+ * @cur_flags: pointer to current flags
+ * @new_flags: new incoming flags
+ * @flags: set of flags to set/unset
+ */
+static inline void set_flags(u32 *cur_flags, u32 new_flags, u32 flags)
+{
+   *cur_flags = (*cur_flags & ~flags) | (new_flags & flags);
+}
+
+static int cxgb4_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+   struct port_info *pi = netdev_priv(netdev);
+   struct adapter *adapter = pi->adapter;
+
+   set_flags(&adapter->eth_flags, flags, PRIV_FLAGS_ADAP);
+   set_flags(&pi->eth_flags, flags, PRIV_FLAGS_PORT);
+
+   return 0;
+}
+
 static const struct ethtool_ops cxgb_ethtool_ops = {
.get_link_ksettings = get_link_ksettings,
.set_link_ksettings = set_link_ksettings,
@@ -1535,6 +1575,8 @@ static const struct ethtool_ops cxgb_ethtool_ops = {
.get_dump_data = get_dump_data,
.get_module_info   = cxgb4_get_module_info,
.get_module_eeprom = cxgb4_get_module_eeprom,
+   .get_priv_flags= cxgb4_get_priv_flags,
+   .set_priv_flags= cxgb4_set_priv_flags,
 };
 
 void cxgb4_set_ethtool_ops(struct net_device *netdev)
-- 
2.1.0

Re: [PATCH net-next 0/6] mlxsw: Support bridge router interfaces with non-default VLAN

2018-06-26 Thread David Miller

From: Ido Schimmel 
Date: Mon, 25 Jun 2018 10:48:12 +0300

> Petr says:
> 
> When traffic is inserted on a router interface associated with an 802.1q
> bridge, the VLAN that the traffic appears on is determined by PVID of
> the bridge device itself. However currently mlxsw always configures such
> traffic to be forwarded to VLAN 1, regardless of the bridge PVID.
> 
> Fix the problem by modifying the FID-handling code to assign such
> traffic not to FID that corresponds to VLAN 1, but to a FID that
> corresponds to the configured PVID. Bail out if there is no PVID. This
> is implemented in patches #1 and #2.
> 
> From that point on, also forbid any changes to bridge device PVID,
> because such changes would not be reflected. This is implemented in
> patches #3, #4 and #5.
> 
> Finally in patch #6, introduce tests that use bridge as a routed
> interface, and test mlxsw in both the currently-supported scenario of
> using PVID 1, and the newly-supported one of using a custom PVID.

Series applied, thank you.

Re: Request to enable setting the nested network namespace

2018-06-26 Thread Pamela Mei

I don't mean to track the whole history of netns changes as mandatory.
I mean it's better to have an option to ask user to set the new parent
of the child netns, not only the initial one.
Is there any technical bottle neck on this request?

Cheers,
Pamela MEI

On Thu, Jun 14, 2018 at 5:27 PM, Jiri Pirko  wrote:
> Thu, Jun 14, 2018 at 10:04:57AM CEST, pamela@gmail.com wrote:
>>In linux, set up 2 network namespaces, ns1 and ns2. "ip netns list"
>>can view the 2 network namespaces.
>>Move one network device from linux root namespace to ns1 then from ns1
>>to ns2, then delete ns2,
>>expect that network device can move back to ns1,
>>but actual result is that eth1 is back to linux root network
>>namespace. I'm not sure whether it's as expected.
>>
>>Here is the detail test steps:
>>
>>1.ip netns add ns1
>>
>>2.ip netns add ns2
>>
>>3.ip link set eth1 netns ns1
>>
>>4.ip netns exec ns1 ip link set eth1 netns ns2
>>
>>5.ip netns del ns2
>>
>>Expected result: eth1 will be in ns1
>>
>>Actual result: eth1 is back in linux root namespace 1
>>
>>Question: is there any method to realize such scenario to make sure
>>device can be back to ns1 not linux root network namespace 1?
>>
>>How about if there's a function to enable nest network namespace e.g.
>>can set ns1 as the parent namespace of ns2, then device can return to
>>ns1 when ns2 is gone.
>
> You would have to track the whole history of netns changes for each
> netdevice. That does not sound right. Move back to initial netns seems
> correct to me.
>
>
>>
>>
>>Cheers,
>>
>>Pamela MEI

[PATCH net-next] cxgb4: Add flag tc_flower_initialized

2018-06-26 Thread Ganesh Goudar

From: Casey Leedom 

Add flag tc_flower_initialized to indicate the
completion if tc flower initialization.

Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h   | 1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 8 
 drivers/net/ethernet/chelsio/cxgb4/sched.c   | 3 +++
 3 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index a4ea53d..4a8cbd8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -968,6 +968,7 @@ struct adapter {
struct chcr_stats_debug chcr_stats;
 
/* TC flower offload */
+   bool tc_flower_initialized;
struct rhashtable flower_tbl;
struct rhashtable_params flower_ht_params;
struct timer_list flower_stats_timer;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 3ddd2c4..623f73d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -874,6 +874,9 @@ int cxgb4_init_tc_flower(struct adapter *adap)
 {
int ret;
 
+   if (adap->tc_flower_initialized)
+   return -EEXIST;
+
adap->flower_ht_params = cxgb4_tc_flower_ht_params;
ret = rhashtable_init(&adap->flower_tbl, &adap->flower_ht_params);
if (ret)
@@ -882,13 +885,18 @@ int cxgb4_init_tc_flower(struct adapter *adap)
INIT_WORK(&adap->flower_stats_work, ch_flower_stats_handler);
timer_setup(&adap->flower_stats_timer, ch_flower_stats_cb, 0);
mod_timer(&adap->flower_stats_timer, jiffies + STATS_CHECK_PERIOD);
+   adap->tc_flower_initialized = true;
return 0;
 }
 
 void cxgb4_cleanup_tc_flower(struct adapter *adap)
 {
+   if (!adap->tc_flower_initialized)
+   return;
+
if (adap->flower_stats_timer.function)
del_timer_sync(&adap->flower_stats_timer);
cancel_work_sync(&adap->flower_stats_work);
rhashtable_destroy(&adap->flower_tbl);
+   adap->tc_flower_initialized = false;
 }
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c 
b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index 9148abb..7fc6566 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -539,6 +539,9 @@ void t4_cleanup_sched(struct adapter *adap)
struct port_info *pi = netdev2pinfo(adap->port[j]);
 
s = pi->sched_tbl;
+   if (!s)
+   continue;
+
for (i = 0; i < s->sched_size; i++) {
struct sched_class *e;
 
-- 
2.1.0

[PATCH net-next] cxgb4: Add new T5 PCI device id 0x50ae

2018-06-26 Thread Ganesh Goudar

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index c7f8d04..e3adf43 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -188,6 +188,7 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
CH_PCI_ID_TABLE_FENTRY(0x50ab), /* Custom T520-CR */
CH_PCI_ID_TABLE_FENTRY(0x50ac), /* Custom T540-BT */
CH_PCI_ID_TABLE_FENTRY(0x50ad), /* Custom T520-CR */
+   CH_PCI_ID_TABLE_FENTRY(0x50ae), /* Custom T540-XL-SO */
 
/* T6 adapters:
 */
-- 
2.1.0

Re: [PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-26 Thread Daniel Borkmann

Hi David,

first off all sorry for my late reply, been mostly offline last week. I think
there's still an issue with the current patch, more below:

On 06/21/2018 05:00 AM, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> For ACLs implemented using either FIB rules or FIB entries, the BPF
> program needs the FIB lookup status to be able to drop the packet.
> Since the bpf_fib_lookup API has not reached a released kernel yet,
> change the return code to contain an encoding of the FIB lookup
> result and return the nexthop device index in the params struct.
> 
> In addition, inform the BPF program of any post FIB lookup reason as
> to why the packet needs to go up the stack.
> 
> The fib result for unicast routes must have an egress device, so remove
> the check that it is non-NULL.
> 
> Signed-off-by: David Ahern 
> ---
> v2
> - drop BPF_FIB_LKUP_RET_NO_NHDEV; check in dev in fib result not needed
> - enhance documentation of BPF_FIB_LKUP_RET_ codes
> 
>  include/uapi/linux/bpf.h   | 28 ++
>  net/core/filter.c  | 72 
> ++
>  samples/bpf/xdp_fwd_kern.c |  8 +++---
>  3 files changed, 74 insertions(+), 34 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..b7db3261c62d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1857,7 +1857,8 @@ union bpf_attr {
>   *   is resolved), the nexthop address is returned in ipv4_dst
>   *   or ipv6_dst based on family, smac is set to mac address of
>   *   egress device, dmac is set to nexthop mac address, rt_metric
> - *   is set to metric from route (IPv4/IPv6 only).
> + *   is set to metric from route (IPv4/IPv6 only), and ifindex
> + *   is set to the device index of the nexthop from the FIB lookup.
>   *
>   * *plen* argument is the size of the passed in struct.
>   * *flags* argument can be a combination of one or more of the
> @@ -1873,9 +1874,10 @@ union bpf_attr {
>   * *ctx* is either **struct xdp_md** for XDP programs or
>   * **struct sk_buff** tc cls_act programs.
>   * Return
> - * Egress device index on success, 0 if packet needs to continue
> - * up the stack for further processing or a negative error in 
> case
> - * of failure.
> + *   * < 0 if any input argument is invalid
> + *   *   0 on success (packet is forwarded, nexthop neighbor exists)
> + *   * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
> + *   * packet is not forwarded or needs assist from full stack
>   *
>   * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
> *map, void *key, u64 flags)
>   *   Description
> @@ -2612,6 +2614,18 @@ struct bpf_raw_tracepoint_args {
>  #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
>  #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
>  
> +enum {
> + BPF_FIB_LKUP_RET_SUCCESS,  /* lookup successful */
> + BPF_FIB_LKUP_RET_BLACKHOLE,/* dest is blackholed; can be dropped */
> + BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable; can be dropped */
> + BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed; can be dropped */
> + BPF_FIB_LKUP_RET_NOT_FWDED,/* packet is not forwarded */
> + BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
> + BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
> + BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
> + BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +};
> +
>  struct bpf_fib_lookup {
>   /* input:  network family for lookup (AF_INET, AF_INET6)
>* output: network family of egress nexthop
> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>  
>   /* total length of packet from network header - used for MTU check */
>   __u16   tot_len;
> - __u32   ifindex;  /* L3 device index for lookup */
> +
> + /* input: L3 device index for lookup
> +  * output: device index from FIB lookup
> +  */
> + __u32   ifindex;
>  
>   union {
>   /* inputs to lookup */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e7f12e9f598c..f8dd8aa89de4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup 
> *params,
>   memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>   params->h_vlan_TCI = 0;
>   params->h_vlan_proto = 0;
> + params->ifindex = dev->ifindex;
>  
> - return dev->ifindex;
> + return 0;
>  }
>  #endif
>  
> @@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
> bpf_fib_lookup *params,
>   /* verify forwarding is enabled on this interface */
>   in_dev = __in_dev_get_rcu(dev);
>   if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
> - return 0;
> + return BPF_F

[PATCH 8/9] networking: e1000.rst: Get rid of Sphinx warnings

2018-06-26 Thread Mauro Carvalho Chehab

Documentation/networking/e1000.rst:83: ERROR: Unexpected indentation.
Documentation/networking/e1000.rst:84: WARNING: Block quote ends without a 
blank line; unexpected unindent.
Documentation/networking/e1000.rst:173: WARNING: Definition list ends 
without a blank line; unexpected unindent.
Documentation/networking/e1000.rst:236: WARNING: Definition list ends 
without a blank line; unexpected unindent.

While here, fix highlights and mark a table as such.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/networking/e1000.rst | 187 +
 1 file changed, 112 insertions(+), 75 deletions(-)

diff --git a/Documentation/networking/e1000.rst 
b/Documentation/networking/e1000.rst
index 144b87eef153..f10dd4086921 100644
--- a/Documentation/networking/e1000.rst
+++ b/Documentation/networking/e1000.rst
@@ -34,7 +34,8 @@ Command Line Parameters
 The default value for each parameter is generally the recommended setting,
 unless otherwise noted.
 
-NOTES:  For more information about the AutoNeg, Duplex, and Speed
+NOTES:
+   For more information about the AutoNeg, Duplex, and Speed
 parameters, see the "Speed and Duplex Configuration" section in
 this document.
 
@@ -45,22 +46,27 @@ NOTES:  For more information about the AutoNeg, Duplex, and 
Speed
 
 AutoNeg
 ---
+
 (Supported only on adapters with copper connections)
-Valid Range:   0x01-0x0F, 0x20-0x2F
-Default Value: 0x2F
+
+:Valid Range:   0x01-0x0F, 0x20-0x2F
+:Default Value: 0x2F
 
 This parameter is a bit-mask that specifies the speed and duplex settings
 advertised by the adapter.  When this parameter is used, the Speed and
 Duplex parameters must not be specified.
 
-NOTE:  Refer to the Speed and Duplex section of this readme for more
+NOTE:
+   Refer to the Speed and Duplex section of this readme for more
information on the AutoNeg parameter.
 
 Duplex
 --
+
 (Supported only on adapters with copper connections)
-Valid Range:   0-2 (0=auto-negotiate, 1=half, 2=full)
-Default Value: 0
+
+:Valid Range:   0-2 (0=auto-negotiate, 1=half, 2=full)
+:Default Value: 0
 
 This defines the direction in which data is allowed to flow.  Can be
 either one or two-directional.  If both Duplex and the link partner are
@@ -70,18 +76,22 @@ duplex.
 
 FlowControl
 ---
-Valid Range:   0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
-Default Value: Reads flow control settings from the EEPROM
+
+:Valid Range:   0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
+:Default Value: Reads flow control settings from the EEPROM
 
 This parameter controls the automatic generation(Tx) and response(Rx)
 to Ethernet PAUSE frames.
 
 InterruptThrottleRate
 -
+
 (not supported on Intel(R) 82542, 82543 or 82544-based adapters)
-Valid Range:   0,1,3,4,100-10 (0=off, 1=dynamic, 3=dynamic conservative,
- 4=simplified balancing)
-Default Value: 3
+
+:Valid Range:
+   0,1,3,4,100-10 (0=off, 1=dynamic, 3=dynamic conservative,
+   4=simplified balancing)
+:Default Value: 3
 
 The driver can limit the amount of interrupts per second that the adapter
 will generate for incoming packets. It does this by writing a value to the
@@ -135,13 +145,15 @@ Setting InterruptThrottleRate to 0 turns off any 
interrupt moderation
 and may improve small packet latency, but is generally not suitable
 for bulk throughput traffic.
 
-NOTE:  InterruptThrottleRate takes precedence over the TxAbsIntDelay and
+NOTE:
+   InterruptThrottleRate takes precedence over the TxAbsIntDelay and
RxAbsIntDelay parameters.  In other words, minimizing the receive
and/or transmit absolute delays does not force the controller to
generate more interrupts than what the Interrupt Throttle Rate
allows.
 
-CAUTION:  If you are using the Intel(R) PRO/1000 CT Network Connection
+CAUTION:
+  If you are using the Intel(R) PRO/1000 CT Network Connection
   (controller 82547), setting InterruptThrottleRate to a value
   greater than 75,000, may hang (stop transmitting) adapters
   under certain network conditions.  If this occurs a NETDEV
@@ -151,7 +163,8 @@ CAUTION:  If you are using the Intel(R) PRO/1000 CT Network 
Connection
   hang, ensure that InterruptThrottleRate is set no greater
   than 75,000 and is not set to 0.
 
-NOTE:  When e1000 is loaded with default settings and multiple adapters
+NOTE:
+   When e1000 is loaded with default settings and multiple adapters
are in use simultaneously, the CPU utilization may increase non-
linearly.  In order to limit the CPU utilization without impacting
the overall throughput, we recommend that you load the driver as
@@ -168,9 +181,11 @@ NOTE:  When e1000 is loaded with default settings and 
multiple adapters
 
 RxDescriptors
 -
-Valid Range:   48-256 for 82542 and 82543-based adapters
-   48-4096 for all other supported adapte

Re: [PATCH bpf] nfp: bpf: don't stop offload if replace failed

2018-06-26 Thread Daniel Borkmann

On 06/22/2018 08:56 PM, Jakub Kicinski wrote:
> Stopping offload completely if replace of program failed dates
> back to days of transparent offload.  Back then we wanted to
> silently fall back to the in-driver processing.  Today we mark
> programs for offload when they are loaded into the kernel, so
> the transparent offload is no longer a reality.
> 
> Flags check in the driver will only allow replace of a driver
> program with another driver program or an offload program with
> another offload program.
> 
> When driver program is replaced stopping offload is a no-op,
> because driver program isn't offloaded.  When replacing
> offloaded program if the offload fails the entire operation
> will fail all the way back to user space and we should continue
> using the old program.  IOW when replacing a driver program
> stopping offload is unnecessary and when replacing offloaded
> program - it's a bug, old program should continue to run.
> 
> In practice this bug would mean that if offload operation was to
> fail (either due to FW communication error, kernel OOM or new
> program being offloaded but for a different netdev) driver
> would continue reporting that previous XDP program is offloaded
> but in fact no program will be loaded in hardware.  The failure
> is fairly unlikely (found by inspection, when working on the code)
> but it's unpleasant.
> 
> Backport note: even though the bug was introduced in commit
> cafa92ac2553 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE"),
> this fix depends on commit 441a33031fe5 ("net: xdp: don't allow
> device-bound programs in driver mode"), so this fix is sufficient
> only in v4.15 or newer.  Kernels v4.13.x and v4.14.x do need to
> stop offload if it was transparent/opportunistic, i.e. if
> XDP_FLAGS_HW_MODE was not set on running program.
> 
> Fixes: cafa92ac2553 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Applied to bpf, thanks Jakub!

[patch net-next RFC 08/12] mlxsw: core: Extend cooling device with cooling levels

2018-06-26 Thread Vadim Pasternak

Extend cooling device with cooling levels vector to allow more
flexibility of PWM setting.
Thermal zone algorithm operates with the numerical states for PWM
setting. Each state is the index, defined in range from 0 to 10 and
it's mapped to the relevant duty cycle value, which is written to PWM
controller. With the current definition FAN speed is set to 0% for
state 0, 10% for state 1, and so on up to 100% for the maximum state
10.
Some systems have limitation for the PWM speed minimum. For such
systems PWM setting speed to 0% will just disable the ability to
increase speed anymore and such device will be stall on zero speed.
Cooling levels allow to configure state vector according to the
particular system requirements. For example, if PWM speed is not
allowed to be below 30%, cooling levels could be configured as 30%,
30%, 30%, 30%, 40%, 50% and so on.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 59 +-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 1587820..53e4ef9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -46,6 +46,15 @@
 #define MLXSW_THERMAL_HYSTERESIS_TEMP  5000/* 5C */
 #define MLXSW_THERMAL_MAX_STATE10
 #define MLXSW_THERMAL_MAX_DUTY 255
+/* Minimum and maximum FAN allowed speed in percent: from 20% to 100%. Values
+ * MLXSW_THERMAL_MAX_STATE + x, where x is between 2 and 10 are used for
+ * setting FAN speed dynamic minimum. For example, if value is set to 14 (40%)
+ * cooling levels vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 to
+ * introduce PWM speed in percent: 40, 40, 40, 40, 40, 50, 60. 70, 80, 90, 100.
+ */
+#define MLXSW_THERMAL_SPEED_MIN(MLXSW_THERMAL_MAX_STATE + 2)
+#define MLXSW_THERMAL_SPEED_MAX(MLXSW_THERMAL_MAX_STATE * 2)
+#define MLXSW_THERMAL_SPEED_MIN_LEVEL  2   /* 20 percent */
 
 struct mlxsw_thermal_trip {
int type;
@@ -97,6 +106,7 @@ struct mlxsw_thermal {
struct thermal_zone_device *tzdev;
int polling_delay;
struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
+   u8 cooling_levels[MLXSW_THERMAL_MAX_STATE + 1];
struct mlxsw_thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
enum thermal_device_mode mode;
 };
@@ -361,12 +371,52 @@ static int mlxsw_thermal_set_cur_state(struct 
thermal_cooling_device *cdev,
struct mlxsw_thermal *thermal = cdev->devdata;
struct device *dev = thermal->bus_info->dev;
char mfsc_pl[MLXSW_REG_MFSC_LEN];
-   int err, idx;
+   unsigned long cur_state;
+   int idx, i;
+   u8 duty;
+   int err;
 
idx = mlxsw_get_cooling_device_idx(thermal, cdev);
if (idx < 0)
return idx;
 
+   /* Verify if this request is for changing allowed FAN dynamical
+* minimum. If it is - update cooling levels accordingly and update
+* state, if current state is below the newly requested minimum state.
+* For example, if current state is 5, and minimal state is to be
+* changed from 4 to 6, thermal->cooling_levels[0 to 5] will be changed
+* all from 4 to 6. And state 5 (thermal->cooling_levels[4]) should be
+* overwritten.
+*/
+   if (state >= MLXSW_THERMAL_SPEED_MIN &&
+   state <= MLXSW_THERMAL_SPEED_MAX) {
+   state -= MLXSW_THERMAL_MAX_STATE;
+   for (i = 0; i < state; i++)
+   thermal->cooling_levels[i] = state;
+   for (i = state; i <= MLXSW_THERMAL_MAX_STATE; i++)
+   thermal->cooling_levels[i] = i;
+
+   mlxsw_reg_mfsc_pack(mfsc_pl, idx, 0);
+   err = mlxsw_reg_query(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
+   if (err) {
+   dev_err(dev, "Failed to query PWM duty\n");
+   return err;
+   }
+
+   duty = mlxsw_reg_mfsc_pwm_duty_cycle_get(mfsc_pl);
+   cur_state = mlxsw_duty_to_state(duty);
+
+   if (state < cur_state)
+   return 0;
+
+   state = cur_state;
+   }
+
+   if (state > MLXSW_THERMAL_MAX_STATE)
+   return -EINVAL;
+
+   /* Normalize the state to the valid speed range. */
+   state = thermal->cooling_levels[state];
mlxsw_reg_mfsc_pack(mfsc_pl, idx, mlxsw_state_to_duty(state));
err = mlxsw_reg_write(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
if (err) {
@@ -445,6 +495,13 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
}
}
 
+   /* Init cooling levels per PWM state. */
+   for (i = 0; i < MLXSW_THERMAL_SPEED_MIN_LEVEL; i++)
+   thermal->cooling_levels[i] = MLXSW_THERMAL_SPEED_MIN_

[patch net-next RFC 04/12] mlxsw: core: Add bus frequency capability flag for the bus type

2018-06-26 Thread Vadim Pasternak

Add low frequency bus capability in order to allow core functionality
separation based on bus type. Driver could run over PCIe, which is
considered as high frequency bus or I2C , which is considered as low
frequency bus. In the last case time setting, for example, for thermal
polling interval, should be increased.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.h | 1 +
 drivers/net/ethernet/mellanox/mlxsw/i2c.c  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 552cfa2..95e6190 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -371,6 +371,7 @@ struct mlxsw_bus_info {
struct mlxsw_fw_rev fw_rev;
u8 vsd[MLXSW_CMD_BOARDINFO_VSD_LEN];
u8 psid[MLXSW_CMD_BOARDINFO_PSID_LEN];
+   bool low_frequency;
 };
 
 struct mlxsw_hwmon;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/i2c.c 
b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
index 25f9915..384b337 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/i2c.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
@@ -536,6 +536,7 @@ static int mlxsw_i2c_probe(struct i2c_client *client,
mlxsw_i2c->bus_info.device_kind = id->name;
mlxsw_i2c->bus_info.device_name = client->name;
mlxsw_i2c->bus_info.dev = &client->dev;
+   mlxsw_i2c->bus_info.low_frequency = true;
mlxsw_i2c->dev = &client->dev;
 
err = mlxsw_core_bus_device_register(&mlxsw_i2c->bus_info,
-- 
2.1.4

[patch net-next RFC 07/12] mlxsw: core: Extend thermal zone operations with get_trend method

2018-06-26 Thread Vadim Pasternak

Thermal get_trend method is added in order to notify user in case of
fast temperature downgrade. It could happen in case one or few very hot
port cables are removed. In such situation temperature trend could go
down once, and then could stay in a stable state, while PWM state will
be decreased only once and could stay in not optimal high state.
Notification will allow user to take an appropriate action if
necessary.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 27 ++
 1 file changed, 27 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 91c4946..1587820 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -281,6 +281,32 @@ static int mlxsw_thermal_set_trip_hyst(struct 
thermal_zone_device *tzdev,
return 0;
 }
 
+static int mlxsw_thermal_get_trend(struct thermal_zone_device *tzdev,
+  int trip, enum thermal_trend *trend)
+{
+   int delta;
+
+   if (trip < 0 || trip >= MLXSW_THERMAL_NUM_TRIPS)
+   return -EINVAL;
+
+   delta = tzdev->last_temperature - tzdev->temperature;
+   if (delta > MLXSW_ENV_TEMP_WINDOW) {
+   /* Notify user about fast temperature decreasing by sending
+* hwmon uevent. Decreasing could happen in case one or few
+* very hot port cables have been removed. In this situation
+* temperature trend could go down once, and then could stay
+* in a stable state, while PWM state will be decreased only
+* once. As a side effect PWM could be not at optimal speed.
+* Notification will allow user to handle such case, if user
+* supposes to optimize PWM state.
+*/
+   kobject_uevent(&tzdev->device.kobj, KOBJ_CHANGE);
+   }
+
+   /* Return non-zero value to pass control to get_tz_trend() routine. */
+   return 1;
+}
+
 static struct thermal_zone_device_ops mlxsw_thermal_ops = {
.bind   = mlxsw_thermal_bind,
.unbind = mlxsw_thermal_unbind,
@@ -292,6 +318,7 @@ static struct thermal_zone_device_ops mlxsw_thermal_ops = {
.set_trip_temp  = mlxsw_thermal_set_trip_temp,
.get_trip_hyst  = mlxsw_thermal_get_trip_hyst,
.set_trip_hyst  = mlxsw_thermal_set_trip_hyst,
+   .get_trend  = mlxsw_thermal_get_trend,
 };
 
 static int mlxsw_thermal_get_max_state(struct thermal_cooling_device *cdev,
-- 
2.1.4

[patch net-next RFC 05/12] mlxsw: core: Set different thermal polling time based on bus type

2018-06-26 Thread Vadim Pasternak

Use different thermal monitoring based on bus type.
For I2C bus time is set to 20 seconds, while for PCIe 1 second polling
interval is used.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index d866c98..152591d8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -41,6 +41,7 @@
 #include "core.h"
 
 #define MLXSW_THERMAL_POLL_INT 1000/* ms */
+#define MLXSW_THERMAL_SLOW_POLL_INT2   /* ms */
 #define MLXSW_THERMAL_MAX_TEMP 11  /* 110C */
 #define MLXSW_THERMAL_MAX_STATE10
 #define MLXSW_THERMAL_MAX_DUTY 255
@@ -95,6 +96,7 @@ struct mlxsw_thermal {
struct mlxsw_core *core;
const struct mlxsw_bus_info *bus_info;
struct thermal_zone_device *tzdev;
+   int polling_delay;
struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
struct mlxsw_thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
enum thermal_device_mode mode;
@@ -190,7 +192,7 @@ static int mlxsw_thermal_set_mode(struct 
thermal_zone_device *tzdev,
mutex_lock(&tzdev->lock);
 
if (mode == THERMAL_DEVICE_ENABLED)
-   tzdev->polling_delay = MLXSW_THERMAL_POLL_INT;
+   tzdev->polling_delay = thermal->polling_delay;
else
tzdev->polling_delay = 0;
 
@@ -397,13 +399,18 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
}
}
 
+   if (bus_info->low_frequency)
+   thermal->polling_delay = MLXSW_THERMAL_SLOW_POLL_INT;
+   else
+   thermal->polling_delay = MLXSW_THERMAL_POLL_INT;
+
thermal->tzdev = thermal_zone_device_register("mlxsw",
  MLXSW_THERMAL_NUM_TRIPS,
  MLXSW_THERMAL_TRIP_MASK,
  thermal,
  &mlxsw_thermal_ops,
  NULL, 0,
- MLXSW_THERMAL_POLL_INT);
+ thermal->polling_delay);
if (IS_ERR(thermal->tzdev)) {
err = PTR_ERR(thermal->tzdev);
dev_err(dev, "Failed to register thermal zone\n");
-- 
2.1.4

[patch net-next RFC 09/12] mlxsw: core: Rename cooling device

2018-06-26 Thread Vadim Pasternak

Name "Fan" is too common name, and such name is misleading, while it's
interpreted by user.
For example name "Fan" could be used by ACPI.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 53e4ef9..65962ed 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -484,7 +484,8 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
if (pwm_active & BIT(i)) {
struct thermal_cooling_device *cdev;
 
-   cdev = thermal_cooling_device_register("Fan", thermal,
+   cdev = thermal_cooling_device_register("mlxsw_fan",
+   thermal,
&mlxsw_cooling_ops);
if (IS_ERR(cdev)) {
err = PTR_ERR(cdev);
-- 
2.1.4

[patch net-next RFC 06/12] mlxsw: core: Modify thermal zone definition

2018-06-26 Thread Vadim Pasternak

Thermal zone trip points setting is modified for better alignment with
modified thermal algorithm.
The hysteresis thresholds for thermal trips are added in order to avoid
throttling around thermal trip point. If hysteresis temperature is not
considered PWM can have side effect of flip up/down on thermal trip
point boundary.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 63 ++
 1 file changed, 41 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 152591d8..91c4946 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -39,16 +39,18 @@
 #include 
 
 #include "core.h"
+#include "core_env.h"
 
-#define MLXSW_THERMAL_POLL_INT 1000/* ms */
+#define MLXSW_THERMAL_POLL_INT 1000/* ms */
 #define MLXSW_THERMAL_SLOW_POLL_INT2   /* ms */
-#define MLXSW_THERMAL_MAX_TEMP 11  /* 110C */
-#define MLXSW_THERMAL_MAX_STATE10
-#define MLXSW_THERMAL_MAX_DUTY 255
+#define MLXSW_THERMAL_HYSTERESIS_TEMP  5000/* 5C */
+#define MLXSW_THERMAL_MAX_STATE10
+#define MLXSW_THERMAL_MAX_DUTY 255
 
 struct mlxsw_thermal_trip {
int type;
int temp;
+   int hyst;
int min_state;
int max_state;
 };
@@ -56,32 +58,29 @@ struct mlxsw_thermal_trip {
 static const struct mlxsw_thermal_trip default_thermal_trips[] = {
{   /* In range - 0-40% PWM */
.type   = THERMAL_TRIP_ACTIVE,
-   .temp   = 75000,
+   .temp   = MLXSW_ENV_TEMP_NORM,
+   .hyst   = MLXSW_THERMAL_HYSTERESIS_TEMP,
.min_state  = 0,
.max_state  = (4 * MLXSW_THERMAL_MAX_STATE) / 10,
},
-   {   /* High - 40-100% PWM */
-   .type   = THERMAL_TRIP_ACTIVE,
-   .temp   = 8,
-   .min_state  = (4 * MLXSW_THERMAL_MAX_STATE) / 10,
-   .max_state  = MLXSW_THERMAL_MAX_STATE,
-   },
{
-   /* Very high - 100% PWM */
+   /* In range - 40-100% PWM */
.type   = THERMAL_TRIP_ACTIVE,
-   .temp   = 85000,
-   .min_state  = MLXSW_THERMAL_MAX_STATE,
+   .temp   = MLXSW_ENV_TEMP_HIGH,
+   .hyst   = MLXSW_THERMAL_HYSTERESIS_TEMP,
+   .min_state  = (4 * MLXSW_THERMAL_MAX_STATE) / 10,
.max_state  = MLXSW_THERMAL_MAX_STATE,
},
{   /* Warning */
.type   = THERMAL_TRIP_HOT,
-   .temp   = 105000,
+   .temp   = MLXSW_ENV_TEMP_HOT,
+   .hyst   = MLXSW_THERMAL_HYSTERESIS_TEMP,
.min_state  = MLXSW_THERMAL_MAX_STATE,
.max_state  = MLXSW_THERMAL_MAX_STATE,
},
{   /* Critical - soft poweroff */
.type   = THERMAL_TRIP_CRITICAL,
-   .temp   = MLXSW_THERMAL_MAX_TEMP,
+   .temp   = MLXSW_ENV_TEMP_CRIT,
.min_state  = MLXSW_THERMAL_MAX_STATE,
.max_state  = MLXSW_THERMAL_MAX_STATE,
}
@@ -257,22 +256,42 @@ static int mlxsw_thermal_set_trip_temp(struct 
thermal_zone_device *tzdev,
struct mlxsw_thermal *thermal = tzdev->devdata;
 
if (trip < 0 || trip >= MLXSW_THERMAL_NUM_TRIPS ||
-   temp > MLXSW_THERMAL_MAX_TEMP)
+   temp > MLXSW_ENV_TEMP_CRIT)
return -EINVAL;
 
thermal->trips[trip].temp = temp;
return 0;
 }
 
+static int mlxsw_thermal_get_trip_hyst(struct thermal_zone_device *tzdev,
+  int trip, int *p_hyst)
+{
+   struct mlxsw_thermal *thermal = tzdev->devdata;
+
+   *p_hyst = thermal->trips[trip].hyst;
+   return 0;
+}
+
+static int mlxsw_thermal_set_trip_hyst(struct thermal_zone_device *tzdev,
+  int trip, int hyst)
+{
+   struct mlxsw_thermal *thermal = tzdev->devdata;
+
+   thermal->trips[trip].hyst = hyst;
+   return 0;
+}
+
 static struct thermal_zone_device_ops mlxsw_thermal_ops = {
-   .bind = mlxsw_thermal_bind,
-   .unbind = mlxsw_thermal_unbind,
-   .get_mode = mlxsw_thermal_get_mode,
-   .set_mode = mlxsw_thermal_set_mode,
-   .get_temp = mlxsw_thermal_get_temp,
+   .bind   = mlxsw_thermal_bind,
+   .unbind = mlxsw_thermal_unbind,
+   .get_mode   = mlxsw_thermal_get_mode,
+   .set_mode   = mlxsw_thermal_set_mode,
+   .get_temp   = mlxsw_thermal_get_temp,
.get_trip_type  = mlxsw_thermal_get_trip_type,
.get_trip_temp  = mlxsw_thermal_get_trip_tem

[patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute

2018-06-26 Thread Vadim Pasternak

Add new FAN hwmon attribute for exposing FAN faults (fault is set in
case FAN tachometer is below allowed minimum).

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c | 62 +++-
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
index 84185f8..dfd7adc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
@@ -44,6 +44,7 @@
 #define MLXSW_HWMON_TEMP_SENSOR_MAX_COUNT 127
 #define MLXSW_HWMON_ATTR_COUNT (MLXSW_HWMON_TEMP_SENSOR_MAX_COUNT * 4 + \
MLXSW_MFCR_TACHOS_MAX + MLXSW_MFCR_PWMS_MAX)
+#define MLXSW_HWMON_SPEED_MAX 5/* RPM */
 
 struct mlxsw_hwmon_attr {
struct device_attribute dev_attr;
@@ -61,6 +62,7 @@ struct mlxsw_hwmon {
struct attribute *attrs[MLXSW_HWMON_ATTR_COUNT + 1];
struct mlxsw_hwmon_attr hwmon_attrs[MLXSW_HWMON_ATTR_COUNT];
unsigned int attrs_count;
+   u16 tach_min;
 };
 
 static ssize_t mlxsw_hwmon_temp_show(struct device *dev,
@@ -152,6 +154,28 @@ static ssize_t mlxsw_hwmon_fan_rpm_show(struct device *dev,
return sprintf(buf, "%u\n", mlxsw_reg_mfsm_rpm_get(mfsm_pl));
 }
 
+static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
+   container_of(attr, struct mlxsw_hwmon_attr, dev_attr);
+   struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
+   char mfsm_pl[MLXSW_REG_MFSM_LEN];
+   u16 tach;
+   int err;
+
+   mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
+   err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm), mfsm_pl);
+   if (err) {
+   dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query fan\n");
+   return err;
+   }
+   tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
+
+   return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 : 0);
+}
+
 static ssize_t mlxsw_hwmon_pwm_show(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -203,6 +227,7 @@ enum mlxsw_hwmon_attr_type {
MLXSW_HWMON_ATTR_TYPE_TEMP_MAX,
MLXSW_HWMON_ATTR_TYPE_TEMP_RST,
MLXSW_HWMON_ATTR_TYPE_FAN_RPM,
+   MLXSW_HWMON_ATTR_TYPE_FAN_FAULT,
MLXSW_HWMON_ATTR_TYPE_PWM,
 };
 
@@ -240,6 +265,12 @@ static void mlxsw_hwmon_attr_add(struct mlxsw_hwmon 
*mlxsw_hwmon,
snprintf(mlxsw_hwmon_attr->name, sizeof(mlxsw_hwmon_attr->name),
 "fan%u_input", num + 1);
break;
+   case MLXSW_HWMON_ATTR_TYPE_FAN_FAULT:
+   mlxsw_hwmon_attr->dev_attr.show = mlxsw_hwmon_fan_fault_show;
+   mlxsw_hwmon_attr->dev_attr.attr.mode = 0444;
+   snprintf(mlxsw_hwmon_attr->name, sizeof(mlxsw_hwmon_attr->name),
+"fan%u_fault", num + 1);
+   break;
case MLXSW_HWMON_ATTR_TYPE_PWM:
mlxsw_hwmon_attr->dev_attr.show = mlxsw_hwmon_pwm_show;
mlxsw_hwmon_attr->dev_attr.store = mlxsw_hwmon_pwm_store;
@@ -297,9 +328,9 @@ static int mlxsw_hwmon_fans_init(struct mlxsw_hwmon 
*mlxsw_hwmon)
 {
char mfcr_pl[MLXSW_REG_MFCR_LEN] = {0};
enum mlxsw_reg_mfcr_pwm_frequency freq;
+   u16 tacho_active, tach_min;
unsigned int type_index;
unsigned int num;
-   u16 tacho_active;
u8 pwm_active;
int err;
 
@@ -310,11 +341,38 @@ static int mlxsw_hwmon_fans_init(struct mlxsw_hwmon 
*mlxsw_hwmon)
}
mlxsw_reg_mfcr_unpack(mfcr_pl, &freq, &tacho_active, &pwm_active);
num = 0;
+   /* Set tachometer to maximum value as the initial seed. */
+   mlxsw_hwmon->tach_min = MLXSW_HWMON_SPEED_MAX;
for (type_index = 0; type_index < MLXSW_MFCR_TACHOS_MAX; type_index++) {
-   if (tacho_active & BIT(type_index))
+   if (tacho_active & BIT(type_index)) {
+   char mfsl_pl[MLXSW_REG_MFSL_LEN] = {0};
+
mlxsw_hwmon_attr_add(mlxsw_hwmon,
 MLXSW_HWMON_ATTR_TYPE_FAN_RPM,
+type_index, num);
+   mlxsw_hwmon_attr_add(mlxsw_hwmon,
+MLXSW_HWMON_ATTR_TYPE_FAN_FAULT,
 type_index, num++);
+   /* Get tachometer minimum value. */
+   mlxsw_reg_mfsl_pack(mfsl_pl, type_index, 0, 0);
+   err = mlxsw_reg_query(mlxsw_hwmon->core,
+ MLXSW_REG(mfsl), mfsl_pl);
+

[patch net-next RFC 00/12] mlxsw thermal monitoring amendments

2018-06-26 Thread Vadim Pasternak

This patchset extends mlxsw hwmon and thermal modules with ports
temperature reading and adds new hwmon attributes for FAN and
temperature.

Ports temperatures are most critical component in system thermal control
and should be considered by thermal algorithm.

New hwmon attributes, such as FAN faults, port temperature fault will
improve system monitoring abilities.

Vadim Pasternak (12):
  mlxsw: spectrum: Move QSFP EEPROM defenitons to common location
  mlxsw: reg: Add MTBR register
  mlxsw: core: Add core environment module for port temperature reading
  mlxsw: core: Add bus frequency capability flag for the bus type
  mlxsw: core: Set different thermal polling time based on bus type
  mlxsw: core: Modify thermal zone definition
  mlxsw: core: Extend thermal zone operations with get_trend method
  mlxsw: core: Extend cooling device with cooling levels
  mlxsw: core: Rename cooling device
  mlxsw: core: Add ports temperature measurement to thermal algorithm
  mlxsw: core: Extend hwmon interface with FAN fault attribute
  mlxsw: core: Extend hwmon interface with port temperature attributes

 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/core.h |   1 +
 drivers/net/ethernet/mellanox/mlxsw/core_env.c | 316 +
 drivers/net/ethernet/mellanox/mlxsw/core_env.h |  63 
 drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c   | 164 ++-
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 231 +--
 drivers/net/ethernet/mellanox/mlxsw/i2c.c  |   1 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h  | 101 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  62 ++--
 9 files changed, 865 insertions(+), 76 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_env.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_env.h

-- 
2.1.4

[patch net-next RFC 02/12] mlxsw: reg: Add MTBR register

2018-06-26 Thread Vadim Pasternak

Add MTBR (Management Temperature Bulk Register), which is used for port
temperature reading in a bulk mode.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 69 +++
 1 file changed, 69 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 6a41c48..cfe6bde 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -6703,6 +6703,74 @@ static inline void mlxsw_reg_mtmp_unpack(char *payload, 
unsigned int *p_temp,
mlxsw_reg_mtmp_sensor_name_memcpy_from(payload, sensor_name);
 }
 
+/* MTBR - Management Temperature Bulk Register
+ * ---
+ * This register is used for bulk temperature reading.
+ */
+#define MLXSW_REG_MTBR_ID  0x900F
+#define MLXSW_REG_MTBR_LEN 0xCC
+#define MLXSW_REG_MTBR_REC_MAX_COUNT   47
+
+MLXSW_REG_DEFINE(mtbr, MLXSW_REG_MTBR_ID, MLXSW_REG_MTBR_LEN);
+
+/* reg_mtbr_base_sensor_index
+ * Base sensors index to access (0 - ASIC sensor, 1-63 - ambient sensors,
+ * 64-127 are mapped to the SFP+/QSFP modules sequentially).
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mtbr, base_sensor_index, 0x00, 0, 7);
+
+/* reg_mtbr_num_rec
+ * Request: Number of records to read
+ * Response: Number of records read
+ * See above description for more details.
+ * Ranges 0..64
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mtbr, num_rec, 0x04, 0, 8);
+
+/* reg_mtbr_temp
+ * Temperature reading from the sensor. Reading is in 0.125 Celsius
+ * degrees units.
+ * Access: RO
+ */
+MLXSW_ITEM32_INDEXED(reg, mtbr, temp, 0x10, 0, 16, 0x04, 0x00, false);
+
+/* reg_mtbr_max_temp
+ * The highest measured temperature from the sensor.
+ * When the bit mte is cleared, the field max_temperature is reserved.
+ * Access: RO
+ */
+MLXSW_ITEM32_INDEXED(reg, mtbr, max_temp, 0x10, 16, 16, 0x04, 0x00, false);
+
+static inline void mlxsw_reg_mtbr_pack(char *payload, u8 base_sensor_index,
+  u8 num_rec)
+{
+   MLXSW_REG_ZERO(mtbr, payload);
+   mlxsw_reg_mtbr_base_sensor_index_set(payload, base_sensor_index);
+   mlxsw_reg_mtbr_num_rec_set(payload, num_rec);
+}
+
+/* Error codes from temperatute reading */
+enum mlxsw_reg_mtbr_temp_status {
+   MLXSW_REG_MTBR_NO_CONN  = 0x8000,
+   MLXSW_REG_MTBR_NO_TEMP_SENS = 0x8001,
+   MLXSW_REG_MTBR_INDEX_NA = 0x8002,
+   MLXSW_REG_MTBR_BAD_SENS_INFO= 0x8003,
+};
+
+/* Base index for reading ports temperature */
+#define MLXSW_REG_MTBR_BASE_PORT_INDEX 64
+
+static inline void mlxsw_reg_mtbr_temp_unpack(char *payload, int rec_index,
+ u16 *p_temp, u16 *p_max_temp)
+{
+   if (p_temp)
+   *p_temp = mlxsw_reg_mtbr_temp_get(payload, rec_index);
+   if (p_max_temp)
+   *p_max_temp = mlxsw_reg_mtbr_max_temp_get(payload, rec_index);
+}
+
 /* MCIA - Management Cable Info Access
  * ---
  * MCIA register is used to access the SFP+ and QSFP connector's EPROM.
@@ -7945,6 +8013,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(mfsc),
MLXSW_REG(mfsm),
MLXSW_REG(mfsl),
+   MLXSW_REG(mtbr),
MLXSW_REG(mtcap),
MLXSW_REG(mtmp),
MLXSW_REG(mcia),
-- 
2.1.4

[patch net-next RFC 10/12] mlxsw: core: Add ports temperature measurement to thermal algorithm

2018-06-26 Thread Vadim Pasternak

Ports temperature has most significant impact on system thermal state
and should be considered by the thermal algorithm. The thermal zone
temperature is extended for reading ports temperatures along with a
chip temperature. The temperature value, provided to the core thermal
algorithm will be accumulated value of a chip and ports temperature
sensing, normalized according to the basic constant thresholds.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 66 --
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 65962ed..23d6197 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -109,6 +109,8 @@ struct mlxsw_thermal {
u8 cooling_levels[MLXSW_THERMAL_MAX_STATE + 1];
struct mlxsw_thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
enum thermal_device_mode mode;
+   int count;
+   int *ports_temp_cache;
 };
 
 static inline u8 mlxsw_state_to_duty(int state)
@@ -213,10 +215,11 @@ static int mlxsw_thermal_set_mode(struct 
thermal_zone_device *tzdev,
return 0;
 }
 
-static int mlxsw_thermal_get_temp(struct thermal_zone_device *tzdev,
- int *p_temp)
+static int mlxsw_thermal_init_temp(struct mlxsw_thermal *thermal,
+  struct mlxsw_env_temp_thresh *delta,
+  struct mlxsw_env_temp_multi *multi,
+  int *p_temp, bool *p_crit)
 {
-   struct mlxsw_thermal *thermal = tzdev->devdata;
struct device *dev = thermal->bus_info->dev;
char mtmp_pl[MLXSW_REG_MTMP_LEN];
unsigned int temp;
@@ -231,10 +234,58 @@ static int mlxsw_thermal_get_temp(struct 
thermal_zone_device *tzdev,
}
mlxsw_reg_mtmp_unpack(mtmp_pl, &temp, NULL, NULL);
 
-   *p_temp = (int) temp;
+   if (temp >= MLXSW_ENV_TEMP_CRIT) {
+   *p_crit = true;
+   } else if (temp < MLXSW_ENV_TEMP_NORM) {
+   multi->thresh.normal = temp;
+   delta->normal = MLXSW_ENV_TEMP_NORM - temp;
+   } else if (temp >= MLXSW_ENV_TEMP_HOT) {
+   multi->thresh.crit = temp;
+   delta->crit = temp - MLXSW_ENV_TEMP_HOT;
+   multi->mask |= MLXSW_ENV_CRIT_MASK;
+   } else {
+   multi->thresh.hot = temp;
+   delta->hot = temp - MLXSW_ENV_TEMP_NORM;
+   multi->mask |= MLXSW_ENV_HOT_MASK;
+   }
+   *p_temp = temp;
+
return 0;
 }
 
+static int mlxsw_thermal_get_temp(struct thermal_zone_device *tzdev,
+ int *p_temp)
+{
+   struct mlxsw_thermal *thermal = tzdev->devdata;
+   struct device *dev = thermal->bus_info->dev;
+   struct mlxsw_env_temp_multi multi;
+   struct mlxsw_env_temp_thresh delta;
+   bool crit = false;
+   int err;
+
+   memset(&multi, 0, sizeof(struct mlxsw_env_temp_multi));
+   memset(&delta, 0, sizeof(struct mlxsw_env_temp_thresh));
+   /* Read ASIC temperature */
+   err = mlxsw_thermal_init_temp(thermal, &delta, &multi,
+ p_temp, &crit);
+   if (err) {
+   dev_err(dev, "Failed to query ASIC temp sensor\n");
+   return err;
+   }
+
+   /* No need to proceed ports temperature reading, since ASIC temperature
+* should be resulted in system shutdown.
+*/
+   if (crit)
+   return 0;
+
+   /* Collect ports temperature */
+   return mlxsw_env_collect_port_temp(thermal->core,
+  thermal->ports_temp_cache,
+  thermal->count, &multi, &delta,
+  NULL, p_temp);
+}
+
 static int mlxsw_thermal_get_trip_type(struct thermal_zone_device *tzdev,
   int trip,
   enum thermal_trip_type *p_type)
@@ -436,6 +487,7 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
   const struct mlxsw_bus_info *bus_info,
   struct mlxsw_thermal **p_thermal)
 {
+   unsigned int max_ports = mlxsw_core_max_ports(core);
char mfcr_pl[MLXSW_REG_MFCR_LEN] = { 0 };
enum mlxsw_reg_mfcr_pwm_frequency freq;
struct device *dev = bus_info->dev;
@@ -452,6 +504,12 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
thermal->core = core;
thermal->bus_info = bus_info;
memcpy(thermal->trips, default_thermal_trips, sizeof(thermal->trips));
+   thermal->ports_temp_cache = devm_kmalloc_array(dev, max_ports,
+  sizeof(int),
+  GFP_KERNEL);
+   if (!thermal

[patch net-next RFC 12/12] mlxsw: core: Extend hwmon interface with port temperature attributes

2018-06-26 Thread Vadim Pasternak

Add new attributes to hwmon object for exposing accumulative ports
temperature input and accumulative ports temperature fault (if one of
sensors in untrusted - fault is set).
All ports temperature and fault info is reading from the hardware
through MTBR (Management Temperature Bulk Register).
In case at least one port fault is detected, user can consider it in
the thermal algorithm. For example, in such case, FAN speed could be
increased.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c | 102 +++
 1 file changed, 102 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
index dfd7adc..ac28e6c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
@@ -40,6 +40,7 @@
 #include 
 
 #include "core.h"
+#include "core_env.h"
 
 #define MLXSW_HWMON_TEMP_SENSOR_MAX_COUNT 127
 #define MLXSW_HWMON_ATTR_COUNT (MLXSW_HWMON_TEMP_SENSOR_MAX_COUNT * 4 + \
@@ -63,6 +64,9 @@ struct mlxsw_hwmon {
struct mlxsw_hwmon_attr hwmon_attrs[MLXSW_HWMON_ATTR_COUNT];
unsigned int attrs_count;
u16 tach_min;
+   int *ports_temp_cache;
+   int count;
+   bool untrusted_sensor;
 };
 
 static ssize_t mlxsw_hwmon_temp_show(struct device *dev,
@@ -222,6 +226,47 @@ static ssize_t mlxsw_hwmon_pwm_store(struct device *dev,
return len;
 }
 
+static ssize_t mlxsw_hwmon_port_temp_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
+   container_of(attr, struct mlxsw_hwmon_attr, dev_attr);
+   struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
+   struct mlxsw_env_temp_multi multi;
+   struct mlxsw_env_temp_thresh delta;
+   int temp;
+   int err;
+
+   memset(&multi, 0, sizeof(struct mlxsw_env_temp_multi));
+   memset(&delta, 0, sizeof(struct mlxsw_env_temp_thresh));
+   /* Set initial value for normal temperature to unreachable value. */
+   delta.normal = MLXSW_ENV_TEMP_UNREACHABLE;
+   /* Collect ports temperature */
+   err = mlxsw_env_collect_port_temp(mlxsw_hwmon->core,
+ mlxsw_hwmon->ports_temp_cache,
+ mlxsw_hwmon->count, &multi, &delta,
+ &mlxsw_hwmon->untrusted_sensor,
+ &temp);
+   if (err) {
+   dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query port 
temp\n");
+   return err;
+   }
+
+   return sprintf(buf, "%u\n", temp);
+}
+
+static ssize_t mlxsw_hwmon_port_temp_fault_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
+   container_of(attr, struct mlxsw_hwmon_attr, dev_attr);
+   struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
+
+   return sprintf(buf, "%u\n", mlxsw_hwmon->untrusted_sensor ? 1 : 0);
+}
+
 enum mlxsw_hwmon_attr_type {
MLXSW_HWMON_ATTR_TYPE_TEMP,
MLXSW_HWMON_ATTR_TYPE_TEMP_MAX,
@@ -229,6 +274,8 @@ enum mlxsw_hwmon_attr_type {
MLXSW_HWMON_ATTR_TYPE_FAN_RPM,
MLXSW_HWMON_ATTR_TYPE_FAN_FAULT,
MLXSW_HWMON_ATTR_TYPE_PWM,
+   MLXSW_HWMON_ATTR_TYPE_TEMP_PORT,
+   MLXSW_HWMON_ATTR_TYPE_TEMP_PORT_FAULT,
 };
 
 static void mlxsw_hwmon_attr_add(struct mlxsw_hwmon *mlxsw_hwmon,
@@ -278,6 +325,19 @@ static void mlxsw_hwmon_attr_add(struct mlxsw_hwmon 
*mlxsw_hwmon,
snprintf(mlxsw_hwmon_attr->name, sizeof(mlxsw_hwmon_attr->name),
 "pwm%u", num + 1);
break;
+   case MLXSW_HWMON_ATTR_TYPE_TEMP_PORT:
+   mlxsw_hwmon_attr->dev_attr.show = mlxsw_hwmon_port_temp_show;
+   mlxsw_hwmon_attr->dev_attr.attr.mode = 0444;
+   snprintf(mlxsw_hwmon_attr->name, sizeof(mlxsw_hwmon_attr->name),
+"temp%u_input", num + 1);
+   break;
+   case MLXSW_HWMON_ATTR_TYPE_TEMP_PORT_FAULT:
+   mlxsw_hwmon_attr->dev_attr.show =
+   mlxsw_hwmon_port_temp_fault_show;
+   mlxsw_hwmon_attr->dev_attr.attr.mode = 0444;
+   snprintf(mlxsw_hwmon_attr->name, sizeof(mlxsw_hwmon_attr->name),
+"temp%u_fault", num + 1);
+   break;
default:
WARN_ON(1);
}
@@ -384,6 +444,43 @@ static int mlxsw_hwmon_fans_init(struct mlxsw_hwmon 
*mlxsw_hwmon)
return 0;
 }
 
+static int mlxsw_hwmon_port_init(struct mlxsw_hwmon *mlxsw_hwmon)
+{
+   unsigned int max_ports = mlxsw_core_max_ports

[patch net-next RFC 01/12] mlxsw: spectrum: Move QSFP EEPROM defenitons to common location

2018-06-26 Thread Vadim Pasternak

Move QSFP EEPROM definitions to common location from the spectrum
driver in order to make them available for other mlxsw modules. They
are common for all kind of chips and have relation to SFF
specifications 8024, 8436, 8472, 8636, rather then to chip type.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h  | 32 -
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 62 +-
 2 files changed, 52 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 1877d9f..6a41c48 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -6757,13 +6757,41 @@ MLXSW_ITEM32(reg, mcia, device_address, 0x04, 0, 16);
  */
 MLXSW_ITEM32(reg, mcia, size, 0x08, 0, 16);
 
-#define MLXSW_SP_REG_MCIA_EEPROM_SIZE 48
+#define MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH  256
+#define MLXSW_REG_MCIA_EEPROM_SIZE 48
+#define MLXSW_REG_MCIA_I2C_ADDR_LOW0x50
+#define MLXSW_REG_MCIA_I2C_ADDR_HIGH   0x51
+#define MLXSW_REG_MCIA_PAGE0_LO_OFF0xa0
+#define MLXSW_REG_MCIA_TH_SIZE 8
+#define MLXSW_REG_MCIA_TH_PAGE_NUM 3
+#define MLXSW_REG_MCIA_PAGE0_LO0
+#define MLXSW_REG_MCIA_TH_PAGE_OFF 0x80
+
+enum mlxsw_reg_mcia_eeprom_module_info_rev_id {
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID_UNSPC  = 0x00,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID_8436   = 0x01,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID_8636   = 0x03,
+};
+
+enum mlxsw_reg_mcia_eeprom_module_info_id {
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_SFP= 0x03,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP   = 0x0C,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP_PLUS  = 0x0D,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP28 = 0x11,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP_DD= 0x18,
+};
+
+enum mlxsw_reg_mcia_eeprom_module_info {
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID,
+   MLXSW_REG_MCIA_EEPROM_MODULE_INFO_SIZE,
+};
 
 /* reg_mcia_eeprom
  * Bytes to read/write.
  * Access: RW
  */
-MLXSW_ITEM_BUF(reg, mcia, eeprom, 0x10, MLXSW_SP_REG_MCIA_EEPROM_SIZE);
+MLXSW_ITEM_BUF(reg, mcia, eeprom, 0x10, MLXSW_REG_MCIA_EEPROM_SIZE);
 
 static inline void mlxsw_reg_mcia_pack(char *payload, u8 module, u8 lock,
   u8 page_number, u16 device_addr,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 968b88a..1b0d1bc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2481,23 +2481,23 @@ static int mlxsw_sp_query_module_eeprom(struct 
mlxsw_sp_port *mlxsw_sp_port,
unsigned int *p_read_size)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
-   char eeprom_tmp[MLXSW_SP_REG_MCIA_EEPROM_SIZE];
+   char eeprom_tmp[MLXSW_REG_MCIA_EEPROM_SIZE];
char mcia_pl[MLXSW_REG_MCIA_LEN];
u16 i2c_addr;
int status;
int err;
 
-   size = min_t(u16, size, MLXSW_SP_REG_MCIA_EEPROM_SIZE);
+   size = min_t(u16, size, MLXSW_REG_MCIA_EEPROM_SIZE);
 
-   if (offset < MLXSW_SP_EEPROM_PAGE_LENGTH &&
-   offset + size > MLXSW_SP_EEPROM_PAGE_LENGTH)
+   if (offset < MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH &&
+   offset + size > MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH)
/* Cross pages read, read until offset 256 in low page */
-   size = MLXSW_SP_EEPROM_PAGE_LENGTH - offset;
+   size = MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH - offset;
 
-   i2c_addr = MLXSW_SP_I2C_ADDR_LOW;
-   if (offset >= MLXSW_SP_EEPROM_PAGE_LENGTH) {
-   i2c_addr = MLXSW_SP_I2C_ADDR_HIGH;
-   offset -= MLXSW_SP_EEPROM_PAGE_LENGTH;
+   i2c_addr = MLXSW_REG_MCIA_I2C_ADDR_LOW;
+   if (offset >= MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH) {
+   i2c_addr = MLXSW_REG_MCIA_I2C_ADDR_HIGH;
+   offset -= MLXSW_REG_MCIA_EEPROM_PAGE_LENGTH;
}
 
mlxsw_reg_mcia_pack(mcia_pl, mlxsw_sp_port->mapping.module,
@@ -2518,55 +2518,37 @@ static int mlxsw_sp_query_module_eeprom(struct 
mlxsw_sp_port *mlxsw_sp_port,
return 0;
 }
 
-enum mlxsw_sp_eeprom_module_info_rev_id {
-   MLXSW_SP_EEPROM_MODULE_INFO_REV_ID_UNSPC  = 0x00,
-   MLXSW_SP_EEPROM_MODULE_INFO_REV_ID_8436   = 0x01,
-   MLXSW_SP_EEPROM_MODULE_INFO_REV_ID_8636   = 0x03,
-};
-
-enum mlxsw_sp_eeprom_module_info_id {
-   MLXSW_SP_EEPROM_MODULE_INFO_ID_SFP  = 0x03,
-   MLXSW_SP_EEPROM_MODULE_INFO_ID_QSFP = 0x0C,
-   MLXSW_SP_EEPROM_MODULE_INFO_ID_QSFP_PLUS= 0x0D,
-   MLXSW_SP_EEPROM_MODULE_INFO_ID_QSFP28   = 0x11,
-};
-
-enum mlxsw_sp_eeprom_m

[patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading

2018-06-26 Thread Vadim Pasternak

Add new core_env module to allow port temperature reading. This
information has most critical impact on system's thermal monitoring and
is to be used by core_hwmon and core_thermal modules.

New internal API reads the temperature from all the modules, which are
equipped with the thermal sensor and exposes temperature according to
the worst measure. All individual temperature values are normalized to
pre-defined range.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/core_env.c | 316 +
 drivers/net/ethernet/mellanox/mlxsw/core_env.h |  63 +
 3 files changed, 380 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_env.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_env.h

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index 0cadcab..9f1dc0b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MLXSW_CORE)   += mlxsw_core.o
 mlxsw_core-objs:= core.o core_acl_flex_keys.o \
-  core_acl_flex_actions.o
+  core_acl_flex_actions.o core_env.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_HWMON) += core_hwmon.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_THERMAL) += core_thermal.o
 obj-$(CONFIG_MLXSW_PCI)+= mlxsw_pci.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
new file mode 100644
index 000..fb6394d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
@@ -0,0 +1,316 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/core_env.c
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+#include 
+#include 
+
+#include "core.h"
+#include "core_env.h"
+#include "item.h"
+
+union mlxsw_env_port_thresh {
+   u8 buf[MLXSW_REG_MCIA_TH_SIZE];
+   struct mlxsw_env_port_temp_th {
+   u16 temp_alarm_hi;
+   u16 temp_alarm_lo;
+   u16 temp_warn_hi;
+   u16 temp_warn_low;
+   } t;
+};
+
+static int mlxsw_env_bulk_get(struct mlxsw_core *core,
+ int *ports_temp_cache, int port_count,
+ bool *untrusted_sensor)
+{
+   char mtbr_pl[MLXSW_REG_MTBR_LEN];
+   int i, j, count, off;
+   u16 temp;
+   int err;
+
+   /* Read ports temperature. */
+   if (untrusted_sensor)
+   *untrusted_sensor = false;
+   count = 0;
+   while (count < port_count) {
+   off = min_t(u8, MLXSW_REG_MTBR_REC_MAX_COUNT,
+   port_count - count);
+   mlxsw_reg_mtbr_pack(mtbr_pl, MLXSW_REG_MTBR_BASE_PORT_INDEX +
+   count, off);
+   err = mlxsw_reg_query(core, MLXSW_REG(mtbr), mtbr_pl);
+   if (err)
+   return err;
+
+   for (i = 0, j = count; i < off; i++, j++) {
+   mlxsw_reg_mtbr_temp_unpack(mtbr_pl, i, &temp, NULL);
+
+

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 13:30), Ka-Cheong Poon wrote:
> 
> My answer to this is that if a socket is not bound to a link
> local address (meaning it is bound to a non-link local address)
> and it is used to send to a link local peer, I think it should
> fail.

Hmm, I'm not sure I agree. I dont think this is forbidden
by RFC 6724 - yes, such a packet cannot be forwarded, but
if everything is on  the same link, and the dest only has
a link-local, you should not need to (create and) bind
another socket to a link-local to talk to this destination..

>  This is consistent with the scope_id check I mentioned in
> the previous mail.  If the socket is not bound to a link local
> address, the bound_scope_id is 0.  So if the socket is used to
> send to a link local address (which has a non-zero scope_id), the
> check will catch it and fail the call.  A new conn should not
> be created in this case.

Re: [patch net-next v2 7/9] mlxsw: spectrum: Implement chain template hinting

2018-06-26 Thread Ido Schimmel

On Tue, Jun 26, 2018 at 09:59:58AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Since cld_flower provides information about the filter template for

s/cld_flower/cls_flower/

> specific chain, use this information in order to prepare a region.
> Use the template to find out what elements are going to be used
> and pass that down to mlxsw_sp_acl_tcam_group_add(). Later on, when the
> first filter is inserted, the mlxsw_sp_acl_tcam_group_use_patterns()
> function would use this element usage information instead of looking
> up a pattern.
> 
> Signed-off-by: Jiri Pirko 

Reviewed-by: Ido Schimmel

[patch net-next RFC 05/12] mlxsw: core: Set different thermal polling time based on bus type

2018-06-26 Thread Vadim Pasternak

Use different thermal monitoring based on bus type.
For I2C bus time is set to 20 seconds, while for PCIe 1 second polling
interval is used.

Signed-off-by: Vadim Pasternak 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index d866c98..152591d8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -41,6 +41,7 @@
 #include "core.h"
 
 #define MLXSW_THERMAL_POLL_INT 1000/* ms */
+#define MLXSW_THERMAL_SLOW_POLL_INT2   /* ms */
 #define MLXSW_THERMAL_MAX_TEMP 11  /* 110C */
 #define MLXSW_THERMAL_MAX_STATE10
 #define MLXSW_THERMAL_MAX_DUTY 255
@@ -95,6 +96,7 @@ struct mlxsw_thermal {
struct mlxsw_core *core;
const struct mlxsw_bus_info *bus_info;
struct thermal_zone_device *tzdev;
+   int polling_delay;
struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
struct mlxsw_thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
enum thermal_device_mode mode;
@@ -190,7 +192,7 @@ static int mlxsw_thermal_set_mode(struct 
thermal_zone_device *tzdev,
mutex_lock(&tzdev->lock);
 
if (mode == THERMAL_DEVICE_ENABLED)
-   tzdev->polling_delay = MLXSW_THERMAL_POLL_INT;
+   tzdev->polling_delay = thermal->polling_delay;
else
tzdev->polling_delay = 0;
 
@@ -397,13 +399,18 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
}
}
 
+   if (bus_info->low_frequency)
+   thermal->polling_delay = MLXSW_THERMAL_SLOW_POLL_INT;
+   else
+   thermal->polling_delay = MLXSW_THERMAL_POLL_INT;
+
thermal->tzdev = thermal_zone_device_register("mlxsw",
  MLXSW_THERMAL_NUM_TRIPS,
  MLXSW_THERMAL_TRIP_MASK,
  thermal,
  &mlxsw_thermal_ops,
  NULL, 0,
- MLXSW_THERMAL_POLL_INT);
+ thermal->polling_delay);
if (IS_ERR(thermal->tzdev)) {
err = PTR_ERR(thermal->tzdev);
dev_err(dev, "Failed to register thermal zone\n");
-- 
2.1.4

Re: [PATCH v3,net-next] vlan: implement vlan id and protocol changes

2018-06-26 Thread Ido Schimmel

On Mon, Jun 25, 2018 at 02:45:24PM -0600, David Ahern wrote:
> On 6/25/18 4:30 AM, Chas Williams wrote:
> > vlan_changelink silently ignores attempts to change the vlan id
> > or protocol id of an existing vlan interface.  Implement by adding
> > the new vlan id and protocol to the interface's vlan group and then
> > removing the old vlan id and protocol from the vlan group.
> > 
> > Signed-off-by: Chas Williams <3ch...@gmail.com>
> > ---
> >  include/linux/netdevice.h |  1 +
> >  net/8021q/vlan.c  |  4 ++--
> >  net/8021q/vlan.h  |  2 ++
> >  net/8021q/vlan_netlink.c  | 38 ++
> >  net/core/dev.c|  1 +
> >  5 files changed, 44 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 3ec9850c7936..a95ae238addf 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2409,6 +2409,7 @@ enum netdev_cmd {
> > NETDEV_CVLAN_FILTER_DROP_INFO,
> > NETDEV_SVLAN_FILTER_PUSH_INFO,
> > NETDEV_SVLAN_FILTER_DROP_INFO,
> > +   NETDEV_CHANGEVLAN,
> >  };
> >  const char *netdev_cmd_to_name(enum netdev_cmd cmd);
> >  
> 
> you add the new notifier, but do not add any hooks to catch and process it.
> 
> Personally, I think it is a bit sketchy to change the vlan id on an
> existing device and I suspect it will cause latent errors.

+1

> 
> What's your use case for trying to implement the change versus causing
> it to generate an unsupported error?
> 
> If this patch does get accepted, I believe the mlxsw switchdev driver
> will be impacted.

Yes, at minimum we need to return an error for NETDEV_CHANGEVLAN, but
looking at the code it seems that there's no proper rollback.

Thanks for the Cc, David.

Re: [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-06-26 Thread Willem de Bruijn

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
 wrote:
>
> Extend transmit queue sysfs attribute to configure Rx queue(s) map
> per Tx queue. By default no receive queues are configured for the
> Tx queue.
>
> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>
> Signed-off-by: Amritha Nambiar 
> ---

> +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
> +{
> +   struct net_device *dev = queue->dev;
> +   struct xps_dev_maps *dev_maps;
> +   unsigned long *mask, index;
> +   int j, len, num_tc = 1, tc = 0;
> +
> +   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
> +  GFP_KERNEL);
> +   if (!mask)
> +   return -ENOMEM;
> +
> +   index = get_netdev_queue_index(queue);
> +
> +   if (dev->num_tc) {
> +   num_tc = dev->num_tc;
> +   tc = netdev_txq_to_tc(dev, index);
> +   if (tc < 0)
> +   return -EINVAL;

Must free mask

> +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
> + size_t len)
> +{
> +   struct net_device *dev = queue->dev;
> +   unsigned long *mask, index;
> +   int err;
> +
> +   if (!capable(CAP_NET_ADMIN))
> +   return -EPERM;

ns_capable?

Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-26 Thread Willem de Bruijn

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
 wrote:
>
> Change 'skc_tx_queue_mapping' field in sock_common structure from
> 'int' to 'unsigned short' type with 0 indicating unset and
> a positive queue value being set. This way it is consistent with
> the queue_mapping field in the sk_buff. This will also accommodate
> adding a new 'unsigned short' field in sock_common in the next
> patch for rx_queue_mapping.
>
> Signed-off-by: Amritha Nambiar 
> ---

>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>  {
> -   sk->sk_tx_queue_mapping = tx_queue;
> +   /* sk_tx_queue_mapping accept only upto a 16-bit value */
> +   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
> +   sk->sk_tx_queue_mapping = tx_queue + 1;
>  }

WARN_ON_ONCE to avoid flooding the kernel buffer.

Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-26 Thread Willem de Bruijn

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
 wrote:
>
> This patch adds support to pick Tx queue based on the Rx queue(s) map
> configuration set by the admin through the sysfs attribute
> for each Tx queue. If the user configuration for receive queue(s) map
> does not apply, then the Tx queue selection falls back to CPU(s) map
> based selection and finally to hashing.
>
> Signed-off-by: Amritha Nambiar 
> ---

> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>  {
>  #ifdef CONFIG_XPS
> struct xps_dev_maps *dev_maps;
> -   struct xps_map *map;
> +   struct sock *sk = skb->sk;
> int queue_index = -1;
>
> if (!static_key_false(&xps_needed))
> return -1;
>
> rcu_read_lock();
> -   dev_maps = rcu_dereference(dev->xps_cpus_map);
> +   if (!static_key_false(&xps_rxqs_needed))
> +   goto get_cpus_map;
> +
> +   dev_maps = rcu_dereference(dev->xps_rxqs_map);
> if (dev_maps) {
> -   unsigned int tci = skb->sender_cpu - 1;
> +   int tci = sk_rx_queue_get(sk);

What if the rx device differs from the tx device?

Re: [PATCH net-next V3 1/2] cxgb4: Add support for FW_ETH_TX_PKT_VM_WR

2018-06-26 Thread kbuild test robot

Hi Arjun,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Ganesh-Goudar/cxgb4-Add-support-for-FW_ETH_TX_PKT_VM_WR/20180626-163628
config: x86_64-randconfig-x001-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/chelsio/cxgb4/sge.c: In function 'cxgb4_vf_eth_xmit':
>> drivers/net/ethernet/chelsio/cxgb4/sge.c:1646:18: error: assignment of 
>> read-only variable 'fw_hdr_copy_len'
 fw_hdr_copy_len = (sizeof(wr->ethmacdst) + sizeof(wr->ethmacsrc) +
 ^

vim +/fw_hdr_copy_len +1646 drivers/net/ethernet/chelsio/cxgb4/sge.c

  1622  
  1623  /**
  1624   *  cxgb4_vf_eth_xmit - add a packet to an Ethernet TX queue
  1625   *  @skb: the packet
  1626   *  @dev: the egress net device
  1627   *
  1628   *  Add a packet to an SGE Ethernet TX queue.  Runs with softirqs 
disabled.
  1629   */
  1630  static netdev_tx_t cxgb4_vf_eth_xmit(struct sk_buff *skb,
  1631   struct net_device *dev)
  1632  {
  1633  dma_addr_t addr[MAX_SKB_FRAGS + 1];
  1634  const struct skb_shared_info *ssi;
  1635  struct fw_eth_tx_pkt_vm_wr *wr;
  1636  int qidx, credits, max_pkt_len;
  1637  const size_t fw_hdr_copy_len;
  1638  struct cpl_tx_pkt_core *cpl;
  1639  const struct port_info *pi;
  1640  unsigned int flits, ndesc;
  1641  struct sge_eth_txq *txq;
  1642  struct adapter *adapter;
  1643  u64 cntrl, *end;
  1644  u32 wr_mid;
  1645  
> 1646  fw_hdr_copy_len = (sizeof(wr->ethmacdst) + 
> sizeof(wr->ethmacsrc) +
  1647 sizeof(wr->ethtype) + sizeof(wr->vlantci));
  1648  
  1649  /* The chip minimum packet length is 10 octets but the firmware
  1650   * command that we are using requires that we copy the Ethernet 
header
  1651   * (including the VLAN tag) into the header so we reject 
anything
  1652   * smaller than that ...
  1653   */
  1654  if (unlikely(skb->len < fw_hdr_copy_len))
  1655  goto out_free;
  1656  
  1657  /* Discard the packet if the length is greater than mtu */
  1658  max_pkt_len = ETH_HLEN + dev->mtu;
  1659  if (skb_vlan_tag_present(skb))
  1660  max_pkt_len += VLAN_HLEN;
  1661  if (!skb_shinfo(skb)->gso_size && (unlikely(skb->len > 
max_pkt_len)))
  1662  goto out_free;
  1663  
  1664  /* Figure out which TX Queue we're going to use. */
  1665  pi = netdev_priv(dev);
  1666  adapter = pi->adapter;
  1667  qidx = skb_get_queue_mapping(skb);
  1668  WARN_ON(qidx >= pi->nqsets);
  1669  txq = &adapter->sge.ethtxq[pi->first_qset + qidx];
  1670  
  1671  /* Take this opportunity to reclaim any TX Descriptors whose DMA
  1672   * transfers have completed.
  1673   */
  1674  cxgb4_reclaim_completed_tx(adapter, &txq->q, true);
  1675  
  1676  /* Calculate the number of flits and TX Descriptors we're going 
to
  1677   * need along with how many TX Descriptors will be left over 
after
  1678   * we inject our Work Request.
  1679   */
  1680  flits = t4vf_calc_tx_flits(skb);
  1681  ndesc = flits_to_desc(flits);
  1682  credits = txq_avail(&txq->q) - ndesc;
  1683  
  1684  if (unlikely(credits < 0)) {
  1685  /* Not enough room for this packet's Work Request.  
Stop the
  1686   * TX Queue and return a "busy" condition.  The queue 
will get
  1687   * started later on when the firmware informs us that 
space
  1688   * has opened up.
  1689   */
  1690  eth_txq_stop(txq);
  1691  dev_err(adapter->pdev_dev,
  1692  "%s: TX ring %u full while queue awake!\n",
  1693  dev->name, qidx);
  1694  return NETDEV_TX_BUSY;
  1695  }
  1696  
  1697  if (!t4vf_is_eth_imm(skb) &&
  1698  unlikely(cxgb4_map_skb(adapter->pdev_dev, skb, addr) < 0)) {
  1699  /* We need to map the skb into PCI DMA space (because 
it can't
  1700   * be in-lined directly into the Work Request) and the 
mapping
  1701   * operation failed.  Record the error and drop the 
packet.
  1702   */
  1703  txq->mapping_err++;
  1704

[PATCH net-next V4 1/2] cxgb4: Add support for FW_ETH_TX_PKT_VM_WR

2018-06-26 Thread Ganesh Goudar

From: Arjun Vynipadath 

The present TX workrequest(FW_ETH_TX_PKT_WR) cant be used for
host->vf communication, since it doesn't loopback the outgoing
packets to virtual interfaces on the same port. This can be done
using FW_ETH_TX_PKT_VM_WR.
This fix depends on ethtool_flags to determine what WR to use for
TX path. Support for setting this flags by user is added in next
commit.

Based on the original work by : Casey Leedom 

Signed-off-by: Casey Leedom 
Signed-off-by: Arjun Vynipadath 
Signed-off-by: Ganesh Goudar 
---
V4: Fixed build errors.

V3: Made eth_flags type consistent across struct adapter and
struct port_info.   
   

V2: Renamed t4_eth_xmit() and t4vf_eth_xmit(), since some compilers
were warning about conflicting definition in cxgb4vf driver
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  13 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c| 372 +++-
 3 files changed, 383 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 1adb968..a4ea53d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -522,6 +522,15 @@ enum {
MAX_INGQ = MAX_ETH_QSETS + INGQ_EXTRAS,
 };
 
+enum {
+   PRIV_FLAG_PORT_TX_VM_BIT,
+};
+
+#define PRIV_FLAG_PORT_TX_VM   BIT(PRIV_FLAG_PORT_TX_VM_BIT)
+
+#define PRIV_FLAGS_ADAP0
+#define PRIV_FLAGS_PORTPRIV_FLAG_PORT_TX_VM
+
 struct adapter;
 struct sge_rspq;
 
@@ -558,6 +567,7 @@ struct port_info {
struct hwtstamp_config tstamp_config;
bool ptp_enable;
struct sched_table *sched_tbl;
+   u32 eth_flags;
 };
 
 struct dentry;
@@ -868,6 +878,7 @@ struct adapter {
unsigned int flags;
unsigned int adap_idx;
enum chip_type chip;
+   u32 eth_flags;
 
int msg_enable;
__be16 vxlan_port;
@@ -1334,7 +1345,7 @@ void t4_os_link_changed(struct adapter *adap, int 
port_id, int link_stat);
 void t4_free_sge_resources(struct adapter *adap);
 void t4_free_ofld_rxqs(struct adapter *adap, int n, struct sge_ofld_rxq *q);
 irq_handler_t t4_intr_handler(struct adapter *adap);
-netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t t4_start_xmit(struct sk_buff *skb, struct net_device *dev);
 int t4_ethrx_handler(struct sge_rspq *q, const __be64 *rsp,
 const struct pkt_gl *gl);
 int t4_mgmt_tx(struct adapter *adap, struct sk_buff *skb);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index bc03c17..d3b0f9c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3217,7 +3217,7 @@ static netdev_features_t cxgb_fix_features(struct 
net_device *dev,
 static const struct net_device_ops cxgb4_netdev_ops = {
.ndo_open = cxgb_open,
.ndo_stop = cxgb_close,
-   .ndo_start_xmit   = t4_eth_xmit,
+   .ndo_start_xmit   = t4_start_xmit,
.ndo_select_queue = cxgb_select_queue,
.ndo_get_stats64  = cxgb_get_stats,
.ndo_set_rx_mode  = cxgb_set_rxmode,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 395e2a0..ebb46c4 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -1288,13 +1288,13 @@ static inline void t6_fill_tnl_lso(struct sk_buff *skb,
 }
 
 /**
- * t4_eth_xmit - add a packet to an Ethernet Tx queue
+ * cxgb4_eth_xmit - add a packet to an Ethernet Tx queue
  * @skb: the packet
  * @dev: the egress net device
  *
  * Add a packet to an SGE Ethernet Tx queue.  Runs with softirqs disabled.
  */
-netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t cxgb4_eth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
u32 wr_mid, ctrl0, op;
u64 cntrl, *end, *sgl;
@@ -1547,6 +1547,374 @@ out_free:   dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
 }
 
+/* Constants ... */
+enum {
+   /* Egress Queue sizes, producer and consumer indices are all in units
+* of Egress Context Units bytes.  Note that as far as the hardware is
+* concerned, the free list is an Egress Queue (the host produces free
+* buffers which the hardware consumes) and free list entries are
+* 64-bit PCI DMA addresses.
+*/
+   EQ_UNIT = SGE_EQ_IDXSIZE,
+   FL_PER_EQ_UNIT = EQ_UNIT / sizeof(__be64),
+   TXD_PER_EQ_UNIT = EQ_UNIT / sizeof(__be64),
+
+   T4VF_ETHTXQ_MAX_HDR = (sizeof(struct fw_eth_tx_pkt_vm_wr) +
+  sizeof(struct cpl_tx_pkt_lso_core) +
+  sizeof(struct cpl_tx_pkt_

[PATCH net-next V4 2/2] cxgb4: Support ethtool private flags

2018-06-26 Thread Ganesh Goudar

From: Arjun Vynipadath 

This is used to change TX workrequests, which helps in
host->vf communication.

Signed-off-by: Arjun Vynipadath 
Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
V4: No changes

V3: No changes

V2: No changes
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index f7eef93..ddb8b9e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -177,6 +177,10 @@ static char loopback_stats_strings[][ETH_GSTRING_LEN] = {
"bg3_frames_trunc   ",
 };
 
+static const char cxgb4_priv_flags_strings[][ETH_GSTRING_LEN] = {
+   [PRIV_FLAG_PORT_TX_VM_BIT] = "port_tx_vm_wr",
+};
+
 static int get_sset_count(struct net_device *dev, int sset)
 {
switch (sset) {
@@ -185,6 +189,8 @@ static int get_sset_count(struct net_device *dev, int sset)
   ARRAY_SIZE(adapter_stats_strings) +
   ARRAY_SIZE(channel_stats_strings) +
   ARRAY_SIZE(loopback_stats_strings);
+   case ETH_SS_PRIV_FLAGS:
+   return ARRAY_SIZE(cxgb4_priv_flags_strings);
default:
return -EOPNOTSUPP;
}
@@ -235,6 +241,7 @@ static void get_drvinfo(struct net_device *dev, struct 
ethtool_drvinfo *info)
 FW_HDR_FW_VER_MINOR_G(exprom_vers),
 FW_HDR_FW_VER_MICRO_G(exprom_vers),
 FW_HDR_FW_VER_BUILD_G(exprom_vers));
+   info->n_priv_flags = ARRAY_SIZE(cxgb4_priv_flags_strings);
 }
 
 static void get_strings(struct net_device *dev, u32 stringset, u8 *data)
@@ -250,6 +257,9 @@ static void get_strings(struct net_device *dev, u32 
stringset, u8 *data)
data += sizeof(channel_stats_strings);
memcpy(data, loopback_stats_strings,
   sizeof(loopback_stats_strings));
+   } else if (stringset == ETH_SS_PRIV_FLAGS) {
+   memcpy(data, cxgb4_priv_flags_strings,
+  sizeof(cxgb4_priv_flags_strings));
}
 }
 
@@ -1499,6 +1509,36 @@ static int cxgb4_get_module_eeprom(struct net_device 
*dev,
 offset, len, &data[eprom->len - len]);
 }
 
+static u32 cxgb4_get_priv_flags(struct net_device *netdev)
+{
+   struct port_info *pi = netdev_priv(netdev);
+   struct adapter *adapter = pi->adapter;
+
+   return (adapter->eth_flags | pi->eth_flags);
+}
+
+/**
+ * set_flags - set/unset specified flags if passed in new_flags
+ * @cur_flags: pointer to current flags
+ * @new_flags: new incoming flags
+ * @flags: set of flags to set/unset
+ */
+static inline void set_flags(u32 *cur_flags, u32 new_flags, u32 flags)
+{
+   *cur_flags = (*cur_flags & ~flags) | (new_flags & flags);
+}
+
+static int cxgb4_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+   struct port_info *pi = netdev_priv(netdev);
+   struct adapter *adapter = pi->adapter;
+
+   set_flags(&adapter->eth_flags, flags, PRIV_FLAGS_ADAP);
+   set_flags(&pi->eth_flags, flags, PRIV_FLAGS_PORT);
+
+   return 0;
+}
+
 static const struct ethtool_ops cxgb_ethtool_ops = {
.get_link_ksettings = get_link_ksettings,
.set_link_ksettings = set_link_ksettings,
@@ -1535,6 +1575,8 @@ static const struct ethtool_ops cxgb_ethtool_ops = {
.get_dump_data = get_dump_data,
.get_module_info   = cxgb4_get_module_info,
.get_module_eeprom = cxgb4_get_module_eeprom,
+   .get_priv_flags= cxgb4_get_priv_flags,
+   .set_priv_flags= cxgb4_set_priv_flags,
 };
 
 void cxgb4_set_ethtool_ops(struct net_device *netdev)
-- 
2.1.0

Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams

2018-06-26 Thread 吉藤英明

2018-06-26 13:33 GMT+09:00 Xin Long :
> On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner
>  wrote:
>> Hi,
>>
>> On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote:
>>> Hi,
>>>
>>> 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner 
>>> :
>>> > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote:
>>> >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote:
>>> >> > From: Xin Long 
>>> >> > Date: Mon, 25 Jun 2018 10:14:35 +0800
>>> >> >
>>> >> > >  struct sctp_paddrparams {
>>> >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams {
>>> >> > >   __u32   spp_pathmtu;
>>> >> > >   __u32   spp_sackdelay;
>>> >> > >   __u32   spp_flags;
>>> >> > > + __u32   spp_ipv6_flowlabel;
>>> >> > > + __u8spp_dscp;
>>> >> > >  } __attribute__((packed, aligned(4)));
>>> >> >
>>> >> > I don't think you can change the size of this structure like this.
>>> >> >
>>> >> > This check in sctp_setsockopt_peer_addr_params():
>>> >> >
>>> >> > if (optlen != sizeof(struct sctp_paddrparams))
>>> >> > return -EINVAL;
>>> >> >
>>> >> > is going to trigger in old kernels when executing programs
>>> >> > built against the new struct definition.
>>> >
>>> > That will happen, yes, but do we really care about being future-proof
>>> > here? I mean: if we also update such check(s) to support dealing with
>>> > smaller-than-supported structs, newer kernels will be able to run
>>> > programs built against the old struct, and the new one; while building
>>> > using newer headers and running on older kernel may fool the
>>> > application in other ways too (like enabling support for something
>>> > that is available on newer kernel and that is not present in the older
>>> > one).
>>>
>>> We should not break existing apps.
>>> We still accept apps of pre-2.4 era without sin6_scope_id
>>> (e.g., net/ipv6/af_inet6.c:inet6_bind()).
>>
>> Yes. That's what I tried to say. That is supporting an old app built
>> with old kernel headers and running on a newer kernel, and not the
>> other way around (an app built with fresh headers and running on an
>> old kernel).
> To make it, I will update the check like:
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 1df5d07..c949d8c 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2715,13 +2715,18 @@ static int
> sctp_setsockopt_peer_addr_params(struct sock *sk,
> struct sctp_sock*sp = sctp_sk(sk);
> int error;
> int hb_change, pmtud_change, sackdelay_change;
> +   int plen = sizeof(params);
> +   int old_plen = plen - sizeof(u32) * 2;

if (optlen < offsetof(struct sctp_paddrparams, spp_ipv6_flowlabel))
maybe?

>
> -   if (optlen != sizeof(struct sctp_paddrparams))
> +   if (optlen != plen && optlen != old_plen)
> return -EINVAL;
>
> if (copy_from_user(¶ms, optval, optlen))
> return -EFAULT;
>
> +   if (optlen == old_plen)
> +   params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL);

I think we should return -EINVAL if size is not new one.

--yoshfuji

> +
> /* Validate flags and value parameters. */
> hb_change= params.spp_flags & SPP_HB;
> pmtud_change = params.spp_flags & SPP_PMTUD;
> @@ -5591,10 +5596,13 @@ static int
> sctp_getsockopt_peer_addr_params(struct sock *sk, int len,
> struct sctp_transport   *trans = NULL;
> struct sctp_association *asoc = NULL;
> struct sctp_sock*sp = sctp_sk(sk);
> +   int plen = sizeof(params);
> +   int old_plen = plen - sizeof(u32) * 2;
>
> -   if (len < sizeof(struct sctp_paddrparams))
> +   if (len < old_plen)
> return -EINVAL;
> -   len = sizeof(struct sctp_paddrparams);
> +
> +   len = len >= plen ? plen : old_plen;
> if (copy_from_user(¶ms, optval, len))
> return -EFAULT;
>
> does it look ok to you?

Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1

2018-06-26 Thread Peter Robinson

Hi Daniel,

>>> On 06/24/2018 11:24 AM, Peter Robinson wrote:
>> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
>> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
>> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
>> others, both LPAE/normal kernels.
>>>
>>> So this is arm32 right?
>>
>> Correct.
>>
>> I'm a bit out of my depth in this part of the kernel but I'm wondering
>> if it's known, I couldn't find anything that looked obvious on a few
>> mailing lists.
>>
>> Peter
>
> Hi Peter
>
> Could you provide symbolic information ?

 I passed in through scripts/decode_stacktrace.sh is that what you were 
 after:

 [8.673880] Internal error: Oops: a06 [#10] SMP ARM
 [8.673949] ---[ end trace 049df4786ea3140a ]---
 [8.678754] Modules linked in:
 [8.678766] CPU: 1 PID: 206 Comm: systemd-udevd Tainted: G  D
 4.18.0-0.rc1.git0.1.fc29.armv7hl+lpae #1
 [8.678769] Hardware name: Allwinner sun8i Family
 [8.678781] PC is at sk_filter_trim_cap ()
 [8.678790] LR is at   (null)
 [8.709463] pc : lr : psr: 6013 ()
 [8.715722] sp : c996bd60  ip :   fp : 
 [8.720939] r10: ee79dc00  r9 : c12c9f80  r8 : 
 [8.726157] r7 :   r6 : 0001  r5 : f1648000  r4 : 
 [8.732674] r3 : 0007  r2 :   r1 :   r0 : 
 [8.739193] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
 Segment user
 [8.746318] Control: 30c5387d  Table: 6e7bc880  DAC: ffe75ece
 [8.752055] Process systemd-udevd (pid: 206, stack limit = 0x(ptrval))
 [8.758574] Stack: (0xc996bd60 to 0xc996c000)
>>>
>>> Do you have BPF JIT enabled or disabled? Does it happen with disabled?
>>
>> Enabled, I can test with it disabled, BPF configs bits are:
>> CONFIG_BPF_EVENTS=y
>> # CONFIG_BPFILTER is not set
>> CONFIG_BPF_JIT_ALWAYS_ON=y
>> CONFIG_BPF_JIT=y
>> CONFIG_BPF_STREAM_PARSER=y
>> CONFIG_BPF_SYSCALL=y
>> CONFIG_BPF=y
>> CONFIG_CGROUP_BPF=y
>> CONFIG_HAVE_EBPF_JIT=y
>> CONFIG_IPV6_SEG6_BPF=y
>> CONFIG_LWTUNNEL_BPF=y
>> # CONFIG_NBPFAXI_DMA is not set
>> CONFIG_NET_ACT_BPF=m
>> CONFIG_NET_CLS_BPF=m
>> CONFIG_NETFILTER_XT_MATCH_BPF=m
>> # CONFIG_TEST_BPF is not set
>>
>>> I can see one bug, but your stack trace seems unrelated.
>>>
>>> Anyway, could you try with this?
>>
>> Build in process.
>>
>>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>>> index 6e8b716..f6a62ae 100644
>>> --- a/arch/arm/net/bpf_jit_32.c
>>> +++ b/arch/arm/net/bpf_jit_32.c
>>> @@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
>>> *prog)
>>> /* there are 2 passes here */
>>> bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>>>
>>> -   set_memory_ro((unsigned long)header, header->pages);
>>> +   bpf_jit_binary_lock_ro(header);
>>> prog->bpf_func = (void *)ctx.target;
>>> prog->jited = 1;
>>> prog->jited_len = image_size;
>
> So with that and the other fix there was no improvement, with those
> and the BPF JIT disabled it works, I'm not sure if the two patches
> have any effect with the JIT disabled though.
>
> Will look at the other patches shortly, there's been some other issue
> introduced between rc1 and rc2 which I have to work out before I can
> test those though.

Quick update, with linus's head as of yesterday, basically rc2 plus
davem's network fixes it works if the JIT is disabled IE:
# CONFIG_BPF_JIT_ALWAYS_ON is not set
# CONFIG_BPF_JIT is not set

If I enable it the boot breaks even worse than the errors above in
that I get no console output at all, even with earlycon, so we've gone
backwards since rc1 somehow.

I'll try the above two reverted unless you have any other suggestions.

Peter

Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1

2018-06-26 Thread Daniel Borkmann

On 06/26/2018 02:23 PM, Peter Robinson wrote:
 On 06/24/2018 11:24 AM, Peter Robinson wrote:
>>> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
>>> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
>>> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
>>> others, both LPAE/normal kernels.

 So this is arm32 right?
>>>
>>> Correct.
>>>
>>> I'm a bit out of my depth in this part of the kernel but I'm wondering
>>> if it's known, I couldn't find anything that looked obvious on a few
>>> mailing lists.
>>>
>>> Peter
>>
>> Hi Peter
>>
>> Could you provide symbolic information ?
>
> I passed in through scripts/decode_stacktrace.sh is that what you were 
> after:
>
> [8.673880] Internal error: Oops: a06 [#10] SMP ARM
> [8.673949] ---[ end trace 049df4786ea3140a ]---
> [8.678754] Modules linked in:
> [8.678766] CPU: 1 PID: 206 Comm: systemd-udevd Tainted: G  D
> 4.18.0-0.rc1.git0.1.fc29.armv7hl+lpae #1
> [8.678769] Hardware name: Allwinner sun8i Family
> [8.678781] PC is at sk_filter_trim_cap ()
> [8.678790] LR is at   (null)
> [8.709463] pc : lr : psr: 6013 ()
> [8.715722] sp : c996bd60  ip :   fp : 
> [8.720939] r10: ee79dc00  r9 : c12c9f80  r8 : 
> [8.726157] r7 :   r6 : 0001  r5 : f1648000  r4 : 
> [8.732674] r3 : 0007  r2 :   r1 :   r0 : 
> [8.739193] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
> Segment user
> [8.746318] Control: 30c5387d  Table: 6e7bc880  DAC: ffe75ece
> [8.752055] Process systemd-udevd (pid: 206, stack limit = 0x(ptrval))
> [8.758574] Stack: (0xc996bd60 to 0xc996c000)

 Do you have BPF JIT enabled or disabled? Does it happen with disabled?
>>>
>>> Enabled, I can test with it disabled, BPF configs bits are:
>>> CONFIG_BPF_EVENTS=y
>>> # CONFIG_BPFILTER is not set
>>> CONFIG_BPF_JIT_ALWAYS_ON=y
>>> CONFIG_BPF_JIT=y
>>> CONFIG_BPF_STREAM_PARSER=y
>>> CONFIG_BPF_SYSCALL=y
>>> CONFIG_BPF=y
>>> CONFIG_CGROUP_BPF=y
>>> CONFIG_HAVE_EBPF_JIT=y
>>> CONFIG_IPV6_SEG6_BPF=y
>>> CONFIG_LWTUNNEL_BPF=y
>>> # CONFIG_NBPFAXI_DMA is not set
>>> CONFIG_NET_ACT_BPF=m
>>> CONFIG_NET_CLS_BPF=m
>>> CONFIG_NETFILTER_XT_MATCH_BPF=m
>>> # CONFIG_TEST_BPF is not set
>>>
 I can see one bug, but your stack trace seems unrelated.

 Anyway, could you try with this?
>>>
>>> Build in process.
>>>
 diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
 index 6e8b716..f6a62ae 100644
 --- a/arch/arm/net/bpf_jit_32.c
 +++ b/arch/arm/net/bpf_jit_32.c
 @@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
 *prog)
 /* there are 2 passes here */
 bpf_jit_dump(prog->len, image_size, 2, ctx.target);

 -   set_memory_ro((unsigned long)header, header->pages);
 +   bpf_jit_binary_lock_ro(header);
 prog->bpf_func = (void *)ctx.target;
 prog->jited = 1;
 prog->jited_len = image_size;
>>
>> So with that and the other fix there was no improvement, with those
>> and the BPF JIT disabled it works, I'm not sure if the two patches
>> have any effect with the JIT disabled though.
>>
>> Will look at the other patches shortly, there's been some other issue
>> introduced between rc1 and rc2 which I have to work out before I can
>> test those though.
> 
> Quick update, with linus's head as of yesterday, basically rc2 plus
> davem's network fixes it works if the JIT is disabled IE:
> # CONFIG_BPF_JIT_ALWAYS_ON is not set
> # CONFIG_BPF_JIT is not set
> 
> If I enable it the boot breaks even worse than the errors above in
> that I get no console output at all, even with earlycon, so we've gone
> backwards since rc1 somehow.
> 
> I'll try the above two reverted unless you have any other suggestions.

Ok, thanks, lets do that!

I'm still working on fixes meanwhile, should have something by end of day.

Thanks,
Daniel

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-26 Thread Ka-Cheong Poon


On 06/26/2018 06:16 PM, Sowmini Varadhan wrote:

On (06/26/18 13:30), Ka-Cheong Poon wrote:


My answer to this is that if a socket is not bound to a link
local address (meaning it is bound to a non-link local address)
and it is used to send to a link local peer, I think it should
fail.


Hmm, I'm not sure I agree. I dont think this is forbidden
by RFC 6724 - yes, such a packet cannot be forwarded, but
if everything is on  the same link, and the dest only has
a link-local, you should not need to (create and) bind
another socket to a link-local to talk to this destination..



In this case, RFC 6724 prefers link local address as source.
While using non-link local address (say ULA) is not forbidden,
doing this can easily cause inter-operability issues (does the
app really know that the non-link local source and the link
local destination addresses are really on the same link?).  I
think it is prudent to disallow this in RDS unless there is a
very clear and important reason to do so.  BTW, if it is really
needed, it can be added in future.



  This is consistent with the scope_id check I mentioned in
the previous mail.  If the socket is not bound to a link local
address, the bound_scope_id is 0.  So if the socket is used to
send to a link local address (which has a non-zero scope_id), the
check will catch it and fail the call.  A new conn should not
be created in this case.





--
K. Poon
ka-cheong.p...@oracle.com

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 21:02), Ka-Cheong Poon wrote:
> 
> In this case, RFC 6724 prefers link local address as source.

the keyword is "prefers". 

> While using non-link local address (say ULA) is not forbidden,
> doing this can easily cause inter-operability issues (does the
> app really know that the non-link local source and the link
> local destination addresses are really on the same link?).  I
> think it is prudent to disallow this in RDS unless there is a
> very clear and important reason to do so. 

I remember the issues that triggered 6724. The "interop" issue
is that when you send from Link-local to global, and need forwarding,
it may not work.

but I dont think an RDS application today expects to deal with
the case that "oh I got back and error when I tried to send to
address X on rds socket rs1, let me go and check what I am bound
to, and maybe create another socket, and bind it to link-local"

You're not doing this for IPv4 and RDS today (you dont have to do this
for UDP, afaik)

This is especially true if "X" is a hostname that got resovled using DNS

> BTW, if it is really > needed, it can be added in future.

shrug. You are introducing a new error return.

--Sowmini

[PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests

2018-06-26 Thread Keara Leibovitz

Create unittests for the tc tunnel_key action.


Signed-off-by: Keara Leibovitz 
---
 .../tc-testing/tc-tests/actions/tunnel_key.json| 676 +
 1 file changed, 676 insertions(+)
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json

diff --git 
a/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
new file mode 100644
index ..bfe522ac8177
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
@@ -0,0 +1,676 @@
+[
+{
+"id": "2b11",
+"name": "Add tunnel_key set action with mandatory parameters",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action tunnel_key set src_ip 
10.10.10.1 dst_ip 20.20.20.2 id 1",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action tunnel_key",
+"matchPattern": "action order [0-9]+: tunnel_key.*set.*src_ip 
10.10.10.1.*dst_ip 20.20.20.2.*key_id 1",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action tunnel_key"
+]
+},
+{
+"id": "dc6b",
+"name": "Add tunnel_key set action with missing mandatory src_ip 
parameter",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action tunnel_key set dst_ip 
20.20.20.2 id 100",
+"expExitCode": "255",
+"verifyCmd": "$TC actions list action tunnel_key",
+"matchPattern": "action order [0-9]+: tunnel_key set.*dst_ip 
20.20.20.2.*key_id 100",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action tunnel_key"
+]
+},
+{
+"id": "7f25",
+"name": "Add tunnel_key set action with missing mandatory dst_ip 
parameter",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action tunnel_key set src_ip 
10.10.10.1 id 100",
+"expExitCode": "255",
+"verifyCmd": "$TC actions list action tunnel_key",
+"matchPattern": "action order [0-9]+: tunnel_key set.*src_ip 
10.10.10.1.*key_id 100",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action tunnel_key"
+]
+},
+{
+"id": "ba4e",
+"name": "Add tunnel_key set action with missing mandatory id 
parameter",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action tunnel_key set src_ip 
10.10.10.1 dst_ip 20.20.20.2",
+"expExitCode": "255",
+"verifyCmd": "$TC actions list action tunnel_key",
+"matchPattern": "action order [0-9]+: tunnel_key set.*src_ip 
10.10.10.1.*dst_ip 20.20.20.2",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action tunnel_key"
+]
+},
+{
+"id": "a5e0",
+"name": "Add tunnel_key set action with invalid src_ip parameter",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action tunnel_key set src_ip 
300.168.100.1 dst_ip 192.168.200.1 id 7 index 1",
+"expExitCode": "1",
+"verifyCmd": "$TC actions get action tunnel_key index 1",
+"matchPattern": "action order [0-9]+: tunnel_key set.*src_ip 
300.168.100.1.*dst_ip 192.168.200.1.*key_id 7.*index 1 ref",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action tunnel_key"
+]
+},
+{
+"id": "eaa8",
+"name": "Add tunnel_key set action with invalid dst_ip parameter",
+"category": [
+"actions",
+"tunnel_key"
+],
+"setup": [
+[
+"$TC actions flush action tunnel_key",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$

Re: [PATCH V4 5/8] ARM: dts: stm32: Add ethernet dwmac on stm32mp1

2018-06-26 Thread Alexandre Torgue


Hi christophe

On 05/23/2018 05:47 PM, Christophe Roullier wrote:

Add Ethernet support (Synopsys MAC IP 4.20a) on stm32mp1 SOC.
Enable feature supported by the stmmac driver, such as TSO.

Signed-off-by: Christophe Roullier 
---
  arch/arm/boot/dts/stm32mp157c.dtsi | 30 ++
  1 file changed, 30 insertions(+)

diff --git a/arch/arm/boot/dts/stm32mp157c.dtsi 
b/arch/arm/boot/dts/stm32mp157c.dtsi
index 3db03a2..ea7b6cb 100644
--- a/arch/arm/boot/dts/stm32mp157c.dtsi
+++ b/arch/arm/boot/dts/stm32mp157c.dtsi
@@ -179,5 +179,35 @@
clocks = <&rcc USART1_K>;
status = "disabled";
};
+
+   stmmac_axi_config_0: stmmac-axi-config {
+   snps,wr_osr_lmt = <0x7>;
+   snps,rd_osr_lmt = <0x7>;
+   snps,blen = <0 0 0 0 16 8 4>;
+   };
+
+   ethernet0: ethernet@5800a000 {
+   compatible = "st,stm32mp1-dwmac", "snps,dwmac-4.20a";
+   reg = <0x5800a000 0x2000>;
+   reg-names = "stmmaceth";
+   interrupts-extended = <&intc GIC_SPI 61 IRQ_TYPE_NONE>;


IRQ_TYPE_NONE souldn't be used. Please provide edge sensitiv or level 
sensitic type.



+   interrupt-names = "macirq";
+   clock-names = "stmmaceth",
+ "mac-clk-tx",
+ "mac-clk-rx",
+ "ethstp",
+ "syscfg-clk";
+   clocks = <&rcc ETHMAC>,
+<&rcc ETHTX>,
+<&rcc ETHRX>,
+<&rcc ETHSTP>,
+<&rcc SYSCFG>;
+   st,syscon = <&syscfg 0x4>;
+   snps,mixed-burst;
+   snps,pbl = <2>;
+   snps,axi-config = <&stmmac_axi_config_0>;
+   snps,tso;
+   status = "disabled";
+   };
};
  };

Re: [PATCH 00/14] ARM: davinci: step towards removing at24_platform_data

2018-06-26 Thread Andrew Lunn

> I see. I see it this way: the setup callback comes from the time when
> we didn't have nvmem and should go away. I will protest loud whenever
> someone will try to use it again and will work towards removing it as
> soon as possible.

The setup() callback could be moved into the nvmem framework, rather
than in the at24 driver. Make the call when the cells have been
connected to the backing store.

> I will give your problem a thought and will try to get back with some
> proposals - maybe we should, as you suggested, extend nvmem even
> further to allow to remove nvmem info entries etc.

That does not help me too much. I have the same problem with i2c and
MDIO. So i actually prefer to keep this the same as all others.

Andrew

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread David Miller

From: Sowmini Varadhan 
Date: Mon, 25 Jun 2018 06:41:25 -0700

> The RDS core module creates rds_connections based on callbacks
> from rds_loop_transport when sending/receiving packets to local
> addresses.
> 
> These connections will need to be cleaned up when they are
> created from a netns that is not init_net, and that netns is deleted.
> 
> Add the changes aligned with the changes from
> commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize
> netns/module teardown and rds connection/workq management") for
> rds_loop_transport
> 
> Acked-by: Santosh Shilimkar 
> Signed-off-by: Sowmini Varadhan 

Since this probably fixes syzbot reports, this can be targetted
at 'net' instead?

Re: [lkp-robot] [bisect done] ace45bec6d [ 52.056290] EIP: lock_release

2018-06-26 Thread David Howells

kernel test robot  wrote:

> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> 
> commit ace45bec6d77bc061c3c3d8ad99e298ea9800c2b
> Author: David Howells 
> AuthorDate: Fri Mar 30 21:04:43 2018 +0100
> Commit: David Howells 
> CommitDate: Fri Mar 30 21:04:43 2018 +0100
> 
> rxrpc: Fix firewall route keepalive

Are you actually making AF_RXRPC or the AFS filesystem do anything?  Or is it
just happening spontaneously?

David

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 22:23), David Miller wrote:
> 
> Since this probably fixes syzbot reports, this can be targetted
> at 'net' instead?

that thought occurred to me but I wanted to be conservative and have
it in net-next first, have the syzkaller-bugs team confirm the
the fixes and then backport to earlier kernels (if needed)..

--Sowmini

Re: [PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-26 Thread David Ahern

On 6/26/18 3:50 AM, Daniel Borkmann wrote:

> [...]
> You change all the semantics of return code here, but this breaks 
> bpf_skb_fib_lookup().
> I cannot see how this would work in that case. The code does the following 
> with the
> bpf_ipv{4,6}_fib_lookup() return code:
> 
> [...]
> switch (params->family) {
> #if IS_ENABLED(CONFIG_INET)
> case AF_INET:
> index = bpf_ipv4_fib_lookup(net, params, flags, false);
> break;
> #endif
> #if IS_ENABLED(CONFIG_IPV6)
> case AF_INET6:
> index = bpf_ipv6_fib_lookup(net, params, flags, false);
> break;
> #endif
> }
> 
> if (index > 0) {
> struct net_device *dev;
> 
> dev = dev_get_by_index_rcu(net, index);
> if (!is_skb_forwardable(dev, skb))
> index = 0;
> }

Yes, I forgot to update the skb path. That should be rc now and then the
dev lookup based on params->ifindex. Will fix.

> [...]
> 
> So the BPF_FIB_LKUP_* results become the dev ifindex here and the 
> !is_skb_forwardable()
> case further suggests that the packet *can* be forwarded based on the new 
> semantics
> whereas MTU check is bypassed on success.
> 
> It probably helps to craft a selftest for XDP *and* tc case in future, so we 
> can be sure
> nothing breaks with new changes.

yes, will do.

Re: [PATCH 0/4] lan78xx minor fixes

2018-06-26 Thread David Miller

From: Dave Stevenson 
Date: Mon, 25 Jun 2018 15:07:11 +0100

> This is a small set of patches for the Microchip LAN78xx chip,
> as used in the Raspberry Pi 3B+.
> The main debug/discussion was on
> https://github.com/raspberrypi/linux/issues/2458
 ...

Series applied, thank you.

Re: [PATCH net-next 0/7] l2tp: trivial cleanups

2018-06-26 Thread David Miller

From: Guillaume Nault 
Date: Mon, 25 Jun 2018 16:07:17 +0200

> Just a set of unrelated trivial cleanups (remove unused code, make
> local functions static, etc.).

Series applied, thanks.

Re: [PATCH net-next v2] selftests: net: Test headroom handling of ip6_gre devices

2018-06-26 Thread David Miller

From: Petr Machata 
Date: Mon, 25 Jun 2018 16:43:55 +0200

> Commit 5691484df961 ("net: ip6_gre: Fix headroom request in
> ip6erspan_tunnel_xmit()") and commit 01b8d064d58b ("net: ip6_gre:
> Request headroom in __gre6_xmit()") fix problems in reserving headroom
> in the packets tunneled through ip6gre/tap and ip6erspan netdevices.
> 
> These two patches included snippets that reproduced the issues. This
> patch elevates the snippets to a full-fledged test case.
> 
> Suggested-by: David Miller 
> Signed-off-by: Petr Machata 

Applied, thanks.

Re: [PATCH net-next] r8169: reject unsupported WoL options

2018-06-26 Thread David Miller

From: Heiner Kallweit 
Date: Mon, 25 Jun 2018 20:34:41 +0200

> So far unsupported WoL options are silently ignored. Change this and
> reject attempts to set unsupported options. This prevents situations
> where a user tries to set an unsupported WoL option and is under the
> impression it was successful because ethtool doesn't complain.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

[PATCH net-next 1/1] tc-tests: add an extreme-case csum action test

2018-06-26 Thread Keara Leibovitz

Added an extreme-case test for all 7 csum action headers.

Signed-off-by: Keara Leibovitz 
---
 .../tc-testing/tc-tests/actions/csum.json  | 24 ++
 1 file changed, 24 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json
index 3a2f51fc7fd4..a022792d392a 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json
@@ -336,6 +336,30 @@
 ]
 },
 {
+"id": "b10b",
+"name": "Add all 7 csum actions",
+"category": [
+"actions",
+"csum"
+],
+"setup": [
+[
+"$TC actions flush action csum",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action csum icmp ip4h sctp igmp 
udplite udp tcp index 7",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action csum index 7",
+"matchPattern": "action order [0-9]*: csum \\(iph, icmp, igmp, tcp, 
udp, udplite, sctp\\).*index 7 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action csum"
+]
+},
+{
 "id": "ce92",
 "name": "Add csum udp action with cookie",
 "category": [
-- 
2.7.4

Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed

2018-06-26 Thread Ian Lance Taylor

I'm reporting what appears to be a bug in the Linux kernel's epoll
support.  It seems that epoll appears to sometimes fail to report an
EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is
closed.  This bug report started as a Go program reported at
https://golang.org/issue/23604.  I've written a C program that
demonstrates the same symptoms, at
https://github.com/golang/go/issues/23604#issuecomment-398945027 .

The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral
identical clients, all running in non-blocking mode.  All the
non-blocking sockets are added to epoll, using EPOLLET.  The server
periodically closes and reopens its socket.  The clients look for
ECONNREFUSED errors on their write calls, and close and reopen their
sockets when they see one.

The clients will sometimes fill up their buffer and block with EAGAIN.
At that point they expect the poller to return an EPOLLOUT event to
tell them when they are ready to write again.  The expectation is that
either the server will read data, freeing up buffer space, or will
close the socket, which should cause the sending packets to be
discarded, freeing up buffer space.  Generally the EPOLLOUT event
happens.  But sometimes, the poller never returns such an event, and
the client stalls.  In the test program this is reported as a client
that waits more than 20 seconds to be told to continue.

A similar bug report was made, with few details, at
https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket
.

I've tested the program and seen the failure on kernel 4.9.0-6-amd64.
A colleague has tested the program and seen the failure on
4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux.

If there is a better way for me to report this, please let me know.

Thanks for your attention.

Ian

Re: [PATCH net-next v2 0/7] net: sched: support replay of filter offload when binding to block

2018-06-26 Thread David Miller

From: Jakub Kicinski 
Date: Mon, 25 Jun 2018 14:30:03 -0700

> This series from John adds the ability to replay filter offload requests
> when new offload callback is being registered on a TC block.  This is most
> likely to take place for shared blocks today, when a block which already
> has rules is bound to another interface.  Prior to this patch set if any
> of the rules were offloaded the block bind would fail.
> 
> A new tcf_proto_op is added to generate a filter-specific offload request.
> The new 'offload' op is supporting extack from day 0, hence we need to
> propagate extack to .ndo_setup_tc TC_BLOCK_BIND/TC_BLOCK_UNBIND and
> through tcf_block_cb_register() to tcf_block_playback_offloads().
> 
> The immediate use of this patch set is to simplify life of drivers which
> require duplicating rules when sharing blocks.  Switch drivers (mlxsw)
> can bind ports to rule lists dynamically, NIC drivers generally don't
> have that ability and need the rules to be duplicated for each ingress
> they match on.  In code terms this means that switch drivers don't
> register multiple callbacks for each port.  NIC drivers do, and get a
> separate request and hance rule per-port, as if the block was not shared.
> The registration fails today, however, if some rules were already present.
> 
> As John notes in description of patch 7, drivers which register multiple
> callbacks to shared blocks will likely need to flush the rules on block
> unbind.  This set makes the core not only replay the the offload add
> requests but also offload remove requests when callback is unregistered.
> 
> v2:
>  - name parameters in patch 2;
>  - use unsigned int instead of u32 for in_hw_coun;
>  - improve extack message in patch 7.

Series applied, thank you.

Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading

2018-06-26 Thread Andrew Lunn

On Tue, Jun 26, 2018 at 12:10:28PM +, Vadim Pasternak wrote:

Adding the linux...@vger.kernel.org list.

> Add new core_env module to allow port temperature reading. This
> information has most critical impact on system's thermal monitoring and
> is to be used by core_hwmon and core_thermal modules.
> 
> New internal API reads the temperature from all the modules, which are
> equipped with the thermal sensor and exposes temperature according to
> the worst measure. All individual temperature values are normalized to
> pre-defined range.

This patchset has been sent to the netdev list before. I raised a few
questions about this, which is why it is now being posted to a bigger
group for review.

The hardware has up to 64 temperature sensors. These sensors are
hot-plugable, since they are inside SFP modules, which are
hot-plugable. Different SFP modules can have different operating
temperature ranges. They contain an EEPROM which lists upper and lower
warning and fail temperatures, and report alarms when these thresholds
a reached.

This code takes the 64 sensors readings and calculates a single value
it passes to one thermal zone. That thermal zone then controls one fan
to keep this single value in range.

I queried is this is the correct way to do this? Would it not be
better to have up to 64 thermal zones? Leave the thermal core to
iterate over all the zones in order to determine how the fan should be
driven?

This is possibly the first board with so many sensors. However, i
doubt it is totally unique. Other big Ethernet switches with lots of
SFP modules may be added later. Also, 10G copper PHYs often have
temperature sensors, so this is not limited to just boards with
optical ports. So having a generic solution would be good.

What do the Linux PM exports say about this?

Thanks
Andrew

Re: [PATCH net-next] selftests: forwarding: mirror_gre_vlan_bridge_1q: Unset rp_filter

2018-06-26 Thread David Miller

From: Petr Machata 
Date: Tue, 26 Jun 2018 01:20:32 +0200

> The IP addresses of tunnel endpoint at H3 are set at the VLAN device
> $h3.555. Therefore when test_gretap_untagged_egress() sets vlan 555 to
> egress untagged at $swp3, $h3's rp_filter rejects these packets. The
> test then spuriously fails.
> 
> Therefore turn off net.ipv4.conf.{all, $h3}.rp_filter.
> 
> Fixes: 9c7c8a82442c ("selftests: forwarding: mirror_gre_vlan_bridge_1q: Add 
> more tests")
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 

Applied.

Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute

2018-06-26 Thread Andrew Lunn

> +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> +   struct device_attribute *attr,
> +   char *buf)
> +{
> + struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> + container_of(attr, struct mlxsw_hwmon_attr, dev_attr);
> + struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> + char mfsm_pl[MLXSW_REG_MFSM_LEN];
> + u16 tach;
> + int err;
> +
> + mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> + err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm), mfsm_pl);
> + if (err) {
> + dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query fan\n");
> + return err;
> + }
> + tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> +
> + return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 : 0);
> +}

Documentation/hwmon/sysfs-interface says:

Alarms are direct indications read from the chips. The drivers do NOT
make comparisons of readings to thresholds. This allows violations
between readings to be caught and alarmed. The exact definition of an
alarm (for example, whether a threshold must be met or must be exceeded
to cause an alarm) is chip-dependent.

Now, this is a fault, not an alarm. But does the same apply?

 Andrew

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread David Miller

From: Sowmini Varadhan 
Date: Tue, 26 Jun 2018 09:40:43 -0400

> On (06/26/18 22:23), David Miller wrote:
>> 
>> Since this probably fixes syzbot reports, this can be targetted
>> at 'net' instead?
> 
> that thought occurred to me but I wanted to be conservative and have
> it in net-next first, have the syzkaller-bugs team confirm the
> the fixes and then backport to earlier kernels (if needed)..

I think there is a way to ask syzbot to test a patch in an
email.

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 23:29), David Miller wrote:
> 
> I think there is a way to ask syzbot to test a patch in an
> email.

Dmitry/syzkaller-bugs, can you clarify? 

This is for the cluster of dup reports like
 https://groups.google.com/forum/#!topic/syzkaller-bugs/zBph8Vu-q2U
and (most recently)
 https://www.spinics.net/lists/linux-rdma/msg66020.html

as I understand it, if there is no reproducer, you cannot really
have a pass/fail test to confirm the fix.

--Sowmini

RE: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute

2018-06-26 Thread Vadim Pasternak




> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Tuesday, June 26, 2018 5:29 PM
> To: Vadim Pasternak 
> Cc: da...@davemloft.net; netdev@vger.kernel.org; li...@roeck-us.net;
> rui.zh...@intel.com; edubez...@gmail.com; j...@resnulli.us; mlxsw
> ; Michael Shych 
> Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> with FAN fault attribute
> 
> > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > +   struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > +   container_of(attr, struct mlxsw_hwmon_attr,
> dev_attr);
> > +   struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > +   char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > +   u16 tach;
> > +   int err;
> > +
> > +   mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > +   err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> mfsm_pl);
> > +   if (err) {
> > +   dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> fan\n");
> > +   return err;
> > +   }
> > +   tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > +
> > +   return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 : 0);
> > +}
> 
> Documentation/hwmon/sysfs-interface says:
> 
> Alarms are direct indications read from the chips. The drivers do NOT make
> comparisons of readings to thresholds. This allows violations between readings
> to be caught and alarmed. The exact definition of an alarm (for example,
> whether a threshold must be met or must be exceeded to cause an alarm) is
> chip-dependent.
> 
> Now, this is a fault, not an alarm. But does the same apply?

Hi Andrew,

Hardware provides minimum value for tachometer.
Tachometer is considered as faulty in case it's below this
value.
In case any tachometer is faulty, PWM according to the
system requirements should be set to 100% until the fault
is not recovered (f.e. by physical replacing of bad unit).
This is the motivation to expose fan{x}_fault in the way
it's exposed.

Thanks,
Vadim.

> 
>  Andrew

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Dmitry Vyukov

On Tue, Jun 26, 2018 at 4:44 PM, Sowmini Varadhan
 wrote:
> On (06/26/18 23:29), David Miller wrote:
>>
>> I think there is a way to ask syzbot to test a patch in an
>> email.
>
> Dmitry/syzkaller-bugs, can you clarify?
>
> This is for the cluster of dup reports like
>  https://groups.google.com/forum/#!topic/syzkaller-bugs/zBph8Vu-q2U
> and (most recently)
>  https://www.spinics.net/lists/linux-rdma/msg66020.html
>
> as I understand it, if there is no reproducer, you cannot really
> have a pass/fail test to confirm the fix.

This bug has a reproducer as far as I see:

https://syzkaller.appspot.com/bug?id=f4ef381349e100280193c25f24e01d9d364132d9

It seems to be a subtle race since syzbot did not progress with
minimization too much:

https://syzkaller.appspot.com/text?tag=ReproSyz&x=16cbfeaf80

it probably hit the race by a pure luck of the large program, but then
never had the same luck when tried to remove any syscalls.
So it can make sense to submit several test requests to get more testing.

Re: [PATCH v2] fib_rules: match rules based on suppress_* properties too

2018-06-26 Thread Roopa Prabhu

On Mon, Jun 25, 2018 at 4:39 PM, Jason A. Donenfeld  wrote:
> Two rules with different values of suppress_prefix or suppress_ifgroup
> are not the same. This fixes an -EEXIST when running:
>
>$ ip -4 rule add table main suppress_prefixlength 0
>
> Signed-off-by: Jason A. Donenfeld 
> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
> into fib_nl2rule")
> ---
> This adds the new condition you mentioned. I'm not sure what you make of
> DaveM's remark about this not being in the original code, but here is
> nonetheless the requested change.

I just saw DaveM's comment and agree the new rule_find is different
but that was intentional and it merged
the finding of the rule in the newlink and dellink paths. I did port
each of the conditions from previous rule_exists
to new rule_find, but forgot to add the new keys which now became
necessary. I replied with details on your
other bug report thread. Also pasting that response here:

So the previous rule_exists code did not check for attribute matches correctly.
It would ignore a rule at the first non-existent attribute mis-match.
And rule_find will always
be called with a valid key.
eg in your case, it would
return at pref mismatch...and never match an existing rule.

$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0

$ip rule show
0:  from all lookup local
32763:  from all lookup main suppress_prefixlength 0
32764:  from all lookup main suppress_prefixlength 0
32765:  from all lookup main suppress_prefixlength 0
32766:  from all lookup main
32767:  from all lookup default

With your patch, you should get proper EXISTS check
$ ip -4 rule add table main suppress_prefixlength 0
$ ip -4 rule add table main suppress_prefixlength 0

RTNETLINK answers: File exists

Dave, pls let me know if this is acceptable. If not
I can easily restore the previous rule_exists func. Will also submit a
patch to cover this in self-tests.

thanks.

>
>  net/core/fib_rules.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
> index 126ffc5bc630..bc8425d81022 100644
> --- a/net/core/fib_rules.c
> +++ b/net/core/fib_rules.c
> @@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
> *ops,
> if (rule->mark && r->mark != rule->mark)
> continue;
>
> +   if (rule->suppress_ifgroup != -1 &&
> +   r->suppress_ifgroup != rule->suppress_ifgroup)
> +   continue;
> +
> +   if (rule->suppress_prefixlen != -1 &&
> +   r->suppress_prefixlen != rule->suppress_prefixlen)
> +   continue;
> +
> if (rule->mark_mask && r->mark_mask != rule->mark_mask)
> continue;
>
> --

Re: [PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests

2018-06-26 Thread Davide Caratti

On Tue, 2018-06-26 at 09:17 -0400, Keara Leibovitz wrote:
> Create unittests for the tc tunnel_key action.
> 
> 
> Signed-off-by: Keara Leibovitz 
> ---
>  .../tc-testing/tc-tests/actions/tunnel_key.json| 676 
> +
>  1 file changed, 676 insertions(+)
>  create mode 100644 
> tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> 
> diff --git 
> a/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json 
> b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> new file mode 100644
> index ..bfe522ac8177

hello Keara!

I think the 'teardown' stage in some of these tests should be reviewed.
Those that are meant to test invalid configurations (like dc6b) should
allow non-zero exit codes in the teardown stage, if the wrong
configuration is catched by the userspace TC tool, before talking to the
kernel. 

Otherwise, those tests will fail when they are invoked one by one with the
act_tunnel_key module unloaded.

> --- /dev/null
> +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> @@ -0,0 +1,676 @@
> 
...

> +{
> +"id": "dc6b",
> +"name": "Add tunnel_key set action with missing mandatory src_ip 
> parameter",
> +"category": [
> +"actions",
> +"tunnel_key"
> +],
> +"setup": [
> +[
> +"$TC actions flush action tunnel_key",
> +0,
> +1,
> +255
> +]
> +],
> +"cmdUnderTest": "$TC actions add action tunnel_key set dst_ip 
> 20.20.20.2 id 100",
> +"expExitCode": "255",
> +"verifyCmd": "$TC actions list action tunnel_key",
> +"matchPattern": "action order [0-9]+: tunnel_key set.*dst_ip 
> 20.20.20.2.*key_id 100",
> +"matchCount": "0",
> +"teardown": [
> +"$TC actions flush action tunnel_key"
> +]
> +},

example: try the test above as follows:

[root@rhel tc-testing]# modprobe  act_tunnel_key
[root@rhel tc-testing]# ./tdc.py -e dc6b
Test dc6b: Add tunnel_key set action with missing mandatory src_ip parameter
All test results: 

1..1
ok 1 - dc6b # Add tunnel_key set action with missing mandatory src_ip parameter
about to flush the tap output if tests need to be skipped
done flushing skipped test tap output

[root@rhel tc-testing]# modprobe -r act_tunnel_key ; ./tdc.py -p 
/usr/local/src/iproute2/tc/tc -e dc6b
Test dc6b: Add tunnel_key set action with missing mandatory src_ip parameter

-> teardown stage *** Could not execute: "$TC actions flush action 
tunnel_key"

-> teardown stage *** Error message: "Error: Cannot flush unknown TC action.
We have an error flushing
"
[...]
---
accumulated output for this test:
---
All test results: 

1..1
about to flush the tap output if tests need to be skipped
ok 1 - dc6b # skipped - previous teardown failed 1 dc6b
done flushing skipped test tap output

(BTW: I'm fixing the bpf test suite for a similar problem, I forgot to fix
it when I posted commit f7017cafcdd ("tc-testing: fix tdc tests for 'bpf'
action") . Sorry for that.)


WDYT?

regards,
-- 
davide

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 23:29), David Miller wrote:
> >> 
> >> Since this probably fixes syzbot reports, this can be targetted
> >> at 'net' instead?
> > 
> > that thought occurred to me but I wanted to be conservative and have
> > it in net-next first, have the syzkaller-bugs team confirm the
> > the fixes and then backport to earlier kernels (if needed)..
> 
> I think there is a way to ask syzbot to test a patch in an
> email.

and just to add, the fix itself is logically correct, so belongs in
net-next. What I dont have (and therefore did not target net) is
official confirmation that the syzbot failures are root-caused to the
absence of this patch (since there is no reproducer for many of these,
and no crash dumps available from syzbot).  

--Sowmini

Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath

2018-06-26 Thread Or Gerlitz

>  Forwarded Message 
> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> Date: Thu, 21 Jun 2018 14:35:55 +0100
> From: John Hurley 
> To: d...@openvswitch.org, r...@mellanox.com, g...@mellanox.com, 
> pa...@mellanox.com, f...@sysclose.org, simon.hor...@netronome.com
> CC: John Hurley 
> 
> This patchset extends OvS TC and the linux-netdev implementation to
> support the offloading of Linux Link Aggregation devices (LAG) and their
> slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> together a series of qdiscs. If a filter is added to one of these qdiscs
> then it applied to all. Similarly, if a packet is matched on one of the
> grouped qdiscs then the stats for the entire block are increased. The
> basis of the LAG offload is that the LAG master (attached to the OvS
> bridge) and slaves that may exist outside of OvS are all added to the same
> TC block. OvS can then control the filters and collect the stats on the
> slaves via its interaction with the LAG master.
> 
> The TC API is extended within OvS to allow the addition of a block id to
> ingress qdisc adds. Block ids are then assigned to each LAG master that is
> attached to the OvS bridge. The linux netdev netlink socket is used to
> monitor slave devices. If a LAG slave is found whose master is on the bridge
> then it is added to the same block as its master. If the underlying slaves
> belong to an offloadable device then the Linux LAG device can be offloaded
> to hardware.

Guys (J/J/J), 

Doing this here b/c

a. this has impact on the kernel side of things

b. I am more of a netdev and not openvswitch citizen..

some comments, 

1. this + Jakub's patch for the reply are really a great design

2. re the egress side of things. Some NIC HWs can't just use LAG
as the egress port destination of an ACL (tc rule) and the HW rule
needs to be duplicated to both HW ports. So... in that case, you 
see the HW driver doing the duplication (:() or we can somehow
make it happen from user-space?

3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
ingress (decap) rule is set on the vxlan device. Jakub, you mentioned 
a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind 
to the tunnel device for ingress rules. If we have agreed way to identify
uplink representors, can we do that from ovs too? does it matter if we are
bonding + encapsulating or just encapsulating? note that under encap scheme
the bond is typically not part of the OVS bridge. 

Or.

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 16:48), Dmitry Vyukov wrote:
> it probably hit the race by a pure luck of the large program, but then
> never had the same luck when tried to remove any syscalls.
> So it can make sense to submit several test requests to get more testing.

How does one submit test requests by email? 

the last time I asked this question, the answer was a pointer to
https://groups.google.com/forum/#!msg/syzkaller-bugs/7ucgCkAJKSk/skZjgavRAQAJ

Thanks
--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Dmitry Vyukov

On Tue, Jun 26, 2018 at 5:04 PM, Sowmini Varadhan
 wrote:
> On (06/26/18 16:48), Dmitry Vyukov wrote:
>> it probably hit the race by a pure luck of the large program, but then
>> never had the same luck when tried to remove any syscalls.
>> So it can make sense to submit several test requests to get more testing.
>
> How does one submit test requests by email?

https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

> the last time I asked this question, the answer was a pointer to
> https://groups.google.com/forum/#!msg/syzkaller-bugs/7ucgCkAJKSk/skZjgavRAQAJ

You probably asked to apply an unsubmitted patch to syzbot git tree.
That's the question that I gave that link to. But now it's also
detailed here:

https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches

Re: [PATCH v3,net-next] vlan: implement vlan id and protocol changes

2018-06-26 Thread Ido Schimmel

On Tue, Jun 26, 2018 at 09:33:40AM -0400, Chas Williams wrote:
> On Tue, Jun 26, 2018 at 6:32 AM Ido Schimmel  wrote:
> 
> > On Mon, Jun 25, 2018 at 02:45:24PM -0600, David Ahern wrote:
> > > On 6/25/18 4:30 AM, Chas Williams wrote:
> > > > vlan_changelink silently ignores attempts to change the vlan id
> > > > or protocol id of an existing vlan interface.  Implement by adding
> > > > the new vlan id and protocol to the interface's vlan group and then
> > > > removing the old vlan id and protocol from the vlan group.
> > > >
> > > > Signed-off-by: Chas Williams <3ch...@gmail.com>
> > > > ---
> > > >  include/linux/netdevice.h |  1 +
> > > >  net/8021q/vlan.c  |  4 ++--
> > > >  net/8021q/vlan.h  |  2 ++
> > > >  net/8021q/vlan_netlink.c  | 38 ++
> > > >  net/core/dev.c|  1 +
> > > >  5 files changed, 44 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > index 3ec9850c7936..a95ae238addf 100644
> > > > --- a/include/linux/netdevice.h
> > > > +++ b/include/linux/netdevice.h
> > > > @@ -2409,6 +2409,7 @@ enum netdev_cmd {
> > > > NETDEV_CVLAN_FILTER_DROP_INFO,
> > > > NETDEV_SVLAN_FILTER_PUSH_INFO,
> > > > NETDEV_SVLAN_FILTER_DROP_INFO,
> > > > +   NETDEV_CHANGEVLAN,
> > > >  };
> > > >  const char *netdev_cmd_to_name(enum netdev_cmd cmd);
> > > >
> > >
> > > you add the new notifier, but do not add any hooks to catch and process
> > it.
> > >
> > > Personally, I think it is a bit sketchy to change the vlan id on an
> > > existing device and I suspect it will cause latent errors.
> >
> > +1
> >
> > >
> > > What's your use case for trying to implement the change versus causing
> > > it to generate an unsupported error?
> > >
> > > If this patch does get accepted, I believe the mlxsw switchdev driver
> > > will be impacted.
> >
> > Yes, at minimum we need to return an error for NETDEV_CHANGEVLAN, but
> > looking at the code it seems that there's no proper rollback.
> >
> 
> I would prefer not to bother with error handling on the notification.  If
> something misses the notification, something misses the notification.
> It happens.

The notification is used so that relevant users in the kernel can
potentially veto the operation and refuse it. See other notifications
such as NETDEV_PRECHANGEUPPER.

The driver David mentioned is one existing user that needs to refuse the
VLAN change as it can't support it.

[PATCH 1/3] ixgbe: split XDP_TX tail and XDP_REDIRECT map flushing

2018-06-26 Thread Jesper Dangaard Brouer

The driver was combining the XDP_TX tail flush and XDP_REDIRECT
map flushing (xdp_do_flush_map).  This is suboptimal, these two
flush operations should be kept separate.

Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map")
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4929f7265598..5f8a969638b2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2186,9 +2186,10 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring 
*rx_ring,
return skb;
 }
 
-#define IXGBE_XDP_PASS 0
-#define IXGBE_XDP_CONSUMED 1
-#define IXGBE_XDP_TX 2
+#define IXGBE_XDP_PASS 0
+#define IXGBE_XDP_CONSUMED BIT(0)
+#define IXGBE_XDP_TX   BIT(1)
+#define IXGBE_XDP_REDIRBIT(2)
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf);
@@ -2225,7 +2226,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter 
*adapter,
case XDP_REDIRECT:
err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
if (!err)
-   result = IXGBE_XDP_TX;
+   result = IXGBE_XDP_REDIR;
else
result = IXGBE_XDP_CONSUMED;
break;
@@ -2285,7 +2286,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
unsigned int mss = 0;
 #endif /* IXGBE_FCOE */
u16 cleaned_count = ixgbe_desc_unused(rx_ring);
-   bool xdp_xmit = false;
+   unsigned int xdp_xmit = 0;
struct xdp_buff xdp;
 
xdp.rxq = &rx_ring->xdp_rxq;
@@ -2328,8 +2329,10 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
}
 
if (IS_ERR(skb)) {
-   if (PTR_ERR(skb) == -IXGBE_XDP_TX) {
-   xdp_xmit = true;
+   unsigned int xdp_res = -PTR_ERR(skb);
+
+   if (xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR)) {
+   xdp_xmit |= xdp_res;
ixgbe_rx_buffer_flip(rx_ring, rx_buffer, size);
} else {
rx_buffer->pagecnt_bias++;
@@ -2401,7 +2404,10 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
total_rx_packets++;
}
 
-   if (xdp_xmit) {
+   if (xdp_xmit & IXGBE_XDP_REDIR)
+   xdp_do_flush_map();
+
+   if (xdp_xmit & IXGBE_XDP_TX) {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 
/* Force memory writes to complete before letting h/w
@@ -2409,8 +2415,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 */
wmb();
writel(ring->next_to_use, ring->tail);
-
-   xdp_do_flush_map();
}
 
u64_stats_update_begin(&rx_ring->syncp);

[PATCH 2/3] i40e: split XDP_TX tail and XDP_REDIRECT map flushing

2018-06-26 Thread Jesper Dangaard Brouer

The driver was combining the XDP_TX tail flush and XDP_REDIRECT
map flushing (xdp_do_flush_map).  This is suboptimal, these two
flush operations should be kept separate.

It looks like the mistake was copy-pasted from ixgbe.

Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8ffb7454e67c..c1c027743159 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2200,9 +2200,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
return true;
 }
 
-#define I40E_XDP_PASS 0
-#define I40E_XDP_CONSUMED 1
-#define I40E_XDP_TX 2
+#define I40E_XDP_PASS  0
+#define I40E_XDP_CONSUMED  BIT(0)
+#define I40E_XDP_TXBIT(1)
+#define I40E_XDP_REDIR BIT(2)
 
 static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring);
@@ -2249,7 +2250,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
break;
case XDP_REDIRECT:
err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-   result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -2312,7 +2313,8 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
struct sk_buff *skb = rx_ring->skb;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
-   bool failure = false, xdp_xmit = false;
+   unsigned int xdp_xmit = 0;
+   bool failure = false;
struct xdp_buff xdp;
 
xdp.rxq = &rx_ring->xdp_rxq;
@@ -2373,8 +2375,10 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
}
 
if (IS_ERR(skb)) {
-   if (PTR_ERR(skb) == -I40E_XDP_TX) {
-   xdp_xmit = true;
+   unsigned int xdp_res = -PTR_ERR(skb);
+
+   if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR)) {
+   xdp_xmit |= xdp_res;
i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
} else {
rx_buffer->pagecnt_bias++;
@@ -2428,12 +2432,14 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
total_rx_packets++;
}
 
-   if (xdp_xmit) {
+   if (xdp_xmit & I40E_XDP_REDIR)
+   xdp_do_flush_map();
+
+   if (xdp_xmit & I40E_XDP_TX) {
struct i40e_ring *xdp_ring =
rx_ring->vsi->xdp_rings[rx_ring->queue_index];
 
i40e_xdp_ring_update_tail(xdp_ring);
-   xdp_do_flush_map();
}
 
rx_ring->skb = skb;

[PATCH 0/3] xdp: don't mix XDP_TX and XDP_REDIRECT flush ops

2018-06-26 Thread Jesper Dangaard Brouer

Fix driver logic that are combining XDP_TX flush and XDP_REDIRECT map
flushing.  These are two different XDP xmit modes, and it is clearly
wrong to invoke both types of flush operations when only one of the
XDP xmit modes is used.

---
Unsure what git tree to send this against. Thus, I'll leave it up-to
the patchwork assigner ;-)


Jesper Dangaard Brouer (3):
  ixgbe: split XDP_TX tail and XDP_REDIRECT map flushing
  i40e: split XDP_TX tail and XDP_REDIRECT map flushing
  virtio_net: split XDP_TX kick and XDP_REDIRECT map flushing


 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   24 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   24 
 drivers/net/virtio_net.c  |   30 -
 3 files changed, 48 insertions(+), 30 deletions(-)

--

[PATCH 3/3] virtio_net: split XDP_TX kick and XDP_REDIRECT map flushing

2018-06-26 Thread Jesper Dangaard Brouer

The driver was combining XDP_TX virtqueue_kick and XDP_REDIRECT
map flushing (xdp_do_flush_map).  This is suboptimal, these two
flush operations should be kept separate.

The suboptimal behavior was introduced in commit 9267c430c6b6
("virtio-net: add missing virtqueue kick when flushing packets").

Fixes: 9267c430c6b6 ("virtio-net: add missing virtqueue kick when flushing 
packets")
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/virtio_net.c |   30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1619ee3070b6..ae47ecf80c2d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -53,6 +53,10 @@ module_param(napi_tx, bool, 0644);
 /* Amount of XDP headroom to prepend to packets for use by xdp_adjust_head */
 #define VIRTIO_XDP_HEADROOM 256
 
+/* Separating two types of XDP xmit */
+#define VIRTIO_XDP_TX  BIT(0)
+#define VIRTIO_XDP_REDIR   BIT(1)
+
 /* RX packet size EWMA. The average packet size is used to determine the packet
  * buffer size when refilling RX rings. As the entire RX ring may be refilled
  * at once, the weight is chosen so that the EWMA will be insensitive to short-
@@ -582,7 +586,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 struct receive_queue *rq,
 void *buf, void *ctx,
 unsigned int len,
-bool *xdp_xmit)
+unsigned int *xdp_xmit)
 {
struct sk_buff *skb;
struct bpf_prog *xdp_prog;
@@ -654,14 +658,14 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
trace_xdp_exception(vi->dev, xdp_prog, act);
goto err_xdp;
}
-   *xdp_xmit = true;
+   *xdp_xmit |= VIRTIO_XDP_TX;
rcu_read_unlock();
goto xdp_xmit;
case XDP_REDIRECT:
err = xdp_do_redirect(dev, &xdp, xdp_prog);
if (err)
goto err_xdp;
-   *xdp_xmit = true;
+   *xdp_xmit |= VIRTIO_XDP_REDIR;
rcu_read_unlock();
goto xdp_xmit;
default:
@@ -723,7 +727,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
 void *buf,
 void *ctx,
 unsigned int len,
-bool *xdp_xmit)
+unsigned int *xdp_xmit)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
@@ -818,7 +822,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
put_page(xdp_page);
goto err_xdp;
}
-   *xdp_xmit = true;
+   *xdp_xmit |= VIRTIO_XDP_TX;
if (unlikely(xdp_page != page))
put_page(page);
rcu_read_unlock();
@@ -830,7 +834,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
put_page(xdp_page);
goto err_xdp;
}
-   *xdp_xmit = true;
+   *xdp_xmit |= VIRTIO_XDP_REDIR;
if (unlikely(xdp_page != page))
put_page(page);
rcu_read_unlock();
@@ -939,7 +943,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
 }
 
 static int receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
-  void *buf, unsigned int len, void **ctx, bool *xdp_xmit)
+  void *buf, unsigned int len, void **ctx,
+  unsigned int *xdp_xmit)
 {
struct net_device *dev = vi->dev;
struct sk_buff *skb;
@@ -1232,7 +1237,8 @@ static void refill_work(struct work_struct *work)
}
 }
 
-static int virtnet_receive(struct receive_queue *rq, int budget, bool 
*xdp_xmit)
+static int virtnet_receive(struct receive_queue *rq, int budget,
+  unsigned int *xdp_xmit)
 {
struct virtnet_info *vi = rq->vq->vdev->priv;
unsigned int len, received = 0, bytes = 0;
@@ -1321,7 +1327,7 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
struct virtnet_info *vi = rq->vq->vdev->priv;
struct send_queue *sq;
unsigned int received, qp;
-   bool xdp_xmit = false;
+   unsigned int xdp_xmit = 0;
 
vir

[PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child

2018-06-26 Thread Eric Dumazet

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_minisocks.c | 223 ---
 1 file changed, 113 insertions(+), 110 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 
1dda1341a223937580b4efdbedb21ae50b221ff7..dac5893a52b4520d86ed2fcadbfb561a559fcd3d
 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -449,119 +449,122 @@ struct sock *tcp_create_openreq_child(const struct sock 
*sk,
  struct sk_buff *skb)
 {
struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
-
-   if (newsk) {
-   const struct inet_request_sock *ireq = inet_rsk(req);
-   struct tcp_request_sock *treq = tcp_rsk(req);
-   struct inet_connection_sock *newicsk = inet_csk(newsk);
-   struct tcp_sock *newtp = tcp_sk(newsk);
-   struct tcp_sock *oldtp = tcp_sk(sk);
-
-   smc_check_reset_syn_req(oldtp, req, newtp);
-
-   /* Now setup tcp_sock */
-   newtp->pred_flags = 0;
-
-   newtp->rcv_wup = newtp->copied_seq =
-   newtp->rcv_nxt = treq->rcv_isn + 1;
-   newtp->segs_in = 1;
-
-   newtp->snd_sml = newtp->snd_una =
-   newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
-
-   INIT_LIST_HEAD(&newtp->tsq_node);
-   INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
-
-   tcp_init_wl(newtp, treq->rcv_isn);
-
-   newtp->srtt_us = 0;
-   newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   minmax_reset(&newtp->rtt_min, tcp_jiffies32, ~0U);
-   newicsk->icsk_rto = TCP_TIMEOUT_INIT;
-   newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
-
-   newtp->packets_out = 0;
-   newtp->retrans_out = 0;
-   newtp->sacked_out = 0;
-   newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
-   newtp->tlp_high_seq = 0;
-   newtp->lsndtime = tcp_jiffies32;
-   newsk->sk_txhash = treq->txhash;
-   newtp->last_oow_ack_time = 0;
-   newtp->total_retrans = req->num_retrans;
-
-   /* So many TCP implementations out there (incorrectly) count the
-* initial SYN frame in their delayed-ACK and congestion control
-* algorithms that we must have the following bandaid to talk
-* efficiently to them.  -DaveM
-*/
-   newtp->snd_cwnd = TCP_INIT_CWND;
-   newtp->snd_cwnd_cnt = 0;
-
-   /* There's a bubble in the pipe until at least the first ACK. */
-   newtp->app_limited = ~0U;
-
-   tcp_init_xmit_timers(newsk);
-   newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
-
-   newtp->rx_opt.saw_tstamp = 0;
-
-   newtp->rx_opt.dsack = 0;
-   newtp->rx_opt.num_sacks = 0;
-
-   newtp->urg_data = 0;
-
-   if (sock_flag(newsk, SOCK_KEEPOPEN))
-   inet_csk_reset_keepalive_timer(newsk,
-  
keepalive_time_when(newtp));
-
-   newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
-   newtp->rx_opt.sack_ok = ireq->sack_ok;
-   newtp->window_clamp = req->rsk_window_clamp;
-   newtp->rcv_ssthresh = req->rsk_rcv_wnd;
-   newtp->rcv_wnd = req->rsk_rcv_wnd;
-   newtp->rx_opt.wscale_ok = ireq->wscale_ok;
-   if (newtp->rx_opt.wscale_ok) {
-   newtp->rx_opt.snd_wscale = ireq->snd_wscale;
-   newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
-   } else {
-   newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0;
-   newtp->window_clamp = min(newtp->window_clamp, 65535U);
-   }
-   newtp->snd_wnd = (ntohs(tcp_hdr(skb)->window) <<
- newtp->rx_opt.snd_wscale);
-   newtp->max_window = newtp->snd_wnd;
-
-   if (newtp->rx_opt.tstamp_ok) {
-   newtp->rx_opt.ts_recent = req->ts_recent;
-   newtp->rx_opt.ts_recent_stamp = get_seconds();
-   newtp->tcp_header_len = sizeof(struct tcphdr) + 
TCPOLEN_TSTAMP_ALIGNED;
-   } else {
-   newtp->rx_opt.ts_recent_stamp = 0;
-   newtp->tcp_header_len = sizeof(struct tcphdr);
-   }
-   newtp->tsoffset = treq->ts_off;
+   const struct inet_request_sock *ireq = inet_rsk(req);
+   struct tcp_request_sock *treq = tcp_rsk(req);
+   struct inet_connection_sock *newicsk;
+   struct tcp_sock *oldtp, *newtp;
+
+   if (!newsk)
+   return NULL;
+
+   newicsk = inet_csk(newsk);
+   newtp = tcp_sk(newsk);
+   oldtp = tc

Re: [PATCH v3,net-next] vlan: implement vlan id and protocol changes

2018-06-26 Thread Ido Schimmel

On Tue, Jun 26, 2018 at 09:31:55AM -0400, Chas Williams wrote:
> On Mon, Jun 25, 2018 at 4:45 PM David Ahern  wrote:
> 
> > On 6/25/18 4:30 AM, Chas Williams wrote:
> > > vlan_changelink silently ignores attempts to change the vlan id
> > > or protocol id of an existing vlan interface.  Implement by adding
> > > the new vlan id and protocol to the interface's vlan group and then
> > > removing the old vlan id and protocol from the vlan group.
> > >
> > > Signed-off-by: Chas Williams <3ch...@gmail.com>
> > > ---
> > >  include/linux/netdevice.h |  1 +
> > >  net/8021q/vlan.c  |  4 ++--
> > >  net/8021q/vlan.h  |  2 ++
> > >  net/8021q/vlan_netlink.c  | 38 ++
> > >  net/core/dev.c|  1 +
> > >  5 files changed, 44 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 3ec9850c7936..a95ae238addf 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -2409,6 +2409,7 @@ enum netdev_cmd {
> > >   NETDEV_CVLAN_FILTER_DROP_INFO,
> > >   NETDEV_SVLAN_FILTER_PUSH_INFO,
> > >   NETDEV_SVLAN_FILTER_DROP_INFO,
> > > + NETDEV_CHANGEVLAN,
> > >  };
> > >  const char *netdev_cmd_to_name(enum netdev_cmd cmd);
> > >
> >
> > you add the new notifier, but do not add any hooks to catch and process it.
> >
> 
> I can remove it.  I thought it would be prudent to add it now.
> This could also really be NETDEV_CHANGE.  I wasn't sure
> which would be more acceptable.
> 
> 
> > Personally, I think it is a bit sketchy to change the vlan id on an
> > existing device and I suspect it will cause latent errors.
> >
> 
> It's not any different than changing any other layer 2 property on a device.
> If you change the MTU or the MAC address, you are potentially going to
> cause latent errors.

It is different in switch ASICs, at least. The MTU and MAC don't have
any state associated with them. The VLAN does.

For example, when you assign an IP address to a VLAN device configured
on top of an mlxsw port (e.g., swp1.10), then you are basically creating
a router interface (RIF) that is able to route packets. This RIF is
bound to the port and the VLAN {1, 10} which cannot be changed during
the lifetime of the RIF (at least w/o impacting traffic). The MAC and
the MTU can be easily changed and are changed following
NETDEV_CHANGEADDR and NETDEV_CHANGEMTU events.

Similar problems exist in bridged VLAN devices.

> 
> 
> >
> > What's your use case for trying to implement the change versus causing
> > it to generate an unsupported error?
> >
> 
> It's far more convenient to be able to change the VLAN ID and proto
> instead of having to delete the link and put it back.  That's a lot of
> churn (netlink mesages, kernel calls) for something relatively simple.
> 
> 
> >
> > If this patch does get accepted, I believe the mlxsw switchdev driver
> > will be impacted.
> >
> 
> How so?  It was relying on the fact that VLAN changes were ignored?

It is relying on existing kernel behavior which doesn't allow to change
the VLAN.

tl;dr - I'm still not convinced this is actually needed, but if you're
going to allow such behavior, then please also include a notification
that enables existing in-kernel users to refuse the operation.

Thanks

[PATCH v5 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-26 Thread Fu, Qiaobin

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v5:
*Update the drop counter for TC_ACT_SHOT

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

v3:
*Use optional flags, so that it won't break old versions of tc.

*Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.

v2:
*Fix the style issue

*Move the code from skbmod to skbedit

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
Acked-by: Jamal Hadi Salim 
Reviewed-by: Marcelo Ricardo Leitner 
Acked-by: Davide Caratti 
---

Note that the motivation for this patch is found in the following discussion:
https://www.spinics.net/lists/netdev/msg501061.html
---
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fbcfe27a4e6c..6de6071ebed6 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -30,6 +30,7 @@
#define SKBEDIT_F_MARK  0x4
#define SKBEDIT_F_PTYPE 0x8
#define SKBEDIT_F_MASK  0x10
+#define SKBEDIT_F_INHERITDSFIELD   0x20

struct tc_skbedit {
tc_gen;
@@ -45,6 +46,7 @@ enum {
TCA_SKBEDIT_PAD,
TCA_SKBEDIT_PTYPE,
TCA_SKBEDIT_MASK,
+   TCA_SKBEDIT_FLAGS,
__TCA_SKBEDIT_MAX
};
#define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 6138d1d71900..dfaf5d8028dd 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -23,6 +23,9 @@
#include 
#include 
#include 
+#include 
+#include 
+#include 

#include 
#include 
@@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,

if (d->flags & SKBEDIT_F_PRIORITY)
skb->priority = d->priority;
+   if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
+   int wlen = skb_network_offset(skb);
+
+   switch (tc_skb_protocol(skb)) {
+   case htons(ETH_P_IP):
+   wlen += sizeof(struct iphdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+   break;
+
+   case htons(ETH_P_IPV6):
+   wlen += sizeof(struct ipv6hdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+   break;
+   }
+   }
if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
skb->dev->real_num_tx_queues > d->queue_mapping)
skb_set_queue_mapping(skb, d->queue_mapping);
@@ -53,6 +75,11 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,

spin_unlock(&d->tcf_lock);
return d->tcf_action;
+
+err:
+   d->tcf_qstats.drops++;
+   spin_unlock(&d->tcf_lock);
+   return TC_ACT_SHOT;
}

static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
@@ -62,6 +89,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX 
+ 1] = {
[TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
[TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MASK]  = { .len = sizeof(u32) },
+   [TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) },
};

static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -114,6 +142,13 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
mask = nla_data(tb[TCA_SKBEDIT_MASK]);
}

+   if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
+   u64 *pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
+
+   if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
+   flags |= SKBEDIT_F_INHERITDSFIELD;
+   }
+
parm = nla_data(tb[TCA_SKBEDIT_PARMS]);

exists = tcf_idr_check(tn, parm->index, a, bind);
@@ -178,6 +213,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
tc_action *a,
.action  = d->tcf_action,
};
struct tcf_t t;
+   u64 pure_flags = 0;

if (nla_put(skb, TCA_SKBEDIT_PARMS, sizeof(opt), &opt))
goto nla_put_failure;
@@ -196,6 +232,11 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
tc_action *a,
if ((d->flags & SKBEDIT_F_MASK) &&
nla_put_u32(skb, TCA_SKBEDIT_MASK, d->mask))
goto nla_put_failure;
+   if (d->flags & SKBEDIT_F_INHERITDSFIELD)
+   pure_flags |= SKBEDIT_F_INHERITDSFIELD;
+   if (pure_flags != 0 &&
+   nla_put(skb, TCA_SKBEDIT_FLAGS, sizeof(pure_flags), &pure_flags))
+   goto nla_put_failure;

tcf_tm_dump(&t, &d->tcf_tm);
if (nla_

Re: [PATCH net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2

2018-06-26 Thread Shannon Nelson


On 6/26/2018 4:58 AM, Felix Manlunas wrote:

From: Rick Farrington 

Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get
speed" command which is sent by the 1.7.2 driver during modprobe.  Due to a
bug in older firmware (with respect to unknown commands), this unsupported
command causes a cascade of errors that ends in a kernel panic.

Fix it by making the sending of the "get speed" command conditional on the
firmware version.

Signed-off-by: Rick Farrington 
Acked-by: Derek Chickles 
Signed-off-by: Felix Manlunas 
---
Note: To avoid checkpatch.pl "WARNING: line over 80 characters", the comma
   that separates the arguments in the call to strcmp() was placed one
   line below the usual spot.

  drivers/net/ethernet/cavium/liquidio/lio_main.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 7cb4e75..f83f884 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3671,7 +3671,16 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
OCTEON_CN2350_25GB_SUBSYS_ID ||
octeon_dev->subsystem_id ==
OCTEON_CN2360_25GB_SUBSYS_ID) {
-   liquidio_get_speed(lio);
+   /* speed control unsupported in f/w older than 1.7.2 */
+   if (strcmp(octeon_dev->fw_info.liquidio_firmware_version
+  , "1.7.2") < 0) {


Will the liquidio_firmware_version ever end up something like 1.7.10? 
If so, this strcmp() may not do what you want.


sln


+   dev_info(&octeon_dev->pci_dev->dev,
+"speed setting not supported by f/w.");
+   octeon_dev->speed_setting = 25;
+   octeon_dev->no_speed_setting = 1;
+   } else {
+   liquidio_get_speed(lio);
+   }
  
  			if (octeon_dev->speed_setting == 0) {

octeon_dev->speed_setting = 25;

Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization

2018-06-26 Thread Song Liu

On Tue, Jun 26, 2018 at 12:08 AM, Jakub Kicinski
 wrote:
> On Mon, Jun 25, 2018 at 10:50 PM, Song Liu  wrote:
>> On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
>>  wrote:
>>> From: Jiong Wang 
>>>
>>> Map read has been supported on NFP, this patch enables optimization for
>>> memcpy from map to packet.
>>>
>>> This patch also fixed one latent bug which will cause copying from
>>> unexpected address once memcpy for map pointer enabled.
>>>
>>> Reported-by: Mary Pham 
>>> Reported-by: David Beckett 
>>> Signed-off-by: Jiong Wang 
>>> Reviewed-by: Jakub Kicinski 
>>> ---
>>>  drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
>>> b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> index 8a92088df0d7..33111739b210 100644
>>> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, 
>>> struct nfp_insn_meta *meta)
>>> xfer_num = round_up(len, 4) / 4;
>>>
>>> if (src_40bit_addr)
>>> -   addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
>>> +   addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, 
>>> &src_base,
>>>   &off);
>>
>> Did this break other cases before this patch?
>>
>> I am sorry if this is a dumb question. I don't think I fully
>> understand addr40_offset().
>
> Only map memory uses 40 bit addressing right now, so the if was pretty
> much dead code before the patch.
>
> The memcpy optimization was left out of the initial map support due to
> insufficient test coverage, I should have probably left more of the 40
> bit addressing code out back then.

Thanks for the explanation!

Acked-by: Song Liu

Re: [rds-devel] [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 10:53), Sowmini Varadhan wrote:
> Date: Tue, 26 Jun 2018 10:53:23 -0400
> From: Sowmini Varadhan 
> To: David Miller 
> Cc: netdev@vger.kernel.org, rds-de...@oss.oracle.com
> Subject: Re: [rds-devel] [PATCH net-next] rds: clean up loopback
> 
> and just to add, the fix itself is logically correct, so belongs in
> net-next. What I dont have (and therefore did not target net) is
> official confirmation that the syzbot failures are root-caused to the
> absence of this patch (since there is no reproducer for many of these,
> and no crash dumps available from syzbot).  
> 

With help from Dmitry, I just got the confirmation from syzbot that
"syzbot has tested the proposed patch and the reproducer did not trigger 
crash:"

thus, we can mark this

Reported-and-tested-by: syzbot+4c20b3866171ce844...@syzkaller.appspotmail.com

and yes, it can target net.

Thanks
--Sowmini

Re: [PATCH V4 0/8] net: ethernet: stmmac: add support for stm32mp1

2018-06-26 Thread Alexandre Torgue


Hi Christophe,

On 05/23/2018 05:47 PM, Christophe Roullier wrote:

Patches to have Ethernet support on stm32mp1
Changelog:
Remark from Rob Herring
Move Documentation/devicetree/bindings/arm/stm32.txt in
Documentation/devicetree/bindings/arm/stm32/stm32.txt and create
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt

Replace also in arch/arm/boot/dts/stm32mp157c.dtsi, syscfg: 
system-config@5002
with syscfg: syscon@5002syscfg: system-config@5002

Christophe Roullier (8):
   net: ethernet: stmmac: add adaptation for stm32mp157c.
   dt-bindings: stm32-dwmac: add support of MPU families
   ARM: dts: stm32: add ethernet pins to stm32mp157c
   ARM: dts: stm32: Add syscfg on stm32mp1
   ARM: dts: stm32: Add ethernet dwmac on stm32mp1
   net: stmmac: add dwmac-4.20a compatible
   ARM: dts: stm32: add support of ethernet on stm32mp157c-ev1
   dt-bindings: stm32: add compatible for syscon

  Documentation/devicetree/bindings/arm/stm32.txt|  10 -
  .../devicetree/bindings/arm/stm32/stm32-syscon.txt |  14 ++
  .../devicetree/bindings/arm/stm32/stm32.txt|  10 +
  .../devicetree/bindings/net/stm32-dwmac.txt|  18 +-
  arch/arm/boot/dts/stm32mp157-pinctrl.dtsi  |  46 
  arch/arm/boot/dts/stm32mp157c-ev1.dts  |  20 ++
  arch/arm/boot/dts/stm32mp157c.dtsi |  35 +++
  drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c  | 270 +++--
  .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   3 +-
  9 files changed, 398 insertions(+), 28 deletions(-)
  delete mode 100644 Documentation/devicetree/bindings/arm/stm32.txt
  create mode 100644 
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
  create mode 100644 Documentation/devicetree/bindings/arm/stm32/stm32.txt



As discussed I squashed "ARM: dts: stm32: add ethernet pins to 
stm32mp157c" and "ARM: dts: stm32: add support of ethernet on 
stm32mp157c-ev1" ans fixed interrupt binding issue.


So DT patches applied on stm32-next.

regards
Alex

Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute

2018-06-26 Thread Guenter Roeck

On Tue, Jun 26, 2018 at 02:47:01PM +, Vadim Pasternak wrote:
> 
> 
> > -Original Message-
> > From: Andrew Lunn [mailto:and...@lunn.ch]
> > Sent: Tuesday, June 26, 2018 5:29 PM
> > To: Vadim Pasternak 
> > Cc: da...@davemloft.net; netdev@vger.kernel.org; li...@roeck-us.net;
> > rui.zh...@intel.com; edubez...@gmail.com; j...@resnulli.us; mlxsw
> > ; Michael Shych 
> > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> > with FAN fault attribute
> > 
> > > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > > +   struct device_attribute *attr,
> > > +   char *buf)
> > > +{
> > > + struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > > + container_of(attr, struct mlxsw_hwmon_attr,
> > dev_attr);
> > > + struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > > + char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > > + u16 tach;
> > > + int err;
> > > +
> > > + mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > > + err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> > mfsm_pl);
> > > + if (err) {
> > > + dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> > fan\n");
> > > + return err;
> > > + }
> > > + tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > > +
> > > + return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 : 0);
> > > +}
> > 
> > Documentation/hwmon/sysfs-interface says:
> > 
> > Alarms are direct indications read from the chips. The drivers do NOT make
> > comparisons of readings to thresholds. This allows violations between 
> > readings
> > to be caught and alarmed. The exact definition of an alarm (for example,
> > whether a threshold must be met or must be exceeded to cause an alarm) is
> > chip-dependent.
> > 
> > Now, this is a fault, not an alarm. But does the same apply?
> 
Yes, it does. There are no "soft" alarms / faults.

> Hi Andrew,
> 
> Hardware provides minimum value for tachometer.
> Tachometer is considered as faulty in case it's below this
> value.

This is for user space to decide, not for the kernel.

> In case any tachometer is faulty, PWM according to the
> system requirements should be set to 100% until the fault

system requirements. Again, this is for user space to decide.

> is not recovered (f.e. by physical replacing of bad unit).
> This is the motivation to expose fan{x}_fault in the way
> it's exposed.
> 
> Thanks,
> Vadim.
> 
> > 
> >  Andrew

Re: [PATCH 2/3] i40e: split XDP_TX tail and XDP_REDIRECT map flushing

2018-06-26 Thread Björn Töpel

Den tis 26 juni 2018 kl 18:08 skrev Jesper Dangaard Brouer :
>
> The driver was combining the XDP_TX tail flush and XDP_REDIRECT
> map flushing (xdp_do_flush_map).  This is suboptimal, these two
> flush operations should be kept separate.
>
> It looks like the mistake was copy-pasted from ixgbe.
>
> Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT")
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |   24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 8ffb7454e67c..c1c027743159 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -2200,9 +2200,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
> return true;
>  }
>
> -#define I40E_XDP_PASS 0
> -#define I40E_XDP_CONSUMED 1
> -#define I40E_XDP_TX 2
> +#define I40E_XDP_PASS  0
> +#define I40E_XDP_CONSUMED  BIT(0)
> +#define I40E_XDP_TXBIT(1)
> +#define I40E_XDP_REDIR BIT(2)
>
>  static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>   struct i40e_ring *xdp_ring);
> @@ -2249,7 +2250,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
> *rx_ring,
> break;
> case XDP_REDIRECT:
> err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> -   result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
> break;
> default:
> bpf_warn_invalid_xdp_action(act);
> @@ -2312,7 +2313,8 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
> int budget)
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> struct sk_buff *skb = rx_ring->skb;
> u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> -   bool failure = false, xdp_xmit = false;
> +   unsigned int xdp_xmit = 0;
> +   bool failure = false;
> struct xdp_buff xdp;
>
> xdp.rxq = &rx_ring->xdp_rxq;
> @@ -2373,8 +2375,10 @@ static int i40e_clean_rx_irq(struct i40e_ring 
> *rx_ring, int budget)
> }
>
> if (IS_ERR(skb)) {
> -   if (PTR_ERR(skb) == -I40E_XDP_TX) {
> -   xdp_xmit = true;
> +   unsigned int xdp_res = -PTR_ERR(skb);
> +
> +   if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR)) {
> +   xdp_xmit |= xdp_res;
> i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
> } else {
> rx_buffer->pagecnt_bias++;
> @@ -2428,12 +2432,14 @@ static int i40e_clean_rx_irq(struct i40e_ring 
> *rx_ring, int budget)
> total_rx_packets++;
> }
>
> -   if (xdp_xmit) {
> +   if (xdp_xmit & I40E_XDP_REDIR)
> +   xdp_do_flush_map();
> +
> +   if (xdp_xmit & I40E_XDP_TX) {
> struct i40e_ring *xdp_ring =
> rx_ring->vsi->xdp_rings[rx_ring->queue_index];
>
> i40e_xdp_ring_update_tail(xdp_ring);
> -   xdp_do_flush_map();
> }
>
> rx_ring->skb = skb;
>

Nice! Added intel-wired-lan to Cc.

Acked-by: Björn Töpel

[PATCH net-next] l2tp: define helper for parsing struct sockaddr_pppol2tp*

2018-06-26 Thread Guillaume Nault

'sockaddr_len' is checked against various values when entering
pppol2tp_connect(), to verify its validity. It is used again later, to
find out which sockaddr structure was passed from user space. This
patch combines these two operations into one new function in order to
simplify pppol2tp_connect().

A new structure, l2tp_connect_info, is used to pass sockaddr data back
to pppol2tp_connect(), to avoid passing too many parameters to
l2tp_sockaddr_get_info(). Also, the first parameter is void* in order
to avoid casting between all sockaddr_* structures manually.

Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_ppp.c | 173 ++--
 1 file changed, 103 insertions(+), 70 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index eea5d7844473..d3a9355ac8ac 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -588,40 +588,113 @@ static void pppol2tp_session_init(struct l2tp_session 
*session)
}
 }
 
+struct l2tp_connect_info {
+   u8 version;
+   int fd;
+   u32 tunnel_id;
+   u32 peer_tunnel_id;
+   u32 session_id;
+   u32 peer_session_id;
+};
+
+static int pppol2tp_sockaddr_get_info(const void *sa, int sa_len,
+ struct l2tp_connect_info *info)
+{
+   switch (sa_len) {
+   case sizeof(struct sockaddr_pppol2tp):
+   {
+   const struct sockaddr_pppol2tp *sa_v2in4 = sa;
+
+   if (sa_v2in4->sa_protocol != PX_PROTO_OL2TP)
+   return -EINVAL;
+
+   info->version = 2;
+   info->fd = sa_v2in4->pppol2tp.fd;
+   info->tunnel_id = sa_v2in4->pppol2tp.s_tunnel;
+   info->peer_tunnel_id = sa_v2in4->pppol2tp.d_tunnel;
+   info->session_id = sa_v2in4->pppol2tp.s_session;
+   info->peer_session_id = sa_v2in4->pppol2tp.d_session;
+
+   break;
+   }
+   case sizeof(struct sockaddr_pppol2tpv3):
+   {
+   const struct sockaddr_pppol2tpv3 *sa_v3in4 = sa;
+
+   if (sa_v3in4->sa_protocol != PX_PROTO_OL2TP)
+   return -EINVAL;
+
+   info->version = 3;
+   info->fd = sa_v3in4->pppol2tp.fd;
+   info->tunnel_id = sa_v3in4->pppol2tp.s_tunnel;
+   info->peer_tunnel_id = sa_v3in4->pppol2tp.d_tunnel;
+   info->session_id = sa_v3in4->pppol2tp.s_session;
+   info->peer_session_id = sa_v3in4->pppol2tp.d_session;
+
+   break;
+   }
+   case sizeof(struct sockaddr_pppol2tpin6):
+   {
+   const struct sockaddr_pppol2tpin6 *sa_v2in6 = sa;
+
+   if (sa_v2in6->sa_protocol != PX_PROTO_OL2TP)
+   return -EINVAL;
+
+   info->version = 2;
+   info->fd = sa_v2in6->pppol2tp.fd;
+   info->tunnel_id = sa_v2in6->pppol2tp.s_tunnel;
+   info->peer_tunnel_id = sa_v2in6->pppol2tp.d_tunnel;
+   info->session_id = sa_v2in6->pppol2tp.s_session;
+   info->peer_session_id = sa_v2in6->pppol2tp.d_session;
+
+   break;
+   }
+   case sizeof(struct sockaddr_pppol2tpv3in6):
+   {
+   const struct sockaddr_pppol2tpv3in6 *sa_v3in6 = sa;
+
+   if (sa_v3in6->sa_protocol != PX_PROTO_OL2TP)
+   return -EINVAL;
+
+   info->version = 3;
+   info->fd = sa_v3in6->pppol2tp.fd;
+   info->tunnel_id = sa_v3in6->pppol2tp.s_tunnel;
+   info->peer_tunnel_id = sa_v3in6->pppol2tp.d_tunnel;
+   info->session_id = sa_v3in6->pppol2tp.s_session;
+   info->peer_session_id = sa_v3in6->pppol2tp.d_session;
+
+   break;
+   }
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 /* connect() handler. Attach a PPPoX socket to a tunnel UDP socket
  */
 static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
int sockaddr_len, int flags)
 {
struct sock *sk = sock->sk;
-   struct sockaddr_pppol2tp *sp = (struct sockaddr_pppol2tp *) uservaddr;
struct pppox_sock *po = pppox_sk(sk);
struct l2tp_session *session = NULL;
+   struct l2tp_connect_info info;
struct l2tp_tunnel *tunnel;
struct pppol2tp_session *ps;
struct l2tp_session_cfg cfg = { 0, };
-   int error = 0;
-   u32 tunnel_id, peer_tunnel_id;
-   u32 session_id, peer_session_id;
bool drop_refcnt = false;
bool drop_tunnel = false;
bool new_session = false;
bool new_tunnel = false;
-   int ver = 2;
-   int fd;
-
-   lock_sock(sk);
-
-   error = -EINVAL;
+   int error;
 
-   if (sockaddr_len != sizeof(struct sockaddr_pppol2tp) &&
-   sockaddr_len != sizeof(struct sockaddr_pppol2tpv3) &&
-   sockaddr_len != sizeof(struct sockaddr_pppol2tpin6) &&
-   s

RE: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute

2018-06-26 Thread Vadim Pasternak




> -Original Message-
> From: Guenter Roeck [mailto:li...@roeck-us.net]
> Sent: Tuesday, June 26, 2018 7:33 PM
> To: Vadim Pasternak 
> Cc: Andrew Lunn ; da...@davemloft.net;
> netdev@vger.kernel.org; rui.zh...@intel.com; edubez...@gmail.com;
> j...@resnulli.us; mlxsw ; Michael Shych
> 
> Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> with FAN fault attribute
> 
> On Tue, Jun 26, 2018 at 02:47:01PM +, Vadim Pasternak wrote:
> >
> >
> > > -Original Message-
> > > From: Andrew Lunn [mailto:and...@lunn.ch]
> > > Sent: Tuesday, June 26, 2018 5:29 PM
> > > To: Vadim Pasternak 
> > > Cc: da...@davemloft.net; netdev@vger.kernel.org; li...@roeck-us.net;
> > > rui.zh...@intel.com; edubez...@gmail.com; j...@resnulli.us; mlxsw
> > > ; Michael Shych 
> > > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon
> > > interface with FAN fault attribute
> > >
> > > > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > > > + struct device_attribute *attr,
> > > > + char *buf)
> > > > +{
> > > > +   struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > > > +   container_of(attr, struct mlxsw_hwmon_attr,
> > > dev_attr);
> > > > +   struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > > > +   char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > > > +   u16 tach;
> > > > +   int err;
> > > > +
> > > > +   mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > > > +   err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> > > mfsm_pl);
> > > > +   if (err) {
> > > > +   dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> > > fan\n");
> > > > +   return err;
> > > > +   }
> > > > +   tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > > > +
> > > > +   return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 :
> > > > +0); }
> > >
> > > Documentation/hwmon/sysfs-interface says:
> > >
> > > Alarms are direct indications read from the chips. The drivers do
> > > NOT make comparisons of readings to thresholds. This allows
> > > violations between readings to be caught and alarmed. The exact
> > > definition of an alarm (for example, whether a threshold must be met
> > > or must be exceeded to cause an alarm) is chip-dependent.
> > >
> > > Now, this is a fault, not an alarm. But does the same apply?
> >
> Yes, it does. There are no "soft" alarms / faults.
> 
> > Hi Andrew,
> >
> > Hardware provides minimum value for tachometer.
> > Tachometer is considered as faulty in case it's below this value.
> 
> This is for user space to decide, not for the kernel.

Hi Guenter,

Do you suggest to expose provide fan{x}_min, instead of fan{x}_fault
and give to user to compare fan{x}_input versus fan{x}_min for the
fault decision?

> 
> > In case any tachometer is faulty, PWM according to the system
> > requirements should be set to 100% until the fault
> 
> system requirements. Again, this is for user space to decide.


Yes, user should decide in this case and I wanted to provide to user
fan{x}_fault for this matter. But it could do it based on input and min
attributes, of course.

> 
> > is not recovered (f.e. by physical replacing of bad unit).
> > This is the motivation to expose fan{x}_fault in the way it's exposed.
> >
> > Thanks,
> > Vadim.
> >
> > >
> > >  Andrew

Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading

2018-06-26 Thread Guenter Roeck

On Tue, Jun 26, 2018 at 04:22:38PM +0200, Andrew Lunn wrote:
> On Tue, Jun 26, 2018 at 12:10:28PM +, Vadim Pasternak wrote:
> 
> Adding the linux...@vger.kernel.org list.
> 
> > Add new core_env module to allow port temperature reading. This
> > information has most critical impact on system's thermal monitoring and
> > is to be used by core_hwmon and core_thermal modules.
> > 
> > New internal API reads the temperature from all the modules, which are
> > equipped with the thermal sensor and exposes temperature according to
> > the worst measure. All individual temperature values are normalized to
> > pre-defined range.
> 
> This patchset has been sent to the netdev list before. I raised a few
> questions about this, which is why it is now being posted to a bigger
> group for review.
> 
> The hardware has up to 64 temperature sensors. These sensors are
> hot-plugable, since they are inside SFP modules, which are
> hot-plugable. Different SFP modules can have different operating
> temperature ranges. They contain an EEPROM which lists upper and lower
> warning and fail temperatures, and report alarms when these thresholds
> a reached.
> 
> This code takes the 64 sensors readings and calculates a single value
> it passes to one thermal zone. That thermal zone then controls one fan
> to keep this single value in range.
> 
> I queried is this is the correct way to do this? Would it not be
> better to have up to 64 thermal zones? Leave the thermal core to
> iterate over all the zones in order to determine how the fan should be
> driven?
> 
I very much think so. This problem must exist elsewhere; essentially
it is the bundling of multiple temperature sensors into a single thermal
zone. I am not sure if this should be 64 thermal zones or one thermal
zone with up to 64 sensors and some algorithm to select the relevant
temperature; that would be up to the thermal subsystem maintainers
to decide. Either case, the sensors should be handled and reported
as individual sensors, with appropriate limits, not as single sensor.
Yes, I understand that means we'll have hundreds of hwmon devices,
but that should not be a problem (and if it is, we'll have to fix
the problem, not the code exposing it).

I understand that the thermal subsystem does not currently support
handling this problem. There may also be some missing pieces between
the hwmon and thermal subsystems, such as reporting limits or alarms
when a hwmon driver register with the thermal subsystem.

Maybe it is time to add this support as part of this patch series ?

> This is possibly the first board with so many sensors. However, i
> doubt it is totally unique. Other big Ethernet switches with lots of
> SFP modules may be added later. Also, 10G copper PHYs often have
> temperature sensors, so this is not limited to just boards with
> optical ports. So having a generic solution would be good.

Agreed.

Thanks,
Guenter

> 
> What do the Linux PM exports say about this?
> 
> Thanks
>   Andrew

[PATCH v3 net-next 2/4] selftests: rtnetlink: use dummydev as a test device

2018-06-26 Thread Shannon Nelson

We really shouldn't mess with local system settings, so let's
use the already created dummy device instead for ipsec testing.
Oh, and let's put the temp file into a proper directory.

Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 261a981..15948cf 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -523,21 +523,19 @@ kci_test_macsec()
 kci_test_ipsec()
 {
ret=0
-
-   # find an ip address on this machine and make up a destination
-   srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head 
-1 | cut -f1 -d/`
-   net=`echo $srcip | cut -f1-3 -d.`
-   base=`echo $srcip | cut -f4 -d.`
-   dstip="$net."`expr $base + 1`
-
algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
+   srcip=192.168.123.1
+   dstip=192.168.123.2
+   spi=7
+
+   ip addr add $srcip dev $devdummy
 
# flush to be sure there's nothing configured
ip x s flush ; ip x p flush
check_err $?
 
# start the monitor in the background
-   tmpfile=`mktemp ipsectestXXX`
+   tmpfile=`mktemp /var/run/ipsectestXXX`
mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
sleep 0.2
 
@@ -601,6 +599,7 @@ kci_test_ipsec()
check_err $?
ip x p flush
check_err $?
+   ip addr del $srcip/32 dev $devdummy
 
if [ $ret -ne 0 ]; then
echo "FAIL: ipsec"
-- 
2.7.4

[PATCH v3 net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test

2018-06-26 Thread Shannon Nelson

Following the custom from the other functions, clear the global
ret code before starting the test so as to not have previously
failed tests cause us to thing this test has failed.

Reported-by: Anders Roxell 
Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index b33a371..261a981 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,6 +522,8 @@ kci_test_macsec()
 #---
 kci_test_ipsec()
 {
+   ret=0
+
# find an ip address on this machine and make up a destination
srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head 
-1 | cut -f1 -d/`
net=`echo $srcip | cut -f1-3 -d.`
-- 
2.7.4

[PATCH v3 net-next 0/4] Updates for ipsec selftests

2018-06-26 Thread Shannon Nelson

Fix up the existing ipsec selftest and add tests for
the ipsec offload driver API.

v2: addressed formatting nits in netdevsim from Jakub Kicinski
v3: a couple more nits from Jakub

Shannon Nelson (4):
  selftests: rtnetlink: clear the return code at start of ipsec test
  selftests: rtnetlink: use dummydev as a test device
  netdevsim: add ipsec offload testing
  selftests: rtnetlink: add ipsec offload API test

 drivers/net/netdevsim/Makefile   |   4 +
 drivers/net/netdevsim/ipsec.c| 345 +++
 drivers/net/netdevsim/netdev.c   |   7 +
 drivers/net/netdevsim/netdevsim.h|  37 
 tools/testing/selftests/net/rtnetlink.sh | 132 +++-
 5 files changed, 518 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/netdevsim/ipsec.c

-- 
2.7.4

[PATCH v3 net-next 3/4] netdevsim: add ipsec offload testing

2018-06-26 Thread Shannon Nelson

Implement the IPsec/XFRM offload API for testing.

Signed-off-by: Shannon Nelson 
---
V2 - addressed formatting comments from Jakub Kicinski
V3 - a couple more little xmas tree nits

 drivers/net/netdevsim/Makefile|   4 +
 drivers/net/netdevsim/ipsec.c | 297 ++
 drivers/net/netdevsim/netdev.c|   7 +
 drivers/net/netdevsim/netdevsim.h |  41 ++
 4 files changed, 349 insertions(+)
 create mode 100644 drivers/net/netdevsim/ipsec.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 449b2a1..0fee1d0 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -13,3 +13,7 @@ endif
 ifneq ($(CONFIG_NET_DEVLINK),)
 netdevsim-objs += devlink.o fib.o
 endif
+
+ifneq ($(CONFIG_XFRM_OFFLOAD),)
+netdevsim-objs += ipsec.o
+endif
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 000..ceff544
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include 
+#include 
+#include 
+
+#include "netdevsim.h"
+
+#define NSIM_IPSEC_AUTH_BITS   128
+
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
+   char __user *buffer,
+   size_t count, loff_t *ppos)
+{
+   struct netdevsim *ns = filp->private_data;
+   struct nsim_ipsec *ipsec = &ns->ipsec;
+   size_t bufsize;
+   char *buf, *p;
+   int len;
+   int i;
+
+   /* the buffer needed is
+* (num SAs * 3 lines each * ~60 bytes per line) + one more line
+*/
+   bufsize = (ipsec->count * 4 * 60) + 60;
+   buf = kzalloc(bufsize, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   p = buf;
+   p += snprintf(p, bufsize - (p - buf),
+ "SA count=%u tx=%u\n",
+ ipsec->count, ipsec->tx);
+
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   struct nsim_sa *sap = &ipsec->sa[i];
+
+   if (!sap->used)
+   continue;
+
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+ i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+ sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]spi=0x%08x proto=0x%x salt=0x%08x 
crypt=%d\n",
+ i, be32_to_cpu(sap->xs->id.spi),
+ sap->xs->id.proto, sap->salt, sap->crypt);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]key=0x%08x %08x %08x %08x\n",
+ i, sap->key[0], sap->key[1],
+ sap->key[2], sap->key[3]);
+   }
+
+   len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
+
+   kfree(buf);
+   return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+   .owner = THIS_MODULE,
+   .open = simple_open,
+   .read = nsim_dbg_netdev_ops_read,
+};
+
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+   u32 i;
+
+   if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+   return -ENOSPC;
+
+   /* search sa table */
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   if (!ipsec->sa[i].used)
+   return i;
+   }
+
+   return -ENOSPC;
+}
+
+static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
+  u32 *mykey, u32 *mysalt)
+{
+   const char aes_gcm_name[] = "rfc4106(gcm(aes))";
+   struct net_device *dev = xs->xso.dev;
+   unsigned char *key_data;
+   char *alg_name = NULL;
+   int key_len;
+
+   if (!xs->aead) {
+   netdev_err(dev, "Unsupported IPsec algorithm\n");
+   return -EINVAL;
+   }
+
+   if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
+   netdev_err(dev, "IPsec offload requires %d bit 
authentication\n",
+  NSIM_IPSEC_AUTH_BITS);
+   return -EINVAL;
+   }
+
+   key_data = &xs->aead->alg_key[0];
+   key_len = xs->aead->alg_key_len;
+   alg_name = xs->aead->alg_name;
+
+   if (strcmp(alg_name, aes_gcm_name)) {
+   netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+  aes_gcm_name);
+   return -EINVAL;
+   }
+
+   /* 160 accounts for 16 byte key and 4 byte salt */
+   if (key_len > NSIM_IPSEC_AUTH_BITS) {
+   *mysalt = ((u32 *)key_data)[4];
+   } else if (key_len == NSIM_IPSEC_AUTH_BITS) {
+   *mysalt = 0;
+   } else {
+   netdev_err(dev, "IPsec hw offl

[PATCH v3 net-next 4/4] selftests: rtnetlink: add ipsec offload API test

2018-06-26 Thread Shannon Nelson

Using the netdevsim as a device for testing, try out the XFRM commands
for setting up IPsec hardware offloads.

Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 114 +++
 1 file changed, 114 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 15948cf..9e1a82e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -608,6 +608,119 @@ kci_test_ipsec()
echo "PASS: ipsec"
 }
 
+#---
+# Example commands
+#   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07 replay-window 32 \
+#aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+#sel src 14.0.0.52/24 dst 14.0.0.70/24
+#offload dev sim1 dir out
+#   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+#tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07
+#
+#---
+kci_test_ipsec_offload()
+{
+   ret=0
+   algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
+   srcip=192.168.123.3
+   dstip=192.168.123.4
+   dev=simx1
+   sysfsd=/sys/kernel/debug/netdevsim/$dev
+   sysfsf=$sysfsd/ipsec
+
+   # setup netdevsim since dummydev doesn't have offload support
+   modprobe netdevsim
+   check_err $?
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload can't load netdevsim"
+   return 1
+   fi
+
+   ip link add $dev type netdevsim
+   ip addr add $srcip dev $dev
+   ip link set $dev up
+   if [ ! -d $sysfsd ] ; then
+   echo "FAIL: ipsec_offload can't create device $dev"
+   return 1
+   fi
+   if [ ! -f $sysfsf ] ; then
+   echo "FAIL: ipsec_offload netdevsim doesn't support IPsec 
offload"
+   return 1
+   fi
+
+   # flush to be sure there's nothing configured
+   ip x s flush ; ip x p flush
+
+   # create offloaded SAs, both in and out
+   ip x p add dir out src $srcip/24 dst $dstip/24 \
+   tmpl proto esp src $srcip dst $dstip spi 9 \
+   mode transport reqid 42
+   check_err $?
+   ip x p add dir out src $dstip/24 dst $srcip/24 \
+   tmpl proto esp src $dstip dst $srcip spi 9 \
+   mode transport reqid 42
+   check_err $?
+
+   ip x s add proto esp src $srcip dst $dstip spi 9 \
+   mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \
+   offload dev $dev dir out
+   check_err $?
+   ip x s add proto esp src $dstip dst $srcip spi 9 \
+   mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \
+   offload dev $dev dir in
+   check_err $?
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload can't create SA"
+   return 1
+   fi
+
+   # does offload show up in ip output
+   lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"`
+   if [ $lines -ne 2 ] ; then
+   echo "FAIL: ipsec_offload SA offload missing from list output"
+   check_err 1
+   fi
+
+   # use ping to exercise the Tx path
+   ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null
+
+   # does driver have correct offload info
+   diff $sysfsf - << EOF
+SA count=2 tx=3
+sa[0] tx ipaddr=0x   
+sa[0]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1
+sa[0]key=0x34333231 38373635 32313039 36353433
+sa[1] rx ipaddr=0x   037ba8c0
+sa[1]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1
+sa[1]key=0x34333231 38373635 32313039 36353433
+EOF
+   if [ $? -ne 0 ] ; then
+   echo "FAIL: ipsec_offload incorrect driver data"
+   check_err 1
+   fi
+
+   # does offload get removed from driver
+   ip x s flush
+   ip x p flush
+   lines=`grep -c "SA count=0" $sysfsf`
+   if [ $lines -ne 1 ] ; then
+   echo "FAIL: ipsec_offload SA not removed from driver"
+   check_err 1
+   fi
+
+   # clean up any leftovers
+   ip link del $dev
+   rmmod netdevsim
+
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload"
+   return 1
+   fi
+   echo "PASS: ipsec_offload"
+}
+
 kci_test_gretap()
 {
testns="testns"
@@ -862,6 +975,7 @@ kci_test_rtnl()
kci_test_encap
kci_test_macsec
kci_test_ipsec
+   kci_test_ipsec_offload
 
kci_del_dummy
 }
-- 
2.7.4

Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child

2018-06-26 Thread Yuchung Cheng

On Tue, Jun 26, 2018 at 8:45 AM, Eric Dumazet  wrote:
> Signed-off-by: Eric Dumazet 
> ---
nice refactor!
Acked-by: Yuchung Cheng 

>  net/ipv4/tcp_minisocks.c | 223 ---
>  1 file changed, 113 insertions(+), 110 deletions(-)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 
> 1dda1341a223937580b4efdbedb21ae50b221ff7..dac5893a52b4520d86ed2fcadbfb561a559fcd3d
>  100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -449,119 +449,122 @@ struct sock *tcp_create_openreq_child(const struct 
> sock *sk,
>   struct sk_buff *skb)
>  {
> struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
> -
> -   if (newsk) {
> -   const struct inet_request_sock *ireq = inet_rsk(req);
> -   struct tcp_request_sock *treq = tcp_rsk(req);
> -   struct inet_connection_sock *newicsk = inet_csk(newsk);
> -   struct tcp_sock *newtp = tcp_sk(newsk);
> -   struct tcp_sock *oldtp = tcp_sk(sk);
> -
> -   smc_check_reset_syn_req(oldtp, req, newtp);
> -
> -   /* Now setup tcp_sock */
> -   newtp->pred_flags = 0;
> -
> -   newtp->rcv_wup = newtp->copied_seq =
> -   newtp->rcv_nxt = treq->rcv_isn + 1;
> -   newtp->segs_in = 1;
> -
> -   newtp->snd_sml = newtp->snd_una =
> -   newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
> -
> -   INIT_LIST_HEAD(&newtp->tsq_node);
> -   INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
> -
> -   tcp_init_wl(newtp, treq->rcv_isn);
> -
> -   newtp->srtt_us = 0;
> -   newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> -   minmax_reset(&newtp->rtt_min, tcp_jiffies32, ~0U);
> -   newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> -   newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
> -
> -   newtp->packets_out = 0;
> -   newtp->retrans_out = 0;
> -   newtp->sacked_out = 0;
> -   newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> -   newtp->tlp_high_seq = 0;
> -   newtp->lsndtime = tcp_jiffies32;
> -   newsk->sk_txhash = treq->txhash;
> -   newtp->last_oow_ack_time = 0;
> -   newtp->total_retrans = req->num_retrans;
> -
> -   /* So many TCP implementations out there (incorrectly) count 
> the
> -* initial SYN frame in their delayed-ACK and congestion 
> control
> -* algorithms that we must have the following bandaid to talk
> -* efficiently to them.  -DaveM
> -*/
> -   newtp->snd_cwnd = TCP_INIT_CWND;
> -   newtp->snd_cwnd_cnt = 0;
> -
> -   /* There's a bubble in the pipe until at least the first ACK. 
> */
> -   newtp->app_limited = ~0U;
> -
> -   tcp_init_xmit_timers(newsk);
> -   newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
> -
> -   newtp->rx_opt.saw_tstamp = 0;
> -
> -   newtp->rx_opt.dsack = 0;
> -   newtp->rx_opt.num_sacks = 0;
> -
> -   newtp->urg_data = 0;
> -
> -   if (sock_flag(newsk, SOCK_KEEPOPEN))
> -   inet_csk_reset_keepalive_timer(newsk,
> -  
> keepalive_time_when(newtp));
> -
> -   newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
> -   newtp->rx_opt.sack_ok = ireq->sack_ok;
> -   newtp->window_clamp = req->rsk_window_clamp;
> -   newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> -   newtp->rcv_wnd = req->rsk_rcv_wnd;
> -   newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> -   if (newtp->rx_opt.wscale_ok) {
> -   newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> -   newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
> -   } else {
> -   newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 
> 0;
> -   newtp->window_clamp = min(newtp->window_clamp, 
> 65535U);
> -   }
> -   newtp->snd_wnd = (ntohs(tcp_hdr(skb)->window) <<
> - newtp->rx_opt.snd_wscale);
> -   newtp->max_window = newtp->snd_wnd;
> -
> -   if (newtp->rx_opt.tstamp_ok) {
> -   newtp->rx_opt.ts_recent = req->ts_recent;
> -   newtp->rx_opt.ts_recent_stamp = get_seconds();
> -   newtp->tcp_header_len = sizeof(struct tcphdr) + 
> TCPOLEN_TSTAMP_ALIGNED;
> -   } else {
> -   newtp->rx_opt.ts_recent_stamp = 0;
> -   newtp->tcp_header_len = sizeof(struct tcphdr);
> -   }
> -   newtp->tsoffset = treq->ts_off;
> +

1 2 3 >

1 - 100 of 206 matches

Mail list logo