date:20180527

Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG

2018-05-27 Thread Ido Schimmel

On Mon, May 28, 2018 at 05:55:58AM +0200, Andrew Lunn wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
> > b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> > index ca38a30fbe91..adc6ab2cf429 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> > @@ -4433,6 +4433,11 @@ static int 
> > mlxsw_sp_netdevice_port_upper_event(struct net_device *lower_dev,
> > NL_SET_ERR_MSG_MOD(extack, "Can not put a VLAN on an 
> > OVS port");
> > return -EINVAL;
> > }
> > +   if (is_vlan_dev(upper_dev) &&
> > +   vlan_dev_vlan_id(upper_dev) == 1) {
> > +   NL_SET_ERR_MSG_MOD(extack, "Creating a VLAN device with 
> > VID 1 is unsupported: VLAN 1 carries untagged traffic");
> > +   return -EINVAL;
> > +   }
> 
> Hi Ido
> 
> Would ENOTSUPP be a better return code. VLAN 1 is valid, you just
> don't support it.

OK, makes sense. We currently use EINVAL for such errors, but we can
convert to EOPNOTSUPP in net-next.

Thanks

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-27 Thread Thomas Gleixner

On Mon, 28 May 2018, Christoph Hellwig wrote:

> On Mon, May 28, 2018 at 08:18:46AM +0200, Thomas Gleixner wrote:
> > > Which other two?  The boot optional removal patches?  They just remove
> > > the visible interface, but keep the implementation which is converted
> > > to the better mechanism here, so I think the order makes sense.
> > > But I might be missing something..
> > 
> > They remove the commandline switch before having the replacement in place
> > unless I'm misreading something.
> 
> The command line switch to force 32-bit dma is removed without
> replacement.   The PCI quirk for force the 32-bit dma for VIA bridges
> is kept in place, and switch to a different mechanism in this patch.

Fair enough

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-27 Thread Christoph Hellwig

On Mon, May 28, 2018 at 08:18:46AM +0200, Thomas Gleixner wrote:
> > Which other two?  The boot optional removal patches?  They just remove
> > the visible interface, but keep the implementation which is converted
> > to the better mechanism here, so I think the order makes sense.
> > But I might be missing something..
> 
> They remove the commandline switch before having the replacement in place
> unless I'm misreading something.

The command line switch to force 32-bit dma is removed without
replacement.   The PCI quirk for force the 32-bit dma for VIA bridges
is kept in place, and switch to a different mechanism in this patch.

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-27 Thread Thomas Gleixner

On Mon, 28 May 2018, Christoph Hellwig wrote:

> On Mon, May 28, 2018 at 08:10:40AM +0200, Thomas Gleixner wrote:
> > n Fri, 25 May 2018, Christoph Hellwig wrote:
> > 
> > x86/pci-dma: ...
> > 
> > Please
> > 
> > > Instead of globally disabling > 32bit DMA using the arch_dma_supported
> > > hook walk the PCI bus under the actually affected bridge and mark every
> > > device with the dma_32bit_limit flag.  This also gets rid of the
> > > arch_dma_supported hook entirely.
> > 
> > Shouldn't this go before the other two? 
> 
> Which other two?  The boot optional removal patches?  They just remove
> the visible interface, but keep the implementation which is converted
> to the better mechanism here, so I think the order makes sense.
> But I might be missing something..

They remove the commandline switch before having the replacement in place
unless I'm misreading something.

Thanks,

tglx

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-27 Thread Christoph Hellwig

On Mon, May 28, 2018 at 08:10:40AM +0200, Thomas Gleixner wrote:
> n Fri, 25 May 2018, Christoph Hellwig wrote:
> 
> x86/pci-dma: ...
> 
> Please
> 
> > Instead of globally disabling > 32bit DMA using the arch_dma_supported
> > hook walk the PCI bus under the actually affected bridge and mark every
> > device with the dma_32bit_limit flag.  This also gets rid of the
> > arch_dma_supported hook entirely.
> 
> Shouldn't this go before the other two? 

Which other two?  The boot optional removal patches?  They just remove
the visible interface, but keep the implementation which is converted
to the better mechanism here, so I think the order makes sense.
But I might be missing something..

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-27 Thread Thomas Gleixner

n Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> Instead of globally disabling > 32bit DMA using the arch_dma_supported
> hook walk the PCI bus under the actually affected bridge and mark every
> device with the dma_32bit_limit flag.  This also gets rid of the
> arch_dma_supported hook entirely.

Shouldn't this go before the other two? 
 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Thomas Gleixner

Re: [PATCH 6/7] x86: remove the explicit nodac and allowdac option

2018-05-27 Thread Thomas Gleixner

On Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> This is something drivers should decide (modulo chipset quirks like
> for VIA), which as far as I can tell is how things have been handled
> for the last 15 years.
> 
> Note that we keep the usedac option for now, as it is used in the wild
> to override the too generic VIA quirk.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Thomas Gleixner

Re: [PATCH 5/7] x86: remove the experimental forcesac boot option

2018-05-27 Thread Thomas Gleixner

On Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> Limiting the dma mask to avoid PCI (pre-PCIe) DAC cycles while paying
> the huge overhead of an IOMMU is rather pointless, and this seriously
> gets in the way of dma mapping work.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Thomas Gleixner

[PATCH iproute2] net:sched: add action inheritdsfield to skbedit

2018-05-27 Thread Fu, Qiaobin

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
---

Note that the motivation for this patch is found in the following discussion:
https://www.spinics.net/lists/netdev/msg501061.html
---

diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fbcfe27..432ad2f 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -30,9 +30,11 @@
 #define SKBEDIT_F_MARK 0x4
 #define SKBEDIT_F_PTYPE0x8
 #define SKBEDIT_F_MASK 0x10
+#define SKBEDIT_F_INHERITDSFIELD   0x20
 
 struct tc_skbedit {
tc_gen;
+   __u64 flags;
 };
 
 enum {
diff --git a/tc/m_skbedit.c b/tc/m_skbedit.c
index db5c64c..7553a40 100644
--- a/tc/m_skbedit.c
+++ b/tc/m_skbedit.c
@@ -30,16 +30,18 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT]>\n"
+   fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT] [IF]>\n"
"QM = queue_mapping QUEUE_MAPPING\n"
"PM = priority PRIORITY\n"
"MM = mark MARK\n"
"PT = ptype PACKETYPE\n"
+   "IF = inheritdsfield\n"
"PACKETYPE = is one of:\n"
"  host, otherhost, broadcast, multicast\n"
"QUEUE_MAPPING = device transmit queue to use\n"
"PRIORITY = classID to assign to priority field\n"
-   "MARK = firewall mark to set\n");
+   "MARK = firewall mark to set\n"
+   "note: inheritdsfield maps DS field to skb->priority\n");
 }
 
 static void
@@ -59,7 +61,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
struct rtattr *tail;
unsigned int tmp;
__u16 queue_mapping, ptype;
-   __u32 flags = 0, priority, mark;
+   __u32 priority, mark;
struct tc_skbedit sel = { 0 };
 
if (matches(*argv, "skbedit") != 0)
@@ -69,7 +71,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
 
while (argc > 0) {
if (matches(*argv, "queue_mapping") == 0) {
-   flags |= SKBEDIT_F_QUEUE_MAPPING;
+   sel.flags |= SKBEDIT_F_QUEUE_MAPPING;
NEXT_ARG();
if (get_unsigned(&tmp, *argv, 10) || tmp > 65535) {
fprintf(stderr, "Illegal queue_mapping\n");
@@ -78,7 +80,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
queue_mapping = tmp;
ok++;
} else if (matches(*argv, "priority") == 0) {
-   flags |= SKBEDIT_F_PRIORITY;
+   sel.flags |= SKBEDIT_F_PRIORITY;
NEXT_ARG();
if (get_tc_classid(&priority, *argv)) {
fprintf(stderr, "Illegal priority\n");
@@ -86,7 +88,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
}
ok++;
} else if (matches(*argv, "mark") == 0) {
-   flags |= SKBEDIT_F_MARK;
+   sel.flags |= SKBEDIT_F_MARK;
NEXT_ARG();
if (get_u32(&mark, *argv, 0)) {
fprintf(stderr, "Illegal mark\n");
@@ -109,7 +111,10 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
*argv);
return -1;
}
-   flags |= SKBEDIT_F_PTYPE;
+   sel.flags |= SKBEDIT_F_PTYPE;
+   ok++;
+   } else if (matches(*argv, "inheritdsfield") == 0) {
+   sel.flags |= SKBEDIT_F_INHERITDSFIELD;
ok++;
} else if (matches(*argv, "help") == 0) {
usage();
@@ -144,16 +149,16 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
 
tail = addattr_nest(n, MAX_MSG, tca_id);
addattr_l(n, MAX_MSG, TCA_SKBEDIT_PARMS, &sel, sizeof(sel));
-   if (flags & SKBEDIT_F_QUEUE_MAPPING)
+   if (sel.flags & SKBEDIT_F_QUEUE_MAPPING)
addattr_l(n, MAX_MSG, TCA_SKBEDIT_QUEUE_MAPPING,
  &queue_mapping, sizeof(queue_mapping));
-   if (flags & SKBEDIT_F_PRIORITY)
+   if (sel.flags & SKBEDIT_F_PRIORITY)
addattr_l(n, MAX_MSG, TCA_SKBEDIT_PRIORITY,
  &priority, sizeof(priority));
-   if (flags & SKBEDIT_F_MARK)
+   if (sel.flags & SKBEDIT_F_MARK)

[PATCH v2 net-next] net:sched: add action inheritdsfield to skbedit

2018-05-27 Thread Fu, Qiaobin

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
---

Note that the motivation for this patch is found in the following discussion:
https://www.spinics.net/lists/netdev/msg501061.html
---

diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fbcfe27..432ad2f 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -30,9 +30,11 @@
 #define SKBEDIT_F_MARK 0x4
 #define SKBEDIT_F_PTYPE0x8
 #define SKBEDIT_F_MASK 0x10
+#define SKBEDIT_F_INHERITDSFIELD   0x20
 
 struct tc_skbedit {
tc_gen;
+   __u64 flags;
 };
 
 enum {
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 6138d1d7..c3e9d03 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -23,6 +23,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -39,8 +42,32 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
tcf_lastuse_update(&d->tcf_tm);
bstats_update(&d->tcf_bstats, skb);
 
-   if (d->flags & SKBEDIT_F_PRIORITY)
-   skb->priority = d->priority;
+   if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
+   int wlen = skb_network_offset(skb);
+
+   switch (tc_skb_protocol(skb)) {
+   case htons(ETH_P_IP):
+   wlen += sizeof(struct iphdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+   break;
+
+   case htons(ETH_P_IPV6):
+   wlen += sizeof(struct ipv6hdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+   break;
+
+   default:
+   goto default_priority;
+   }
+   } else {
+default_priority:
+   if (d->flags & SKBEDIT_F_PRIORITY)
+   skb->priority = d->priority;
+   }
if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
skb->dev->real_num_tx_queues > d->queue_mapping)
skb_set_queue_mapping(skb, d->queue_mapping);
@@ -53,6 +80,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
 
spin_unlock(&d->tcf_lock);
return d->tcf_action;
+
+err:
+   spin_unlock(&d->tcf_lock);
+   return TC_ACT_SHOT;
 }
 
 static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
@@ -115,6 +146,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
}
 
parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
+   if (parm->flags & SKBEDIT_F_INHERITDSFIELD)
+   flags |= SKBEDIT_F_INHERITDSFIELD;
 
exists = tcf_idr_check(tn, parm->index, a, bind);
if (exists && bind)

[PATCH net] be2net: Fix error detection logic for BE3

2018-05-27 Thread Suresh Reddy

Check for 0xE00 (RECOVERABLE_ERR) along with ARMFW UE (0x0)
in be_detect_error() to know whether the error is valid error or not

Fixes: 673c96e5a ("be2net: Fix UE detection logic for BE3")
Signed-off-by: Suresh Reddy 
---
 drivers/net/ethernet/emulex/benet/be_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index c697e79..8f75500 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -3309,7 +3309,9 @@ void be_detect_error(struct be_adapter *adapter)
if ((val & POST_STAGE_FAT_LOG_START)
 != POST_STAGE_FAT_LOG_START &&
(val & POST_STAGE_ARMFW_UE)
-!= POST_STAGE_ARMFW_UE)
+!= POST_STAGE_ARMFW_UE &&
+   (val & POST_STAGE_RECOVERABLE_ERR)
+!= POST_STAGE_RECOVERABLE_ERR)
return;
}
 
-- 
1.8.3.1

Re: [PATCH v2] Revert "alx: remove WoL support"

2018-05-27 Thread AceLan Kao

Hi all,

Just inform you a news reported by a user who confirmed the wake up
issue can't be reproduce by the new kernel.
https://bugzilla.kernel.org/show_bug.cgi?id=61651#c126


Guillaume de Jabrun 2018-05-27 15:12:54 UTC

I am using this patch for a long time. I was experiencing the "wake up
twice" bug, but with recent kernel version, I don't have this issue
anymore.
I can't tell which version actually fix the bug (I don't remember..).


Best regards,
AceLan Kao.

2018-05-21 11:18 GMT+08:00 David Miller :
> From: AceLan Kao 
> Date: Mon, 21 May 2018 11:14:00 +0800
>
>> We are willing to fix the issue, but we don't have a machine to
>> reproduce it, and the WoL feature has been removed 5 years ago, it's
>> hard to find those buggy machines.
>
> Have you bothered to ask the person who did the revert?
>
>> WoL is a feature that is only used by a very small group of people,
>> and the wake up issue looks like only happens on some
>> platforms. Which means only small part of the group of people are
>> affected.
>
> One of those people was the wireless networking stack maintainer.
>
>> So, it's not a serious issue worth to remove it from alx driver.
>
> I disagree.
>
> You must fix the regression solved by the revert, before adding
> WoL support back to the driver.
>
> I'm not going to say this again.
>
> Thank you.
>

[PATCH bpf v2 5/5] selftests/bpf: test_sockmap, print additional test options

2018-05-27 Thread Prashant Bhole

Print values of test options like apply, cork, start, end so that
individual failed tests can be identified for manual run

Acked-by: John Fastabend 
Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_sockmap.c | 28 +++---
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index e8faf4596dac..7970d48c4acb 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -869,6 +869,8 @@ static char *test_to_str(int test)
 #define OPTSTRING 60
 static void test_options(char *options)
 {
+   char tstr[OPTSTRING];
+
memset(options, 0, OPTSTRING);
 
if (txmsg_pass)
@@ -881,14 +883,22 @@ static void test_options(char *options)
strncat(options, "redir_noisy,", OPTSTRING);
if (txmsg_drop)
strncat(options, "drop,", OPTSTRING);
-   if (txmsg_apply)
-   strncat(options, "apply,", OPTSTRING);
-   if (txmsg_cork)
-   strncat(options, "cork,", OPTSTRING);
-   if (txmsg_start)
-   strncat(options, "start,", OPTSTRING);
-   if (txmsg_end)
-   strncat(options, "end,", OPTSTRING);
+   if (txmsg_apply) {
+   snprintf(tstr, OPTSTRING, "apply %d,", txmsg_apply);
+   strncat(options, tstr, OPTSTRING);
+   }
+   if (txmsg_cork) {
+   snprintf(tstr, OPTSTRING, "cork %d,", txmsg_cork);
+   strncat(options, tstr, OPTSTRING);
+   }
+   if (txmsg_start) {
+   snprintf(tstr, OPTSTRING, "start %d,", txmsg_start);
+   strncat(options, tstr, OPTSTRING);
+   }
+   if (txmsg_end) {
+   snprintf(tstr, OPTSTRING, "end %d,", txmsg_end);
+   strncat(options, tstr, OPTSTRING);
+   }
if (txmsg_ingress)
strncat(options, "ingress,", OPTSTRING);
if (txmsg_skb)
@@ -897,7 +907,7 @@ static void test_options(char *options)
 
 static int __test_exec(int cgrp, int test, struct sockmap_options *opt)
 {
-   char *options = calloc(60, sizeof(char));
+   char *options = calloc(OPTSTRING, sizeof(char));
int err;
 
if (test == SENDPAGE)
-- 
2.17.0

[PATCH bpf v2 4/5] selftests/bpf: test_sockmap, fix data verification

2018-05-27 Thread Prashant Bhole

When data verification is enabled, some tests fail because verification is done
incorrectly. Following changes fix it.

- Identify the size of data block to be verified
- Reset verification counter when data block size is reached
- Fixed the value printed in case of verfication failure

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Acked-by: John Fastabend 
Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_sockmap.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index f79397513362..e8faf4596dac 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -337,8 +337,15 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
int fd_flags = O_NONBLOCK;
struct timeval timeout;
float total_bytes;
+   int bytes_cnt = 0;
+   int chunk_sz;
fd_set w;
 
+   if (opt->sendpage)
+   chunk_sz = iov_length * cnt;
+   else
+   chunk_sz = iov_length * iov_count;
+
fcntl(fd, fd_flags);
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
err = clock_gettime(CLOCK_MONOTONIC, &s->start);
@@ -388,9 +395,14 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
errno = -EIO;
fprintf(stderr,
"detected data 
corruption @iov[%i]:%i %02x != %02x, %02x ?= %02x\n",
-   i, j, d[j], k - 
1, d[j+1], k + 1);
+   i, j, d[j], k - 
1, d[j+1], k);
goto out_errno;
}
+   bytes_cnt++;
+   if (bytes_cnt == chunk_sz) {
+   k = 0;
+   bytes_cnt = 0;
+   }
recv--;
}
}
-- 
2.17.0

[PATCH bpf v2 3/5] selftests/bpf: test_sockmap, fix test timeout

2018-05-27 Thread Prashant Bhole

In order to reduce runtime of tests, recently timout for select() call
was reduced from 1sec to 10usec. This was causing many tests failures.
It was caught with failure handling commits in this series.

Restoring the timeout from 10usec to 1sec

Fixes: a18fda1a62c3 ("bpf: reduce runtime of test_sockmap tests")
Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_sockmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 8a81ea0e9fb6..f79397513362 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -345,8 +345,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
if (err < 0)
perror("recv start time: ");
while (s->bytes_recvd < total_bytes) {
-   timeout.tv_sec = 0;
-   timeout.tv_usec = 10;
+   timeout.tv_sec = 1;
+   timeout.tv_usec = 0;
 
/* FD sets */
FD_ZERO(&w);
-- 
2.17.0

[PATCH bpf v2 2/5] selftests/bpf: test_sockmap, join cgroup in selftest mode

2018-05-27 Thread Prashant Bhole

In case of selftest mode, temporary cgroup environment is created but
cgroup is not joined. It causes test failures. Fixed by joining the
cgroup

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Acked-by: John Fastabend 
Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_sockmap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 34feb74c95c4..8a81ea0e9fb6 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -1342,6 +1342,11 @@ static int __test_suite(char *bpf_file)
return cg_fd;
}
 
+   if (join_cgroup(CG_PATH)) {
+   fprintf(stderr, "ERROR: failed to join cgroup\n");
+   return -EINVAL;
+   }
+
/* Tests basic commands and APIs with range of iov values */
txmsg_start = txmsg_end = 0;
err = test_txmsg(cg_fd);
-- 
2.17.0

[PATCH bpf v2 1/5] selftests/bpf: test_sockmap, check test failure

2018-05-27 Thread Prashant Bhole

Test failures are not identified because exit code of RX/TX threads
is not checked. Also threads are not returning correct exit code.

- Return exit code from threads depending on test execution status
- In main thread, check the exit code of RX/TX threads

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_sockmap.c | 25 --
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index eb17fae458e6..34feb74c95c4 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -429,8 +429,8 @@ static int sendmsg_test(struct sockmap_options *opt)
struct msg_stats s = {0};
int iov_count = opt->iov_count;
int iov_buf = opt->iov_length;
+   int rx_status, tx_status;
int cnt = opt->rate;
-   int status;
 
errno = 0;
 
@@ -442,7 +442,7 @@ static int sendmsg_test(struct sockmap_options *opt)
rxpid = fork();
if (rxpid == 0) {
if (opt->drop_expected)
-   exit(1);
+   exit(0);
 
if (opt->sendpage)
iov_count = 1;
@@ -463,7 +463,7 @@ static int sendmsg_test(struct sockmap_options *opt)
"rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB 
%fB/s %fGB/s\n",
s.bytes_sent, sent_Bps, sent_Bps/giga,
s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
-   exit(1);
+   exit(err ? 1 : 0);
} else if (rxpid == -1) {
perror("msg_loop_rx: ");
return errno;
@@ -491,14 +491,27 @@ static int sendmsg_test(struct sockmap_options *opt)
"tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB 
%fB/s %fGB/s\n",
s.bytes_sent, sent_Bps, sent_Bps/giga,
s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
-   exit(1);
+   exit(err ? 1 : 0);
} else if (txpid == -1) {
perror("msg_loop_tx: ");
return errno;
}
 
-   assert(waitpid(rxpid, &status, 0) == rxpid);
-   assert(waitpid(txpid, &status, 0) == txpid);
+   assert(waitpid(rxpid, &rx_status, 0) == rxpid);
+   assert(waitpid(txpid, &tx_status, 0) == txpid);
+   if (WIFEXITED(rx_status)) {
+   err = WEXITSTATUS(rx_status);
+   if (err) {
+   fprintf(stderr, "rx thread exited with err %d. ", err);
+   goto out;
+   }
+   }
+   if (WIFEXITED(tx_status)) {
+   err = WEXITSTATUS(tx_status);
+   if (err)
+   fprintf(stderr, "tx thread exited with err %d. ", err);
+   }
+out:
return err;
 }
 
-- 
2.17.0

[PATCH bpf v2 0/5] fix test_sockmap

2018-05-27 Thread Prashant Bhole

This series fixes error handling, timeout and data verification in
test_sockmap. Previously it was not able to detect failure/timeout in
RX/TX thread because error was not notified to the main thread.

Also slightly improved test output by printing parameter values (cork,
apply, start, end) so that parameters for all tests are displayed.

Prashant Bhole (5):
  selftests/bpf: test_sockmap, check test failure
  selftests/bpf: test_sockmap, join cgroup in selftest mode
  selftests/bpf: test_sockmap, fix test timeout
  selftests/bpf: test_sockmap, fix data verification
  selftests/bpf: test_sockmap, print additional test options

 tools/testing/selftests/bpf/test_sockmap.c | 76 +-
 1 file changed, 58 insertions(+), 18 deletions(-)

-- 
2.17.0

Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect

2018-05-27 Thread Toshiaki Makita

On 2018/05/28 11:24, Jason Wang wrote:
> On 2018年05月25日 21:43, Toshiaki Makita wrote:
> 
> [...]
> 
 @@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct
 tun_struct *tun, struct tun_file *tfile,
   struct bpf_prog *xdp_prog;
   int ret;
 +    local_bh_disable();
 +    preempt_disable();
   rcu_read_lock();
   xdp_prog = rcu_dereference(tun->xdp_prog);
   if (xdp_prog) {
   ret = do_xdp_generic(xdp_prog, skb);
   if (ret != XDP_PASS) {
   rcu_read_unlock();
 +    preempt_enable();
 +    local_bh_enable();
   return total_len;
   }
   }
   rcu_read_unlock();
 +    preempt_enable();
 +    local_bh_enable();
   }
   rcu_read_lock();
>>>
>>> Good catch, thanks.
>>>
>>> But I think we can just replace preempt_disable()/enable() with
>>> local_bh_disable()/local_bh_enable() ?
>>
>> I actually thought the same, but noticed this patch.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe
>>
>>
>> It looks like they do not think local_bh_disable() implies
>> preempt_disable(). But I'm not sure why..
>>
>> Toshiaki Makita 
> 
> I see, there're probably have some subtle differences and implications
> for e.g scheduler or others.
> 
> What we what here is to make sure the process is not moved to another
> CPU and bh is enabled. By checking preemptible() function, preemption
> should be disabled after local_bh_disable(). So I think we're safe here.

OK. I checked retint_kernel which IIUC is the entry point of preemption
process on x86, and confirmed it just checks if __preempt_count is zero.

I haven't checked other archs but I was probably worried too much.
Will send v2.

Thanks,
Toshiaki Makita

Re: [PATCH bpf-next 0/5] fix test_sockmap

2018-05-27 Thread Prashant Bhole


On 5/25/2018 11:01 PM, John Fastabend wrote:

On 05/25/2018 01:28 AM, Prashant Bhole wrote:



On 5/24/2018 1:58 PM, John Fastabend wrote:

On 05/23/2018 09:47 PM, Prashant Bhole wrote:



On 5/23/2018 6:44 PM, Prashant Bhole wrote:



On 5/22/2018 2:08 AM, John Fastabend wrote:

On 05/20/2018 10:13 PM, Prashant Bhole wrote:



On 5/19/2018 1:42 AM, John Fastabend wrote:

On 05/18/2018 12:17 AM, Prashant Bhole wrote:

This series fixes bugs in test_sockmap code. They weren't caught
previously because failure in RX/TX thread was not notified to the
main thread.

Also fixed data verification logic and slightly improved test
output
such that parameters values (cork, apply, start, end) of failed
test
can be easily seen.



Great, this was on my list so thanks for taking care of it.


Note: Even after fixing above problems there are issues with tests
which set cork parameter. Tests fail (RX thread timeout) when cork
value is non-zero and overall data sent by TX thread isn't
multiples
of cork value.



This is expected. When 'cork' is set the sender should only xmit
the data when 'cork' bytes are available. If the user doesn't
provide the N bytes the data is cork'ed waiting for the bytes and
if the socket is closed the state is cleaned up. What these tests
are testing is the cleanup path when a user doesn't provide the
N bytes. In practice this is used to validate headers and prevent
users from sending partial headers. We want to keep these tests
because
they verify a tear-down path in the code.


Ok.



After your changes do these get reported as failures? If so we
need to account for the above in the calculations.


Yes, cork related test are reported as failures because of RX thread
timeout.

So with your above description, I think we need to differentiate cork
tests with partial data and full data. In partial data test we can
have
something like "timeout_expected" flag. Any other way to fix it?



Adding a flag seems reasonable to me. Lets do this for now. Also I
plan to add more negative tests so we can either use the same
flag or a new one for those cases as well.



John,
I worked on this for some time and noticed that the RX-timeout of
tests with cork parameter is dependent on various parameters. So we
can not set a flag like the way 'drop_expected' flag is set before
executing the test.

So I decided to write a function which judges all parameters before
each test and decides whether a test with cork parameter will
timeout or not. Then the conditions in the function became
complicated. For example some tests fail if opt->rate < 17 (with
some other conditions). Here is 17 is related to FRAGS_PER_SKB.
Consider following two examples.

I'm sorry. Correction: s/FRAGS_PER_SKB/MAX_SKB_FRAGS/



./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage
--txmsg --txmsg_cork 1024   # RX timeout occurs

./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage
--txmsg --txmsg_cork 1024   # Success!



Ah yes this hits the buffer limit and flushes the queue. The kernel
side doesn't know how to merge those specific sendpage requests so
it gives each request its own buffer and when the limit is reached
we flush it.


Do we need to keep such tests? if yes, then I will continue with
adding such conditions in the function.



Yes, these tests are needed because they are testing the edge cases.
These are probably the most important tests because my normal usage
will catch any issues in the "good" cases its these types of things
that can go unnoticed (at least for a short while) if we don't have
specific tests for them.


I tried but it is difficult to come up with a right set of conditions
which lead to test failure.



Agreed, it can be yes. How about adding your logic for all tests except
"cork" cases. If there is a flag to set if the timeout is expected we
can always manually set it in the test invocation. Might not be as
nice as automatically learning the expected results but possibly easier
than building some complicated logic to figure it out.

Would you mind submitting your series again without the "cork" tests
being tracked? And if you want add a bit to tell if the "cork" tests are
going to timeout or not setting it per test manually. But I think
your series can just omit the cork test for now and still be useful.


Ok. I will submit the series again. Without any change in actual 
patches, but cover letter reorganized. Thanks.


-Prashant

Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG

2018-05-27 Thread Andrew Lunn

> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> index ca38a30fbe91..adc6ab2cf429 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
> @@ -4433,6 +4433,11 @@ static int mlxsw_sp_netdevice_port_upper_event(struct 
> net_device *lower_dev,
>   NL_SET_ERR_MSG_MOD(extack, "Can not put a VLAN on an 
> OVS port");
>   return -EINVAL;
>   }
> + if (is_vlan_dev(upper_dev) &&
> + vlan_dev_vlan_id(upper_dev) == 1) {
> + NL_SET_ERR_MSG_MOD(extack, "Creating a VLAN device with 
> VID 1 is unsupported: VLAN 1 carries untagged traffic");
> + return -EINVAL;
> + }

Hi Ido

Would ENOTSUPP be a better return code. VLAN 1 is valid, you just
don't support it.

Andrew

Re: [PATCH] drivers/net/phy/micrel: Fix for PHY KSZ8061 errrata: Potential link-up failure when Ethernet cable is connected slowly

2018-05-27 Thread Andrew Lunn

> This e-mail (including any attachments) is confidential and may be
> legally privileged. If you are not an intended recipient or an
> authorized representative of an intended recipient, you are
> prohibited from using, copying or distributing the information in
> this e-mail or its attachments. If you have received this e-mail in
> error, please notify the sender immediately by return e-mail and
> delete all copies of this message and any attachments. Thank you.

Once this has been removed, i will have a comment...

 Andrew

Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect

2018-05-27 Thread Jason Wang




On 2018年05月25日 21:43, Toshiaki Makita wrote:

[...]

@@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct 
tun_struct *tun, struct tun_file *tfile,

  struct bpf_prog *xdp_prog;
  int ret;
+    local_bh_disable();
+    preempt_disable();
  rcu_read_lock();
  xdp_prog = rcu_dereference(tun->xdp_prog);
  if (xdp_prog) {
  ret = do_xdp_generic(xdp_prog, skb);
  if (ret != XDP_PASS) {
  rcu_read_unlock();
+    preempt_enable();
+    local_bh_enable();
  return total_len;
  }
  }
  rcu_read_unlock();
+    preempt_enable();
+    local_bh_enable();
  }
  rcu_read_lock();


Good catch, thanks.

But I think we can just replace preempt_disable()/enable() with 
local_bh_disable()/local_bh_enable() ?


I actually thought the same, but noticed this patch.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe 



It looks like they do not think local_bh_disable() implies 
preempt_disable(). But I'm not sure why..


Toshiaki Makita 


I see, there're probably have some subtle differences and implications 
for e.g scheduler or others.


What we what here is to make sure the process is not moved to another 
CPU and bh is enabled. By checking preemptible() function, preemption 
should be disabled after local_bh_disable(). So I think we're safe here.


Thanks

[PATCH bpf-next 01/11] bpf: test case for map pointer poison with calls/branches

2018-05-27 Thread Daniel Borkmann

Add several test cases where the same or different map pointers
originate from different paths in the program and execute a map
lookup or tail call at a common location.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h  |  10 ++
 tools/include/linux/filter.h|  10 ++
 tools/testing/selftests/bpf/test_verifier.c | 185 
 3 files changed, 178 insertions(+), 27 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d358d18..b443f70 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -289,6 +289,16 @@ struct xdp_buff;
.off   = OFF,   \
.imm   = 0 })
 
+/* Relative call */
+
+#define BPF_CALL_REL(TGT)  \
+   ((struct bpf_insn) {\
+   .code  = BPF_JMP | BPF_CALL,\
+   .dst_reg = 0,   \
+   .src_reg = BPF_PSEUDO_CALL, \
+   .off   = 0, \
+   .imm   = TGT })
+
 /* Function call */
 
 #define BPF_EMIT_CALL(FUNC)\
diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h
index c5e512d..af55acf 100644
--- a/tools/include/linux/filter.h
+++ b/tools/include/linux/filter.h
@@ -263,6 +263,16 @@
 #define BPF_LD_MAP_FD(DST, MAP_FD) \
BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
 
+/* Relative call */
+
+#define BPF_CALL_REL(TGT)  \
+   ((struct bpf_insn) {\
+   .code  = BPF_JMP | BPF_CALL,\
+   .dst_reg = 0,   \
+   .src_reg = BPF_PSEUDO_CALL, \
+   .off   = 0, \
+   .imm   = TGT })
+
 /* Program exit */
 
 #define BPF_EXIT_INSN()\
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 4b4f015..7cb1d74 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -50,7 +50,7 @@
 
 #define MAX_INSNS  BPF_MAXINSNS
 #define MAX_FIXUPS 8
-#define MAX_NR_MAPS4
+#define MAX_NR_MAPS7
 #define POINTER_VALUE  0xcafe4all
 #define TEST_DATA_LEN  64
 
@@ -66,7 +66,9 @@ struct bpf_test {
int fixup_map1[MAX_FIXUPS];
int fixup_map2[MAX_FIXUPS];
int fixup_map3[MAX_FIXUPS];
-   int fixup_prog[MAX_FIXUPS];
+   int fixup_map4[MAX_FIXUPS];
+   int fixup_prog1[MAX_FIXUPS];
+   int fixup_prog2[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
const char *errstr;
const char *errstr_unpriv;
@@ -2769,7 +2771,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.errstr_unpriv = "R3 leaks addr into helper",
.result_unpriv = REJECT,
.result = ACCEPT,
@@ -2856,7 +2858,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.result = ACCEPT,
.retval = 42,
},
@@ -2870,7 +2872,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.result = ACCEPT,
.retval = 41,
},
@@ -2884,7 +2886,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.result = ACCEPT,
.retval = 1,
},
@@ -2898,7 +2900,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 2),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.result = ACCEPT,
.retval = 2,
},
@@ -2912,7 +2914,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 2),
BPF_EXIT_INSN(),
},
-   .fixup_prog = { 1 },
+   .fixup_prog1 = { 1 },
.result = ACCEPT,
.retval = 2,
},
@@ -2926,7 +2928,7

[PATCH bpf-next 10/11] bpf: sync bpf uapi header with tools

2018-05-27 Thread Daniel Borkmann

Pull in recent changes from include/uapi/linux/bpf.h.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9b8c6e3..7108711 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2004,6 +2004,20 @@ union bpf_attr {
  * direct packet access.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb)
+ * Description
+ * Return the cgroup v2 id of the socket associated with the *skb*.
+ * This is roughly similar to the **bpf_get_cgroup_classid**\ ()
+ * helper for cgroup v1 by providing a tag resp. identifier that
+ * can be matched on or used for map lookups e.g. to implement
+ * policy. The cgroup v2 id of a given path in the hierarchy is
+ * exposed in user space through the f_handle API in order to get
+ * to the same 64-bit id.
+ *
+ * This helper can be used on TC egress path, but not on ingress.
+ * Return
+ * The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2082,7 +2096,8 @@ union bpf_attr {
FN(lwt_push_encap), \
FN(lwt_seg6_store_bytes),   \
FN(lwt_seg6_adjust_srh),\
-   FN(lwt_seg6_action),
+   FN(lwt_seg6_action),\
+   FN(skb_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2199,7 +2214,7 @@ struct bpf_tunnel_key {
};
__u8 tunnel_tos;
__u8 tunnel_ttl;
-   __u16 tunnel_ext;
+   __u16 tunnel_ext;   /* Padding, future use. */
__u32 tunnel_label;
 };
 
@@ -2210,6 +2225,7 @@ struct bpf_xfrm_state {
__u32 reqid;
__u32 spi;  /* Stored in network byte order */
__u16 family;
+   __u16 ext;  /* Padding, future use. */
union {
__u32 remote_ipv4;  /* Stored in network byte order */
__u32 remote_ipv6[4];   /* Stored in network byte order */
-- 
2.9.5

Re: [PATCH v4 net] stmmac: 802.1ad tag stripping support fix

2018-05-27 Thread Toshiaki Makita

On 2018/05/27 4:24, Elad Nachman wrote:
> stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before 
> calling napi_gro_receive().
> 
> The function assumes VLAN tagged frames are always tagged with 802.1Q 
> protocol,
> and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to 
> __vlan_hwaccel_put_tag() .
> 
> This causes packets not to be passed to the VLAN slave if it was created with 
> 802.1AD protocol
> (ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).
> 
> This fix passes the protocol from the VLAN header into 
> __vlan_hwaccel_put_tag()
> instead of using the hard-coded value of ETH_P_8021Q.
> NETIF_F_HW_VLAN_CTAG_RX check was removed to be in line with the driver 
> actual abilities.
> 
> Signed-off-by: Elad Nachman 
> 

I might have not been clear enough but you still need this hunk with
this change.

> - ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
> + ndev->features |= (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX);

Also I guess spaces are preferred around '|'. Would you check it by
checkpatch.pl?

-- 
Toshiaki Makita

[PATCH bpf-next 09/11] bpf: fix context access in tracing progs on 32 bit archs

2018-05-27 Thread Daniel Borkmann

Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:

  172/p unpriv: spill/fill of different pointers ldx FAIL
  Unexpected error message!
  0: (bf) r6 = r10
  1: (07) r6 += -8
  2: (15) if r1 == 0x0 goto pc+3
  R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
  3: (bf) r2 = r10
  4: (07) r2 += -76
  5: (7b) *(u64 *)(r6 +0) = r2
  6: (55) if r1 != 0x0 goto pc+1
  R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 
fp-8=fp
  7: (7b) *(u64 *)(r6 +0) = r1
  8: (79) r1 = *(u64 *)(r6 +0)
  9: (79) r1 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

  378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (71) r0 = *(u8 *)(r1 +68)
  invalid bpf_context access off=68 size=1

  379/p check bpf_perf_event_data->sample_period half load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (69) r0 = *(u16 *)(r1 +68)
  invalid bpf_context access off=68 size=2

  380/p check bpf_perf_event_data->sample_period word load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (61) r0 = *(u32 *)(r1 +68)
  invalid bpf_context access off=68 size=4

  381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (79) r0 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes.

Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that
off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the
case of sample_period access from struct bpf_perf_event_data, hence
verifier wrongly thinks we might be doing an unaligned access here.
Therefore adjust this down to machine size and check the offset for
narrow access on that basis.

We also need to fix pe_prog_is_valid_access(), since we hit the check
for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test.

Reported-by: Wang YanQing 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h   | 30 --
 kernel/trace/bpf_trace.c | 10 --
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d407ede..89903d2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -639,16 +639,34 @@ static inline bool bpf_prog_was_classic(const struct 
bpf_prog *prog)
return prog->type == BPF_PROG_TYPE_UNSPEC;
 }
 
-static inline bool
-bpf_ctx_narrow_access_ok(u32 off, u32 size, const u32 size_default)
+static inline u32 bpf_ctx_off_adjust_machine(u32 size)
+{
+   const u32 size_machine = sizeof(unsigned long);
+
+   if (size > size_machine && size % size_machine == 0)
+   size = size_machine;
+
+   return size;
+}
+
+static inline bool bpf_ctx_narrow_align_ok(u32 off, u32 size_access,
+  u32 size_default)
 {
-   bool off_ok;
+   size_default = bpf_ctx_off_adjust_machine(size_default);
+   size_access  = bpf_ctx_off_adjust_machine(size_access);
+
 #ifdef __LITTLE_ENDIAN
-   off_ok = (off & (size_default - 1)) == 0;
+   return (off & (size_default - 1)) == 0;
 #else
-   off_ok = (off & (size_default - 1)) + size == size_default;
+   return (off & (size_default - 1)) + size_access == size_default;
 #endif
-   return off_ok && size <= size_default && (size & (size - 1)) == 0;
+}
+
+static inline bool
+bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
+{
+   return bpf_ctx_narrow_align_ok(off, size, size_default) &&
+  size <= size_default && (size & (size - 1)) == 0;
 }
 
 #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 81fdf2f..7269530 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -880,8 +880,14 @@ static bool pe_prog_is_valid_access(int off, int size, 
enum bpf_access_type type
return false;
if (type != BPF_READ)
return false;
-   if (off % size != 0)
-   return false;
+   if (off % size != 0) {
+   if (sizeof(unsigned long) != 4)
+   return false;
+   if (size != 8)
+   return false;
+   if (off % size != 4)
+   return false;
+   }
 
switch (off) {
case bpf_ctx_range(struct bpf_perf_event_data, sample_period):
-- 
2.9.5

[PATCH bpf-next 06/11] bpf: add bpf_skb_cgroup_id helper

2018-05-27 Thread Daniel Borkmann

Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
cgroup id from the skb's socket. This is useful in particular to
enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
cgroup v2 by allowing ID based matching on egress. This can in
particular be used in combination with applying policy e.g. from
map lookups, and also complements the older bpf_skb_under_cgroup()
interface. In user space the cgroup id for a given path can be
retrieved through the f_handle as demonstrated in [0] recently.

  [0] https://lkml.org/lkml/2018/5/22/1190

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 17 -
 net/core/filter.c| 29 +++--
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9b8c6e3..e2853aa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2004,6 +2004,20 @@ union bpf_attr {
  * direct packet access.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb)
+ * Description
+ * Return the cgroup v2 id of the socket associated with the *skb*.
+ * This is roughly similar to the **bpf_get_cgroup_classid**\ ()
+ * helper for cgroup v1 by providing a tag resp. identifier that
+ * can be matched on or used for map lookups e.g. to implement
+ * policy. The cgroup v2 id of a given path in the hierarchy is
+ * exposed in user space through the f_handle API in order to get
+ * to the same 64-bit id.
+ *
+ * This helper can be used on TC egress path, but not on ingress.
+ * Return
+ * The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2082,7 +2096,8 @@ union bpf_attr {
FN(lwt_push_encap), \
FN(lwt_seg6_store_bytes),   \
FN(lwt_seg6_adjust_srh),\
-   FN(lwt_seg6_action),
+   FN(lwt_seg6_action),\
+   FN(skb_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index acf1f4f..717c740 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3661,6 +3661,27 @@ static const struct bpf_func_proto 
bpf_skb_under_cgroup_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
+#ifdef CONFIG_SOCK_CGROUP_DATA
+BPF_CALL_1(bpf_skb_cgroup_id, const struct sk_buff *, skb)
+{
+   struct sock *sk = skb_to_full_sk(skb);
+   struct cgroup *cgrp;
+
+   if (!sk || !sk_fullsock(sk))
+   return 0;
+
+   cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+   return cgrp->kn->id.id;
+}
+
+static const struct bpf_func_proto bpf_skb_cgroup_id_proto = {
+   .func   = bpf_skb_cgroup_id,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+#endif
+
 static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
  unsigned long off, unsigned long len)
 {
@@ -4741,12 +4762,16 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+   case BPF_FUNC_fib_lookup:
+   return &bpf_skb_fib_lookup_proto;
 #ifdef CONFIG_XFRM
case BPF_FUNC_skb_get_xfrm_state:
return &bpf_skb_get_xfrm_state_proto;
 #endif
-   case BPF_FUNC_fib_lookup:
-   return &bpf_skb_fib_lookup_proto;
+#ifdef CONFIG_SOCK_CGROUP_DATA
+   case BPF_FUNC_skb_cgroup_id:
+   return &bpf_skb_cgroup_id_proto;
+#endif
default:
return bpf_base_func_proto(func_id);
}
-- 
2.9.5

[PATCH bpf-next 07/11] bpf: make sure to clear unused fields in tunnel/xfrm state fetch

2018-05-27 Thread Daniel Borkmann

Since the remaining bits are not filled in struct bpf_tunnel_key
resp. struct bpf_xfrm_state and originate from uninitialized stack
space, we should make sure to clear them before handing control
back to the program.

Also add a padding element to struct bpf_xfrm_state for future use
similar as we have in struct bpf_tunnel_key and clear it as well.

  struct bpf_xfrm_state {
  __u32  reqid;/* 0 4 */
  __u32  spi;  /* 4 4 */
  __u16  family;   /* 8 2 */

  /* XXX 2 bytes hole, try to pack */

  union {
  __u32  remote_ipv4;  /*   4 */
  __u32  remote_ipv6[4];   /*  16 */
  };   /*1216 */

  /* size: 28, cachelines: 1, members: 4 */
  /* sum members: 26, holes: 1, sum holes: 2 */
  /* last cacheline: 28 bytes */
  };

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 3 ++-
 net/core/filter.c| 6 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2853aa..7108711 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2214,7 +2214,7 @@ struct bpf_tunnel_key {
};
__u8 tunnel_tos;
__u8 tunnel_ttl;
-   __u16 tunnel_ext;
+   __u16 tunnel_ext;   /* Padding, future use. */
__u32 tunnel_label;
 };
 
@@ -2225,6 +2225,7 @@ struct bpf_xfrm_state {
__u32 reqid;
__u32 spi;  /* Stored in network byte order */
__u16 family;
+   __u16 ext;  /* Padding, future use. */
union {
__u32 remote_ipv4;  /* Stored in network byte order */
__u32 remote_ipv6[4];   /* Stored in network byte order */
diff --git a/net/core/filter.c b/net/core/filter.c
index 717c740..5ceb5e6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3445,6 +3445,7 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, 
struct bpf_tunnel_key
to->tunnel_id = be64_to_cpu(info->key.tun_id);
to->tunnel_tos = info->key.tos;
to->tunnel_ttl = info->key.ttl;
+   to->tunnel_ext = 0;
 
if (flags & BPF_F_TUNINFO_IPV6) {
memcpy(to->remote_ipv6, &info->key.u.ipv6.src,
@@ -3452,6 +3453,8 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, 
struct bpf_tunnel_key
to->tunnel_label = be32_to_cpu(info->key.label);
} else {
to->remote_ipv4 = be32_to_cpu(info->key.u.ipv4.src);
+   memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3);
+   to->tunnel_label = 0;
}
 
if (unlikely(size != sizeof(struct bpf_tunnel_key)))
@@ -4047,11 +4050,14 @@ BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, 
skb, u32, index,
to->reqid = x->props.reqid;
to->spi = x->id.spi;
to->family = x->props.family;
+   to->ext = 0;
+
if (to->family == AF_INET6) {
memcpy(to->remote_ipv6, x->props.saddr.a6,
   sizeof(to->remote_ipv6));
} else {
to->remote_ipv4 = x->props.saddr.a4;
+   memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3);
}
 
return 0;
-- 
2.9.5

[PATCH bpf-next 02/11] bpf: add also cbpf long jump test cases with heavy expansion

2018-05-27 Thread Daniel Borkmann

We have one triggering on eBPF but lets also add a cBPF example to
make sure we keep tracking them. Also add anther cBPF test running
max number of MSH ops.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 lib/test_bpf.c | 63 ++
 1 file changed, 63 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 317f231..60aedc8 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -356,6 +356,52 @@ static int bpf_fill_maxinsns11(struct bpf_test *self)
return __bpf_fill_ja(self, BPF_MAXINSNS, 68);
 }
 
+static int bpf_fill_maxinsns12(struct bpf_test *self)
+{
+   unsigned int len = BPF_MAXINSNS;
+   struct sock_filter *insn;
+   int i = 0;
+
+   insn = kmalloc_array(len, sizeof(*insn), GFP_KERNEL);
+   if (!insn)
+   return -ENOMEM;
+
+   insn[0] = __BPF_JUMP(BPF_JMP | BPF_JA, len - 2, 0, 0);
+
+   for (i = 1; i < len - 1; i++)
+   insn[i] = __BPF_STMT(BPF_LDX | BPF_B | BPF_MSH, 0);
+
+   insn[len - 1] = __BPF_STMT(BPF_RET | BPF_K, 0xabababab);
+
+   self->u.ptr.insns = insn;
+   self->u.ptr.len = len;
+
+   return 0;
+}
+
+static int bpf_fill_maxinsns13(struct bpf_test *self)
+{
+   unsigned int len = BPF_MAXINSNS;
+   struct sock_filter *insn;
+   int i = 0;
+
+   insn = kmalloc_array(len, sizeof(*insn), GFP_KERNEL);
+   if (!insn)
+   return -ENOMEM;
+
+   for (i = 0; i < len - 3; i++)
+   insn[i] = __BPF_STMT(BPF_LDX | BPF_B | BPF_MSH, 0);
+
+   insn[len - 3] = __BPF_STMT(BPF_LD | BPF_IMM, 0xabababab);
+   insn[len - 2] = __BPF_STMT(BPF_ALU | BPF_XOR | BPF_X, 0);
+   insn[len - 1] = __BPF_STMT(BPF_RET | BPF_A, 0);
+
+   self->u.ptr.insns = insn;
+   self->u.ptr.len = len;
+
+   return 0;
+}
+
 static int bpf_fill_ja(struct bpf_test *self)
 {
/* Hits exactly 11 passes on x86_64 JIT. */
@@ -5290,6 +5336,23 @@ static struct bpf_test tests[] = {
.expected_errcode = -ENOTSUPP,
},
{
+   "BPF_MAXINSNS: jump over MSH",
+   { },
+   CLASSIC | FLAG_EXPECTED_FAIL,
+   { 0xfa, 0xfb, 0xfc, 0xfd, },
+   { { 4, 0xabababab } },
+   .fill_helper = bpf_fill_maxinsns12,
+   .expected_errcode = -EINVAL,
+   },
+   {
+   "BPF_MAXINSNS: exec all MSH",
+   { },
+   CLASSIC,
+   { 0xfa, 0xfb, 0xfc, 0xfd, },
+   { { 4, 0xababab83 } },
+   .fill_helper = bpf_fill_maxinsns13,
+   },
+   {
"BPF_MAXINSNS: ld_abs+get_processor_id",
{ },
CLASSIC,
-- 
2.9.5

[PATCH bpf-next 11/11] bpf, doc: add missing patchwork url and libbpf to maintainers

2018-05-27 Thread Daniel Borkmann

Add missing bits under tools/lib/bpf/ and also Q: entry in order to
make it easier for people to retrieve current patch queue.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f492431..2fd51db 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2722,6 +2722,7 @@ L:netdev@vger.kernel.org
 L: linux-ker...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
 S: Supported
 F: arch/x86/net/bpf_jit*
 F: Documentation/networking/filter.txt
@@ -2740,6 +2741,7 @@ F:net/sched/act_bpf.c
 F: net/sched/cls_bpf.c
 F: samples/bpf/
 F: tools/bpf/
+F: tools/lib/bpf/
 F: tools/testing/selftests/bpf/
 
 BROADCOM B44 10/100 ETHERNET DRIVER
-- 
2.9.5

[PATCH bpf-next 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps

2018-05-27 Thread Daniel Borkmann

While some of the BPF map lookup helpers provide a ->map_gen_lookup()
callback for inlining the map lookup altogether it is not available
for every map, so the remaining ones have to call bpf_map_lookup_elem()
helper which does a dispatch to map->ops->map_lookup_elem(). In
times of retpolines, this will control and trap speculative execution
rather than letting it do its work for the indirect call and will
therefore cause a slowdown. Likewise, bpf_map_update_elem() and
bpf_map_delete_elem() do not have an inlined version and need to call
into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
handlers.

Before:

  # bpftool p d x i 1
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = map[id:1]
5: (85) call __htab_map_lookup_elem#232656
6: (15) if r0 == 0x0 goto pc+4
7: (71) r1 = *(u8 *)(r0 +35)
8: (55) if r1 != 0x0 goto pc+1
9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
   16: (95) exit helper

After:

  # bpftool p d x i 1
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = map[id:1]
5: (85) call __htab_map_lookup_elem#233328
6: (15) if r0 == 0x0 goto pc+4
7: (71) r1 = *(u8 *)(r0 +35)
8: (55) if r1 != 0x0 goto pc+1
9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
   16: (95) exit

In all three lookup/update/delete cases however we can use the actual
address of the map callback directly if we find that there's only a
single path with a map pointer leading to the helper call, meaning
when the map pointer has not been poisoned from verifier side.
Example code can be seen above for the delete case.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h |  3 +++
 kernel/bpf/hashtab.c   | 12 ++---
 kernel/bpf/verifier.c  | 67 +-
 3 files changed, 62 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b443f70..d407ede 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -301,6 +301,9 @@ struct xdp_buff;
 
 /* Function call */
 
+#define BPF_CAST_CALL(x)   \
+   ((u64 (*)(u64, u64, u64, u64, u64))(x))
+
 #define BPF_EMIT_CALL(FUNC)\
((struct bpf_insn) {\
.code  = BPF_JMP | BPF_CALL,\
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b76828f..3ca2198 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -503,7 +503,9 @@ static u32 htab_map_gen_lookup(struct bpf_map *map, struct 
bpf_insn *insn_buf)
struct bpf_insn *insn = insn_buf;
const int ret = BPF_REG_0;
 
-   *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, 
u64))__htab_map_lookup_elem);
+   BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+(void *(*)(struct bpf_map *map, void *key))NULL));
+   *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
offsetof(struct htab_elem, key) +
@@ -530,7 +532,9 @@ static u32 htab_lru_map_gen_lookup(struct bpf_map *map,
const int ret = BPF_REG_0;
const int ref_reg = BPF_REG_1;
 
-   *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, 
u64))__htab_map_lookup_elem);
+   BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+(void *(*)(struct bpf_map *map, void *key))NULL));
+   *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4);
*insn++ = BPF_LDX_MEM(BPF_B, ref_reg, ret,
  offsetof(struct htab_elem, lru_node) +
@@ -1369,7 +1373,9 @@ static u32 htab_of_map_gen_lookup(struct bpf_map *map,
struct bpf_insn *insn = insn_buf;
const int ret = BPF_REG_0;
 
-   *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, 
u64))__htab_map_lookup_elem);
+   BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+(void *(*)(struct bpf_map *map, void *key))NULL));
+   *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
offsetof(struct htab_elem, key) +
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4f4786e..5684b15 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2421,8 +2421,11 @@ record_func_map(struct bpf_

[PATCH bpf-next 03/11] bpf: fixup error message from gpl helpers on license mismatch

2018-05-27 Thread Daniel Borkmann

Stating 'proprietary program' in the error is just silly since it
can also be a different open source license than that which is just
not compatible.

Reference: https://twitter.com/majek04/status/998531268039102465
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1fd9667b..4f4786e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2462,7 +2462,7 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
 
/* eBPF programs must be GPL compatible to use GPL-ed functions */
if (!env->prog->gpl_compatible && fn->gpl_only) {
-   verbose(env, "cannot call GPL only function from proprietary 
program\n");
+   verbose(env, "cannot call GPL-restricted function from non-GPL 
compatible program\n");
return -EINVAL;
}
 
-- 
2.9.5

[PATCH bpf-next 08/11] bpf: fix cbpf parser bug for octal numbers

2018-05-27 Thread Daniel Borkmann

Range is 0-7, not 0-9, otherwise parser silently excludes it from the
strtol() rather than throwing an error.

Reported-by: Marc Boschma 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/bpf/bpf_exp.l | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpf_exp.l b/tools/bpf/bpf_exp.l
index bd83149..4da8d05 100644
--- a/tools/bpf/bpf_exp.l
+++ b/tools/bpf/bpf_exp.l
@@ -175,7 +175,7 @@ extern void yyerror(const char *str);
yylval.number = strtol(yytext, NULL, 10);
return number;
}
-([0][0-9]+){
+([0][0-7]+){
yylval.number = strtol(yytext + 1, NULL, 8);
return number;
}
-- 
2.9.5

[PATCH bpf-next 00/11] Misc BPF improvements

2018-05-27 Thread Daniel Borkmann

This set adds various patches I still had in my queue, first two
are test cases to provide coverage for the recent two fixes that
went to bpf tree, then a small improvement on the error message
for gpl helpers. Next, we expose prog and map id into fdinfo in
order to allow for inspection of these objections currently used
in applications. Patch after that removes a retpoline call for
map lookup/update/delete helpers. A new helper is added in the
subsequent patch to lookup the skb's socket's cgroup v2 id which
can be used in an efficient way for e.g. lookups on egress side.
Next one is a fix to fully clear state info in tunnel/xfrm helpers.
Given this is full cap_sys_admin from init ns and has same priv
requirements like tracing, bpf-next should be okay. A small bug
fix for bpf_asm follows, and next a fix for context access in
tracing which was recently reported. Lastly, a small update in
the maintainer's file to add patchwork url and missing files.

Daniel Borkmann (11):
  bpf: test case for map pointer poison with calls/branches
  bpf: add also cbpf long jump test cases with heavy expansion
  bpf: fixup error message from gpl helpers on license mismatch
  bpf: show prog and map id in fdinfo
  bpf: avoid retpoline for lookup/update/delete calls on maps
  bpf: add bpf_skb_cgroup_id helper
  bpf: make sure to clear unused fields in tunnel/xfrm state fetch
  bpf: fix cbpf parser bug for octal numbers
  bpf: fix context access in tracing progs on 32 bit archs
  bpf: sync bpf uapi header with tools
  bpf, doc: add missing patchwork url and libbpf to maintainers

 MAINTAINERS |   2 +
 include/linux/filter.h  |  43 ++-
 include/uapi/linux/bpf.h|  20 ++-
 kernel/bpf/hashtab.c|  12 +-
 kernel/bpf/syscall.c|  12 +-
 kernel/bpf/verifier.c   |  69 ---
 kernel/trace/bpf_trace.c|  10 +-
 lib/test_bpf.c  |  63 ++
 net/core/filter.c   |  35 +-
 tools/bpf/bpf_exp.l |   2 +-
 tools/include/linux/filter.h|  10 ++
 tools/include/uapi/linux/bpf.h  |  20 ++-
 tools/testing/selftests/bpf/test_verifier.c | 185 
 13 files changed, 416 insertions(+), 67 deletions(-)

-- 
2.9.5

[PATCH bpf-next 04/11] bpf: show prog and map id in fdinfo

2018-05-27 Thread Daniel Borkmann

Its trivial and straight forward to expose it for scripts that can
then use it along with bpftool in order to inspect an individual
application's used maps and progs. Right now we dump some basic
information in the fdinfo file but with the help of the map/prog
id full introspection becomes possible now.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/syscall.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 388d4fe..79341e8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -326,13 +326,15 @@ static void bpf_map_show_fdinfo(struct seq_file *m, 
struct file *filp)
   "value_size:\t%u\n"
   "max_entries:\t%u\n"
   "map_flags:\t%#x\n"
-  "memlock:\t%llu\n",
+  "memlock:\t%llu\n"
+  "map_id:\t%u\n",
   map->map_type,
   map->key_size,
   map->value_size,
   map->max_entries,
   map->map_flags,
-  map->pages * 1ULL << PAGE_SHIFT);
+  map->pages * 1ULL << PAGE_SHIFT,
+  map->id);
 
if (owner_prog_type) {
seq_printf(m, "owner_prog_type:\t%u\n",
@@ -1069,11 +1071,13 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, 
struct file *filp)
   "prog_type:\t%u\n"
   "prog_jited:\t%u\n"
   "prog_tag:\t%s\n"
-  "memlock:\t%llu\n",
+  "memlock:\t%llu\n"
+  "prog_id:\t%u\n",
   prog->type,
   prog->jited,
   prog_tag,
-  prog->pages * 1ULL << PAGE_SHIFT);
+  prog->pages * 1ULL << PAGE_SHIFT,
+  prog->aux->id);
 }
 #endif
 
-- 
2.9.5

[PATCH] net: qmi_wwan: Add Netgear Aircard 779S

2018-05-27 Thread Josh Hill

Add support for Netgear Aircard 779S

Signed-off-by: Josh Hill 

---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 42565dd..0946808 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1103,6 +1103,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x05c6, 0x920d, 5)},
{QMI_QUIRK_SET_DTR(0x05c6, 0x9625, 4)}, /* YUGA CLM920-NC5 */
{QMI_FIXED_INTF(0x0846, 0x68a2, 8)},
+   {QMI_FIXED_INTF(0x0846, 0x68d3, 8)},/* Netgear Aircard 779S */
{QMI_FIXED_INTF(0x12d1, 0x140c, 1)},/* Huawei E173 */
{QMI_FIXED_INTF(0x12d1, 0x14ac, 1)},/* Huawei E1820 */
{QMI_FIXED_INTF(0x1435, 0xd181, 3)},/* Wistron NeWeb D18Q1 */
-- 
2.7.4

Re: WARNING in bpf_int_jit_compile

2018-05-27 Thread Daniel Borkmann

On 05/26/2018 11:29 AM, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:    62d18ecfa641 Merge tag 'arm64-fixes' of git://git.kernel.o..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=14c6bf5780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
> dashboard link: https://syzkaller.appspot.com/bug?extid=9e762b52dd17e616a7a5
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=130e42b780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+9e762b52dd17e616a...@syzkaller.appspotmail.com
> 
> RAX: ffda RBX: 02542914 RCX: 00455a09
> RDX: 0048 RSI: 2240 RDI: 0005
> RBP: 0072bea0 R08:  R09: 
> R10:  R11: 0246 R12: 0003
> R13: 0046 R14: 006f4730 R15: 0023
> WARNING: CPU: 0 PID: 4752 at include/linux/filter.h:667 
> bpf_jit_binary_lock_ro include/linux/filter.h:667 [inline]
> WARNING: CPU: 0 PID: 4752 at include/linux/filter.h:667 
> bpf_int_jit_compile+0xbf7/0xef7 arch/x86/net/bpf_jit_comp.c:1271
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 0 PID: 4752 Comm: syz-executor0 Not tainted 4.17.0-rc6+ #67
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  panic+0x22f/0x4de kernel/panic.c:184
>  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>  report_bug+0x252/0x2d0 lib/bug.c:186
>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
>  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> RIP: 0010:bpf_jit_binary_lock_ro include/linux/filter.h:667 [inline]

Been looking into this last Friday already. What seems to happen here is that
there's fault injection from inside set_memory_ro(), meaning it will eventually
return an error there, and we throw a WARN_ON_ONCE() to bark that making the
memory read-only didn't work out. I'd be in preference to notify the user on
such issue rather than keeping completely silent about it so that there's
awareness that read-only protections are not in place / guaranteed.

> RIP: 0010:bpf_int_jit_compile+0xbf7/0xef7 arch/x86/net/bpf_jit_comp.c:1271
> RSP: 0018:8801d85ff920 EFLAGS: 00010293
> RAX: 8801d78c40c0 RBX: 0046 RCX: 81445d89
> RDX:  RSI: 81445d97 RDI: 0005
> RBP: 8801d85ffa40 R08: 8801d78c40c0 R09: 
> R10:  R11:  R12: c9000194e002
> R13: 8801d85ffa18 R14: fff4 R15: 0003
>  bpf_prog_select_runtime+0x131/0x640 kernel/bpf/core.c:1541
>  bpf_prog_load+0x16c2/0x2070 kernel/bpf/syscall.c:1333
>  __do_sys_bpf kernel/bpf/syscall.c:2073 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:2035 [inline]
>  __x64_sys_bpf+0x389/0x4c0 kernel/bpf/syscall.c:2035
>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x455a09
> RSP: 002b:7ffec3da2868 EFLAGS: 0246 ORIG_RAX: 0141
> RAX: ffda RBX: 02542914 RCX: 00455a09
> RDX: 0048 RSI: 2240 RDI: 0005
> RBP: 0072bea0 R08:  R09: 
> R10:  R11: 0246 R12: 0003
> R13: 0046 R14: 006f4730 R15: 0023
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>

Re: [PATCH v3 bpf-next 0/5] bpf: Hooks for sys_sendmsg

2018-05-27 Thread Daniel Borkmann

On 05/28/2018 12:56 AM, Daniel Borkmann wrote:
> On 05/25/2018 07:09 AM, Andrey Ignatov wrote:
>> v2 -> v3:
>> * place BPF logic under static key in udp_sendmsg, udpv6_sendmsg;
>> * rebase.
[...]
> 
> Applied to bpf-next, thanks Andrey!

Woops, and I meant of course v4 [0] of the series not v3. ;-)

  [0] https://patchwork.ozlabs.org/project/netdev/list/?series=46691&state=*

Re: [PATCH v3 bpf-next 0/5] bpf: Hooks for sys_sendmsg

2018-05-27 Thread Daniel Borkmann

On 05/25/2018 07:09 AM, Andrey Ignatov wrote:
> v2 -> v3:
> * place BPF logic under static key in udp_sendmsg, udpv6_sendmsg;
> * rebase.
> 
> v1 -> v2:
> * return ENOTSUPP if bpf_prog rewrote IPv6-only with IPv4-mapped IPv6;
> * add test for IPv4-mapped IPv6 use-case;
> * fix build for CONFIG_CGROUP_BPF=n;
> * rebase.
> 
> This path set adds BPF hooks for sys_sendmsg similar to existing hooks for
> sys_bind and sys_connect.
> 
> Hooks allow to override source IP (including the case when it's set via
> cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and
> connected UDP (fast path) are not affected. This makes UDP support
> complete: connected UDP is handled by sys_connect hooks, unconnected by
> sys_sendmsg ones.
> 
> Similar to sys_connect hooks, sys_sendmsg ones can be used to make system
> calls such as sendmsg(2) and sendto(2) return EPERM.
> 
> Please see patch 0001 for more details.
> 
> 
> Andrey Ignatov (5):
>   bpf: Hooks for sys_sendmsg
>   bpf: Sync bpf.h to tools/
>   libbpf: Support guessing sendmsg{4,6} progs
>   selftests/bpf: Prepare test_sock_addr for extension
>   selftests/bpf: Selftest for sys_sendmsg hooks
> 
>  include/linux/bpf-cgroup.h   |   23 +-
>  include/linux/filter.h   |1 +
>  include/uapi/linux/bpf.h |8 +
>  kernel/bpf/cgroup.c  |   11 +-
>  kernel/bpf/syscall.c |8 +
>  net/core/filter.c|   39 +
>  net/ipv4/udp.c   |   20 +-
>  net/ipv6/udp.c   |   24 +
>  tools/include/uapi/linux/bpf.h   |8 +
>  tools/lib/bpf/libbpf.c   |2 +
>  tools/testing/selftests/bpf/Makefile |2 +-
>  tools/testing/selftests/bpf/sendmsg4_prog.c  |   49 ++
>  tools/testing/selftests/bpf/sendmsg6_prog.c  |   60 ++
>  tools/testing/selftests/bpf/test_sock_addr.c | 1155 
> +-
>  14 files changed, 1214 insertions(+), 196 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/sendmsg4_prog.c
>  create mode 100644 tools/testing/selftests/bpf/sendmsg6_prog.c
> 

Applied to bpf-next, thanks Andrey!

Re: [PATCH, net-next 2/2] bpf: avoid -Wmaybe-uninitialized warning

2018-05-27 Thread Daniel Borkmann

On 05/25/2018 11:33 PM, Arnd Bergmann wrote:
> The stack_map_get_build_id_offset() function is too long for gcc to track
> whether 'work' may or may not be initialized at the end of it, leading
> to a false-positive warning:
> 
> kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset':
> kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
> 
> This removes the 'in_nmi_ctx' flag and uses the state of that variable
> itself to see if it got initialized.
> 
> Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
> Signed-off-by: Arnd Bergmann 

Applied to bpf-next, thanks Arnd!

Re: [PATCH, net-next 1/2] bpf: btf: avoid -Wreturn-type warning

2018-05-27 Thread Daniel Borkmann

On 05/25/2018 11:33 PM, Arnd Bergmann wrote:
> gcc warns about a noreturn function possibly returning in
> some configurations:
> 
> kernel/bpf/btf.c: In function 'env_type_is_resolve_sink':
> kernel/bpf/btf.c:729:1: error: control reaches end of non-void function 
> [-Werror=return-type]
> 
> Using BUG() instead of BUG_ON() avoids that warning and otherwise
> does the exact same thing.
> 
> Fixes: eb3f595dab40 ("bpf: btf: Validate type reference")
> Signed-off-by: Arnd Bergmann 

Applied to bpf-next, thanks Arnd!

Re: [PATCH bpf-next v2] selftests/bpf: missing headers test_lwt_seg6local

2018-05-27 Thread Daniel Borkmann

On 05/26/2018 04:44 PM, Mathieu Xhonneux wrote:
> Previous patch "selftests/bpf: test for seg6local End.BPF action" lacks
> some UAPI headers in tools/.
> 
> clang -I. -I./include/uapi -I../../../include/uapi -idirafter
> /usr/local/include -idirafter
> /data/users/yhs/work/llvm/build/install/lib/clang/7.0.0/include
> -idirafter /usr/include -Wno-compare-distinct-pointer-types \
>  -O2 -target bpf -emit-llvm -c test_lwt_seg6local.c -o - |  \
> llc -march=bpf -mcpu=generic  -filetype=obj -o
> [...]/net-next/tools/testing/selftests/bpf/test_lwt_seg6local.o
> test_lwt_seg6local.c:4:10: fatal error: 'linux/seg6_local.h' file not found
>  ^~~~
> 1 error generated.
> make: Leaving directory
> `/data/users/yhs/work/net-next/tools/testing/selftests/bpf'
> 
> v2: moving the headers to tools/include/uapi/.
> 
> Reported-by: Y Song 
> Signed-off-by: Mathieu Xhonneux 

Applied to bpf-next, thanks Mathieu!

Re: [bpf-next PATCH] bpf: sockhash fix race with bpf_tcp_close and map delete

2018-05-27 Thread Daniel Borkmann

On 05/25/2018 07:37 PM, John Fastabend wrote:
> syzbot reported two related splats, a use after free and null
> pointer dereference, when a TCP socket is closed while the map is
> also being removed.
> 
> The psock keeps a reference to all map slots that have a reference
> to the sock so that when the sock is closed we can clean up any
> outstanding sock{map|hash} entries. This avoids pinning a sock
> forever if the map owner fails to do proper cleanup. However, the
> result is we have two paths that can free an entry in the map. Even
> the comment in the sock{map|hash} tear down function, sock_hash_free()
> notes this:
> 
>  At this point no update, lookup or delete operations can happen.
>  However, be aware we can still get a socket state event updates,
>  and data ready callbacks that reference the psock from sk_user_data.
> 
> Both removal paths omitted taking the hash bucket lock resulting
> in the case where we have two references that are in the process
> of being free'd.
> 
> Reported-by: syzbot+a761b81c211794fa1...@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend 

Applied to bpf-next, thanks John!

Re: [PATCH bpf-next] libbpf: Install btf.h with libbpf

2018-05-27 Thread Daniel Borkmann

On 05/25/2018 07:23 PM, Andrey Ignatov wrote:
> install_headers target should contain all headers that are part of
> libbpf. Add missing btf.h
> 
> Signed-off-by: Andrey Ignatov 

Applied to bpf-next, thanks Andrey!

Proposal

2018-05-27 Thread Miss Zeliha Omer Faruk




--
Hello

I have been trying to contact you. Did you get my business proposal?

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turke

Re: aio poll and a new in-kernel poll API V13

2018-05-27 Thread Al Viro

OK, it's in -next now; there are several cleanups I'd put
into vfs.git#work.aio:
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED 
the same way
  aio_read_events_ring(): make a bit more readable
  aio: shift copyin of iocb into io_submit_one()
  aio: fold do_io_submit() into callers
Those are *not* on -next yet and if anybody has objections against
any of those, please yell.  Individual patches in followups...

Re: KASAN: use-after-free Read in bpf_tcp_close

2018-05-27 Thread Daniel Borkmann

[ +John ]

On 05/26/2018 10:54 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    3fb48d881dbe Merge branch 'bpf-fib-mtu-check'
> git tree:   bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15fc197780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=fce8f2462c403d02af98
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1310c85780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17de717780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+fce8f2462c403d02a...@syzkaller.appspotmail.com

Should be fixed by: https://patchwork.ozlabs.org/patch/920695/

> ==
> BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:427 
> [inline]
> BUG: KASAN: use-after-free in bpf_tcp_close+0xd7f/0xf80 
> kernel/bpf/sockmap.c:271
> Read of size 8 at addr 8801c884cf90 by task syz-executor330/11778
> 
> CPU: 1 PID: 11778 Comm: syz-executor330 Not tainted 4.17.0-rc4+ #18
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  print_address_description+0x6c/0x20b mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>  hlist_del_rcu include/linux/rculist.h:427 [inline]
>  bpf_tcp_close+0xd7f/0xf80 kernel/bpf/sockmap.c:271
>  inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459
>  sock_release+0x96/0x1b0 net/socket.c:594
>  sock_close+0x16/0x20 net/socket.c:1149
>  __fput+0x34d/0x890 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:243
>  task_work_run+0x1e4/0x290 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x1aee/0x2730 kernel/exit.c:865
>  do_group_exit+0x16f/0x430 kernel/exit.c:968
>  get_signal+0x886/0x1960 kernel/signal.c:2469
>  do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
>  exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
>  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x445ed9
> RSP: 002b:7f0078c0adb8 EFLAGS: 0246 ORIG_RAX: 00ca
> RAX: fe00 RBX: 006dbc24 RCX: 00445ed9
> RDX:  RSI:  RDI: 006dbc24
> RBP: 006dbc20 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7ffcd147dbef R14: 7f0078c0b9c0 R15: 0007
> 
> Allocated by task 11787:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
>  __do_kmalloc_node mm/slab.c:3682 [inline]
>  __kmalloc_node+0x47/0x70 mm/slab.c:3689
>  kmalloc_node include/linux/slab.h:554 [inline]
>  alloc_sock_hash_elem kernel/bpf/sockmap.c:2114 [inline]
>  sock_hash_ctx_update_elem.isra.23+0xa57/0x1560 kernel/bpf/sockmap.c:2245
>  sock_hash_update_elem+0x14f/0x2d0 kernel/bpf/sockmap.c:2303
>  map_update_elem+0x5c4/0xc90 kernel/bpf/syscall.c:760
>  __do_sys_bpf kernel/bpf/syscall.c:2134 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:2105 [inline]
>  __x64_sys_bpf+0x32a/0x4f0 kernel/bpf/syscall.c:2105
>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 8998:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
>  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>  __cache_free mm/slab.c:3498 [inline]
>  kfree+0xd9/0x260 mm/slab.c:3813
>  sock_hash_free+0x24e/0x6e0 kernel/bpf/sockmap.c:2093
>  bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:238
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> 
> The buggy address belongs to the object at 8801c884cf80
>  which belongs to the cache kmalloc-64 of size 64
> The buggy address is located 16 bytes inside of
>  64-byte region [8801c884cf80, 8801c884cfc0)
> The buggy address belongs to the page:
> page:ea0007221300 count:1 mapcount:0 mapping:8801c884c000 index:0x0
> flags: 0x2fffc000100(slab)
> raw: 02fffc000100 8801c884c000  00010020
> raw: ea00072e08e0 ea0006e9

Re: general protection fault in bpf_tcp_close

2018-05-27 Thread Daniel Borkmann

[ +John ]

On 05/26/2018 11:13 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    fd0bfa8d6e04 Merge branch 'bpf-af-xdp-cleanups'
> git tree:   bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=11da942780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=0ce137753c78f7b6acc1
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com

Should be fixed by: https://patchwork.ozlabs.org/patch/920695/

> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 12139 Comm: syz-executor2 Not tainted 4.17.0-rc4+ #17
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:__hlist_del include/linux/list.h:649 [inline]
> RIP: 0010:hlist_del_rcu include/linux/rculist.h:427 [inline]
> RIP: 0010:bpf_tcp_close+0x7d2/0xf80 kernel/bpf/sockmap.c:271
> RSP: 0018:8801a8f8ef70 EFLAGS: 00010a02
> RAX: ed00351f1dfd RBX: dc00 RCX: dead0200
> RDX:  RSI: 1bd5a040 RDI: 8801cb710910
> RBP: 8801a8f8f110 R08: ed003350ac9d R09: ed003350ac9c
> R10: ed003350ac9c R11: 88019a8564e3 R12: 8801cb710380
> R13: 8801b17ea6e0 R14: 8801cb710398 R15: 8801cb710900
> FS:  7f9890c43700() GS:8801dae0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fde1a668000 CR3: 00019dca2000 CR4: 001406f0
> DR0: 21c0 DR1: 21c0 DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0600
> Call Trace:
>  inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459
>  sock_release+0x96/0x1b0 net/socket.c:594
>  sock_close+0x16/0x20 net/socket.c:1149
>  __fput+0x34d/0x890 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:243
>  task_work_run+0x1e4/0x290 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x1aee/0x2730 kernel/exit.c:865
>  do_group_exit+0x16f/0x430 kernel/exit.c:968
>  get_signal+0x886/0x1960 kernel/signal.c:2469
>  do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
>  exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
>  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x455a09
> RSP: 002b:7f9890c42ce8 EFLAGS: 0246 ORIG_RAX: 00ca
> RAX: fe00 RBX: 0072bec8 RCX: 00455a09
> RDX:  RSI:  RDI: 0072bec8
> RBP: 0072bec8 R08:  R09: 0072bea0
> R10:  R11: 0246 R12: 
> R13: 7ffcb48ac3ff R14: 7f9890c439c0 R15: 
> Code: ff 48 c1 e9 03 80 3c 19 00 0f 85 a9 05 00 00 49 8b 4f 18 48 8b 85 98 fe 
> ff ff 48 89 ce c6 00 00 48 c1 ee 03 48 89 95 d8 fe ff ff <80> 3c 1e 00 0f 85 
> c6 05 00 00 48 8b 85 98 fe ff ff 48 85 d2 48
> RIP: __hlist_del include/linux/list.h:649 [inline] RSP: 8801a8f8ef70
> RIP: hlist_del_rcu include/linux/rculist.h:427 [inline] RSP: 8801a8f8ef70
> RIP: bpf_tcp_close+0x7d2/0xf80 kernel/bpf/sockmap.c:271 RSP: 8801a8f8ef70
> ---[ end trace e81227e93c7e7b75 ]---
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with syzbot.

Re: KASAN: use-after-free Write in bpf_tcp_close

2018-05-27 Thread Daniel Borkmann

[ +John ]

On 05/27/2018 10:06 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    ff4fb475cea8 Merge branch 'btf-uapi-cleanups'
> git tree:   bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12b3d57780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=31025a5f3f7650081204
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=109a2f3780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=171a727b80
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+31025a5f3f7650081...@syzkaller.appspotmail.com

Should be fixed by: https://patchwork.ozlabs.org/patch/920695/

> ==
> BUG: KASAN: use-after-free in cmpxchg_size 
> include/asm-generic/atomic-instrumented.h:355 [inline]
> BUG: KASAN: use-after-free in bpf_tcp_close+0x6f5/0xf80 
> kernel/bpf/sockmap.c:265
> Write of size 8 at addr 8801ca277680 by task syz-executor749/9723
> 
> CPU: 0 PID: 9723 Comm: syz-executor749 Not tainted 4.17.0-rc4+ #19
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  print_address_description+0x6c/0x20b mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>  kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
>  cmpxchg_size include/asm-generic/atomic-instrumented.h:355 [inline]
>  bpf_tcp_close+0x6f5/0xf80 kernel/bpf/sockmap.c:265
>  inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459
>  sock_release+0x96/0x1b0 net/socket.c:594
>  sock_close+0x16/0x20 net/socket.c:1149
>  __fput+0x34d/0x890 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:243
>  task_work_run+0x1e4/0x290 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x1aee/0x2730 kernel/exit.c:865
>  do_group_exit+0x16f/0x430 kernel/exit.c:968
>  __do_sys_exit_group kernel/exit.c:979 [inline]
>  __se_sys_exit_group kernel/exit.c:977 [inline]
>  __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977
>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x440a59
> RSP: 002b:7ffdadf92488 EFLAGS: 0206 ORIG_RAX: 00e7
> RAX: ffda RBX:  RCX: 00440a59
> RDX: 00440a59 RSI: 0020 RDI: 
> RBP:  R08: 004002c8 R09: 00401ea0
> R10: 004002c8 R11: 0206 R12: 0001b5ac
> R13: 00401ea0 R14:  R15: 
> 
> Allocated by task 9723:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
>  __do_kmalloc_node mm/slab.c:3682 [inline]
>  __kmalloc_node+0x47/0x70 mm/slab.c:3689
>  kmalloc_node include/linux/slab.h:554 [inline]
>  bpf_map_area_alloc+0x3f/0x90 kernel/bpf/syscall.c:144
>  sock_map_alloc+0x376/0x410 kernel/bpf/sockmap.c:1555
>  find_and_alloc_map kernel/bpf/syscall.c:126 [inline]
>  map_create+0x393/0x1010 kernel/bpf/syscall.c:448
>  __do_sys_bpf kernel/bpf/syscall.c:2128 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:2105 [inline]
>  __x64_sys_bpf+0x300/0x4f0 kernel/bpf/syscall.c:2105
>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 4521:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
>  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>  __cache_free mm/slab.c:3498 [inline]
>  kfree+0xd9/0x260 mm/slab.c:3813
>  kvfree+0x61/0x70 mm/util.c:440
>  bpf_map_area_free+0x15/0x20 kernel/bpf/syscall.c:155
>  sock_map_remove_complete kernel/bpf/sockmap.c:1443 [inline]
>  sock_map_free+0x408/0x540 kernel/bpf/sockmap.c:1619
>  bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:238
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> 
> The buggy address belongs to the object at 8801ca277680
>  which belongs to the cache kmalloc-1024 of size 1024
> The buggy address is located 0 bytes inside of
>  1024-byte region [8801ca277680, 8801ca277a80)
> The buggy address belongs to the page:
> page:ea0007289d80 count:1 mapcount:0 mapping:8801ca276000 index:0x0 
> compound_mapcount: 0
> f

[PATCH v3 07/11] net: sched: implement reference counted action release

2018-05-27 Thread Vlad Buslov

Implement helper delete function that uses new action ops 'delete', instead
of destroying action directly. This is required so act API could delete
actions by index, without holding any references to action that is being
deleted.

Implement function __tcf_action_put() that releases reference to action and
frees it, if necessary. Refactor action deletion code to use new put
function and not to rely on rtnl lock. Remove rtnl lock assertions that are
no longer needed.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Removed redundant actions ops lookup during delete.
- Assume all actions have delete implemented and don't check for it
  explicitly.
- Rearrange variable definitions in tcf_action_delete.

 net/sched/act_api.c | 84 +++--
 net/sched/cls_api.c |  1 -
 2 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 0f31f09946ab..a023873db713 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -90,21 +90,39 @@ static void free_tcf(struct tc_action *p)
kfree(p);
 }
 
-static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p)
+static void tcf_action_cleanup(struct tc_action *p)
 {
-   spin_lock(&idrinfo->lock);
-   idr_remove(&idrinfo->action_idr, p->tcfa_index);
-   spin_unlock(&idrinfo->lock);
+   if (p->ops->cleanup)
+   p->ops->cleanup(p);
+
gen_kill_estimator(&p->tcfa_rate_est);
free_tcf(p);
 }
 
+static int __tcf_action_put(struct tc_action *p, bool bind)
+{
+   struct tcf_idrinfo *idrinfo = p->idrinfo;
+
+   if (refcount_dec_and_lock(&p->tcfa_refcnt, &idrinfo->lock)) {
+   if (bind)
+   atomic_dec(&p->tcfa_bindcnt);
+   idr_remove(&idrinfo->action_idr, p->tcfa_index);
+   spin_unlock(&idrinfo->lock);
+
+   tcf_action_cleanup(p);
+   return 1;
+   }
+
+   if (bind)
+   atomic_dec(&p->tcfa_bindcnt);
+
+   return 0;
+}
+
 int __tcf_idr_release(struct tc_action *p, bool bind, bool strict)
 {
int ret = 0;
 
-   ASSERT_RTNL();
-
/* Release with strict==1 and bind==0 is only called through act API
 * interface (classifiers always bind). Only case when action with
 * positive reference count and zero bind count can exist is when it was
@@ -118,18 +136,11 @@ int __tcf_idr_release(struct tc_action *p, bool bind, 
bool strict)
 * are acceptable.
 */
if (p) {
-   if (bind)
-   atomic_dec(&p->tcfa_bindcnt);
-   else if (strict && atomic_read(&p->tcfa_bindcnt) > 0)
+   if (!bind && strict && atomic_read(&p->tcfa_bindcnt) > 0)
return -EPERM;
 
-   if (atomic_read(&p->tcfa_bindcnt) <= 0 &&
-   refcount_dec_and_test(&p->tcfa_refcnt)) {
-   if (p->ops->cleanup)
-   p->ops->cleanup(p);
-   tcf_idr_remove(p->idrinfo, p);
+   if (__tcf_action_put(p, bind))
ret = ACT_P_DELETED;
-   }
}
 
return ret;
@@ -340,11 +351,7 @@ int tcf_idr_delete_index(struct tc_action_net *tn, u32 
index)
p->tcfa_index));
spin_unlock(&idrinfo->lock);
 
-   if (p->ops->cleanup)
-   p->ops->cleanup(p);
-
-   gen_kill_estimator(&p->tcfa_rate_est);
-   free_tcf(p);
+   tcf_action_cleanup(p);
module_put(owner);
return 0;
}
@@ -615,6 +622,11 @@ int tcf_action_destroy(struct list_head *actions, int bind)
return ret;
 }
 
+static int tcf_action_put(struct tc_action *p)
+{
+   return __tcf_action_put(p, false);
+}
+
 int
 tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int 
ref)
 {
@@ -1092,6 +1104,35 @@ static int tca_action_flush(struct net *net, struct 
nlattr *nla,
return err;
 }
 
+static int tcf_action_delete(struct net *net, struct list_head *actions,
+struct netlink_ext_ack *extack)
+{
+   struct tc_action *a, *tmp;
+   u32 act_index;
+   int ret;
+
+   list_for_each_entry_safe(a, tmp, actions, list) {
+   const struct tc_action_ops *ops = a->ops;
+
+   /* Actions can be deleted concurrently so we must save their
+* type and id to search again after reference is released.
+*/
+   act_index = a->tcfa_index;
+
+   list_del(&a->list);
+   if (tcf_action_put(a)) {
+   /* last reference, action was deleted concurrently */
+   module_put(ops->owner);
+   } else  {
+   /* no

[PATCH v3 10/11] net: sched: atomically check-allocate action

2018-05-27 Thread Vlad Buslov

Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.

Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)

Refactor all action init functions to insert new action to idr using this
API.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Remove unique idr insertion function. Change original idr insert to do
  the same thing.
- Refactor action check-alloc code into standalone function.

 include/net/act_api.h  |  3 ++
 net/sched/act_api.c| 92 --
 net/sched/act_bpf.c| 11 --
 net/sched/act_connmark.c   | 10 +++--
 net/sched/act_csum.c   | 11 --
 net/sched/act_gact.c   | 11 --
 net/sched/act_ife.c|  6 ++-
 net/sched/act_ipt.c| 13 ++-
 net/sched/act_mirred.c | 16 ++--
 net/sched/act_nat.c| 11 --
 net/sched/act_pedit.c  | 15 ++--
 net/sched/act_police.c |  9 -
 net/sched/act_sample.c | 11 --
 net/sched/act_simple.c | 11 +-
 net/sched/act_skbedit.c| 11 +-
 net/sched/act_skbmod.c | 11 +-
 net/sched/act_tunnel_key.c |  9 -
 net/sched/act_vlan.c   | 17 -
 18 files changed, 218 insertions(+), 60 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d256e20507b9..cd4547476074 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -154,6 +154,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
   int bind, bool cpustats);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
 
+void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
+int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
+   struct tc_action **a, int bind);
 int tcf_idr_delete_index(struct tc_action_net *tn, u32 index);
 int __tcf_idr_release(struct tc_action *a, bool bind, bool strict);
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index eefe8c2fe667..9511502e1cbb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -303,7 +303,9 @@ static bool __tcf_idr_check(struct tc_action_net *tn, u32 
index,
 
spin_lock(&idrinfo->lock);
p = idr_find(&idrinfo->action_idr, index);
-   if (p) {
+   if (IS_ERR(p)) {
+   p = NULL;
+   } else if (p) {
refcount_inc(&p->tcfa_refcnt);
if (bind)
atomic_inc(&p->tcfa_bindcnt);
@@ -371,7 +373,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
 {
struct tc_action *p = kzalloc(ops->size, GFP_KERNEL);
struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct idr *idr = &idrinfo->action_idr;
int err = -ENOMEM;
 
if (unlikely(!p))
@@ -389,20 +390,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
goto err2;
}
spin_lock_init(&p->tcfa_lock);
-   idr_preload(GFP_KERNEL);
-   spin_lock(&idrinfo->lock);
-   /* user doesn't specify an index */
-   if (!index) {
-   index = 1;
-   err = idr_alloc_u32(idr, NULL, &index, UINT_MAX, GFP_ATOMIC);
-   } else {
-   err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC);
-   }
-   spin_unlock(&idrinfo->lock);
-   idr_preload_end();
-   if (err)
-   goto err3;
-
p->tcfa_index = index;
p->tcfa_tm.install = jiffies;
p->tcfa_tm.lastuse = jiffies;
@@ -412,7 +399,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
&p->tcfa_rate_est,
&p->tcfa_lock, NULL, est);
if (err)
-   goto err4;
+   goto err3;
}
 
p->idrinfo = idrinfo;
@@ -420,8 +407,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
INIT_LIST_HEAD(&p->list);
*a = p;
return 0;
-err4:
-   idr_remove(idr, index);
 err3:
free_percpu(p->cpu_qstats);
 err2:
@@ -437,11 +422,78 @@ void tcf_idr_insert(struct tc_action_net *tn, struct 
tc_action *a)
struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
spin_lock(&idrinfo->lock);
-   idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
+   WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
spin_unlock(&idrinfo->lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
 
+/* Cleanup idr index that was allocated but not initialized. */
+
+void tcf_idr_cleanup(struct tc_acti

[PATCH v3 08/11] net: sched: don't release reference on action overwrite

2018-05-27 Thread Vlad Buslov

Return from action init function with reference to action taken,
even when overwriting existing action.

Action init API initializes its fourth argument (pointer to pointer to tc
action) to either existing action with same index or newly created action.
In case of existing index(and bind argument is zero), init function returns
without incrementing action reference counter. Caller of action init then
proceeds working with action, without actually holding reference to it.
This means that action could be deleted concurrently.

Change action init behavior to always take reference to action before
returning successfully, in order to protect from concurrent deletion.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Resplit action lookup/release code to prevent memory leaks in
  individual patches.
- Change convoluted commit message.

 net/sched/act_api.c|  2 --
 net/sched/act_bpf.c|  8 
 net/sched/act_connmark.c   |  5 +++--
 net/sched/act_csum.c   |  8 
 net/sched/act_gact.c   |  5 +++--
 net/sched/act_ife.c| 12 +---
 net/sched/act_ipt.c|  5 +++--
 net/sched/act_mirred.c |  5 ++---
 net/sched/act_nat.c|  5 +++--
 net/sched/act_pedit.c  |  5 +++--
 net/sched/act_police.c |  8 +++-
 net/sched/act_sample.c |  8 +++-
 net/sched/act_simple.c |  5 +++--
 net/sched/act_skbedit.c|  5 +++--
 net/sched/act_skbmod.c |  8 +++-
 net/sched/act_tunnel_key.c |  8 +++-
 net/sched/act_vlan.c   |  8 +++-
 17 files changed, 51 insertions(+), 59 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index a023873db713..f019f0464cec 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -870,8 +870,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
}
act->order = i;
sz += tcf_action_fill_size(act);
-   if (ovr)
-   refcount_inc(&act->tcfa_refcnt);
list_add_tail(&act->list, actions);
}
 
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 7941dd66ff83..d3f4ac6f2c4b 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -311,9 +311,10 @@ static int tcf_bpf_init(struct net *net, struct nlattr 
*nla,
if (bind)
return 0;
 
-   tcf_idr_release(*act, bind);
-   if (!replace)
+   if (!replace) {
+   tcf_idr_release(*act, bind);
return -EEXIST;
+   }
}
 
is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS];
@@ -356,8 +357,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 
return res;
 out:
-   if (res == ACT_P_CREATED)
-   tcf_idr_release(*act, bind);
+   tcf_idr_release(*act, bind);
 
return ret;
 }
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 143c2d3de723..701e90244eff 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -135,9 +135,10 @@ static int tcf_connmark_init(struct net *net, struct 
nlattr *nla,
ci = to_connmark(*a);
if (bind)
return 0;
-   tcf_idr_release(*a, bind);
-   if (!ovr)
+   if (!ovr) {
+   tcf_idr_release(*a, bind);
return -EEXIST;
+   }
/* replacing action and zone */
ci->tcf_action = parm->action;
ci->zone = parm->zone;
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 3768539340e0..5dbee136b0a1 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -76,9 +76,10 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
} else {
if (bind)/* dont override defaults */
return 0;
-   tcf_idr_release(*a, bind);
-   if (!ovr)
+   if (!ovr) {
+   tcf_idr_release(*a, bind);
return -EEXIST;
+   }
}
 
p = to_tcf_csum(*a);
@@ -86,8 +87,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 
params_new = kzalloc(sizeof(*params_new), GFP_KERNEL);
if (unlikely(!params_new)) {
-   if (ret == ACT_P_CREATED)
-   tcf_idr_release(*a, bind);
+   tcf_idr_release(*a, bind);
return -ENOMEM;
}
params_old = rtnl_dereference(p->params);
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index a431a711f0dd..11c4de3f344e 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -100,9 +100,10 @@ static int tcf_gact_init(struct net *net, struct nlattr 
*nla,
} else {
if (bind)/* dont override defaults */
return 0;
-   tcf_idr_release(*a, bind);
-

[PATCH v3 06/11] net: sched: add 'delete' function to action ops

2018-05-27 Thread Vlad Buslov

Extend action ops with 'delete' function. Each action type to implements
its own delete function that doesn't depend on rtnl lock.

Implement delete function that is required to delete actions without
holding rtnl lock. Use action API function that atomically deletes action
only if it is still in action idr. This implementation prevents concurrent
threads from deleting same action twice.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Merge action ops delete definition and implementation.

 include/net/act_api.h  |  1 +
 net/sched/act_bpf.c|  8 
 net/sched/act_connmark.c   |  8 
 net/sched/act_csum.c   |  8 
 net/sched/act_gact.c   |  8 
 net/sched/act_ife.c|  8 
 net/sched/act_ipt.c| 16 
 net/sched/act_mirred.c |  8 
 net/sched/act_nat.c|  8 
 net/sched/act_pedit.c  |  8 
 net/sched/act_police.c |  8 
 net/sched/act_sample.c |  8 
 net/sched/act_simple.c |  8 
 net/sched/act_skbedit.c|  8 
 net/sched/act_skbmod.c |  8 
 net/sched/act_tunnel_key.c |  8 
 net/sched/act_vlan.c   |  8 
 17 files changed, 137 insertions(+)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d94ec6400673..d256e20507b9 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -101,6 +101,7 @@ struct tc_action_ops {
void(*stats_update)(struct tc_action *, u64, u32, u64);
size_t  (*get_fill_size)(const struct tc_action *act);
struct net_device *(*get_dev)(const struct tc_action *a);
+   int (*delete)(struct net *net, u32 index);
 };
 
 struct tc_action_net {
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 8ebf40a3506c..7941dd66ff83 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -388,6 +388,13 @@ static int tcf_bpf_search(struct net *net, struct 
tc_action **a, u32 index,
return tcf_idr_search(tn, a, index);
 }
 
+static int tcf_bpf_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, bpf_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_bpf_ops __read_mostly = {
.kind   =   "bpf",
.type   =   TCA_ACT_BPF,
@@ -398,6 +405,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = {
.init   =   tcf_bpf_init,
.walk   =   tcf_bpf_walker,
.lookup =   tcf_bpf_search,
+   .delete =   tcf_bpf_delete,
.size   =   sizeof(struct tcf_bpf),
 };
 
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index e3787aa0025a..143c2d3de723 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -193,6 +193,13 @@ static int tcf_connmark_search(struct net *net, struct 
tc_action **a, u32 index,
return tcf_idr_search(tn, a, index);
 }
 
+static int tcf_connmark_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, connmark_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_connmark_ops = {
.kind   =   "connmark",
.type   =   TCA_ACT_CONNMARK,
@@ -202,6 +209,7 @@ static struct tc_action_ops act_connmark_ops = {
.init   =   tcf_connmark_init,
.walk   =   tcf_connmark_walker,
.lookup =   tcf_connmark_search,
+   .delete =   tcf_connmark_delete,
.size   =   sizeof(struct tcf_connmark_info),
 };
 
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 334261943f9f..3768539340e0 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -654,6 +654,13 @@ static size_t tcf_csum_get_fill_size(const struct 
tc_action *act)
return nla_total_size(sizeof(struct tc_csum));
 }
 
+static int tcf_csum_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, csum_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_csum_ops = {
.kind   = "csum",
.type   = TCA_ACT_CSUM,
@@ -665,6 +672,7 @@ static struct tc_action_ops act_csum_ops = {
.walk   = tcf_csum_walker,
.lookup = tcf_csum_search,
.get_fill_size  = tcf_csum_get_fill_size,
+   .delete = tcf_csum_delete,
.size   = sizeof(struct tcf_csum),
 };
 
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index b4dfb2b4addc..a431a711f0dd 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -231,6 +231,13 @@ static size_t tcf_gact_get_fill_size(const struct 
tc_action *act)
return sz;
 }
 
+static int tcf_gact_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, gact_net_id);
+
+   return tcf_idr

[PATCH v3 09/11] net: sched: use reference counting action init

2018-05-27 Thread Vlad Buslov

Change action API to assume that action init function always takes
reference to action, even when overwriting existing action. This is
necessary because action API continues to use action pointer after init
function is done. At this point action becomes accessible for concurrent
modifications, so user must always hold reference to it.

Implement helper put list function to atomically release list of actions
after action API init code is done using them.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Resplit action lookup/release code to prevent memory leaks in
  individual patches.

 net/sched/act_api.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f019f0464cec..eefe8c2fe667 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -627,6 +627,18 @@ static int tcf_action_put(struct tc_action *p)
return __tcf_action_put(p, false);
 }
 
+static void tcf_action_put_lst(struct list_head *actions)
+{
+   struct tc_action *a, *tmp;
+
+   list_for_each_entry_safe(a, tmp, actions, list) {
+   const struct tc_action_ops *ops = a->ops;
+
+   if (tcf_action_put(a))
+   module_put(ops->owner);
+   }
+}
+
 int
 tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int 
ref)
 {
@@ -835,17 +847,6 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
return ERR_PTR(err);
 }
 
-static void cleanup_a(struct list_head *actions, int ovr)
-{
-   struct tc_action *a;
-
-   if (!ovr)
-   return;
-
-   list_for_each_entry(a, actions, list)
-   refcount_dec(&a->tcfa_refcnt);
-}
-
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
@@ -874,11 +875,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
}
 
*attr_size = tcf_action_full_attrs_size(sz);
-
-   /* Remove the temp refcnt which was necessary to protect against
-* destroying an existing action which was being replaced
-*/
-   cleanup_a(actions, ovr);
return 0;
 
 err:
@@ -1209,7 +1205,7 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct 
nlmsghdr *n,
return ret;
}
 err:
-   tcf_action_destroy(&actions, 0);
+   tcf_action_put_lst(&actions);
return ret;
 }
 
@@ -1251,8 +1247,11 @@ static int tcf_action_add(struct net *net, struct nlattr 
*nla,
  &attr_size, true, extack);
if (ret)
return ret;
+   ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+   if (ovr)
+   tcf_action_put_lst(&actions);
 
-   return tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+   return ret;
 }
 
 static u32 tcaa_root_flags_allowed = TCA_FLAG_LARGE_DUMP_ON;
-- 
2.7.5

[PATCH v3 11/11] net: sched: change action API to use array of pointers to actions

2018-05-27 Thread Vlad Buslov

Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.

Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h |  7 ++---
 net/sched/act_api.c   | 74 ---
 net/sched/cls_api.c   | 21 +--
 3 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index cd4547476074..43dfa5e1b3b3 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -168,19 +168,20 @@ static inline int tcf_idr_release(struct tc_action *a, 
bool bind)
 int tcf_register_action(struct tc_action_ops *a, struct pernet_operations 
*ops);
 int tcf_unregister_action(struct tc_action_ops *a,
  struct pernet_operations *ops);
-int tcf_action_destroy(struct list_head *actions, int bind);
+int tcf_action_destroy(struct tc_action *actions[], int bind);
 int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
int nr_actions, struct tcf_result *res);
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
-   struct list_head *actions, size_t *attr_size,
+   struct tc_action *actions[], size_t *attr_size,
bool rtnl_held, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
bool rtnl_held,
struct netlink_ext_ack *extack);
-int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
+   int ref);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9511502e1cbb..7f904bb84aab 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -657,13 +657,14 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action 
**actions,
 }
 EXPORT_SYMBOL(tcf_action_exec);
 
-int tcf_action_destroy(struct list_head *actions, int bind)
+int tcf_action_destroy(struct tc_action *actions[], int bind)
 {
const struct tc_action_ops *ops;
-   struct tc_action *a, *tmp;
-   int ret = 0;
+   struct tc_action *a;
+   int ret = 0, i;
 
-   list_for_each_entry_safe(a, tmp, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   a = actions[i];
ops = a->ops;
ret = __tcf_idr_release(a, bind, true);
if (ret == ACT_P_DELETED)
@@ -679,11 +680,12 @@ static int tcf_action_put(struct tc_action *p)
return __tcf_action_put(p, false);
 }
 
-static void tcf_action_put_lst(struct list_head *actions)
+static void tcf_action_put_many(struct tc_action *actions[])
 {
-   struct tc_action *a, *tmp;
+   int i;
 
-   list_for_each_entry_safe(a, tmp, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   struct tc_action *a = actions[i];
const struct tc_action_ops *ops = a->ops;
 
if (tcf_action_put(a))
@@ -735,14 +737,15 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action 
*a, int bind, int ref)
 }
 EXPORT_SYMBOL(tcf_action_dump_1);
 
-int tcf_action_dump(struct sk_buff *skb, struct list_head *actions,
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[],
int bind, int ref)
 {
struct tc_action *a;
-   int err = -EINVAL;
+   int err = -EINVAL, i;
struct nlattr *nest;
 
-   list_for_each_entry(a, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   a = actions[i];
nest = nla_nest_start(skb, a->order);
if (nest == NULL)
goto nla_put_failure;
@@ -878,10 +881,9 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
err = tcf_action_goto_chain_init(a, tp);
if (err) {
-   LIST_HEAD(actions);
+

[PATCH v3 02/11] net: sched: change type of reference and bind counters

2018-05-27 Thread Vlad Buslov

Change type of action reference counter to refcount_t.

Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.

Signed-off-by: Vlad Buslov 
Signed-off-by: Jiri Pirko 
---
 include/net/act_api.h  |  5 +++--
 net/sched/act_api.c| 32 ++--
 net/sched/act_bpf.c|  4 ++--
 net/sched/act_connmark.c   |  4 ++--
 net/sched/act_csum.c   |  4 ++--
 net/sched/act_gact.c   |  4 ++--
 net/sched/act_ife.c|  4 ++--
 net/sched/act_ipt.c|  4 ++--
 net/sched/act_mirred.c |  4 ++--
 net/sched/act_nat.c|  4 ++--
 net/sched/act_pedit.c  |  4 ++--
 net/sched/act_police.c |  4 ++--
 net/sched/act_sample.c |  4 ++--
 net/sched/act_simple.c |  4 ++--
 net/sched/act_skbedit.c|  4 ++--
 net/sched/act_skbmod.c |  4 ++--
 net/sched/act_tunnel_key.c |  4 ++--
 net/sched/act_vlan.c   |  4 ++--
 18 files changed, 57 insertions(+), 44 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index f7b59ef7303d..e634014605cb 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -6,6 +6,7 @@
  * Public action API for classifiers/qdiscs
 */
 
+#include 
 #include 
 #include 
 #include 
@@ -26,8 +27,8 @@ struct tc_action {
struct tcf_idrinfo  *idrinfo;
 
u32 tcfa_index;
-   int tcfa_refcnt;
-   int tcfa_bindcnt;
+   refcount_t  tcfa_refcnt;
+   atomic_ttcfa_bindcnt;
u32 tcfa_capab;
int tcfa_action;
struct tcf_ttcfa_tm;
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 02670c7489e3..4f064ecab882 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -105,14 +105,26 @@ int __tcf_idr_release(struct tc_action *p, bool bind, 
bool strict)
 
ASSERT_RTNL();
 
+   /* Release with strict==1 and bind==0 is only called through act API
+* interface (classifiers always bind). Only case when action with
+* positive reference count and zero bind count can exist is when it was
+* also created with act API (unbinding last classifier will destroy the
+* action if it was created by classifier). So only case when bind count
+* can be changed after initial check is when unbound action is
+* destroyed by act API while classifier binds to action with same id
+* concurrently. This result either creation of new action(same behavior
+* as before), or reusing existing action if concurrent process
+* increments reference count before action is deleted. Both scenarios
+* are acceptable.
+*/
if (p) {
if (bind)
-   p->tcfa_bindcnt--;
-   else if (strict && p->tcfa_bindcnt > 0)
+   atomic_dec(&p->tcfa_bindcnt);
+   else if (strict && atomic_read(&p->tcfa_bindcnt) > 0)
return -EPERM;
 
-   p->tcfa_refcnt--;
-   if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) {
+   if (atomic_read(&p->tcfa_bindcnt) <= 0 &&
+   refcount_dec_and_test(&p->tcfa_refcnt)) {
if (p->ops->cleanup)
p->ops->cleanup(p);
tcf_idr_remove(p->idrinfo, p);
@@ -304,8 +316,8 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, 
struct tc_action **a,
 
if (index && p) {
if (bind)
-   p->tcfa_bindcnt++;
-   p->tcfa_refcnt++;
+   atomic_inc(&p->tcfa_bindcnt);
+   refcount_inc(&p->tcfa_refcnt);
*a = p;
return true;
}
@@ -324,9 +336,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
 
if (unlikely(!p))
return -ENOMEM;
-   p->tcfa_refcnt = 1;
+   refcount_set(&p->tcfa_refcnt, 1);
if (bind)
-   p->tcfa_bindcnt = 1;
+   atomic_set(&p->tcfa_bindcnt, 1);
 
if (cpustats) {
p->cpu_bstats = netdev_alloc_pcpu_stats(struct 
gnet_stats_basic_cpu);
@@ -782,7 +794,7 @@ static void cleanup_a(struct list_head *actions, int ovr)
return;
 
list_for_each_entry(a, actions, list)
-   a->tcfa_refcnt--;
+   refcount_dec(&a->tcfa_refcnt);
 }
 
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
@@ -810,7 +822,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
act->order = i;
sz += tcf_action_fill_size(act);
if (ovr)
-   act->tcfa_refcnt++;
+   refcount_inc(&act->tcf

[PATCH v3 05/11] net: sched: implement action API that deletes action by index

2018-05-27 Thread Vlad Buslov

Implement new action API function that atomically finds and deletes action
from idr by index. Intended to be used by lockless actions that do not rely
on rtnl lock.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Rename tcf_idr_find_delete to tcf_idr_delete_index.

 include/net/act_api.h |  1 +
 net/sched/act_api.c   | 39 +++
 2 files changed, 40 insertions(+)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 888ff471bbf6..d94ec6400673 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -153,6 +153,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
   int bind, bool cpustats);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
 
+int tcf_idr_delete_index(struct tc_action_net *tn, u32 index);
 int __tcf_idr_release(struct tc_action *a, bool bind, bool strict);
 
 static inline int tcf_idr_release(struct tc_action *a, bool bind)
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index aa304d36fee0..0f31f09946ab 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -319,6 +319,45 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, 
struct tc_action **a,
 }
 EXPORT_SYMBOL(tcf_idr_check);
 
+int tcf_idr_delete_index(struct tc_action_net *tn, u32 index)
+{
+   struct tcf_idrinfo *idrinfo = tn->idrinfo;
+   struct tc_action *p;
+   int ret = 0;
+
+   spin_lock(&idrinfo->lock);
+   p = idr_find(&idrinfo->action_idr, index);
+   if (!p) {
+   spin_unlock(&idrinfo->lock);
+   return -ENOENT;
+   }
+
+   if (!atomic_read(&p->tcfa_bindcnt)) {
+   if (refcount_dec_and_test(&p->tcfa_refcnt)) {
+   struct module *owner = p->ops->owner;
+
+   WARN_ON(p != idr_remove(&idrinfo->action_idr,
+   p->tcfa_index));
+   spin_unlock(&idrinfo->lock);
+
+   if (p->ops->cleanup)
+   p->ops->cleanup(p);
+
+   gen_kill_estimator(&p->tcfa_rate_est);
+   free_tcf(p);
+   module_put(owner);
+   return 0;
+   }
+   ret = 0;
+   } else {
+   ret = -EPERM;
+   }
+
+   spin_unlock(&idrinfo->lock);
+   return ret;
+}
+EXPORT_SYMBOL(tcf_idr_delete_index);
+
 int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
   struct tc_action **a, const struct tc_action_ops *ops,
   int bind, bool cpustats)
-- 
2.7.5

[PATCH v3 04/11] net: sched: always take reference to action

2018-05-27 Thread Vlad Buslov

Without rtnl lock protection it is no longer safe to use pointer to tc
action without holding reference to it. (it can be destroyed concurrently)

Remove unsafe action idr lookup function. Instead of it, implement safe tcf
idr check function that atomically looks up action in idr and increments
its reference and bind counters. Implement both action search and check
using new safe function

Reference taken by idr check is temporal and should not be accounted by
userspace clients (both logically and to preserver current API behavior).
Subtract temporal reference when dumping action to userspace using existing
tca_get_fill function arguments.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Make __tcf_idr_check function static
- Merge changes that take reference to action when performing lookup and
  changes that account for this additional reference when dumping action
  to user space into single patch.

 net/sched/act_api.c | 46 --
 1 file changed, 20 insertions(+), 26 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 256b0c93916c..aa304d36fee0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct 
sk_buff *skb,
 }
 EXPORT_SYMBOL(tcf_generic_walker);
 
-static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
+static bool __tcf_idr_check(struct tc_action_net *tn, u32 index,
+   struct tc_action **a, int bind)
 {
-   struct tc_action *p = NULL;
+   struct tcf_idrinfo *idrinfo = tn->idrinfo;
+   struct tc_action *p;
 
spin_lock(&idrinfo->lock);
p = idr_find(&idrinfo->action_idr, index);
+   if (p) {
+   refcount_inc(&p->tcfa_refcnt);
+   if (bind)
+   atomic_inc(&p->tcfa_bindcnt);
+   }
spin_unlock(&idrinfo->lock);
 
-   return p;
+   if (p) {
+   *a = p;
+   return true;
+   }
+   return false;
 }
 
 int tcf_idr_search(struct tc_action_net *tn, struct tc_action **a, u32 index)
 {
-   struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct tc_action *p = tcf_idr_lookup(index, idrinfo);
-
-   if (p) {
-   *a = p;
-   return 1;
-   }
-   return 0;
+   return __tcf_idr_check(tn, index, a, 0);
 }
 EXPORT_SYMBOL(tcf_idr_search);
 
 bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a,
   int bind)
 {
-   struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct tc_action *p = tcf_idr_lookup(index, idrinfo);
-
-   if (index && p) {
-   if (bind)
-   atomic_inc(&p->tcfa_bindcnt);
-   refcount_inc(&p->tcfa_refcnt);
-   *a = p;
-   return true;
-   }
-   return false;
+   return __tcf_idr_check(tn, index, a, bind);
 }
 EXPORT_SYMBOL(tcf_idr_check);
 
@@ -932,7 +926,7 @@ tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr 
*n,
if (!skb)
return -ENOBUFS;
if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event,
-0, 0) <= 0) {
+0, 1) <= 0) {
NL_SET_ERR_MSG(extack, "Failed to fill netlink attributes while 
adding TC action");
kfree_skb(skb);
return -EINVAL;
@@ -1072,7 +1066,7 @@ tcf_del_notify(struct net *net, struct nlmsghdr *n, 
struct list_head *actions,
return -ENOBUFS;
 
if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, RTM_DELACTION,
-0, 1) <= 0) {
+0, 2) <= 0) {
NL_SET_ERR_MSG(extack, "Failed to fill netlink TC action 
attributes");
kfree_skb(skb);
return -EINVAL;
@@ -1131,14 +1125,14 @@ tca_action_gd(struct net *net, struct nlattr *nla, 
struct nlmsghdr *n,
if (event == RTM_GETACTION)
ret = tcf_get_notify(net, portid, n, &actions, event, extack);
else { /* delete */
+   cleanup_a(&actions, 1); /* lookup took reference */
ret = tcf_del_notify(net, n, &actions, portid, attr_size, 
extack);
if (ret)
goto err;
return ret;
}
 err:
-   if (event != RTM_GETACTION)
-   tcf_action_destroy(&actions, 0);
+   tcf_action_destroy(&actions, 0);
return ret;
 }
 
-- 
2.7.5

[PATCH v3 00/11] Modify action API for implementing lockless actions

2018-05-27 Thread Vlad Buslov

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a first step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in action API that
prevents it from being executed concurrently.

As a preparation for executing TC rules update handlers without rtnl
lock, action API code was audited to determine areas that assume
external synchronization with rtnl lock and must be changed to allow
safe concurrent access with following results:

1. Action idr is already protected with spinlock. However, some code
   paths assume that idr state is not changes between several
   consecutive tcf_idr_* function calls.
2. tc_action reference and bind counters are implemented as plain
   integers. They purpose was to allow single actions to be shared
   between multiple filters, not to provide means for concurrent
   modification.
3. tc_action 'cookie' pointer field is not protected against
   modification.
4. Action API functions, that work with set of actions, use intrusive
   linked list, which cannot be used concurrently without additional
   synchronization.
5. Action API functions don't take reference to actions while using
   them, assuming external synchronization with rtnl lock.

Following solutions to these problems are implemented:

1. To remove assumption that idr state doesn't change between tcf_idr_*
   calls, implement new functions that atomically perform several
   operations on idr without releasing idr spinlock. (function to
   atomically lookup and delete action by index, function to atomically
   check if action exists and allocate new one if necessary, etc.)
2. Use atomic operations on counters to make them suitable for
   concurrent get/put operations.
3. Data that 'cookie' points to is never modified, so it enough to
   refactor it to rcu pointer to prevent concurrent de-allocation.
4. Action API doesn't actually use any linked list specific operations
   on actions intrusive linked list, so it can be refactored to array in
   straightforward manner.
5. Always take reference to action while accessing it in action API.
   tcf_idr_search function modified to take reference to action before
   returning it, so there is no way to lookup an action without
   incrementing its reference counter. All users of this function are
   modified to release the reference, after they done using action. With
   all users using reference counting, it is now safe to concurrently
   delete actions.

Since only shared state in action API module are actions themselves and
action idr, these changes are sufficient to not to rely on global rtnl
lock for protection of internal action API data structures.

Changes from V1 to V2:
- Removed redundant actions ops lookup during delete.
- Merge action ops delete definition and implementation.
- Assume all actions have delete implemented and don't check for it
  explicitly.
- Resplit action lookup/release code to prevent memory leaks in
  individual patches.
- Make __tcf_idr_check function static
- Remove unique idr insertion function. Change original idr insert to do
  the same thing.
- Merge changes that take reference to action when performing lookup and
  changes that account for this additional reference when dumping action
  to user space into single patch.
- Change convoluted commit message.
- Rename "unlocked" to "rtnl_held" for clarity.
- Remove estimator lock add patch.
- Refactor action check-alloc code into standalone function.
- Rename tcf_idr_find_delete to tcf_idr_delete_index.
- Rearrange variable definitions in tc_action_delete.
- Add patch that refactors action API code to use array of pointers to
  actions instead of intrusive linked list.
- Expand cover letter.

Vlad Buslov (11):
  net: sched: use rcu for action cookie update
  net: sched: change type of reference and bind counters
  net: sched: implement unlocked action init API
  net: sched: always take reference to action
  net: sched: implement action API that deletes action by index
  net: sched: add 'delete' function to action ops
  net: sched: implement reference counted action release
  net: sched: don't release reference on action overwrite
  net: sched: use reference counting action init
  net: sched: atomically check-allocate action
  net: sched: change action API to use array of pointers to actions

 include/net/act_api.h  |  25 ++-
 include/net/pkt_cls.h  |   1 +
 net/sched/act_api.c| 408 +++--
 net/sched/act_bpf.

[PATCH v3 01/11] net: sched: use rcu for action cookie update

2018-05-27 Thread Vlad Buslov

Implement functions to atomically update and free action cookie
using rcu mechanism.

Signed-off-by: Vlad Buslov 
Signed-off-by: Jiri Pirko 
---
 include/net/act_api.h |  2 +-
 include/net/pkt_cls.h |  1 +
 net/sched/act_api.c   | 44 ++--
 3 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9e59ebfded62..f7b59ef7303d 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -37,7 +37,7 @@ struct tc_action {
spinlock_t  tcfa_lock;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
-   struct tc_cookie*act_cookie;
+   struct tc_cookie__rcu *act_cookie;
struct tcf_chain*goto_chain;
 };
 #define tcf_index  common.tcfa_index
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e828d31be5da..3068cc8aa0f1 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -769,6 +769,7 @@ struct tc_mqprio_qopt_offload {
 struct tc_cookie {
u8  *data;
u32 len;
+   struct rcu_head rcu;
 };
 
 struct tc_qopt_offload_stats {
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3f4cf930f809..02670c7489e3 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -55,6 +55,24 @@ static void tcf_action_goto_chain_exec(const struct 
tc_action *a,
res->goto_tp = rcu_dereference_bh(chain->filter_chain);
 }
 
+static void tcf_free_cookie_rcu(struct rcu_head *p)
+{
+   struct tc_cookie *cookie = container_of(p, struct tc_cookie, rcu);
+
+   kfree(cookie->data);
+   kfree(cookie);
+}
+
+static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
+ struct tc_cookie *new_cookie)
+{
+   struct tc_cookie *old;
+
+   old = xchg(old_cookie, new_cookie);
+   if (old)
+   call_rcu(&old->rcu, tcf_free_cookie_rcu);
+}
+
 /* XXX: For standalone actions, we don't need a RCU grace period either, 
because
  * actions are always connected to filters and filters are already destroyed in
  * RCU callbacks, so after a RCU grace period actions are already disconnected
@@ -65,10 +83,7 @@ static void free_tcf(struct tc_action *p)
free_percpu(p->cpu_bstats);
free_percpu(p->cpu_qstats);
 
-   if (p->act_cookie) {
-   kfree(p->act_cookie->data);
-   kfree(p->act_cookie);
-   }
+   tcf_set_action_cookie(&p->act_cookie, NULL);
if (p->goto_chain)
tcf_action_goto_chain_fini(p);
 
@@ -567,16 +582,22 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action 
*a, int bind, int ref)
int err = -EINVAL;
unsigned char *b = skb_tail_pointer(skb);
struct nlattr *nest;
+   struct tc_cookie *cookie;
 
if (nla_put_string(skb, TCA_KIND, a->ops->kind))
goto nla_put_failure;
if (tcf_action_copy_stats(skb, a, 0))
goto nla_put_failure;
-   if (a->act_cookie) {
-   if (nla_put(skb, TCA_ACT_COOKIE, a->act_cookie->len,
-   a->act_cookie->data))
+
+   rcu_read_lock();
+   cookie = rcu_dereference(a->act_cookie);
+   if (cookie) {
+   if (nla_put(skb, TCA_ACT_COOKIE, cookie->len, cookie->data)) {
+   rcu_read_unlock();
goto nla_put_failure;
+   }
}
+   rcu_read_unlock();
 
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
@@ -719,13 +740,8 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (err < 0)
goto err_mod;
 
-   if (name == NULL && tb[TCA_ACT_COOKIE]) {
-   if (a->act_cookie) {
-   kfree(a->act_cookie->data);
-   kfree(a->act_cookie);
-   }
-   a->act_cookie = cookie;
-   }
+   if (!name && tb[TCA_ACT_COOKIE])
+   tcf_set_action_cookie(&a->act_cookie, cookie);
 
/* module count goes up only when brand new policy is created
 * if it exists and is only bound to in a_o->init() then
-- 
2.7.5

[PATCH v3 03/11] net: sched: implement unlocked action init API

2018-05-27 Thread Vlad Buslov

Add additional 'rtnl_held' argument to act API init functions. It is
required to implement actions that need to release rtnl lock before loading
kernel module and reacquire if afterwards.

Signed-off-by: Vlad Buslov 
---
Changes from V1 to V2:
- Rename "unlocked" to "rtnl_held" for clarity.

 include/net/act_api.h  |  6 --
 net/sched/act_api.c| 18 +++---
 net/sched/act_bpf.c|  3 ++-
 net/sched/act_connmark.c   |  2 +-
 net/sched/act_csum.c   |  3 ++-
 net/sched/act_gact.c   |  3 ++-
 net/sched/act_ife.c|  3 ++-
 net/sched/act_ipt.c|  6 --
 net/sched/act_mirred.c |  5 +++--
 net/sched/act_nat.c|  2 +-
 net/sched/act_pedit.c  |  3 ++-
 net/sched/act_police.c |  2 +-
 net/sched/act_sample.c |  3 ++-
 net/sched/act_simple.c |  3 ++-
 net/sched/act_skbedit.c|  3 ++-
 net/sched/act_skbmod.c |  3 ++-
 net/sched/act_tunnel_key.c |  3 ++-
 net/sched/act_vlan.c   |  3 ++-
 net/sched/cls_api.c|  5 +++--
 19 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index e634014605cb..888ff471bbf6 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -92,7 +92,8 @@ struct tc_action_ops {
  struct netlink_ext_ack *extack);
int (*init)(struct net *net, struct nlattr *nla,
struct nlattr *est, struct tc_action **act, int ovr,
-   int bind, struct netlink_ext_ack *extack);
+   int bind, bool rtnl_held,
+   struct netlink_ext_ack *extack);
int (*walk)(struct net *, struct sk_buff *,
struct netlink_callback *, int,
const struct tc_action_ops *,
@@ -168,10 +169,11 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action 
**actions,
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
-   struct netlink_ext_ack *extack);
+   bool rtnl_held, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
+   bool rtnl_held,
struct netlink_ext_ack *extack);
 int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4f064ecab882..256b0c93916c 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -671,6 +671,7 @@ static struct tc_cookie *nla_memdup_cookie(struct nlattr 
**tb)
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
+   bool rtnl_held,
struct netlink_ext_ack *extack)
 {
struct tc_action *a;
@@ -721,9 +722,11 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
a_o = tc_lookup_action_n(act_name);
if (a_o == NULL) {
 #ifdef CONFIG_MODULES
-   rtnl_unlock();
+   if (rtnl_held)
+   rtnl_unlock();
request_module("act_%s", act_name);
-   rtnl_lock();
+   if (rtnl_held)
+   rtnl_lock();
 
a_o = tc_lookup_action_n(act_name);
 
@@ -746,9 +749,10 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
/* backward compatibility for policer */
if (name == NULL)
err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-   extack);
+   rtnl_held, extack);
else
-   err = a_o->init(net, nla, est, &a, ovr, bind, extack);
+   err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
+   extack);
if (err < 0)
goto err_mod;
 
@@ -800,7 +804,7 @@ static void cleanup_a(struct list_head *actions, int ovr)
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
-   struct netlink_ext_ack *extack)
+   bool rtnl_held, struct netlink_ext_ack *extack)
 {
struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
struct tc_action *act;
@@ -814,7 +818,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
s

[PATCH] ath10k: fix memory leak of tpc_stats

2018-05-27 Thread Colin King

From: Colin Ian King 

Currently tpc_stats is allocated and is leaked on the return
path if num_tx_chain is greater than WMI_TPC_TX_N_CHAIN. Avoid
this leak by performing the check on num_tx_chain before the
allocation of tpc_stats.

Detected by CoverityScan, CID#1469422 ("Resource Leak")
Fixes: 4b190675ad06 ("ath10k: fix kernel panic while reading tpc_stats")

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/ath/ath10k/wmi.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 46fb96ee5852..694f46fa9e14 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -4470,10 +4470,6 @@ void ath10k_wmi_event_pdev_tpc_config(struct ath10k *ar, 
struct sk_buff *skb)
 
ev = (struct wmi_pdev_tpc_config_event *)skb->data;
 
-   tpc_stats = kzalloc(sizeof(*tpc_stats), GFP_ATOMIC);
-   if (!tpc_stats)
-   return;
-
num_tx_chain = __le32_to_cpu(ev->num_tx_chain);
 
if (num_tx_chain > WMI_TPC_TX_N_CHAIN) {
@@ -4482,6 +4478,10 @@ void ath10k_wmi_event_pdev_tpc_config(struct ath10k *ar, 
struct sk_buff *skb)
return;
}
 
+   tpc_stats = kzalloc(sizeof(*tpc_stats), GFP_ATOMIC);
+   if (!tpc_stats)
+   return;
+
ath10k_wmi_tpc_config_get_rate_code(rate_code, pream_table,
num_tx_chain);
 
-- 
2.17.0

KASAN: use-after-free Write in bpf_tcp_close

2018-05-27 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:ff4fb475cea8 Merge branch 'btf-uapi-cleanups'
git tree:   bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12b3d57780
kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
dashboard link: https://syzkaller.appspot.com/bug?extid=31025a5f3f7650081204
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=109a2f3780
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=171a727b80

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+31025a5f3f7650081...@syzkaller.appspotmail.com

==
BUG: KASAN: use-after-free in cmpxchg_size  
include/asm-generic/atomic-instrumented.h:355 [inline]
BUG: KASAN: use-after-free in bpf_tcp_close+0x6f5/0xf80  
kernel/bpf/sockmap.c:265

Write of size 8 at addr 8801ca277680 by task syz-executor749/9723

CPU: 0 PID: 9723 Comm: syz-executor749 Not tainted 4.17.0-rc4+ #19
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
 cmpxchg_size include/asm-generic/atomic-instrumented.h:355 [inline]
 bpf_tcp_close+0x6f5/0xf80 kernel/bpf/sockmap.c:265
 inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459
 sock_release+0x96/0x1b0 net/socket.c:594
 sock_close+0x16/0x20 net/socket.c:1149
 __fput+0x34d/0x890 fs/file_table.c:209
 fput+0x15/0x20 fs/file_table.c:243
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x1aee/0x2730 kernel/exit.c:865
 do_group_exit+0x16f/0x430 kernel/exit.c:968
 __do_sys_exit_group kernel/exit.c:979 [inline]
 __se_sys_exit_group kernel/exit.c:977 [inline]
 __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440a59
RSP: 002b:7ffdadf92488 EFLAGS: 0206 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 00440a59
RDX: 00440a59 RSI: 0020 RDI: 
RBP:  R08: 004002c8 R09: 00401ea0
R10: 004002c8 R11: 0206 R12: 0001b5ac
R13: 00401ea0 R14:  R15: 

Allocated by task 9723:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 __do_kmalloc_node mm/slab.c:3682 [inline]
 __kmalloc_node+0x47/0x70 mm/slab.c:3689
 kmalloc_node include/linux/slab.h:554 [inline]
 bpf_map_area_alloc+0x3f/0x90 kernel/bpf/syscall.c:144
 sock_map_alloc+0x376/0x410 kernel/bpf/sockmap.c:1555
 find_and_alloc_map kernel/bpf/syscall.c:126 [inline]
 map_create+0x393/0x1010 kernel/bpf/syscall.c:448
 __do_sys_bpf kernel/bpf/syscall.c:2128 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:2105 [inline]
 __x64_sys_bpf+0x300/0x4f0 kernel/bpf/syscall.c:2105
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4521:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 kvfree+0x61/0x70 mm/util.c:440
 bpf_map_area_free+0x15/0x20 kernel/bpf/syscall.c:155
 sock_map_remove_complete kernel/bpf/sockmap.c:1443 [inline]
 sock_map_free+0x408/0x540 kernel/bpf/sockmap.c:1619
 bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259
 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
 kthread+0x345/0x410 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

The buggy address belongs to the object at 8801ca277680
 which belongs to the cache kmalloc-1024 of size 1024
The buggy address is located 0 bytes inside of
 1024-byte region [8801ca277680, 8801ca277a80)
The buggy address belongs to the page:
page:ea0007289d80 count:1 mapcount:0 mapping:8801ca276000 index:0x0  
compound_mapcount: 0

flags: 0x2fffc008100(slab|head)
raw: 02fffc008100 8801ca276000  00010007
raw: ea0006d12b20 ea000763bba0 8801da800ac0 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 8801ca277580: fc fc fc fc fc fc f

[PATCH] net: sched: split tc_ctl_tfilter into three handlers

2018-05-27 Thread Vlad Buslov

tc_ctl_tfilter handles three netlink message types: RTM_NEWTFILTER,
RTM_DELTFILTER, RTM_GETTFILTER. However, implementation of this function
involves a lot of branching on specific message type because most of the
code is message-specific. This significantly complicates adding new
functionality and doesn't provide much benefit of code reuse.

Split tc_ctl_tfilter to three standalone functions that handle filter new,
delete and get requests.

The only truly protocol independent part of tc_ctl_tfilter is code that
looks up queue, class, and block. Refactor this code to standalone
tcf_block_find function that is used by all three new handlers.

Signed-off-by: Vlad Buslov 
---
 net/sched/cls_api.c | 438 +++-
 1 file changed, 293 insertions(+), 145 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1ed673f501c2..8dfbad1bddf5 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -436,6 +436,78 @@ static struct tcf_block *tcf_block_lookup(struct net *net, 
u32 block_index)
return idr_find(&tn->idr, block_index);
 }
 
+/* Find tcf block.
+ * Set q, parent, cl when appropriate.
+ */
+
+static struct tcf_block *tcf_block_find(struct net *net, struct Qdisc **q,
+   u32 *parent, unsigned long *cl,
+   int ifindex, u32 block_index,
+   struct netlink_ext_ack *extack)
+{
+   struct tcf_block *block;
+
+   if (ifindex == TCM_IFINDEX_MAGIC_BLOCK) {
+   block = tcf_block_lookup(net, block_index);
+   if (!block) {
+   NL_SET_ERR_MSG(extack, "Block of given index was not 
found");
+   return ERR_PTR(-EINVAL);
+   }
+   } else {
+   const struct Qdisc_class_ops *cops;
+   struct net_device *dev;
+
+   /* Find link */
+   dev = __dev_get_by_index(net, ifindex);
+   if (!dev)
+   return ERR_PTR(-ENODEV);
+
+   /* Find qdisc */
+   if (!*parent) {
+   *q = dev->qdisc;
+   *parent = (*q)->handle;
+   } else {
+   *q = qdisc_lookup(dev, TC_H_MAJ(*parent));
+   if (!*q) {
+   NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't 
exists");
+   return ERR_PTR(-EINVAL);
+   }
+   }
+
+   /* Is it classful? */
+   cops = (*q)->ops->cl_ops;
+   if (!cops) {
+   NL_SET_ERR_MSG(extack, "Qdisc not classful");
+   return ERR_PTR(-EINVAL);
+   }
+
+   if (!cops->tcf_block) {
+   NL_SET_ERR_MSG(extack, "Class doesn't support blocks");
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+
+   /* Do we search for filter, attached to class? */
+   if (TC_H_MIN(*parent)) {
+   *cl = cops->find(*q, *parent);
+   if (*cl == 0) {
+   NL_SET_ERR_MSG(extack, "Specified class doesn't 
exist");
+   return ERR_PTR(-ENOENT);
+   }
+   }
+
+   /* And the last stroke */
+   block = cops->tcf_block(*q, *cl, extack);
+   if (!block)
+   return ERR_PTR(-EINVAL);
+   if (tcf_block_shared(block)) {
+   NL_SET_ERR_MSG(extack, "This filter block is shared. 
Please use the block index to manipulate the filters");
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
+   return block;
+}
+
 static struct tcf_chain *tcf_block_chain_zero(struct tcf_block *block)
 {
return list_first_entry(&block->chain_list, struct tcf_chain, list);
@@ -1002,9 +1074,7 @@ static void tfilter_notify_chain(struct net *net, struct 
sk_buff *oskb,
   q, parent, 0, event, false);
 }
 
-/* Add/change/delete/get a filter node */
-
-static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
+static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
  struct netlink_ext_ack *extack)
 {
struct net *net = sock_net(skb->sk);
@@ -1025,8 +1095,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
int err;
int tp_created;
 
-   if ((n->nlmsg_type != RTM_GETTFILTER) &&
-   !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
+   if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
 replay:
@@ -1044,24 +1113,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
cl = 0;
 
if (prio == 0) {
-   switch (n->nlmsg_type) {
-

Re: [PATCH v2 00/11] Modify action API for implementing lockless actions

2018-05-27 Thread Jiri Pirko

Sun, May 27, 2018 at 06:41:27PM CEST, vla...@mellanox.com wrote:

[...]

>Changes from V1 to V2:
>- Removed redundant actions ops lookup during delete.
>- Merge action ops delete definition and implementation.
>- Assume all actions have delete implemented and don't check for it
>  explicitly.
>- Resplit action lookup/release code to prevent memory leaks in
>  individual patches.
>- Make __tcf_idr_check function static
>- Remove unique idr insertion function. Change original idr insert to do
>  the same thing.
>- Merge changes that take reference to action when performing lookup and
>  changes that account for this additional reference when dumping action
>  to user space into single patch.
>- Change convoluted commit message.
>- Rename "unlocked" to "rtnl_held" for clarity.
>- Remove estimator lock add patch.
>- Refactor action check-alloc code into standalone function.
>- Rename tcf_idr_find_delete to tcf_idr_delete_index.
>- Rearrange variable definitions in tc_action_delete.
>- Add patch that refactors action API code to use array of pointers to
>  actions instead of intrusive linked list.
>- Expand cover letter.

With 11 patches, I'm unable to follow the changes in this description.
Could you please write the changes to each individual patch?
What I do is I write changes for each patch under "---" (there are two
"---" lines in each patch then. Then I also copy the changes descriptions
to the cover letter.

[...]

[PATCH v2 01/11] net: sched: use rcu for action cookie update

2018-05-27 Thread Vlad Buslov

Implement functions to atomically update and free action cookie
using rcu mechanism.

Signed-off-by: Vlad Buslov 
Signed-off-by: Jiri Pirko 
---
 include/net/act_api.h |  2 +-
 include/net/pkt_cls.h |  1 +
 net/sched/act_api.c   | 44 ++--
 3 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9e59ebfded62..f7b59ef7303d 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -37,7 +37,7 @@ struct tc_action {
spinlock_t  tcfa_lock;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
-   struct tc_cookie*act_cookie;
+   struct tc_cookie__rcu *act_cookie;
struct tcf_chain*goto_chain;
 };
 #define tcf_index  common.tcfa_index
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e828d31be5da..3068cc8aa0f1 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -769,6 +769,7 @@ struct tc_mqprio_qopt_offload {
 struct tc_cookie {
u8  *data;
u32 len;
+   struct rcu_head rcu;
 };
 
 struct tc_qopt_offload_stats {
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3f4cf930f809..02670c7489e3 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -55,6 +55,24 @@ static void tcf_action_goto_chain_exec(const struct 
tc_action *a,
res->goto_tp = rcu_dereference_bh(chain->filter_chain);
 }
 
+static void tcf_free_cookie_rcu(struct rcu_head *p)
+{
+   struct tc_cookie *cookie = container_of(p, struct tc_cookie, rcu);
+
+   kfree(cookie->data);
+   kfree(cookie);
+}
+
+static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
+ struct tc_cookie *new_cookie)
+{
+   struct tc_cookie *old;
+
+   old = xchg(old_cookie, new_cookie);
+   if (old)
+   call_rcu(&old->rcu, tcf_free_cookie_rcu);
+}
+
 /* XXX: For standalone actions, we don't need a RCU grace period either, 
because
  * actions are always connected to filters and filters are already destroyed in
  * RCU callbacks, so after a RCU grace period actions are already disconnected
@@ -65,10 +83,7 @@ static void free_tcf(struct tc_action *p)
free_percpu(p->cpu_bstats);
free_percpu(p->cpu_qstats);
 
-   if (p->act_cookie) {
-   kfree(p->act_cookie->data);
-   kfree(p->act_cookie);
-   }
+   tcf_set_action_cookie(&p->act_cookie, NULL);
if (p->goto_chain)
tcf_action_goto_chain_fini(p);
 
@@ -567,16 +582,22 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action 
*a, int bind, int ref)
int err = -EINVAL;
unsigned char *b = skb_tail_pointer(skb);
struct nlattr *nest;
+   struct tc_cookie *cookie;
 
if (nla_put_string(skb, TCA_KIND, a->ops->kind))
goto nla_put_failure;
if (tcf_action_copy_stats(skb, a, 0))
goto nla_put_failure;
-   if (a->act_cookie) {
-   if (nla_put(skb, TCA_ACT_COOKIE, a->act_cookie->len,
-   a->act_cookie->data))
+
+   rcu_read_lock();
+   cookie = rcu_dereference(a->act_cookie);
+   if (cookie) {
+   if (nla_put(skb, TCA_ACT_COOKIE, cookie->len, cookie->data)) {
+   rcu_read_unlock();
goto nla_put_failure;
+   }
}
+   rcu_read_unlock();
 
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
@@ -719,13 +740,8 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (err < 0)
goto err_mod;
 
-   if (name == NULL && tb[TCA_ACT_COOKIE]) {
-   if (a->act_cookie) {
-   kfree(a->act_cookie->data);
-   kfree(a->act_cookie);
-   }
-   a->act_cookie = cookie;
-   }
+   if (!name && tb[TCA_ACT_COOKIE])
+   tcf_set_action_cookie(&a->act_cookie, cookie);
 
/* module count goes up only when brand new policy is created
 * if it exists and is only bound to in a_o->init() then
-- 
2.7.5

[PATCH v2 02/11] net: sched: change type of reference and bind counters

2018-05-27 Thread Vlad Buslov

Change type of action reference counter to refcount_t.

Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.

Signed-off-by: Vlad Buslov 
Signed-off-by: Jiri Pirko 
---
 include/net/act_api.h  |  5 +++--
 net/sched/act_api.c| 32 ++--
 net/sched/act_bpf.c|  4 ++--
 net/sched/act_connmark.c   |  4 ++--
 net/sched/act_csum.c   |  4 ++--
 net/sched/act_gact.c   |  4 ++--
 net/sched/act_ife.c|  4 ++--
 net/sched/act_ipt.c|  4 ++--
 net/sched/act_mirred.c |  4 ++--
 net/sched/act_nat.c|  4 ++--
 net/sched/act_pedit.c  |  4 ++--
 net/sched/act_police.c |  4 ++--
 net/sched/act_sample.c |  4 ++--
 net/sched/act_simple.c |  4 ++--
 net/sched/act_skbedit.c|  4 ++--
 net/sched/act_skbmod.c |  4 ++--
 net/sched/act_tunnel_key.c |  4 ++--
 net/sched/act_vlan.c   |  4 ++--
 18 files changed, 57 insertions(+), 44 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index f7b59ef7303d..e634014605cb 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -6,6 +6,7 @@
  * Public action API for classifiers/qdiscs
 */
 
+#include 
 #include 
 #include 
 #include 
@@ -26,8 +27,8 @@ struct tc_action {
struct tcf_idrinfo  *idrinfo;
 
u32 tcfa_index;
-   int tcfa_refcnt;
-   int tcfa_bindcnt;
+   refcount_t  tcfa_refcnt;
+   atomic_ttcfa_bindcnt;
u32 tcfa_capab;
int tcfa_action;
struct tcf_ttcfa_tm;
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 02670c7489e3..4f064ecab882 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -105,14 +105,26 @@ int __tcf_idr_release(struct tc_action *p, bool bind, 
bool strict)
 
ASSERT_RTNL();
 
+   /* Release with strict==1 and bind==0 is only called through act API
+* interface (classifiers always bind). Only case when action with
+* positive reference count and zero bind count can exist is when it was
+* also created with act API (unbinding last classifier will destroy the
+* action if it was created by classifier). So only case when bind count
+* can be changed after initial check is when unbound action is
+* destroyed by act API while classifier binds to action with same id
+* concurrently. This result either creation of new action(same behavior
+* as before), or reusing existing action if concurrent process
+* increments reference count before action is deleted. Both scenarios
+* are acceptable.
+*/
if (p) {
if (bind)
-   p->tcfa_bindcnt--;
-   else if (strict && p->tcfa_bindcnt > 0)
+   atomic_dec(&p->tcfa_bindcnt);
+   else if (strict && atomic_read(&p->tcfa_bindcnt) > 0)
return -EPERM;
 
-   p->tcfa_refcnt--;
-   if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) {
+   if (atomic_read(&p->tcfa_bindcnt) <= 0 &&
+   refcount_dec_and_test(&p->tcfa_refcnt)) {
if (p->ops->cleanup)
p->ops->cleanup(p);
tcf_idr_remove(p->idrinfo, p);
@@ -304,8 +316,8 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, 
struct tc_action **a,
 
if (index && p) {
if (bind)
-   p->tcfa_bindcnt++;
-   p->tcfa_refcnt++;
+   atomic_inc(&p->tcfa_bindcnt);
+   refcount_inc(&p->tcfa_refcnt);
*a = p;
return true;
}
@@ -324,9 +336,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
 
if (unlikely(!p))
return -ENOMEM;
-   p->tcfa_refcnt = 1;
+   refcount_set(&p->tcfa_refcnt, 1);
if (bind)
-   p->tcfa_bindcnt = 1;
+   atomic_set(&p->tcfa_bindcnt, 1);
 
if (cpustats) {
p->cpu_bstats = netdev_alloc_pcpu_stats(struct 
gnet_stats_basic_cpu);
@@ -782,7 +794,7 @@ static void cleanup_a(struct list_head *actions, int ovr)
return;
 
list_for_each_entry(a, actions, list)
-   a->tcfa_refcnt--;
+   refcount_dec(&a->tcfa_refcnt);
 }
 
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
@@ -810,7 +822,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
act->order = i;
sz += tcf_action_fill_size(act);
if (ovr)
-   act->tcfa_refcnt++;
+   refcount_inc(&act->tcf

[PATCH v2 03/11] net: sched: implement unlocked action init API

2018-05-27 Thread Vlad Buslov

Add additional 'rtnl_held' argument to act API init functions. It is
required to implement actions that need to release rtnl lock before loading
kernel module and reacquire if afterwards.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h  |  6 --
 net/sched/act_api.c| 18 +++---
 net/sched/act_bpf.c|  3 ++-
 net/sched/act_connmark.c   |  2 +-
 net/sched/act_csum.c   |  3 ++-
 net/sched/act_gact.c   |  3 ++-
 net/sched/act_ife.c|  3 ++-
 net/sched/act_ipt.c|  6 --
 net/sched/act_mirred.c |  5 +++--
 net/sched/act_nat.c|  2 +-
 net/sched/act_pedit.c  |  3 ++-
 net/sched/act_police.c |  2 +-
 net/sched/act_sample.c |  3 ++-
 net/sched/act_simple.c |  3 ++-
 net/sched/act_skbedit.c|  3 ++-
 net/sched/act_skbmod.c |  3 ++-
 net/sched/act_tunnel_key.c |  3 ++-
 net/sched/act_vlan.c   |  3 ++-
 net/sched/cls_api.c|  5 +++--
 19 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index e634014605cb..888ff471bbf6 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -92,7 +92,8 @@ struct tc_action_ops {
  struct netlink_ext_ack *extack);
int (*init)(struct net *net, struct nlattr *nla,
struct nlattr *est, struct tc_action **act, int ovr,
-   int bind, struct netlink_ext_ack *extack);
+   int bind, bool rtnl_held,
+   struct netlink_ext_ack *extack);
int (*walk)(struct net *, struct sk_buff *,
struct netlink_callback *, int,
const struct tc_action_ops *,
@@ -168,10 +169,11 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action 
**actions,
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
-   struct netlink_ext_ack *extack);
+   bool rtnl_held, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
+   bool rtnl_held,
struct netlink_ext_ack *extack);
 int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4f064ecab882..256b0c93916c 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -671,6 +671,7 @@ static struct tc_cookie *nla_memdup_cookie(struct nlattr 
**tb)
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
+   bool rtnl_held,
struct netlink_ext_ack *extack)
 {
struct tc_action *a;
@@ -721,9 +722,11 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
a_o = tc_lookup_action_n(act_name);
if (a_o == NULL) {
 #ifdef CONFIG_MODULES
-   rtnl_unlock();
+   if (rtnl_held)
+   rtnl_unlock();
request_module("act_%s", act_name);
-   rtnl_lock();
+   if (rtnl_held)
+   rtnl_lock();
 
a_o = tc_lookup_action_n(act_name);
 
@@ -746,9 +749,10 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
/* backward compatibility for policer */
if (name == NULL)
err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-   extack);
+   rtnl_held, extack);
else
-   err = a_o->init(net, nla, est, &a, ovr, bind, extack);
+   err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
+   extack);
if (err < 0)
goto err_mod;
 
@@ -800,7 +804,7 @@ static void cleanup_a(struct list_head *actions, int ovr)
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
-   struct netlink_ext_ack *extack)
+   bool rtnl_held, struct netlink_ext_ack *extack)
 {
struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
struct tc_action *act;
@@ -814,7 +818,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
 
for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i];

[PATCH v2 06/11] net: sched: add 'delete' function to action ops

2018-05-27 Thread Vlad Buslov

Extend action ops with 'delete' function. Each action type to implements
its own delete function that doesn't depend on rtnl lock.

Implement delete function that is required to delete actions without
holding rtnl lock. Use action API function that atomically deletes action
only if it is still in action idr. This implementation prevents concurrent
threads from deleting same action twice.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h  |  1 +
 net/sched/act_bpf.c|  8 
 net/sched/act_connmark.c   |  8 
 net/sched/act_csum.c   |  8 
 net/sched/act_gact.c   |  8 
 net/sched/act_ife.c|  8 
 net/sched/act_ipt.c| 16 
 net/sched/act_mirred.c |  8 
 net/sched/act_nat.c|  8 
 net/sched/act_pedit.c  |  8 
 net/sched/act_police.c |  8 
 net/sched/act_sample.c |  8 
 net/sched/act_simple.c |  8 
 net/sched/act_skbedit.c|  8 
 net/sched/act_skbmod.c |  8 
 net/sched/act_tunnel_key.c |  8 
 net/sched/act_vlan.c   |  8 
 17 files changed, 137 insertions(+)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d94ec6400673..d256e20507b9 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -101,6 +101,7 @@ struct tc_action_ops {
void(*stats_update)(struct tc_action *, u64, u32, u64);
size_t  (*get_fill_size)(const struct tc_action *act);
struct net_device *(*get_dev)(const struct tc_action *a);
+   int (*delete)(struct net *net, u32 index);
 };
 
 struct tc_action_net {
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 8ebf40a3506c..7941dd66ff83 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -388,6 +388,13 @@ static int tcf_bpf_search(struct net *net, struct 
tc_action **a, u32 index,
return tcf_idr_search(tn, a, index);
 }
 
+static int tcf_bpf_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, bpf_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_bpf_ops __read_mostly = {
.kind   =   "bpf",
.type   =   TCA_ACT_BPF,
@@ -398,6 +405,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = {
.init   =   tcf_bpf_init,
.walk   =   tcf_bpf_walker,
.lookup =   tcf_bpf_search,
+   .delete =   tcf_bpf_delete,
.size   =   sizeof(struct tcf_bpf),
 };
 
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index e3787aa0025a..143c2d3de723 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -193,6 +193,13 @@ static int tcf_connmark_search(struct net *net, struct 
tc_action **a, u32 index,
return tcf_idr_search(tn, a, index);
 }
 
+static int tcf_connmark_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, connmark_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_connmark_ops = {
.kind   =   "connmark",
.type   =   TCA_ACT_CONNMARK,
@@ -202,6 +209,7 @@ static struct tc_action_ops act_connmark_ops = {
.init   =   tcf_connmark_init,
.walk   =   tcf_connmark_walker,
.lookup =   tcf_connmark_search,
+   .delete =   tcf_connmark_delete,
.size   =   sizeof(struct tcf_connmark_info),
 };
 
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 334261943f9f..3768539340e0 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -654,6 +654,13 @@ static size_t tcf_csum_get_fill_size(const struct 
tc_action *act)
return nla_total_size(sizeof(struct tc_csum));
 }
 
+static int tcf_csum_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, csum_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_csum_ops = {
.kind   = "csum",
.type   = TCA_ACT_CSUM,
@@ -665,6 +672,7 @@ static struct tc_action_ops act_csum_ops = {
.walk   = tcf_csum_walker,
.lookup = tcf_csum_search,
.get_fill_size  = tcf_csum_get_fill_size,
+   .delete = tcf_csum_delete,
.size   = sizeof(struct tcf_csum),
 };
 
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index b4dfb2b4addc..a431a711f0dd 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -231,6 +231,13 @@ static size_t tcf_gact_get_fill_size(const struct 
tc_action *act)
return sz;
 }
 
+static int tcf_gact_delete(struct net *net, u32 index)
+{
+   struct tc_action_net *tn = net_generic(net, gact_net_id);
+
+   return tcf_idr_delete_index(tn, index);
+}
+
 static struct tc_action_ops act_gact_ops = {

[PATCH v2 07/11] net: sched: implement reference counted action release

2018-05-27 Thread Vlad Buslov

Implement helper delete function that uses new action ops 'delete', instead
of destroying action directly. This is required so act API could delete
actions by index, without holding any references to action that is being
deleted.

Implement function __tcf_action_put() that releases reference to action and
frees it, if necessary. Refactor action deletion code to use new put
function and not to rely on rtnl lock. Remove rtnl lock assertions that are
no longer needed.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_api.c | 84 +++--
 net/sched/cls_api.c |  1 -
 2 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 0f31f09946ab..a023873db713 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -90,21 +90,39 @@ static void free_tcf(struct tc_action *p)
kfree(p);
 }
 
-static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p)
+static void tcf_action_cleanup(struct tc_action *p)
 {
-   spin_lock(&idrinfo->lock);
-   idr_remove(&idrinfo->action_idr, p->tcfa_index);
-   spin_unlock(&idrinfo->lock);
+   if (p->ops->cleanup)
+   p->ops->cleanup(p);
+
gen_kill_estimator(&p->tcfa_rate_est);
free_tcf(p);
 }
 
+static int __tcf_action_put(struct tc_action *p, bool bind)
+{
+   struct tcf_idrinfo *idrinfo = p->idrinfo;
+
+   if (refcount_dec_and_lock(&p->tcfa_refcnt, &idrinfo->lock)) {
+   if (bind)
+   atomic_dec(&p->tcfa_bindcnt);
+   idr_remove(&idrinfo->action_idr, p->tcfa_index);
+   spin_unlock(&idrinfo->lock);
+
+   tcf_action_cleanup(p);
+   return 1;
+   }
+
+   if (bind)
+   atomic_dec(&p->tcfa_bindcnt);
+
+   return 0;
+}
+
 int __tcf_idr_release(struct tc_action *p, bool bind, bool strict)
 {
int ret = 0;
 
-   ASSERT_RTNL();
-
/* Release with strict==1 and bind==0 is only called through act API
 * interface (classifiers always bind). Only case when action with
 * positive reference count and zero bind count can exist is when it was
@@ -118,18 +136,11 @@ int __tcf_idr_release(struct tc_action *p, bool bind, 
bool strict)
 * are acceptable.
 */
if (p) {
-   if (bind)
-   atomic_dec(&p->tcfa_bindcnt);
-   else if (strict && atomic_read(&p->tcfa_bindcnt) > 0)
+   if (!bind && strict && atomic_read(&p->tcfa_bindcnt) > 0)
return -EPERM;
 
-   if (atomic_read(&p->tcfa_bindcnt) <= 0 &&
-   refcount_dec_and_test(&p->tcfa_refcnt)) {
-   if (p->ops->cleanup)
-   p->ops->cleanup(p);
-   tcf_idr_remove(p->idrinfo, p);
+   if (__tcf_action_put(p, bind))
ret = ACT_P_DELETED;
-   }
}
 
return ret;
@@ -340,11 +351,7 @@ int tcf_idr_delete_index(struct tc_action_net *tn, u32 
index)
p->tcfa_index));
spin_unlock(&idrinfo->lock);
 
-   if (p->ops->cleanup)
-   p->ops->cleanup(p);
-
-   gen_kill_estimator(&p->tcfa_rate_est);
-   free_tcf(p);
+   tcf_action_cleanup(p);
module_put(owner);
return 0;
}
@@ -615,6 +622,11 @@ int tcf_action_destroy(struct list_head *actions, int bind)
return ret;
 }
 
+static int tcf_action_put(struct tc_action *p)
+{
+   return __tcf_action_put(p, false);
+}
+
 int
 tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int 
ref)
 {
@@ -1092,6 +1104,35 @@ static int tca_action_flush(struct net *net, struct 
nlattr *nla,
return err;
 }
 
+static int tcf_action_delete(struct net *net, struct list_head *actions,
+struct netlink_ext_ack *extack)
+{
+   struct tc_action *a, *tmp;
+   u32 act_index;
+   int ret;
+
+   list_for_each_entry_safe(a, tmp, actions, list) {
+   const struct tc_action_ops *ops = a->ops;
+
+   /* Actions can be deleted concurrently so we must save their
+* type and id to search again after reference is released.
+*/
+   act_index = a->tcfa_index;
+
+   list_del(&a->list);
+   if (tcf_action_put(a)) {
+   /* last reference, action was deleted concurrently */
+   module_put(ops->owner);
+   } else  {
+   /* now do the delete */
+   ret = ops->delete(net, act_index);
+   if (ret < 0)
+   return ret;
+   }
+   }
+   return 0;
+}
+
 stat

[PATCH v2 08/11] net: sched: don't release reference on action overwrite

2018-05-27 Thread Vlad Buslov

Return from action init function with reference to action taken,
even when overwriting existing action.

Action init API initializes its fourth argument (pointer to pointer to tc
action) to either existing action with same index or newly created action.
In case of existing index(and bind argument is zero), init function returns
without incrementing action reference counter. Caller of action init then
proceeds working with action, without actually holding reference to it.
This means that action could be deleted concurrently.

Change action init behavior to always take reference to action before
returning successfully, in order to protect from concurrent deletion.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_api.c|  2 --
 net/sched/act_bpf.c|  8 
 net/sched/act_connmark.c   |  5 +++--
 net/sched/act_csum.c   |  8 
 net/sched/act_gact.c   |  5 +++--
 net/sched/act_ife.c| 12 +---
 net/sched/act_ipt.c|  5 +++--
 net/sched/act_mirred.c |  5 ++---
 net/sched/act_nat.c|  5 +++--
 net/sched/act_pedit.c  |  5 +++--
 net/sched/act_police.c |  8 +++-
 net/sched/act_sample.c |  8 +++-
 net/sched/act_simple.c |  5 +++--
 net/sched/act_skbedit.c|  5 +++--
 net/sched/act_skbmod.c |  8 +++-
 net/sched/act_tunnel_key.c |  8 +++-
 net/sched/act_vlan.c   |  8 +++-
 17 files changed, 51 insertions(+), 59 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index a023873db713..f019f0464cec 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -870,8 +870,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
}
act->order = i;
sz += tcf_action_fill_size(act);
-   if (ovr)
-   refcount_inc(&act->tcfa_refcnt);
list_add_tail(&act->list, actions);
}
 
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 7941dd66ff83..d3f4ac6f2c4b 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -311,9 +311,10 @@ static int tcf_bpf_init(struct net *net, struct nlattr 
*nla,
if (bind)
return 0;
 
-   tcf_idr_release(*act, bind);
-   if (!replace)
+   if (!replace) {
+   tcf_idr_release(*act, bind);
return -EEXIST;
+   }
}
 
is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS];
@@ -356,8 +357,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 
return res;
 out:
-   if (res == ACT_P_CREATED)
-   tcf_idr_release(*act, bind);
+   tcf_idr_release(*act, bind);
 
return ret;
 }
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 143c2d3de723..701e90244eff 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -135,9 +135,10 @@ static int tcf_connmark_init(struct net *net, struct 
nlattr *nla,
ci = to_connmark(*a);
if (bind)
return 0;
-   tcf_idr_release(*a, bind);
-   if (!ovr)
+   if (!ovr) {
+   tcf_idr_release(*a, bind);
return -EEXIST;
+   }
/* replacing action and zone */
ci->tcf_action = parm->action;
ci->zone = parm->zone;
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 3768539340e0..5dbee136b0a1 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -76,9 +76,10 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
} else {
if (bind)/* dont override defaults */
return 0;
-   tcf_idr_release(*a, bind);
-   if (!ovr)
+   if (!ovr) {
+   tcf_idr_release(*a, bind);
return -EEXIST;
+   }
}
 
p = to_tcf_csum(*a);
@@ -86,8 +87,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 
params_new = kzalloc(sizeof(*params_new), GFP_KERNEL);
if (unlikely(!params_new)) {
-   if (ret == ACT_P_CREATED)
-   tcf_idr_release(*a, bind);
+   tcf_idr_release(*a, bind);
return -ENOMEM;
}
params_old = rtnl_dereference(p->params);
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index a431a711f0dd..11c4de3f344e 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -100,9 +100,10 @@ static int tcf_gact_init(struct net *net, struct nlattr 
*nla,
} else {
if (bind)/* dont override defaults */
return 0;
-   tcf_idr_release(*a, bind);
-   if (!ovr)
+   if (!ovr) {
+   tcf_idr_release(*a, bind);
return -EEXIST;
+

[PATCH v2 10/11] net: sched: atomically check-allocate action

2018-05-27 Thread Vlad Buslov

Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.

Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)

Refactor all action init functions to insert new action to idr using this
API.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h  |  3 ++
 net/sched/act_api.c| 92 --
 net/sched/act_bpf.c| 11 --
 net/sched/act_connmark.c   | 10 +++--
 net/sched/act_csum.c   | 11 --
 net/sched/act_gact.c   | 11 --
 net/sched/act_ife.c|  6 ++-
 net/sched/act_ipt.c| 13 ++-
 net/sched/act_mirred.c | 16 ++--
 net/sched/act_nat.c| 11 --
 net/sched/act_pedit.c  | 15 ++--
 net/sched/act_police.c |  9 -
 net/sched/act_sample.c | 11 --
 net/sched/act_simple.c | 11 +-
 net/sched/act_skbedit.c| 11 +-
 net/sched/act_skbmod.c | 11 +-
 net/sched/act_tunnel_key.c |  9 -
 net/sched/act_vlan.c   | 17 -
 18 files changed, 218 insertions(+), 60 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d256e20507b9..cd4547476074 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -154,6 +154,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
   int bind, bool cpustats);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
 
+void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
+int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
+   struct tc_action **a, int bind);
 int tcf_idr_delete_index(struct tc_action_net *tn, u32 index);
 int __tcf_idr_release(struct tc_action *a, bool bind, bool strict);
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index eefe8c2fe667..9511502e1cbb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -303,7 +303,9 @@ static bool __tcf_idr_check(struct tc_action_net *tn, u32 
index,
 
spin_lock(&idrinfo->lock);
p = idr_find(&idrinfo->action_idr, index);
-   if (p) {
+   if (IS_ERR(p)) {
+   p = NULL;
+   } else if (p) {
refcount_inc(&p->tcfa_refcnt);
if (bind)
atomic_inc(&p->tcfa_bindcnt);
@@ -371,7 +373,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
 {
struct tc_action *p = kzalloc(ops->size, GFP_KERNEL);
struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct idr *idr = &idrinfo->action_idr;
int err = -ENOMEM;
 
if (unlikely(!p))
@@ -389,20 +390,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
goto err2;
}
spin_lock_init(&p->tcfa_lock);
-   idr_preload(GFP_KERNEL);
-   spin_lock(&idrinfo->lock);
-   /* user doesn't specify an index */
-   if (!index) {
-   index = 1;
-   err = idr_alloc_u32(idr, NULL, &index, UINT_MAX, GFP_ATOMIC);
-   } else {
-   err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC);
-   }
-   spin_unlock(&idrinfo->lock);
-   idr_preload_end();
-   if (err)
-   goto err3;
-
p->tcfa_index = index;
p->tcfa_tm.install = jiffies;
p->tcfa_tm.lastuse = jiffies;
@@ -412,7 +399,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
&p->tcfa_rate_est,
&p->tcfa_lock, NULL, est);
if (err)
-   goto err4;
+   goto err3;
}
 
p->idrinfo = idrinfo;
@@ -420,8 +407,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
INIT_LIST_HEAD(&p->list);
*a = p;
return 0;
-err4:
-   idr_remove(idr, index);
 err3:
free_percpu(p->cpu_qstats);
 err2:
@@ -437,11 +422,78 @@ void tcf_idr_insert(struct tc_action_net *tn, struct 
tc_action *a)
struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
spin_lock(&idrinfo->lock);
-   idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
+   WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
spin_unlock(&idrinfo->lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
 
+/* Cleanup idr index that was allocated but not initialized. */
+
+void tcf_idr_cleanup(struct tc_action_net *tn, u32 index)
+{
+   struct tcf_idrinfo *idrinfo = tn->idrinfo;
+
+   spin_lock(&idrinfo->lock);
+   /* Remove ERR_PTR(-EBUSY) allocated by tcf_idr_check_a

[PATCH v2 09/11] net: sched: use reference counting action init

2018-05-27 Thread Vlad Buslov

Change action API to assume that action init function always takes
reference to action, even when overwriting existing action. This is
necessary because action API continues to use action pointer after init
function is done. At this point action becomes accessible for concurrent
modifications, so user must always hold reference to it.

Implement helper put list function to atomically release list of actions
after action API init code is done using them.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_api.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f019f0464cec..eefe8c2fe667 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -627,6 +627,18 @@ static int tcf_action_put(struct tc_action *p)
return __tcf_action_put(p, false);
 }
 
+static void tcf_action_put_lst(struct list_head *actions)
+{
+   struct tc_action *a, *tmp;
+
+   list_for_each_entry_safe(a, tmp, actions, list) {
+   const struct tc_action_ops *ops = a->ops;
+
+   if (tcf_action_put(a))
+   module_put(ops->owner);
+   }
+}
+
 int
 tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int 
ref)
 {
@@ -835,17 +847,6 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
return ERR_PTR(err);
 }
 
-static void cleanup_a(struct list_head *actions, int ovr)
-{
-   struct tc_action *a;
-
-   if (!ovr)
-   return;
-
-   list_for_each_entry(a, actions, list)
-   refcount_dec(&a->tcfa_refcnt);
-}
-
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct list_head *actions, size_t *attr_size,
@@ -874,11 +875,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
}
 
*attr_size = tcf_action_full_attrs_size(sz);
-
-   /* Remove the temp refcnt which was necessary to protect against
-* destroying an existing action which was being replaced
-*/
-   cleanup_a(actions, ovr);
return 0;
 
 err:
@@ -1209,7 +1205,7 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct 
nlmsghdr *n,
return ret;
}
 err:
-   tcf_action_destroy(&actions, 0);
+   tcf_action_put_lst(&actions);
return ret;
 }
 
@@ -1251,8 +1247,11 @@ static int tcf_action_add(struct net *net, struct nlattr 
*nla,
  &attr_size, true, extack);
if (ret)
return ret;
+   ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+   if (ovr)
+   tcf_action_put_lst(&actions);
 
-   return tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+   return ret;
 }
 
 static u32 tcaa_root_flags_allowed = TCA_FLAG_LARGE_DUMP_ON;
-- 
2.7.5

[PATCH v2 11/11] net: sched: change action API to use array of pointers to actions

2018-05-27 Thread Vlad Buslov

Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.

Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h |  7 ++---
 net/sched/act_api.c   | 74 ---
 net/sched/cls_api.c   | 21 +--
 3 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index cd4547476074..43dfa5e1b3b3 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -168,19 +168,20 @@ static inline int tcf_idr_release(struct tc_action *a, 
bool bind)
 int tcf_register_action(struct tc_action_ops *a, struct pernet_operations 
*ops);
 int tcf_unregister_action(struct tc_action_ops *a,
  struct pernet_operations *ops);
-int tcf_action_destroy(struct list_head *actions, int bind);
+int tcf_action_destroy(struct tc_action *actions[], int bind);
 int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
int nr_actions, struct tcf_result *res);
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
-   struct list_head *actions, size_t *attr_size,
+   struct tc_action *actions[], size_t *attr_size,
bool rtnl_held, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
bool rtnl_held,
struct netlink_ext_ack *extack);
-int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
+   int ref);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9511502e1cbb..7f904bb84aab 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -657,13 +657,14 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action 
**actions,
 }
 EXPORT_SYMBOL(tcf_action_exec);
 
-int tcf_action_destroy(struct list_head *actions, int bind)
+int tcf_action_destroy(struct tc_action *actions[], int bind)
 {
const struct tc_action_ops *ops;
-   struct tc_action *a, *tmp;
-   int ret = 0;
+   struct tc_action *a;
+   int ret = 0, i;
 
-   list_for_each_entry_safe(a, tmp, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   a = actions[i];
ops = a->ops;
ret = __tcf_idr_release(a, bind, true);
if (ret == ACT_P_DELETED)
@@ -679,11 +680,12 @@ static int tcf_action_put(struct tc_action *p)
return __tcf_action_put(p, false);
 }
 
-static void tcf_action_put_lst(struct list_head *actions)
+static void tcf_action_put_many(struct tc_action *actions[])
 {
-   struct tc_action *a, *tmp;
+   int i;
 
-   list_for_each_entry_safe(a, tmp, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   struct tc_action *a = actions[i];
const struct tc_action_ops *ops = a->ops;
 
if (tcf_action_put(a))
@@ -735,14 +737,15 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action 
*a, int bind, int ref)
 }
 EXPORT_SYMBOL(tcf_action_dump_1);
 
-int tcf_action_dump(struct sk_buff *skb, struct list_head *actions,
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[],
int bind, int ref)
 {
struct tc_action *a;
-   int err = -EINVAL;
+   int err = -EINVAL, i;
struct nlattr *nest;
 
-   list_for_each_entry(a, actions, list) {
+   for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+   a = actions[i];
nest = nla_nest_start(skb, a->order);
if (nest == NULL)
goto nla_put_failure;
@@ -878,10 +881,9 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
err = tcf_action_goto_chain_init(a, tp);
if (err) {
-   LIST_HEAD(actions);
+

[PATCH v2 05/11] net: sched: implement action API that deletes action by index

2018-05-27 Thread Vlad Buslov

Implement new action API function that atomically finds and deletes action
from idr by index. Intended to be used by lockless actions that do not rely
on rtnl lock.

Signed-off-by: Vlad Buslov 
---
 include/net/act_api.h |  1 +
 net/sched/act_api.c   | 39 +++
 2 files changed, 40 insertions(+)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 888ff471bbf6..d94ec6400673 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -153,6 +153,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
   int bind, bool cpustats);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
 
+int tcf_idr_delete_index(struct tc_action_net *tn, u32 index);
 int __tcf_idr_release(struct tc_action *a, bool bind, bool strict);
 
 static inline int tcf_idr_release(struct tc_action *a, bool bind)
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index aa304d36fee0..0f31f09946ab 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -319,6 +319,45 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, 
struct tc_action **a,
 }
 EXPORT_SYMBOL(tcf_idr_check);
 
+int tcf_idr_delete_index(struct tc_action_net *tn, u32 index)
+{
+   struct tcf_idrinfo *idrinfo = tn->idrinfo;
+   struct tc_action *p;
+   int ret = 0;
+
+   spin_lock(&idrinfo->lock);
+   p = idr_find(&idrinfo->action_idr, index);
+   if (!p) {
+   spin_unlock(&idrinfo->lock);
+   return -ENOENT;
+   }
+
+   if (!atomic_read(&p->tcfa_bindcnt)) {
+   if (refcount_dec_and_test(&p->tcfa_refcnt)) {
+   struct module *owner = p->ops->owner;
+
+   WARN_ON(p != idr_remove(&idrinfo->action_idr,
+   p->tcfa_index));
+   spin_unlock(&idrinfo->lock);
+
+   if (p->ops->cleanup)
+   p->ops->cleanup(p);
+
+   gen_kill_estimator(&p->tcfa_rate_est);
+   free_tcf(p);
+   module_put(owner);
+   return 0;
+   }
+   ret = 0;
+   } else {
+   ret = -EPERM;
+   }
+
+   spin_unlock(&idrinfo->lock);
+   return ret;
+}
+EXPORT_SYMBOL(tcf_idr_delete_index);
+
 int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
   struct tc_action **a, const struct tc_action_ops *ops,
   int bind, bool cpustats)
-- 
2.7.5

[PATCH v2 04/11] net: sched: always take reference to action

2018-05-27 Thread Vlad Buslov

Without rtnl lock protection it is no longer safe to use pointer to tc
action without holding reference to it. (it can be destroyed concurrently)

Remove unsafe action idr lookup function. Instead of it, implement safe tcf
idr check function that atomically looks up action in idr and increments
its reference and bind counters. Implement both action search and check
using new safe function

Reference taken by idr check is temporal and should not be accounted by
userspace clients (both logically and to preserver current API behavior).
Subtract temporal reference when dumping action to userspace using existing
tca_get_fill function arguments.

Signed-off-by: Vlad Buslov 
---
 net/sched/act_api.c | 46 --
 1 file changed, 20 insertions(+), 26 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 256b0c93916c..aa304d36fee0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct 
sk_buff *skb,
 }
 EXPORT_SYMBOL(tcf_generic_walker);
 
-static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
+static bool __tcf_idr_check(struct tc_action_net *tn, u32 index,
+   struct tc_action **a, int bind)
 {
-   struct tc_action *p = NULL;
+   struct tcf_idrinfo *idrinfo = tn->idrinfo;
+   struct tc_action *p;
 
spin_lock(&idrinfo->lock);
p = idr_find(&idrinfo->action_idr, index);
+   if (p) {
+   refcount_inc(&p->tcfa_refcnt);
+   if (bind)
+   atomic_inc(&p->tcfa_bindcnt);
+   }
spin_unlock(&idrinfo->lock);
 
-   return p;
+   if (p) {
+   *a = p;
+   return true;
+   }
+   return false;
 }
 
 int tcf_idr_search(struct tc_action_net *tn, struct tc_action **a, u32 index)
 {
-   struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct tc_action *p = tcf_idr_lookup(index, idrinfo);
-
-   if (p) {
-   *a = p;
-   return 1;
-   }
-   return 0;
+   return __tcf_idr_check(tn, index, a, 0);
 }
 EXPORT_SYMBOL(tcf_idr_search);
 
 bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a,
   int bind)
 {
-   struct tcf_idrinfo *idrinfo = tn->idrinfo;
-   struct tc_action *p = tcf_idr_lookup(index, idrinfo);
-
-   if (index && p) {
-   if (bind)
-   atomic_inc(&p->tcfa_bindcnt);
-   refcount_inc(&p->tcfa_refcnt);
-   *a = p;
-   return true;
-   }
-   return false;
+   return __tcf_idr_check(tn, index, a, bind);
 }
 EXPORT_SYMBOL(tcf_idr_check);
 
@@ -932,7 +926,7 @@ tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr 
*n,
if (!skb)
return -ENOBUFS;
if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event,
-0, 0) <= 0) {
+0, 1) <= 0) {
NL_SET_ERR_MSG(extack, "Failed to fill netlink attributes while 
adding TC action");
kfree_skb(skb);
return -EINVAL;
@@ -1072,7 +1066,7 @@ tcf_del_notify(struct net *net, struct nlmsghdr *n, 
struct list_head *actions,
return -ENOBUFS;
 
if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, RTM_DELACTION,
-0, 1) <= 0) {
+0, 2) <= 0) {
NL_SET_ERR_MSG(extack, "Failed to fill netlink TC action 
attributes");
kfree_skb(skb);
return -EINVAL;
@@ -1131,14 +1125,14 @@ tca_action_gd(struct net *net, struct nlattr *nla, 
struct nlmsghdr *n,
if (event == RTM_GETACTION)
ret = tcf_get_notify(net, portid, n, &actions, event, extack);
else { /* delete */
+   cleanup_a(&actions, 1); /* lookup took reference */
ret = tcf_del_notify(net, n, &actions, portid, attr_size, 
extack);
if (ret)
goto err;
return ret;
}
 err:
-   if (event != RTM_GETACTION)
-   tcf_action_destroy(&actions, 0);
+   tcf_action_destroy(&actions, 0);
return ret;
 }
 
-- 
2.7.5

[PATCH v2 00/11] Modify action API for implementing lockless actions

2018-05-27 Thread Vlad Buslov

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a first step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in action API that
prevents it from being executed concurrently.

As a preparation for executing TC rules update handlers without rtnl
lock, action API code was audited to determine areas that assume
external synchronization with rtnl lock and must be changed to allow
safe concurrent access with following results:

1. Action idr is already protected with spinlock. However, some code
   paths assume that idr state is not changes between several
   consecutive tcf_idr_* function calls.
2. tc_action reference and bind counters are implemented as plain
   integers. They purpose was to allow single actions to be shared
   between multiple filters, not to provide means for concurrent
   modification.
3. tc_action 'cookie' pointer field is not protected against
   modification.
4. Action API functions, that work with set of actions, use intrusive
   linked list, which cannot be used concurrently without additional
   synchronization.
5. Action API functions don't take reference to actions while using
   them, assuming external synchronization with rtnl lock.

Following solutions to these problems are implemented:

1. To remove assumption that idr state doesn't change between tcf_idr_*
   calls, implement new functions that atomically perform several
   operations on idr without releasing idr spinlock. (function to
   atomically lookup and delete action by index, function to atomically
   check if action exists and allocate new one if necessary, etc.)
2. Use atomic operations on counters to make them suitable for
   concurrent get/put operations.
3. Data that 'cookie' points to is never modified, so it enough to
   refactor it to rcu pointer to prevent concurrent de-allocation.
4. Action API doesn't actually use any linked list specific operations
   on actions intrusive linked list, so it can be refactored to array in
   straightforward manner.
5. Always take reference to action while accessing it in action API.
   tcf_idr_search function modified to take reference to action before
   returning it, so there is no way to lookup an action without
   incrementing its reference counter. All users of this function are
   modified to release the reference, after they done using action. With
   all users using reference counting, it is now safe to concurrently
   delete actions.

Since only shared state in action API module are actions themselves and
action idr, these changes are sufficient to not to rely on global rtnl
lock for protection of internal action API data structures.

Changes from V1 to V2:
- Removed redundant actions ops lookup during delete.
- Merge action ops delete definition and implementation.
- Assume all actions have delete implemented and don't check for it
  explicitly.
- Resplit action lookup/release code to prevent memory leaks in
  individual patches.
- Make __tcf_idr_check function static
- Remove unique idr insertion function. Change original idr insert to do
  the same thing.
- Merge changes that take reference to action when performing lookup and
  changes that account for this additional reference when dumping action
  to user space into single patch.
- Change convoluted commit message.
- Rename "unlocked" to "rtnl_held" for clarity.
- Remove estimator lock add patch.
- Refactor action check-alloc code into standalone function.
- Rename tcf_idr_find_delete to tcf_idr_delete_index.
- Rearrange variable definitions in tc_action_delete.
- Add patch that refactors action API code to use array of pointers to
  actions instead of intrusive linked list.
- Expand cover letter.

Vlad Buslov (11):
  net: sched: use rcu for action cookie update
  net: sched: change type of reference and bind counters
  net: sched: implement unlocked action init API
  net: sched: always take reference to action
  net: sched: implement action API that deletes action by index
  net: sched: add 'delete' function to action ops
  net: sched: implement reference counted action release
  net: sched: don't release reference on action overwrite
  net: sched: use reference counting action init
  net: sched: atomically check-allocate action
  net: sched: change action API to use array of pointers to actions

 include/net/act_api.h  |  25 ++-
 include/net/pkt_cls.h  |   1 +
 net/sched/act_api.c| 408 +++--
 net/sched/act_bpf.

Re: [PATCH rdma-next v1 00/13] Verbs flow counters support

2018-05-27 Thread Leon Romanovsky

On Sun, May 27, 2018 at 01:23:33PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
>
> Changelog v0->v1:
>  * Decouple from DevX submission
>  * Use uverbs_attr_get_obj at counters read method
>  * Added define for max read buffer size (MAX_COUNTERS_BUFF_SIZE)
>  * Removed the struct mlx5_ib_flow_counter basic_flow_cnts and
>the related structs used, used define instead
>  * Took Matan's patch from DevX
>  * uverbs_free_counters removed void* casting
>  * Added check to bound ncounters value (added define
>  * Changed user supplied data buffer structure to be array of
>struct  pair (applied this change to user space also)
>
> Not changed:
>  * UAPI files
>  * Addition of uhw to flow
>
> Thanks

Sorry for the noise, but please drop this series.

We found improper locking sequence in one of error paths which can be
triggered by the user.

Thanks


signature.asc
Description: PGP signature

[PATCH v2 0/5] build warnings cleanup

2018-05-27 Thread Atul Gupta

Build warnings cleanup reported for
- using only 128b key
- wait for buffer in sendmsg/sendpage
- check for null before using skb
- free rspq_skb_cache in error path
- indentation

v2:
  Added bug report description for 0002
  Incorported comments from Dan Carpenter

Atul Gupta (5):
  crypto:chtls: key len correction
  crypto: chtls: wait for memory sendmsg, sendpage
  crypto: chtls: dereference null variable
  crypto: chtls: kbuild warnings
  crypto: chtls: free beyond end rspq_skb_cache

 drivers/crypto/chelsio/chtls/chtls.h  |   1 +
 drivers/crypto/chelsio/chtls/chtls_hw.c   |   6 +-
 drivers/crypto/chelsio/chtls/chtls_io.c   | 104 +++---
 drivers/crypto/chelsio/chtls/chtls_main.c |   3 +-
 4 files changed, 98 insertions(+), 16 deletions(-)

-- 
1.8.3.1

[PATCH v2 1/5] crypto:chtls: key len correction

2018-05-27 Thread Atul Gupta

corrected the key length to copy 128b key. Removed 192b and 256b
key as user input supports key of size 128b in gcm_ctx

Reported-by: Dan Carpenter 
Signed-off-by: Atul Gupta 
---
 drivers/crypto/chelsio/chtls/chtls_hw.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/crypto/chelsio/chtls/chtls_hw.c 
b/drivers/crypto/chelsio/chtls/chtls_hw.c
index 54a13aa9..55d5014 100644
--- a/drivers/crypto/chelsio/chtls/chtls_hw.c
+++ b/drivers/crypto/chelsio/chtls/chtls_hw.c
@@ -213,7 +213,7 @@ static int chtls_key_info(struct chtls_sock *csk,
  struct _key_ctx *kctx,
  u32 keylen, u32 optname)
 {
-   unsigned char key[CHCR_KEYCTX_CIPHER_KEY_SIZE_256];
+   unsigned char key[AES_KEYSIZE_128];
struct tls12_crypto_info_aes_gcm_128 *gcm_ctx;
unsigned char ghash_h[AEAD_H_SIZE];
struct crypto_cipher *cipher;
@@ -228,10 +228,6 @@ static int chtls_key_info(struct chtls_sock *csk,
 
if (keylen == AES_KEYSIZE_128) {
ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_128;
-   } else if (keylen == AES_KEYSIZE_192) {
-   ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_192;
-   } else if (keylen == AES_KEYSIZE_256) {
-   ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_256;
} else {
pr_err("GCM: Invalid key length %d\n", keylen);
return -EINVAL;
-- 
1.8.3.1

[PATCH v2 3/5] crypto: chtls: dereference null variable

2018-05-27 Thread Atul Gupta

skb dereferenced before check in sendpage

Reported-by: Dan Carpenter 
Signed-off-by: Atul Gupta 
---
 drivers/crypto/chelsio/chtls/chtls_io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c 
b/drivers/crypto/chelsio/chtls/chtls_io.c
index 7aa5d90..8cfc27b 100644
--- a/drivers/crypto/chelsio/chtls/chtls_io.c
+++ b/drivers/crypto/chelsio/chtls/chtls_io.c
@@ -1230,9 +1230,8 @@ int chtls_sendpage(struct sock *sk, struct page *page,
struct sk_buff *skb = skb_peek_tail(&csk->txq);
int copy, i;
 
-   copy = mss - skb->len;
if (!skb || (ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND) ||
-   copy <= 0) {
+   (copy = mss - skb->len) <= 0) {
 new_buf:
if (!csk_mem_free(cdev, sk))
goto wait_for_sndbuf;
-- 
1.8.3.1

[PATCH v2 2/5] crypto: chtls: wait for memory sendmsg, sendpage

2018-05-27 Thread Atul Gupta

address suspicious code 

1210   set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
1211   }

The issue is that in the code above, set_bit is never reached
due to the 'continue' statement at line 1208.

Also reported by bug report:
1210   set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
   ^^^
Not reachable.

Its required to wait for buffer in the send path and takes care of
unaddress and un-handled SOCK_NOSPACE.

v2: use csk_mem_free where appropriate
proper indent of goto do_nonblock
replace out with do_rm_wq

Reported-by: Gustavo A. R. Silva 
Reported-by: Dan Carpenter 
Signed-off-by: Atul Gupta 
---
 drivers/crypto/chelsio/chtls/chtls.h  |  1 +
 drivers/crypto/chelsio/chtls/chtls_io.c   | 90 +--
 drivers/crypto/chelsio/chtls/chtls_main.c |  1 +
 3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/chelsio/chtls/chtls.h 
b/drivers/crypto/chelsio/chtls/chtls.h
index 1b2f43c..a53a0e6 100644
--- a/drivers/crypto/chelsio/chtls/chtls.h
+++ b/drivers/crypto/chelsio/chtls/chtls.h
@@ -144,6 +144,7 @@ struct chtls_dev {
struct list_head rcu_node;
struct list_head na_node;
unsigned int send_page_order;
+   int max_host_sndbuf;
struct key_map kmap;
 };
 
diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c 
b/drivers/crypto/chelsio/chtls/chtls_io.c
index 840dd01..7aa5d90 100644
--- a/drivers/crypto/chelsio/chtls/chtls_io.c
+++ b/drivers/crypto/chelsio/chtls/chtls_io.c
@@ -914,6 +914,78 @@ static u16 tls_header_read(struct tls_hdr *thdr, struct 
iov_iter *from)
return (__force u16)cpu_to_be16(thdr->length);
 }
 
+static int csk_mem_free(struct chtls_dev *cdev, struct sock *sk)
+{
+   return (cdev->max_host_sndbuf - sk->sk_wmem_queued);
+}
+
+static int csk_wait_memory(struct chtls_dev *cdev,
+  struct sock *sk, long *timeo_p)
+{
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
+   int sndbuf, err = 0;
+   long current_timeo;
+   long vm_wait = 0;
+   bool noblock;
+
+   current_timeo = *timeo_p;
+   noblock = (*timeo_p ? false : true);
+   sndbuf = cdev->max_host_sndbuf;
+   if (csk_mem_free(cdev, sk)) {
+   current_timeo = (prandom_u32() % (HZ / 5)) + 2;
+   vm_wait = (prandom_u32() % (HZ / 5)) + 2;
+   }
+
+   add_wait_queue(sk_sleep(sk), &wait);
+   while (1) {
+   sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+
+   if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+   goto do_error;
+   if (!*timeo_p) {
+   if (noblock)
+   set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+   goto do_nonblock;
+   }
+   if (signal_pending(current))
+   goto do_interrupted;
+   sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+   if (csk_mem_free(cdev, sk) && !vm_wait)
+   break;
+
+   set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+   sk->sk_write_pending++;
+   sk_wait_event(sk, ¤t_timeo, sk->sk_err ||
+ (sk->sk_shutdown & SEND_SHUTDOWN) ||
+ (csk_mem_free(cdev, sk) && !vm_wait), &wait);
+   sk->sk_write_pending--;
+
+   if (vm_wait) {
+   vm_wait -= current_timeo;
+   current_timeo = *timeo_p;
+   if (current_timeo != MAX_SCHEDULE_TIMEOUT) {
+   current_timeo -= vm_wait;
+   if (current_timeo < 0)
+   current_timeo = 0;
+   }
+   vm_wait = 0;
+   }
+   *timeo_p = current_timeo;
+   }
+do_rm_wq:
+   remove_wait_queue(sk_sleep(sk), &wait);
+   return err;
+do_error:
+   err = -EPIPE;
+   goto do_rm_wq;
+do_nonblock:
+   err = -EAGAIN;
+   goto do_rm_wq;
+do_interrupted:
+   err = sock_intr_errno(*timeo_p);
+   goto do_rm_wq;
+}
+
 int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
@@ -952,6 +1024,8 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
copy = mss - skb->len;
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
+   if (!csk_mem_free(cdev, sk))
+   goto wait_for_sndbuf;
 
if (is_tls_tx(csk) && !csk->tlshws.txleft) {
struct tls_hdr hdr;
@@ -1099,8 +1173,10 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
if (ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND)
push_frames_if_head(sk);
continue;
+wait_for_sndbuf

[PATCH v2 4/5] crypto: chtls: kbuild warnings

2018-05-27 Thread Atul Gupta

- unindented continue
- check for null page
- signed return

Reported-by: Dan Carpenter 
Signed-off-by: Atul Gupta 
---
 drivers/crypto/chelsio/chtls/chtls_io.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c 
b/drivers/crypto/chelsio/chtls/chtls_io.c
index 8cfc27b..51fc682 100644
--- a/drivers/crypto/chelsio/chtls/chtls_io.c
+++ b/drivers/crypto/chelsio/chtls/chtls_io.c
@@ -907,11 +907,11 @@ static int chtls_skb_copy_to_page_nocache(struct sock *sk,
 }
 
 /* Read TLS header to find content type and data length */
-static u16 tls_header_read(struct tls_hdr *thdr, struct iov_iter *from)
+static int tls_header_read(struct tls_hdr *thdr, struct iov_iter *from)
 {
if (copy_from_iter(thdr, sizeof(*thdr), from) != sizeof(*thdr))
return -EFAULT;
-   return (__force u16)cpu_to_be16(thdr->length);
+   return (__force int)cpu_to_be16(thdr->length);
 }
 
 static int csk_mem_free(struct chtls_dev *cdev, struct sock *sk)
@@ -1083,9 +1083,10 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
int off = TCP_OFF(sk);
bool merge;
 
-   if (page)
-   pg_size <<= compound_order(page);
+   if (!page)
+   goto wait_for_memory;
 
+   pg_size <<= compound_order(page);
if (off < pg_size &&
skb_can_coalesce(skb, i, page, off)) {
merge = 1;
@@ -1492,7 +1493,7 @@ static int chtls_pt_recvmsg(struct sock *sk, struct 
msghdr *msg, size_t len,
break;
chtls_cleanup_rbuf(sk, copied);
sk_wait_data(sk, &timeo, NULL);
-   continue;
+   continue;
 found_ok_skb:
if (!skb->len) {
skb_dst_set(skb, NULL);
-- 
1.8.3.1

[PATCH v2 5/5] crypto: chtls: free beyond end rspq_skb_cache

2018-05-27 Thread Atul Gupta

Reported-by: Dan Carpenter 
Signed-off-by: Atul Gupta 
---
 drivers/crypto/chelsio/chtls/chtls_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/chelsio/chtls/chtls_main.c 
b/drivers/crypto/chelsio/chtls/chtls_main.c
index 273afd3..9b07f91 100644
--- a/drivers/crypto/chelsio/chtls/chtls_main.c
+++ b/drivers/crypto/chelsio/chtls/chtls_main.c
@@ -250,7 +250,7 @@ static void *chtls_uld_add(const struct cxgb4_lld_info 
*info)
 
return cdev;
 out_rspq_skb:
-   for (j = 0; j <= i; j++)
+   for (j = 0; j < i; j++)
kfree_skb(cdev->rspq_skb_cache[j]);
kfree_skb(cdev->askb);
 out_skb:
-- 
1.8.3.1

Re: [PATCH net] VSOCK: check sk state before receive

2018-05-27 Thread Hangbin Liu

Hmm...Although I won't reproduce this bug with my reproducer after
apply my patch. I could still get a similiar issue with syzkaller sock vnet 
test.

It looks this patch is not complete. Here is the KASAN call trace with my patch.
I can also reproduce it without my patch.

==
BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 
[vmw_vsock_vmci_transport]
Read of size 4 at addr 880026a3a914 by task kworker/0:2/96

CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Workqueue: events dg_delayed_dispatch [vmw_vmci]
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xdd/0x18e lib/dump_stack.c:113
 print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x1dd/0x460 mm/kasan/report.c:412
 vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
 vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
 dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
 process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
 worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
 kthread+0x343/0x4b0 kernel/kthread.c:240
 ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412

Allocated by task 2684:
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
 slab_post_alloc_hook mm/slab.h:444 [inline]
 slab_alloc_node mm/slub.c:2741 [inline]
 slab_alloc mm/slub.c:2749 [inline]
 kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
 sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
 sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
 __vsock_create+0xc8/0x9b0 [vsock]
 vsock_create+0xfd/0x1a0 [vsock]
 __sock_create+0x310/0x690 net/socket.c:1285
 sock_create net/socket.c:1325 [inline]
 __sys_socket+0x101/0x240 net/socket.c:1355
 __do_sys_socket net/socket.c:1364 [inline]
 __se_sys_socket net/socket.c:1362 [inline]
 __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
 do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 2684:
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
 slab_free_hook mm/slub.c:1388 [inline]
 slab_free_freelist_hook mm/slub.c:1415 [inline]
 slab_free mm/slub.c:2988 [inline]
 kmem_cache_free+0xce/0x410 mm/slub.c:3004
 sk_prot_free net/core/sock.c:1509 [inline]
 __sk_destruct+0x629/0x940 net/core/sock.c:1593
 sk_destruct+0x4e/0x90 net/core/sock.c:1601
 __sk_free+0xd3/0x320 net/core/sock.c:1612
 sk_free+0x2a/0x30 net/core/sock.c:1623
 __vsock_release+0x431/0x610 [vsock]
 vsock_release+0x3c/0xc0 [vsock]
 sock_release+0x91/0x200 net/socket.c:594
 sock_close+0x17/0x20 net/socket.c:1149
 __fput+0x368/0xa20 fs/file_table.c:209
 task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x1876/0x26c0 kernel/exit.c:865
 do_group_exit+0x159/0x3e0 kernel/exit.c:968
 get_signal+0x65a/0x1780 kernel/signal.c:2482
 do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
 exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at 880026a3a600
 which belongs to the cache AF_VSOCK of size 1056
The buggy address is located 788 bytes inside of
 1056-byte region [880026a3a600, 880026a3aa20)
The buggy address belongs to the page:
page:ea9a8e00 count:1 mapcount:0 mapping: index:0x0 
compound_mapcount: 0
flags: 0xfc0008100(slab|head)
raw: 000fc0008100   0001000d000d
raw: dead0100 dead0200 880034471a40 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ^
 880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
==

[PATCH net-next 3/7] net/ipv6: Pass ifa6_config struct to inet6_addr_modify

2018-05-27 Thread dsahern

From: David Ahern 

Update inet6_addr_modify to take ifa6_config argument versus a parameter
list. This is an argument move only; no functional change intended.

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2db1acf70610..1b1e63a4520b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4527,8 +4527,7 @@ inet6_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr 
*nlh,
  ifm->ifa_prefixlen);
 }
 
-static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags,
-u32 prefered_lft, u32 valid_lft)
+static int inet6_addr_modify(struct inet6_ifaddr *ifp, struct ifa6_config *cfg)
 {
u32 flags;
clock_t expires;
@@ -4538,32 +4537,32 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, 
u32 ifa_flags,
 
ASSERT_RTNL();
 
-   if (!valid_lft || (prefered_lft > valid_lft))
+   if (!cfg->valid_lft || cfg->preferred_lft > cfg->valid_lft)
return -EINVAL;
 
-   if (ifa_flags & IFA_F_MANAGETEMPADDR &&
+   if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR &&
(ifp->flags & IFA_F_TEMPORARY || ifp->prefix_len != 64))
return -EINVAL;
 
if (!(ifp->flags & IFA_F_TENTATIVE) || ifp->flags & IFA_F_DADFAILED)
-   ifa_flags &= ~IFA_F_OPTIMISTIC;
+   cfg->ifa_flags &= ~IFA_F_OPTIMISTIC;
 
-   timeout = addrconf_timeout_fixup(valid_lft, HZ);
+   timeout = addrconf_timeout_fixup(cfg->valid_lft, HZ);
if (addrconf_finite_timeout(timeout)) {
expires = jiffies_to_clock_t(timeout * HZ);
-   valid_lft = timeout;
+   cfg->valid_lft = timeout;
flags = RTF_EXPIRES;
} else {
expires = 0;
flags = 0;
-   ifa_flags |= IFA_F_PERMANENT;
+   cfg->ifa_flags |= IFA_F_PERMANENT;
}
 
-   timeout = addrconf_timeout_fixup(prefered_lft, HZ);
+   timeout = addrconf_timeout_fixup(cfg->preferred_lft, HZ);
if (addrconf_finite_timeout(timeout)) {
if (timeout == 0)
-   ifa_flags |= IFA_F_DEPRECATED;
-   prefered_lft = timeout;
+   cfg->ifa_flags |= IFA_F_DEPRECATED;
+   cfg->preferred_lft = timeout;
}
 
spin_lock_bh(&ifp->lock);
@@ -4573,16 +4572,16 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, 
u32 ifa_flags,
ifp->flags &= ~(IFA_F_DEPRECATED | IFA_F_PERMANENT | IFA_F_NODAD |
IFA_F_HOMEADDRESS | IFA_F_MANAGETEMPADDR |
IFA_F_NOPREFIXROUTE);
-   ifp->flags |= ifa_flags;
+   ifp->flags |= cfg->ifa_flags;
ifp->tstamp = jiffies;
-   ifp->valid_lft = valid_lft;
-   ifp->prefered_lft = prefered_lft;
+   ifp->valid_lft = cfg->valid_lft;
+   ifp->prefered_lft = cfg->preferred_lft;
 
spin_unlock_bh(&ifp->lock);
if (!(ifp->flags&IFA_F_TENTATIVE))
ipv6_ifa_notify(0, ifp);
 
-   if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) {
+   if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) {
addrconf_prefix_route(&ifp->addr, ifp->prefix_len,
  ifp->idev->dev, expires, flags,
  GFP_KERNEL);
@@ -4601,10 +4600,14 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, 
u32 ifa_flags,
}
 
if (was_managetempaddr || ifp->flags & IFA_F_MANAGETEMPADDR) {
-   if (was_managetempaddr && !(ifp->flags & IFA_F_MANAGETEMPADDR))
-   valid_lft = prefered_lft = 0;
-   manage_tempaddrs(ifp->idev, ifp, valid_lft, prefered_lft,
-!was_managetempaddr, jiffies);
+   if (was_managetempaddr &&
+   !(ifp->flags & IFA_F_MANAGETEMPADDR)) {
+   cfg->valid_lft = 0;
+   cfg->preferred_lft = 0;
+   }
+   manage_tempaddrs(ifp->idev, ifp, cfg->valid_lft,
+cfg->preferred_lft, !was_managetempaddr,
+jiffies);
}
 
addrconf_verify_rtnl();
@@ -4691,8 +4694,7 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr 
*nlh,
!(nlh->nlmsg_flags & NLM_F_REPLACE))
err = -EEXIST;
else
-   err = inet6_addr_modify(ifa, cfg.ifa_flags, cfg.preferred_lft,
-   cfg.valid_lft);
+   err = inet6_addr_modify(ifa, &cfg);
 
in6_ifa_put(ifa);
 
-- 
2.11.0

[PATCH iproute2-next] ipaddress: Add support for address metric

2018-05-27 Thread dsahern

From: David Ahern 

Add support for IFA_RT_PRIORITY using the same keywords as iproute for
RTA_PRIORITY.

Signed-off-by: David Ahern 
---
 include/uapi/linux/if_addr.h |  1 +
 ip/ipaddress.c   | 15 ++-
 man/man8/ip-address.8.in |  6 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index c4899e22cdaa..a924606f36e5 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -33,6 +33,7 @@ enum {
IFA_CACHEINFO,
IFA_MULTICAST,
IFA_FLAGS,
+   IFA_RT_PRIORITY,  /* u32, priority/metric for prefix route */
__IFA_MAX,
 };
 
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 75539e057f6a..6b53b753ead9 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -63,7 +63,7 @@ static void usage(void)
fprintf(stderr, "   ip address {showdump|restore}\n");
fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n");
fprintf(stderr, "  [ broadcast ADDR ] [ anycast ADDR ]\n");
-   fprintf(stderr, "  [ label IFNAME ] [ scope SCOPE-ID ]\n");
+   fprintf(stderr, "  [ label IFNAME ] [ scope SCOPE-ID ] [ metric 
METRIC ]\n");
fprintf(stderr, "SCOPE-ID := [ host | link | global | NUMBER ]\n");
fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n");
fprintf(stderr, "FLAG  := [ permanent | dynamic | secondary | primary 
|\n");
@@ -1328,6 +1328,10 @@ int print_addrinfo(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
   
rta_tb[IFA_ADDRESS]));
}
print_int(PRINT_ANY, "prefixlen", "/%d ", ifa->ifa_prefixlen);
+
+   if (rta_tb[IFA_RT_PRIORITY])
+   print_uint(PRINT_ANY, "metric", "metric %u ",
+  rta_getattr_u32(rta_tb[IFA_RT_PRIORITY]));
}
 
if (brief)
@@ -2119,6 +2123,15 @@ static int ipaddr_modify(int cmd, int flags, int argc, 
char **argv)
NEXT_ARG();
l = *argv;
addattr_l(&req.n, sizeof(req), IFA_LABEL, l, 
strlen(l)+1);
+   } else if (matches(*argv, "metric") == 0 ||
+  matches(*argv, "priority") == 0 ||
+  matches(*argv, "preference") == 0) {
+   __u32 metric;
+
+   NEXT_ARG();
+   if (get_u32(&metric, *argv, 0))
+   invarg("\"metric\" value is invalid\n", *argv);
+   addattr32(&req.n, sizeof(req), IFA_RT_PRIORITY, metric);
} else if (matches(*argv, "valid_lft") == 0) {
if (valid_lftp)
duparg("valid_lft", *argv);
diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in
index 7ebf0bc98dcd..c3861b3725cc 100644
--- a/man/man8/ip-address.8.in
+++ b/man/man8/ip-address.8.in
@@ -27,6 +27,8 @@ ip-address \- protocol address management
 .IR IFNAME " ] [ "
 .B  scope
 .IR SCOPE-ID " ] [ "
+.B  metric
+.IR METRIC " ] [ "
 .B  to
 .IR PREFIX " ] [ " FLAG-LIST " ] [ "
 .B  label
@@ -215,6 +217,10 @@ valid inside this site.
 .in -8
 
 .TP
+.BI metric " NUMBER"
+priority of prefix route associated with address.
+
+.TP
 .BI valid_lft " LFT"
 the valid lifetime of this address; see section 5.5.4 of
 RFC 4862. When it expires, the address is removed by the kernel.
-- 
2.11.0

[PATCH net-next 1/7] net/ipv6: Convert ipv6_add_addr to struct ifa6_config

2018-05-27 Thread dsahern

From: David Ahern 

Move config parameters for adding an ipv6 address to a struct. struct
names stem from inet6_rtm_newaddr which is the modern handler for
adding an address.

Start the conversion to ifa6_config with ipv6_add_addr. This is an argument
move only; no functional change intended. Mapping of variable changes:

addr  -->  cfg->pfx
peer_addr -->  cfg->peer_pfx
pfxlen-->  cfg->plen
flags -->  cfg->ifa_flags

scope, valid_lft, prefered_lft have the same names within cfg
(with corrected spelling).

Signed-off-by: David Ahern 
---
 include/net/addrconf.h |  12 +
 net/ipv6/addrconf.c| 134 +++--
 2 files changed, 87 insertions(+), 59 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index c07d4dd09361..f766af2cd1a4 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -59,6 +59,18 @@ struct in6_validator_info {
struct netlink_ext_ack  *extack;
 };
 
+struct ifa6_config {
+   const struct in6_addr   *pfx;
+   unsigned intplen;
+
+   const struct in6_addr   *peer_pfx;
+
+   u32 ifa_flags;
+   u32 preferred_lft;
+   u32 valid_lft;
+   u16 scope;
+};
+
 int addrconf_init(void);
 void addrconf_cleanup(void);
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index fbfd71a2d9c8..4988f2265882 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -986,17 +986,15 @@ static int ipv6_add_addr_hash(struct net_device *dev, 
struct inet6_ifaddr *ifa)
 /* On success it returns ifp with increased reference count */
 
 static struct inet6_ifaddr *
-ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
- const struct in6_addr *peer_addr, int pfxlen,
- int scope, u32 flags, u32 valid_lft, u32 prefered_lft,
+ipv6_add_addr(struct inet6_dev *idev, struct ifa6_config *cfg,
  bool can_block, struct netlink_ext_ack *extack)
 {
gfp_t gfp_flags = can_block ? GFP_KERNEL : GFP_ATOMIC;
+   int addr_type = ipv6_addr_type(cfg->pfx);
struct net *net = dev_net(idev->dev);
struct inet6_ifaddr *ifa = NULL;
struct fib6_info *f6i = NULL;
int err = 0;
-   int addr_type = ipv6_addr_type(addr);
 
if (addr_type == IPV6_ADDR_ANY ||
addr_type & IPV6_ADDR_MULTICAST ||
@@ -1019,7 +1017,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
 */
if (can_block) {
struct in6_validator_info i6vi = {
-   .i6vi_addr = *addr,
+   .i6vi_addr = *cfg->pfx,
.i6vi_dev = idev,
.extack = extack,
};
@@ -1036,7 +1034,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
goto out;
}
 
-   f6i = addrconf_f6i_alloc(net, idev, addr, false, gfp_flags);
+   f6i = addrconf_f6i_alloc(net, idev, cfg->pfx, false, gfp_flags);
if (IS_ERR(f6i)) {
err = PTR_ERR(f6i);
f6i = NULL;
@@ -1049,21 +1047,21 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
 
neigh_parms_data_state_setall(idev->nd_parms);
 
-   ifa->addr = *addr;
-   if (peer_addr)
-   ifa->peer_addr = *peer_addr;
+   ifa->addr = *cfg->pfx;
+   if (cfg->peer_pfx)
+   ifa->peer_addr = *cfg->peer_pfx;
 
spin_lock_init(&ifa->lock);
INIT_DELAYED_WORK(&ifa->dad_work, addrconf_dad_work);
INIT_HLIST_NODE(&ifa->addr_lst);
-   ifa->scope = scope;
-   ifa->prefix_len = pfxlen;
-   ifa->flags = flags;
+   ifa->scope = cfg->scope;
+   ifa->prefix_len = cfg->plen;
+   ifa->flags = cfg->ifa_flags;
/* No need to add the TENTATIVE flag for addresses with NODAD */
-   if (!(flags & IFA_F_NODAD))
+   if (!(cfg->ifa_flags & IFA_F_NODAD))
ifa->flags |= IFA_F_TENTATIVE;
-   ifa->valid_lft = valid_lft;
-   ifa->prefered_lft = prefered_lft;
+   ifa->valid_lft = cfg->valid_lft;
+   ifa->prefered_lft = cfg->preferred_lft;
ifa->cstamp = ifa->tstamp = jiffies;
ifa->tokenized = false;
 
@@ -1260,11 +1258,10 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr 
*ifp,
 {
struct inet6_dev *idev = ifp->idev;
struct in6_addr addr, *tmpaddr;
-   unsigned long tmp_prefered_lft, tmp_valid_lft, tmp_tstamp, age;
+   unsigned long tmp_tstamp, age;
unsigned long regen_advance;
-   int tmp_plen;
+   struct ifa6_config cfg;
int ret = 0;
-   u32 addr_flags;
unsigned long now = jiffies;
long max_desync_factor;
s32 cnf_temp_preferred_lft;
@@ -1326,13 +1323,12 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr 
*ifp,
}
}
 
-   tmp_valid_lft = min_t(__u32,
-

[PATCH net-next 6/7] net/ipv6: Add support for specifying metric of connected routes

2018-05-27 Thread dsahern

From: David Ahern 

Add support for IFA_RT_PRIORITY to ipv6 addresses.

If the metric is changed on an existing address then the new route
is inserted before removing the old one. Since the metric is one
of the route keys, the prefix route can not be atomically replaced.

Signed-off-by: David Ahern 
---
 include/net/addrconf.h |  1 +
 include/net/if_inet6.h |  1 +
 net/ipv6/addrconf.c| 93 --
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index f766af2cd1a4..5f43f7a70fe6 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -65,6 +65,7 @@ struct ifa6_config {
 
const struct in6_addr   *peer_pfx;
 
+   u32 rt_priority;
u32 ifa_flags;
u32 preferred_lft;
u32 valid_lft;
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index db389253dc2a..d7578cf49c3a 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -42,6 +42,7 @@ enum {
 struct inet6_ifaddr {
struct in6_addr addr;
__u32   prefix_len;
+   __u32   rt_priority;
 
/* In seconds, relative to tstamp. Expiry is at tstamp + HZ * lft. */
__u32   valid_lft;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1b1e63a4520b..f09afc232da4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1056,6 +1056,7 @@ ipv6_add_addr(struct inet6_dev *idev, struct ifa6_config 
*cfg,
INIT_HLIST_NODE(&ifa->addr_lst);
ifa->scope = cfg->scope;
ifa->prefix_len = cfg->plen;
+   ifa->rt_priority = cfg->rt_priority;
ifa->flags = cfg->ifa_flags;
/* No need to add the TENTATIVE flag for addresses with NODAD */
if (!(cfg->ifa_flags & IFA_F_NODAD))
@@ -1356,6 +1357,7 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp,
 
cfg.pfx = &addr;
cfg.scope = ipv6_addr_scope(cfg.pfx);
+   cfg.rt_priority = 0;
 
ift = ipv6_add_addr(idev, &cfg, block, NULL);
if (IS_ERR(ift)) {
@@ -2314,12 +2316,13 @@ static void  ipv6_try_regen_rndid(struct inet6_dev 
*idev, struct in6_addr *tmpad
  */
 
 static void
-addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev,
- unsigned long expires, u32 flags, gfp_t gfp_flags)
+addrconf_prefix_route(struct in6_addr *pfx, int plen, u32 metric,
+ struct net_device *dev, unsigned long expires,
+ u32 flags, gfp_t gfp_flags)
 {
struct fib6_config cfg = {
.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX,
-   .fc_metric = IP6_RT_PRIO_ADDRCONF,
+   .fc_metric = metric ? : IP6_RT_PRIO_ADDRCONF,
.fc_ifindex = dev->ifindex,
.fc_expires = expires,
.fc_dst_len = plen,
@@ -2683,7 +2686,8 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, 
int len, bool sllao)
expires = jiffies_to_clock_t(rt_expires);
}
addrconf_prefix_route(&pinfo->prefix, pinfo->prefix_len,
- dev, expires, flags, GFP_ATOMIC);
+ 0, dev, expires, flags,
+ GFP_ATOMIC);
}
fib6_info_release(rt);
}
@@ -2891,8 +2895,9 @@ static int inet6_addr_add(struct net *net, int ifindex,
ifp = ipv6_add_addr(idev, cfg, true, extack);
if (!IS_ERR(ifp)) {
if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) {
-   addrconf_prefix_route(&ifp->addr, ifp->prefix_len, dev,
- expires, flags, GFP_KERNEL);
+   addrconf_prefix_route(&ifp->addr, ifp->prefix_len,
+ ifp->rt_priority, dev, expires,
+ flags, GFP_KERNEL);
}
 
/* Send a netlink notification if DAD is enabled and
@@ -3056,7 +3061,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 
if (addr.s6_addr32[3]) {
add_addr(idev, &addr, plen, scope);
-   addrconf_prefix_route(&addr, plen, idev->dev, 0, pflags,
+   addrconf_prefix_route(&addr, plen, 0, idev->dev, 0, pflags,
  GFP_ATOMIC);
return;
}
@@ -3081,8 +3086,8 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
}
 
add_addr(idev, &addr, plen, flag);
-   addrconf_prefix_route(&addr, plen, idev->dev, 0,
- pflags, GFP_ATOMIC);
+

[PATCH net-next 2/7] net/ipv6: Pass ifa6_config struct to inet6_addr_add

2018-05-27 Thread dsahern

From: David Ahern 

Move the creation of struct ifa6_config up to callers of inet6_addr_add.

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 112 ++--
 1 file changed, 57 insertions(+), 55 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4988f2265882..2db1acf70610 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2830,20 +2830,9 @@ static int ipv6_mc_config(struct sock *sk, bool join,
  * Manual configuration of address on an interface
  */
 static int inet6_addr_add(struct net *net, int ifindex,
- const struct in6_addr *pfx,
- const struct in6_addr *peer_pfx,
- unsigned int plen, __u32 ifa_flags,
- __u32 prefered_lft, __u32 valid_lft,
+ struct ifa6_config *cfg,
  struct netlink_ext_ack *extack)
 {
-   struct ifa6_config cfg = {
-   .pfx = pfx,
-   .plen = plen,
-   .peer_pfx = peer_pfx,
-   .ifa_flags = ifa_flags,
-   .preferred_lft = prefered_lft,
-   .valid_lft = valid_lft,
-   };
struct inet6_ifaddr *ifp;
struct inet6_dev *idev;
struct net_device *dev;
@@ -2853,14 +2842,14 @@ static int inet6_addr_add(struct net *net, int ifindex,
 
ASSERT_RTNL();
 
-   if (plen > 128)
+   if (cfg->plen > 128)
return -EINVAL;
 
/* check the lifetime */
-   if (!valid_lft || prefered_lft > valid_lft)
+   if (!cfg->valid_lft || cfg->preferred_lft > cfg->valid_lft)
return -EINVAL;
 
-   if (ifa_flags & IFA_F_MANAGETEMPADDR && plen != 64)
+   if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR && cfg->plen != 64)
return -EINVAL;
 
dev = __dev_get_by_index(net, ifindex);
@@ -2871,37 +2860,37 @@ static int inet6_addr_add(struct net *net, int ifindex,
if (IS_ERR(idev))
return PTR_ERR(idev);
 
-   if (ifa_flags & IFA_F_MCAUTOJOIN) {
+   if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
int ret = ipv6_mc_config(net->ipv6.mc_autojoin_sk,
-true, pfx, ifindex);
+true, cfg->pfx, ifindex);
 
if (ret < 0)
return ret;
}
 
-   cfg.scope = ipv6_addr_scope(pfx);
+   cfg->scope = ipv6_addr_scope(cfg->pfx);
 
-   timeout = addrconf_timeout_fixup(valid_lft, HZ);
+   timeout = addrconf_timeout_fixup(cfg->valid_lft, HZ);
if (addrconf_finite_timeout(timeout)) {
expires = jiffies_to_clock_t(timeout * HZ);
-   valid_lft = timeout;
+   cfg->valid_lft = timeout;
flags = RTF_EXPIRES;
} else {
expires = 0;
flags = 0;
-   ifa_flags |= IFA_F_PERMANENT;
+   cfg->ifa_flags |= IFA_F_PERMANENT;
}
 
-   timeout = addrconf_timeout_fixup(prefered_lft, HZ);
+   timeout = addrconf_timeout_fixup(cfg->preferred_lft, HZ);
if (addrconf_finite_timeout(timeout)) {
if (timeout == 0)
-   ifa_flags |= IFA_F_DEPRECATED;
-   prefered_lft = timeout;
+   cfg->ifa_flags |= IFA_F_DEPRECATED;
+   cfg->preferred_lft = timeout;
}
 
-   ifp = ipv6_add_addr(idev, &cfg, true, extack);
+   ifp = ipv6_add_addr(idev, cfg, true, extack);
if (!IS_ERR(ifp)) {
-   if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) {
+   if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) {
addrconf_prefix_route(&ifp->addr, ifp->prefix_len, dev,
  expires, flags, GFP_KERNEL);
}
@@ -2917,15 +2906,15 @@ static int inet6_addr_add(struct net *net, int ifindex,
 * manually configured addresses
 */
addrconf_dad_start(ifp);
-   if (ifa_flags & IFA_F_MANAGETEMPADDR)
-   manage_tempaddrs(idev, ifp, valid_lft, prefered_lft,
-true, jiffies);
+   if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR)
+   manage_tempaddrs(idev, ifp, cfg->valid_lft,
+cfg->preferred_lft, true, jiffies);
in6_ifa_put(ifp);
addrconf_verify_rtnl();
return 0;
-   } else if (ifa_flags & IFA_F_MCAUTOJOIN) {
-   ipv6_mc_config(net->ipv6.mc_autojoin_sk,
-  false, pfx, ifindex);
+   } else if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
+   ipv6_mc_config(net->ipv6.mc_autojoin_sk, false,
+  cfg->pfx, ifindex);
}
 
return PTR_ERR(ifp);
@@ -2976,6 +2965,11 @@ static int inet6_addr_del(st

[PATCH net-next 4/7] net: Add IFA_RT_PRIORITY address attribute

2018-05-27 Thread dsahern

From: David Ahern 

Currently, if two interfaces have addresses in the same connected route,
then the order of the prefix route entries is determined by the order in
which the addresses are assigned or the links brought up.

Add IFA_RT_PRIORITY to allow user to specify the metric of the prefix
route associated with an address giving interface managers better
control of the order of prefix routes.

Signed-off-by: David Ahern 
---
 include/uapi/linux/if_addr.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index 2ef053d265de..ebaf5701c9db 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -33,6 +33,7 @@ enum {
IFA_CACHEINFO,
IFA_MULTICAST,
IFA_FLAGS,
+   IFA_RT_PRIORITY,  /* u32, priority/metric for prefix route */
__IFA_MAX,
 };
 
-- 
2.11.0

[PATCH net-next 0/7] net: Add address attribute to control metric of prefix route

2018-05-27 Thread dsahern

From: David Ahern 

For use cases such as VRR (Virtual Router Redundancy) interface managers
want efficient control over the order of prefix routes when multiple
interfaces have addresses with overlapping/duplicate subnets.

Currently, if two interfaces have addresses in the same subnet, the order
of the prefix route entries is determined by the order in which the
addresses are assigned or the links brought up. Any actions like cycling
an interface up and down changes that order. This set adds a new attribute
for addresses to allow a user to specify the metric of the prefix route
associated with an address giving interface managers better and more
efficient control of the order of prefix routes.

Patches 1-3 refactor IPv6 address add functions to pass an ifa6_config
struct. The functions currently have a long list of arguments and adding
the metric just makes it worse. Because of the overall diff size in
moving the arguments to a struct, the change is done in stages to make
it easier to review starting with the bottom function and pushing the
struct up to callers in each successive patch.

Patch 4 introduces the new attribute.

Patches 5 and 6 add support for the new attribute to IPv4 and IPv6
addresses.

Patch 7 adds a set of test cases.

Patch 8 adds support to iproute2

Changes since RFC
- collapsed patches 1 and 3 into patch 2
- simplified stack variables in fib_modify_prefix_metric in patch 5

David Ahern (7):
  net/ipv6: Convert ipv6_add_addr to struct ifa6_config
  net/ipv6: Pass ifa6_config struct to inet6_addr_add
  net/ipv6: Pass ifa6_config struct to inet6_addr_modify
  net: Add IFA_RT_PRIORITY address attribute
  net/ipv4: Add support for specifying metric of connected routes
  net/ipv6: Add support for specifying metric of connected routes
  selftests: fib_tests: Add prefix route tests with metric

 include/linux/inetdevice.h   |   1 +
 include/net/addrconf.h   |  13 ++
 include/net/if_inet6.h   |   1 +
 include/net/route.h  |   1 +
 include/uapi/linux/if_addr.h |   1 +
 net/ipv4/devinet.c   |  15 ++
 net/ipv4/fib_frontend.c  |  53 -
 net/ipv6/addrconf.c  | 359 +++
 tools/testing/selftests/net/fib_tests.sh | 181 +++-
 9 files changed, 472 insertions(+), 153 deletions(-)
 mode change 100755 => 100644 tools/testing/selftests/net/fib_tests.sh

-- 
2.11.0

[PATCH net-next 7/7] selftests: fib_tests: Add prefix route tests with metric

2018-05-27 Thread dsahern

From: David Ahern 

Add tests verifying prefix routes are inserted with expected metric.

IPv6 prefix route tests
TEST: Default metric  [ OK ]
TEST: User specified metric on first device   [ OK ]
TEST: User specified metric on second device  [ OK ]
TEST: Delete of address on first device   [ OK ]
TEST: Modify metric of address[ OK ]
TEST: Prefix route removed on link down   [ OK ]
TEST: Prefix route with metric on link up [ OK ]

IPv4 prefix route tests
TEST: Default metric  [ OK ]
TEST: User specified metric on first device   [ OK ]
TEST: User specified metric on second device  [ OK ]
TEST: Delete of address on first device   [ OK ]
TEST: Modify metric of address[ OK ]
TEST: Prefix route removed on link down   [ OK ]
TEST: Prefix route with metric on link up [ OK ]

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 181 ++-
 1 file changed, 180 insertions(+), 1 deletion(-)
 mode change 100755 => 100644 tools/testing/selftests/net/fib_tests.sh

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
old mode 100755
new mode 100644
index e7d76fbc36e9..9780c5a2d73f
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,7 +6,8 @@
 
 ret=0
 
-TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt"
+# all tests in this script. Can be overridden with -t option
+TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric 
ipv4_addr_metric"
 VERBOSE=0
 PAUSE_ON_FAIL=no
 PAUSE=no
@@ -642,6 +643,8 @@ check_route6()
local rc=0
 
out=$($IP -6 ro ls match ${pfx} | sed -e 's/ pref medium//')
+   [ "${out}" = "${expected}" ] && return 0
+
if [ -z "${out}" ]; then
if [ "$VERBOSE" = "1" ]; then
printf "\nNo route entry found\n"
@@ -911,6 +914,98 @@ ipv6_route_test()
route_cleanup
 }
 
+ip_addr_metric_check()
+{
+   ip addr help 2>&1 | grep -q metric
+   if [ $? -ne 0 ]; then
+   echo "iproute2 command does not support metric for addresses. 
Skipping test"
+   return 1
+   fi
+
+   return 0
+}
+
+ipv6_addr_metric_test()
+{
+   local rc
+
+   echo
+   echo "IPv6 prefix route tests"
+
+   ip_addr_metric_check || return 1
+
+   setup
+
+   set -e
+   $IP li add dummy1 type dummy
+   $IP li add dummy2 type dummy
+   $IP li set dummy1 up
+   $IP li set dummy2 up
+
+   # default entry is metric 256
+   run_cmd "$IP -6 addr add dev dummy1 2001:db8:104::1/64"
+   run_cmd "$IP -6 addr add dev dummy2 2001:db8:104::2/64"
+   set +e
+
+   check_route6 "2001:db8:104::/64 dev dummy1 proto kernel metric 256 
2001:db8:104::/64 dev dummy2 proto kernel metric 256"
+   log_test $? 0 "Default metric"
+
+   set -e
+   run_cmd "$IP -6 addr flush dev dummy1"
+   run_cmd "$IP -6 addr add dev dummy1 2001:db8:104::1/64 metric 257"
+   set +e
+
+   check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 256 
2001:db8:104::/64 dev dummy1 proto kernel metric 257"
+   log_test $? 0 "User specified metric on first device"
+
+   set -e
+   run_cmd "$IP -6 addr flush dev dummy2"
+   run_cmd "$IP -6 addr add dev dummy2 2001:db8:104::2/64 metric 258"
+   set +e
+
+   check_route6 "2001:db8:104::/64 dev dummy1 proto kernel metric 257 
2001:db8:104::/64 dev dummy2 proto kernel metric 258"
+   log_test $? 0 "User specified metric on second device"
+
+   run_cmd "$IP -6 addr del dev dummy1 2001:db8:104::1/64 metric 257"
+   rc=$?
+   if [ $rc -eq 0 ]; then
+   check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 
258"
+   rc=$?
+   fi
+   log_test $rc 0 "Delete of address on first device"
+
+   run_cmd "$IP -6 addr change dev dummy2 2001:db8:104::2/64 metric 259"
+   rc=$?
+   if [ $rc -eq 0 ]; then
+   check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 
259"
+   rc=$?
+   fi
+   log_test $rc 0 "Modify metric of address"
+
+   # verify prefix route removed on down
+   run_cmd "ip netns exec testns sysctl -qw 
net.ipv6.conf.all.keep_addr_on_down=1"
+   run_cmd "$IP li set dev dummy2 down"
+   rc=$?
+   if [ $rc -eq 0 ]; then
+   check_route6 ""
+   rc=$?
+   fi
+   log_test $rc 0 "Prefix route removed on link down"
+
+   # verify prefix route re-inserted with assigned metric
+   run_cmd "$IP li set dev dummy2 up"
+   rc=$?
+   if [ $rc -eq 0 ]; then
+   check_route6 "2001:db8

[PATCH net-next 5/7] net/ipv4: Add support for specifying metric of connected routes

2018-05-27 Thread dsahern

From: David Ahern 

Add support for IFA_RT_PRIORITY to ipv4 addresses.

If the metric is changed on an existing address then the new route
is inserted before removing the old one. Since the metric is one
of the route keys, the prefix route can not be replaced.

Signed-off-by: David Ahern 
---
 include/linux/inetdevice.h |  1 +
 include/net/route.h|  1 +
 net/ipv4/devinet.c | 15 +
 net/ipv4/fib_frontend.c| 53 --
 4 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d44a71..27650f1bff3d 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -139,6 +139,7 @@ struct in_ifaddr {
__be32  ifa_local;
__be32  ifa_address;
__be32  ifa_mask;
+   __u32   ifa_rt_priority;
__be32  ifa_broadcast;
unsigned char   ifa_scope;
unsigned char   ifa_prefixlen;
diff --git a/include/net/route.h b/include/net/route.h
index dbb032d5921b..bb53cdba38dc 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -225,6 +225,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 struct in_ifaddr;
 void fib_add_ifaddr(struct in_ifaddr *);
 void fib_del_ifaddr(struct in_ifaddr *, struct in_ifaddr *);
+void fib_modify_prefix_metric(struct in_ifaddr *ifa, u32 new_metric);
 
 void rt_add_uncached_list(struct rtable *rt);
 void rt_del_uncached_list(struct rtable *rt);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 40f001782c1b..d7585ab1a77a 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -99,6 +99,7 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
[IFA_LABEL] = { .type = NLA_STRING, .len = IFNAMSIZ - 1 },
[IFA_CACHEINFO] = { .len = sizeof(struct ifa_cacheinfo) },
[IFA_FLAGS] = { .type = NLA_U32 },
+   [IFA_RT_PRIORITY]   = { .type = NLA_U32 },
 };
 
 #define IN4_ADDR_HSIZE_SHIFT   8
@@ -835,6 +836,9 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, 
struct nlmsghdr *nlh,
else
memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
 
+   if (tb[IFA_RT_PRIORITY])
+   ifa->ifa_rt_priority = nla_get_u32(tb[IFA_RT_PRIORITY]);
+
if (tb[IFA_CACHEINFO]) {
struct ifa_cacheinfo *ci;
 
@@ -906,12 +910,20 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return __inet_insert_ifa(ifa, nlh, NETLINK_CB(skb).portid,
 extack);
} else {
+   u32 new_metric = ifa->ifa_rt_priority;
+
inet_free_ifa(ifa);
 
if (nlh->nlmsg_flags & NLM_F_EXCL ||
!(nlh->nlmsg_flags & NLM_F_REPLACE))
return -EEXIST;
ifa = ifa_existing;
+
+   if (ifa->ifa_rt_priority != new_metric) {
+   fib_modify_prefix_metric(ifa, new_metric);
+   ifa->ifa_rt_priority = new_metric;
+   }
+
set_ifa_lifetime(ifa, valid_lft, prefered_lft);
cancel_delayed_work(&check_lifetime_work);
queue_delayed_work(system_power_efficient_wq,
@@ -1549,6 +1561,7 @@ static size_t inet_nlmsg_size(void)
   + nla_total_size(4) /* IFA_BROADCAST */
   + nla_total_size(IFNAMSIZ) /* IFA_LABEL */
   + nla_total_size(4)  /* IFA_FLAGS */
+  + nla_total_size(4)  /* IFA_RT_PRIORITY */
   + nla_total_size(sizeof(struct ifa_cacheinfo)); /* IFA_CACHEINFO 
*/
 }
 
@@ -1618,6 +1631,8 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct 
in_ifaddr *ifa,
(ifa->ifa_label[0] &&
 nla_put_string(skb, IFA_LABEL, ifa->ifa_label)) ||
nla_put_u32(skb, IFA_FLAGS, ifa->ifa_flags) ||
+   (ifa->ifa_rt_priority &&
+nla_put_u32(skb, IFA_RT_PRIORITY, ifa->ifa_rt_priority)) ||
put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
  preferred, valid))
goto nla_put_failure;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 045c43a27c12..07989d9ab2f2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -846,7 +846,8 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
  * to fib engine. It is legal, because all events occur
  * only when netlink is already locked.
  */
-static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
+static void fib_magic(int cmd, int type, __be32 dst, int dst_len,
+ struct in_ifaddr *ifa, u32 rt_priority)
 {
struct net *net = dev_net(ifa->ifa_dev->dev);
u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev);
@@ -856,6 +857,7 @@ static void fib_magic(int

[PATCH] iptables-compat: homogenize error message

2018-05-27 Thread Arushi Singhal

There is a difference between error messages in iptables and
iptables-compat:

#sudo iptables-compat -D INPUT 4
iptables: No chain/target/match by that name.

#sudo iptables -D INPUT 4
iptables: Index of deletion too big.

Now, will show same error message.

Signed-off-by: Arushi Singhal 
---
 iptables/nft.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/iptables/nft.c b/iptables/nft.c
index e33d00f..40646f4 100644
--- a/iptables/nft.c
+++ b/iptables/nft.c
@@ -2603,7 +2603,7 @@ const char *nft_strerror(int err)
{ nft_rule_add, E2BIG, "Index of insertion too big" },
{ nft_rule_check, ENOENT, "Bad rule (does a matching rule exist in 
that chain?)" },
{ nft_rule_replace, ENOENT, "Index of replacement too big" },
-   { nft_rule_delete_num, E2BIG, "Index of deletion too big" },
+   { nft_rule_delete_num, ENOENT, "Index of deletion too big" },
 /* { TC_READ_COUNTER, E2BIG, "Index of counter too big" },
{ TC_ZERO_COUNTER, E2BIG, "Index of counter too big" }, */
{ nft_rule_add, ELOOP, "Loop found in table" },
-- 
2.7.4

Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG

2018-05-27 Thread Jiri Pirko

Sun, May 27, 2018 at 08:48:41AM CEST, ido...@mellanox.com wrote:
>From: Petr Machata 
>
>VLAN 1 is internally used for untagged traffic. Prevent creation of
>explicit netdevice for that VLAN, because that currently isn't supported
>and leads to the NULL pointer dereference cited below.
>
>Fix by preventing creation of VLAN devices with VID of 1 over mlxsw
>devices or LAG devices that involve mlxsw devices.
>
>[  327.175816] 
>
>[  327.184544] UBSAN: Undefined behaviour in 
>drivers/net/ethernet/mellanox/mlxsw/spectrum_fid.c:200:12
>[  327.193667] member access within null pointer of type 'const struct 
>mlxsw_sp_fid'
>[  327.201226] CPU: 0 PID: 8983 Comm: ip Not tainted 
>4.17.0-rc4-petrm_net_ip6gre_headroom-custom-140 #11
>[  327.210496] Hardware name: Mellanox Technologies Ltd. 
>"MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016
>[  327.219872] Call Trace:
>[  327.222384]  dump_stack+0xc3/0x12b
>[  327.234007]  ubsan_epilogue+0x9/0x49
>[  327.237638]  ubsan_type_mismatch_common+0x1f9/0x2d0
>[  327.255769]  __ubsan_handle_type_mismatch+0x90/0xa7
>[  327.264716]  mlxsw_sp_fid_type+0x35/0x50 [mlxsw_spectrum]
>[  327.270255]  mlxsw_sp_port_vlan_router_leave+0x46/0xc0 [mlxsw_spectrum]
>[  327.277019]  mlxsw_sp_inetaddr_port_vlan_event+0xe1/0x340 [mlxsw_spectrum]
>[  327.315031]  mlxsw_sp_netdevice_vrf_event+0xa8/0x100 [mlxsw_spectrum]
>[  327.321626]  mlxsw_sp_netdevice_event+0x276/0x430 [mlxsw_spectrum]
>[  327.367863]  notifier_call_chain+0x4c/0x150
>[  327.372128]  __netdev_upper_dev_link+0x1b3/0x260
>[  327.399450]  vrf_add_slave+0xce/0x170 [vrf]
>[  327.403703]  do_setlink+0x658/0x1d70
>[  327.508998]  rtnl_newlink+0x908/0xf20
>[  327.559128]  rtnetlink_rcv_msg+0x50c/0x720
>[  327.571720]  netlink_rcv_skb+0x16a/0x1f0
>[  327.583450]  netlink_unicast+0x2ca/0x3e0
>[  327.599305]  netlink_sendmsg+0x3e2/0x7f0
>[  327.616655]  sock_sendmsg+0x76/0xc0
>[  327.620207]  ___sys_sendmsg+0x494/0x5d0
>[  327.666117]  __sys_sendmsg+0xc2/0x130
>[  327.690953]  do_syscall_64+0x66/0x370
>[  327.694677]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>[  327.699782] RIP: 0033:0x7f4c2f3f8037
>[  327.703393] RSP: 002b:7ffe8c389708 EFLAGS: 0246 ORIG_RAX: 
>002e
>[  327.711035] RAX: ffda RBX: 5b03f53e RCX: 
>7f4c2f3f8037
>[  327.718229] RDX:  RSI: 7ffe8c389760 RDI: 
>0003
>[  327.725431] RBP: 7ffe8c389760 R08:  R09: 
>7f4c2f443630
>[  327.732632] R10: 05eb R11: 0246 R12: 
>
>[  327.739833] R13: 006774e0 R14: 7ffe8c3897e8 R15: 
>
>[  327.747096] 
>
>
>Fixes: 9589a7b5d7d9 ("mlxsw: spectrum: Handle VLAN devices linking / 
>unlinking")
>Suggested-by: Ido Schimmel 
>Signed-off-by: Petr Machata 
>Signed-off-by: Ido Schimmel 

Acked-by: Jiri Pirko

[PATCH bpf 0/2] Use __aligned_u64 in UAPI fields

2018-05-27 Thread Eugene Syromiatnikov

Hello.

It was discovered during strace development that struct bpf_map_info and
struct bpf_prog_info now have different layouts of i386/compat and x86_64.
Since it's already broken and bpf syscall has no separate compat (as
far as I can see), and the affecting change was introduced recently (in
Linux 4.16), it's proposed to change the layout of these structures
on 32-bit architectures by using __aligned_u64.

In order to somewhat future-proof from this problem in future, an
approach similar to the one implemented in RDMA subsystem recently
is proposed: use __aligned_u64 consistently throughout the UAPI header.

Eugene Syromiatnikov (2):
  bpf: fix alignment of netns_dev/netns_ino fields in
bpf_{map,prog}_info
  bpf: enforce usage of __aligned_u64 in the UAPI header

 include/uapi/linux/bpf.h   | 30 +++---
 tools/include/uapi/linux/bpf.h | 30 +++---
 2 files changed, 30 insertions(+), 30 deletions(-)

-- 
2.1.4

[PATCH bpf 2/2] bpf: enforce usage of __aligned_u64 in the UAPI header

2018-05-27 Thread Eugene Syromiatnikov

Use __aligned_u64 instead of __u64 everywhere in the UAPI header,
similarly to v4.17-rc1~94^2~58^2 "RDMA: Change all uapi headers to use
__aligned_u64 instead of __u64".

This commit doesn't change structure layouts, but imposes additional
alignment requirements on struct bpf_stack_build_id, struct
bpf_sock_ops, struct bpf_perf_event_value, and struct
bpf_raw_tracepoint_args; of them only struct bpf_sock_ops and struct
bpf_perf_event_value have 64-bit fields that were present in a released
kernel without this additional alignment requirement (bytes_received
and bytes_acked were added to struct bpf_sock_ops in commit
v4.16-rc1~123^2~33^2~5^2~4, struct bpf_perf_event_value was added
in commit v4.15-rc1~84^2~532^2~3).

Signed-off-by: Eugene Syromiatnikov 
---
 include/uapi/linux/bpf.h   | 22 +++---
 tools/include/uapi/linux/bpf.h | 22 +++---
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 903010a..174e99a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -259,8 +259,8 @@ struct bpf_stack_build_id {
__s32   status;
unsigned char   build_id[BPF_BUILD_ID_SIZE];
union {
-   __u64   offset;
-   __u64   ip;
+   __aligned_u64   offset;
+   __aligned_u64   ip;
};
 };
 
@@ -288,7 +288,7 @@ union bpf_attr {
__aligned_u64 value;
__aligned_u64 next_key;
};
-   __u64   flags;
+   __aligned_u64   flags;
};
 
struct { /* anonymous struct used by BPF_PROG_LOAD command */
@@ -360,7 +360,7 @@ union bpf_attr {
} query;
 
struct {
-   __u64 name;
+   __aligned_u64 name;
__u32 prog_fd;
} raw_tracepoint;
 } __attribute__((aligned(8)));
@@ -1011,7 +1011,7 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;
__aligned_u64 xlated_prog_insns;
-   __u64 load_time;/* ns since boottime */
+   __aligned_u64 load_time;/* ns since boottime */
__u32 created_by_uid;
__u32 nr_map_ids;
__aligned_u64 map_ids;
@@ -1101,8 +1101,8 @@ struct bpf_sock_ops {
__u32 lost_out;
__u32 sacked_out;
__u32 sk_txhash;
-   __u64 bytes_received;
-   __u64 bytes_acked;
+   __aligned_u64 bytes_received;
+   __aligned_u64 bytes_acked;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -1189,9 +1189,9 @@ enum {
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
 struct bpf_perf_event_value {
-   __u64 counter;
-   __u64 enabled;
-   __u64 running;
+   __aligned_u64 counter;
+   __aligned_u64 enabled;
+   __aligned_u64 running;
 };
 
 #define BPF_DEVCG_ACC_MKNOD(1ULL << 0)
@@ -1209,7 +1209,7 @@ struct bpf_cgroup_dev_ctx {
 };
 
 struct bpf_raw_tracepoint_args {
-   __u64 args[0];
+   __aligned_u64 args[0];
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 903010a..174e99a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -259,8 +259,8 @@ struct bpf_stack_build_id {
__s32   status;
unsigned char   build_id[BPF_BUILD_ID_SIZE];
union {
-   __u64   offset;
-   __u64   ip;
+   __aligned_u64   offset;
+   __aligned_u64   ip;
};
 };
 
@@ -288,7 +288,7 @@ union bpf_attr {
__aligned_u64 value;
__aligned_u64 next_key;
};
-   __u64   flags;
+   __aligned_u64   flags;
};
 
struct { /* anonymous struct used by BPF_PROG_LOAD command */
@@ -360,7 +360,7 @@ union bpf_attr {
} query;
 
struct {
-   __u64 name;
+   __aligned_u64 name;
__u32 prog_fd;
} raw_tracepoint;
 } __attribute__((aligned(8)));
@@ -1011,7 +1011,7 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;
__aligned_u64 xlated_prog_insns;
-   __u64 load_time;/* ns since boottime */
+   __aligned_u64 load_time;/* ns since boottime */
__u32 created_by_uid;
__u32 nr_map_ids;
__aligned_u64 map_ids;
@@ -1101,8 +1101,8 @@ struct bpf_sock_ops {
__u32 lost_out;
__u32 sacked_out;
__u32 sk_txhash;
-   __u64 bytes_received;
-   __u64 bytes_acked;
+   __aligned_u64 bytes_received;
+   __aligned_u64 bytes_acked;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -1189,9 +1189,9 @@ enum {
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
 struct bpf_perf_event_value {
-   __u64 counter;
-   __u64 enabled;
-   __u64 running;

[PATCH bpf 1/2] bpf: fix alignment of netns_dev/netns_ino fields in bpf_{map,prog}_info

2018-05-27 Thread Eugene Syromiatnikov

Recent introduction of netns_dev/netns_ino to bpf_map_info/bpf_prog info
has broken compat, as offsets of these fields are different in 32-bit
and 64-bit ABIs.  One fix (other than implementing compat support in
syscall in order to handle this discrepancy) is to use __aligned_u64
instead of __u64 for these fields.

Reported-by: Dmitry V. Levin 
Fixes: 52775b33bb507 ("bpf: offload: report device information about
offloaded maps")
Fixes: 675fc275a3a2d ("bpf: offload: report device information for
offloaded programs")

Signed-off-by: Eugene Syromiatnikov 
---
 include/uapi/linux/bpf.h   | 8 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec897..903010a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1017,8 +1017,8 @@ struct bpf_prog_info {
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
-   __u64 netns_dev;
-   __u64 netns_ino;
+   __aligned_u64 netns_dev;
+   __aligned_u64 netns_ino;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
@@ -1030,8 +1030,8 @@ struct bpf_map_info {
__u32 map_flags;
char  name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
-   __u64 netns_dev;
-   __u64 netns_ino;
+   __aligned_u64 netns_dev;
+   __aligned_u64 netns_ino;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c5ec897..903010a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1017,8 +1017,8 @@ struct bpf_prog_info {
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
-   __u64 netns_dev;
-   __u64 netns_ino;
+   __aligned_u64 netns_dev;
+   __aligned_u64 netns_ino;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
@@ -1030,8 +1030,8 @@ struct bpf_map_info {
__u32 map_flags;
char  name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
-   __u64 netns_dev;
-   __u64 netns_ino;
+   __aligned_u64 netns_dev;
+   __aligned_u64 netns_ino;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
-- 
2.1.4

1 2 >

1 - 100 of 123 matches

Mail list logo