Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG
On Mon, May 28, 2018 at 05:55:58AM +0200, Andrew Lunn wrote: > > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > > b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > > index ca38a30fbe91..adc6ab2cf429 100644 > > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > > @@ -4433,6 +4433,11 @@ static int > > mlxsw_sp_netdevice_port_upper_event(struct net_device *lower_dev, > > NL_SET_ERR_MSG_MOD(extack, "Can not put a VLAN on an > > OVS port"); > > return -EINVAL; > > } > > + if (is_vlan_dev(upper_dev) && > > + vlan_dev_vlan_id(upper_dev) == 1) { > > + NL_SET_ERR_MSG_MOD(extack, "Creating a VLAN device with > > VID 1 is unsupported: VLAN 1 carries untagged traffic"); > > + return -EINVAL; > > + } > > Hi Ido > > Would ENOTSUPP be a better return code. VLAN 1 is valid, you just > don't support it. OK, makes sense. We currently use EINVAL for such errors, but we can convert to EOPNOTSUPP in net-next. Thanks
Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
On Mon, 28 May 2018, Christoph Hellwig wrote: > On Mon, May 28, 2018 at 08:18:46AM +0200, Thomas Gleixner wrote: > > > Which other two? The boot optional removal patches? They just remove > > > the visible interface, but keep the implementation which is converted > > > to the better mechanism here, so I think the order makes sense. > > > But I might be missing something.. > > > > They remove the commandline switch before having the replacement in place > > unless I'm misreading something. > > The command line switch to force 32-bit dma is removed without > replacement. The PCI quirk for force the 32-bit dma for VIA bridges > is kept in place, and switch to a different mechanism in this patch. Fair enough
Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
On Mon, May 28, 2018 at 08:18:46AM +0200, Thomas Gleixner wrote: > > Which other two? The boot optional removal patches? They just remove > > the visible interface, but keep the implementation which is converted > > to the better mechanism here, so I think the order makes sense. > > But I might be missing something.. > > They remove the commandline switch before having the replacement in place > unless I'm misreading something. The command line switch to force 32-bit dma is removed without replacement. The PCI quirk for force the 32-bit dma for VIA bridges is kept in place, and switch to a different mechanism in this patch.
Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
On Mon, 28 May 2018, Christoph Hellwig wrote: > On Mon, May 28, 2018 at 08:10:40AM +0200, Thomas Gleixner wrote: > > n Fri, 25 May 2018, Christoph Hellwig wrote: > > > > x86/pci-dma: ... > > > > Please > > > > > Instead of globally disabling > 32bit DMA using the arch_dma_supported > > > hook walk the PCI bus under the actually affected bridge and mark every > > > device with the dma_32bit_limit flag. This also gets rid of the > > > arch_dma_supported hook entirely. > > > > Shouldn't this go before the other two? > > Which other two? The boot optional removal patches? They just remove > the visible interface, but keep the implementation which is converted > to the better mechanism here, so I think the order makes sense. > But I might be missing something.. They remove the commandline switch before having the replacement in place unless I'm misreading something. Thanks, tglx
Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
On Mon, May 28, 2018 at 08:10:40AM +0200, Thomas Gleixner wrote: > n Fri, 25 May 2018, Christoph Hellwig wrote: > > x86/pci-dma: ... > > Please > > > Instead of globally disabling > 32bit DMA using the arch_dma_supported > > hook walk the PCI bus under the actually affected bridge and mark every > > device with the dma_32bit_limit flag. This also gets rid of the > > arch_dma_supported hook entirely. > > Shouldn't this go before the other two? Which other two? The boot optional removal patches? They just remove the visible interface, but keep the implementation which is converted to the better mechanism here, so I think the order makes sense. But I might be missing something..
Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
n Fri, 25 May 2018, Christoph Hellwig wrote: x86/pci-dma: ... Please > Instead of globally disabling > 32bit DMA using the arch_dma_supported > hook walk the PCI bus under the actually affected bridge and mark every > device with the dma_32bit_limit flag. This also gets rid of the > arch_dma_supported hook entirely. Shouldn't this go before the other two? > Signed-off-by: Christoph Hellwig Reviewed-by: Thomas Gleixner
Re: [PATCH 6/7] x86: remove the explicit nodac and allowdac option
On Fri, 25 May 2018, Christoph Hellwig wrote: x86/pci-dma: ... Please > This is something drivers should decide (modulo chipset quirks like > for VIA), which as far as I can tell is how things have been handled > for the last 15 years. > > Note that we keep the usedac option for now, as it is used in the wild > to override the too generic VIA quirk. > > Signed-off-by: Christoph Hellwig Reviewed-by: Thomas Gleixner
Re: [PATCH 5/7] x86: remove the experimental forcesac boot option
On Fri, 25 May 2018, Christoph Hellwig wrote: x86/pci-dma: ... Please > Limiting the dma mask to avoid PCI (pre-PCIe) DAC cycles while paying > the huge overhead of an IOMMU is rather pointless, and this seriously > gets in the way of dma mapping work. > > Signed-off-by: Christoph Hellwig Reviewed-by: Thomas Gleixner
[PATCH iproute2] net:sched: add action inheritdsfield to skbedit
The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. Original idea by Jamal Hadi Salim Signed-off-by: Qiaobin Fu Reviewed-by: Michel Machado --- Note that the motivation for this patch is found in the following discussion: https://www.spinics.net/lists/netdev/msg501061.html --- diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h index fbcfe27..432ad2f 100644 --- a/include/uapi/linux/tc_act/tc_skbedit.h +++ b/include/uapi/linux/tc_act/tc_skbedit.h @@ -30,9 +30,11 @@ #define SKBEDIT_F_MARK 0x4 #define SKBEDIT_F_PTYPE0x8 #define SKBEDIT_F_MASK 0x10 +#define SKBEDIT_F_INHERITDSFIELD 0x20 struct tc_skbedit { tc_gen; + __u64 flags; }; enum { diff --git a/tc/m_skbedit.c b/tc/m_skbedit.c index db5c64c..7553a40 100644 --- a/tc/m_skbedit.c +++ b/tc/m_skbedit.c @@ -30,16 +30,18 @@ static void explain(void) { - fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT]>\n" + fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT] [IF]>\n" "QM = queue_mapping QUEUE_MAPPING\n" "PM = priority PRIORITY\n" "MM = mark MARK\n" "PT = ptype PACKETYPE\n" + "IF = inheritdsfield\n" "PACKETYPE = is one of:\n" " host, otherhost, broadcast, multicast\n" "QUEUE_MAPPING = device transmit queue to use\n" "PRIORITY = classID to assign to priority field\n" - "MARK = firewall mark to set\n"); + "MARK = firewall mark to set\n" + "note: inheritdsfield maps DS field to skb->priority\n"); } static void @@ -59,7 +61,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, struct rtattr *tail; unsigned int tmp; __u16 queue_mapping, ptype; - __u32 flags = 0, priority, mark; + __u32 priority, mark; struct tc_skbedit sel = { 0 }; if (matches(*argv, "skbedit") != 0) @@ -69,7 +71,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, while (argc > 0) { if (matches(*argv, "queue_mapping") == 0) { - flags |= SKBEDIT_F_QUEUE_MAPPING; + sel.flags |= SKBEDIT_F_QUEUE_MAPPING; NEXT_ARG(); if (get_unsigned(&tmp, *argv, 10) || tmp > 65535) { fprintf(stderr, "Illegal queue_mapping\n"); @@ -78,7 +80,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, queue_mapping = tmp; ok++; } else if (matches(*argv, "priority") == 0) { - flags |= SKBEDIT_F_PRIORITY; + sel.flags |= SKBEDIT_F_PRIORITY; NEXT_ARG(); if (get_tc_classid(&priority, *argv)) { fprintf(stderr, "Illegal priority\n"); @@ -86,7 +88,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, } ok++; } else if (matches(*argv, "mark") == 0) { - flags |= SKBEDIT_F_MARK; + sel.flags |= SKBEDIT_F_MARK; NEXT_ARG(); if (get_u32(&mark, *argv, 0)) { fprintf(stderr, "Illegal mark\n"); @@ -109,7 +111,10 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, *argv); return -1; } - flags |= SKBEDIT_F_PTYPE; + sel.flags |= SKBEDIT_F_PTYPE; + ok++; + } else if (matches(*argv, "inheritdsfield") == 0) { + sel.flags |= SKBEDIT_F_INHERITDSFIELD; ok++; } else if (matches(*argv, "help") == 0) { usage(); @@ -144,16 +149,16 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, tail = addattr_nest(n, MAX_MSG, tca_id); addattr_l(n, MAX_MSG, TCA_SKBEDIT_PARMS, &sel, sizeof(sel)); - if (flags & SKBEDIT_F_QUEUE_MAPPING) + if (sel.flags & SKBEDIT_F_QUEUE_MAPPING) addattr_l(n, MAX_MSG, TCA_SKBEDIT_QUEUE_MAPPING, &queue_mapping, sizeof(queue_mapping)); - if (flags & SKBEDIT_F_PRIORITY) + if (sel.flags & SKBEDIT_F_PRIORITY) addattr_l(n, MAX_MSG, TCA_SKBEDIT_PRIORITY, &priority, sizeof(priority)); - if (flags & SKBEDIT_F_MARK) + if (sel.flags & SKBEDIT_F_MARK)
[PATCH v2 net-next] net:sched: add action inheritdsfield to skbedit
The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. Original idea by Jamal Hadi Salim Signed-off-by: Qiaobin Fu Reviewed-by: Michel Machado --- Note that the motivation for this patch is found in the following discussion: https://www.spinics.net/lists/netdev/msg501061.html --- diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h index fbcfe27..432ad2f 100644 --- a/include/uapi/linux/tc_act/tc_skbedit.h +++ b/include/uapi/linux/tc_act/tc_skbedit.h @@ -30,9 +30,11 @@ #define SKBEDIT_F_MARK 0x4 #define SKBEDIT_F_PTYPE0x8 #define SKBEDIT_F_MASK 0x10 +#define SKBEDIT_F_INHERITDSFIELD 0x20 struct tc_skbedit { tc_gen; + __u64 flags; }; enum { diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index 6138d1d7..c3e9d03 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -23,6 +23,9 @@ #include #include #include +#include +#include +#include #include #include @@ -39,8 +42,32 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, tcf_lastuse_update(&d->tcf_tm); bstats_update(&d->tcf_bstats, skb); - if (d->flags & SKBEDIT_F_PRIORITY) - skb->priority = d->priority; + if (d->flags & SKBEDIT_F_INHERITDSFIELD) { + int wlen = skb_network_offset(skb); + + switch (tc_skb_protocol(skb)) { + case htons(ETH_P_IP): + wlen += sizeof(struct iphdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2; + break; + + case htons(ETH_P_IPV6): + wlen += sizeof(struct ipv6hdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2; + break; + + default: + goto default_priority; + } + } else { +default_priority: + if (d->flags & SKBEDIT_F_PRIORITY) + skb->priority = d->priority; + } if (d->flags & SKBEDIT_F_QUEUE_MAPPING && skb->dev->real_num_tx_queues > d->queue_mapping) skb_set_queue_mapping(skb, d->queue_mapping); @@ -53,6 +80,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, spin_unlock(&d->tcf_lock); return d->tcf_action; + +err: + spin_unlock(&d->tcf_lock); + return TC_ACT_SHOT; } static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { @@ -115,6 +146,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, } parm = nla_data(tb[TCA_SKBEDIT_PARMS]); + if (parm->flags & SKBEDIT_F_INHERITDSFIELD) + flags |= SKBEDIT_F_INHERITDSFIELD; exists = tcf_idr_check(tn, parm->index, a, bind); if (exists && bind)
[PATCH net] be2net: Fix error detection logic for BE3
Check for 0xE00 (RECOVERABLE_ERR) along with ARMFW UE (0x0) in be_detect_error() to know whether the error is valid error or not Fixes: 673c96e5a ("be2net: Fix UE detection logic for BE3") Signed-off-by: Suresh Reddy --- drivers/net/ethernet/emulex/benet/be_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index c697e79..8f75500 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -3309,7 +3309,9 @@ void be_detect_error(struct be_adapter *adapter) if ((val & POST_STAGE_FAT_LOG_START) != POST_STAGE_FAT_LOG_START && (val & POST_STAGE_ARMFW_UE) -!= POST_STAGE_ARMFW_UE) +!= POST_STAGE_ARMFW_UE && + (val & POST_STAGE_RECOVERABLE_ERR) +!= POST_STAGE_RECOVERABLE_ERR) return; } -- 1.8.3.1
Re: [PATCH v2] Revert "alx: remove WoL support"
Hi all, Just inform you a news reported by a user who confirmed the wake up issue can't be reproduce by the new kernel. https://bugzilla.kernel.org/show_bug.cgi?id=61651#c126 Guillaume de Jabrun 2018-05-27 15:12:54 UTC I am using this patch for a long time. I was experiencing the "wake up twice" bug, but with recent kernel version, I don't have this issue anymore. I can't tell which version actually fix the bug (I don't remember..). Best regards, AceLan Kao. 2018-05-21 11:18 GMT+08:00 David Miller : > From: AceLan Kao > Date: Mon, 21 May 2018 11:14:00 +0800 > >> We are willing to fix the issue, but we don't have a machine to >> reproduce it, and the WoL feature has been removed 5 years ago, it's >> hard to find those buggy machines. > > Have you bothered to ask the person who did the revert? > >> WoL is a feature that is only used by a very small group of people, >> and the wake up issue looks like only happens on some >> platforms. Which means only small part of the group of people are >> affected. > > One of those people was the wireless networking stack maintainer. > >> So, it's not a serious issue worth to remove it from alx driver. > > I disagree. > > You must fix the regression solved by the revert, before adding > WoL support back to the driver. > > I'm not going to say this again. > > Thank you. >
[PATCH bpf v2 5/5] selftests/bpf: test_sockmap, print additional test options
Print values of test options like apply, cork, start, end so that individual failed tests can be identified for manual run Acked-by: John Fastabend Signed-off-by: Prashant Bhole --- tools/testing/selftests/bpf/test_sockmap.c | 28 +++--- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index e8faf4596dac..7970d48c4acb 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -869,6 +869,8 @@ static char *test_to_str(int test) #define OPTSTRING 60 static void test_options(char *options) { + char tstr[OPTSTRING]; + memset(options, 0, OPTSTRING); if (txmsg_pass) @@ -881,14 +883,22 @@ static void test_options(char *options) strncat(options, "redir_noisy,", OPTSTRING); if (txmsg_drop) strncat(options, "drop,", OPTSTRING); - if (txmsg_apply) - strncat(options, "apply,", OPTSTRING); - if (txmsg_cork) - strncat(options, "cork,", OPTSTRING); - if (txmsg_start) - strncat(options, "start,", OPTSTRING); - if (txmsg_end) - strncat(options, "end,", OPTSTRING); + if (txmsg_apply) { + snprintf(tstr, OPTSTRING, "apply %d,", txmsg_apply); + strncat(options, tstr, OPTSTRING); + } + if (txmsg_cork) { + snprintf(tstr, OPTSTRING, "cork %d,", txmsg_cork); + strncat(options, tstr, OPTSTRING); + } + if (txmsg_start) { + snprintf(tstr, OPTSTRING, "start %d,", txmsg_start); + strncat(options, tstr, OPTSTRING); + } + if (txmsg_end) { + snprintf(tstr, OPTSTRING, "end %d,", txmsg_end); + strncat(options, tstr, OPTSTRING); + } if (txmsg_ingress) strncat(options, "ingress,", OPTSTRING); if (txmsg_skb) @@ -897,7 +907,7 @@ static void test_options(char *options) static int __test_exec(int cgrp, int test, struct sockmap_options *opt) { - char *options = calloc(60, sizeof(char)); + char *options = calloc(OPTSTRING, sizeof(char)); int err; if (test == SENDPAGE) -- 2.17.0
[PATCH bpf v2 4/5] selftests/bpf: test_sockmap, fix data verification
When data verification is enabled, some tests fail because verification is done incorrectly. Following changes fix it. - Identify the size of data block to be verified - Reset verification counter when data block size is reached - Fixed the value printed in case of verfication failure Fixes: 16962b2404ac ("bpf: sockmap, add selftests") Acked-by: John Fastabend Signed-off-by: Prashant Bhole --- tools/testing/selftests/bpf/test_sockmap.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index f79397513362..e8faf4596dac 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -337,8 +337,15 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, int fd_flags = O_NONBLOCK; struct timeval timeout; float total_bytes; + int bytes_cnt = 0; + int chunk_sz; fd_set w; + if (opt->sendpage) + chunk_sz = iov_length * cnt; + else + chunk_sz = iov_length * iov_count; + fcntl(fd, fd_flags); total_bytes = (float)iov_count * (float)iov_length * (float)cnt; err = clock_gettime(CLOCK_MONOTONIC, &s->start); @@ -388,9 +395,14 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, errno = -EIO; fprintf(stderr, "detected data corruption @iov[%i]:%i %02x != %02x, %02x ?= %02x\n", - i, j, d[j], k - 1, d[j+1], k + 1); + i, j, d[j], k - 1, d[j+1], k); goto out_errno; } + bytes_cnt++; + if (bytes_cnt == chunk_sz) { + k = 0; + bytes_cnt = 0; + } recv--; } } -- 2.17.0
[PATCH bpf v2 3/5] selftests/bpf: test_sockmap, fix test timeout
In order to reduce runtime of tests, recently timout for select() call was reduced from 1sec to 10usec. This was causing many tests failures. It was caught with failure handling commits in this series. Restoring the timeout from 10usec to 1sec Fixes: a18fda1a62c3 ("bpf: reduce runtime of test_sockmap tests") Signed-off-by: Prashant Bhole --- tools/testing/selftests/bpf/test_sockmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index 8a81ea0e9fb6..f79397513362 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -345,8 +345,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, if (err < 0) perror("recv start time: "); while (s->bytes_recvd < total_bytes) { - timeout.tv_sec = 0; - timeout.tv_usec = 10; + timeout.tv_sec = 1; + timeout.tv_usec = 0; /* FD sets */ FD_ZERO(&w); -- 2.17.0
[PATCH bpf v2 2/5] selftests/bpf: test_sockmap, join cgroup in selftest mode
In case of selftest mode, temporary cgroup environment is created but cgroup is not joined. It causes test failures. Fixed by joining the cgroup Fixes: 16962b2404ac ("bpf: sockmap, add selftests") Acked-by: John Fastabend Signed-off-by: Prashant Bhole --- tools/testing/selftests/bpf/test_sockmap.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index 34feb74c95c4..8a81ea0e9fb6 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -1342,6 +1342,11 @@ static int __test_suite(char *bpf_file) return cg_fd; } + if (join_cgroup(CG_PATH)) { + fprintf(stderr, "ERROR: failed to join cgroup\n"); + return -EINVAL; + } + /* Tests basic commands and APIs with range of iov values */ txmsg_start = txmsg_end = 0; err = test_txmsg(cg_fd); -- 2.17.0
[PATCH bpf v2 1/5] selftests/bpf: test_sockmap, check test failure
Test failures are not identified because exit code of RX/TX threads is not checked. Also threads are not returning correct exit code. - Return exit code from threads depending on test execution status - In main thread, check the exit code of RX/TX threads Fixes: 16962b2404ac ("bpf: sockmap, add selftests") Signed-off-by: Prashant Bhole --- tools/testing/selftests/bpf/test_sockmap.c | 25 -- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index eb17fae458e6..34feb74c95c4 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -429,8 +429,8 @@ static int sendmsg_test(struct sockmap_options *opt) struct msg_stats s = {0}; int iov_count = opt->iov_count; int iov_buf = opt->iov_length; + int rx_status, tx_status; int cnt = opt->rate; - int status; errno = 0; @@ -442,7 +442,7 @@ static int sendmsg_test(struct sockmap_options *opt) rxpid = fork(); if (rxpid == 0) { if (opt->drop_expected) - exit(1); + exit(0); if (opt->sendpage) iov_count = 1; @@ -463,7 +463,7 @@ static int sendmsg_test(struct sockmap_options *opt) "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB %fB/s %fGB/s\n", s.bytes_sent, sent_Bps, sent_Bps/giga, s.bytes_recvd, recvd_Bps, recvd_Bps/giga); - exit(1); + exit(err ? 1 : 0); } else if (rxpid == -1) { perror("msg_loop_rx: "); return errno; @@ -491,14 +491,27 @@ static int sendmsg_test(struct sockmap_options *opt) "tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB %fB/s %fGB/s\n", s.bytes_sent, sent_Bps, sent_Bps/giga, s.bytes_recvd, recvd_Bps, recvd_Bps/giga); - exit(1); + exit(err ? 1 : 0); } else if (txpid == -1) { perror("msg_loop_tx: "); return errno; } - assert(waitpid(rxpid, &status, 0) == rxpid); - assert(waitpid(txpid, &status, 0) == txpid); + assert(waitpid(rxpid, &rx_status, 0) == rxpid); + assert(waitpid(txpid, &tx_status, 0) == txpid); + if (WIFEXITED(rx_status)) { + err = WEXITSTATUS(rx_status); + if (err) { + fprintf(stderr, "rx thread exited with err %d. ", err); + goto out; + } + } + if (WIFEXITED(tx_status)) { + err = WEXITSTATUS(tx_status); + if (err) + fprintf(stderr, "tx thread exited with err %d. ", err); + } +out: return err; } -- 2.17.0
[PATCH bpf v2 0/5] fix test_sockmap
This series fixes error handling, timeout and data verification in test_sockmap. Previously it was not able to detect failure/timeout in RX/TX thread because error was not notified to the main thread. Also slightly improved test output by printing parameter values (cork, apply, start, end) so that parameters for all tests are displayed. Prashant Bhole (5): selftests/bpf: test_sockmap, check test failure selftests/bpf: test_sockmap, join cgroup in selftest mode selftests/bpf: test_sockmap, fix test timeout selftests/bpf: test_sockmap, fix data verification selftests/bpf: test_sockmap, print additional test options tools/testing/selftests/bpf/test_sockmap.c | 76 +- 1 file changed, 58 insertions(+), 18 deletions(-) -- 2.17.0
Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect
On 2018/05/28 11:24, Jason Wang wrote: > On 2018年05月25日 21:43, Toshiaki Makita wrote: > > [...] > @@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, struct bpf_prog *xdp_prog; int ret; + local_bh_disable(); + preempt_disable(); rcu_read_lock(); xdp_prog = rcu_dereference(tun->xdp_prog); if (xdp_prog) { ret = do_xdp_generic(xdp_prog, skb); if (ret != XDP_PASS) { rcu_read_unlock(); + preempt_enable(); + local_bh_enable(); return total_len; } } rcu_read_unlock(); + preempt_enable(); + local_bh_enable(); } rcu_read_lock(); >>> >>> Good catch, thanks. >>> >>> But I think we can just replace preempt_disable()/enable() with >>> local_bh_disable()/local_bh_enable() ? >> >> I actually thought the same, but noticed this patch. >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe >> >> >> It looks like they do not think local_bh_disable() implies >> preempt_disable(). But I'm not sure why.. >> >> Toshiaki Makita > > I see, there're probably have some subtle differences and implications > for e.g scheduler or others. > > What we what here is to make sure the process is not moved to another > CPU and bh is enabled. By checking preemptible() function, preemption > should be disabled after local_bh_disable(). So I think we're safe here. OK. I checked retint_kernel which IIUC is the entry point of preemption process on x86, and confirmed it just checks if __preempt_count is zero. I haven't checked other archs but I was probably worried too much. Will send v2. Thanks, Toshiaki Makita
Re: [PATCH bpf-next 0/5] fix test_sockmap
On 5/25/2018 11:01 PM, John Fastabend wrote: On 05/25/2018 01:28 AM, Prashant Bhole wrote: On 5/24/2018 1:58 PM, John Fastabend wrote: On 05/23/2018 09:47 PM, Prashant Bhole wrote: On 5/23/2018 6:44 PM, Prashant Bhole wrote: On 5/22/2018 2:08 AM, John Fastabend wrote: On 05/20/2018 10:13 PM, Prashant Bhole wrote: On 5/19/2018 1:42 AM, John Fastabend wrote: On 05/18/2018 12:17 AM, Prashant Bhole wrote: This series fixes bugs in test_sockmap code. They weren't caught previously because failure in RX/TX thread was not notified to the main thread. Also fixed data verification logic and slightly improved test output such that parameters values (cork, apply, start, end) of failed test can be easily seen. Great, this was on my list so thanks for taking care of it. Note: Even after fixing above problems there are issues with tests which set cork parameter. Tests fail (RX thread timeout) when cork value is non-zero and overall data sent by TX thread isn't multiples of cork value. This is expected. When 'cork' is set the sender should only xmit the data when 'cork' bytes are available. If the user doesn't provide the N bytes the data is cork'ed waiting for the bytes and if the socket is closed the state is cleaned up. What these tests are testing is the cleanup path when a user doesn't provide the N bytes. In practice this is used to validate headers and prevent users from sending partial headers. We want to keep these tests because they verify a tear-down path in the code. Ok. After your changes do these get reported as failures? If so we need to account for the above in the calculations. Yes, cork related test are reported as failures because of RX thread timeout. So with your above description, I think we need to differentiate cork tests with partial data and full data. In partial data test we can have something like "timeout_expected" flag. Any other way to fix it? Adding a flag seems reasonable to me. Lets do this for now. Also I plan to add more negative tests so we can either use the same flag or a new one for those cases as well. John, I worked on this for some time and noticed that the RX-timeout of tests with cork parameter is dependent on various parameters. So we can not set a flag like the way 'drop_expected' flag is set before executing the test. So I decided to write a function which judges all parameters before each test and decides whether a test with cork parameter will timeout or not. Then the conditions in the function became complicated. For example some tests fail if opt->rate < 17 (with some other conditions). Here is 17 is related to FRAGS_PER_SKB. Consider following two examples. I'm sorry. Correction: s/FRAGS_PER_SKB/MAX_SKB_FRAGS/ ./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage --txmsg --txmsg_cork 1024 # RX timeout occurs ./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage --txmsg --txmsg_cork 1024 # Success! Ah yes this hits the buffer limit and flushes the queue. The kernel side doesn't know how to merge those specific sendpage requests so it gives each request its own buffer and when the limit is reached we flush it. Do we need to keep such tests? if yes, then I will continue with adding such conditions in the function. Yes, these tests are needed because they are testing the edge cases. These are probably the most important tests because my normal usage will catch any issues in the "good" cases its these types of things that can go unnoticed (at least for a short while) if we don't have specific tests for them. I tried but it is difficult to come up with a right set of conditions which lead to test failure. Agreed, it can be yes. How about adding your logic for all tests except "cork" cases. If there is a flag to set if the timeout is expected we can always manually set it in the test invocation. Might not be as nice as automatically learning the expected results but possibly easier than building some complicated logic to figure it out. Would you mind submitting your series again without the "cork" tests being tracked? And if you want add a bit to tell if the "cork" tests are going to timeout or not setting it per test manually. But I think your series can just omit the cork test for now and still be useful. Ok. I will submit the series again. Without any change in actual patches, but cover letter reorganized. Thanks. -Prashant
Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > index ca38a30fbe91..adc6ab2cf429 100644 > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > @@ -4433,6 +4433,11 @@ static int mlxsw_sp_netdevice_port_upper_event(struct > net_device *lower_dev, > NL_SET_ERR_MSG_MOD(extack, "Can not put a VLAN on an > OVS port"); > return -EINVAL; > } > + if (is_vlan_dev(upper_dev) && > + vlan_dev_vlan_id(upper_dev) == 1) { > + NL_SET_ERR_MSG_MOD(extack, "Creating a VLAN device with > VID 1 is unsupported: VLAN 1 carries untagged traffic"); > + return -EINVAL; > + } Hi Ido Would ENOTSUPP be a better return code. VLAN 1 is valid, you just don't support it. Andrew
Re: [PATCH] drivers/net/phy/micrel: Fix for PHY KSZ8061 errrata: Potential link-up failure when Ethernet cable is connected slowly
> This e-mail (including any attachments) is confidential and may be > legally privileged. If you are not an intended recipient or an > authorized representative of an intended recipient, you are > prohibited from using, copying or distributing the information in > this e-mail or its attachments. If you have received this e-mail in > error, please notify the sender immediately by return e-mail and > delete all copies of this message and any attachments. Thank you. Once this has been removed, i will have a comment... Andrew
Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect
On 2018年05月25日 21:43, Toshiaki Makita wrote: [...] @@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, struct bpf_prog *xdp_prog; int ret; + local_bh_disable(); + preempt_disable(); rcu_read_lock(); xdp_prog = rcu_dereference(tun->xdp_prog); if (xdp_prog) { ret = do_xdp_generic(xdp_prog, skb); if (ret != XDP_PASS) { rcu_read_unlock(); + preempt_enable(); + local_bh_enable(); return total_len; } } rcu_read_unlock(); + preempt_enable(); + local_bh_enable(); } rcu_read_lock(); Good catch, thanks. But I think we can just replace preempt_disable()/enable() with local_bh_disable()/local_bh_enable() ? I actually thought the same, but noticed this patch. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe It looks like they do not think local_bh_disable() implies preempt_disable(). But I'm not sure why.. Toshiaki Makita I see, there're probably have some subtle differences and implications for e.g scheduler or others. What we what here is to make sure the process is not moved to another CPU and bh is enabled. By checking preemptible() function, preemption should be disabled after local_bh_disable(). So I think we're safe here. Thanks
[PATCH bpf-next 01/11] bpf: test case for map pointer poison with calls/branches
Add several test cases where the same or different map pointers originate from different paths in the program and execute a map lookup or tail call at a common location. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/filter.h | 10 ++ tools/include/linux/filter.h| 10 ++ tools/testing/selftests/bpf/test_verifier.c | 185 3 files changed, 178 insertions(+), 27 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index d358d18..b443f70 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -289,6 +289,16 @@ struct xdp_buff; .off = OFF, \ .imm = 0 }) +/* Relative call */ + +#define BPF_CALL_REL(TGT) \ + ((struct bpf_insn) {\ + .code = BPF_JMP | BPF_CALL,\ + .dst_reg = 0, \ + .src_reg = BPF_PSEUDO_CALL, \ + .off = 0, \ + .imm = TGT }) + /* Function call */ #define BPF_EMIT_CALL(FUNC)\ diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h index c5e512d..af55acf 100644 --- a/tools/include/linux/filter.h +++ b/tools/include/linux/filter.h @@ -263,6 +263,16 @@ #define BPF_LD_MAP_FD(DST, MAP_FD) \ BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD) +/* Relative call */ + +#define BPF_CALL_REL(TGT) \ + ((struct bpf_insn) {\ + .code = BPF_JMP | BPF_CALL,\ + .dst_reg = 0, \ + .src_reg = BPF_PSEUDO_CALL, \ + .off = 0, \ + .imm = TGT }) + /* Program exit */ #define BPF_EXIT_INSN()\ diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 4b4f015..7cb1d74 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -50,7 +50,7 @@ #define MAX_INSNS BPF_MAXINSNS #define MAX_FIXUPS 8 -#define MAX_NR_MAPS4 +#define MAX_NR_MAPS7 #define POINTER_VALUE 0xcafe4all #define TEST_DATA_LEN 64 @@ -66,7 +66,9 @@ struct bpf_test { int fixup_map1[MAX_FIXUPS]; int fixup_map2[MAX_FIXUPS]; int fixup_map3[MAX_FIXUPS]; - int fixup_prog[MAX_FIXUPS]; + int fixup_map4[MAX_FIXUPS]; + int fixup_prog1[MAX_FIXUPS]; + int fixup_prog2[MAX_FIXUPS]; int fixup_map_in_map[MAX_FIXUPS]; const char *errstr; const char *errstr_unpriv; @@ -2769,7 +2771,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .errstr_unpriv = "R3 leaks addr into helper", .result_unpriv = REJECT, .result = ACCEPT, @@ -2856,7 +2858,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .result = ACCEPT, .retval = 42, }, @@ -2870,7 +2872,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .result = ACCEPT, .retval = 41, }, @@ -2884,7 +2886,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .result = ACCEPT, .retval = 1, }, @@ -2898,7 +2900,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 2), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .result = ACCEPT, .retval = 2, }, @@ -2912,7 +2914,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 2), BPF_EXIT_INSN(), }, - .fixup_prog = { 1 }, + .fixup_prog1 = { 1 }, .result = ACCEPT, .retval = 2, }, @@ -2926,7 +2928,7
[PATCH bpf-next 10/11] bpf: sync bpf uapi header with tools
Pull in recent changes from include/uapi/linux/bpf.h. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9b8c6e3..7108711 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2004,6 +2004,20 @@ union bpf_attr { * direct packet access. * Return * 0 on success, or a negative error in case of failure. + * + * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb) + * Description + * Return the cgroup v2 id of the socket associated with the *skb*. + * This is roughly similar to the **bpf_get_cgroup_classid**\ () + * helper for cgroup v1 by providing a tag resp. identifier that + * can be matched on or used for map lookups e.g. to implement + * policy. The cgroup v2 id of a given path in the hierarchy is + * exposed in user space through the f_handle API in order to get + * to the same 64-bit id. + * + * This helper can be used on TC egress path, but not on ingress. + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2082,7 +2096,8 @@ union bpf_attr { FN(lwt_push_encap), \ FN(lwt_seg6_store_bytes), \ FN(lwt_seg6_adjust_srh),\ - FN(lwt_seg6_action), + FN(lwt_seg6_action),\ + FN(skb_cgroup_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2199,7 +2214,7 @@ struct bpf_tunnel_key { }; __u8 tunnel_tos; __u8 tunnel_ttl; - __u16 tunnel_ext; + __u16 tunnel_ext; /* Padding, future use. */ __u32 tunnel_label; }; @@ -2210,6 +2225,7 @@ struct bpf_xfrm_state { __u32 reqid; __u32 spi; /* Stored in network byte order */ __u16 family; + __u16 ext; /* Padding, future use. */ union { __u32 remote_ipv4; /* Stored in network byte order */ __u32 remote_ipv6[4]; /* Stored in network byte order */ -- 2.9.5
Re: [PATCH v4 net] stmmac: 802.1ad tag stripping support fix
On 2018/05/27 4:24, Elad Nachman wrote: > stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before > calling napi_gro_receive(). > > The function assumes VLAN tagged frames are always tagged with 802.1Q > protocol, > and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to > __vlan_hwaccel_put_tag() . > > This causes packets not to be passed to the VLAN slave if it was created with > 802.1AD protocol > (ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100). > > This fix passes the protocol from the VLAN header into > __vlan_hwaccel_put_tag() > instead of using the hard-coded value of ETH_P_8021Q. > NETIF_F_HW_VLAN_CTAG_RX check was removed to be in line with the driver > actual abilities. > > Signed-off-by: Elad Nachman > I might have not been clear enough but you still need this hunk with this change. > - ndev->features |= NETIF_F_HW_VLAN_CTAG_RX; > + ndev->features |= (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX); Also I guess spaces are preferred around '|'. Would you check it by checkpatch.pl? -- Toshiaki Makita
[PATCH bpf-next 09/11] bpf: fix context access in tracing progs on 32 bit archs
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here. Therefore adjust this down to machine size and check the offset for narrow access on that basis. We also need to fix pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. Reported-by: Wang YanQing Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/filter.h | 30 -- kernel/trace/bpf_trace.c | 10 -- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index d407ede..89903d2 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -639,16 +639,34 @@ static inline bool bpf_prog_was_classic(const struct bpf_prog *prog) return prog->type == BPF_PROG_TYPE_UNSPEC; } -static inline bool -bpf_ctx_narrow_access_ok(u32 off, u32 size, const u32 size_default) +static inline u32 bpf_ctx_off_adjust_machine(u32 size) +{ + const u32 size_machine = sizeof(unsigned long); + + if (size > size_machine && size % size_machine == 0) + size = size_machine; + + return size; +} + +static inline bool bpf_ctx_narrow_align_ok(u32 off, u32 size_access, + u32 size_default) { - bool off_ok; + size_default = bpf_ctx_off_adjust_machine(size_default); + size_access = bpf_ctx_off_adjust_machine(size_access); + #ifdef __LITTLE_ENDIAN - off_ok = (off & (size_default - 1)) == 0; + return (off & (size_default - 1)) == 0; #else - off_ok = (off & (size_default - 1)) + size == size_default; + return (off & (size_default - 1)) + size_access == size_default; #endif - return off_ok && size <= size_default && (size & (size - 1)) == 0; +} + +static inline bool +bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default) +{ + return bpf_ctx_narrow_align_ok(off, size, size_default) && + size <= size_default && (size & (size - 1)) == 0; } #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 81fdf2f..7269530 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -880,8 +880,14 @@ static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type return false; if (type != BPF_READ) return false; - if (off % size != 0) - return false; + if (off % size != 0) { + if (sizeof(unsigned long) != 4) + return false; + if (size != 8) + return false; + if (off % size != 4) + return false; + } switch (off) { case bpf_ctx_range(struct bpf_perf_event_data, sample_period): -- 2.9.5
[PATCH bpf-next 06/11] bpf: add bpf_skb_cgroup_id helper
Add a new bpf_skb_cgroup_id() helper that allows to retrieve the cgroup id from the skb's socket. This is useful in particular to enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in cgroup v2 by allowing ID based matching on egress. This can in particular be used in combination with applying policy e.g. from map lookups, and also complements the older bpf_skb_under_cgroup() interface. In user space the cgroup id for a given path can be retrieved through the f_handle as demonstrated in [0] recently. [0] https://lkml.org/lkml/2018/5/22/1190 Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 17 - net/core/filter.c| 29 +++-- 2 files changed, 43 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9b8c6e3..e2853aa 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2004,6 +2004,20 @@ union bpf_attr { * direct packet access. * Return * 0 on success, or a negative error in case of failure. + * + * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb) + * Description + * Return the cgroup v2 id of the socket associated with the *skb*. + * This is roughly similar to the **bpf_get_cgroup_classid**\ () + * helper for cgroup v1 by providing a tag resp. identifier that + * can be matched on or used for map lookups e.g. to implement + * policy. The cgroup v2 id of a given path in the hierarchy is + * exposed in user space through the f_handle API in order to get + * to the same 64-bit id. + * + * This helper can be used on TC egress path, but not on ingress. + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2082,7 +2096,8 @@ union bpf_attr { FN(lwt_push_encap), \ FN(lwt_seg6_store_bytes), \ FN(lwt_seg6_adjust_srh),\ - FN(lwt_seg6_action), + FN(lwt_seg6_action),\ + FN(skb_cgroup_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index acf1f4f..717c740 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3661,6 +3661,27 @@ static const struct bpf_func_proto bpf_skb_under_cgroup_proto = { .arg3_type = ARG_ANYTHING, }; +#ifdef CONFIG_SOCK_CGROUP_DATA +BPF_CALL_1(bpf_skb_cgroup_id, const struct sk_buff *, skb) +{ + struct sock *sk = skb_to_full_sk(skb); + struct cgroup *cgrp; + + if (!sk || !sk_fullsock(sk)) + return 0; + + cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + return cgrp->kn->id.id; +} + +static const struct bpf_func_proto bpf_skb_cgroup_id_proto = { + .func = bpf_skb_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; +#endif + static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, unsigned long off, unsigned long len) { @@ -4741,12 +4762,16 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_cookie_proto; case BPF_FUNC_get_socket_uid: return &bpf_get_socket_uid_proto; + case BPF_FUNC_fib_lookup: + return &bpf_skb_fib_lookup_proto; #ifdef CONFIG_XFRM case BPF_FUNC_skb_get_xfrm_state: return &bpf_skb_get_xfrm_state_proto; #endif - case BPF_FUNC_fib_lookup: - return &bpf_skb_fib_lookup_proto; +#ifdef CONFIG_SOCK_CGROUP_DATA + case BPF_FUNC_skb_cgroup_id: + return &bpf_skb_cgroup_id_proto; +#endif default: return bpf_base_func_proto(func_id); } -- 2.9.5
[PATCH bpf-next 07/11] bpf: make sure to clear unused fields in tunnel/xfrm state fetch
Since the remaining bits are not filled in struct bpf_tunnel_key resp. struct bpf_xfrm_state and originate from uninitialized stack space, we should make sure to clear them before handing control back to the program. Also add a padding element to struct bpf_xfrm_state for future use similar as we have in struct bpf_tunnel_key and clear it as well. struct bpf_xfrm_state { __u32 reqid;/* 0 4 */ __u32 spi; /* 4 4 */ __u16 family; /* 8 2 */ /* XXX 2 bytes hole, try to pack */ union { __u32 remote_ipv4; /* 4 */ __u32 remote_ipv6[4]; /* 16 */ }; /*1216 */ /* size: 28, cachelines: 1, members: 4 */ /* sum members: 26, holes: 1, sum holes: 2 */ /* last cacheline: 28 bytes */ }; Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 3 ++- net/core/filter.c| 6 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e2853aa..7108711 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2214,7 +2214,7 @@ struct bpf_tunnel_key { }; __u8 tunnel_tos; __u8 tunnel_ttl; - __u16 tunnel_ext; + __u16 tunnel_ext; /* Padding, future use. */ __u32 tunnel_label; }; @@ -2225,6 +2225,7 @@ struct bpf_xfrm_state { __u32 reqid; __u32 spi; /* Stored in network byte order */ __u16 family; + __u16 ext; /* Padding, future use. */ union { __u32 remote_ipv4; /* Stored in network byte order */ __u32 remote_ipv6[4]; /* Stored in network byte order */ diff --git a/net/core/filter.c b/net/core/filter.c index 717c740..5ceb5e6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3445,6 +3445,7 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, struct bpf_tunnel_key to->tunnel_id = be64_to_cpu(info->key.tun_id); to->tunnel_tos = info->key.tos; to->tunnel_ttl = info->key.ttl; + to->tunnel_ext = 0; if (flags & BPF_F_TUNINFO_IPV6) { memcpy(to->remote_ipv6, &info->key.u.ipv6.src, @@ -3452,6 +3453,8 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, struct bpf_tunnel_key to->tunnel_label = be32_to_cpu(info->key.label); } else { to->remote_ipv4 = be32_to_cpu(info->key.u.ipv4.src); + memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3); + to->tunnel_label = 0; } if (unlikely(size != sizeof(struct bpf_tunnel_key))) @@ -4047,11 +4050,14 @@ BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index, to->reqid = x->props.reqid; to->spi = x->id.spi; to->family = x->props.family; + to->ext = 0; + if (to->family == AF_INET6) { memcpy(to->remote_ipv6, x->props.saddr.a6, sizeof(to->remote_ipv6)); } else { to->remote_ipv4 = x->props.saddr.a4; + memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3); } return 0; -- 2.9.5
[PATCH bpf-next 02/11] bpf: add also cbpf long jump test cases with heavy expansion
We have one triggering on eBPF but lets also add a cBPF example to make sure we keep tracking them. Also add anther cBPF test running max number of MSH ops. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- lib/test_bpf.c | 63 ++ 1 file changed, 63 insertions(+) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 317f231..60aedc8 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -356,6 +356,52 @@ static int bpf_fill_maxinsns11(struct bpf_test *self) return __bpf_fill_ja(self, BPF_MAXINSNS, 68); } +static int bpf_fill_maxinsns12(struct bpf_test *self) +{ + unsigned int len = BPF_MAXINSNS; + struct sock_filter *insn; + int i = 0; + + insn = kmalloc_array(len, sizeof(*insn), GFP_KERNEL); + if (!insn) + return -ENOMEM; + + insn[0] = __BPF_JUMP(BPF_JMP | BPF_JA, len - 2, 0, 0); + + for (i = 1; i < len - 1; i++) + insn[i] = __BPF_STMT(BPF_LDX | BPF_B | BPF_MSH, 0); + + insn[len - 1] = __BPF_STMT(BPF_RET | BPF_K, 0xabababab); + + self->u.ptr.insns = insn; + self->u.ptr.len = len; + + return 0; +} + +static int bpf_fill_maxinsns13(struct bpf_test *self) +{ + unsigned int len = BPF_MAXINSNS; + struct sock_filter *insn; + int i = 0; + + insn = kmalloc_array(len, sizeof(*insn), GFP_KERNEL); + if (!insn) + return -ENOMEM; + + for (i = 0; i < len - 3; i++) + insn[i] = __BPF_STMT(BPF_LDX | BPF_B | BPF_MSH, 0); + + insn[len - 3] = __BPF_STMT(BPF_LD | BPF_IMM, 0xabababab); + insn[len - 2] = __BPF_STMT(BPF_ALU | BPF_XOR | BPF_X, 0); + insn[len - 1] = __BPF_STMT(BPF_RET | BPF_A, 0); + + self->u.ptr.insns = insn; + self->u.ptr.len = len; + + return 0; +} + static int bpf_fill_ja(struct bpf_test *self) { /* Hits exactly 11 passes on x86_64 JIT. */ @@ -5290,6 +5336,23 @@ static struct bpf_test tests[] = { .expected_errcode = -ENOTSUPP, }, { + "BPF_MAXINSNS: jump over MSH", + { }, + CLASSIC | FLAG_EXPECTED_FAIL, + { 0xfa, 0xfb, 0xfc, 0xfd, }, + { { 4, 0xabababab } }, + .fill_helper = bpf_fill_maxinsns12, + .expected_errcode = -EINVAL, + }, + { + "BPF_MAXINSNS: exec all MSH", + { }, + CLASSIC, + { 0xfa, 0xfb, 0xfc, 0xfd, }, + { { 4, 0xababab83 } }, + .fill_helper = bpf_fill_maxinsns13, + }, + { "BPF_MAXINSNS: ld_abs+get_processor_id", { }, CLASSIC, -- 2.9.5
[PATCH bpf-next 11/11] bpf, doc: add missing patchwork url and libbpf to maintainers
Add missing bits under tools/lib/bpf/ and also Q: entry in order to make it easier for people to retrieve current patch queue. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index f492431..2fd51db 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2722,6 +2722,7 @@ L:netdev@vger.kernel.org L: linux-ker...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git +Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 S: Supported F: arch/x86/net/bpf_jit* F: Documentation/networking/filter.txt @@ -2740,6 +2741,7 @@ F:net/sched/act_bpf.c F: net/sched/cls_bpf.c F: samples/bpf/ F: tools/bpf/ +F: tools/lib/bpf/ F: tools/testing/selftests/bpf/ BROADCOM B44 10/100 ETHERNET DRIVER -- 2.9.5
[PATCH bpf-next 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool p d x i 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool p d x i 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/filter.h | 3 +++ kernel/bpf/hashtab.c | 12 ++--- kernel/bpf/verifier.c | 67 +- 3 files changed, 62 insertions(+), 20 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index b443f70..d407ede 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -301,6 +301,9 @@ struct xdp_buff; /* Function call */ +#define BPF_CAST_CALL(x) \ + ((u64 (*)(u64, u64, u64, u64, u64))(x)) + #define BPF_EMIT_CALL(FUNC)\ ((struct bpf_insn) {\ .code = BPF_JMP | BPF_CALL,\ diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index b76828f..3ca2198 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -503,7 +503,9 @@ static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, +(void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1); *insn++ = BPF_ALU64_IMM(BPF_ADD, ret, offsetof(struct htab_elem, key) + @@ -530,7 +532,9 @@ static u32 htab_lru_map_gen_lookup(struct bpf_map *map, const int ret = BPF_REG_0; const int ref_reg = BPF_REG_1; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, +(void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4); *insn++ = BPF_LDX_MEM(BPF_B, ref_reg, ret, offsetof(struct htab_elem, lru_node) + @@ -1369,7 +1373,9 @@ static u32 htab_of_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, +(void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2); *insn++ = BPF_ALU64_IMM(BPF_ADD, ret, offsetof(struct htab_elem, key) + diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4f4786e..5684b15 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2421,8 +2421,11 @@ record_func_map(struct bpf_
[PATCH bpf-next 03/11] bpf: fixup error message from gpl helpers on license mismatch
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465 Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1fd9667b..4f4786e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2462,7 +2462,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn /* eBPF programs must be GPL compatible to use GPL-ed functions */ if (!env->prog->gpl_compatible && fn->gpl_only) { - verbose(env, "cannot call GPL only function from proprietary program\n"); + verbose(env, "cannot call GPL-restricted function from non-GPL compatible program\n"); return -EINVAL; } -- 2.9.5
[PATCH bpf-next 08/11] bpf: fix cbpf parser bug for octal numbers
Range is 0-7, not 0-9, otherwise parser silently excludes it from the strtol() rather than throwing an error. Reported-by: Marc Boschma Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- tools/bpf/bpf_exp.l | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/bpf_exp.l b/tools/bpf/bpf_exp.l index bd83149..4da8d05 100644 --- a/tools/bpf/bpf_exp.l +++ b/tools/bpf/bpf_exp.l @@ -175,7 +175,7 @@ extern void yyerror(const char *str); yylval.number = strtol(yytext, NULL, 10); return number; } -([0][0-9]+){ +([0][0-7]+){ yylval.number = strtol(yytext + 1, NULL, 8); return number; } -- 2.9.5
[PATCH bpf-next 00/11] Misc BPF improvements
This set adds various patches I still had in my queue, first two are test cases to provide coverage for the recent two fixes that went to bpf tree, then a small improvement on the error message for gpl helpers. Next, we expose prog and map id into fdinfo in order to allow for inspection of these objections currently used in applications. Patch after that removes a retpoline call for map lookup/update/delete helpers. A new helper is added in the subsequent patch to lookup the skb's socket's cgroup v2 id which can be used in an efficient way for e.g. lookups on egress side. Next one is a fix to fully clear state info in tunnel/xfrm helpers. Given this is full cap_sys_admin from init ns and has same priv requirements like tracing, bpf-next should be okay. A small bug fix for bpf_asm follows, and next a fix for context access in tracing which was recently reported. Lastly, a small update in the maintainer's file to add patchwork url and missing files. Daniel Borkmann (11): bpf: test case for map pointer poison with calls/branches bpf: add also cbpf long jump test cases with heavy expansion bpf: fixup error message from gpl helpers on license mismatch bpf: show prog and map id in fdinfo bpf: avoid retpoline for lookup/update/delete calls on maps bpf: add bpf_skb_cgroup_id helper bpf: make sure to clear unused fields in tunnel/xfrm state fetch bpf: fix cbpf parser bug for octal numbers bpf: fix context access in tracing progs on 32 bit archs bpf: sync bpf uapi header with tools bpf, doc: add missing patchwork url and libbpf to maintainers MAINTAINERS | 2 + include/linux/filter.h | 43 ++- include/uapi/linux/bpf.h| 20 ++- kernel/bpf/hashtab.c| 12 +- kernel/bpf/syscall.c| 12 +- kernel/bpf/verifier.c | 69 --- kernel/trace/bpf_trace.c| 10 +- lib/test_bpf.c | 63 ++ net/core/filter.c | 35 +- tools/bpf/bpf_exp.l | 2 +- tools/include/linux/filter.h| 10 ++ tools/include/uapi/linux/bpf.h | 20 ++- tools/testing/selftests/bpf/test_verifier.c | 185 13 files changed, 416 insertions(+), 67 deletions(-) -- 2.9.5
[PATCH bpf-next 04/11] bpf: show prog and map id in fdinfo
Its trivial and straight forward to expose it for scripts that can then use it along with bpftool in order to inspect an individual application's used maps and progs. Right now we dump some basic information in the fdinfo file but with the help of the map/prog id full introspection becomes possible now. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 388d4fe..79341e8 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -326,13 +326,15 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) "value_size:\t%u\n" "max_entries:\t%u\n" "map_flags:\t%#x\n" - "memlock:\t%llu\n", + "memlock:\t%llu\n" + "map_id:\t%u\n", map->map_type, map->key_size, map->value_size, map->max_entries, map->map_flags, - map->pages * 1ULL << PAGE_SHIFT); + map->pages * 1ULL << PAGE_SHIFT, + map->id); if (owner_prog_type) { seq_printf(m, "owner_prog_type:\t%u\n", @@ -1069,11 +1071,13 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp) "prog_type:\t%u\n" "prog_jited:\t%u\n" "prog_tag:\t%s\n" - "memlock:\t%llu\n", + "memlock:\t%llu\n" + "prog_id:\t%u\n", prog->type, prog->jited, prog_tag, - prog->pages * 1ULL << PAGE_SHIFT); + prog->pages * 1ULL << PAGE_SHIFT, + prog->aux->id); } #endif -- 2.9.5
[PATCH] net: qmi_wwan: Add Netgear Aircard 779S
Add support for Netgear Aircard 779S Signed-off-by: Josh Hill --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 42565dd..0946808 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1103,6 +1103,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x05c6, 0x920d, 5)}, {QMI_QUIRK_SET_DTR(0x05c6, 0x9625, 4)}, /* YUGA CLM920-NC5 */ {QMI_FIXED_INTF(0x0846, 0x68a2, 8)}, + {QMI_FIXED_INTF(0x0846, 0x68d3, 8)},/* Netgear Aircard 779S */ {QMI_FIXED_INTF(0x12d1, 0x140c, 1)},/* Huawei E173 */ {QMI_FIXED_INTF(0x12d1, 0x14ac, 1)},/* Huawei E1820 */ {QMI_FIXED_INTF(0x1435, 0xd181, 3)},/* Wistron NeWeb D18Q1 */ -- 2.7.4
Re: WARNING in bpf_int_jit_compile
On 05/26/2018 11:29 AM, syzbot wrote: > syzbot has found a reproducer for the following crash on: > > HEAD commit: 62d18ecfa641 Merge tag 'arm64-fixes' of git://git.kernel.o.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=14c6bf5780 > kernel config: https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02 > dashboard link: https://syzkaller.appspot.com/bug?extid=9e762b52dd17e616a7a5 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=130e42b780 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+9e762b52dd17e616a...@syzkaller.appspotmail.com > > RAX: ffda RBX: 02542914 RCX: 00455a09 > RDX: 0048 RSI: 2240 RDI: 0005 > RBP: 0072bea0 R08: R09: > R10: R11: 0246 R12: 0003 > R13: 0046 R14: 006f4730 R15: 0023 > WARNING: CPU: 0 PID: 4752 at include/linux/filter.h:667 > bpf_jit_binary_lock_ro include/linux/filter.h:667 [inline] > WARNING: CPU: 0 PID: 4752 at include/linux/filter.h:667 > bpf_int_jit_compile+0xbf7/0xef7 arch/x86/net/bpf_jit_comp.c:1271 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 4752 Comm: syz-executor0 Not tainted 4.17.0-rc6+ #67 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x1b9/0x294 lib/dump_stack.c:113 > panic+0x22f/0x4de kernel/panic.c:184 > __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 > report_bug+0x252/0x2d0 lib/bug.c:186 > fixup_bug arch/x86/kernel/traps.c:178 [inline] > do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 > invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 > RIP: 0010:bpf_jit_binary_lock_ro include/linux/filter.h:667 [inline] Been looking into this last Friday already. What seems to happen here is that there's fault injection from inside set_memory_ro(), meaning it will eventually return an error there, and we throw a WARN_ON_ONCE() to bark that making the memory read-only didn't work out. I'd be in preference to notify the user on such issue rather than keeping completely silent about it so that there's awareness that read-only protections are not in place / guaranteed. > RIP: 0010:bpf_int_jit_compile+0xbf7/0xef7 arch/x86/net/bpf_jit_comp.c:1271 > RSP: 0018:8801d85ff920 EFLAGS: 00010293 > RAX: 8801d78c40c0 RBX: 0046 RCX: 81445d89 > RDX: RSI: 81445d97 RDI: 0005 > RBP: 8801d85ffa40 R08: 8801d78c40c0 R09: > R10: R11: R12: c9000194e002 > R13: 8801d85ffa18 R14: fff4 R15: 0003 > bpf_prog_select_runtime+0x131/0x640 kernel/bpf/core.c:1541 > bpf_prog_load+0x16c2/0x2070 kernel/bpf/syscall.c:1333 > __do_sys_bpf kernel/bpf/syscall.c:2073 [inline] > __se_sys_bpf kernel/bpf/syscall.c:2035 [inline] > __x64_sys_bpf+0x389/0x4c0 kernel/bpf/syscall.c:2035 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x455a09 > RSP: 002b:7ffec3da2868 EFLAGS: 0246 ORIG_RAX: 0141 > RAX: ffda RBX: 02542914 RCX: 00455a09 > RDX: 0048 RSI: 2240 RDI: 0005 > RBP: 0072bea0 R08: R09: > R10: R11: 0246 R12: 0003 > R13: 0046 R14: 006f4730 R15: 0023 > Dumping ftrace buffer: > (ftrace buffer empty) > Kernel Offset: disabled > Rebooting in 86400 seconds.. >
Re: [PATCH v3 bpf-next 0/5] bpf: Hooks for sys_sendmsg
On 05/28/2018 12:56 AM, Daniel Borkmann wrote: > On 05/25/2018 07:09 AM, Andrey Ignatov wrote: >> v2 -> v3: >> * place BPF logic under static key in udp_sendmsg, udpv6_sendmsg; >> * rebase. [...] > > Applied to bpf-next, thanks Andrey! Woops, and I meant of course v4 [0] of the series not v3. ;-) [0] https://patchwork.ozlabs.org/project/netdev/list/?series=46691&state=*
Re: [PATCH v3 bpf-next 0/5] bpf: Hooks for sys_sendmsg
On 05/25/2018 07:09 AM, Andrey Ignatov wrote: > v2 -> v3: > * place BPF logic under static key in udp_sendmsg, udpv6_sendmsg; > * rebase. > > v1 -> v2: > * return ENOTSUPP if bpf_prog rewrote IPv6-only with IPv4-mapped IPv6; > * add test for IPv4-mapped IPv6 use-case; > * fix build for CONFIG_CGROUP_BPF=n; > * rebase. > > This path set adds BPF hooks for sys_sendmsg similar to existing hooks for > sys_bind and sys_connect. > > Hooks allow to override source IP (including the case when it's set via > cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and > connected UDP (fast path) are not affected. This makes UDP support > complete: connected UDP is handled by sys_connect hooks, unconnected by > sys_sendmsg ones. > > Similar to sys_connect hooks, sys_sendmsg ones can be used to make system > calls such as sendmsg(2) and sendto(2) return EPERM. > > Please see patch 0001 for more details. > > > Andrey Ignatov (5): > bpf: Hooks for sys_sendmsg > bpf: Sync bpf.h to tools/ > libbpf: Support guessing sendmsg{4,6} progs > selftests/bpf: Prepare test_sock_addr for extension > selftests/bpf: Selftest for sys_sendmsg hooks > > include/linux/bpf-cgroup.h | 23 +- > include/linux/filter.h |1 + > include/uapi/linux/bpf.h |8 + > kernel/bpf/cgroup.c | 11 +- > kernel/bpf/syscall.c |8 + > net/core/filter.c| 39 + > net/ipv4/udp.c | 20 +- > net/ipv6/udp.c | 24 + > tools/include/uapi/linux/bpf.h |8 + > tools/lib/bpf/libbpf.c |2 + > tools/testing/selftests/bpf/Makefile |2 +- > tools/testing/selftests/bpf/sendmsg4_prog.c | 49 ++ > tools/testing/selftests/bpf/sendmsg6_prog.c | 60 ++ > tools/testing/selftests/bpf/test_sock_addr.c | 1155 > +- > 14 files changed, 1214 insertions(+), 196 deletions(-) > create mode 100644 tools/testing/selftests/bpf/sendmsg4_prog.c > create mode 100644 tools/testing/selftests/bpf/sendmsg6_prog.c > Applied to bpf-next, thanks Andrey!
Re: [PATCH, net-next 2/2] bpf: avoid -Wmaybe-uninitialized warning
On 05/25/2018 11:33 PM, Arnd Bergmann wrote: > The stack_map_get_build_id_offset() function is too long for gcc to track > whether 'work' may or may not be initialized at the end of it, leading > to a false-positive warning: > > kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': > kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this > function [-Werror=maybe-uninitialized] > > This removes the 'in_nmi_ctx' flag and uses the state of that variable > itself to see if it got initialized. > > Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") > Signed-off-by: Arnd Bergmann Applied to bpf-next, thanks Arnd!
Re: [PATCH, net-next 1/2] bpf: btf: avoid -Wreturn-type warning
On 05/25/2018 11:33 PM, Arnd Bergmann wrote: > gcc warns about a noreturn function possibly returning in > some configurations: > > kernel/bpf/btf.c: In function 'env_type_is_resolve_sink': > kernel/bpf/btf.c:729:1: error: control reaches end of non-void function > [-Werror=return-type] > > Using BUG() instead of BUG_ON() avoids that warning and otherwise > does the exact same thing. > > Fixes: eb3f595dab40 ("bpf: btf: Validate type reference") > Signed-off-by: Arnd Bergmann Applied to bpf-next, thanks Arnd!
Re: [PATCH bpf-next v2] selftests/bpf: missing headers test_lwt_seg6local
On 05/26/2018 04:44 PM, Mathieu Xhonneux wrote: > Previous patch "selftests/bpf: test for seg6local End.BPF action" lacks > some UAPI headers in tools/. > > clang -I. -I./include/uapi -I../../../include/uapi -idirafter > /usr/local/include -idirafter > /data/users/yhs/work/llvm/build/install/lib/clang/7.0.0/include > -idirafter /usr/include -Wno-compare-distinct-pointer-types \ > -O2 -target bpf -emit-llvm -c test_lwt_seg6local.c -o - | \ > llc -march=bpf -mcpu=generic -filetype=obj -o > [...]/net-next/tools/testing/selftests/bpf/test_lwt_seg6local.o > test_lwt_seg6local.c:4:10: fatal error: 'linux/seg6_local.h' file not found > ^~~~ > 1 error generated. > make: Leaving directory > `/data/users/yhs/work/net-next/tools/testing/selftests/bpf' > > v2: moving the headers to tools/include/uapi/. > > Reported-by: Y Song > Signed-off-by: Mathieu Xhonneux Applied to bpf-next, thanks Mathieu!
Re: [bpf-next PATCH] bpf: sockhash fix race with bpf_tcp_close and map delete
On 05/25/2018 07:37 PM, John Fastabend wrote: > syzbot reported two related splats, a use after free and null > pointer dereference, when a TCP socket is closed while the map is > also being removed. > > The psock keeps a reference to all map slots that have a reference > to the sock so that when the sock is closed we can clean up any > outstanding sock{map|hash} entries. This avoids pinning a sock > forever if the map owner fails to do proper cleanup. However, the > result is we have two paths that can free an entry in the map. Even > the comment in the sock{map|hash} tear down function, sock_hash_free() > notes this: > > At this point no update, lookup or delete operations can happen. > However, be aware we can still get a socket state event updates, > and data ready callbacks that reference the psock from sk_user_data. > > Both removal paths omitted taking the hash bucket lock resulting > in the case where we have two references that are in the process > of being free'd. > > Reported-by: syzbot+a761b81c211794fa1...@syzkaller.appspotmail.com > Signed-off-by: John Fastabend Applied to bpf-next, thanks John!
Re: [PATCH bpf-next] libbpf: Install btf.h with libbpf
On 05/25/2018 07:23 PM, Andrey Ignatov wrote: > install_headers target should contain all headers that are part of > libbpf. Add missing btf.h > > Signed-off-by: Andrey Ignatov Applied to bpf-next, thanks Andrey!
Proposal
-- Hello I have been trying to contact you. Did you get my business proposal? Best Regards, Miss.Zeliha ömer faruk Esentepe Mahallesi Büyükdere Caddesi Kristal Kule Binasi No:215 Sisli - Istanbul, Turke
Re: aio poll and a new in-kernel poll API V13
OK, it's in -next now; there are several cleanups I'd put into vfs.git#work.aio: aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way aio_read_events_ring(): make a bit more readable aio: shift copyin of iocb into io_submit_one() aio: fold do_io_submit() into callers Those are *not* on -next yet and if anybody has objections against any of those, please yell. Individual patches in followups...
Re: KASAN: use-after-free Read in bpf_tcp_close
[ +John ] On 05/26/2018 10:54 AM, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit: 3fb48d881dbe Merge branch 'bpf-fib-mtu-check' > git tree: bpf-next > console output: https://syzkaller.appspot.com/x/log.txt?x=15fc197780 > kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 > dashboard link: https://syzkaller.appspot.com/bug?extid=fce8f2462c403d02af98 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1310c85780 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17de717780 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+fce8f2462c403d02a...@syzkaller.appspotmail.com Should be fixed by: https://patchwork.ozlabs.org/patch/920695/ > == > BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:427 > [inline] > BUG: KASAN: use-after-free in bpf_tcp_close+0xd7f/0xf80 > kernel/bpf/sockmap.c:271 > Read of size 8 at addr 8801c884cf90 by task syz-executor330/11778 > > CPU: 1 PID: 11778 Comm: syz-executor330 Not tainted 4.17.0-rc4+ #18 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x1b9/0x294 lib/dump_stack.c:113 > print_address_description+0x6c/0x20b mm/kasan/report.c:256 > kasan_report_error mm/kasan/report.c:354 [inline] > kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 > __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 > hlist_del_rcu include/linux/rculist.h:427 [inline] > bpf_tcp_close+0xd7f/0xf80 kernel/bpf/sockmap.c:271 > inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427 > inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459 > sock_release+0x96/0x1b0 net/socket.c:594 > sock_close+0x16/0x20 net/socket.c:1149 > __fput+0x34d/0x890 fs/file_table.c:209 > fput+0x15/0x20 fs/file_table.c:243 > task_work_run+0x1e4/0x290 kernel/task_work.c:113 > exit_task_work include/linux/task_work.h:22 [inline] > do_exit+0x1aee/0x2730 kernel/exit.c:865 > do_group_exit+0x16f/0x430 kernel/exit.c:968 > get_signal+0x886/0x1960 kernel/signal.c:2469 > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810 > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162 > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] > syscall_return_slowpath arch/x86/entry/common.c:265 [inline] > do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x445ed9 > RSP: 002b:7f0078c0adb8 EFLAGS: 0246 ORIG_RAX: 00ca > RAX: fe00 RBX: 006dbc24 RCX: 00445ed9 > RDX: RSI: RDI: 006dbc24 > RBP: 006dbc20 R08: R09: > R10: R11: 0246 R12: > R13: 7ffcd147dbef R14: 7f0078c0b9c0 R15: 0007 > > Allocated by task 11787: > save_stack+0x43/0xd0 mm/kasan/kasan.c:448 > set_track mm/kasan/kasan.c:460 [inline] > kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 > __do_kmalloc_node mm/slab.c:3682 [inline] > __kmalloc_node+0x47/0x70 mm/slab.c:3689 > kmalloc_node include/linux/slab.h:554 [inline] > alloc_sock_hash_elem kernel/bpf/sockmap.c:2114 [inline] > sock_hash_ctx_update_elem.isra.23+0xa57/0x1560 kernel/bpf/sockmap.c:2245 > sock_hash_update_elem+0x14f/0x2d0 kernel/bpf/sockmap.c:2303 > map_update_elem+0x5c4/0xc90 kernel/bpf/syscall.c:760 > __do_sys_bpf kernel/bpf/syscall.c:2134 [inline] > __se_sys_bpf kernel/bpf/syscall.c:2105 [inline] > __x64_sys_bpf+0x32a/0x4f0 kernel/bpf/syscall.c:2105 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Freed by task 8998: > save_stack+0x43/0xd0 mm/kasan/kasan.c:448 > set_track mm/kasan/kasan.c:460 [inline] > __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 > kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 > __cache_free mm/slab.c:3498 [inline] > kfree+0xd9/0x260 mm/slab.c:3813 > sock_hash_free+0x24e/0x6e0 kernel/bpf/sockmap.c:2093 > bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259 > process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145 > worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279 > kthread+0x345/0x410 kernel/kthread.c:238 > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 > > The buggy address belongs to the object at 8801c884cf80 > which belongs to the cache kmalloc-64 of size 64 > The buggy address is located 16 bytes inside of > 64-byte region [8801c884cf80, 8801c884cfc0) > The buggy address belongs to the page: > page:ea0007221300 count:1 mapcount:0 mapping:8801c884c000 index:0x0 > flags: 0x2fffc000100(slab) > raw: 02fffc000100 8801c884c000 00010020 > raw: ea00072e08e0 ea0006e9
Re: general protection fault in bpf_tcp_close
[ +John ] On 05/26/2018 11:13 AM, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit: fd0bfa8d6e04 Merge branch 'bpf-af-xdp-cleanups' > git tree: bpf-next > console output: https://syzkaller.appspot.com/x/log.txt?x=11da942780 > kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 > dashboard link: https://syzkaller.appspot.com/bug?extid=0ce137753c78f7b6acc1 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > Unfortunately, I don't have any reproducer for this crash yet. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com Should be fixed by: https://patchwork.ozlabs.org/patch/920695/ > kasan: CONFIG_KASAN_INLINE enabled > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: [#1] SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 12139 Comm: syz-executor2 Not tainted 4.17.0-rc4+ #17 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:__hlist_del include/linux/list.h:649 [inline] > RIP: 0010:hlist_del_rcu include/linux/rculist.h:427 [inline] > RIP: 0010:bpf_tcp_close+0x7d2/0xf80 kernel/bpf/sockmap.c:271 > RSP: 0018:8801a8f8ef70 EFLAGS: 00010a02 > RAX: ed00351f1dfd RBX: dc00 RCX: dead0200 > RDX: RSI: 1bd5a040 RDI: 8801cb710910 > RBP: 8801a8f8f110 R08: ed003350ac9d R09: ed003350ac9c > R10: ed003350ac9c R11: 88019a8564e3 R12: 8801cb710380 > R13: 8801b17ea6e0 R14: 8801cb710398 R15: 8801cb710900 > FS: 7f9890c43700() GS:8801dae0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fde1a668000 CR3: 00019dca2000 CR4: 001406f0 > DR0: 21c0 DR1: 21c0 DR2: > DR3: DR6: fffe0ff0 DR7: 0600 > Call Trace: > inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427 > inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459 > sock_release+0x96/0x1b0 net/socket.c:594 > sock_close+0x16/0x20 net/socket.c:1149 > __fput+0x34d/0x890 fs/file_table.c:209 > fput+0x15/0x20 fs/file_table.c:243 > task_work_run+0x1e4/0x290 kernel/task_work.c:113 > exit_task_work include/linux/task_work.h:22 [inline] > do_exit+0x1aee/0x2730 kernel/exit.c:865 > do_group_exit+0x16f/0x430 kernel/exit.c:968 > get_signal+0x886/0x1960 kernel/signal.c:2469 > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810 > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162 > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] > syscall_return_slowpath arch/x86/entry/common.c:265 [inline] > do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x455a09 > RSP: 002b:7f9890c42ce8 EFLAGS: 0246 ORIG_RAX: 00ca > RAX: fe00 RBX: 0072bec8 RCX: 00455a09 > RDX: RSI: RDI: 0072bec8 > RBP: 0072bec8 R08: R09: 0072bea0 > R10: R11: 0246 R12: > R13: 7ffcb48ac3ff R14: 7f9890c439c0 R15: > Code: ff 48 c1 e9 03 80 3c 19 00 0f 85 a9 05 00 00 49 8b 4f 18 48 8b 85 98 fe > ff ff 48 89 ce c6 00 00 48 c1 ee 03 48 89 95 d8 fe ff ff <80> 3c 1e 00 0f 85 > c6 05 00 00 48 8b 85 98 fe ff ff 48 85 d2 48 > RIP: __hlist_del include/linux/list.h:649 [inline] RSP: 8801a8f8ef70 > RIP: hlist_del_rcu include/linux/rculist.h:427 [inline] RSP: 8801a8f8ef70 > RIP: bpf_tcp_close+0x7d2/0xf80 kernel/bpf/sockmap.c:271 RSP: 8801a8f8ef70 > ---[ end trace e81227e93c7e7b75 ]--- > > > --- > This bug is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this bug report. See: > https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with syzbot.
Re: KASAN: use-after-free Write in bpf_tcp_close
[ +John ] On 05/27/2018 10:06 PM, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit: ff4fb475cea8 Merge branch 'btf-uapi-cleanups' > git tree: bpf-next > console output: https://syzkaller.appspot.com/x/log.txt?x=12b3d57780 > kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 > dashboard link: https://syzkaller.appspot.com/bug?extid=31025a5f3f7650081204 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=109a2f3780 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=171a727b80 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+31025a5f3f7650081...@syzkaller.appspotmail.com Should be fixed by: https://patchwork.ozlabs.org/patch/920695/ > == > BUG: KASAN: use-after-free in cmpxchg_size > include/asm-generic/atomic-instrumented.h:355 [inline] > BUG: KASAN: use-after-free in bpf_tcp_close+0x6f5/0xf80 > kernel/bpf/sockmap.c:265 > Write of size 8 at addr 8801ca277680 by task syz-executor749/9723 > > CPU: 0 PID: 9723 Comm: syz-executor749 Not tainted 4.17.0-rc4+ #19 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x1b9/0x294 lib/dump_stack.c:113 > print_address_description+0x6c/0x20b mm/kasan/report.c:256 > kasan_report_error mm/kasan/report.c:354 [inline] > kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 > check_memory_region_inline mm/kasan/kasan.c:260 [inline] > check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 > kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 > cmpxchg_size include/asm-generic/atomic-instrumented.h:355 [inline] > bpf_tcp_close+0x6f5/0xf80 kernel/bpf/sockmap.c:265 > inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427 > inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459 > sock_release+0x96/0x1b0 net/socket.c:594 > sock_close+0x16/0x20 net/socket.c:1149 > __fput+0x34d/0x890 fs/file_table.c:209 > fput+0x15/0x20 fs/file_table.c:243 > task_work_run+0x1e4/0x290 kernel/task_work.c:113 > exit_task_work include/linux/task_work.h:22 [inline] > do_exit+0x1aee/0x2730 kernel/exit.c:865 > do_group_exit+0x16f/0x430 kernel/exit.c:968 > __do_sys_exit_group kernel/exit.c:979 [inline] > __se_sys_exit_group kernel/exit.c:977 [inline] > __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x440a59 > RSP: 002b:7ffdadf92488 EFLAGS: 0206 ORIG_RAX: 00e7 > RAX: ffda RBX: RCX: 00440a59 > RDX: 00440a59 RSI: 0020 RDI: > RBP: R08: 004002c8 R09: 00401ea0 > R10: 004002c8 R11: 0206 R12: 0001b5ac > R13: 00401ea0 R14: R15: > > Allocated by task 9723: > save_stack+0x43/0xd0 mm/kasan/kasan.c:448 > set_track mm/kasan/kasan.c:460 [inline] > kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 > __do_kmalloc_node mm/slab.c:3682 [inline] > __kmalloc_node+0x47/0x70 mm/slab.c:3689 > kmalloc_node include/linux/slab.h:554 [inline] > bpf_map_area_alloc+0x3f/0x90 kernel/bpf/syscall.c:144 > sock_map_alloc+0x376/0x410 kernel/bpf/sockmap.c:1555 > find_and_alloc_map kernel/bpf/syscall.c:126 [inline] > map_create+0x393/0x1010 kernel/bpf/syscall.c:448 > __do_sys_bpf kernel/bpf/syscall.c:2128 [inline] > __se_sys_bpf kernel/bpf/syscall.c:2105 [inline] > __x64_sys_bpf+0x300/0x4f0 kernel/bpf/syscall.c:2105 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Freed by task 4521: > save_stack+0x43/0xd0 mm/kasan/kasan.c:448 > set_track mm/kasan/kasan.c:460 [inline] > __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 > kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 > __cache_free mm/slab.c:3498 [inline] > kfree+0xd9/0x260 mm/slab.c:3813 > kvfree+0x61/0x70 mm/util.c:440 > bpf_map_area_free+0x15/0x20 kernel/bpf/syscall.c:155 > sock_map_remove_complete kernel/bpf/sockmap.c:1443 [inline] > sock_map_free+0x408/0x540 kernel/bpf/sockmap.c:1619 > bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259 > process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145 > worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279 > kthread+0x345/0x410 kernel/kthread.c:238 > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 > > The buggy address belongs to the object at 8801ca277680 > which belongs to the cache kmalloc-1024 of size 1024 > The buggy address is located 0 bytes inside of > 1024-byte region [8801ca277680, 8801ca277a80) > The buggy address belongs to the page: > page:ea0007289d80 count:1 mapcount:0 mapping:8801ca276000 index:0x0 > compound_mapcount: 0 > f
[PATCH v3 07/11] net: sched: implement reference counted action release
Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Removed redundant actions ops lookup during delete. - Assume all actions have delete implemented and don't check for it explicitly. - Rearrange variable definitions in tcf_action_delete. net/sched/act_api.c | 84 +++-- net/sched/cls_api.c | 1 - 2 files changed, 62 insertions(+), 23 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 0f31f09946ab..a023873db713 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -90,21 +90,39 @@ static void free_tcf(struct tc_action *p) kfree(p); } -static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p) +static void tcf_action_cleanup(struct tc_action *p) { - spin_lock(&idrinfo->lock); - idr_remove(&idrinfo->action_idr, p->tcfa_index); - spin_unlock(&idrinfo->lock); + if (p->ops->cleanup) + p->ops->cleanup(p); + gen_kill_estimator(&p->tcfa_rate_est); free_tcf(p); } +static int __tcf_action_put(struct tc_action *p, bool bind) +{ + struct tcf_idrinfo *idrinfo = p->idrinfo; + + if (refcount_dec_and_lock(&p->tcfa_refcnt, &idrinfo->lock)) { + if (bind) + atomic_dec(&p->tcfa_bindcnt); + idr_remove(&idrinfo->action_idr, p->tcfa_index); + spin_unlock(&idrinfo->lock); + + tcf_action_cleanup(p); + return 1; + } + + if (bind) + atomic_dec(&p->tcfa_bindcnt); + + return 0; +} + int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) { int ret = 0; - ASSERT_RTNL(); - /* Release with strict==1 and bind==0 is only called through act API * interface (classifiers always bind). Only case when action with * positive reference count and zero bind count can exist is when it was @@ -118,18 +136,11 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) * are acceptable. */ if (p) { - if (bind) - atomic_dec(&p->tcfa_bindcnt); - else if (strict && atomic_read(&p->tcfa_bindcnt) > 0) + if (!bind && strict && atomic_read(&p->tcfa_bindcnt) > 0) return -EPERM; - if (atomic_read(&p->tcfa_bindcnt) <= 0 && - refcount_dec_and_test(&p->tcfa_refcnt)) { - if (p->ops->cleanup) - p->ops->cleanup(p); - tcf_idr_remove(p->idrinfo, p); + if (__tcf_action_put(p, bind)) ret = ACT_P_DELETED; - } } return ret; @@ -340,11 +351,7 @@ int tcf_idr_delete_index(struct tc_action_net *tn, u32 index) p->tcfa_index)); spin_unlock(&idrinfo->lock); - if (p->ops->cleanup) - p->ops->cleanup(p); - - gen_kill_estimator(&p->tcfa_rate_est); - free_tcf(p); + tcf_action_cleanup(p); module_put(owner); return 0; } @@ -615,6 +622,11 @@ int tcf_action_destroy(struct list_head *actions, int bind) return ret; } +static int tcf_action_put(struct tc_action *p) +{ + return __tcf_action_put(p, false); +} + int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { @@ -1092,6 +1104,35 @@ static int tca_action_flush(struct net *net, struct nlattr *nla, return err; } +static int tcf_action_delete(struct net *net, struct list_head *actions, +struct netlink_ext_ack *extack) +{ + struct tc_action *a, *tmp; + u32 act_index; + int ret; + + list_for_each_entry_safe(a, tmp, actions, list) { + const struct tc_action_ops *ops = a->ops; + + /* Actions can be deleted concurrently so we must save their +* type and id to search again after reference is released. +*/ + act_index = a->tcfa_index; + + list_del(&a->list); + if (tcf_action_put(a)) { + /* last reference, action was deleted concurrently */ + module_put(ops->owner); + } else { + /* no
[PATCH v3 10/11] net: sched: atomically check-allocate action
Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Remove unique idr insertion function. Change original idr insert to do the same thing. - Refactor action check-alloc code into standalone function. include/net/act_api.h | 3 ++ net/sched/act_api.c| 92 -- net/sched/act_bpf.c| 11 -- net/sched/act_connmark.c | 10 +++-- net/sched/act_csum.c | 11 -- net/sched/act_gact.c | 11 -- net/sched/act_ife.c| 6 ++- net/sched/act_ipt.c| 13 ++- net/sched/act_mirred.c | 16 ++-- net/sched/act_nat.c| 11 -- net/sched/act_pedit.c | 15 ++-- net/sched/act_police.c | 9 - net/sched/act_sample.c | 11 -- net/sched/act_simple.c | 11 +- net/sched/act_skbedit.c| 11 +- net/sched/act_skbmod.c | 11 +- net/sched/act_tunnel_key.c | 9 - net/sched/act_vlan.c | 17 - 18 files changed, 218 insertions(+), 60 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index d256e20507b9..cd4547476074 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -154,6 +154,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, int bind, bool cpustats); void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a); +void tcf_idr_cleanup(struct tc_action_net *tn, u32 index); +int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index, + struct tc_action **a, int bind); int tcf_idr_delete_index(struct tc_action_net *tn, u32 index); int __tcf_idr_release(struct tc_action *a, bool bind, bool strict); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index eefe8c2fe667..9511502e1cbb 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -303,7 +303,9 @@ static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, spin_lock(&idrinfo->lock); p = idr_find(&idrinfo->action_idr, index); - if (p) { + if (IS_ERR(p)) { + p = NULL; + } else if (p) { refcount_inc(&p->tcfa_refcnt); if (bind) atomic_inc(&p->tcfa_bindcnt); @@ -371,7 +373,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, { struct tc_action *p = kzalloc(ops->size, GFP_KERNEL); struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct idr *idr = &idrinfo->action_idr; int err = -ENOMEM; if (unlikely(!p)) @@ -389,20 +390,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, goto err2; } spin_lock_init(&p->tcfa_lock); - idr_preload(GFP_KERNEL); - spin_lock(&idrinfo->lock); - /* user doesn't specify an index */ - if (!index) { - index = 1; - err = idr_alloc_u32(idr, NULL, &index, UINT_MAX, GFP_ATOMIC); - } else { - err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC); - } - spin_unlock(&idrinfo->lock); - idr_preload_end(); - if (err) - goto err3; - p->tcfa_index = index; p->tcfa_tm.install = jiffies; p->tcfa_tm.lastuse = jiffies; @@ -412,7 +399,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, &p->tcfa_rate_est, &p->tcfa_lock, NULL, est); if (err) - goto err4; + goto err3; } p->idrinfo = idrinfo; @@ -420,8 +407,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, INIT_LIST_HEAD(&p->list); *a = p; return 0; -err4: - idr_remove(idr, index); err3: free_percpu(p->cpu_qstats); err2: @@ -437,11 +422,78 @@ void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a) struct tcf_idrinfo *idrinfo = tn->idrinfo; spin_lock(&idrinfo->lock); - idr_replace(&idrinfo->action_idr, a, a->tcfa_index); + /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */ + WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index))); spin_unlock(&idrinfo->lock); } EXPORT_SYMBOL(tcf_idr_insert); +/* Cleanup idr index that was allocated but not initialized. */ + +void tcf_idr_cleanup(struct tc_acti
[PATCH v3 08/11] net: sched: don't release reference on action overwrite
Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Resplit action lookup/release code to prevent memory leaks in individual patches. - Change convoluted commit message. net/sched/act_api.c| 2 -- net/sched/act_bpf.c| 8 net/sched/act_connmark.c | 5 +++-- net/sched/act_csum.c | 8 net/sched/act_gact.c | 5 +++-- net/sched/act_ife.c| 12 +--- net/sched/act_ipt.c| 5 +++-- net/sched/act_mirred.c | 5 ++--- net/sched/act_nat.c| 5 +++-- net/sched/act_pedit.c | 5 +++-- net/sched/act_police.c | 8 +++- net/sched/act_sample.c | 8 +++- net/sched/act_simple.c | 5 +++-- net/sched/act_skbedit.c| 5 +++-- net/sched/act_skbmod.c | 8 +++- net/sched/act_tunnel_key.c | 8 +++- net/sched/act_vlan.c | 8 +++- 17 files changed, 51 insertions(+), 59 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index a023873db713..f019f0464cec 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -870,8 +870,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, } act->order = i; sz += tcf_action_fill_size(act); - if (ovr) - refcount_inc(&act->tcfa_refcnt); list_add_tail(&act->list, actions); } diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 7941dd66ff83..d3f4ac6f2c4b 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -311,9 +311,10 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla, if (bind) return 0; - tcf_idr_release(*act, bind); - if (!replace) + if (!replace) { + tcf_idr_release(*act, bind); return -EEXIST; + } } is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS]; @@ -356,8 +357,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla, return res; out: - if (res == ACT_P_CREATED) - tcf_idr_release(*act, bind); + tcf_idr_release(*act, bind); return ret; } diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 143c2d3de723..701e90244eff 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -135,9 +135,10 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla, ci = to_connmark(*a); if (bind) return 0; - tcf_idr_release(*a, bind); - if (!ovr) + if (!ovr) { + tcf_idr_release(*a, bind); return -EEXIST; + } /* replacing action and zone */ ci->tcf_action = parm->action; ci->zone = parm->zone; diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index 3768539340e0..5dbee136b0a1 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -76,9 +76,10 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla, } else { if (bind)/* dont override defaults */ return 0; - tcf_idr_release(*a, bind); - if (!ovr) + if (!ovr) { + tcf_idr_release(*a, bind); return -EEXIST; + } } p = to_tcf_csum(*a); @@ -86,8 +87,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla, params_new = kzalloc(sizeof(*params_new), GFP_KERNEL); if (unlikely(!params_new)) { - if (ret == ACT_P_CREATED) - tcf_idr_release(*a, bind); + tcf_idr_release(*a, bind); return -ENOMEM; } params_old = rtnl_dereference(p->params); diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index a431a711f0dd..11c4de3f344e 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -100,9 +100,10 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla, } else { if (bind)/* dont override defaults */ return 0; - tcf_idr_release(*a, bind); -
[PATCH v3 06/11] net: sched: add 'delete' function to action ops
Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Merge action ops delete definition and implementation. include/net/act_api.h | 1 + net/sched/act_bpf.c| 8 net/sched/act_connmark.c | 8 net/sched/act_csum.c | 8 net/sched/act_gact.c | 8 net/sched/act_ife.c| 8 net/sched/act_ipt.c| 16 net/sched/act_mirred.c | 8 net/sched/act_nat.c| 8 net/sched/act_pedit.c | 8 net/sched/act_police.c | 8 net/sched/act_sample.c | 8 net/sched/act_simple.c | 8 net/sched/act_skbedit.c| 8 net/sched/act_skbmod.c | 8 net/sched/act_tunnel_key.c | 8 net/sched/act_vlan.c | 8 17 files changed, 137 insertions(+) diff --git a/include/net/act_api.h b/include/net/act_api.h index d94ec6400673..d256e20507b9 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -101,6 +101,7 @@ struct tc_action_ops { void(*stats_update)(struct tc_action *, u64, u32, u64); size_t (*get_fill_size)(const struct tc_action *act); struct net_device *(*get_dev)(const struct tc_action *a); + int (*delete)(struct net *net, u32 index); }; struct tc_action_net { diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 8ebf40a3506c..7941dd66ff83 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -388,6 +388,13 @@ static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index, return tcf_idr_search(tn, a, index); } +static int tcf_bpf_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, bpf_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_bpf_ops __read_mostly = { .kind = "bpf", .type = TCA_ACT_BPF, @@ -398,6 +405,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = { .init = tcf_bpf_init, .walk = tcf_bpf_walker, .lookup = tcf_bpf_search, + .delete = tcf_bpf_delete, .size = sizeof(struct tcf_bpf), }; diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index e3787aa0025a..143c2d3de723 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -193,6 +193,13 @@ static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 index, return tcf_idr_search(tn, a, index); } +static int tcf_connmark_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, connmark_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_connmark_ops = { .kind = "connmark", .type = TCA_ACT_CONNMARK, @@ -202,6 +209,7 @@ static struct tc_action_ops act_connmark_ops = { .init = tcf_connmark_init, .walk = tcf_connmark_walker, .lookup = tcf_connmark_search, + .delete = tcf_connmark_delete, .size = sizeof(struct tcf_connmark_info), }; diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index 334261943f9f..3768539340e0 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -654,6 +654,13 @@ static size_t tcf_csum_get_fill_size(const struct tc_action *act) return nla_total_size(sizeof(struct tc_csum)); } +static int tcf_csum_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, csum_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_csum_ops = { .kind = "csum", .type = TCA_ACT_CSUM, @@ -665,6 +672,7 @@ static struct tc_action_ops act_csum_ops = { .walk = tcf_csum_walker, .lookup = tcf_csum_search, .get_fill_size = tcf_csum_get_fill_size, + .delete = tcf_csum_delete, .size = sizeof(struct tcf_csum), }; diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index b4dfb2b4addc..a431a711f0dd 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -231,6 +231,13 @@ static size_t tcf_gact_get_fill_size(const struct tc_action *act) return sz; } +static int tcf_gact_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, gact_net_id); + + return tcf_idr
[PATCH v3 09/11] net: sched: use reference counting action init
Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Resplit action lookup/release code to prevent memory leaks in individual patches. net/sched/act_api.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index f019f0464cec..eefe8c2fe667 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -627,6 +627,18 @@ static int tcf_action_put(struct tc_action *p) return __tcf_action_put(p, false); } +static void tcf_action_put_lst(struct list_head *actions) +{ + struct tc_action *a, *tmp; + + list_for_each_entry_safe(a, tmp, actions, list) { + const struct tc_action_ops *ops = a->ops; + + if (tcf_action_put(a)) + module_put(ops->owner); + } +} + int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { @@ -835,17 +847,6 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, return ERR_PTR(err); } -static void cleanup_a(struct list_head *actions, int ovr) -{ - struct tc_action *a; - - if (!ovr) - return; - - list_for_each_entry(a, actions, list) - refcount_dec(&a->tcfa_refcnt); -} - int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, @@ -874,11 +875,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, } *attr_size = tcf_action_full_attrs_size(sz); - - /* Remove the temp refcnt which was necessary to protect against -* destroying an existing action which was being replaced -*/ - cleanup_a(actions, ovr); return 0; err: @@ -1209,7 +1205,7 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, return ret; } err: - tcf_action_destroy(&actions, 0); + tcf_action_put_lst(&actions); return ret; } @@ -1251,8 +1247,11 @@ static int tcf_action_add(struct net *net, struct nlattr *nla, &attr_size, true, extack); if (ret) return ret; + ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack); + if (ovr) + tcf_action_put_lst(&actions); - return tcf_add_notify(net, n, &actions, portid, attr_size, extack); + return ret; } static u32 tcaa_root_flags_allowed = TCA_FLAG_LARGE_DUMP_ON; -- 2.7.5
[PATCH v3 11/11] net: sched: change action API to use array of pointers to actions
Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 7 ++--- net/sched/act_api.c | 74 --- net/sched/cls_api.c | 21 +-- 3 files changed, 50 insertions(+), 52 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index cd4547476074..43dfa5e1b3b3 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -168,19 +168,20 @@ static inline int tcf_idr_release(struct tc_action *a, bool bind) int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops); int tcf_unregister_action(struct tc_action_ops *a, struct pernet_operations *ops); -int tcf_action_destroy(struct list_head *actions, int bind); +int tcf_action_destroy(struct tc_action *actions[], int bind); int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, int nr_actions, struct tcf_result *res); int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, - struct list_head *actions, size_t *attr_size, + struct tc_action *actions[], size_t *attr_size, bool rtnl_held, struct netlink_ext_ack *extack); struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, bool rtnl_held, struct netlink_ext_ack *extack); -int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int); +int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind, + int ref); int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int); int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int); int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 9511502e1cbb..7f904bb84aab 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -657,13 +657,14 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, } EXPORT_SYMBOL(tcf_action_exec); -int tcf_action_destroy(struct list_head *actions, int bind) +int tcf_action_destroy(struct tc_action *actions[], int bind) { const struct tc_action_ops *ops; - struct tc_action *a, *tmp; - int ret = 0; + struct tc_action *a; + int ret = 0, i; - list_for_each_entry_safe(a, tmp, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + a = actions[i]; ops = a->ops; ret = __tcf_idr_release(a, bind, true); if (ret == ACT_P_DELETED) @@ -679,11 +680,12 @@ static int tcf_action_put(struct tc_action *p) return __tcf_action_put(p, false); } -static void tcf_action_put_lst(struct list_head *actions) +static void tcf_action_put_many(struct tc_action *actions[]) { - struct tc_action *a, *tmp; + int i; - list_for_each_entry_safe(a, tmp, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + struct tc_action *a = actions[i]; const struct tc_action_ops *ops = a->ops; if (tcf_action_put(a)) @@ -735,14 +737,15 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref) } EXPORT_SYMBOL(tcf_action_dump_1); -int tcf_action_dump(struct sk_buff *skb, struct list_head *actions, +int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind, int ref) { struct tc_action *a; - int err = -EINVAL; + int err = -EINVAL, i; struct nlattr *nest; - list_for_each_entry(a, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + a = actions[i]; nest = nla_nest_start(skb, a->order); if (nest == NULL) goto nla_put_failure; @@ -878,10 +881,9 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) { err = tcf_action_goto_chain_init(a, tp); if (err) { - LIST_HEAD(actions); +
[PATCH v3 02/11] net: sched: change type of reference and bind counters
Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Signed-off-by: Vlad Buslov Signed-off-by: Jiri Pirko --- include/net/act_api.h | 5 +++-- net/sched/act_api.c| 32 ++-- net/sched/act_bpf.c| 4 ++-- net/sched/act_connmark.c | 4 ++-- net/sched/act_csum.c | 4 ++-- net/sched/act_gact.c | 4 ++-- net/sched/act_ife.c| 4 ++-- net/sched/act_ipt.c| 4 ++-- net/sched/act_mirred.c | 4 ++-- net/sched/act_nat.c| 4 ++-- net/sched/act_pedit.c | 4 ++-- net/sched/act_police.c | 4 ++-- net/sched/act_sample.c | 4 ++-- net/sched/act_simple.c | 4 ++-- net/sched/act_skbedit.c| 4 ++-- net/sched/act_skbmod.c | 4 ++-- net/sched/act_tunnel_key.c | 4 ++-- net/sched/act_vlan.c | 4 ++-- 18 files changed, 57 insertions(+), 44 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index f7b59ef7303d..e634014605cb 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -6,6 +6,7 @@ * Public action API for classifiers/qdiscs */ +#include #include #include #include @@ -26,8 +27,8 @@ struct tc_action { struct tcf_idrinfo *idrinfo; u32 tcfa_index; - int tcfa_refcnt; - int tcfa_bindcnt; + refcount_t tcfa_refcnt; + atomic_ttcfa_bindcnt; u32 tcfa_capab; int tcfa_action; struct tcf_ttcfa_tm; diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 02670c7489e3..4f064ecab882 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -105,14 +105,26 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) ASSERT_RTNL(); + /* Release with strict==1 and bind==0 is only called through act API +* interface (classifiers always bind). Only case when action with +* positive reference count and zero bind count can exist is when it was +* also created with act API (unbinding last classifier will destroy the +* action if it was created by classifier). So only case when bind count +* can be changed after initial check is when unbound action is +* destroyed by act API while classifier binds to action with same id +* concurrently. This result either creation of new action(same behavior +* as before), or reusing existing action if concurrent process +* increments reference count before action is deleted. Both scenarios +* are acceptable. +*/ if (p) { if (bind) - p->tcfa_bindcnt--; - else if (strict && p->tcfa_bindcnt > 0) + atomic_dec(&p->tcfa_bindcnt); + else if (strict && atomic_read(&p->tcfa_bindcnt) > 0) return -EPERM; - p->tcfa_refcnt--; - if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) { + if (atomic_read(&p->tcfa_bindcnt) <= 0 && + refcount_dec_and_test(&p->tcfa_refcnt)) { if (p->ops->cleanup) p->ops->cleanup(p); tcf_idr_remove(p->idrinfo, p); @@ -304,8 +316,8 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, if (index && p) { if (bind) - p->tcfa_bindcnt++; - p->tcfa_refcnt++; + atomic_inc(&p->tcfa_bindcnt); + refcount_inc(&p->tcfa_refcnt); *a = p; return true; } @@ -324,9 +336,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, if (unlikely(!p)) return -ENOMEM; - p->tcfa_refcnt = 1; + refcount_set(&p->tcfa_refcnt, 1); if (bind) - p->tcfa_bindcnt = 1; + atomic_set(&p->tcfa_bindcnt, 1); if (cpustats) { p->cpu_bstats = netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu); @@ -782,7 +794,7 @@ static void cleanup_a(struct list_head *actions, int ovr) return; list_for_each_entry(a, actions, list) - a->tcfa_refcnt--; + refcount_dec(&a->tcfa_refcnt); } int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, @@ -810,7 +822,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, act->order = i; sz += tcf_action_fill_size(act); if (ovr) - act->tcfa_refcnt++; + refcount_inc(&act->tcf
[PATCH v3 05/11] net: sched: implement action API that deletes action by index
Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Rename tcf_idr_find_delete to tcf_idr_delete_index. include/net/act_api.h | 1 + net/sched/act_api.c | 39 +++ 2 files changed, 40 insertions(+) diff --git a/include/net/act_api.h b/include/net/act_api.h index 888ff471bbf6..d94ec6400673 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -153,6 +153,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, int bind, bool cpustats); void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a); +int tcf_idr_delete_index(struct tc_action_net *tn, u32 index); int __tcf_idr_release(struct tc_action *a, bool bind, bool strict); static inline int tcf_idr_release(struct tc_action *a, bool bind) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index aa304d36fee0..0f31f09946ab 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -319,6 +319,45 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, } EXPORT_SYMBOL(tcf_idr_check); +int tcf_idr_delete_index(struct tc_action_net *tn, u32 index) +{ + struct tcf_idrinfo *idrinfo = tn->idrinfo; + struct tc_action *p; + int ret = 0; + + spin_lock(&idrinfo->lock); + p = idr_find(&idrinfo->action_idr, index); + if (!p) { + spin_unlock(&idrinfo->lock); + return -ENOENT; + } + + if (!atomic_read(&p->tcfa_bindcnt)) { + if (refcount_dec_and_test(&p->tcfa_refcnt)) { + struct module *owner = p->ops->owner; + + WARN_ON(p != idr_remove(&idrinfo->action_idr, + p->tcfa_index)); + spin_unlock(&idrinfo->lock); + + if (p->ops->cleanup) + p->ops->cleanup(p); + + gen_kill_estimator(&p->tcfa_rate_est); + free_tcf(p); + module_put(owner); + return 0; + } + ret = 0; + } else { + ret = -EPERM; + } + + spin_unlock(&idrinfo->lock); + return ret; +} +EXPORT_SYMBOL(tcf_idr_delete_index); + int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, struct tc_action **a, const struct tc_action_ops *ops, int bind, bool cpustats) -- 2.7.5
[PATCH v3 04/11] net: sched: always take reference to action
Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Make __tcf_idr_check function static - Merge changes that take reference to action when performing lookup and changes that account for this additional reference when dumping action to user space into single patch. net/sched/act_api.c | 46 -- 1 file changed, 20 insertions(+), 26 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 256b0c93916c..aa304d36fee0 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb, } EXPORT_SYMBOL(tcf_generic_walker); -static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo) +static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, + struct tc_action **a, int bind) { - struct tc_action *p = NULL; + struct tcf_idrinfo *idrinfo = tn->idrinfo; + struct tc_action *p; spin_lock(&idrinfo->lock); p = idr_find(&idrinfo->action_idr, index); + if (p) { + refcount_inc(&p->tcfa_refcnt); + if (bind) + atomic_inc(&p->tcfa_bindcnt); + } spin_unlock(&idrinfo->lock); - return p; + if (p) { + *a = p; + return true; + } + return false; } int tcf_idr_search(struct tc_action_net *tn, struct tc_action **a, u32 index) { - struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct tc_action *p = tcf_idr_lookup(index, idrinfo); - - if (p) { - *a = p; - return 1; - } - return 0; + return __tcf_idr_check(tn, index, a, 0); } EXPORT_SYMBOL(tcf_idr_search); bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, int bind) { - struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct tc_action *p = tcf_idr_lookup(index, idrinfo); - - if (index && p) { - if (bind) - atomic_inc(&p->tcfa_bindcnt); - refcount_inc(&p->tcfa_refcnt); - *a = p; - return true; - } - return false; + return __tcf_idr_check(tn, index, a, bind); } EXPORT_SYMBOL(tcf_idr_check); @@ -932,7 +926,7 @@ tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n, if (!skb) return -ENOBUFS; if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event, -0, 0) <= 0) { +0, 1) <= 0) { NL_SET_ERR_MSG(extack, "Failed to fill netlink attributes while adding TC action"); kfree_skb(skb); return -EINVAL; @@ -1072,7 +1066,7 @@ tcf_del_notify(struct net *net, struct nlmsghdr *n, struct list_head *actions, return -ENOBUFS; if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, RTM_DELACTION, -0, 1) <= 0) { +0, 2) <= 0) { NL_SET_ERR_MSG(extack, "Failed to fill netlink TC action attributes"); kfree_skb(skb); return -EINVAL; @@ -1131,14 +1125,14 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, if (event == RTM_GETACTION) ret = tcf_get_notify(net, portid, n, &actions, event, extack); else { /* delete */ + cleanup_a(&actions, 1); /* lookup took reference */ ret = tcf_del_notify(net, n, &actions, portid, attr_size, extack); if (ret) goto err; return ret; } err: - if (event != RTM_GETACTION) - tcf_action_destroy(&actions, 0); + tcf_action_destroy(&actions, 0); return ret; } -- 2.7.5
[PATCH v3 00/11] Modify action API for implementing lockless actions
Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a first step to remove rtnl lock dependency from TC rules update path. Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added. Handlers registered with this flag are called without RTNL taken. End goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER, etc.) to be registered with UNLOCKED flag to allow parallel execution. However, there is no intention to completely remove or split rtnl lock itself. This patch set addresses specific problems in action API that prevents it from being executed concurrently. As a preparation for executing TC rules update handlers without rtnl lock, action API code was audited to determine areas that assume external synchronization with rtnl lock and must be changed to allow safe concurrent access with following results: 1. Action idr is already protected with spinlock. However, some code paths assume that idr state is not changes between several consecutive tcf_idr_* function calls. 2. tc_action reference and bind counters are implemented as plain integers. They purpose was to allow single actions to be shared between multiple filters, not to provide means for concurrent modification. 3. tc_action 'cookie' pointer field is not protected against modification. 4. Action API functions, that work with set of actions, use intrusive linked list, which cannot be used concurrently without additional synchronization. 5. Action API functions don't take reference to actions while using them, assuming external synchronization with rtnl lock. Following solutions to these problems are implemented: 1. To remove assumption that idr state doesn't change between tcf_idr_* calls, implement new functions that atomically perform several operations on idr without releasing idr spinlock. (function to atomically lookup and delete action by index, function to atomically check if action exists and allocate new one if necessary, etc.) 2. Use atomic operations on counters to make them suitable for concurrent get/put operations. 3. Data that 'cookie' points to is never modified, so it enough to refactor it to rcu pointer to prevent concurrent de-allocation. 4. Action API doesn't actually use any linked list specific operations on actions intrusive linked list, so it can be refactored to array in straightforward manner. 5. Always take reference to action while accessing it in action API. tcf_idr_search function modified to take reference to action before returning it, so there is no way to lookup an action without incrementing its reference counter. All users of this function are modified to release the reference, after they done using action. With all users using reference counting, it is now safe to concurrently delete actions. Since only shared state in action API module are actions themselves and action idr, these changes are sufficient to not to rely on global rtnl lock for protection of internal action API data structures. Changes from V1 to V2: - Removed redundant actions ops lookup during delete. - Merge action ops delete definition and implementation. - Assume all actions have delete implemented and don't check for it explicitly. - Resplit action lookup/release code to prevent memory leaks in individual patches. - Make __tcf_idr_check function static - Remove unique idr insertion function. Change original idr insert to do the same thing. - Merge changes that take reference to action when performing lookup and changes that account for this additional reference when dumping action to user space into single patch. - Change convoluted commit message. - Rename "unlocked" to "rtnl_held" for clarity. - Remove estimator lock add patch. - Refactor action check-alloc code into standalone function. - Rename tcf_idr_find_delete to tcf_idr_delete_index. - Rearrange variable definitions in tc_action_delete. - Add patch that refactors action API code to use array of pointers to actions instead of intrusive linked list. - Expand cover letter. Vlad Buslov (11): net: sched: use rcu for action cookie update net: sched: change type of reference and bind counters net: sched: implement unlocked action init API net: sched: always take reference to action net: sched: implement action API that deletes action by index net: sched: add 'delete' function to action ops net: sched: implement reference counted action release net: sched: don't release reference on action overwrite net: sched: use reference counting action init net: sched: atomically check-allocate action net: sched: change action API to use array of pointers to actions include/net/act_api.h | 25 ++- include/net/pkt_cls.h | 1 + net/sched/act_api.c| 408 +++-- net/sched/act_bpf.
[PATCH v3 01/11] net: sched: use rcu for action cookie update
Implement functions to atomically update and free action cookie using rcu mechanism. Signed-off-by: Vlad Buslov Signed-off-by: Jiri Pirko --- include/net/act_api.h | 2 +- include/net/pkt_cls.h | 1 + net/sched/act_api.c | 44 ++-- 3 files changed, 32 insertions(+), 15 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 9e59ebfded62..f7b59ef7303d 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -37,7 +37,7 @@ struct tc_action { spinlock_t tcfa_lock; struct gnet_stats_basic_cpu __percpu *cpu_bstats; struct gnet_stats_queue __percpu *cpu_qstats; - struct tc_cookie*act_cookie; + struct tc_cookie__rcu *act_cookie; struct tcf_chain*goto_chain; }; #define tcf_index common.tcfa_index diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index e828d31be5da..3068cc8aa0f1 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -769,6 +769,7 @@ struct tc_mqprio_qopt_offload { struct tc_cookie { u8 *data; u32 len; + struct rcu_head rcu; }; struct tc_qopt_offload_stats { diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 3f4cf930f809..02670c7489e3 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -55,6 +55,24 @@ static void tcf_action_goto_chain_exec(const struct tc_action *a, res->goto_tp = rcu_dereference_bh(chain->filter_chain); } +static void tcf_free_cookie_rcu(struct rcu_head *p) +{ + struct tc_cookie *cookie = container_of(p, struct tc_cookie, rcu); + + kfree(cookie->data); + kfree(cookie); +} + +static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie, + struct tc_cookie *new_cookie) +{ + struct tc_cookie *old; + + old = xchg(old_cookie, new_cookie); + if (old) + call_rcu(&old->rcu, tcf_free_cookie_rcu); +} + /* XXX: For standalone actions, we don't need a RCU grace period either, because * actions are always connected to filters and filters are already destroyed in * RCU callbacks, so after a RCU grace period actions are already disconnected @@ -65,10 +83,7 @@ static void free_tcf(struct tc_action *p) free_percpu(p->cpu_bstats); free_percpu(p->cpu_qstats); - if (p->act_cookie) { - kfree(p->act_cookie->data); - kfree(p->act_cookie); - } + tcf_set_action_cookie(&p->act_cookie, NULL); if (p->goto_chain) tcf_action_goto_chain_fini(p); @@ -567,16 +582,22 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref) int err = -EINVAL; unsigned char *b = skb_tail_pointer(skb); struct nlattr *nest; + struct tc_cookie *cookie; if (nla_put_string(skb, TCA_KIND, a->ops->kind)) goto nla_put_failure; if (tcf_action_copy_stats(skb, a, 0)) goto nla_put_failure; - if (a->act_cookie) { - if (nla_put(skb, TCA_ACT_COOKIE, a->act_cookie->len, - a->act_cookie->data)) + + rcu_read_lock(); + cookie = rcu_dereference(a->act_cookie); + if (cookie) { + if (nla_put(skb, TCA_ACT_COOKIE, cookie->len, cookie->data)) { + rcu_read_unlock(); goto nla_put_failure; + } } + rcu_read_unlock(); nest = nla_nest_start(skb, TCA_OPTIONS); if (nest == NULL) @@ -719,13 +740,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, if (err < 0) goto err_mod; - if (name == NULL && tb[TCA_ACT_COOKIE]) { - if (a->act_cookie) { - kfree(a->act_cookie->data); - kfree(a->act_cookie); - } - a->act_cookie = cookie; - } + if (!name && tb[TCA_ACT_COOKIE]) + tcf_set_action_cookie(&a->act_cookie, cookie); /* module count goes up only when brand new policy is created * if it exists and is only bound to in a_o->init() then -- 2.7.5
[PATCH v3 03/11] net: sched: implement unlocked action init API
Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Signed-off-by: Vlad Buslov --- Changes from V1 to V2: - Rename "unlocked" to "rtnl_held" for clarity. include/net/act_api.h | 6 -- net/sched/act_api.c| 18 +++--- net/sched/act_bpf.c| 3 ++- net/sched/act_connmark.c | 2 +- net/sched/act_csum.c | 3 ++- net/sched/act_gact.c | 3 ++- net/sched/act_ife.c| 3 ++- net/sched/act_ipt.c| 6 -- net/sched/act_mirred.c | 5 +++-- net/sched/act_nat.c| 2 +- net/sched/act_pedit.c | 3 ++- net/sched/act_police.c | 2 +- net/sched/act_sample.c | 3 ++- net/sched/act_simple.c | 3 ++- net/sched/act_skbedit.c| 3 ++- net/sched/act_skbmod.c | 3 ++- net/sched/act_tunnel_key.c | 3 ++- net/sched/act_vlan.c | 3 ++- net/sched/cls_api.c| 5 +++-- 19 files changed, 50 insertions(+), 29 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index e634014605cb..888ff471bbf6 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -92,7 +92,8 @@ struct tc_action_ops { struct netlink_ext_ack *extack); int (*init)(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **act, int ovr, - int bind, struct netlink_ext_ack *extack); + int bind, bool rtnl_held, + struct netlink_ext_ack *extack); int (*walk)(struct net *, struct sk_buff *, struct netlink_callback *, int, const struct tc_action_ops *, @@ -168,10 +169,11 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, - struct netlink_ext_ack *extack); + bool rtnl_held, struct netlink_ext_ack *extack); struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, + bool rtnl_held, struct netlink_ext_ack *extack); int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int); int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 4f064ecab882..256b0c93916c 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -671,6 +671,7 @@ static struct tc_cookie *nla_memdup_cookie(struct nlattr **tb) struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, + bool rtnl_held, struct netlink_ext_ack *extack) { struct tc_action *a; @@ -721,9 +722,11 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, a_o = tc_lookup_action_n(act_name); if (a_o == NULL) { #ifdef CONFIG_MODULES - rtnl_unlock(); + if (rtnl_held) + rtnl_unlock(); request_module("act_%s", act_name); - rtnl_lock(); + if (rtnl_held) + rtnl_lock(); a_o = tc_lookup_action_n(act_name); @@ -746,9 +749,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, /* backward compatibility for policer */ if (name == NULL) err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind, - extack); + rtnl_held, extack); else - err = a_o->init(net, nla, est, &a, ovr, bind, extack); + err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held, + extack); if (err < 0) goto err_mod; @@ -800,7 +804,7 @@ static void cleanup_a(struct list_head *actions, int ovr) int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, - struct netlink_ext_ack *extack) + bool rtnl_held, struct netlink_ext_ack *extack) { struct nlattr *tb[TCA_ACT_MAX_PRIO + 1]; struct tc_action *act; @@ -814,7 +818,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, s
[PATCH] ath10k: fix memory leak of tpc_stats
From: Colin Ian King Currently tpc_stats is allocated and is leaked on the return path if num_tx_chain is greater than WMI_TPC_TX_N_CHAIN. Avoid this leak by performing the check on num_tx_chain before the allocation of tpc_stats. Detected by CoverityScan, CID#1469422 ("Resource Leak") Fixes: 4b190675ad06 ("ath10k: fix kernel panic while reading tpc_stats") Signed-off-by: Colin Ian King --- drivers/net/wireless/ath/ath10k/wmi.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c index 46fb96ee5852..694f46fa9e14 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.c +++ b/drivers/net/wireless/ath/ath10k/wmi.c @@ -4470,10 +4470,6 @@ void ath10k_wmi_event_pdev_tpc_config(struct ath10k *ar, struct sk_buff *skb) ev = (struct wmi_pdev_tpc_config_event *)skb->data; - tpc_stats = kzalloc(sizeof(*tpc_stats), GFP_ATOMIC); - if (!tpc_stats) - return; - num_tx_chain = __le32_to_cpu(ev->num_tx_chain); if (num_tx_chain > WMI_TPC_TX_N_CHAIN) { @@ -4482,6 +4478,10 @@ void ath10k_wmi_event_pdev_tpc_config(struct ath10k *ar, struct sk_buff *skb) return; } + tpc_stats = kzalloc(sizeof(*tpc_stats), GFP_ATOMIC); + if (!tpc_stats) + return; + ath10k_wmi_tpc_config_get_rate_code(rate_code, pream_table, num_tx_chain); -- 2.17.0
KASAN: use-after-free Write in bpf_tcp_close
Hello, syzbot found the following crash on: HEAD commit:ff4fb475cea8 Merge branch 'btf-uapi-cleanups' git tree: bpf-next console output: https://syzkaller.appspot.com/x/log.txt?x=12b3d57780 kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 dashboard link: https://syzkaller.appspot.com/bug?extid=31025a5f3f7650081204 compiler: gcc (GCC) 8.0.1 20180413 (experimental) syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=109a2f3780 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=171a727b80 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+31025a5f3f7650081...@syzkaller.appspotmail.com == BUG: KASAN: use-after-free in cmpxchg_size include/asm-generic/atomic-instrumented.h:355 [inline] BUG: KASAN: use-after-free in bpf_tcp_close+0x6f5/0xf80 kernel/bpf/sockmap.c:265 Write of size 8 at addr 8801ca277680 by task syz-executor749/9723 CPU: 0 PID: 9723 Comm: syz-executor749 Not tainted 4.17.0-rc4+ #19 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 cmpxchg_size include/asm-generic/atomic-instrumented.h:355 [inline] bpf_tcp_close+0x6f5/0xf80 kernel/bpf/sockmap.c:265 inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:459 sock_release+0x96/0x1b0 net/socket.c:594 sock_close+0x16/0x20 net/socket.c:1149 __fput+0x34d/0x890 fs/file_table.c:209 fput+0x15/0x20 fs/file_table.c:243 task_work_run+0x1e4/0x290 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x1aee/0x2730 kernel/exit.c:865 do_group_exit+0x16f/0x430 kernel/exit.c:968 __do_sys_exit_group kernel/exit.c:979 [inline] __se_sys_exit_group kernel/exit.c:977 [inline] __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x440a59 RSP: 002b:7ffdadf92488 EFLAGS: 0206 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 00440a59 RDX: 00440a59 RSI: 0020 RDI: RBP: R08: 004002c8 R09: 00401ea0 R10: 004002c8 R11: 0206 R12: 0001b5ac R13: 00401ea0 R14: R15: Allocated by task 9723: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 __do_kmalloc_node mm/slab.c:3682 [inline] __kmalloc_node+0x47/0x70 mm/slab.c:3689 kmalloc_node include/linux/slab.h:554 [inline] bpf_map_area_alloc+0x3f/0x90 kernel/bpf/syscall.c:144 sock_map_alloc+0x376/0x410 kernel/bpf/sockmap.c:1555 find_and_alloc_map kernel/bpf/syscall.c:126 [inline] map_create+0x393/0x1010 kernel/bpf/syscall.c:448 __do_sys_bpf kernel/bpf/syscall.c:2128 [inline] __se_sys_bpf kernel/bpf/syscall.c:2105 [inline] __x64_sys_bpf+0x300/0x4f0 kernel/bpf/syscall.c:2105 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 4521: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xd9/0x260 mm/slab.c:3813 kvfree+0x61/0x70 mm/util.c:440 bpf_map_area_free+0x15/0x20 kernel/bpf/syscall.c:155 sock_map_remove_complete kernel/bpf/sockmap.c:1443 [inline] sock_map_free+0x408/0x540 kernel/bpf/sockmap.c:1619 bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:259 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279 kthread+0x345/0x410 kernel/kthread.c:238 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 The buggy address belongs to the object at 8801ca277680 which belongs to the cache kmalloc-1024 of size 1024 The buggy address is located 0 bytes inside of 1024-byte region [8801ca277680, 8801ca277a80) The buggy address belongs to the page: page:ea0007289d80 count:1 mapcount:0 mapping:8801ca276000 index:0x0 compound_mapcount: 0 flags: 0x2fffc008100(slab|head) raw: 02fffc008100 8801ca276000 00010007 raw: ea0006d12b20 ea000763bba0 8801da800ac0 page dumped because: kasan: bad access detected Memory state around the buggy address: 8801ca277580: fc fc fc fc fc fc f
[PATCH] net: sched: split tc_ctl_tfilter into three handlers
tc_ctl_tfilter handles three netlink message types: RTM_NEWTFILTER, RTM_DELTFILTER, RTM_GETTFILTER. However, implementation of this function involves a lot of branching on specific message type because most of the code is message-specific. This significantly complicates adding new functionality and doesn't provide much benefit of code reuse. Split tc_ctl_tfilter to three standalone functions that handle filter new, delete and get requests. The only truly protocol independent part of tc_ctl_tfilter is code that looks up queue, class, and block. Refactor this code to standalone tcf_block_find function that is used by all three new handlers. Signed-off-by: Vlad Buslov --- net/sched/cls_api.c | 438 +++- 1 file changed, 293 insertions(+), 145 deletions(-) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 1ed673f501c2..8dfbad1bddf5 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -436,6 +436,78 @@ static struct tcf_block *tcf_block_lookup(struct net *net, u32 block_index) return idr_find(&tn->idr, block_index); } +/* Find tcf block. + * Set q, parent, cl when appropriate. + */ + +static struct tcf_block *tcf_block_find(struct net *net, struct Qdisc **q, + u32 *parent, unsigned long *cl, + int ifindex, u32 block_index, + struct netlink_ext_ack *extack) +{ + struct tcf_block *block; + + if (ifindex == TCM_IFINDEX_MAGIC_BLOCK) { + block = tcf_block_lookup(net, block_index); + if (!block) { + NL_SET_ERR_MSG(extack, "Block of given index was not found"); + return ERR_PTR(-EINVAL); + } + } else { + const struct Qdisc_class_ops *cops; + struct net_device *dev; + + /* Find link */ + dev = __dev_get_by_index(net, ifindex); + if (!dev) + return ERR_PTR(-ENODEV); + + /* Find qdisc */ + if (!*parent) { + *q = dev->qdisc; + *parent = (*q)->handle; + } else { + *q = qdisc_lookup(dev, TC_H_MAJ(*parent)); + if (!*q) { + NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists"); + return ERR_PTR(-EINVAL); + } + } + + /* Is it classful? */ + cops = (*q)->ops->cl_ops; + if (!cops) { + NL_SET_ERR_MSG(extack, "Qdisc not classful"); + return ERR_PTR(-EINVAL); + } + + if (!cops->tcf_block) { + NL_SET_ERR_MSG(extack, "Class doesn't support blocks"); + return ERR_PTR(-EOPNOTSUPP); + } + + /* Do we search for filter, attached to class? */ + if (TC_H_MIN(*parent)) { + *cl = cops->find(*q, *parent); + if (*cl == 0) { + NL_SET_ERR_MSG(extack, "Specified class doesn't exist"); + return ERR_PTR(-ENOENT); + } + } + + /* And the last stroke */ + block = cops->tcf_block(*q, *cl, extack); + if (!block) + return ERR_PTR(-EINVAL); + if (tcf_block_shared(block)) { + NL_SET_ERR_MSG(extack, "This filter block is shared. Please use the block index to manipulate the filters"); + return ERR_PTR(-EOPNOTSUPP); + } + } + + return block; +} + static struct tcf_chain *tcf_block_chain_zero(struct tcf_block *block) { return list_first_entry(&block->chain_list, struct tcf_chain, list); @@ -1002,9 +1074,7 @@ static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb, q, parent, 0, event, false); } -/* Add/change/delete/get a filter node */ - -static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n, +static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n, struct netlink_ext_ack *extack) { struct net *net = sock_net(skb->sk); @@ -1025,8 +1095,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n, int err; int tp_created; - if ((n->nlmsg_type != RTM_GETTFILTER) && - !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) + if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) return -EPERM; replay: @@ -1044,24 +1113,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n, cl = 0; if (prio == 0) { - switch (n->nlmsg_type) { -
Re: [PATCH v2 00/11] Modify action API for implementing lockless actions
Sun, May 27, 2018 at 06:41:27PM CEST, vla...@mellanox.com wrote: [...] >Changes from V1 to V2: >- Removed redundant actions ops lookup during delete. >- Merge action ops delete definition and implementation. >- Assume all actions have delete implemented and don't check for it > explicitly. >- Resplit action lookup/release code to prevent memory leaks in > individual patches. >- Make __tcf_idr_check function static >- Remove unique idr insertion function. Change original idr insert to do > the same thing. >- Merge changes that take reference to action when performing lookup and > changes that account for this additional reference when dumping action > to user space into single patch. >- Change convoluted commit message. >- Rename "unlocked" to "rtnl_held" for clarity. >- Remove estimator lock add patch. >- Refactor action check-alloc code into standalone function. >- Rename tcf_idr_find_delete to tcf_idr_delete_index. >- Rearrange variable definitions in tc_action_delete. >- Add patch that refactors action API code to use array of pointers to > actions instead of intrusive linked list. >- Expand cover letter. With 11 patches, I'm unable to follow the changes in this description. Could you please write the changes to each individual patch? What I do is I write changes for each patch under "---" (there are two "---" lines in each patch then. Then I also copy the changes descriptions to the cover letter. [...]
[PATCH v2 01/11] net: sched: use rcu for action cookie update
Implement functions to atomically update and free action cookie using rcu mechanism. Signed-off-by: Vlad Buslov Signed-off-by: Jiri Pirko --- include/net/act_api.h | 2 +- include/net/pkt_cls.h | 1 + net/sched/act_api.c | 44 ++-- 3 files changed, 32 insertions(+), 15 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 9e59ebfded62..f7b59ef7303d 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -37,7 +37,7 @@ struct tc_action { spinlock_t tcfa_lock; struct gnet_stats_basic_cpu __percpu *cpu_bstats; struct gnet_stats_queue __percpu *cpu_qstats; - struct tc_cookie*act_cookie; + struct tc_cookie__rcu *act_cookie; struct tcf_chain*goto_chain; }; #define tcf_index common.tcfa_index diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index e828d31be5da..3068cc8aa0f1 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -769,6 +769,7 @@ struct tc_mqprio_qopt_offload { struct tc_cookie { u8 *data; u32 len; + struct rcu_head rcu; }; struct tc_qopt_offload_stats { diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 3f4cf930f809..02670c7489e3 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -55,6 +55,24 @@ static void tcf_action_goto_chain_exec(const struct tc_action *a, res->goto_tp = rcu_dereference_bh(chain->filter_chain); } +static void tcf_free_cookie_rcu(struct rcu_head *p) +{ + struct tc_cookie *cookie = container_of(p, struct tc_cookie, rcu); + + kfree(cookie->data); + kfree(cookie); +} + +static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie, + struct tc_cookie *new_cookie) +{ + struct tc_cookie *old; + + old = xchg(old_cookie, new_cookie); + if (old) + call_rcu(&old->rcu, tcf_free_cookie_rcu); +} + /* XXX: For standalone actions, we don't need a RCU grace period either, because * actions are always connected to filters and filters are already destroyed in * RCU callbacks, so after a RCU grace period actions are already disconnected @@ -65,10 +83,7 @@ static void free_tcf(struct tc_action *p) free_percpu(p->cpu_bstats); free_percpu(p->cpu_qstats); - if (p->act_cookie) { - kfree(p->act_cookie->data); - kfree(p->act_cookie); - } + tcf_set_action_cookie(&p->act_cookie, NULL); if (p->goto_chain) tcf_action_goto_chain_fini(p); @@ -567,16 +582,22 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref) int err = -EINVAL; unsigned char *b = skb_tail_pointer(skb); struct nlattr *nest; + struct tc_cookie *cookie; if (nla_put_string(skb, TCA_KIND, a->ops->kind)) goto nla_put_failure; if (tcf_action_copy_stats(skb, a, 0)) goto nla_put_failure; - if (a->act_cookie) { - if (nla_put(skb, TCA_ACT_COOKIE, a->act_cookie->len, - a->act_cookie->data)) + + rcu_read_lock(); + cookie = rcu_dereference(a->act_cookie); + if (cookie) { + if (nla_put(skb, TCA_ACT_COOKIE, cookie->len, cookie->data)) { + rcu_read_unlock(); goto nla_put_failure; + } } + rcu_read_unlock(); nest = nla_nest_start(skb, TCA_OPTIONS); if (nest == NULL) @@ -719,13 +740,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, if (err < 0) goto err_mod; - if (name == NULL && tb[TCA_ACT_COOKIE]) { - if (a->act_cookie) { - kfree(a->act_cookie->data); - kfree(a->act_cookie); - } - a->act_cookie = cookie; - } + if (!name && tb[TCA_ACT_COOKIE]) + tcf_set_action_cookie(&a->act_cookie, cookie); /* module count goes up only when brand new policy is created * if it exists and is only bound to in a_o->init() then -- 2.7.5
[PATCH v2 02/11] net: sched: change type of reference and bind counters
Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Signed-off-by: Vlad Buslov Signed-off-by: Jiri Pirko --- include/net/act_api.h | 5 +++-- net/sched/act_api.c| 32 ++-- net/sched/act_bpf.c| 4 ++-- net/sched/act_connmark.c | 4 ++-- net/sched/act_csum.c | 4 ++-- net/sched/act_gact.c | 4 ++-- net/sched/act_ife.c| 4 ++-- net/sched/act_ipt.c| 4 ++-- net/sched/act_mirred.c | 4 ++-- net/sched/act_nat.c| 4 ++-- net/sched/act_pedit.c | 4 ++-- net/sched/act_police.c | 4 ++-- net/sched/act_sample.c | 4 ++-- net/sched/act_simple.c | 4 ++-- net/sched/act_skbedit.c| 4 ++-- net/sched/act_skbmod.c | 4 ++-- net/sched/act_tunnel_key.c | 4 ++-- net/sched/act_vlan.c | 4 ++-- 18 files changed, 57 insertions(+), 44 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index f7b59ef7303d..e634014605cb 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -6,6 +6,7 @@ * Public action API for classifiers/qdiscs */ +#include #include #include #include @@ -26,8 +27,8 @@ struct tc_action { struct tcf_idrinfo *idrinfo; u32 tcfa_index; - int tcfa_refcnt; - int tcfa_bindcnt; + refcount_t tcfa_refcnt; + atomic_ttcfa_bindcnt; u32 tcfa_capab; int tcfa_action; struct tcf_ttcfa_tm; diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 02670c7489e3..4f064ecab882 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -105,14 +105,26 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) ASSERT_RTNL(); + /* Release with strict==1 and bind==0 is only called through act API +* interface (classifiers always bind). Only case when action with +* positive reference count and zero bind count can exist is when it was +* also created with act API (unbinding last classifier will destroy the +* action if it was created by classifier). So only case when bind count +* can be changed after initial check is when unbound action is +* destroyed by act API while classifier binds to action with same id +* concurrently. This result either creation of new action(same behavior +* as before), or reusing existing action if concurrent process +* increments reference count before action is deleted. Both scenarios +* are acceptable. +*/ if (p) { if (bind) - p->tcfa_bindcnt--; - else if (strict && p->tcfa_bindcnt > 0) + atomic_dec(&p->tcfa_bindcnt); + else if (strict && atomic_read(&p->tcfa_bindcnt) > 0) return -EPERM; - p->tcfa_refcnt--; - if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) { + if (atomic_read(&p->tcfa_bindcnt) <= 0 && + refcount_dec_and_test(&p->tcfa_refcnt)) { if (p->ops->cleanup) p->ops->cleanup(p); tcf_idr_remove(p->idrinfo, p); @@ -304,8 +316,8 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, if (index && p) { if (bind) - p->tcfa_bindcnt++; - p->tcfa_refcnt++; + atomic_inc(&p->tcfa_bindcnt); + refcount_inc(&p->tcfa_refcnt); *a = p; return true; } @@ -324,9 +336,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, if (unlikely(!p)) return -ENOMEM; - p->tcfa_refcnt = 1; + refcount_set(&p->tcfa_refcnt, 1); if (bind) - p->tcfa_bindcnt = 1; + atomic_set(&p->tcfa_bindcnt, 1); if (cpustats) { p->cpu_bstats = netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu); @@ -782,7 +794,7 @@ static void cleanup_a(struct list_head *actions, int ovr) return; list_for_each_entry(a, actions, list) - a->tcfa_refcnt--; + refcount_dec(&a->tcfa_refcnt); } int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, @@ -810,7 +822,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, act->order = i; sz += tcf_action_fill_size(act); if (ovr) - act->tcfa_refcnt++; + refcount_inc(&act->tcf
[PATCH v2 03/11] net: sched: implement unlocked action init API
Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 6 -- net/sched/act_api.c| 18 +++--- net/sched/act_bpf.c| 3 ++- net/sched/act_connmark.c | 2 +- net/sched/act_csum.c | 3 ++- net/sched/act_gact.c | 3 ++- net/sched/act_ife.c| 3 ++- net/sched/act_ipt.c| 6 -- net/sched/act_mirred.c | 5 +++-- net/sched/act_nat.c| 2 +- net/sched/act_pedit.c | 3 ++- net/sched/act_police.c | 2 +- net/sched/act_sample.c | 3 ++- net/sched/act_simple.c | 3 ++- net/sched/act_skbedit.c| 3 ++- net/sched/act_skbmod.c | 3 ++- net/sched/act_tunnel_key.c | 3 ++- net/sched/act_vlan.c | 3 ++- net/sched/cls_api.c| 5 +++-- 19 files changed, 50 insertions(+), 29 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index e634014605cb..888ff471bbf6 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -92,7 +92,8 @@ struct tc_action_ops { struct netlink_ext_ack *extack); int (*init)(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **act, int ovr, - int bind, struct netlink_ext_ack *extack); + int bind, bool rtnl_held, + struct netlink_ext_ack *extack); int (*walk)(struct net *, struct sk_buff *, struct netlink_callback *, int, const struct tc_action_ops *, @@ -168,10 +169,11 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, - struct netlink_ext_ack *extack); + bool rtnl_held, struct netlink_ext_ack *extack); struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, + bool rtnl_held, struct netlink_ext_ack *extack); int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int); int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 4f064ecab882..256b0c93916c 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -671,6 +671,7 @@ static struct tc_cookie *nla_memdup_cookie(struct nlattr **tb) struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, + bool rtnl_held, struct netlink_ext_ack *extack) { struct tc_action *a; @@ -721,9 +722,11 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, a_o = tc_lookup_action_n(act_name); if (a_o == NULL) { #ifdef CONFIG_MODULES - rtnl_unlock(); + if (rtnl_held) + rtnl_unlock(); request_module("act_%s", act_name); - rtnl_lock(); + if (rtnl_held) + rtnl_lock(); a_o = tc_lookup_action_n(act_name); @@ -746,9 +749,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, /* backward compatibility for policer */ if (name == NULL) err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind, - extack); + rtnl_held, extack); else - err = a_o->init(net, nla, est, &a, ovr, bind, extack); + err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held, + extack); if (err < 0) goto err_mod; @@ -800,7 +804,7 @@ static void cleanup_a(struct list_head *actions, int ovr) int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, - struct netlink_ext_ack *extack) + bool rtnl_held, struct netlink_ext_ack *extack) { struct nlattr *tb[TCA_ACT_MAX_PRIO + 1]; struct tc_action *act; @@ -814,7 +818,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i];
[PATCH v2 06/11] net: sched: add 'delete' function to action ops
Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 1 + net/sched/act_bpf.c| 8 net/sched/act_connmark.c | 8 net/sched/act_csum.c | 8 net/sched/act_gact.c | 8 net/sched/act_ife.c| 8 net/sched/act_ipt.c| 16 net/sched/act_mirred.c | 8 net/sched/act_nat.c| 8 net/sched/act_pedit.c | 8 net/sched/act_police.c | 8 net/sched/act_sample.c | 8 net/sched/act_simple.c | 8 net/sched/act_skbedit.c| 8 net/sched/act_skbmod.c | 8 net/sched/act_tunnel_key.c | 8 net/sched/act_vlan.c | 8 17 files changed, 137 insertions(+) diff --git a/include/net/act_api.h b/include/net/act_api.h index d94ec6400673..d256e20507b9 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -101,6 +101,7 @@ struct tc_action_ops { void(*stats_update)(struct tc_action *, u64, u32, u64); size_t (*get_fill_size)(const struct tc_action *act); struct net_device *(*get_dev)(const struct tc_action *a); + int (*delete)(struct net *net, u32 index); }; struct tc_action_net { diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 8ebf40a3506c..7941dd66ff83 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -388,6 +388,13 @@ static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index, return tcf_idr_search(tn, a, index); } +static int tcf_bpf_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, bpf_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_bpf_ops __read_mostly = { .kind = "bpf", .type = TCA_ACT_BPF, @@ -398,6 +405,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = { .init = tcf_bpf_init, .walk = tcf_bpf_walker, .lookup = tcf_bpf_search, + .delete = tcf_bpf_delete, .size = sizeof(struct tcf_bpf), }; diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index e3787aa0025a..143c2d3de723 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -193,6 +193,13 @@ static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 index, return tcf_idr_search(tn, a, index); } +static int tcf_connmark_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, connmark_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_connmark_ops = { .kind = "connmark", .type = TCA_ACT_CONNMARK, @@ -202,6 +209,7 @@ static struct tc_action_ops act_connmark_ops = { .init = tcf_connmark_init, .walk = tcf_connmark_walker, .lookup = tcf_connmark_search, + .delete = tcf_connmark_delete, .size = sizeof(struct tcf_connmark_info), }; diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index 334261943f9f..3768539340e0 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -654,6 +654,13 @@ static size_t tcf_csum_get_fill_size(const struct tc_action *act) return nla_total_size(sizeof(struct tc_csum)); } +static int tcf_csum_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, csum_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_csum_ops = { .kind = "csum", .type = TCA_ACT_CSUM, @@ -665,6 +672,7 @@ static struct tc_action_ops act_csum_ops = { .walk = tcf_csum_walker, .lookup = tcf_csum_search, .get_fill_size = tcf_csum_get_fill_size, + .delete = tcf_csum_delete, .size = sizeof(struct tcf_csum), }; diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index b4dfb2b4addc..a431a711f0dd 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -231,6 +231,13 @@ static size_t tcf_gact_get_fill_size(const struct tc_action *act) return sz; } +static int tcf_gact_delete(struct net *net, u32 index) +{ + struct tc_action_net *tn = net_generic(net, gact_net_id); + + return tcf_idr_delete_index(tn, index); +} + static struct tc_action_ops act_gact_ops = {
[PATCH v2 07/11] net: sched: implement reference counted action release
Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Signed-off-by: Vlad Buslov --- net/sched/act_api.c | 84 +++-- net/sched/cls_api.c | 1 - 2 files changed, 62 insertions(+), 23 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 0f31f09946ab..a023873db713 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -90,21 +90,39 @@ static void free_tcf(struct tc_action *p) kfree(p); } -static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p) +static void tcf_action_cleanup(struct tc_action *p) { - spin_lock(&idrinfo->lock); - idr_remove(&idrinfo->action_idr, p->tcfa_index); - spin_unlock(&idrinfo->lock); + if (p->ops->cleanup) + p->ops->cleanup(p); + gen_kill_estimator(&p->tcfa_rate_est); free_tcf(p); } +static int __tcf_action_put(struct tc_action *p, bool bind) +{ + struct tcf_idrinfo *idrinfo = p->idrinfo; + + if (refcount_dec_and_lock(&p->tcfa_refcnt, &idrinfo->lock)) { + if (bind) + atomic_dec(&p->tcfa_bindcnt); + idr_remove(&idrinfo->action_idr, p->tcfa_index); + spin_unlock(&idrinfo->lock); + + tcf_action_cleanup(p); + return 1; + } + + if (bind) + atomic_dec(&p->tcfa_bindcnt); + + return 0; +} + int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) { int ret = 0; - ASSERT_RTNL(); - /* Release with strict==1 and bind==0 is only called through act API * interface (classifiers always bind). Only case when action with * positive reference count and zero bind count can exist is when it was @@ -118,18 +136,11 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) * are acceptable. */ if (p) { - if (bind) - atomic_dec(&p->tcfa_bindcnt); - else if (strict && atomic_read(&p->tcfa_bindcnt) > 0) + if (!bind && strict && atomic_read(&p->tcfa_bindcnt) > 0) return -EPERM; - if (atomic_read(&p->tcfa_bindcnt) <= 0 && - refcount_dec_and_test(&p->tcfa_refcnt)) { - if (p->ops->cleanup) - p->ops->cleanup(p); - tcf_idr_remove(p->idrinfo, p); + if (__tcf_action_put(p, bind)) ret = ACT_P_DELETED; - } } return ret; @@ -340,11 +351,7 @@ int tcf_idr_delete_index(struct tc_action_net *tn, u32 index) p->tcfa_index)); spin_unlock(&idrinfo->lock); - if (p->ops->cleanup) - p->ops->cleanup(p); - - gen_kill_estimator(&p->tcfa_rate_est); - free_tcf(p); + tcf_action_cleanup(p); module_put(owner); return 0; } @@ -615,6 +622,11 @@ int tcf_action_destroy(struct list_head *actions, int bind) return ret; } +static int tcf_action_put(struct tc_action *p) +{ + return __tcf_action_put(p, false); +} + int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { @@ -1092,6 +1104,35 @@ static int tca_action_flush(struct net *net, struct nlattr *nla, return err; } +static int tcf_action_delete(struct net *net, struct list_head *actions, +struct netlink_ext_ack *extack) +{ + struct tc_action *a, *tmp; + u32 act_index; + int ret; + + list_for_each_entry_safe(a, tmp, actions, list) { + const struct tc_action_ops *ops = a->ops; + + /* Actions can be deleted concurrently so we must save their +* type and id to search again after reference is released. +*/ + act_index = a->tcfa_index; + + list_del(&a->list); + if (tcf_action_put(a)) { + /* last reference, action was deleted concurrently */ + module_put(ops->owner); + } else { + /* now do the delete */ + ret = ops->delete(net, act_index); + if (ret < 0) + return ret; + } + } + return 0; +} + stat
[PATCH v2 08/11] net: sched: don't release reference on action overwrite
Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Signed-off-by: Vlad Buslov --- net/sched/act_api.c| 2 -- net/sched/act_bpf.c| 8 net/sched/act_connmark.c | 5 +++-- net/sched/act_csum.c | 8 net/sched/act_gact.c | 5 +++-- net/sched/act_ife.c| 12 +--- net/sched/act_ipt.c| 5 +++-- net/sched/act_mirred.c | 5 ++--- net/sched/act_nat.c| 5 +++-- net/sched/act_pedit.c | 5 +++-- net/sched/act_police.c | 8 +++- net/sched/act_sample.c | 8 +++- net/sched/act_simple.c | 5 +++-- net/sched/act_skbedit.c| 5 +++-- net/sched/act_skbmod.c | 8 +++- net/sched/act_tunnel_key.c | 8 +++- net/sched/act_vlan.c | 8 +++- 17 files changed, 51 insertions(+), 59 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index a023873db713..f019f0464cec 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -870,8 +870,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, } act->order = i; sz += tcf_action_fill_size(act); - if (ovr) - refcount_inc(&act->tcfa_refcnt); list_add_tail(&act->list, actions); } diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 7941dd66ff83..d3f4ac6f2c4b 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -311,9 +311,10 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla, if (bind) return 0; - tcf_idr_release(*act, bind); - if (!replace) + if (!replace) { + tcf_idr_release(*act, bind); return -EEXIST; + } } is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS]; @@ -356,8 +357,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla, return res; out: - if (res == ACT_P_CREATED) - tcf_idr_release(*act, bind); + tcf_idr_release(*act, bind); return ret; } diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 143c2d3de723..701e90244eff 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -135,9 +135,10 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla, ci = to_connmark(*a); if (bind) return 0; - tcf_idr_release(*a, bind); - if (!ovr) + if (!ovr) { + tcf_idr_release(*a, bind); return -EEXIST; + } /* replacing action and zone */ ci->tcf_action = parm->action; ci->zone = parm->zone; diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index 3768539340e0..5dbee136b0a1 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -76,9 +76,10 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla, } else { if (bind)/* dont override defaults */ return 0; - tcf_idr_release(*a, bind); - if (!ovr) + if (!ovr) { + tcf_idr_release(*a, bind); return -EEXIST; + } } p = to_tcf_csum(*a); @@ -86,8 +87,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla, params_new = kzalloc(sizeof(*params_new), GFP_KERNEL); if (unlikely(!params_new)) { - if (ret == ACT_P_CREATED) - tcf_idr_release(*a, bind); + tcf_idr_release(*a, bind); return -ENOMEM; } params_old = rtnl_dereference(p->params); diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index a431a711f0dd..11c4de3f344e 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -100,9 +100,10 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla, } else { if (bind)/* dont override defaults */ return 0; - tcf_idr_release(*a, bind); - if (!ovr) + if (!ovr) { + tcf_idr_release(*a, bind); return -EEXIST; +
[PATCH v2 10/11] net: sched: atomically check-allocate action
Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 3 ++ net/sched/act_api.c| 92 -- net/sched/act_bpf.c| 11 -- net/sched/act_connmark.c | 10 +++-- net/sched/act_csum.c | 11 -- net/sched/act_gact.c | 11 -- net/sched/act_ife.c| 6 ++- net/sched/act_ipt.c| 13 ++- net/sched/act_mirred.c | 16 ++-- net/sched/act_nat.c| 11 -- net/sched/act_pedit.c | 15 ++-- net/sched/act_police.c | 9 - net/sched/act_sample.c | 11 -- net/sched/act_simple.c | 11 +- net/sched/act_skbedit.c| 11 +- net/sched/act_skbmod.c | 11 +- net/sched/act_tunnel_key.c | 9 - net/sched/act_vlan.c | 17 - 18 files changed, 218 insertions(+), 60 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index d256e20507b9..cd4547476074 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -154,6 +154,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, int bind, bool cpustats); void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a); +void tcf_idr_cleanup(struct tc_action_net *tn, u32 index); +int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index, + struct tc_action **a, int bind); int tcf_idr_delete_index(struct tc_action_net *tn, u32 index); int __tcf_idr_release(struct tc_action *a, bool bind, bool strict); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index eefe8c2fe667..9511502e1cbb 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -303,7 +303,9 @@ static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, spin_lock(&idrinfo->lock); p = idr_find(&idrinfo->action_idr, index); - if (p) { + if (IS_ERR(p)) { + p = NULL; + } else if (p) { refcount_inc(&p->tcfa_refcnt); if (bind) atomic_inc(&p->tcfa_bindcnt); @@ -371,7 +373,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, { struct tc_action *p = kzalloc(ops->size, GFP_KERNEL); struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct idr *idr = &idrinfo->action_idr; int err = -ENOMEM; if (unlikely(!p)) @@ -389,20 +390,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, goto err2; } spin_lock_init(&p->tcfa_lock); - idr_preload(GFP_KERNEL); - spin_lock(&idrinfo->lock); - /* user doesn't specify an index */ - if (!index) { - index = 1; - err = idr_alloc_u32(idr, NULL, &index, UINT_MAX, GFP_ATOMIC); - } else { - err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC); - } - spin_unlock(&idrinfo->lock); - idr_preload_end(); - if (err) - goto err3; - p->tcfa_index = index; p->tcfa_tm.install = jiffies; p->tcfa_tm.lastuse = jiffies; @@ -412,7 +399,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, &p->tcfa_rate_est, &p->tcfa_lock, NULL, est); if (err) - goto err4; + goto err3; } p->idrinfo = idrinfo; @@ -420,8 +407,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, INIT_LIST_HEAD(&p->list); *a = p; return 0; -err4: - idr_remove(idr, index); err3: free_percpu(p->cpu_qstats); err2: @@ -437,11 +422,78 @@ void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a) struct tcf_idrinfo *idrinfo = tn->idrinfo; spin_lock(&idrinfo->lock); - idr_replace(&idrinfo->action_idr, a, a->tcfa_index); + /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */ + WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index))); spin_unlock(&idrinfo->lock); } EXPORT_SYMBOL(tcf_idr_insert); +/* Cleanup idr index that was allocated but not initialized. */ + +void tcf_idr_cleanup(struct tc_action_net *tn, u32 index) +{ + struct tcf_idrinfo *idrinfo = tn->idrinfo; + + spin_lock(&idrinfo->lock); + /* Remove ERR_PTR(-EBUSY) allocated by tcf_idr_check_a
[PATCH v2 09/11] net: sched: use reference counting action init
Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Signed-off-by: Vlad Buslov --- net/sched/act_api.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index f019f0464cec..eefe8c2fe667 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -627,6 +627,18 @@ static int tcf_action_put(struct tc_action *p) return __tcf_action_put(p, false); } +static void tcf_action_put_lst(struct list_head *actions) +{ + struct tc_action *a, *tmp; + + list_for_each_entry_safe(a, tmp, actions, list) { + const struct tc_action_ops *ops = a->ops; + + if (tcf_action_put(a)) + module_put(ops->owner); + } +} + int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { @@ -835,17 +847,6 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, return ERR_PTR(err); } -static void cleanup_a(struct list_head *actions, int ovr) -{ - struct tc_action *a; - - if (!ovr) - return; - - list_for_each_entry(a, actions, list) - refcount_dec(&a->tcfa_refcnt); -} - int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions, size_t *attr_size, @@ -874,11 +875,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, } *attr_size = tcf_action_full_attrs_size(sz); - - /* Remove the temp refcnt which was necessary to protect against -* destroying an existing action which was being replaced -*/ - cleanup_a(actions, ovr); return 0; err: @@ -1209,7 +1205,7 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, return ret; } err: - tcf_action_destroy(&actions, 0); + tcf_action_put_lst(&actions); return ret; } @@ -1251,8 +1247,11 @@ static int tcf_action_add(struct net *net, struct nlattr *nla, &attr_size, true, extack); if (ret) return ret; + ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack); + if (ovr) + tcf_action_put_lst(&actions); - return tcf_add_notify(net, n, &actions, portid, attr_size, extack); + return ret; } static u32 tcaa_root_flags_allowed = TCA_FLAG_LARGE_DUMP_ON; -- 2.7.5
[PATCH v2 11/11] net: sched: change action API to use array of pointers to actions
Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 7 ++--- net/sched/act_api.c | 74 --- net/sched/cls_api.c | 21 +-- 3 files changed, 50 insertions(+), 52 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index cd4547476074..43dfa5e1b3b3 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -168,19 +168,20 @@ static inline int tcf_idr_release(struct tc_action *a, bool bind) int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops); int tcf_unregister_action(struct tc_action_ops *a, struct pernet_operations *ops); -int tcf_action_destroy(struct list_head *actions, int bind); +int tcf_action_destroy(struct tc_action *actions[], int bind); int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, int nr_actions, struct tcf_result *res); int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, - struct list_head *actions, size_t *attr_size, + struct tc_action *actions[], size_t *attr_size, bool rtnl_held, struct netlink_ext_ack *extack); struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, bool rtnl_held, struct netlink_ext_ack *extack); -int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int); +int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind, + int ref); int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int); int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int); int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int); diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 9511502e1cbb..7f904bb84aab 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -657,13 +657,14 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions, } EXPORT_SYMBOL(tcf_action_exec); -int tcf_action_destroy(struct list_head *actions, int bind) +int tcf_action_destroy(struct tc_action *actions[], int bind) { const struct tc_action_ops *ops; - struct tc_action *a, *tmp; - int ret = 0; + struct tc_action *a; + int ret = 0, i; - list_for_each_entry_safe(a, tmp, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + a = actions[i]; ops = a->ops; ret = __tcf_idr_release(a, bind, true); if (ret == ACT_P_DELETED) @@ -679,11 +680,12 @@ static int tcf_action_put(struct tc_action *p) return __tcf_action_put(p, false); } -static void tcf_action_put_lst(struct list_head *actions) +static void tcf_action_put_many(struct tc_action *actions[]) { - struct tc_action *a, *tmp; + int i; - list_for_each_entry_safe(a, tmp, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + struct tc_action *a = actions[i]; const struct tc_action_ops *ops = a->ops; if (tcf_action_put(a)) @@ -735,14 +737,15 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref) } EXPORT_SYMBOL(tcf_action_dump_1); -int tcf_action_dump(struct sk_buff *skb, struct list_head *actions, +int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind, int ref) { struct tc_action *a; - int err = -EINVAL; + int err = -EINVAL, i; struct nlattr *nest; - list_for_each_entry(a, actions, list) { + for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) { + a = actions[i]; nest = nla_nest_start(skb, a->order); if (nest == NULL) goto nla_put_failure; @@ -878,10 +881,9 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) { err = tcf_action_goto_chain_init(a, tp); if (err) { - LIST_HEAD(actions); +
[PATCH v2 05/11] net: sched: implement action API that deletes action by index
Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Signed-off-by: Vlad Buslov --- include/net/act_api.h | 1 + net/sched/act_api.c | 39 +++ 2 files changed, 40 insertions(+) diff --git a/include/net/act_api.h b/include/net/act_api.h index 888ff471bbf6..d94ec6400673 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -153,6 +153,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, int bind, bool cpustats); void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a); +int tcf_idr_delete_index(struct tc_action_net *tn, u32 index); int __tcf_idr_release(struct tc_action *a, bool bind, bool strict); static inline int tcf_idr_release(struct tc_action *a, bool bind) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index aa304d36fee0..0f31f09946ab 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -319,6 +319,45 @@ bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, } EXPORT_SYMBOL(tcf_idr_check); +int tcf_idr_delete_index(struct tc_action_net *tn, u32 index) +{ + struct tcf_idrinfo *idrinfo = tn->idrinfo; + struct tc_action *p; + int ret = 0; + + spin_lock(&idrinfo->lock); + p = idr_find(&idrinfo->action_idr, index); + if (!p) { + spin_unlock(&idrinfo->lock); + return -ENOENT; + } + + if (!atomic_read(&p->tcfa_bindcnt)) { + if (refcount_dec_and_test(&p->tcfa_refcnt)) { + struct module *owner = p->ops->owner; + + WARN_ON(p != idr_remove(&idrinfo->action_idr, + p->tcfa_index)); + spin_unlock(&idrinfo->lock); + + if (p->ops->cleanup) + p->ops->cleanup(p); + + gen_kill_estimator(&p->tcfa_rate_est); + free_tcf(p); + module_put(owner); + return 0; + } + ret = 0; + } else { + ret = -EPERM; + } + + spin_unlock(&idrinfo->lock); + return ret; +} +EXPORT_SYMBOL(tcf_idr_delete_index); + int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, struct tc_action **a, const struct tc_action_ops *ops, int bind, bool cpustats) -- 2.7.5
[PATCH v2 04/11] net: sched: always take reference to action
Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Signed-off-by: Vlad Buslov --- net/sched/act_api.c | 46 -- 1 file changed, 20 insertions(+), 26 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 256b0c93916c..aa304d36fee0 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb, } EXPORT_SYMBOL(tcf_generic_walker); -static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo) +static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, + struct tc_action **a, int bind) { - struct tc_action *p = NULL; + struct tcf_idrinfo *idrinfo = tn->idrinfo; + struct tc_action *p; spin_lock(&idrinfo->lock); p = idr_find(&idrinfo->action_idr, index); + if (p) { + refcount_inc(&p->tcfa_refcnt); + if (bind) + atomic_inc(&p->tcfa_bindcnt); + } spin_unlock(&idrinfo->lock); - return p; + if (p) { + *a = p; + return true; + } + return false; } int tcf_idr_search(struct tc_action_net *tn, struct tc_action **a, u32 index) { - struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct tc_action *p = tcf_idr_lookup(index, idrinfo); - - if (p) { - *a = p; - return 1; - } - return 0; + return __tcf_idr_check(tn, index, a, 0); } EXPORT_SYMBOL(tcf_idr_search); bool tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, int bind) { - struct tcf_idrinfo *idrinfo = tn->idrinfo; - struct tc_action *p = tcf_idr_lookup(index, idrinfo); - - if (index && p) { - if (bind) - atomic_inc(&p->tcfa_bindcnt); - refcount_inc(&p->tcfa_refcnt); - *a = p; - return true; - } - return false; + return __tcf_idr_check(tn, index, a, bind); } EXPORT_SYMBOL(tcf_idr_check); @@ -932,7 +926,7 @@ tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n, if (!skb) return -ENOBUFS; if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event, -0, 0) <= 0) { +0, 1) <= 0) { NL_SET_ERR_MSG(extack, "Failed to fill netlink attributes while adding TC action"); kfree_skb(skb); return -EINVAL; @@ -1072,7 +1066,7 @@ tcf_del_notify(struct net *net, struct nlmsghdr *n, struct list_head *actions, return -ENOBUFS; if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, RTM_DELACTION, -0, 1) <= 0) { +0, 2) <= 0) { NL_SET_ERR_MSG(extack, "Failed to fill netlink TC action attributes"); kfree_skb(skb); return -EINVAL; @@ -1131,14 +1125,14 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, if (event == RTM_GETACTION) ret = tcf_get_notify(net, portid, n, &actions, event, extack); else { /* delete */ + cleanup_a(&actions, 1); /* lookup took reference */ ret = tcf_del_notify(net, n, &actions, portid, attr_size, extack); if (ret) goto err; return ret; } err: - if (event != RTM_GETACTION) - tcf_action_destroy(&actions, 0); + tcf_action_destroy(&actions, 0); return ret; } -- 2.7.5
[PATCH v2 00/11] Modify action API for implementing lockless actions
Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a first step to remove rtnl lock dependency from TC rules update path. Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added. Handlers registered with this flag are called without RTNL taken. End goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER, etc.) to be registered with UNLOCKED flag to allow parallel execution. However, there is no intention to completely remove or split rtnl lock itself. This patch set addresses specific problems in action API that prevents it from being executed concurrently. As a preparation for executing TC rules update handlers without rtnl lock, action API code was audited to determine areas that assume external synchronization with rtnl lock and must be changed to allow safe concurrent access with following results: 1. Action idr is already protected with spinlock. However, some code paths assume that idr state is not changes between several consecutive tcf_idr_* function calls. 2. tc_action reference and bind counters are implemented as plain integers. They purpose was to allow single actions to be shared between multiple filters, not to provide means for concurrent modification. 3. tc_action 'cookie' pointer field is not protected against modification. 4. Action API functions, that work with set of actions, use intrusive linked list, which cannot be used concurrently without additional synchronization. 5. Action API functions don't take reference to actions while using them, assuming external synchronization with rtnl lock. Following solutions to these problems are implemented: 1. To remove assumption that idr state doesn't change between tcf_idr_* calls, implement new functions that atomically perform several operations on idr without releasing idr spinlock. (function to atomically lookup and delete action by index, function to atomically check if action exists and allocate new one if necessary, etc.) 2. Use atomic operations on counters to make them suitable for concurrent get/put operations. 3. Data that 'cookie' points to is never modified, so it enough to refactor it to rcu pointer to prevent concurrent de-allocation. 4. Action API doesn't actually use any linked list specific operations on actions intrusive linked list, so it can be refactored to array in straightforward manner. 5. Always take reference to action while accessing it in action API. tcf_idr_search function modified to take reference to action before returning it, so there is no way to lookup an action without incrementing its reference counter. All users of this function are modified to release the reference, after they done using action. With all users using reference counting, it is now safe to concurrently delete actions. Since only shared state in action API module are actions themselves and action idr, these changes are sufficient to not to rely on global rtnl lock for protection of internal action API data structures. Changes from V1 to V2: - Removed redundant actions ops lookup during delete. - Merge action ops delete definition and implementation. - Assume all actions have delete implemented and don't check for it explicitly. - Resplit action lookup/release code to prevent memory leaks in individual patches. - Make __tcf_idr_check function static - Remove unique idr insertion function. Change original idr insert to do the same thing. - Merge changes that take reference to action when performing lookup and changes that account for this additional reference when dumping action to user space into single patch. - Change convoluted commit message. - Rename "unlocked" to "rtnl_held" for clarity. - Remove estimator lock add patch. - Refactor action check-alloc code into standalone function. - Rename tcf_idr_find_delete to tcf_idr_delete_index. - Rearrange variable definitions in tc_action_delete. - Add patch that refactors action API code to use array of pointers to actions instead of intrusive linked list. - Expand cover letter. Vlad Buslov (11): net: sched: use rcu for action cookie update net: sched: change type of reference and bind counters net: sched: implement unlocked action init API net: sched: always take reference to action net: sched: implement action API that deletes action by index net: sched: add 'delete' function to action ops net: sched: implement reference counted action release net: sched: don't release reference on action overwrite net: sched: use reference counting action init net: sched: atomically check-allocate action net: sched: change action API to use array of pointers to actions include/net/act_api.h | 25 ++- include/net/pkt_cls.h | 1 + net/sched/act_api.c| 408 +++-- net/sched/act_bpf.
Re: [PATCH rdma-next v1 00/13] Verbs flow counters support
On Sun, May 27, 2018 at 01:23:33PM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky > > Changelog v0->v1: > * Decouple from DevX submission > * Use uverbs_attr_get_obj at counters read method > * Added define for max read buffer size (MAX_COUNTERS_BUFF_SIZE) > * Removed the struct mlx5_ib_flow_counter basic_flow_cnts and >the related structs used, used define instead > * Took Matan's patch from DevX > * uverbs_free_counters removed void* casting > * Added check to bound ncounters value (added define > * Changed user supplied data buffer structure to be array of >struct pair (applied this change to user space also) > > Not changed: > * UAPI files > * Addition of uhw to flow > > Thanks Sorry for the noise, but please drop this series. We found improper locking sequence in one of error paths which can be triggered by the user. Thanks signature.asc Description: PGP signature
[PATCH v2 0/5] build warnings cleanup
Build warnings cleanup reported for - using only 128b key - wait for buffer in sendmsg/sendpage - check for null before using skb - free rspq_skb_cache in error path - indentation v2: Added bug report description for 0002 Incorported comments from Dan Carpenter Atul Gupta (5): crypto:chtls: key len correction crypto: chtls: wait for memory sendmsg, sendpage crypto: chtls: dereference null variable crypto: chtls: kbuild warnings crypto: chtls: free beyond end rspq_skb_cache drivers/crypto/chelsio/chtls/chtls.h | 1 + drivers/crypto/chelsio/chtls/chtls_hw.c | 6 +- drivers/crypto/chelsio/chtls/chtls_io.c | 104 +++--- drivers/crypto/chelsio/chtls/chtls_main.c | 3 +- 4 files changed, 98 insertions(+), 16 deletions(-) -- 1.8.3.1
[PATCH v2 1/5] crypto:chtls: key len correction
corrected the key length to copy 128b key. Removed 192b and 256b key as user input supports key of size 128b in gcm_ctx Reported-by: Dan Carpenter Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chtls/chtls_hw.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/crypto/chelsio/chtls/chtls_hw.c b/drivers/crypto/chelsio/chtls/chtls_hw.c index 54a13aa9..55d5014 100644 --- a/drivers/crypto/chelsio/chtls/chtls_hw.c +++ b/drivers/crypto/chelsio/chtls/chtls_hw.c @@ -213,7 +213,7 @@ static int chtls_key_info(struct chtls_sock *csk, struct _key_ctx *kctx, u32 keylen, u32 optname) { - unsigned char key[CHCR_KEYCTX_CIPHER_KEY_SIZE_256]; + unsigned char key[AES_KEYSIZE_128]; struct tls12_crypto_info_aes_gcm_128 *gcm_ctx; unsigned char ghash_h[AEAD_H_SIZE]; struct crypto_cipher *cipher; @@ -228,10 +228,6 @@ static int chtls_key_info(struct chtls_sock *csk, if (keylen == AES_KEYSIZE_128) { ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_128; - } else if (keylen == AES_KEYSIZE_192) { - ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_192; - } else if (keylen == AES_KEYSIZE_256) { - ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_256; } else { pr_err("GCM: Invalid key length %d\n", keylen); return -EINVAL; -- 1.8.3.1
[PATCH v2 3/5] crypto: chtls: dereference null variable
skb dereferenced before check in sendpage Reported-by: Dan Carpenter Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chtls/chtls_io.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c b/drivers/crypto/chelsio/chtls/chtls_io.c index 7aa5d90..8cfc27b 100644 --- a/drivers/crypto/chelsio/chtls/chtls_io.c +++ b/drivers/crypto/chelsio/chtls/chtls_io.c @@ -1230,9 +1230,8 @@ int chtls_sendpage(struct sock *sk, struct page *page, struct sk_buff *skb = skb_peek_tail(&csk->txq); int copy, i; - copy = mss - skb->len; if (!skb || (ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND) || - copy <= 0) { + (copy = mss - skb->len) <= 0) { new_buf: if (!csk_mem_free(cdev, sk)) goto wait_for_sndbuf; -- 1.8.3.1
[PATCH v2 2/5] crypto: chtls: wait for memory sendmsg, sendpage
address suspicious code 1210 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 1211 } The issue is that in the code above, set_bit is never reached due to the 'continue' statement at line 1208. Also reported by bug report: 1210 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); ^^^ Not reachable. Its required to wait for buffer in the send path and takes care of unaddress and un-handled SOCK_NOSPACE. v2: use csk_mem_free where appropriate proper indent of goto do_nonblock replace out with do_rm_wq Reported-by: Gustavo A. R. Silva Reported-by: Dan Carpenter Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chtls/chtls.h | 1 + drivers/crypto/chelsio/chtls/chtls_io.c | 90 +-- drivers/crypto/chelsio/chtls/chtls_main.c | 1 + 3 files changed, 89 insertions(+), 3 deletions(-) diff --git a/drivers/crypto/chelsio/chtls/chtls.h b/drivers/crypto/chelsio/chtls/chtls.h index 1b2f43c..a53a0e6 100644 --- a/drivers/crypto/chelsio/chtls/chtls.h +++ b/drivers/crypto/chelsio/chtls/chtls.h @@ -144,6 +144,7 @@ struct chtls_dev { struct list_head rcu_node; struct list_head na_node; unsigned int send_page_order; + int max_host_sndbuf; struct key_map kmap; }; diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c b/drivers/crypto/chelsio/chtls/chtls_io.c index 840dd01..7aa5d90 100644 --- a/drivers/crypto/chelsio/chtls/chtls_io.c +++ b/drivers/crypto/chelsio/chtls/chtls_io.c @@ -914,6 +914,78 @@ static u16 tls_header_read(struct tls_hdr *thdr, struct iov_iter *from) return (__force u16)cpu_to_be16(thdr->length); } +static int csk_mem_free(struct chtls_dev *cdev, struct sock *sk) +{ + return (cdev->max_host_sndbuf - sk->sk_wmem_queued); +} + +static int csk_wait_memory(struct chtls_dev *cdev, + struct sock *sk, long *timeo_p) +{ + DEFINE_WAIT_FUNC(wait, woken_wake_function); + int sndbuf, err = 0; + long current_timeo; + long vm_wait = 0; + bool noblock; + + current_timeo = *timeo_p; + noblock = (*timeo_p ? false : true); + sndbuf = cdev->max_host_sndbuf; + if (csk_mem_free(cdev, sk)) { + current_timeo = (prandom_u32() % (HZ / 5)) + 2; + vm_wait = (prandom_u32() % (HZ / 5)) + 2; + } + + add_wait_queue(sk_sleep(sk), &wait); + while (1) { + sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); + + if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) + goto do_error; + if (!*timeo_p) { + if (noblock) + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + goto do_nonblock; + } + if (signal_pending(current)) + goto do_interrupted; + sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); + if (csk_mem_free(cdev, sk) && !vm_wait) + break; + + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + sk->sk_write_pending++; + sk_wait_event(sk, ¤t_timeo, sk->sk_err || + (sk->sk_shutdown & SEND_SHUTDOWN) || + (csk_mem_free(cdev, sk) && !vm_wait), &wait); + sk->sk_write_pending--; + + if (vm_wait) { + vm_wait -= current_timeo; + current_timeo = *timeo_p; + if (current_timeo != MAX_SCHEDULE_TIMEOUT) { + current_timeo -= vm_wait; + if (current_timeo < 0) + current_timeo = 0; + } + vm_wait = 0; + } + *timeo_p = current_timeo; + } +do_rm_wq: + remove_wait_queue(sk_sleep(sk), &wait); + return err; +do_error: + err = -EPIPE; + goto do_rm_wq; +do_nonblock: + err = -EAGAIN; + goto do_rm_wq; +do_interrupted: + err = sock_intr_errno(*timeo_p); + goto do_rm_wq; +} + int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) { struct chtls_sock *csk = rcu_dereference_sk_user_data(sk); @@ -952,6 +1024,8 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) copy = mss - skb->len; skb->ip_summed = CHECKSUM_UNNECESSARY; } + if (!csk_mem_free(cdev, sk)) + goto wait_for_sndbuf; if (is_tls_tx(csk) && !csk->tlshws.txleft) { struct tls_hdr hdr; @@ -1099,8 +1173,10 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) if (ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND) push_frames_if_head(sk); continue; +wait_for_sndbuf
[PATCH v2 4/5] crypto: chtls: kbuild warnings
- unindented continue - check for null page - signed return Reported-by: Dan Carpenter Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chtls/chtls_io.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/crypto/chelsio/chtls/chtls_io.c b/drivers/crypto/chelsio/chtls/chtls_io.c index 8cfc27b..51fc682 100644 --- a/drivers/crypto/chelsio/chtls/chtls_io.c +++ b/drivers/crypto/chelsio/chtls/chtls_io.c @@ -907,11 +907,11 @@ static int chtls_skb_copy_to_page_nocache(struct sock *sk, } /* Read TLS header to find content type and data length */ -static u16 tls_header_read(struct tls_hdr *thdr, struct iov_iter *from) +static int tls_header_read(struct tls_hdr *thdr, struct iov_iter *from) { if (copy_from_iter(thdr, sizeof(*thdr), from) != sizeof(*thdr)) return -EFAULT; - return (__force u16)cpu_to_be16(thdr->length); + return (__force int)cpu_to_be16(thdr->length); } static int csk_mem_free(struct chtls_dev *cdev, struct sock *sk) @@ -1083,9 +1083,10 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) int off = TCP_OFF(sk); bool merge; - if (page) - pg_size <<= compound_order(page); + if (!page) + goto wait_for_memory; + pg_size <<= compound_order(page); if (off < pg_size && skb_can_coalesce(skb, i, page, off)) { merge = 1; @@ -1492,7 +1493,7 @@ static int chtls_pt_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, break; chtls_cleanup_rbuf(sk, copied); sk_wait_data(sk, &timeo, NULL); - continue; + continue; found_ok_skb: if (!skb->len) { skb_dst_set(skb, NULL); -- 1.8.3.1
[PATCH v2 5/5] crypto: chtls: free beyond end rspq_skb_cache
Reported-by: Dan Carpenter Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chtls/chtls_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/crypto/chelsio/chtls/chtls_main.c b/drivers/crypto/chelsio/chtls/chtls_main.c index 273afd3..9b07f91 100644 --- a/drivers/crypto/chelsio/chtls/chtls_main.c +++ b/drivers/crypto/chelsio/chtls/chtls_main.c @@ -250,7 +250,7 @@ static void *chtls_uld_add(const struct cxgb4_lld_info *info) return cdev; out_rspq_skb: - for (j = 0; j <= i; j++) + for (j = 0; j < i; j++) kfree_skb(cdev->rspq_skb_cache[j]); kfree_skb(cdev->askb); out_skb: -- 1.8.3.1
Re: [PATCH net] VSOCK: check sk state before receive
Hmm...Although I won't reproduce this bug with my reproducer after apply my patch. I could still get a similiar issue with syzkaller sock vnet test. It looks this patch is not complete. Here is the KASAN call trace with my patch. I can also reproduce it without my patch. == BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport] Read of size 4 at addr 880026a3a914 by task kworker/0:2/96 CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Workqueue: events dg_delayed_dispatch [vmw_vmci] Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xdd/0x18e lib/dump_stack.c:113 print_address_description+0x7a/0x3e0 mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report+0x1dd/0x460 mm/kasan/report.c:412 vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport] vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport] dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci] process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145 worker_thread+0x1df/0x1400 kernel/workqueue.c:2279 kthread+0x343/0x4b0 kernel/kthread.c:240 ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412 Allocated by task 2684: set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553 slab_post_alloc_hook mm/slab.h:444 [inline] slab_alloc_node mm/slub.c:2741 [inline] slab_alloc mm/slub.c:2749 [inline] kmem_cache_alloc+0x105/0x330 mm/slub.c:2754 sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468 sk_alloc+0xc9/0xbb0 net/core/sock.c:1528 __vsock_create+0xc8/0x9b0 [vsock] vsock_create+0xfd/0x1a0 [vsock] __sock_create+0x310/0x690 net/socket.c:1285 sock_create net/socket.c:1325 [inline] __sys_socket+0x101/0x240 net/socket.c:1355 __do_sys_socket net/socket.c:1364 [inline] __se_sys_socket net/socket.c:1362 [inline] __x64_sys_socket+0x7d/0xd0 net/socket.c:1362 do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 2684: set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521 slab_free_hook mm/slub.c:1388 [inline] slab_free_freelist_hook mm/slub.c:1415 [inline] slab_free mm/slub.c:2988 [inline] kmem_cache_free+0xce/0x410 mm/slub.c:3004 sk_prot_free net/core/sock.c:1509 [inline] __sk_destruct+0x629/0x940 net/core/sock.c:1593 sk_destruct+0x4e/0x90 net/core/sock.c:1601 __sk_free+0xd3/0x320 net/core/sock.c:1612 sk_free+0x2a/0x30 net/core/sock.c:1623 __vsock_release+0x431/0x610 [vsock] vsock_release+0x3c/0xc0 [vsock] sock_release+0x91/0x200 net/socket.c:594 sock_close+0x17/0x20 net/socket.c:1149 __fput+0x368/0xa20 fs/file_table.c:209 task_work_run+0x1c5/0x2a0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x1876/0x26c0 kernel/exit.c:865 do_group_exit+0x159/0x3e0 kernel/exit.c:968 get_signal+0x65a/0x1780 kernel/signal.c:2482 do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810 exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at 880026a3a600 which belongs to the cache AF_VSOCK of size 1056 The buggy address is located 788 bytes inside of 1056-byte region [880026a3a600, 880026a3aa20) The buggy address belongs to the page: page:ea9a8e00 count:1 mapcount:0 mapping: index:0x0 compound_mapcount: 0 flags: 0xfc0008100(slab|head) raw: 000fc0008100 0001000d000d raw: dead0100 dead0200 880034471a40 page dumped because: kasan: bad access detected Memory state around the buggy address: 880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc ==
[PATCH net-next 3/7] net/ipv6: Pass ifa6_config struct to inet6_addr_modify
From: David Ahern Update inet6_addr_modify to take ifa6_config argument versus a parameter list. This is an argument move only; no functional change intended. Signed-off-by: David Ahern --- net/ipv6/addrconf.c | 44 +++- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2db1acf70610..1b1e63a4520b 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4527,8 +4527,7 @@ inet6_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, ifm->ifa_prefixlen); } -static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags, -u32 prefered_lft, u32 valid_lft) +static int inet6_addr_modify(struct inet6_ifaddr *ifp, struct ifa6_config *cfg) { u32 flags; clock_t expires; @@ -4538,32 +4537,32 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags, ASSERT_RTNL(); - if (!valid_lft || (prefered_lft > valid_lft)) + if (!cfg->valid_lft || cfg->preferred_lft > cfg->valid_lft) return -EINVAL; - if (ifa_flags & IFA_F_MANAGETEMPADDR && + if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR && (ifp->flags & IFA_F_TEMPORARY || ifp->prefix_len != 64)) return -EINVAL; if (!(ifp->flags & IFA_F_TENTATIVE) || ifp->flags & IFA_F_DADFAILED) - ifa_flags &= ~IFA_F_OPTIMISTIC; + cfg->ifa_flags &= ~IFA_F_OPTIMISTIC; - timeout = addrconf_timeout_fixup(valid_lft, HZ); + timeout = addrconf_timeout_fixup(cfg->valid_lft, HZ); if (addrconf_finite_timeout(timeout)) { expires = jiffies_to_clock_t(timeout * HZ); - valid_lft = timeout; + cfg->valid_lft = timeout; flags = RTF_EXPIRES; } else { expires = 0; flags = 0; - ifa_flags |= IFA_F_PERMANENT; + cfg->ifa_flags |= IFA_F_PERMANENT; } - timeout = addrconf_timeout_fixup(prefered_lft, HZ); + timeout = addrconf_timeout_fixup(cfg->preferred_lft, HZ); if (addrconf_finite_timeout(timeout)) { if (timeout == 0) - ifa_flags |= IFA_F_DEPRECATED; - prefered_lft = timeout; + cfg->ifa_flags |= IFA_F_DEPRECATED; + cfg->preferred_lft = timeout; } spin_lock_bh(&ifp->lock); @@ -4573,16 +4572,16 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags, ifp->flags &= ~(IFA_F_DEPRECATED | IFA_F_PERMANENT | IFA_F_NODAD | IFA_F_HOMEADDRESS | IFA_F_MANAGETEMPADDR | IFA_F_NOPREFIXROUTE); - ifp->flags |= ifa_flags; + ifp->flags |= cfg->ifa_flags; ifp->tstamp = jiffies; - ifp->valid_lft = valid_lft; - ifp->prefered_lft = prefered_lft; + ifp->valid_lft = cfg->valid_lft; + ifp->prefered_lft = cfg->preferred_lft; spin_unlock_bh(&ifp->lock); if (!(ifp->flags&IFA_F_TENTATIVE)) ipv6_ifa_notify(0, ifp); - if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) { + if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) { addrconf_prefix_route(&ifp->addr, ifp->prefix_len, ifp->idev->dev, expires, flags, GFP_KERNEL); @@ -4601,10 +4600,14 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags, } if (was_managetempaddr || ifp->flags & IFA_F_MANAGETEMPADDR) { - if (was_managetempaddr && !(ifp->flags & IFA_F_MANAGETEMPADDR)) - valid_lft = prefered_lft = 0; - manage_tempaddrs(ifp->idev, ifp, valid_lft, prefered_lft, -!was_managetempaddr, jiffies); + if (was_managetempaddr && + !(ifp->flags & IFA_F_MANAGETEMPADDR)) { + cfg->valid_lft = 0; + cfg->preferred_lft = 0; + } + manage_tempaddrs(ifp->idev, ifp, cfg->valid_lft, +cfg->preferred_lft, !was_managetempaddr, +jiffies); } addrconf_verify_rtnl(); @@ -4691,8 +4694,7 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, !(nlh->nlmsg_flags & NLM_F_REPLACE)) err = -EEXIST; else - err = inet6_addr_modify(ifa, cfg.ifa_flags, cfg.preferred_lft, - cfg.valid_lft); + err = inet6_addr_modify(ifa, &cfg); in6_ifa_put(ifa); -- 2.11.0
[PATCH iproute2-next] ipaddress: Add support for address metric
From: David Ahern Add support for IFA_RT_PRIORITY using the same keywords as iproute for RTA_PRIORITY. Signed-off-by: David Ahern --- include/uapi/linux/if_addr.h | 1 + ip/ipaddress.c | 15 ++- man/man8/ip-address.8.in | 6 ++ 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h index c4899e22cdaa..a924606f36e5 100644 --- a/include/uapi/linux/if_addr.h +++ b/include/uapi/linux/if_addr.h @@ -33,6 +33,7 @@ enum { IFA_CACHEINFO, IFA_MULTICAST, IFA_FLAGS, + IFA_RT_PRIORITY, /* u32, priority/metric for prefix route */ __IFA_MAX, }; diff --git a/ip/ipaddress.c b/ip/ipaddress.c index 75539e057f6a..6b53b753ead9 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -63,7 +63,7 @@ static void usage(void) fprintf(stderr, " ip address {showdump|restore}\n"); fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n"); fprintf(stderr, " [ broadcast ADDR ] [ anycast ADDR ]\n"); - fprintf(stderr, " [ label IFNAME ] [ scope SCOPE-ID ]\n"); + fprintf(stderr, " [ label IFNAME ] [ scope SCOPE-ID ] [ metric METRIC ]\n"); fprintf(stderr, "SCOPE-ID := [ host | link | global | NUMBER ]\n"); fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n"); fprintf(stderr, "FLAG := [ permanent | dynamic | secondary | primary |\n"); @@ -1328,6 +1328,10 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n, rta_tb[IFA_ADDRESS])); } print_int(PRINT_ANY, "prefixlen", "/%d ", ifa->ifa_prefixlen); + + if (rta_tb[IFA_RT_PRIORITY]) + print_uint(PRINT_ANY, "metric", "metric %u ", + rta_getattr_u32(rta_tb[IFA_RT_PRIORITY])); } if (brief) @@ -2119,6 +2123,15 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv) NEXT_ARG(); l = *argv; addattr_l(&req.n, sizeof(req), IFA_LABEL, l, strlen(l)+1); + } else if (matches(*argv, "metric") == 0 || + matches(*argv, "priority") == 0 || + matches(*argv, "preference") == 0) { + __u32 metric; + + NEXT_ARG(); + if (get_u32(&metric, *argv, 0)) + invarg("\"metric\" value is invalid\n", *argv); + addattr32(&req.n, sizeof(req), IFA_RT_PRIORITY, metric); } else if (matches(*argv, "valid_lft") == 0) { if (valid_lftp) duparg("valid_lft", *argv); diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in index 7ebf0bc98dcd..c3861b3725cc 100644 --- a/man/man8/ip-address.8.in +++ b/man/man8/ip-address.8.in @@ -27,6 +27,8 @@ ip-address \- protocol address management .IR IFNAME " ] [ " .B scope .IR SCOPE-ID " ] [ " +.B metric +.IR METRIC " ] [ " .B to .IR PREFIX " ] [ " FLAG-LIST " ] [ " .B label @@ -215,6 +217,10 @@ valid inside this site. .in -8 .TP +.BI metric " NUMBER" +priority of prefix route associated with address. + +.TP .BI valid_lft " LFT" the valid lifetime of this address; see section 5.5.4 of RFC 4862. When it expires, the address is removed by the kernel. -- 2.11.0
[PATCH net-next 1/7] net/ipv6: Convert ipv6_add_addr to struct ifa6_config
From: David Ahern Move config parameters for adding an ipv6 address to a struct. struct names stem from inet6_rtm_newaddr which is the modern handler for adding an address. Start the conversion to ifa6_config with ipv6_add_addr. This is an argument move only; no functional change intended. Mapping of variable changes: addr --> cfg->pfx peer_addr --> cfg->peer_pfx pfxlen--> cfg->plen flags --> cfg->ifa_flags scope, valid_lft, prefered_lft have the same names within cfg (with corrected spelling). Signed-off-by: David Ahern --- include/net/addrconf.h | 12 + net/ipv6/addrconf.c| 134 +++-- 2 files changed, 87 insertions(+), 59 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index c07d4dd09361..f766af2cd1a4 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -59,6 +59,18 @@ struct in6_validator_info { struct netlink_ext_ack *extack; }; +struct ifa6_config { + const struct in6_addr *pfx; + unsigned intplen; + + const struct in6_addr *peer_pfx; + + u32 ifa_flags; + u32 preferred_lft; + u32 valid_lft; + u16 scope; +}; + int addrconf_init(void); void addrconf_cleanup(void); diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index fbfd71a2d9c8..4988f2265882 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -986,17 +986,15 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa) /* On success it returns ifp with increased reference count */ static struct inet6_ifaddr * -ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, - const struct in6_addr *peer_addr, int pfxlen, - int scope, u32 flags, u32 valid_lft, u32 prefered_lft, +ipv6_add_addr(struct inet6_dev *idev, struct ifa6_config *cfg, bool can_block, struct netlink_ext_ack *extack) { gfp_t gfp_flags = can_block ? GFP_KERNEL : GFP_ATOMIC; + int addr_type = ipv6_addr_type(cfg->pfx); struct net *net = dev_net(idev->dev); struct inet6_ifaddr *ifa = NULL; struct fib6_info *f6i = NULL; int err = 0; - int addr_type = ipv6_addr_type(addr); if (addr_type == IPV6_ADDR_ANY || addr_type & IPV6_ADDR_MULTICAST || @@ -1019,7 +1017,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, */ if (can_block) { struct in6_validator_info i6vi = { - .i6vi_addr = *addr, + .i6vi_addr = *cfg->pfx, .i6vi_dev = idev, .extack = extack, }; @@ -1036,7 +1034,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, goto out; } - f6i = addrconf_f6i_alloc(net, idev, addr, false, gfp_flags); + f6i = addrconf_f6i_alloc(net, idev, cfg->pfx, false, gfp_flags); if (IS_ERR(f6i)) { err = PTR_ERR(f6i); f6i = NULL; @@ -1049,21 +1047,21 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, neigh_parms_data_state_setall(idev->nd_parms); - ifa->addr = *addr; - if (peer_addr) - ifa->peer_addr = *peer_addr; + ifa->addr = *cfg->pfx; + if (cfg->peer_pfx) + ifa->peer_addr = *cfg->peer_pfx; spin_lock_init(&ifa->lock); INIT_DELAYED_WORK(&ifa->dad_work, addrconf_dad_work); INIT_HLIST_NODE(&ifa->addr_lst); - ifa->scope = scope; - ifa->prefix_len = pfxlen; - ifa->flags = flags; + ifa->scope = cfg->scope; + ifa->prefix_len = cfg->plen; + ifa->flags = cfg->ifa_flags; /* No need to add the TENTATIVE flag for addresses with NODAD */ - if (!(flags & IFA_F_NODAD)) + if (!(cfg->ifa_flags & IFA_F_NODAD)) ifa->flags |= IFA_F_TENTATIVE; - ifa->valid_lft = valid_lft; - ifa->prefered_lft = prefered_lft; + ifa->valid_lft = cfg->valid_lft; + ifa->prefered_lft = cfg->preferred_lft; ifa->cstamp = ifa->tstamp = jiffies; ifa->tokenized = false; @@ -1260,11 +1258,10 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp, { struct inet6_dev *idev = ifp->idev; struct in6_addr addr, *tmpaddr; - unsigned long tmp_prefered_lft, tmp_valid_lft, tmp_tstamp, age; + unsigned long tmp_tstamp, age; unsigned long regen_advance; - int tmp_plen; + struct ifa6_config cfg; int ret = 0; - u32 addr_flags; unsigned long now = jiffies; long max_desync_factor; s32 cnf_temp_preferred_lft; @@ -1326,13 +1323,12 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp, } } - tmp_valid_lft = min_t(__u32, -
[PATCH net-next 6/7] net/ipv6: Add support for specifying metric of connected routes
From: David Ahern Add support for IFA_RT_PRIORITY to ipv6 addresses. If the metric is changed on an existing address then the new route is inserted before removing the old one. Since the metric is one of the route keys, the prefix route can not be atomically replaced. Signed-off-by: David Ahern --- include/net/addrconf.h | 1 + include/net/if_inet6.h | 1 + net/ipv6/addrconf.c| 93 -- 3 files changed, 77 insertions(+), 18 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index f766af2cd1a4..5f43f7a70fe6 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -65,6 +65,7 @@ struct ifa6_config { const struct in6_addr *peer_pfx; + u32 rt_priority; u32 ifa_flags; u32 preferred_lft; u32 valid_lft; diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index db389253dc2a..d7578cf49c3a 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -42,6 +42,7 @@ enum { struct inet6_ifaddr { struct in6_addr addr; __u32 prefix_len; + __u32 rt_priority; /* In seconds, relative to tstamp. Expiry is at tstamp + HZ * lft. */ __u32 valid_lft; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 1b1e63a4520b..f09afc232da4 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1056,6 +1056,7 @@ ipv6_add_addr(struct inet6_dev *idev, struct ifa6_config *cfg, INIT_HLIST_NODE(&ifa->addr_lst); ifa->scope = cfg->scope; ifa->prefix_len = cfg->plen; + ifa->rt_priority = cfg->rt_priority; ifa->flags = cfg->ifa_flags; /* No need to add the TENTATIVE flag for addresses with NODAD */ if (!(cfg->ifa_flags & IFA_F_NODAD)) @@ -1356,6 +1357,7 @@ static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp, cfg.pfx = &addr; cfg.scope = ipv6_addr_scope(cfg.pfx); + cfg.rt_priority = 0; ift = ipv6_add_addr(idev, &cfg, block, NULL); if (IS_ERR(ift)) { @@ -2314,12 +2316,13 @@ static void ipv6_try_regen_rndid(struct inet6_dev *idev, struct in6_addr *tmpad */ static void -addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, - unsigned long expires, u32 flags, gfp_t gfp_flags) +addrconf_prefix_route(struct in6_addr *pfx, int plen, u32 metric, + struct net_device *dev, unsigned long expires, + u32 flags, gfp_t gfp_flags) { struct fib6_config cfg = { .fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX, - .fc_metric = IP6_RT_PRIO_ADDRCONF, + .fc_metric = metric ? : IP6_RT_PRIO_ADDRCONF, .fc_ifindex = dev->ifindex, .fc_expires = expires, .fc_dst_len = plen, @@ -2683,7 +2686,8 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len, bool sllao) expires = jiffies_to_clock_t(rt_expires); } addrconf_prefix_route(&pinfo->prefix, pinfo->prefix_len, - dev, expires, flags, GFP_ATOMIC); + 0, dev, expires, flags, + GFP_ATOMIC); } fib6_info_release(rt); } @@ -2891,8 +2895,9 @@ static int inet6_addr_add(struct net *net, int ifindex, ifp = ipv6_add_addr(idev, cfg, true, extack); if (!IS_ERR(ifp)) { if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) { - addrconf_prefix_route(&ifp->addr, ifp->prefix_len, dev, - expires, flags, GFP_KERNEL); + addrconf_prefix_route(&ifp->addr, ifp->prefix_len, + ifp->rt_priority, dev, expires, + flags, GFP_KERNEL); } /* Send a netlink notification if DAD is enabled and @@ -3056,7 +3061,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev) if (addr.s6_addr32[3]) { add_addr(idev, &addr, plen, scope); - addrconf_prefix_route(&addr, plen, idev->dev, 0, pflags, + addrconf_prefix_route(&addr, plen, 0, idev->dev, 0, pflags, GFP_ATOMIC); return; } @@ -3081,8 +3086,8 @@ static void sit_add_v4_addrs(struct inet6_dev *idev) } add_addr(idev, &addr, plen, flag); - addrconf_prefix_route(&addr, plen, idev->dev, 0, - pflags, GFP_ATOMIC); +
[PATCH net-next 2/7] net/ipv6: Pass ifa6_config struct to inet6_addr_add
From: David Ahern Move the creation of struct ifa6_config up to callers of inet6_addr_add. Signed-off-by: David Ahern --- net/ipv6/addrconf.c | 112 ++-- 1 file changed, 57 insertions(+), 55 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4988f2265882..2db1acf70610 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2830,20 +2830,9 @@ static int ipv6_mc_config(struct sock *sk, bool join, * Manual configuration of address on an interface */ static int inet6_addr_add(struct net *net, int ifindex, - const struct in6_addr *pfx, - const struct in6_addr *peer_pfx, - unsigned int plen, __u32 ifa_flags, - __u32 prefered_lft, __u32 valid_lft, + struct ifa6_config *cfg, struct netlink_ext_ack *extack) { - struct ifa6_config cfg = { - .pfx = pfx, - .plen = plen, - .peer_pfx = peer_pfx, - .ifa_flags = ifa_flags, - .preferred_lft = prefered_lft, - .valid_lft = valid_lft, - }; struct inet6_ifaddr *ifp; struct inet6_dev *idev; struct net_device *dev; @@ -2853,14 +2842,14 @@ static int inet6_addr_add(struct net *net, int ifindex, ASSERT_RTNL(); - if (plen > 128) + if (cfg->plen > 128) return -EINVAL; /* check the lifetime */ - if (!valid_lft || prefered_lft > valid_lft) + if (!cfg->valid_lft || cfg->preferred_lft > cfg->valid_lft) return -EINVAL; - if (ifa_flags & IFA_F_MANAGETEMPADDR && plen != 64) + if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR && cfg->plen != 64) return -EINVAL; dev = __dev_get_by_index(net, ifindex); @@ -2871,37 +2860,37 @@ static int inet6_addr_add(struct net *net, int ifindex, if (IS_ERR(idev)) return PTR_ERR(idev); - if (ifa_flags & IFA_F_MCAUTOJOIN) { + if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) { int ret = ipv6_mc_config(net->ipv6.mc_autojoin_sk, -true, pfx, ifindex); +true, cfg->pfx, ifindex); if (ret < 0) return ret; } - cfg.scope = ipv6_addr_scope(pfx); + cfg->scope = ipv6_addr_scope(cfg->pfx); - timeout = addrconf_timeout_fixup(valid_lft, HZ); + timeout = addrconf_timeout_fixup(cfg->valid_lft, HZ); if (addrconf_finite_timeout(timeout)) { expires = jiffies_to_clock_t(timeout * HZ); - valid_lft = timeout; + cfg->valid_lft = timeout; flags = RTF_EXPIRES; } else { expires = 0; flags = 0; - ifa_flags |= IFA_F_PERMANENT; + cfg->ifa_flags |= IFA_F_PERMANENT; } - timeout = addrconf_timeout_fixup(prefered_lft, HZ); + timeout = addrconf_timeout_fixup(cfg->preferred_lft, HZ); if (addrconf_finite_timeout(timeout)) { if (timeout == 0) - ifa_flags |= IFA_F_DEPRECATED; - prefered_lft = timeout; + cfg->ifa_flags |= IFA_F_DEPRECATED; + cfg->preferred_lft = timeout; } - ifp = ipv6_add_addr(idev, &cfg, true, extack); + ifp = ipv6_add_addr(idev, cfg, true, extack); if (!IS_ERR(ifp)) { - if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) { + if (!(cfg->ifa_flags & IFA_F_NOPREFIXROUTE)) { addrconf_prefix_route(&ifp->addr, ifp->prefix_len, dev, expires, flags, GFP_KERNEL); } @@ -2917,15 +2906,15 @@ static int inet6_addr_add(struct net *net, int ifindex, * manually configured addresses */ addrconf_dad_start(ifp); - if (ifa_flags & IFA_F_MANAGETEMPADDR) - manage_tempaddrs(idev, ifp, valid_lft, prefered_lft, -true, jiffies); + if (cfg->ifa_flags & IFA_F_MANAGETEMPADDR) + manage_tempaddrs(idev, ifp, cfg->valid_lft, +cfg->preferred_lft, true, jiffies); in6_ifa_put(ifp); addrconf_verify_rtnl(); return 0; - } else if (ifa_flags & IFA_F_MCAUTOJOIN) { - ipv6_mc_config(net->ipv6.mc_autojoin_sk, - false, pfx, ifindex); + } else if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) { + ipv6_mc_config(net->ipv6.mc_autojoin_sk, false, + cfg->pfx, ifindex); } return PTR_ERR(ifp); @@ -2976,6 +2965,11 @@ static int inet6_addr_del(st
[PATCH net-next 4/7] net: Add IFA_RT_PRIORITY address attribute
From: David Ahern Currently, if two interfaces have addresses in the same connected route, then the order of the prefix route entries is determined by the order in which the addresses are assigned or the links brought up. Add IFA_RT_PRIORITY to allow user to specify the metric of the prefix route associated with an address giving interface managers better control of the order of prefix routes. Signed-off-by: David Ahern --- include/uapi/linux/if_addr.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h index 2ef053d265de..ebaf5701c9db 100644 --- a/include/uapi/linux/if_addr.h +++ b/include/uapi/linux/if_addr.h @@ -33,6 +33,7 @@ enum { IFA_CACHEINFO, IFA_MULTICAST, IFA_FLAGS, + IFA_RT_PRIORITY, /* u32, priority/metric for prefix route */ __IFA_MAX, }; -- 2.11.0
[PATCH net-next 0/7] net: Add address attribute to control metric of prefix route
From: David Ahern For use cases such as VRR (Virtual Router Redundancy) interface managers want efficient control over the order of prefix routes when multiple interfaces have addresses with overlapping/duplicate subnets. Currently, if two interfaces have addresses in the same subnet, the order of the prefix route entries is determined by the order in which the addresses are assigned or the links brought up. Any actions like cycling an interface up and down changes that order. This set adds a new attribute for addresses to allow a user to specify the metric of the prefix route associated with an address giving interface managers better and more efficient control of the order of prefix routes. Patches 1-3 refactor IPv6 address add functions to pass an ifa6_config struct. The functions currently have a long list of arguments and adding the metric just makes it worse. Because of the overall diff size in moving the arguments to a struct, the change is done in stages to make it easier to review starting with the bottom function and pushing the struct up to callers in each successive patch. Patch 4 introduces the new attribute. Patches 5 and 6 add support for the new attribute to IPv4 and IPv6 addresses. Patch 7 adds a set of test cases. Patch 8 adds support to iproute2 Changes since RFC - collapsed patches 1 and 3 into patch 2 - simplified stack variables in fib_modify_prefix_metric in patch 5 David Ahern (7): net/ipv6: Convert ipv6_add_addr to struct ifa6_config net/ipv6: Pass ifa6_config struct to inet6_addr_add net/ipv6: Pass ifa6_config struct to inet6_addr_modify net: Add IFA_RT_PRIORITY address attribute net/ipv4: Add support for specifying metric of connected routes net/ipv6: Add support for specifying metric of connected routes selftests: fib_tests: Add prefix route tests with metric include/linux/inetdevice.h | 1 + include/net/addrconf.h | 13 ++ include/net/if_inet6.h | 1 + include/net/route.h | 1 + include/uapi/linux/if_addr.h | 1 + net/ipv4/devinet.c | 15 ++ net/ipv4/fib_frontend.c | 53 - net/ipv6/addrconf.c | 359 +++ tools/testing/selftests/net/fib_tests.sh | 181 +++- 9 files changed, 472 insertions(+), 153 deletions(-) mode change 100755 => 100644 tools/testing/selftests/net/fib_tests.sh -- 2.11.0
[PATCH net-next 7/7] selftests: fib_tests: Add prefix route tests with metric
From: David Ahern Add tests verifying prefix routes are inserted with expected metric. IPv6 prefix route tests TEST: Default metric [ OK ] TEST: User specified metric on first device [ OK ] TEST: User specified metric on second device [ OK ] TEST: Delete of address on first device [ OK ] TEST: Modify metric of address[ OK ] TEST: Prefix route removed on link down [ OK ] TEST: Prefix route with metric on link up [ OK ] IPv4 prefix route tests TEST: Default metric [ OK ] TEST: User specified metric on first device [ OK ] TEST: User specified metric on second device [ OK ] TEST: Delete of address on first device [ OK ] TEST: Modify metric of address[ OK ] TEST: Prefix route removed on link down [ OK ] TEST: Prefix route with metric on link up [ OK ] Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 181 ++- 1 file changed, 180 insertions(+), 1 deletion(-) mode change 100755 => 100644 tools/testing/selftests/net/fib_tests.sh diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh old mode 100755 new mode 100644 index e7d76fbc36e9..9780c5a2d73f --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -6,7 +6,8 @@ ret=0 -TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt" +# all tests in this script. Can be overridden with -t option +TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric" VERBOSE=0 PAUSE_ON_FAIL=no PAUSE=no @@ -642,6 +643,8 @@ check_route6() local rc=0 out=$($IP -6 ro ls match ${pfx} | sed -e 's/ pref medium//') + [ "${out}" = "${expected}" ] && return 0 + if [ -z "${out}" ]; then if [ "$VERBOSE" = "1" ]; then printf "\nNo route entry found\n" @@ -911,6 +914,98 @@ ipv6_route_test() route_cleanup } +ip_addr_metric_check() +{ + ip addr help 2>&1 | grep -q metric + if [ $? -ne 0 ]; then + echo "iproute2 command does not support metric for addresses. Skipping test" + return 1 + fi + + return 0 +} + +ipv6_addr_metric_test() +{ + local rc + + echo + echo "IPv6 prefix route tests" + + ip_addr_metric_check || return 1 + + setup + + set -e + $IP li add dummy1 type dummy + $IP li add dummy2 type dummy + $IP li set dummy1 up + $IP li set dummy2 up + + # default entry is metric 256 + run_cmd "$IP -6 addr add dev dummy1 2001:db8:104::1/64" + run_cmd "$IP -6 addr add dev dummy2 2001:db8:104::2/64" + set +e + + check_route6 "2001:db8:104::/64 dev dummy1 proto kernel metric 256 2001:db8:104::/64 dev dummy2 proto kernel metric 256" + log_test $? 0 "Default metric" + + set -e + run_cmd "$IP -6 addr flush dev dummy1" + run_cmd "$IP -6 addr add dev dummy1 2001:db8:104::1/64 metric 257" + set +e + + check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 256 2001:db8:104::/64 dev dummy1 proto kernel metric 257" + log_test $? 0 "User specified metric on first device" + + set -e + run_cmd "$IP -6 addr flush dev dummy2" + run_cmd "$IP -6 addr add dev dummy2 2001:db8:104::2/64 metric 258" + set +e + + check_route6 "2001:db8:104::/64 dev dummy1 proto kernel metric 257 2001:db8:104::/64 dev dummy2 proto kernel metric 258" + log_test $? 0 "User specified metric on second device" + + run_cmd "$IP -6 addr del dev dummy1 2001:db8:104::1/64 metric 257" + rc=$? + if [ $rc -eq 0 ]; then + check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 258" + rc=$? + fi + log_test $rc 0 "Delete of address on first device" + + run_cmd "$IP -6 addr change dev dummy2 2001:db8:104::2/64 metric 259" + rc=$? + if [ $rc -eq 0 ]; then + check_route6 "2001:db8:104::/64 dev dummy2 proto kernel metric 259" + rc=$? + fi + log_test $rc 0 "Modify metric of address" + + # verify prefix route removed on down + run_cmd "ip netns exec testns sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1" + run_cmd "$IP li set dev dummy2 down" + rc=$? + if [ $rc -eq 0 ]; then + check_route6 "" + rc=$? + fi + log_test $rc 0 "Prefix route removed on link down" + + # verify prefix route re-inserted with assigned metric + run_cmd "$IP li set dev dummy2 up" + rc=$? + if [ $rc -eq 0 ]; then + check_route6 "2001:db8
[PATCH net-next 5/7] net/ipv4: Add support for specifying metric of connected routes
From: David Ahern Add support for IFA_RT_PRIORITY to ipv4 addresses. If the metric is changed on an existing address then the new route is inserted before removing the old one. Since the metric is one of the route keys, the prefix route can not be replaced. Signed-off-by: David Ahern --- include/linux/inetdevice.h | 1 + include/net/route.h| 1 + net/ipv4/devinet.c | 15 + net/ipv4/fib_frontend.c| 53 -- 4 files changed, 59 insertions(+), 11 deletions(-) diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index e16fe7d44a71..27650f1bff3d 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -139,6 +139,7 @@ struct in_ifaddr { __be32 ifa_local; __be32 ifa_address; __be32 ifa_mask; + __u32 ifa_rt_priority; __be32 ifa_broadcast; unsigned char ifa_scope; unsigned char ifa_prefixlen; diff --git a/include/net/route.h b/include/net/route.h index dbb032d5921b..bb53cdba38dc 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -225,6 +225,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev, struct in_ifaddr; void fib_add_ifaddr(struct in_ifaddr *); void fib_del_ifaddr(struct in_ifaddr *, struct in_ifaddr *); +void fib_modify_prefix_metric(struct in_ifaddr *ifa, u32 new_metric); void rt_add_uncached_list(struct rtable *rt); void rt_del_uncached_list(struct rtable *rt); diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 40f001782c1b..d7585ab1a77a 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -99,6 +99,7 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = { [IFA_LABEL] = { .type = NLA_STRING, .len = IFNAMSIZ - 1 }, [IFA_CACHEINFO] = { .len = sizeof(struct ifa_cacheinfo) }, [IFA_FLAGS] = { .type = NLA_U32 }, + [IFA_RT_PRIORITY] = { .type = NLA_U32 }, }; #define IN4_ADDR_HSIZE_SHIFT 8 @@ -835,6 +836,9 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh, else memcpy(ifa->ifa_label, dev->name, IFNAMSIZ); + if (tb[IFA_RT_PRIORITY]) + ifa->ifa_rt_priority = nla_get_u32(tb[IFA_RT_PRIORITY]); + if (tb[IFA_CACHEINFO]) { struct ifa_cacheinfo *ci; @@ -906,12 +910,20 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, return __inet_insert_ifa(ifa, nlh, NETLINK_CB(skb).portid, extack); } else { + u32 new_metric = ifa->ifa_rt_priority; + inet_free_ifa(ifa); if (nlh->nlmsg_flags & NLM_F_EXCL || !(nlh->nlmsg_flags & NLM_F_REPLACE)) return -EEXIST; ifa = ifa_existing; + + if (ifa->ifa_rt_priority != new_metric) { + fib_modify_prefix_metric(ifa, new_metric); + ifa->ifa_rt_priority = new_metric; + } + set_ifa_lifetime(ifa, valid_lft, prefered_lft); cancel_delayed_work(&check_lifetime_work); queue_delayed_work(system_power_efficient_wq, @@ -1549,6 +1561,7 @@ static size_t inet_nlmsg_size(void) + nla_total_size(4) /* IFA_BROADCAST */ + nla_total_size(IFNAMSIZ) /* IFA_LABEL */ + nla_total_size(4) /* IFA_FLAGS */ + + nla_total_size(4) /* IFA_RT_PRIORITY */ + nla_total_size(sizeof(struct ifa_cacheinfo)); /* IFA_CACHEINFO */ } @@ -1618,6 +1631,8 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa, (ifa->ifa_label[0] && nla_put_string(skb, IFA_LABEL, ifa->ifa_label)) || nla_put_u32(skb, IFA_FLAGS, ifa->ifa_flags) || + (ifa->ifa_rt_priority && +nla_put_u32(skb, IFA_RT_PRIORITY, ifa->ifa_rt_priority)) || put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp, preferred, valid)) goto nla_put_failure; diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 045c43a27c12..07989d9ab2f2 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -846,7 +846,8 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) * to fib engine. It is legal, because all events occur * only when netlink is already locked. */ -static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) +static void fib_magic(int cmd, int type, __be32 dst, int dst_len, + struct in_ifaddr *ifa, u32 rt_priority) { struct net *net = dev_net(ifa->ifa_dev->dev); u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev); @@ -856,6 +857,7 @@ static void fib_magic(int
[PATCH] iptables-compat: homogenize error message
There is a difference between error messages in iptables and iptables-compat: #sudo iptables-compat -D INPUT 4 iptables: No chain/target/match by that name. #sudo iptables -D INPUT 4 iptables: Index of deletion too big. Now, will show same error message. Signed-off-by: Arushi Singhal --- iptables/nft.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iptables/nft.c b/iptables/nft.c index e33d00f..40646f4 100644 --- a/iptables/nft.c +++ b/iptables/nft.c @@ -2603,7 +2603,7 @@ const char *nft_strerror(int err) { nft_rule_add, E2BIG, "Index of insertion too big" }, { nft_rule_check, ENOENT, "Bad rule (does a matching rule exist in that chain?)" }, { nft_rule_replace, ENOENT, "Index of replacement too big" }, - { nft_rule_delete_num, E2BIG, "Index of deletion too big" }, + { nft_rule_delete_num, ENOENT, "Index of deletion too big" }, /* { TC_READ_COUNTER, E2BIG, "Index of counter too big" }, { TC_ZERO_COUNTER, E2BIG, "Index of counter too big" }, */ { nft_rule_add, ELOOP, "Loop found in table" }, -- 2.7.4
Re: [PATCH net] mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG
Sun, May 27, 2018 at 08:48:41AM CEST, ido...@mellanox.com wrote: >From: Petr Machata > >VLAN 1 is internally used for untagged traffic. Prevent creation of >explicit netdevice for that VLAN, because that currently isn't supported >and leads to the NULL pointer dereference cited below. > >Fix by preventing creation of VLAN devices with VID of 1 over mlxsw >devices or LAG devices that involve mlxsw devices. > >[ 327.175816] > >[ 327.184544] UBSAN: Undefined behaviour in >drivers/net/ethernet/mellanox/mlxsw/spectrum_fid.c:200:12 >[ 327.193667] member access within null pointer of type 'const struct >mlxsw_sp_fid' >[ 327.201226] CPU: 0 PID: 8983 Comm: ip Not tainted >4.17.0-rc4-petrm_net_ip6gre_headroom-custom-140 #11 >[ 327.210496] Hardware name: Mellanox Technologies Ltd. >"MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016 >[ 327.219872] Call Trace: >[ 327.222384] dump_stack+0xc3/0x12b >[ 327.234007] ubsan_epilogue+0x9/0x49 >[ 327.237638] ubsan_type_mismatch_common+0x1f9/0x2d0 >[ 327.255769] __ubsan_handle_type_mismatch+0x90/0xa7 >[ 327.264716] mlxsw_sp_fid_type+0x35/0x50 [mlxsw_spectrum] >[ 327.270255] mlxsw_sp_port_vlan_router_leave+0x46/0xc0 [mlxsw_spectrum] >[ 327.277019] mlxsw_sp_inetaddr_port_vlan_event+0xe1/0x340 [mlxsw_spectrum] >[ 327.315031] mlxsw_sp_netdevice_vrf_event+0xa8/0x100 [mlxsw_spectrum] >[ 327.321626] mlxsw_sp_netdevice_event+0x276/0x430 [mlxsw_spectrum] >[ 327.367863] notifier_call_chain+0x4c/0x150 >[ 327.372128] __netdev_upper_dev_link+0x1b3/0x260 >[ 327.399450] vrf_add_slave+0xce/0x170 [vrf] >[ 327.403703] do_setlink+0x658/0x1d70 >[ 327.508998] rtnl_newlink+0x908/0xf20 >[ 327.559128] rtnetlink_rcv_msg+0x50c/0x720 >[ 327.571720] netlink_rcv_skb+0x16a/0x1f0 >[ 327.583450] netlink_unicast+0x2ca/0x3e0 >[ 327.599305] netlink_sendmsg+0x3e2/0x7f0 >[ 327.616655] sock_sendmsg+0x76/0xc0 >[ 327.620207] ___sys_sendmsg+0x494/0x5d0 >[ 327.666117] __sys_sendmsg+0xc2/0x130 >[ 327.690953] do_syscall_64+0x66/0x370 >[ 327.694677] entry_SYSCALL_64_after_hwframe+0x49/0xbe >[ 327.699782] RIP: 0033:0x7f4c2f3f8037 >[ 327.703393] RSP: 002b:7ffe8c389708 EFLAGS: 0246 ORIG_RAX: >002e >[ 327.711035] RAX: ffda RBX: 5b03f53e RCX: >7f4c2f3f8037 >[ 327.718229] RDX: RSI: 7ffe8c389760 RDI: >0003 >[ 327.725431] RBP: 7ffe8c389760 R08: R09: >7f4c2f443630 >[ 327.732632] R10: 05eb R11: 0246 R12: > >[ 327.739833] R13: 006774e0 R14: 7ffe8c3897e8 R15: > >[ 327.747096] > > >Fixes: 9589a7b5d7d9 ("mlxsw: spectrum: Handle VLAN devices linking / >unlinking") >Suggested-by: Ido Schimmel >Signed-off-by: Petr Machata >Signed-off-by: Ido Schimmel Acked-by: Jiri Pirko
[PATCH bpf 0/2] Use __aligned_u64 in UAPI fields
Hello. It was discovered during strace development that struct bpf_map_info and struct bpf_prog_info now have different layouts of i386/compat and x86_64. Since it's already broken and bpf syscall has no separate compat (as far as I can see), and the affecting change was introduced recently (in Linux 4.16), it's proposed to change the layout of these structures on 32-bit architectures by using __aligned_u64. In order to somewhat future-proof from this problem in future, an approach similar to the one implemented in RDMA subsystem recently is proposed: use __aligned_u64 consistently throughout the UAPI header. Eugene Syromiatnikov (2): bpf: fix alignment of netns_dev/netns_ino fields in bpf_{map,prog}_info bpf: enforce usage of __aligned_u64 in the UAPI header include/uapi/linux/bpf.h | 30 +++--- tools/include/uapi/linux/bpf.h | 30 +++--- 2 files changed, 30 insertions(+), 30 deletions(-) -- 2.1.4
[PATCH bpf 2/2] bpf: enforce usage of __aligned_u64 in the UAPI header
Use __aligned_u64 instead of __u64 everywhere in the UAPI header, similarly to v4.17-rc1~94^2~58^2 "RDMA: Change all uapi headers to use __aligned_u64 instead of __u64". This commit doesn't change structure layouts, but imposes additional alignment requirements on struct bpf_stack_build_id, struct bpf_sock_ops, struct bpf_perf_event_value, and struct bpf_raw_tracepoint_args; of them only struct bpf_sock_ops and struct bpf_perf_event_value have 64-bit fields that were present in a released kernel without this additional alignment requirement (bytes_received and bytes_acked were added to struct bpf_sock_ops in commit v4.16-rc1~123^2~33^2~5^2~4, struct bpf_perf_event_value was added in commit v4.15-rc1~84^2~532^2~3). Signed-off-by: Eugene Syromiatnikov --- include/uapi/linux/bpf.h | 22 +++--- tools/include/uapi/linux/bpf.h | 22 +++--- 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 903010a..174e99a 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -259,8 +259,8 @@ struct bpf_stack_build_id { __s32 status; unsigned char build_id[BPF_BUILD_ID_SIZE]; union { - __u64 offset; - __u64 ip; + __aligned_u64 offset; + __aligned_u64 ip; }; }; @@ -288,7 +288,7 @@ union bpf_attr { __aligned_u64 value; __aligned_u64 next_key; }; - __u64 flags; + __aligned_u64 flags; }; struct { /* anonymous struct used by BPF_PROG_LOAD command */ @@ -360,7 +360,7 @@ union bpf_attr { } query; struct { - __u64 name; + __aligned_u64 name; __u32 prog_fd; } raw_tracepoint; } __attribute__((aligned(8))); @@ -1011,7 +1011,7 @@ struct bpf_prog_info { __u32 xlated_prog_len; __aligned_u64 jited_prog_insns; __aligned_u64 xlated_prog_insns; - __u64 load_time;/* ns since boottime */ + __aligned_u64 load_time;/* ns since boottime */ __u32 created_by_uid; __u32 nr_map_ids; __aligned_u64 map_ids; @@ -1101,8 +1101,8 @@ struct bpf_sock_ops { __u32 lost_out; __u32 sacked_out; __u32 sk_txhash; - __u64 bytes_received; - __u64 bytes_acked; + __aligned_u64 bytes_received; + __aligned_u64 bytes_acked; }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -1189,9 +1189,9 @@ enum { #define TCP_BPF_SNDCWND_CLAMP 1002/* Set sndcwnd_clamp */ struct bpf_perf_event_value { - __u64 counter; - __u64 enabled; - __u64 running; + __aligned_u64 counter; + __aligned_u64 enabled; + __aligned_u64 running; }; #define BPF_DEVCG_ACC_MKNOD(1ULL << 0) @@ -1209,7 +1209,7 @@ struct bpf_cgroup_dev_ctx { }; struct bpf_raw_tracepoint_args { - __u64 args[0]; + __aligned_u64 args[0]; }; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 903010a..174e99a 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -259,8 +259,8 @@ struct bpf_stack_build_id { __s32 status; unsigned char build_id[BPF_BUILD_ID_SIZE]; union { - __u64 offset; - __u64 ip; + __aligned_u64 offset; + __aligned_u64 ip; }; }; @@ -288,7 +288,7 @@ union bpf_attr { __aligned_u64 value; __aligned_u64 next_key; }; - __u64 flags; + __aligned_u64 flags; }; struct { /* anonymous struct used by BPF_PROG_LOAD command */ @@ -360,7 +360,7 @@ union bpf_attr { } query; struct { - __u64 name; + __aligned_u64 name; __u32 prog_fd; } raw_tracepoint; } __attribute__((aligned(8))); @@ -1011,7 +1011,7 @@ struct bpf_prog_info { __u32 xlated_prog_len; __aligned_u64 jited_prog_insns; __aligned_u64 xlated_prog_insns; - __u64 load_time;/* ns since boottime */ + __aligned_u64 load_time;/* ns since boottime */ __u32 created_by_uid; __u32 nr_map_ids; __aligned_u64 map_ids; @@ -1101,8 +1101,8 @@ struct bpf_sock_ops { __u32 lost_out; __u32 sacked_out; __u32 sk_txhash; - __u64 bytes_received; - __u64 bytes_acked; + __aligned_u64 bytes_received; + __aligned_u64 bytes_acked; }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -1189,9 +1189,9 @@ enum { #define TCP_BPF_SNDCWND_CLAMP 1002/* Set sndcwnd_clamp */ struct bpf_perf_event_value { - __u64 counter; - __u64 enabled; - __u64 running;
[PATCH bpf 1/2] bpf: fix alignment of netns_dev/netns_ino fields in bpf_{map,prog}_info
Recent introduction of netns_dev/netns_ino to bpf_map_info/bpf_prog info has broken compat, as offsets of these fields are different in 32-bit and 64-bit ABIs. One fix (other than implementing compat support in syscall in order to handle this discrepancy) is to use __aligned_u64 instead of __u64 for these fields. Reported-by: Dmitry V. Levin Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Signed-off-by: Eugene Syromiatnikov --- include/uapi/linux/bpf.h | 8 tools/include/uapi/linux/bpf.h | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c5ec897..903010a 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1017,8 +1017,8 @@ struct bpf_prog_info { __aligned_u64 map_ids; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u64 netns_dev; - __u64 netns_ino; + __aligned_u64 netns_dev; + __aligned_u64 netns_ino; } __attribute__((aligned(8))); struct bpf_map_info { @@ -1030,8 +1030,8 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u64 netns_dev; - __u64 netns_ino; + __aligned_u64 netns_dev; + __aligned_u64 netns_ino; } __attribute__((aligned(8))); /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c5ec897..903010a 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1017,8 +1017,8 @@ struct bpf_prog_info { __aligned_u64 map_ids; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u64 netns_dev; - __u64 netns_ino; + __aligned_u64 netns_dev; + __aligned_u64 netns_ino; } __attribute__((aligned(8))); struct bpf_map_info { @@ -1030,8 +1030,8 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u64 netns_dev; - __u64 netns_ino; + __aligned_u64 netns_dev; + __aligned_u64 netns_ino; } __attribute__((aligned(8))); /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed -- 2.1.4