pull request: bluetooth-next 2018-01-11
Hi Dave, Here's likely the last bluetooth-next pull request for the 4.16 kernel. - Added support for Bluetooth on 2015+ MacBook (Pro) - Fix to QCA Rome suspend/resume handling - Two new QCA_ROME USB IDs in btusb - A few other minor fixes Please let me know if there are any issues pulling. Thanks. Johan --- The following changes since commit 18feb87105c3c16dc01e6981a6aafb175679b997: enic: add wq clean up budget (2017-12-26 13:10:07 -0500) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git for-upstream for you to fetch changes up to ff8759609d021c0e85945fcc4a148a0e55ace70f: Bluetooth: btbcm: Fix sleep mode struct ordering (2018-01-10 19:00:14 +0100) AceLan Kao (1): Bluetooth: btusb: Add support for 0cf3:e010 Arnd Bergmann (1): Bluetooth: hciuart: add nvmem dependency Colin Ian King (2): Bluetooth: bpa10x: make array 'req' static, shrinks object size Bluetooth: btintel: make array 'param' static, shrinks object size Hans de Goede (1): Bluetooth: btusb: Restore QCA Rome suspend/resume fix with a "rewritten" version Ioan Moldovan (1): Bluetooth: Add a new 04ca:3015 QCA_ROME device Kai-Heng Feng (1): Revert "Bluetooth: btusb: fix QCA Rome suspend/resume" Lukas Wunner (15): Bluetooth: Avoid WARN splat due to missing GPIOLIB Bluetooth: hci_bcm: Streamline runtime PM code Bluetooth: Depend on rather than select GPIOLIB Bluetooth: hci_bcm: Mandate presence of shutdown and device wake GPIO Bluetooth: hci_bcm: Clean up unnecessary #ifdef Bluetooth: hci_bcm: Fix race on close Bluetooth: hci_bcm: Fix unbalanced pm_runtime_disable() Bluetooth: hci_bcm: Invalidate IRQ on request failure Bluetooth: hci_bcm: Document struct bcm_device Bluetooth: hci_bcm: Add callbacks to toggle GPIOs Bluetooth: hci_bcm: Handle errors properly Bluetooth: hci_bcm: Support Apple GPIO handling Bluetooth: hci_bcm: Silence IRQ printk Bluetooth: hci_bcm: Sleep instead of spinning Bluetooth: btbcm: Fix sleep mode struct ordering Ronald Tschalär (1): Bluetooth: hci_bcm: Validate IRQ before using it drivers/bluetooth/Kconfig | 4 + drivers/bluetooth/bpa10x.c | 2 +- drivers/bluetooth/btbcm.h | 2 +- drivers/bluetooth/btintel.c | 2 +- drivers/bluetooth/btusb.c | 22 ++-- drivers/bluetooth/hci_bcm.c | 239 +++- 6 files changed, 207 insertions(+), 64 deletions(-) signature.asc Description: PGP signature
RE: [PATCH net-next v2] xfrm: Add ESN support for IPSec HW offload
> From: Shannon Nelson [mailto:shannon.nel...@oracle.com] > Sent: Thursday, January 11, 2018 5:21 AM > > On 1/10/2018 3:09 PM, Yossi Kuperman wrote: > >> On 10 Jan 2018, at 19:36, Shannon Nelson wrote: > >> > >>> On 1/10/2018 2:34 AM, yoss...@mellanox.com wrote: > >>> From: Yossef Efraim > >>> This patch adds ESN support to IPsec device offload. > >>> Adding new xfrm device operation to synchronize device ESN. > >>> Signed-off-by: Yossef Efraim > >>> --- > >>> Changes from v1: > >>> - Added documentation > >>> --- > >>> Documentation/networking/xfrm_device.txt | 3 +++ > >>> include/linux/netdevice.h| 1 + > >>> include/net/xfrm.h | 12 > >>> net/xfrm/xfrm_device.c | 4 ++-- > >>> net/xfrm/xfrm_replay.c | 2 ++ > >>> 5 files changed, 20 insertions(+), 2 deletions(-) > > [...] > > >>> diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c > >>> index 7598250..704a055 100644 > >>> --- a/net/xfrm/xfrm_device.c > >>> +++ b/net/xfrm/xfrm_device.c > >>> @@ -147,8 +147,8 @@ int xfrm_dev_state_add(struct net *net, struct > >>> xfrm_state *x, > >>> if (!x->type_offload) > >>> return -EINVAL; > >>> -/* We don't yet support UDP encapsulation, TFC padding and ESN. */ > >>> -if (x->encap || x->tfcpad || (x->props.flags & XFRM_STATE_ESN)) > >>> +/* We don't yet support UDP encapsulation and TFC padding. */ > >>> +if (x->encap || x->tfcpad) > >> > >> As I mentioned before, this will cause issues when working with hardware > >> that has no ESN support, such as Intel's x540: the stack will > expect the driver to do ESN, and nothing actually happens but a rollover of > the numbers. Sure, the driver could look for the ESN attribute > and fail the add, but that's a mode where we have to update every driver to > fend off problems every time we add a new feature. Much > better is to only update drivers that actively support the new feature. > >> > > > > You are right. > > > > I’m not sure why this check is here in the first place. IMO it should take > > place in xdo_dev_state_add—a driver-specific callback. > > > > If you say I'm right, then why do you say it should take place in the > driver callback? I just wrote that it should *not*. > Sorry, I wasn't clear; you are right with respect that this change will break Intel's x540 driver. However, I do think that this is the purpose of xdo_dev_state_add(). Again, As far as I can understand, and please correct me if I'm wrong, this shouldn’t be here in the first place. Please have a look at mlx5e_xfrm_validate_state(). Currently, it return an error if the user requests ESN, regardless of the underlying device's capabilities. Subsequent patch to mlx5 driver, will allow such a request if the device does support it; maintaining backward compatibility. Here is a code snippet: - if (x->props.flags & XFRM_STATE_ESN) { + if (x->props.flags & XFRM_STATE_ESN && + !(mlx5_accel_ipsec_device_caps(priv->mdev) & MLX5_ACCEL_IPSEC_ESN)) { netdev_info(netdev, "Cannot offload ESN xfrm states\n"); return -EINVAL; } > This code seems to be assuming that all drivers/NICs with the offload > will be able to do ESN, and this is not the case. If this code is put > into place, suddenly the ixgbe driver's offload will have a failure > case: the driver doesn't support ESN, and doesn't know to NAK the > state_add if the ESN bit is on. This is a generic capabilities issue > for which we already have a solution "pattern". > We weren't assuming that, please see above. > > What do you suggest? > > > > There should be a capabilities/feature flag for the driver to set and > the XFRM code shouldn't try the state_add with ESN if the driver hasn't > set an ESN bit in its capabilities. Other capabilities that might make > sense here are IPv6, TSO, and CSUM; there may be others. > > >> Look at how feature bits are added to netdev->features to signify what the > >> driver can do. I think that's a much better approach. > >> > > > > It looks like an overkill? > > Alternatively, just solve this by failing to add the SA that has ESN set > if the driver hasn't defined your new xdo_dev_state_advance_esn(). > > sln > > > > > >> sln > >> > >> > >>> return -EINVAL; > >>> dev = dev_get_by_index(net, xuo->ifindex); > >>> diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c > >>> index 0250181..1d38c6a 100644 > >>> --- a/net/xfrm/xfrm_replay.c > >>> +++ b/net/xfrm/xfrm_replay.c > >>> @@ -551,6 +551,8 @@ static void xfrm_replay_advance_esn(struct xfrm_state > >>> *x, __be32 net_seq) > >>> bitnr = replay_esn->replay_window - (diff - pos); > >>> } > >>> +xfrm_dev_state_advance_esn(x); > >>> + > >>> nr = bitnr >> 5; > >>> bitnr = bitnr & 0x1F; > >>> replay_esn->bmp[nr] |= (1U << bitnr);
Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods
On Thu, Jan 11, 2018 at 05:22:00AM +, Al Viro wrote: > Whee... The very first ->poll() instance in alphabetic order on pathnames: > in arch/cris/arch-v10/drivers/gpio.c > > static __poll_t gpio_poll(struct file *file, poll_table *wait) > { > __poll_t mask = 0; > struct gpio_private *priv = file->private_data; > unsigned long data; > unsigned long flags; > > spin_lock_irqsave(&gpio_lock, flags); > > poll_wait(file, &priv->alarm_wq, wait); > > IOW, we are doing poll_wait() (== possible GFP_KERNEL __get_free_page()) under > a spinlock... Yes. Another god reason to separate poll_wait and the actual event check callback..
Re: general protection fault in sctp_v6_get_dst
On Thu, Jan 11, 2018 at 2:15 AM, syzbot wrote: > syzkaller has found reproducer for the following crash on > 61ad64080e039dce99a7f8d89b729bbea995e2f7 > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+7b7b518b1228d2743...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. > > device lo entered promiscuous mode > kasan: CONFIG_KASAN_INLINE enabled > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: [#1] SMP KASAN > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 3506 Comm: syzkaller968983 Not tainted 4.15.0-rc7+ #181 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline] > RIP: 0010:sctp_v6_get_dst+0x59e/0x1c60 net/sctp/ipv6.c:271 > RSP: 0018:8801db205e20 EFLAGS: 00010206 > RAX: dc00 RBX: RCX: 8512e05b > RDX: 000f RSI: 67cf608c RDI: 8801db22376c > RBP: 8801db206190 R08: 11003b640b05 R09: 0002 > R10: 8801db205cf0 R11: 8512e008 R12: 8801bf884db0 > R13: 204e R14: 8801bfe3e680 R15: 8801bf884d80 > FS: 7f122e219700() GS:8801db20() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 20aaff09 CR3: 0001bfdf0005 CR4: 001606f0 > > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > > sctp_transport_route+0xa8/0x430 net/sctp/transport.c:293 > sctp_assoc_add_peer+0x4fe/0x1190 net/sctp/associola.c:655 > sctp_process_init+0x119/0x2440 net/sctp/sm_make_chunk.c:2341 > sctp_sf_do_5_1B_init+0x8c9/0xe80 net/sctp/sm_statefuns.c:414 > sctp_do_sm+0x192/0x6ed0 net/sctp/sm_sideeffect.c:1178 > sctp_endpoint_bh_rcv+0x379/0x8f0 net/sctp/endpointola.c:456 > sctp_inq_push+0x23b/0x300 net/sctp/inqueue.c:95 > sctp_rcv+0x29f3/0x35c0 net/sctp/input.c:267 > sctp6_rcv+0x15/0x30 net/sctp/ipv6.c:1006 > ip6_input_finish+0x37e/0x17a0 net/ipv6/ip6_input.c:284 > NF_HOOK include/linux/netfilter.h:288 [inline] > ip6_input+0xdb/0x560 net/ipv6/ip6_input.c:327 > dst_input include/net/dst.h:449 [inline] > ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71 > NF_HOOK include/linux/netfilter.h:288 [inline] > ipv6_rcv+0xf37/0x1fa0 net/ipv6/ip6_input.c:208 > __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4538 > __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4603 > process_backlog+0x203/0x740 net/core/dev.c:5283 > napi_poll net/core/dev.c:5681 [inline] > net_rx_action+0x792/0x1910 net/core/dev.c:5747 > __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 > do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1133 > > do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329 > do_softirq kernel/softirq.c:177 [inline] > __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182 > local_bh_enable include/linux/bottom_half.h:32 [inline] > rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline] > ip6_finish_output2+0xba0/0x23a0 net/ipv6/ip6_output.c:121 > ip6_finish_output+0x698/0xaf0 net/ipv6/ip6_output.c:154 > NF_HOOK_COND include/linux/netfilter.h:277 [inline] > ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171 > dst_output include/net/dst.h:443 [inline] > NF_HOOK include/linux/netfilter.h:288 [inline] > ip6_xmit+0xd84/0x2090 net/ipv6/ip6_output.c:277 > sctp_v6_xmit+0x438/0x630 net/sctp/ipv6.c:225 > sctp_packet_transmit+0x225e/0x3750 net/sctp/output.c:638 > sctp_outq_flush+0xabb/0x4060 net/sctp/outqueue.c:911 > sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776 > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline] > sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline] > sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181 > sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88 > sctp_sendmsg+0x1d2e/0x33f0 net/sctp/socket.c:2018 > inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764 > sock_sendmsg_nosec net/socket.c:628 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:638 > SYSC_sendto+0x361/0x5c0 net/socket.c:1719 > SyS_sendto+0x40/0x50 net/socket.c:1687 > entry_SYSCALL_64_fastpath+0x23/0x9a > RIP: 0033:0x4456c9 > RSP: 002b:7f122e218d98 EFLAGS: 0212 ORIG_RAX: 002c > RAX: ffda RBX: 006dac3c RCX: 004456c9 > RDX: 0001 RSI: 20aaff09 RDI: 0007 > RBP: R08: 20abf000 R09: 001c > R10: R11: 0212 R12: 006dac38 >
[PATCH net] net: ipv4: Make "ip route get" match iif lo rules again.
Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") broke "ip route get" in the presence of rules that specify iif lo. Host-originated traffic always has iif lo, because ip_route_output_key_hash and ip6_route_output_flags set the flow iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a convenient way to select only originated traffic and not forwarded traffic. inet_rtm_getroute used to match these rules correctly because even though it sets the flow iif to 0, it called ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX. But now that it calls ip_route_output_key_hash_rcu, the ifindex will remain 0 and not match the iif lo in the rule. As a result, "ip route get" will return ENETUNREACH. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Tested: https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py passes again Signed-off-by: Lorenzo Colitti --- net/ipv4/route.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 43b69af242..4e153b23bc 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2762,6 +2762,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, if (err == 0 && rt->dst.error) err = -rt->dst.error; } else { + fl4.flowi4_iif = LOOPBACK_IFINDEX; rt = ip_route_output_key_hash_rcu(net, &fl4, &res, skb); err = 0; if (IS_ERR(rt)) -- 2.16.0.rc1.238.g530d649a79-goog
Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get
Wed, Jan 10, 2018 at 05:48:09PM CET, dsah...@gmail.com wrote: >On 1/9/18 7:07 AM, Jiri Pirko wrote: >> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h >> index 9c026d9..038cde7 100644 >> --- a/include/uapi/linux/rtnetlink.h >> +++ b/include/uapi/linux/rtnetlink.h >> @@ -150,6 +150,12 @@ enum { >> RTM_NEWCACHEREPORT = 96, >> #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT >> >> +RTM_NEWBLOCK = 100, >> +#define RTM_NEWBLOCK RTM_NEWBLOCK >> +RTM_DELBLOCK, >> +#define RTM_DELBLOCK RTM_DELBLOCK >> +RTM_GETBLOCK, >> +#define RTM_GETBLOCK RTM_GETBLOCK >> __RTM_MAX, >> #define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1) >> }; > >Seems like this is creating an inconsistency. RTM_GETBLOCK is used to >dump the set of shared blocks, but RTM_NEWBLOCK / RTM_DELBLOCK are not >used to create / delete one. Why is it a problem? RTM_NEWBLOCK is used as a reply for RTM_GETBLOCK. I plan to have block notifications as a follow-up, there the RTM_GETBLOCK and RTM_DELBLOCK will be used. The fact the user cannot create and delete block explicitly is no problem in my opinion. The block creation and deletion is done according to usage of qdiscs.
Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared
Wed, Jan 10, 2018 at 07:12:44PM CET, dsah...@gmail.com wrote: >On 1/9/18 7:07 AM, Jiri Pirko wrote: >> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h >> index 843e29a..9c026d9 100644 >> --- a/include/uapi/linux/rtnetlink.h >> +++ b/include/uapi/linux/rtnetlink.h >> @@ -541,9 +541,15 @@ struct tcmsg { >> int tcm_ifindex; >> __u32 tcm_handle; >> __u32 tcm_parent; >> +/* tcm_block_index is used instead of tcm_parent >> + * in case tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK >> + */ >> +#define tcm_block_index tcm_parent >> __u32 tcm_info; >> }; >> >> +#define TCM_IFINDEX_MAGIC_BLOCK (0xU) >> + >> enum { >> TCA_UNSPEC, >> TCA_KIND, > > >This could be more clearly documented for anyone wanting to write an app >against the API. Something like: > >For shared blocks, tcm_ifindex is set to TCM_IFINDEX_MAGIC_BLOCK, and >tcm_parent is aliased to tcm_block_index which is the block index. Okay, will add this comment here.
Re: [patch net-next v7 03/13] net: sched: avoid usage of tp->q in tcf_classify
Wed, Jan 10, 2018 at 05:17:28PM CET, dsah...@gmail.com wrote: >On 1/9/18 7:07 AM, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Use block index in the messages instead. >> >> Signed-off-by: Jiri Pirko >> --- >> net/sched/cls_api.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c >> index 9b45950..31e91dc 100644 >> --- a/net/sched/cls_api.c >> +++ b/net/sched/cls_api.c >> @@ -672,8 +672,9 @@ int tcf_classify(struct sk_buff *skb, const struct >> tcf_proto *tp, >> #ifdef CONFIG_NET_CLS_ACT >> reset: >> if (unlikely(limit++ >= max_reclassify_loop)) { >> -net_notice_ratelimited("%s: reclassify loop, rule prio %u, >> protocol %02x\n", >> - tp->q->ops->id, tp->prio & 0x, >> +net_notice_ratelimited("%u: reclassify loop, rule prio %u, >> protocol %02x\n", > >if you are dumping index instead of prio shouldn't the 'rule prio' above >be adjusted? I'm not! Why do you think so? "%u:" is tp->chain->block->index "prio %u" is tp->prio & 0x "%02x" is ntohs(tp->protocol) > > >> + tp->chain->block->index, >> + tp->prio & 0x, >> ntohs(tp->protocol)); >> return TC_ACT_SHOT; >> } >> >
[PATCH 2/2] xen-netfront: Fix race between device setup and open
When a netfront device is set up it registers a netdev fairly early on, before it has set up the queues and is actually usable. A userspace tool like NetworkManager will immediately try to open it and access its state as soon as it appears. The bug can be reproduced by hotplugging VIFs until the VM runs out of grant refs. It registers the netdev but fails to set up any queues (since there are no more grant refs). In the meantime, NetworkManager opens the device and the kernel crashes trying to access the queues (of which there are none). Fix this in two ways: * For initial setup, register the netdev much later, after the queues are setup. This avoids the race entirely. * During a suspend/resume cycle, the frontend reconnects to the backend and the queues are recreated. It is possible (though highly unlikely) to race with something opening the device and accessing the queues after they have been destroyed but before they have been recreated. Extend the region covered by the rtnl semaphore to protect against this race. There is a possibility that we fail to recreate the queues so check for this in the open function. Signed-off-by: Ross Lagerwall --- drivers/net/xen-netfront.c | 46 -- 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 9bd7dde..8328d39 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -351,6 +351,9 @@ static int xennet_open(struct net_device *dev) unsigned int i = 0; struct netfront_queue *queue = NULL; + if (!np->queues) + return -ENODEV; + for (i = 0; i < num_queues; ++i) { queue = &np->queues[i]; napi_enable(&queue->napi); @@ -1358,18 +1361,8 @@ static int netfront_probe(struct xenbus_device *dev, #ifdef CONFIG_SYSFS info->netdev->sysfs_groups[0] = &xennet_dev_group; #endif - err = register_netdev(info->netdev); - if (err) { - pr_warn("%s: register_netdev err=%d\n", __func__, err); - goto fail; - } return 0; - - fail: - xennet_free_netdev(netdev); - dev_set_drvdata(&dev->dev, NULL); - return err; } static void xennet_end_access(int ref, void *page) @@ -1737,8 +1730,6 @@ static void xennet_destroy_queues(struct netfront_info *info) { unsigned int i; - rtnl_lock(); - for (i = 0; i < info->netdev->real_num_tx_queues; i++) { struct netfront_queue *queue = &info->queues[i]; @@ -1747,8 +1738,6 @@ static void xennet_destroy_queues(struct netfront_info *info) netif_napi_del(&queue->napi); } - rtnl_unlock(); - kfree(info->queues); info->queues = NULL; } @@ -1764,8 +1753,6 @@ static int xennet_create_queues(struct netfront_info *info, if (!info->queues) return -ENOMEM; - rtnl_lock(); - for (i = 0; i < *num_queues; i++) { struct netfront_queue *queue = &info->queues[i]; @@ -1774,7 +1761,7 @@ static int xennet_create_queues(struct netfront_info *info, ret = xennet_init_queue(queue); if (ret < 0) { - dev_warn(&info->netdev->dev, + dev_warn(&info->xbdev->dev, "only created %d queues\n", i); *num_queues = i; break; @@ -1788,10 +1775,8 @@ static int xennet_create_queues(struct netfront_info *info, netif_set_real_num_tx_queues(info->netdev, *num_queues); - rtnl_unlock(); - if (*num_queues == 0) { - dev_err(&info->netdev->dev, "no queues\n"); + dev_err(&info->xbdev->dev, "no queues\n"); return -EINVAL; } return 0; @@ -1828,6 +1813,7 @@ static int talk_to_netback(struct xenbus_device *dev, goto out; } + rtnl_lock(); if (info->queues) xennet_destroy_queues(info); @@ -1838,6 +1824,7 @@ static int talk_to_netback(struct xenbus_device *dev, info->queues = NULL; goto out; } + rtnl_unlock(); /* Create shared ring, alloc event channel -- for each queue */ for (i = 0; i < num_queues; ++i) { @@ -1934,8 +1921,10 @@ static int talk_to_netback(struct xenbus_device *dev, xenbus_transaction_end(xbt, 1); destroy_ring: xennet_disconnect_backend(info); + rtnl_lock(); xennet_destroy_queues(info); out: + rtnl_unlock(); device_unregister(&dev->dev); return err; } @@ -1965,6 +1954,15 @@ static int xennet_connect(struct net_device *dev) netdev_update_features(dev); rtnl_unlock(); + if (dev->reg_state == NETREG_UNINITIALIZED) { + err = register_netdev(dev); + if (err) { + pr_warn("
Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution
On Tue, 9 Jan 2018, Josh Poimboeuf wrote: > On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote: > > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina wrote: > > > On Fri, 5 Jan 2018, Dan Williams wrote: > > > > > > [ ... snip ... ] > > >> Andi Kleen (1): > > >> x86, barrier: stop speculation for failed access_ok > > >> > > >> Dan Williams (13): > > >> x86: implement nospec_barrier() > > >> [media] uvcvideo: prevent bounds-check bypass via speculative > > >> execution > > >> carl9170: prevent bounds-check bypass via speculative execution > > >> p54: prevent bounds-check bypass via speculative execution > > >> qla2xxx: prevent bounds-check bypass via speculative execution > > >> cw1200: prevent bounds-check bypass via speculative execution > > >> Thermal/int340x: prevent bounds-check bypass via speculative > > >> execution > > >> ipv6: prevent bounds-check bypass via speculative execution > > >> ipv4: prevent bounds-check bypass via speculative execution > > >> vfs, fdtable: prevent bounds-check bypass via speculative execution > > >> net: mpls: prevent bounds-check bypass via speculative execution > > >> udf: prevent bounds-check bypass via speculative execution > > >> userns: prevent bounds-check bypass via speculative execution > > >> > > >> Mark Rutland (4): > > >> asm-generic/barrier: add generic nospec helpers > > >> Documentation: document nospec helpers > > >> arm64: implement nospec_ptr() > > >> arm: implement nospec_ptr() > > > > > > So considering the recent publication of [1], how come we all of a sudden > > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and > > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT? > > > > > > Is this going to be handled in eBPF in some other way? > > > > > > Without that in place, and considering Jann Horn's paper, it would seem > > > like PTI doesn't really lock it down fully, right? > > > > Here is the latest (v3) bpf fix: > > > > https://patchwork.ozlabs.org/patch/856645/ > > > > I currently have v2 on my 'nospec' branch and will move that to v3 for > > the next update, unless it goes upstream before then. Daniel, I guess you're planning to send this still for 4.15? > That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall > the only attack vector? Or are there other ways to run bpf programs > that we should be worried about? Seems like Alexei is probably the only person in the whole universe who isn't CCed here ... let's fix that. Thanks, -- Jiri Kosina SUSE Labs
Re: [iptables] extensions: add support for 'srh' match
On Wed, 10 Jan 2018 16:32:24 +0100 Pablo Neira Ayuso wrote: > On Fri, Dec 29, 2017 at 12:08:25PM +0100, Ahmed Abdelsalam wrote: > > This patch adds a new exetension to iptables to supprt 'srh' match > > The implementation considers revision 7 of the SRH draft. > > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07 > > > > Signed-off-by: Ahmed Abdelsalam > > --- > > extensions/libip6t_srh.c| 283 > > > > include/linux/netfilter_ipv6/ip6t_srh.h | 63 +++ > > Please, add a extensions/libip6t_srh.t test file and send a v2. > > Thanks. Ok, Is there minimum requirements of the test cases to be added to the extensions/libip6t_srh.t file ? -- Ahmed
Re: [Patch net v2] tun: fix a memory leak for tfile->tx_array
On 2018年01月11日 02:51, Cong Wang wrote: tfile->tun could be detached before we close the tun fd, via tun_detach_all(), so it should not be used to check for tfile->tx_array. As Jason suggested, we probably have to clean it up unconditionally, but this requires to check if it is initialized or not. Currently skb_array_cleanup() doesn't have such a check, so I check it in the caller, it is ugly but we can always improve it in net-next. Rethink about this, looks like I was wrong. The case I mentioned previously is open attach detach close But during close, we will try to enable tfile through tun_enable_queue() in __tun_detach(), which means we can do the cleanup for sure. It looks to me what is actual missed is the cleanups tun_detach_all(). For me the only case that could leak is open attach ip link del link dev tap0 close or another set_iff() So in this case, clean during close is not sufficient since it could be attached to another device. Thanks Reported-by: Dmitry Vyukov Fixes: 1576d9860599 ("tun: switch to use skb array for tx") Cc: Jason Wang Signed-off-by: Cong Wang --- drivers/net/tun.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 4f4a842a1c9c..4c85474ffbaf 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -657,7 +657,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean) tun->dev->reg_state == NETREG_REGISTERED) unregister_netdevice(tun->dev); } - if (tun) + if (tfile->tx_array.ring.queue) skb_array_cleanup(&tfile->tx_array); sock_put(&tfile->sk); } @@ -2851,6 +2851,8 @@ static int tun_chr_open(struct inode *inode, struct file * file) sock_set_flag(&tfile->sk, SOCK_ZEROCOPY); + memset(&tfile->tx_array, 0, sizeof(tfile->tx_array)); + return 0; }
[patch net-next 4/5] mlxsw: spectrum: qdiscs: Support PRIO qdisc offload
From: Nogah Frankel Add support for offloading PRIO qdisc as root qdisc. The support is for up to 8 bands. Routed packets priority is determined by the DSCP field with the default translations. Bridged packets priority is determined by the PCP field, if exist, otherwise it is set to 0. Since both options have only priorities 0-7, higher priorities mapping are being ignored. Signed-off-by: Nogah Frankel Reviewed-by: Yuval Mintz Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 + drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 2 + .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 82 ++ 3 files changed, 86 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 54c7d9202e81..f78bfe394966 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -1830,6 +1830,8 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, enum tc_setup_type type, return mlxsw_sp_setup_tc_block(mlxsw_sp_port, type_data); case TC_SETUP_QDISC_RED: return mlxsw_sp_setup_tc_red(mlxsw_sp_port, type_data); + case TC_SETUP_QDISC_PRIO: + return mlxsw_sp_setup_tc_prio(mlxsw_sp_port, type_data); default: return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h index b6f475e83474..16f8fbda0891 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h @@ -565,6 +565,8 @@ int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port); void mlxsw_sp_tc_qdisc_fini(struct mlxsw_sp_port *mlxsw_sp_port); int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, struct tc_red_qopt_offload *p); +int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_prio_qopt_offload *p); /* spectrum_fid.c */ int mlxsw_sp_fid_flood_set(struct mlxsw_sp_fid *fid, diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 273300b75a68..9e83edde7b35 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -41,9 +41,12 @@ #include "spectrum.h" #include "reg.h" +#define MLXSW_SP_PRIO_BAND_TO_TCLASS(band) (IEEE_8021QAZ_MAX_TCS - band - 1) + enum mlxsw_sp_qdisc_type { MLXSW_SP_QDISC_NO_QDISC, MLXSW_SP_QDISC_RED, + MLXSW_SP_QDISC_PRIO, }; struct mlxsw_sp_qdisc_ops { @@ -402,6 +405,85 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, } } +static int +mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port, + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) +{ + int i; + + for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) + mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i, + MLXSW_SP_PORT_DEFAULT_TCLASS); + + return 0; +} + +static int +mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port, +struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, +void *params) +{ + struct tc_prio_qopt_offload_params *p = params; + + if (p->bands > IEEE_8021QAZ_MAX_TCS) + return -EOPNOTSUPP; + + return 0; +} + +static int +mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port, + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, + void *params) +{ + struct tc_prio_qopt_offload_params *p = params; + int tclass, i; + int err; + + for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) { + tclass = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->priomap[i]); + err = mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i, tclass); + if (err) + return err; + } + + return 0; +} + +static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = { + .type = MLXSW_SP_QDISC_PRIO, + .check_params = mlxsw_sp_qdisc_prio_check_params, + .replace = mlxsw_sp_qdisc_prio_replace, + .destroy = mlxsw_sp_qdisc_prio_destroy, +}; + +int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_prio_qopt_offload *p) +{ + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc; + + if (p->parent != TC_H_ROOT) + return -EOPNOTSUPP; + + mlxsw_sp_qdisc = mlxsw_sp_port->root_qdisc; + if (p->command == TC_PRIO_REPLACE) + return mlxsw_sp_qdisc_replace(mlxsw_sp_port, p->handle, + mlxsw_sp_qdisc, + &mlxsw_sp_qdisc_ops_prio, + &p
[patch net-next 2/5] mlxsw: spectrum_router: Configure default routing priority
From: Yuval Mintz When routing ip packets, the kernel is setting the SKB's priority based on the tos field of the packet. Imitate this behavior in the mlxsw router, having the internal switch priority of a routed packet determined according to its DS field. Signed-off-by: Yuval Mintz Signed-off-by: Nogah Frankel Signed-off-by: Jiri Pirko --- .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 24 ++ 1 file changed, 24 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c index 434b3922b34f..8f115d1c7056 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c @@ -7008,6 +7008,24 @@ static int mlxsw_sp_mp_hash_init(struct mlxsw_sp *mlxsw_sp) } #endif +static int mlxsw_sp_dscp_init(struct mlxsw_sp *mlxsw_sp) +{ + char rdpm_pl[MLXSW_REG_RDPM_LEN]; + unsigned int i; + + MLXSW_REG_ZERO(rdpm, rdpm_pl); + + /* HW is determining switch priority based on DSCP-bits, but the +* kernel is still doing that based on the ToS. Since there's a +* mismatch in bits we need to make sure to translate the right +* value ToS would observe, skipping the 2 least-significant ECN bits. +*/ + for (i = 0; i < MLXSW_REG_RDPM_DSCP_ENTRY_REC_MAX_COUNT; i++) + mlxsw_reg_rdpm_pack(rdpm_pl, i, rt_tos2priority(i << 2)); + + return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rdpm), rdpm_pl); +} + static int __mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) { char rgcr_pl[MLXSW_REG_RGCR_LEN]; @@ -7020,6 +7038,7 @@ static int __mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) mlxsw_reg_rgcr_pack(rgcr_pl, true, true); mlxsw_reg_rgcr_max_router_interfaces_set(rgcr_pl, max_rifs); + mlxsw_reg_rgcr_usp_set(rgcr_pl, true); err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rgcr), rgcr_pl); if (err) return err; @@ -7095,6 +7114,10 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) if (err) goto err_mp_hash_init; + err = mlxsw_sp_dscp_init(mlxsw_sp); + if (err) + goto err_dscp_init; + mlxsw_sp->router->fib_nb.notifier_call = mlxsw_sp_router_fib_event; err = register_fib_notifier(&mlxsw_sp->router->fib_nb, mlxsw_sp_router_fib_dump_flush); @@ -7104,6 +7127,7 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) return 0; err_register_fib_notifier: +err_dscp_init: err_mp_hash_init: unregister_netevent_notifier(&mlxsw_sp->router->netevent_nb); err_register_netevent_notifier: -- 2.14.3
[patch net-next 5/5] mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc
From: Nogah Frankel Support basic stats for PRIO qdisc, which includes tx packets and bytes count, drops count and backlog size. The rest of the stats are irrelevant for this qdisc offload. Since backlog is not only incremental but reflecting momentary value, in case of a qdisc that stops being offloaded but is not destroyed, backlog value needs to be updated about the un-offloading. For that reason an unoffload function is being added to the ops struct. Signed-off-by: Nogah Frankel Reviewed-by: Yuval Mintz Signed-off-by: Jiri Pirko --- .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 92 ++ 1 file changed, 92 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 9e83edde7b35..272c04951e5d 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -66,6 +66,11 @@ struct mlxsw_sp_qdisc_ops { void *xstats_ptr); void (*clean_stats)(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc); + /* unoffload - to be used for a qdisc that stops being offloaded without +* being destroyed. +*/ + void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port, + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params); }; struct mlxsw_sp_qdisc { @@ -73,6 +78,9 @@ struct mlxsw_sp_qdisc { u8 tclass_num; union { struct red_stats red; + struct mlxsw_sp_qdisc_prio_stats { + u64 backlog; + } prio; } xstats_base; struct mlxsw_sp_qdisc_stats { u64 tx_bytes; @@ -144,6 +152,9 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, err_bad_param: err_config: + if (mlxsw_sp_qdisc->handle == handle && ops->unoffload) + ops->unoffload(mlxsw_sp_port, mlxsw_sp_qdisc, params); + mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc); return err; } @@ -450,11 +461,88 @@ mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port, return 0; } +static void +mlxsw_sp_qdisc_prio_unoffload(struct mlxsw_sp_port *mlxsw_sp_port, + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, + void *params) +{ + struct tc_prio_qopt_offload_params *p = params; + + *p->backlog -= mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp, + mlxsw_sp_qdisc->xstats_base.prio.backlog); +} + +static int +mlxsw_sp_qdisc_get_prio_stats(struct mlxsw_sp_port *mlxsw_sp_port, + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, + struct tc_qopt_offload_stats *stats_ptr) +{ + u64 tx_bytes, tx_packets, drops = 0, backlog = 0; + struct mlxsw_sp_qdisc_prio_stats *prio_base; + struct mlxsw_sp_qdisc_stats *stats_base; + struct mlxsw_sp_port_xstats *xstats; + struct rtnl_link_stats64 *stats; + int i; + + prio_base = &mlxsw_sp_qdisc->xstats_base.prio; + xstats = &mlxsw_sp_port->periodic_hw_stats.xstats; + stats = &mlxsw_sp_port->periodic_hw_stats.stats; + stats_base = &mlxsw_sp_qdisc->stats_base; + + tx_bytes = stats->tx_bytes - stats_base->tx_bytes; + tx_packets = stats->tx_packets - stats_base->tx_packets; + + for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) { + drops += xstats->tail_drop[i]; + backlog += xstats->backlog[i]; + } + drops = drops - stats_base->drops; + + _bstats_update(stats_ptr->bstats, tx_bytes, tx_packets); + stats_ptr->qstats->drops += drops; + stats_ptr->qstats->backlog += + mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp, +backlog) - + mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp, +prio_base->backlog); + prio_base->backlog = backlog; + stats_base->drops += drops; + stats_base->tx_bytes += tx_bytes; + stats_base->tx_packets += tx_packets; + return 0; +} + +static void +mlxsw_sp_setup_tc_qdisc_prio_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port, +struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) +{ + struct mlxsw_sp_qdisc_stats *stats_base; + struct mlxsw_sp_port_xstats *xstats; + struct rtnl_link_stats64 *stats; + int i; + + xstats = &mlxsw_sp_port->periodic_hw_stats.xstats; + stats = &mlxsw_sp_port->periodic_hw_stats.stats; + stats_base = &mlxsw_sp_qdisc->stats_base; + + stats_base->tx_packets = stats->tx_packets; + stats_base->tx_bytes = stats->tx_bytes; + + stats_base->drops = 0; + for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) +
[patch net-next 0/5] mlxsw: Offload PRIO qdisc
From: Jiri Pirko Add an offload support for PRIO qdisc for mlxsw driver. PRIO qdisc is being offloaded by using ndo_setup_tc. It has three commands, to set or tune the qdisc, to remove it and to get its stats. Like RED offloading, offloading this qdisc is not enforced on the driver and determining its offload state is done in the dump action, when the stats are being updated. In the driver, offloading of PRIO is supported as root qdisc only. It supports only priorities 0-7 (the range that is used by the current static mapping of DSCP to skb prio and by 1:1 PCP values mapping) and up to 8 bands. Patches 1-2 offload DSCP to priority mapping in the mlxsw_sp driver. Patch 3 adds offload support for PRIO qdisc. Patches 4-5 Add PRIO offload support in the mlxsw_sp driver. Nogah Frankel (3): net: sch: prio: Add offload ability to PRIO qdisc mlxsw: spectrum: qdiscs: Support PRIO qdisc offload mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc Yuval Mintz (2): mlxsw: reg: add rdpm register mlxsw: spectrum_router: Configure default routing priority drivers/net/ethernet/mellanox/mlxsw/item.h | 2 +- drivers/net/ethernet/mellanox/mlxsw/reg.h | 37 + drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 + drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 2 + .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 174 + .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 24 +++ include/linux/netdevice.h | 1 + include/net/pkt_cls.h | 25 +++ net/sched/sch_prio.c | 59 +++ 9 files changed, 325 insertions(+), 1 deletion(-) -- 2.14.3
[patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc
From: Nogah Frankel Add the ability to offload PRIO qdisc by using ndo_setup_tc. There are three commands for PRIO offloading: * TC_PRIO_REPLACE: handles set and tune * TC_PRIO_DESTROY: handles qdisc destroy * TC_PRIO_STATS: updates the qdiscs counters (given as reference) Like RED qdisc, the indication of whether PRIO is being offloaded is being set and updated as part of the dump function. It is so because the driver could decide to offload or not based on the qdisc parent, which could change without notifying the qdisc. Signed-off-by: Nogah Frankel Reviewed-by: Yuval Mintz Signed-off-by: Jiri Pirko --- include/linux/netdevice.h | 1 + include/net/pkt_cls.h | 25 net/sched/sch_prio.c | 59 +++ 3 files changed, 85 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ef7b348e8498..6d95477b962c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -780,6 +780,7 @@ enum tc_setup_type { TC_SETUP_BLOCK, TC_SETUP_QDISC_CBS, TC_SETUP_QDISC_RED, + TC_SETUP_QDISC_PRIO, }; /* These structures hold the attributes of bpf state that are being passed diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 0d1343cba84c..4ba8c3ba3dd4 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -761,4 +761,29 @@ struct tc_red_qopt_offload { }; }; +enum tc_prio_command { + TC_PRIO_REPLACE, + TC_PRIO_DESTROY, + TC_PRIO_STATS, +}; + +struct tc_prio_qopt_offload_params { + int bands; + u8 priomap[TC_PRIO_MAX + 1]; + /* In case that a prio qdisc is offloaded and now is changed to a +* non-offloadedable config, it needs to update the backlog value +* to negate the HW backlog value. +*/ + u32 *backlog; +}; + +struct tc_prio_qopt_offload { + enum tc_prio_command command; + u32 handle; + u32 parent; + union { + struct tc_prio_qopt_offload_params replace_params; + struct tc_qopt_offload_stats stats; + }; +}; #endif diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index fe1510eb111f..3f47a30ce72f 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -142,6 +142,31 @@ prio_reset(struct Qdisc *sch) sch->q.qlen = 0; } +static int prio_offload(struct Qdisc *sch, bool enable) +{ + struct prio_sched_data *q = qdisc_priv(sch); + struct net_device *dev = qdisc_dev(sch); + struct tc_prio_qopt_offload opt = { + .handle = sch->handle, + .parent = sch->parent, + }; + + if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) + return -EOPNOTSUPP; + + if (enable) { + opt.command = TC_PRIO_REPLACE; + opt.replace_params.bands = q->bands; + memcpy(&opt.replace_params.priomap, q->prio2band, + TC_PRIO_MAX + 1); + opt.replace_params.backlog = &sch->qstats.backlog; + } else { + opt.command = TC_PRIO_DESTROY; + } + + return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO, &opt); +} + static void prio_destroy(struct Qdisc *sch) { @@ -149,6 +174,7 @@ prio_destroy(struct Qdisc *sch) struct prio_sched_data *q = qdisc_priv(sch); tcf_block_put(q->block); + prio_offload(sch, false); for (prio = 0; prio < q->bands; prio++) qdisc_destroy(q->queues[prio]); } @@ -204,6 +230,7 @@ static int prio_tune(struct Qdisc *sch, struct nlattr *opt, } sch_tree_unlock(sch); + prio_offload(sch, true); return 0; } @@ -223,15 +250,47 @@ static int prio_init(struct Qdisc *sch, struct nlattr *opt, return prio_tune(sch, opt, extack); } +static int prio_dump_offload(struct Qdisc *sch) +{ + struct net_device *dev = qdisc_dev(sch); + struct tc_prio_qopt_offload hw_stats = { + .handle = sch->handle, + .parent = sch->parent, + .command = TC_PRIO_STATS, + .stats.bstats = &sch->bstats, + .stats.qstats = &sch->qstats, + }; + int err; + + sch->flags &= ~TCQ_F_OFFLOADED; + if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) + return 0; + + err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO, + &hw_stats); + if (err == -EOPNOTSUPP) + return 0; + + if (!err) + sch->flags |= TCQ_F_OFFLOADED; + + return err; +} + static int prio_dump(struct Qdisc *sch, struct sk_buff *skb) { struct prio_sched_data *q = qdisc_priv(sch); unsigned char *b = skb_tail_pointer(skb); struct tc_prio_qopt opt; + int err; opt.bands = q->bands; memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX + 1); + err = p
[patch net-next 1/5] mlxsw: reg: add rdpm register
From: Yuval Mintz Add rdpm definition - router DSCP to priority mapping register. Signed-off-by: Yuval Mintz Signed-off-by: Nogah Frankel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/item.h | 2 +- drivers/net/ethernet/mellanox/mlxsw/reg.h | 37 ++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h index 28427f0758c7..31c886edc791 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/item.h +++ b/drivers/net/ethernet/mellanox/mlxsw/item.h @@ -42,7 +42,7 @@ struct mlxsw_item { unsigned short offset; /* bytes in container */ - unsigned short step; /* step in bytes for indexed items */ + short step; /* step in bytes for indexed items */ unsigned short in_step_offset; /* offset within one step */ unsigned char shift; /* shift in bits */ unsigned char element_size; /* size of element in bit array */ diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h index 6c4e08b8058a..0e08be41c8e0 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/reg.h +++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h @@ -4827,6 +4827,42 @@ static inline void mlxsw_reg_ratr_counter_pack(char *payload, u64 counter_index, mlxsw_reg_ratr_counter_set_type_set(payload, set_type); } +/* RDPM - Router DSCP to Priority Mapping + * -- + * Controls the mapping from DSCP field to switch priority on routed packets + */ +#define MLXSW_REG_RDPM_ID 0x8009 +#define MLXSW_REG_RDPM_BASE_LEN 0x00 +#define MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN 0x01 +#define MLXSW_REG_RDPM_DSCP_ENTRY_REC_MAX_COUNT 64 +#define MLXSW_REG_RDPM_LEN 0x40 +#define MLXSW_REG_RDPM_LAST_ENTRY (MLXSW_REG_RDPM_BASE_LEN + \ + MLXSW_REG_RDPM_LEN - \ + MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN) + +MLXSW_REG_DEFINE(rdpm, MLXSW_REG_RDPM_ID, MLXSW_REG_RDPM_LEN); + +/* reg_dscp_entry_e + * Enable update of the specific entry + * Access: Index + */ +MLXSW_ITEM8_INDEXED(reg, rdpm, dscp_entry_e, MLXSW_REG_RDPM_LAST_ENTRY, 7, 1, + -MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN, 0x00, false); + +/* reg_dscp_entry_prio + * Switch Priority + * Access: RW + */ +MLXSW_ITEM8_INDEXED(reg, rdpm, dscp_entry_prio, MLXSW_REG_RDPM_LAST_ENTRY, 0, 4, + -MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN, 0x00, false); + +static inline void mlxsw_reg_rdpm_pack(char *payload, unsigned short index, + u8 prio) +{ + mlxsw_reg_rdpm_dscp_entry_e_set(payload, index, 1); + mlxsw_reg_rdpm_dscp_entry_prio_set(payload, index, prio); +} + /* RICNT - Router Interface Counter Register * - * The RICNT register retrieves per port performance counters @@ -7640,6 +7676,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = { MLXSW_REG(rtar), MLXSW_REG(ratr), MLXSW_REG(rtdp), + MLXSW_REG(rdpm), MLXSW_REG(ricnt), MLXSW_REG(rrcr), MLXSW_REG(ralta), -- 2.14.3
Re: KASAN: use-after-free Read in __bpf_prog_put
On Thu, Jan 11, 2018 at 11:17 AM, syzbot wrote: > Hello, > > syzkaller hit the following crash on > 4147d50978df60f34d444c647dde9e5b34a4315e > git://git.cmpxchg.org/linux-mmots.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > Unfortunately, I don't have any reproducer for this bug yet. > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+d85bfb332db8f0794...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > netlink: 3 bytes leftover after parsing attributes in process > `syz-executor5'. > == > BUG: KASAN: use-after-free in __bpf_prog_put+0x5e8/0x640 > kernel/bpf/syscall.c:944 > netlink: 'syz-executor5': attribute type 5 has an invalid length. > Read of size 8 at addr 8801d3619658 by task syz-executor0/12398 > > CPU: 1 PID: 12398 Comm: syz-executor0 Not tainted 4.15.0-rc7-mm1+ #53 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_address_description+0x73/0x250 mm/kasan/report.c:256 > kasan_report_error mm/kasan/report.c:354 [inline] > kasan_report+0x23b/0x360 mm/kasan/report.c:412 > __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 > __bpf_prog_put+0x5e8/0x640 kernel/bpf/syscall.c:944 > bpf_prog_put+0x1a/0x20 kernel/bpf/syscall.c:961 > prog_fd_array_put_ptr+0x15/0x20 kernel/bpf/arraymap.c:446 > fd_array_map_delete_elem+0xc8/0x110 kernel/bpf/arraymap.c:420 > map_delete_elem kernel/bpf/syscall.c:737 [inline] > SYSC_bpf kernel/bpf/syscall.c:1814 [inline] > SyS_bpf+0x22ea/0x4400 kernel/bpf/syscall.c:1782 > entry_SYSCALL_64_fastpath+0x29/0xa0 > RIP: 0033:0x452ac9 > RSP: 002b:7fb70df60c58 EFLAGS: 0212 ORIG_RAX: 0141 > RAX: ffda RBX: 0071bea0 RCX: 00452ac9 > RDX: 0010 RSI: 20f02ff0 RDI: 0003 > RBP: 03aa R08: R09: > R10: R11: 0212 R12: 006f3890 > R13: R14: 7fb70df616d4 R15: > > Allocated by task 11996: > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552 > kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489 > kmem_cache_alloc+0x12e/0x760 mm/slab.c:3541 > kmem_cache_zalloc include/linux/slab.h:694 [inline] > get_empty_filp+0xfb/0x4f0 fs/file_table.c:122 > path_openat+0xed/0x3530 fs/namei.c:3514 > do_filp_open+0x25b/0x3b0 fs/namei.c:3572 > do_sys_open+0x502/0x6d0 fs/open.c:1059 > SYSC_open fs/open.c:1077 [inline] > SyS_open+0x2d/0x40 fs/open.c:1072 > entry_SYSCALL_64_fastpath+0x29/0xa0 > > Freed by task 11994: > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520 > kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527 > __cache_free mm/slab.c:3485 [inline] > kmem_cache_free+0x86/0x2b0 mm/slab.c:3743 > file_free_rcu+0x5c/0x70 fs/file_table.c:49 > __rcu_reclaim kernel/rcu/rcu.h:172 [inline] > rcu_do_batch kernel/rcu/tree.c:2675 [inline] > invoke_rcu_callbacks kernel/rcu/tree.c:2934 [inline] > __rcu_process_callbacks kernel/rcu/tree.c:2901 [inline] > rcu_process_callbacks+0xd6c/0x17f0 kernel/rcu/tree.c:2918 > __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 > > The buggy address belongs to the object at 8801d36195c0 > which belongs to the cache filp of size 456 > The buggy address is located 152 bytes inside of > 456-byte region [8801d36195c0, 8801d3619788) > The buggy address belongs to the page: > page:ea00074d8640 count:1 mapcount:0 mapping:8801d36190c0 index:0x0 > flags: 0x2fffc000100(slab) > raw: 02fffc000100 8801d36190c0 00010006 > raw: ea00074c49a0 ea000747a160 8801dae30180 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > 8801d3619500: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > 8801d3619580: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb >> >> 8801d3619600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > 8801d3619680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > 8801d3619700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > == Is it the same as "general protection fault in __bpf_prog_put"? https://groups.google.com/forum/#!topic/syzkaller-bugs/jUsNMmVgms0 The first stack looks similar, but alloc/free stacks looks unrelated. What's the root cause of this? Is prog->aux an unini
Re: [Patch net] tipc: fix a memory leak in tipc_nl_node_get_link()
On 01/11/2018 04:50 AM, Cong Wang wrote: > When tipc_node_find_by_name() fails, the nlmsg is not > freed. > > While on it, switch to a goto label to properly > free it. > > Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node") > Reported-by: Dmitry Vyukov > Cc: Jon Maloy > Cc: Ying Xue > Signed-off-by: Cong Wang Acked-by: Ying Xue > --- > net/tipc/node.c | 26 ++ > 1 file changed, 14 insertions(+), 12 deletions(-) > > diff --git a/net/tipc/node.c b/net/tipc/node.c > index 507017fe0f1b..9036d8756e73 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -1880,36 +1880,38 @@ int tipc_nl_node_get_link(struct sk_buff *skb, struct > genl_info *info) > > if (strcmp(name, tipc_bclink_name) == 0) { > err = tipc_nl_add_bc_link(net, &msg); > - if (err) { > - nlmsg_free(msg.skb); > - return err; > - } > + if (err) > + goto err_free; > } else { > int bearer_id; > struct tipc_node *node; > struct tipc_link *link; > > node = tipc_node_find_by_name(net, name, &bearer_id); > - if (!node) > - return -EINVAL; > + if (!node) { > + err = -EINVAL; > + goto err_free; > + } > > tipc_node_read_lock(node); > link = node->links[bearer_id].link; > if (!link) { > tipc_node_read_unlock(node); > - nlmsg_free(msg.skb); > - return -EINVAL; > + err = -EINVAL; > + goto err_free; > } > > err = __tipc_nl_add_link(net, &msg, link, 0); > tipc_node_read_unlock(node); > - if (err) { > - nlmsg_free(msg.skb); > - return err; > - } > + if (err) > + goto err_free; > } > > return genlmsg_reply(msg.skb, info); > + > +err_free: > + nlmsg_free(msg.skb); > + return err; > } > > int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info > *info) >
Re: [PATCH 34/38] arm: Implement thread_struct whitelist for hardened usercopy
On Wed, Jan 10, 2018 at 06:03:06PM -0800, Kees Cook wrote: > ARM does not carry FPU state in the thread structure, so it can declare > no usercopy whitelist at all. This comment seems to be misleading. We have stored FP state in the thread structure for a long time - for example, VFP state is stored in thread->vfpstate.hard, so we _do_ have floating point state in the thread structure. What I think this commit message needs to describe is why we don't need a whitelist _despite_ having FP state in the thread structure. At the moment, the commit message is making me think that this patch is wrong and will introduce a regression. Thanks. > > Cc: Russell King > Cc: Ingo Molnar > Cc: Christian Borntraeger > Cc: "Peter Zijlstra (Intel)" > Cc: linux-arm-ker...@lists.infradead.org > Signed-off-by: Kees Cook > --- > arch/arm/Kconfig | 1 + > arch/arm/include/asm/processor.h | 7 +++ > 2 files changed, 8 insertions(+) > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index 51c8df561077..3ea00d65f35d 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -50,6 +50,7 @@ config ARM > select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU > select HAVE_ARCH_MMAP_RND_BITS if MMU > select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT) > + select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_TRACEHOOK > select HAVE_ARM_SMCCC if CPU_V7 > select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 > diff --git a/arch/arm/include/asm/processor.h > b/arch/arm/include/asm/processor.h > index 338cbe0a18ef..01a41be58d43 100644 > --- a/arch/arm/include/asm/processor.h > +++ b/arch/arm/include/asm/processor.h > @@ -45,6 +45,13 @@ struct thread_struct { > struct debug_info debug; > }; > > +/* Nothing needs to be usercopy-whitelisted from thread_struct. */ > +static inline void arch_thread_struct_whitelist(unsigned long *offset, > + unsigned long *size) > +{ > + *offset = *size = 0; > +} > + > #define INIT_THREAD { } > > #define start_thread(regs,pc,sp) \ > -- > 2.7.4 > -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
[PATCH] [net-next] net: socionext: include linux/io.h to fix build
I ran into a randconfig build failure: drivers/net/ethernet/socionext/netsec.c: In function 'netsec_probe': drivers/net/ethernet/socionext/netsec.c:1583:17: error: implicit declaration of function 'devm_ioremap'; did you mean 'ioremap'? [-Werror=implicit-function-declaration] Including linux/io.h directly fixes this. Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") Signed-off-by: Arnd Bergmann --- drivers/net/ethernet/socionext/netsec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c index a8edcf387bba..af47147dd656 100644 --- a/drivers/net/ethernet/socionext/netsec.c +++ b/drivers/net/ethernet/socionext/netsec.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include -- 2.9.0
Re: [PATCH] [net-next] net: socionext: include linux/io.h to fix build
On 11 January 2018 at 10:36, Arnd Bergmann wrote: > I ran into a randconfig build failure: > > drivers/net/ethernet/socionext/netsec.c: In function 'netsec_probe': > drivers/net/ethernet/socionext/netsec.c:1583:17: error: implicit declaration > of function 'devm_ioremap'; did you mean 'ioremap'? > [-Werror=implicit-function-declaration] > > Including linux/io.h directly fixes this. > > Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") > Signed-off-by: Arnd Bergmann Thanks for fixing this. This is the same issue spotted by kbuild test robot. Acked-by: Ard Biesheuvel > --- > drivers/net/ethernet/socionext/netsec.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/socionext/netsec.c > b/drivers/net/ethernet/socionext/netsec.c > index a8edcf387bba..af47147dd656 100644 > --- a/drivers/net/ethernet/socionext/netsec.c > +++ b/drivers/net/ethernet/socionext/netsec.c > @@ -8,6 +8,7 @@ > #include > #include > #include > +#include > > #include > #include > -- > 2.9.0 >
Re: KASAN: use-after-free Read in __bpf_prog_put
Hi Dmitry, On 01/11/2018 11:22 AM, Dmitry Vyukov wrote: > On Thu, Jan 11, 2018 at 11:17 AM, syzbot > wrote: >> Hello, >> >> syzkaller hit the following crash on >> 4147d50978df60f34d444c647dde9e5b34a4315e >> git://git.cmpxchg.org/linux-mmots.git/master >> compiler: gcc (GCC) 7.1.1 20170620 >> .config is attached >> Raw console output is attached. >> Unfortunately, I don't have any reproducer for this bug yet. >> >> >> IMPORTANT: if you fix the bug, please add the following tag to the commit: >> Reported-by: syzbot+d85bfb332db8f0794...@syzkaller.appspotmail.com >> It will help syzbot understand when the bug is fixed. See footer for >> details. >> If you forward the report, please keep this part and the footer. >> >> netlink: 3 bytes leftover after parsing attributes in process >> `syz-executor5'. >> == >> BUG: KASAN: use-after-free in __bpf_prog_put+0x5e8/0x640 >> kernel/bpf/syscall.c:944 >> netlink: 'syz-executor5': attribute type 5 has an invalid length. >> Read of size 8 at addr 8801d3619658 by task syz-executor0/12398 >> >> CPU: 1 PID: 12398 Comm: syz-executor0 Not tainted 4.15.0-rc7-mm1+ #53 >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> Google 01/01/2011 >> Call Trace: >> __dump_stack lib/dump_stack.c:17 [inline] >> dump_stack+0x194/0x257 lib/dump_stack.c:53 >> print_address_description+0x73/0x250 mm/kasan/report.c:256 >> kasan_report_error mm/kasan/report.c:354 [inline] >> kasan_report+0x23b/0x360 mm/kasan/report.c:412 >> __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 >> __bpf_prog_put+0x5e8/0x640 kernel/bpf/syscall.c:944 >> bpf_prog_put+0x1a/0x20 kernel/bpf/syscall.c:961 >> prog_fd_array_put_ptr+0x15/0x20 kernel/bpf/arraymap.c:446 >> fd_array_map_delete_elem+0xc8/0x110 kernel/bpf/arraymap.c:420 >> map_delete_elem kernel/bpf/syscall.c:737 [inline] >> SYSC_bpf kernel/bpf/syscall.c:1814 [inline] >> SyS_bpf+0x22ea/0x4400 kernel/bpf/syscall.c:1782 >> entry_SYSCALL_64_fastpath+0x29/0xa0 >> RIP: 0033:0x452ac9 >> RSP: 002b:7fb70df60c58 EFLAGS: 0212 ORIG_RAX: 0141 >> RAX: ffda RBX: 0071bea0 RCX: 00452ac9 >> RDX: 0010 RSI: 20f02ff0 RDI: 0003 >> RBP: 03aa R08: R09: >> R10: R11: 0212 R12: 006f3890 >> R13: R14: 7fb70df616d4 R15: >> >> Allocated by task 11996: >> save_stack+0x43/0xd0 mm/kasan/kasan.c:447 >> set_track mm/kasan/kasan.c:459 [inline] >> kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552 >> kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489 >> kmem_cache_alloc+0x12e/0x760 mm/slab.c:3541 >> kmem_cache_zalloc include/linux/slab.h:694 [inline] >> get_empty_filp+0xfb/0x4f0 fs/file_table.c:122 >> path_openat+0xed/0x3530 fs/namei.c:3514 >> do_filp_open+0x25b/0x3b0 fs/namei.c:3572 >> do_sys_open+0x502/0x6d0 fs/open.c:1059 >> SYSC_open fs/open.c:1077 [inline] >> SyS_open+0x2d/0x40 fs/open.c:1072 >> entry_SYSCALL_64_fastpath+0x29/0xa0 >> >> Freed by task 11994: >> save_stack+0x43/0xd0 mm/kasan/kasan.c:447 >> set_track mm/kasan/kasan.c:459 [inline] >> __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520 >> kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527 >> __cache_free mm/slab.c:3485 [inline] >> kmem_cache_free+0x86/0x2b0 mm/slab.c:3743 >> file_free_rcu+0x5c/0x70 fs/file_table.c:49 >> __rcu_reclaim kernel/rcu/rcu.h:172 [inline] >> rcu_do_batch kernel/rcu/tree.c:2675 [inline] >> invoke_rcu_callbacks kernel/rcu/tree.c:2934 [inline] >> __rcu_process_callbacks kernel/rcu/tree.c:2901 [inline] >> rcu_process_callbacks+0xd6c/0x17f0 kernel/rcu/tree.c:2918 >> __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 >> >> The buggy address belongs to the object at 8801d36195c0 >> which belongs to the cache filp of size 456 >> The buggy address is located 152 bytes inside of >> 456-byte region [8801d36195c0, 8801d3619788) >> The buggy address belongs to the page: >> page:ea00074d8640 count:1 mapcount:0 mapping:8801d36190c0 index:0x0 >> flags: 0x2fffc000100(slab) >> raw: 02fffc000100 8801d36190c0 00010006 >> raw: ea00074c49a0 ea000747a160 8801dae30180 >> page dumped because: kasan: bad access detected >> >> Memory state around the buggy address: >> 8801d3619500: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >> 8801d3619580: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb >>> >>> 8801d3619600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> >> ^ >> 8801d3619680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> 8801d3619700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> == > > > Is it the same as "general protection fault in __bpf_prog_put"? > https://groups.goog
Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get
Thu, Jan 11, 2018 at 10:37:10AM CET, j...@resnulli.us wrote: >Wed, Jan 10, 2018 at 05:48:09PM CET, dsah...@gmail.com wrote: >>On 1/9/18 7:07 AM, Jiri Pirko wrote: >>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h >>> index 9c026d9..038cde7 100644 >>> --- a/include/uapi/linux/rtnetlink.h >>> +++ b/include/uapi/linux/rtnetlink.h >>> @@ -150,6 +150,12 @@ enum { >>> RTM_NEWCACHEREPORT = 96, >>> #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT >>> >>> + RTM_NEWBLOCK = 100, >>> +#define RTM_NEWBLOCK RTM_NEWBLOCK >>> + RTM_DELBLOCK, >>> +#define RTM_DELBLOCK RTM_DELBLOCK >>> + RTM_GETBLOCK, >>> +#define RTM_GETBLOCK RTM_GETBLOCK >>> __RTM_MAX, >>> #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) >>> }; >> >>Seems like this is creating an inconsistency. RTM_GETBLOCK is used to >>dump the set of shared blocks, but RTM_NEWBLOCK / RTM_DELBLOCK are not >>used to create / delete one. > >Why is it a problem? RTM_NEWBLOCK is used as a reply for RTM_GETBLOCK. >I plan to have block notifications as a follow-up, there the RTM_GETBLOCK I mean RTM_NEWBLOCK and RTM_DELBLOCK of couse. >and RTM_DELBLOCK will be used. The fact the user cannot create and >delete block explicitly is no problem in my opinion. The block creation >and deletion is done according to usage of qdiscs.
[PATCH net-next] net: socionext: Fix error return code in netsec_netdev_open()
Fix to return error code -ENODEV from the of_phy_connect() error handling case instead of 0, as done elsewhere in this function. Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") Signed-off-by: Wei Yongjun --- drivers/net/ethernet/socionext/netsec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c index a8edcf3..78e4ff6 100644 --- a/drivers/net/ethernet/socionext/netsec.c +++ b/drivers/net/ethernet/socionext/netsec.c @@ -1292,6 +1292,7 @@ static int netsec_netdev_open(struct net_device *ndev) netsec_phy_adjust_link, 0, priv->phy_interface)) { netif_err(priv, link, priv->ndev, "missing PHY\n"); + ret = -ENODEV; goto err3; } } else {
[PATCH net-next] net: phy: mdio-bcm-unimac: fix potential NULL dereference in unimac_mdio_probe()
platform_get_resource() may fail and return NULL, so we should better check it's return value to avoid a NULL pointer dereference a bit later in the code. This is detected by Coccinelle semantic patch. @@ expression pdev, res, n, t, e, e1, e2; @@ res = platform_get_resource(pdev, t, n); + if (!res) + return -EINVAL; ... when != res == NULL e = devm_ioremap(e1, res->start, e2); Signed-off-by: Wei Yongjun --- drivers/net/phy/mdio-bcm-unimac.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/phy/mdio-bcm-unimac.c b/drivers/net/phy/mdio-bcm-unimac.c index 08e0647..8d37066 100644 --- a/drivers/net/phy/mdio-bcm-unimac.c +++ b/drivers/net/phy/mdio-bcm-unimac.c @@ -205,6 +205,8 @@ static int unimac_mdio_probe(struct platform_device *pdev) return -ENOMEM; r = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!r) + return -EINVAL; /* Just ioremap, as this MDIO block is usually integrated into an * Ethernet MAC controller register range
[PATCH net-next 03/11] net: hns3: add ethtool_ops.get_coalesce support to PF
From: Fuyun Liang This patch adds ethtool_ops.get_coalesce support to PF. Whilst our hardware supports per queue values, external interfaces support only a single shared value. As such we use the values for queue 0. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 2 ++ drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 1 + drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 37 ++ 3 files changed, 40 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index adec88d..0bad0e3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -448,6 +448,8 @@ struct hnae3_knic_private_info { u16 num_tqps; /* total number of TQPs in this handle */ struct hnae3_queue **tqp; /* array base of all TQPs in this instance */ const struct hnae3_dcb_ops *dcb_ops; + + u16 int_rl_setting; }; struct hnae3_roce_private_info { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index a2a7ea3..24f6109 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -464,6 +464,7 @@ struct hns3_enet_ring_group { u16 count; enum hns3_flow_level_range flow_level; u16 int_gl; + u8 gl_adapt_enable; }; struct hns3_enet_tqp_vector { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index f44336c..81b4b3b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -887,6 +887,42 @@ static void hns3_get_channels(struct net_device *netdev, h->ae_algo->ops->get_channels(h, ch); } +static int hns3_get_coalesce_per_queue(struct net_device *netdev, u32 queue, + struct ethtool_coalesce *cmd) +{ + struct hns3_enet_tqp_vector *tx_vector, *rx_vector; + struct hns3_nic_priv *priv = netdev_priv(netdev); + struct hnae3_handle *h = priv->ae_handle; + u16 queue_num = h->kinfo.num_tqps; + + if (queue >= queue_num) { + netdev_err(netdev, + "Invalid queue value %d! Queue max id=%d\n", + queue, queue_num - 1); + return -EINVAL; + } + + tx_vector = priv->ring_data[queue].ring->tqp_vector; + rx_vector = priv->ring_data[queue_num + queue].ring->tqp_vector; + + cmd->use_adaptive_tx_coalesce = tx_vector->tx_group.gl_adapt_enable; + cmd->use_adaptive_rx_coalesce = rx_vector->rx_group.gl_adapt_enable; + + cmd->tx_coalesce_usecs = tx_vector->tx_group.int_gl; + cmd->rx_coalesce_usecs = rx_vector->rx_group.int_gl; + + cmd->tx_coalesce_usecs_high = h->kinfo.int_rl_setting; + cmd->rx_coalesce_usecs_high = h->kinfo.int_rl_setting; + + return 0; +} + +static int hns3_get_coalesce(struct net_device *netdev, +struct ethtool_coalesce *cmd) +{ + return hns3_get_coalesce_per_queue(netdev, 0, cmd); +} + static const struct ethtool_ops hns3vf_ethtool_ops = { .get_drvinfo = hns3_get_drvinfo, .get_ringparam = hns3_get_ringparam, @@ -925,6 +961,7 @@ static void hns3_get_channels(struct net_device *netdev, .nway_reset = hns3_nway_reset, .get_channels = hns3_get_channels, .set_channels = hns3_set_channels, + .get_coalesce = hns3_get_coalesce, }; void hns3_ethtool_set_ops(struct net_device *netdev) -- 1.9.1
[PATCH net-next 06/11] net: hns3: refactor GL update function
From: Fuyun Liang The GL update function uses the max GL value between tx_int_gl and rx_int_gl to set both new tx_int_gl and new rx_int_gl. Therefore, User can not enable TX GL self-adaptive or RX GL self-adaptive individually. This patch refactors the code to update the TX GL and the RX GL separately, making user can enable TX GL self-adaptive or RX GL self-adaptive individually. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 35 +++-- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 59d8d9f..2a139ef 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2459,25 +2459,22 @@ static bool hns3_get_new_int_gl(struct hns3_enet_ring_group *ring_group) static void hns3_update_new_int_gl(struct hns3_enet_tqp_vector *tqp_vector) { - u16 rx_int_gl, tx_int_gl; - bool rx, tx; - - rx = hns3_get_new_int_gl(&tqp_vector->rx_group); - tx = hns3_get_new_int_gl(&tqp_vector->tx_group); - rx_int_gl = tqp_vector->rx_group.int_gl; - tx_int_gl = tqp_vector->tx_group.int_gl; - if (rx && tx) { - if (rx_int_gl > tx_int_gl) { - tqp_vector->tx_group.int_gl = rx_int_gl; - tqp_vector->tx_group.flow_level = - tqp_vector->rx_group.flow_level; - hns3_set_vector_coalesc_gl(tqp_vector, rx_int_gl); - } else { - tqp_vector->rx_group.int_gl = tx_int_gl; - tqp_vector->rx_group.flow_level = - tqp_vector->tx_group.flow_level; - hns3_set_vector_coalesc_gl(tqp_vector, tx_int_gl); - } + struct hns3_enet_ring_group *rx_group = &tqp_vector->rx_group; + struct hns3_enet_ring_group *tx_group = &tqp_vector->tx_group; + bool rx_update, tx_update; + + if (rx_group->gl_adapt_enable) { + rx_update = hns3_get_new_int_gl(rx_group); + if (rx_update) + hns3_set_vector_coalesce_rx_gl(tqp_vector, + rx_group->int_gl); + } + + if (tx_group->gl_adapt_enable) { + tx_update = hns3_get_new_int_gl(&tqp_vector->tx_group); + if (tx_update) + hns3_set_vector_coalesce_tx_gl(tqp_vector, + tx_group->int_gl); } } -- 1.9.1
[PATCH net-next 00/11] add some new features and fix some bugs
This patchset adds some new features and fixes some bugs: [patch 1/11] adds ethtool_ops.get_channels support for VF. [patch 2/11] removes TSO config command from VF driver. [patch 3/11] adds ethtool_ops.get_coalesce support to PF. [patch 4/11] adds ethtool_ops.set_coalesce support to PF. [patch 5/11 - 11/11] do some code improvements and fix some bugs. Fuyun Liang (7): net: hns3: add ethtool_ops.get_coalesce support to PF net: hns3: add ethtool_ops.set_coalesce support to PF net: hns3: refactor interrupt coalescing init function net: hns3: refactor GL update function net: hns3: remove unused GL setup function net: hns3: change the unit of GL value macro net: hns3: add int_gl_idx setup for TX and RX queues Jian Shen (2): net: hns3: fixes for feature changed checking net: hns3: fix possible NULL pointer in hns3_nic_set_features Peng Li (2): net: hns3: add ethtool_ops.get_channels support for VF net: hns3: remove TSO config command from VF driver drivers/net/ethernet/hisilicon/hns3/hnae3.h| 7 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 148 ++--- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 26 ++- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 179 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 5 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h | 8 - .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 50 +++--- 7 files changed, 336 insertions(+), 87 deletions(-) -- 1.9.1
[PATCH net-next 10/11] net: hns3: add feature check when feature changed
From: Jian Shen Local variable "changed" was defined to indicates features changed, but was used only for feature NETIF_F_HW_VLAN_CTAG_RX. Add checking for other features. Fixes: 052ece6dc19c ("net: hns3: add ethtool related offload command") Signed-off-by: Jian Shen Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 27 ++--- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 34879c4..a7ae4f3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -1118,25 +1118,28 @@ static int hns3_nic_net_set_mac_address(struct net_device *netdev, void *p) static int hns3_nic_set_features(struct net_device *netdev, netdev_features_t features) { + netdev_features_t changed = netdev->features ^ features; struct hns3_nic_priv *priv = netdev_priv(netdev); struct hnae3_handle *h = priv->ae_handle; - netdev_features_t changed; int ret; - if (features & (NETIF_F_TSO | NETIF_F_TSO6)) { - priv->ops.fill_desc = hns3_fill_desc_tso; - priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tso; - } else { - priv->ops.fill_desc = hns3_fill_desc; - priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tx; + if (changed & (NETIF_F_TSO | NETIF_F_TSO6)) { + if (features & (NETIF_F_TSO | NETIF_F_TSO6)) { + priv->ops.fill_desc = hns3_fill_desc_tso; + priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tso; + } else { + priv->ops.fill_desc = hns3_fill_desc; + priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tx; + } } - if (features & NETIF_F_HW_VLAN_CTAG_FILTER) - h->ae_algo->ops->enable_vlan_filter(h, true); - else - h->ae_algo->ops->enable_vlan_filter(h, false); + if (changed & NETIF_F_HW_VLAN_CTAG_FILTER) { + if (features & NETIF_F_HW_VLAN_CTAG_FILTER) + h->ae_algo->ops->enable_vlan_filter(h, true); + else + h->ae_algo->ops->enable_vlan_filter(h, false); + } - changed = netdev->features ^ features; if (changed & NETIF_F_HW_VLAN_CTAG_RX) { if (features & NETIF_F_HW_VLAN_CTAG_RX) ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true); -- 1.9.1
[PATCH net-next 11/11] net: hns3: check for NULL function pointer in hns3_nic_set_features
From: Jian Shen It's necessary to check hook whether being defined before calling, improve the reliability. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index a7ae4f3..ac84816 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -1133,14 +1133,16 @@ static int hns3_nic_set_features(struct net_device *netdev, } } - if (changed & NETIF_F_HW_VLAN_CTAG_FILTER) { + if ((changed & NETIF_F_HW_VLAN_CTAG_FILTER) && + h->ae_algo->ops->enable_vlan_filter) { if (features & NETIF_F_HW_VLAN_CTAG_FILTER) h->ae_algo->ops->enable_vlan_filter(h, true); else h->ae_algo->ops->enable_vlan_filter(h, false); } - if (changed & NETIF_F_HW_VLAN_CTAG_RX) { + if ((changed & NETIF_F_HW_VLAN_CTAG_RX) && + h->ae_algo->ops->enable_hw_strip_rxvtag) { if (features & NETIF_F_HW_VLAN_CTAG_RX) ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true); else -- 1.9.1
[PATCH net-next 04/11] net: hns3: add ethtool_ops.set_coalesce support to PF
From: Fuyun Liang This patch adds ethtool_ops.set_coalesce support to PF. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 34 - drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 17 +++ drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 141 + 3 files changed, 188 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 14c7625..32c9f88 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -170,14 +170,40 @@ static void hns3_set_vector_coalesc_gl(struct hns3_enet_tqp_vector *tqp_vector, writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL2_OFFSET); } -static void hns3_set_vector_coalesc_rl(struct hns3_enet_tqp_vector *tqp_vector, - u32 rl_value) +void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector, +u32 rl_value) { + u32 rl_reg = hns3_rl_usec_to_reg(rl_value); + /* this defines the configuration for RL (Interrupt Rate Limiter). * Rl defines rate of interrupts i.e. number of interrupts-per-second * GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing */ - writel(rl_value, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET); + + if (rl_reg > 0 && !tqp_vector->tx_group.gl_adapt_enable && + !tqp_vector->rx_group.gl_adapt_enable) + /* According to the hardware, the range of rl_reg is +* 0-59 and the unit is 4. +*/ + rl_reg |= HNS3_INT_RL_ENABLE_MASK; + + writel(rl_reg, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET); +} + +void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector, + u32 gl_value) +{ + u32 rx_gl_reg = hns3_gl_usec_to_reg(gl_value); + + writel(rx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL0_OFFSET); +} + +void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector, + u32 gl_value) +{ + u32 tx_gl_reg = hns3_gl_usec_to_reg(gl_value); + + writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET); } static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector) @@ -194,7 +220,7 @@ static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector) /* for now we are disabling Interrupt RL - we * will re-enable later */ - hns3_set_vector_coalesc_rl(tqp_vector, 0); + hns3_set_vector_coalesce_rl(tqp_vector, 0); tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW; tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW; } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 24f6109..7adbda8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -451,11 +451,15 @@ enum hns3_link_mode_bits { HNS3_LM_COUNT = 15 }; +#define HNS3_INT_GL_MAX0x1FE0 #define HNS3_INT_GL_50K0x000A #define HNS3_INT_GL_20K0x0019 #define HNS3_INT_GL_18K0x001B #define HNS3_INT_GL_8K 0x003E +#define HNS3_INT_RL_MAX0x00EC +#define HNS3_INT_RL_ENABLE_MASK0x40 + struct hns3_enet_ring_group { /* array of pointers to rings */ struct hns3_enet_ring *ring; @@ -595,6 +599,12 @@ static inline void hns3_write_reg(void __iomem *base, u32 reg, u32 value) #define hns3_get_handle(ndev) \ (((struct hns3_nic_priv *)netdev_priv(ndev))->ae_handle) +#define hns3_gl_usec_to_reg(int_gl) (int_gl >> 1) +#define hns3_gl_round_down(int_gl) round_down(int_gl, 2) + +#define hns3_rl_usec_to_reg(int_rl) (int_rl >> 2) +#define hns3_rl_round_down(int_rl) round_down(int_rl, 4) + void hns3_ethtool_set_ops(struct net_device *netdev); int hns3_set_channels(struct net_device *netdev, struct ethtool_channels *ch); @@ -607,6 +617,13 @@ int hns3_clean_rx_ring( struct hns3_enet_ring *ring, int budget, void (*rx_fn)(struct hns3_enet_ring *, struct sk_buff *)); +void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector, + u32 gl_value); +void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector, + u32 gl_value); +void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector, +u32 rl_value); + #ifdef CONFIG_HNS3_DCB void hns3_dcbnl_setup(struct hnae3_handle *handle); #else diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 81b
[PATCH net-next 09/11] net: hns3: add int_gl_idx setup for TX and RX queues
From: Fuyun Liang If the int_gl_idx does not be set, the default interrupt coalesce index is 0. The TX queues and the RX queues will both use the GL0 as the interrupt coalesce GL switch. But it should be GL1 for TX queues and GL0 for RX queues. This patch adds the int_gl_idx setup for TX queues and RX queues. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 5 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 11 +++ drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 5 + 3 files changed, 21 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 0bad0e3..634e932 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -133,11 +133,16 @@ struct hnae3_vector_info { #define HNAE3_RING_TYPE_B 0 #define HNAE3_RING_TYPE_TX 0 #define HNAE3_RING_TYPE_RX 1 +#define HNAE3_RING_GL_IDX_S 0 +#define HNAE3_RING_GL_IDX_M GENMASK(1, 0) +#define HNAE3_RING_GL_RX 0 +#define HNAE3_RING_GL_TX 1 struct hnae3_ring_chain_node { struct hnae3_ring_chain_node *next; u32 tqp_index; u32 flag; + u32 int_gl_idx; }; #define HNAE3_IS_TX_RING(node) \ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 2e9e61c..34879c4 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2523,6 +2523,8 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, cur_chain->tqp_index = tx_ring->tqp->tqp_index; hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B, HNAE3_RING_TYPE_TX); + hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M, + HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_TX); cur_chain->next = NULL; @@ -2538,6 +2540,10 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, chain->tqp_index = tx_ring->tqp->tqp_index; hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B, HNAE3_RING_TYPE_TX); + hnae_set_field(chain->int_gl_idx, + HNAE3_RING_GL_IDX_M, + HNAE3_RING_GL_IDX_S, + HNAE3_RING_GL_TX); cur_chain = chain; } @@ -2549,6 +2555,8 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, cur_chain->tqp_index = rx_ring->tqp->tqp_index; hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B, HNAE3_RING_TYPE_RX); + hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M, + HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX); rx_ring = rx_ring->next; } @@ -2562,6 +2570,9 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, chain->tqp_index = rx_ring->tqp->tqp_index; hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B, HNAE3_RING_TYPE_RX); + hnae_set_field(chain->int_gl_idx, HNAE3_RING_GL_IDX_M, + HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX); + cur_chain = chain; rx_ring = rx_ring->next; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index d7352f5..27f0ab6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3409,6 +3409,11 @@ int hclge_bind_ring_with_vector(struct hclge_vport *vport, hnae_get_bit(node->flag, HNAE3_RING_TYPE_B)); hnae_set_field(tqp_type_and_id, HCLGE_TQP_ID_M, HCLGE_TQP_ID_S, node->tqp_index); + hnae_set_field(tqp_type_and_id, HCLGE_INT_GL_IDX_M, + HCLGE_INT_GL_IDX_S, + hnae_get_field(node->int_gl_idx, + HNAE3_RING_GL_IDX_M, + HNAE3_RING_GL_IDX_S)); req->tqp_type_and_id[i] = cpu_to_le16(tqp_type_and_id); if (++i >= HCLGE_VECTOR_ELEMENTS_PER_CMD) { req->int_cause_num = HCLGE_VECTOR_ELEMENTS_PER_CMD; -- 1.9.1
[PATCH net-next 01/11] net: hns3: add ethtool_ops.get_channels support for VF
This patch supports the ethtool's get_channels() for VF. Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 1 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 30 ++ 2 files changed, 31 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index d3cb3ec..f44336c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -900,6 +900,7 @@ static void hns3_get_channels(struct net_device *netdev, .get_rxfh = hns3_get_rss, .set_rxfh = hns3_set_rss, .get_link_ksettings = hns3_get_link_ksettings, + .get_channels = hns3_get_channels, }; static const struct ethtool_ops hns3_ethtool_ops = { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 655f522..5f9afa6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -1433,6 +1433,35 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev *ae_dev) ae_dev->priv = NULL; } +static u32 hclgevf_get_max_channels(struct hclgevf_dev *hdev) +{ + struct hnae3_handle *nic = &hdev->nic; + struct hnae3_knic_private_info *kinfo = &nic->kinfo; + + return min_t(u32, hdev->rss_size_max * kinfo->num_tc, hdev->num_tqps); +} + +/** + * hclgevf_get_channels - Get the current channels enabled and max supported. + * @handle: hardware information for network interface + * @ch: ethtool channels structure + * + * We don't support separate tx and rx queues as channels. The other count + * represents how many queues are being used for control. max_combined counts + * how many queue pairs we can support. They may not be mapped 1 to 1 with + * q_vectors since we support a lot more queue pairs than q_vectors. + **/ +static void hclgevf_get_channels(struct hnae3_handle *handle, +struct ethtool_channels *ch) +{ + struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + + ch->max_combined = hclgevf_get_max_channels(hdev); + ch->other_count = 0; + ch->max_other = 0; + ch->combined_count = hdev->num_tqps; +} + static const struct hnae3_ae_ops hclgevf_ops = { .init_ae_dev = hclgevf_init_ae_dev, .uninit_ae_dev = hclgevf_uninit_ae_dev, @@ -1462,6 +1491,7 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev *ae_dev) .get_tc_size = hclgevf_get_tc_size, .get_fw_version = hclgevf_get_fw_version, .set_vlan_filter = hclgevf_set_vlan_filter, + .get_channels = hclgevf_get_channels, }; static struct hnae3_ae_algo ae_algovf = { -- 1.9.1
[PATCH net-next 08/11] net: hns3: change the unit of GL value macro
From: Fuyun Liang Previously, driver used 2us as the GL unit. The time unit ethtool command "-c" and "-C" use is 1us, so now the GL unit driver uses actually is 1us. This patch changes the unit of GL value macro from 2us to 1us. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 7adbda8..213f501 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -452,10 +452,10 @@ enum hns3_link_mode_bits { }; #define HNS3_INT_GL_MAX0x1FE0 -#define HNS3_INT_GL_50K0x000A -#define HNS3_INT_GL_20K0x0019 -#define HNS3_INT_GL_18K0x001B -#define HNS3_INT_GL_8K 0x003E +#define HNS3_INT_GL_50K0x0014 +#define HNS3_INT_GL_20K0x0032 +#define HNS3_INT_GL_18K0x0036 +#define HNS3_INT_GL_8K 0x007C #define HNS3_INT_RL_MAX0x00EC #define HNS3_INT_RL_ENABLE_MASK0x40 -- 1.9.1
[PATCH net-next 02/11] net: hns3: remove TSO config command from VF driver
Only main PF can config TSO MSS length according to hardware. This patch removes TSO config command from VF driver. Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h | 8 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c| 20 2 files changed, 28 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h index ad8adfe..2caca93 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h @@ -86,8 +86,6 @@ enum hclgevf_opcode_type { HCLGEVF_OPC_QUERY_TX_STATUS = 0x0B03, HCLGEVF_OPC_QUERY_RX_STATUS = 0x0B13, HCLGEVF_OPC_CFG_COM_TQP_QUEUE = 0x0B20, - /* TSO cmd */ - HCLGEVF_OPC_TSO_GENERIC_CONFIG = 0x0C01, /* RSS cmd */ HCLGEVF_OPC_RSS_GENERIC_CONFIG = 0x0D01, HCLGEVF_OPC_RSS_INDIR_TABLE = 0x0D07, @@ -202,12 +200,6 @@ struct hclgevf_cfg_tx_queue_pointer_cmd { u8 rsv[14]; }; -#define HCLGEVF_TSO_ENABLE_B 0 -struct hclgevf_cfg_tso_status_cmd { - u8 tso_enable; - u8 rsv[23]; -}; - #define HCLGEVF_TYPE_CRQ 0 #define HCLGEVF_TYPE_CSQ 1 #define HCLGEVF_NIC_CSQ_BASEADDR_L_REG 0x27000 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 5f9afa6..3d2bc9a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -201,20 +201,6 @@ static int hclge_get_queue_info(struct hclgevf_dev *hdev) return 0; } -static int hclgevf_enable_tso(struct hclgevf_dev *hdev, int enable) -{ - struct hclgevf_cfg_tso_status_cmd *req; - struct hclgevf_desc desc; - - req = (struct hclgevf_cfg_tso_status_cmd *)desc.data; - - hclgevf_cmd_setup_basic_desc(&desc, HCLGEVF_OPC_TSO_GENERIC_CONFIG, -false); - hnae_set_bit(req->tso_enable, HCLGEVF_TSO_ENABLE_B, enable); - - return hclgevf_cmd_send(&hdev->hw, &desc, 1); -} - static int hclgevf_alloc_tqps(struct hclgevf_dev *hdev) { struct hclgevf_tqp *tqp; @@ -1375,12 +1361,6 @@ static int hclgevf_init_ae_dev(struct hnae3_ae_dev *ae_dev) goto err_config; } - ret = hclgevf_enable_tso(hdev, true); - if (ret) { - dev_err(&pdev->dev, "failed(%d) to enable tso\n", ret); - goto err_config; - } - /* Initialize VF's MTA */ hdev->accept_mta_mc = true; ret = hclgevf_cfg_func_mta_filter(&hdev->nic, hdev->accept_mta_mc); -- 1.9.1
[PATCH net-next 07/11] net: hns3: remove unused GL setup function
From: Fuyun Liang Since the TX GL and the RX GL need to be set separately, hns3_set_vector_coalesc_gl() has been replaced with hns3_set_vector_coalesce_rx_gl() and hns3_set_vector_coalesce_tx_gl(). This patch removes hns3_set_vector_coalesc_gl(). Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 1 file changed, 12 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 2a139ef..2e9e61c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -158,18 +158,6 @@ static void hns3_vector_disable(struct hns3_enet_tqp_vector *tqp_vector) napi_disable(&tqp_vector->napi); } -static void hns3_set_vector_coalesc_gl(struct hns3_enet_tqp_vector *tqp_vector, - u32 gl_value) -{ - /* this defines the configuration for GL (Interrupt Gap Limiter) -* GL defines inter interrupt gap. -* GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing -*/ - writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL0_OFFSET); - writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET); - writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL2_OFFSET); -} - void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector, u32 rl_value) { -- 1.9.1
[PATCH net-next 05/11] net: hns3: refactor interrupt coalescing init function
From: Fuyun Liang In the hardware, the coalesce configurable registers include GL0, GL1, GL2. In the driver, the TX queues use the register GL1 and the RX queues use the register GL0. This function initializes the configuration of the interrupt coalescing, but does not distinguish between the TX direction and the RX direction. It will cause some confusion. This patch refactors the function to initialize the TX GL and the RX GL separately. And the initialization of related variables also is added to this patch. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 29 + 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 32c9f88..59d8d9f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -206,21 +206,32 @@ void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector, writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET); } -static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector) +static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector, + struct hns3_nic_priv *priv) { + struct hnae3_handle *h = priv->ae_handle; + /* initialize the configuration for interrupt coalescing. * 1. GL (Interrupt Gap Limiter) * 2. RL (Interrupt Rate Limiter) */ - /* Default :enable interrupt coalesce */ - tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K; + /* Default: enable interrupt coalescing self-adaptive and GL */ + tqp_vector->tx_group.gl_adapt_enable = 1; + tqp_vector->rx_group.gl_adapt_enable = 1; + tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K; - hns3_set_vector_coalesc_gl(tqp_vector, HNS3_INT_GL_50K); - /* for now we are disabling Interrupt RL - we -* will re-enable later -*/ - hns3_set_vector_coalesce_rl(tqp_vector, 0); + tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K; + + hns3_set_vector_coalesce_tx_gl(tqp_vector, + tqp_vector->tx_group.int_gl); + hns3_set_vector_coalesce_rx_gl(tqp_vector, + tqp_vector->rx_group.int_gl); + + /* Default: disable RL */ + h->kinfo.int_rl_setting = 0; + hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting); + tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW; tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW; } @@ -2654,7 +2665,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) tqp_vector->rx_group.total_packets = 0; tqp_vector->tx_group.total_bytes = 0; tqp_vector->tx_group.total_packets = 0; - hns3_vector_gl_rl_init(tqp_vector); + hns3_vector_gl_rl_init(tqp_vector, priv); tqp_vector->handle = h; ret = hns3_get_vector_ring_chain(tqp_vector, -- 1.9.1
Re: [PATCH V2] ipvlan: fix ipvlan MTU limits
On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > I still prefer the approach I had mentioned that uses 'mtu_adj'. In > that approach you can leave those slaves which have changed their mtu > to be lower than masters' but if master's mtu changes to larger value > all other slaves will get updated mtu leaving behind the slaves who > have opted to change their mtu on their own. Also the same thing is > true when mtu get reduced at master. The problem with this magic behavior is, well, that it's magic. There's no way to tell what happens with a given slave when the master's MTU gets changed just by looking at the current configuration. There's also no way to switch the magic behavior back on once the slave's MTU is changed. At minimum, you'd need some kind of indication that the slave's MTU is following the master. And a way to toggle this back. Keefe's patch is much saner, the behavior is completely deterministic. Jiri
PATCH V5 4/4] selinux: Add SCTP support
The SELinux SCTP implementation is explained in: Documentation/security/SELinux-sctp.rst Signed-off-by: Richard Haines --- V5 Change: Rework selinux_netlbl_socket_connect() and selinux_netlbl_socket_connect_locked as requested by Paul. Documentation/security/SELinux-sctp.rst | 157 ++ security/selinux/hooks.c| 280 +--- security/selinux/include/classmap.h | 2 +- security/selinux/include/netlabel.h | 21 ++- security/selinux/include/objsec.h | 4 + security/selinux/netlabel.c | 133 +-- 6 files changed, 565 insertions(+), 32 deletions(-) create mode 100644 Documentation/security/SELinux-sctp.rst diff --git a/Documentation/security/SELinux-sctp.rst b/Documentation/security/SELinux-sctp.rst new file mode 100644 index 000..2f66bf3 --- /dev/null +++ b/Documentation/security/SELinux-sctp.rst @@ -0,0 +1,157 @@ +SCTP SELinux Support += + +Security Hooks +=== + +``Documentation/security/LSM-sctp.rst`` describes the following SCTP security +hooks with the SELinux specifics expanded below:: + +security_sctp_assoc_request() +security_sctp_bind_connect() +security_sctp_sk_clone() +security_inet_conn_established() + + +security_sctp_assoc_request() +- +Passes the ``@ep`` and ``@chunk->skb`` of the association INIT packet to the +security module. Returns 0 on success, error on failure. +:: + +@ep - pointer to sctp endpoint structure. +@skb - pointer to skbuff of association packet. + +The security module performs the following operations: + IF this is the first association on ``@ep->base.sk``, then set the peer + sid to that in ``@skb``. This will ensure there is only one peer sid + assigned to ``@ep->base.sk`` that may support multiple associations. + + ELSE validate the ``@ep->base.sk peer_sid`` against the ``@skb peer sid`` + to determine whether the association should be allowed or denied. + + Set the sctp ``@ep sid`` to socket's sid (from ``ep->base.sk``) with + MLS portion taken from ``@skb peer sid``. This will be used by SCTP + TCP style sockets and peeled off connections as they cause a new socket + to be generated. + + If IP security options are configured (CIPSO/CALIPSO), then the ip + options are set on the socket. + + +security_sctp_bind_connect() +- +Checks permissions required for ipv4/ipv6 addresses based on the ``@optname`` +as follows:: + + -- + | BIND Permission Checks | + | @optname | @address contains | + ||---| + | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses | + | SCTP_PRIMARY_ADDR | Single ipv4 or ipv6 address | + | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address | + -- + + -- + | CONNECT Permission Checks | + | @optname | @address contains | + ||---| + | SCTP_SOCKOPT_CONNECTX | One or more ipv4 / ipv6 addresses | + | SCTP_PARAM_ADD_IP | One or more ipv4 / ipv6 addresses | + | SCTP_SENDMSG_CONNECT | Single ipv4 or ipv6 address | + | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6 address | + -- + + +``Documentation/security/LSM-sctp.rst`` gives a summary of the ``@optname`` +entries and also describes ASCONF chunk processing when Dynamic Address +Reconfiguration is enabled. + + +security_sctp_sk_clone() +- +Called whenever a new socket is created by **accept**\(2) (i.e. a TCP style +socket) or when a socket is 'peeled off' e.g userspace calls +**sctp_peeloff**\(3). ``security_sctp_sk_clone()`` will set the new +sockets sid and peer sid to that contained in the ``@ep sid`` and +``@ep peer sid`` respectively. +:: + +@ep - pointer to current sctp endpoint structure. +@sk - pointer to current sock structure. +@sk - pointer to new sock structure. + + +security_inet_conn_established() +- +Called when a COOKIE ACK is received where it sets the connection's peer sid +to that in ``@skb``:: + +@sk - pointer to sock structure. +@skb - pointer to skbuff of the COOKIE ACK packet. + + +Policy Statements +== +The following class and permissions to support SCTP are available within the +kernel:: + +class sctp_socket inherits socket { node_bind } + +whenever the following policy capability is enabled:: + +policycap extended_socket_class; + +S
Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods
For other horrors that are even worse than any given ->poll instance take a look at scif_poll and friends..
Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods
On Wed, Jan 10, 2018 at 09:04:16PM +, Al Viro wrote: > There's another problem with that - currently ->poll() may tell you "sod off, > I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something > and don't pester me again". With your API that's hard to express sanely. And what exactly can currently tell 'sod off' right now? ->poll can only return the (E)POLL* mask. But what would probably be sane is to do the same thing in vfs_poll I already do in aio poll: call ->poll_mask a first time before calling poll_wait to clear any already pending events. That way any early error gets instantly propagated. > Another piece of fun related to that is handling of disconnects in general - > > static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct > *pts) > { > struct proc_dir_entry *pde = PDE(file_inode(file)); > __poll_t rv = DEFAULT_POLLMASK; > __poll_t (*poll)(struct file *, struct poll_table_struct *); > if (use_pde(pde)) { > poll = pde->proc_fops->poll; > if (poll) > rv = poll(file, pts); > unuse_pde(pde); > } > return rv; > } > > and similar in sysfs. Can't find anything in sysfs, but debugfs has an amazingly bad variant of the above, including confidence ensuring commit description bits like: In order not to pollute debugfs with wrapper definitions that aren't ever needed, I chose not to define a wrapper for every struct file_operations method possible. Instead, a wrapper is defined only for the subset of methods which are actually set by any debugfs users. Currently, these are: ->llseek() ->read() ->write() ->unlocked_ioctl() ->poll() So anyone implementing say, read_iter/write_iter or compat_ioctl silently doesn't get the magic protection.. Either way - those two will need updating for the new scheme if we add proc/debugfs ops, so I better do them now and convert at least one example each. > Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk. > Some of them are "don't bother putting me on any queues, I won't be sleeping > anyway". Some are "I'm already on all queues I care about, I'm going to > sleep now and the query everything again once woken up". It would be nice > to have the method splitup reflect that kind of logics... Hmm. ->poll_mask already is a simple 'are these events pending' method, and thuse should deal perfectly fine with both cases. What additional split do you think would be helpful? > What about af_alg_poll(), BTW? Looks like you've missed that one... Converted for the next iteration. > Another thing: IMO file_can_poll() should use FMODE_CAN_POLL - either as > "true if set, otherwise check ->f_op and set accordingly" or set in > do_dentry_open() and just check it in file_can_poll()... I don't really see the point of wasting a fmode bit for it. But if really want that I cna do it.
[PATCH 06/11] xfrm: Use __skb_queue_tail in xfrm_trans_queue
From: Herbert Xu We do not need locking in xfrm_trans_queue because it is designed to use per-CPU buffers. However, the original code incorrectly used skb_queue_tail which takes the lock. This patch switches it to __skb_queue_tail instead. Reported-and-tested-by: Artem Savkov Fixes: acf568ee859f ("xfrm: Reinject transport-mode packets...") Signed-off-by: Herbert Xu Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 3f6f6f8c9fa5..5b2409746ae0 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -518,7 +518,7 @@ int xfrm_trans_queue(struct sk_buff *skb, return -ENOBUFS; XFRM_TRANS_SKB_CB(skb)->finish = finish; - skb_queue_tail(&trans->queue, skb); + __skb_queue_tail(&trans->queue, skb); tasklet_schedule(&trans->tasklet); return 0; } -- 2.14.1
[PATCH 05/11] xfrm: fix rcu usage in xfrm_get_type_offload
From: Sabrina Dubroca request_module can sleep, thus we cannot hold rcu_read_lock() while calling it. The function also jumps back and takes rcu_read_lock() again (in xfrm_state_get_afinfo()), resulting in an imbalance. This codepath is triggered whenever a new offloaded state is created. Fixes: ffdb5211da1c ("xfrm: Auto-load xfrm offload modules") Reported-by: syzbot+ca425f44816d749e8eb49755567a75ee48cf4...@syzkaller.appspotmail.com Signed-off-by: Sabrina Dubroca Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_state.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 1e80f68e2266..429957412633 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -313,13 +313,14 @@ xfrm_get_type_offload(u8 proto, unsigned short family, bool try_load) if ((type && !try_module_get(type->owner))) type = NULL; + rcu_read_unlock(); + if (!type && try_load) { request_module("xfrm-offload-%d-%d", family, proto); try_load = 0; goto retry; } - rcu_read_unlock(); return type; } -- 2.14.1
[PATCH 04/11] af_key: fix buffer overread in parse_exthdrs()
From: Eric Biggers If a message sent to a PF_KEY socket ended with an incomplete extension header (fewer than 4 bytes remaining), then parse_exthdrs() read past the end of the message, into uninitialized memory. Fix it by returning -EINVAL in this case. Reproducer: #include #include #include int main() { int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2); char buf[17] = { 0 }; struct sadb_msg *msg = (void *)buf; msg->sadb_msg_version = PF_KEY_V2; msg->sadb_msg_type = SADB_DELETE; msg->sadb_msg_len = 2; write(sock, buf, 17); } Cc: sta...@vger.kernel.org Signed-off-by: Eric Biggers Signed-off-by: Steffen Klassert --- net/key/af_key.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/key/af_key.c b/net/key/af_key.c index 596499cc8b2f..d40861a048fe 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -516,6 +516,9 @@ static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void * uint16_t ext_type; int ext_len; + if (len < sizeof(*ehdr)) + return -EINVAL; + ext_len = ehdr->sadb_ext_len; ext_len *= sizeof(uint64_t); ext_type = ehdr->sadb_ext_type; -- 2.14.1
[PATCH 10/11] af_key: Fix memory leak in key_notify_policy.
We leak the allocated out_skb in case pfkey_xfrm_policy2msg() fails. Fix this by freeing it on error. Reported-by: Dmitry Vyukov Signed-off-by: Steffen Klassert --- net/key/af_key.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index d40861a048fe..7e2e7188e7f4 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -2202,8 +2202,10 @@ static int key_notify_policy(struct xfrm_policy *xp, int dir, const struct km_ev return PTR_ERR(out_skb); err = pfkey_xfrm_policy2msg(out_skb, xp, dir); - if (err < 0) + if (err < 0) { + kfree_skb(out_skb); return err; + } out_hdr = (struct sadb_msg *) out_skb->data; out_hdr->sadb_msg_version = PF_KEY_V2; -- 2.14.1
[PATCH 09/11] esp: Fix GRO when the headers not fully in the linear part of the skb.
The GRO layer does not necessarily pull the complete headers into the linear part of the skb, a part may remain on the first page fragment. This can lead to a crash if we try to pull the headers, so make sure we have them on the linear part before pulling. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Reported-by: syzbot+82bbd65569c49c6c0...@syzkaller.appspotmail.com Signed-off-by: Steffen Klassert --- net/ipv4/esp4_offload.c | 3 ++- net/ipv6/esp6_offload.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c index f8b918c766b0..b1338e576d00 100644 --- a/net/ipv4/esp4_offload.c +++ b/net/ipv4/esp4_offload.c @@ -38,7 +38,8 @@ static struct sk_buff **esp4_gro_receive(struct sk_buff **head, __be32 spi; int err; - skb_pull(skb, offset); + if (!pskb_pull(skb, offset)) + return NULL; if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, &spi, &seq)) != 0) goto out; diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c index 333a478aa161..dd9627490c7c 100644 --- a/net/ipv6/esp6_offload.c +++ b/net/ipv6/esp6_offload.c @@ -60,7 +60,8 @@ static struct sk_buff **esp6_gro_receive(struct sk_buff **head, int nhoff; int err; - skb_pull(skb, offset); + if (!pskb_pull(skb, offset)) + return NULL; if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, &spi, &seq)) != 0) goto out; -- 2.14.1
[PATCH 11/11] xfrm: Fix a race in the xdst pcpu cache.
We need to run xfrm_resolve_and_create_bundle() with bottom halves off. Otherwise we may reuse an already released dst_enty when the xfrm lookup functions are called from process context. Fixes: c30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache") Reported-by: Darius Ski Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index bc5eae12fb09..bd6b0e7a0ee4 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2063,8 +2063,11 @@ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir, if (num_xfrms <= 0) goto make_dummy_bundle; + local_bh_disable(); xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family, - xflo->dst_orig); + xflo->dst_orig); + local_bh_enable(); + if (IS_ERR(xdst)) { err = PTR_ERR(xdst); if (err != -EAGAIN) @@ -2151,9 +2154,12 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig, goto no_transform; } + local_bh_disable(); xdst = xfrm_resolve_and_create_bundle( pols, num_pols, fl, family, dst_orig); + local_bh_enable(); + if (IS_ERR(xdst)) { xfrm_pols_put(pols, num_pols); err = PTR_ERR(xdst); -- 2.14.1
[PATCH 08/11] xfrm: don't call xfrm_policy_cache_flush while holding spinlock
From: Florian Westphal xfrm_policy_cache_flush can sleep, so it cannot be called while holding a spinlock. We could release the lock first, but I don't see why we need to invoke this function here in first place, the packet path won't reuse an xdst entry unless its still valid. While at it, add an annotation to xfrm_policy_cache_flush, it would have probably caught this bug sooner. Fixes: ec30d78c14a813 ("xfrm: add xdst pcpu cache") Reported-by: syzbot+e149f7d1328c26f9c...@syzkaller.appspotmail.com Signed-off-by: Florian Westphal Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 2ef6db98e9ba..bc5eae12fb09 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -975,8 +975,6 @@ int xfrm_policy_flush(struct net *net, u8 type, bool task_valid) } if (!cnt) err = -ESRCH; - else - xfrm_policy_cache_flush(); out: spin_unlock_bh(&net->xfrm.xfrm_policy_lock); return err; @@ -1744,6 +1742,8 @@ void xfrm_policy_cache_flush(void) bool found = 0; int cpu; + might_sleep(); + local_bh_disable(); rcu_read_lock(); for_each_possible_cpu(cpu) { -- 2.14.1
[PATCH 02/11] xfrm: skip policies marked as dead while rehashing
From: Florian Westphal syzkaller triggered following KASAN splat: BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 read of size 2 at addr 8801c8e92fe4 by task kworker/1:1/23 [..] Workqueue: events xfrm_hash_rebuild [..] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428 xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112 worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..] The reproducer triggers: 1016 if (error) { 1017 list_move_tail(&walk->walk.all, &x->all); 1018 goto out; 1019 } in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump callback returns -ENOBUFS). In this case, *walk is located the pfkey socket struct, so this socket becomes visible in the global policy list. It looks like this is intentional -- phony walker has walk.dead set to 1 and all other places skip such "policies". Ccing original authors of the two commits that seem to expose this issue (first patch missed ->dead check, second patch adds pfkey sockets to policies dumper list). Fixes: 880a6fab8f6ba5b ("xfrm: configure policy hash table thresholds by netlink") Fixes: 12a169e7d8f4b1c ("ipsec: Put dumpers on the dump list") Cc: Herbert Xu Cc: Timo Teras Cc: Christophe Gouault Reported-by: syzbot Signed-off-by: Florian Westphal Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 70aa5cb0c659..2ef6db98e9ba 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -609,7 +609,8 @@ static void xfrm_hash_rebuild(struct work_struct *work) /* re-insert all policies by order of creation */ list_for_each_entry_reverse(policy, &net->xfrm.policy_all, walk.all) { - if (xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) { + if (policy->walk.dead || + xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) { /* skip socket policies */ continue; } -- 2.14.1
[PATCH 07/11] xfrm: Return error on unknown encap_type in init_state
From: Herbert Xu Currently esp will happily create an xfrm state with an unknown encap type for IPv4, without setting the necessary state parameters. This patch fixes it by returning -EINVAL. There is a similar problem in IPv6 where if the mode is unknown we will skip initialisation while returning zero. However, this is harmless as the mode has already been checked further up the stack. This patch removes this anomaly by aligning the IPv6 behaviour with IPv4 and treating unknown modes (which cannot actually happen) as transport mode. Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP") Signed-off-by: Herbert Xu Signed-off-by: Steffen Klassert --- net/ipv4/esp4.c | 1 + net/ipv6/esp6.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index d57aa64fa7c7..61fe6e4d23fc 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -981,6 +981,7 @@ static int esp_init_state(struct xfrm_state *x) switch (encap->encap_type) { default: + err = -EINVAL; goto error; case UDP_ENCAP_ESPINUDP: x->props.header_len += sizeof(struct udphdr); diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index a902ff8f59be..1a7f00cd4803 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -890,13 +890,12 @@ static int esp6_init_state(struct xfrm_state *x) x->props.header_len += IPV4_BEET_PHMAXLEN + (sizeof(struct ipv6hdr) - sizeof(struct iphdr)); break; + default: case XFRM_MODE_TRANSPORT: break; case XFRM_MODE_TUNNEL: x->props.header_len += sizeof(struct ipv6hdr); break; - default: - goto error; } align = ALIGN(crypto_aead_blocksize(aead), 4); -- 2.14.1
[PATCH 03/11] af_key: fix buffer overread in verify_address_len()
From: Eric Biggers If a message sent to a PF_KEY socket ended with one of the extensions that takes a 'struct sadb_address' but there were not enough bytes remaining in the message for the ->sa_family member of the 'struct sockaddr' which is supposed to follow, then verify_address_len() read past the end of the message, into uninitialized memory. Fix it by returning -EINVAL in this case. This bug was found using syzkaller with KMSAN. Reproducer: #include #include #include int main() { int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2); char buf[24] = { 0 }; struct sadb_msg *msg = (void *)buf; struct sadb_address *addr = (void *)(msg + 1); msg->sadb_msg_version = PF_KEY_V2; msg->sadb_msg_type = SADB_DELETE; msg->sadb_msg_len = 3; addr->sadb_address_len = 1; addr->sadb_address_exttype = SADB_EXT_ADDRESS_SRC; write(sock, buf, 24); } Reported-by: Alexander Potapenko Cc: sta...@vger.kernel.org Signed-off-by: Eric Biggers Signed-off-by: Steffen Klassert --- net/key/af_key.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/key/af_key.c b/net/key/af_key.c index 3dffb892d52c..596499cc8b2f 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -401,6 +401,11 @@ static int verify_address_len(const void *p) #endif int len; + if (sp->sadb_address_len < + DIV_ROUND_UP(sizeof(*sp) + offsetofend(typeof(*addr), sa_family), +sizeof(uint64_t))) + return -EINVAL; + switch (addr->sa_family) { case AF_INET: len = DIV_ROUND_UP(sizeof(*sp) + sizeof(*sin), sizeof(uint64_t)); -- 2.14.1
pull request (net): ipsec 2018-01-11
1) Don't allow to change the encap type on state updates. The encap type is set on state initialization and should not change anymore. From Herbert Xu. 2) Skip dead policies when rehashing to fix a slab-out-of-bounds bug in xfrm_hash_rebuild. From Florian Westphal. 3) Two buffer overread fixes in pfkey. From Eric Biggers. 4) Fix rcu usage in xfrm_get_type_offload, request_module can sleep, so can't be used under rcu_read_lock. From Sabrina Dubroca. 5) Fix an uninitialized lock in xfrm_trans_queue. Use __skb_queue_tail instead of skb_queue_tail in xfrm_trans_queue as we don't need the lock. From Herbert Xu. 6) Currently it is possible to create an xfrm state with an unknown encap type in ESP IPv4. Fix this by returning an error on unknown encap types. Also from Herbert Xu. 7) Fix sleeping inside a spinlock in xfrm_policy_cache_flush. From Florian Westphal. 8) Fix ESP GRO when the headers not fully in the linear part of the skb. We need to pull before we can access them. 9) Fix a skb leak on error in key_notify_policy. 10) Fix a race in the xdst pcpu cache, we need to run the resolver routines with bottom halfes off like the old flowcache did. Please pull or let me know if there are problems. Thanks! The following changes since commit 2758b3e3e630ba304fc4aca434d591e70e528298: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-12-28 23:20:21 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master for you to fetch changes up to 76a4201191814a0061cb5c861fafb9ecaa764846: xfrm: Fix a race in the xdst pcpu cache. (2018-01-10 12:14:28 +0100) Eric Biggers (2): af_key: fix buffer overread in verify_address_len() af_key: fix buffer overread in parse_exthdrs() Florian Westphal (2): xfrm: skip policies marked as dead while rehashing xfrm: don't call xfrm_policy_cache_flush while holding spinlock Herbert Xu (3): xfrm: Forbid state updates from changing encap type xfrm: Use __skb_queue_tail in xfrm_trans_queue xfrm: Return error on unknown encap_type in init_state Sabrina Dubroca (1): xfrm: fix rcu usage in xfrm_get_type_offload Steffen Klassert (3): esp: Fix GRO when the headers not fully in the linear part of the skb. af_key: Fix memory leak in key_notify_policy. xfrm: Fix a race in the xdst pcpu cache. net/ipv4/esp4.c | 1 + net/ipv4/esp4_offload.c | 3 ++- net/ipv6/esp6.c | 3 +-- net/ipv6/esp6_offload.c | 3 ++- net/key/af_key.c| 12 +++- net/xfrm/xfrm_input.c | 2 +- net/xfrm/xfrm_policy.c | 15 +++ net/xfrm/xfrm_state.c | 11 +-- 8 files changed, 38 insertions(+), 12 deletions(-)
[PATCH 01/11] xfrm: Forbid state updates from changing encap type
From: Herbert Xu Currently we allow state updates to competely replace the contents of x->encap. This is bad because on the user side ESP only sets up header lengths depending on encap_type once when the state is first created. This could result in the header lengths getting out of sync with the actual state configuration. In practice key managers will never do a state update to change the encapsulation type. Only the port numbers need to be changed as the peer NAT entry is updated. Therefore this patch adds a check in xfrm_state_update to forbid any changes to the encap_type. Signed-off-by: Herbert Xu Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_state.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 500b3391f474..1e80f68e2266 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1534,8 +1534,12 @@ int xfrm_state_update(struct xfrm_state *x) err = -EINVAL; spin_lock_bh(&x1->lock); if (likely(x1->km.state == XFRM_STATE_VALID)) { - if (x->encap && x1->encap) + if (x->encap && x1->encap && + x->encap->encap_type == x1->encap->encap_type) memcpy(x1->encap, x->encap, sizeof(*x1->encap)); + else if (x->encap || x1->encap) + goto fail; + if (x->coaddr && x1->coaddr) { memcpy(x1->coaddr, x->coaddr, sizeof(*x1->coaddr)); } @@ -1552,6 +1556,8 @@ int xfrm_state_update(struct xfrm_state *x) x->km.state = XFRM_STATE_DEAD; __xfrm_state_put(x); } + +fail: spin_unlock_bh(&x1->lock); xfrm_state_put(x1); -- 2.14.1
Re: [iptables] extensions: add support for 'srh' match
On Thu, Jan 11, 2018 at 11:14:52AM +0100, Ahmed Abdelsalam wrote: > On Wed, 10 Jan 2018 16:32:24 +0100 > Pablo Neira Ayuso wrote: > > > On Fri, Dec 29, 2017 at 12:08:25PM +0100, Ahmed Abdelsalam wrote: > > > This patch adds a new exetension to iptables to supprt 'srh' match > > > The implementation considers revision 7 of the SRH draft. > > > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07 > > > > > > Signed-off-by: Ahmed Abdelsalam > > > --- > > > extensions/libip6t_srh.c| 283 > > > > > > include/linux/netfilter_ipv6/ip6t_srh.h | 63 +++ > > > > Please, add a extensions/libip6t_srh.t test file and send a v2. > > > > Thanks. > Ok, > Is there minimum requirements of the test cases to be added to the > extensions/libip6t_srh.t file ? I leave it up to you to decide what level of coverage you consider is good to make sure that future changes don't break your new feature. Thanks!
BUG: using smp_processor_id() in preemptible
Greetings, I'm getting occasional video lock-ups, and while checking logs I found these: === [ 297.445296] BUG: using smp_processor_id() in preemptible [] code: claws-mail/1635 [ 297.445319] caller is jprobe_return+0x12/0x25 [ 297.445332] CPU: 1 PID: 1635 Comm: claws-mail Not tainted 4.14.0 #1 [ 297.445341] Hardware name: Micro-Star International Co., Ltd. GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011 [ 297.445349] Call Trace: [ 297.445372] dump_stack+0x9f/0xe1 [ 297.445392] check_preemption_disabled+0xec/0xf0 [ 297.445409] jprobe_return+0x12/0x25 [ 297.445425] tcp_v4_do_rcv+0x7f/0x1a0 [ 297.445443] __release_sock+0x6d/0x100 [ 297.445462] release_sock+0x2b/0xb0 [ 297.445475] tcp_recvmsg+0x300/0x8f0 [ 297.445504] ? __lock_acquire+0x3ee/0x1610 [ 297.445517] ? core_sys_select+0x240/0x3e0 [ 297.445541] inet_recvmsg+0x51/0x1b0 [ 297.445566] sock_read_iter+0x8c/0xd0 [ 297.445598] __vfs_read+0xd5/0x140 [ 297.445632] vfs_read+0x9e/0x150 [ 297.445652] SyS_read+0x45/0xa0 [ 297.445675] entry_SYSCALL_64_fastpath+0x23/0xc2 [ 297.445687] RIP: 0033:0x7ff2536001b8 [ 297.445696] RSP: 002b:7ff247152890 EFLAGS: 0246 ORIG_RAX: [ 297.445713] RAX: ffda RBX: 9cd088ccbff0 RCX: 7ff2536001b8 [ 297.445721] RDX: 0005 RSI: 7ff23c02bb43 RDI: 0013 [ 297.445730] RBP: 7ff23c02bb43 R08: R09: 7ff23c00e520 [ 297.445738] R10: 0010 R11: 0246 R12: 0086 [ 297.445746] R13: 002f R14: 7ff254d3c998 R15: 0001 ... [ 366.965766] BUG: using smp_processor_id() in preemptible [] code: Socket Thread/1435 [ 366.965769] caller is jprobe_return+0x12/0x25 [ 366.965773] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1 [ 366.965775] Hardware name: Micro-Star International Co., Ltd. GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011 [ 366.965777] Call Trace: [ 366.965780] dump_stack+0x9f/0xe1 [ 366.965786] check_preemption_disabled+0xec/0xf0 [ 366.965790] jprobe_return+0x12/0x25 [ 366.965793] tcp_v4_do_rcv+0x7f/0x1a0 [ 366.965797] __release_sock+0x6d/0x100 [ 366.965811] release_sock+0x2b/0xb0 [ 366.965813] tcp_recvmsg+0x300/0x8f0 [ 366.965826] inet_recvmsg+0x51/0x1b0 [ 366.965834] SYSC_recvfrom+0xc6/0x130 [ 366.965845] ? entry_SYSCALL_64_fastpath+0x5/0xc2 [ 366.965848] ? trace_hardirqs_on_caller+0xcb/0x200 [ 366.965851] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 366.965858] entry_SYSCALL_64_fastpath+0x23/0xc2 [ 366.965860] RIP: 0033:0x7f475ab7e5da [ 366.965862] RSP: 002b:7f47438fc8b0 EFLAGS: 0246 ORIG_RAX: 002d [ 366.965864] RAX: ffda RBX: 9cd088ae7ff0 RCX: 7f475ab7e5da [ 366.965865] RDX: 8000 RSI: 7f4721202000 RDI: 007c [ 366.965867] RBP: R08: R09: [ 366.965868] R10: R11: 0246 R12: 0086 [ 366.965869] R13: 7f47212025a8 R14: 7a58 R15: 7f474ba1e5f2 [ 366.966571] BUG: using smp_processor_id() in preemptible [] code: Socket Thread/1435 [ 366.966574] caller is jprobe_return+0x12/0x25 [ 366.966576] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1 [ 366.966577] Hardware name: Micro-Star International Co., Ltd. GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011 [ 366.966578] Call Trace: [ 366.966582] dump_stack+0x9f/0xe1 [ 366.966586] check_preemption_disabled+0xec/0xf0 [ 366.966592] jprobe_return+0x12/0x25 [ 366.966596] tcp_v4_do_rcv+0x7f/0x1a0 [ 366.966601] __release_sock+0x6d/0x100 [ 366.966606] release_sock+0x2b/0xb0 [ 366.966610] tcp_recvmsg+0x300/0x8f0 [ 366.966622] inet_recvmsg+0x51/0x1b0 [ 366.966630] SYSC_recvfrom+0xc6/0x130 [ 366.966643] ? entry_SYSCALL_64_fastpath+0x5/0xc2 [ 366.966647] ? trace_hardirqs_on_caller+0xcb/0x200 [ 366.966651] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 366.97] entry_SYSCALL_64_fastpath+0x23/0xc2 [ 366.99] RIP: 0033:0x7f475ab7e5da [ 366.966670] RSP: 002b:7f47438fc8b0 EFLAGS: 0246 ORIG_RAX: 002d [ 366.966673] RAX: ffda RBX: 9cd088ae7ff0 RCX: 7f475ab7e5da [ 366.966674] RDX: 8000 RSI: 7f4721202000 RDI: 007c [ 366.966676] RBP: R08: R09: [ 366.966677] R10: R11: 0246 R12: 0086 [ 366.966679] R13: 7f47438fca70 R14: 05a8 R15: 7f4721202000 [ 366.979991] BUG: using smp_processor_id() in preemptible [] code: Socket Thread/1435 [ 366.97] caller is jprobe_return+0x12/0x25 [ 366.980004] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1 [ 366.980007] Hardware name: Micro-Star International Co., Ltd. GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011 [ 366.980012] Call Trace: [ 366.980023] dump_stack+0x9f/0xe1 [ 366.980033] check_
Re: general protection fault in sctp_v6_get_dst
On Thu, Jan 11, 2018 at 05:30:17PM +0800, Xin Long wrote: > On Thu, Jan 11, 2018 at 2:15 AM, syzbot > wrote: > > syzkaller has found reproducer for the following crash on > > 61ad64080e039dce99a7f8d89b729bbea995e2f7 > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master > > compiler: gcc (GCC) 7.1.1 20170620 > > .config is attached > > Raw console output is attached. > > C reproducer is attached > > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > > for information about syzkaller reproducers > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+7b7b518b1228d2743...@syzkaller.appspotmail.com > > It will help syzbot understand when the bug is fixed. > > > > device lo entered promiscuous mode > > kasan: CONFIG_KASAN_INLINE enabled > > kasan: GPF could be caused by NULL-ptr deref or user memory access > > general protection fault: [#1] SMP KASAN > > Dumping ftrace buffer: > >(ftrace buffer empty) > > Modules linked in: > > CPU: 0 PID: 3506 Comm: syzkaller968983 Not tainted 4.15.0-rc7+ #181 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline] > > RIP: 0010:sctp_v6_get_dst+0x59e/0x1c60 net/sctp/ipv6.c:271 > > RSP: 0018:8801db205e20 EFLAGS: 00010206 > > RAX: dc00 RBX: RCX: 8512e05b > > RDX: 000f RSI: 67cf608c RDI: 8801db22376c > > RBP: 8801db206190 R08: 11003b640b05 R09: 0002 > > R10: 8801db205cf0 R11: 8512e008 R12: 8801bf884db0 > > R13: 204e R14: 8801bfe3e680 R15: 8801bf884d80 > > FS: 7f122e219700() GS:8801db20() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 20aaff09 CR3: 0001bfdf0005 CR4: 001606f0 > > > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > Call Trace: > > > > sctp_transport_route+0xa8/0x430 net/sctp/transport.c:293 > > sctp_assoc_add_peer+0x4fe/0x1190 net/sctp/associola.c:655 > > sctp_process_init+0x119/0x2440 net/sctp/sm_make_chunk.c:2341 > > sctp_sf_do_5_1B_init+0x8c9/0xe80 net/sctp/sm_statefuns.c:414 > > sctp_do_sm+0x192/0x6ed0 net/sctp/sm_sideeffect.c:1178 > > sctp_endpoint_bh_rcv+0x379/0x8f0 net/sctp/endpointola.c:456 > > sctp_inq_push+0x23b/0x300 net/sctp/inqueue.c:95 > > sctp_rcv+0x29f3/0x35c0 net/sctp/input.c:267 > > sctp6_rcv+0x15/0x30 net/sctp/ipv6.c:1006 > > ip6_input_finish+0x37e/0x17a0 net/ipv6/ip6_input.c:284 > > NF_HOOK include/linux/netfilter.h:288 [inline] > > ip6_input+0xdb/0x560 net/ipv6/ip6_input.c:327 > > dst_input include/net/dst.h:449 [inline] > > ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71 > > NF_HOOK include/linux/netfilter.h:288 [inline] > > ipv6_rcv+0xf37/0x1fa0 net/ipv6/ip6_input.c:208 > > __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4538 > > __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4603 > > process_backlog+0x203/0x740 net/core/dev.c:5283 > > napi_poll net/core/dev.c:5681 [inline] > > net_rx_action+0x792/0x1910 net/core/dev.c:5747 > > __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 > > do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1133 > > > > do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329 > > do_softirq kernel/softirq.c:177 [inline] > > __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182 > > local_bh_enable include/linux/bottom_half.h:32 [inline] > > rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline] > > ip6_finish_output2+0xba0/0x23a0 net/ipv6/ip6_output.c:121 > > ip6_finish_output+0x698/0xaf0 net/ipv6/ip6_output.c:154 > > NF_HOOK_COND include/linux/netfilter.h:277 [inline] > > ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171 > > dst_output include/net/dst.h:443 [inline] > > NF_HOOK include/linux/netfilter.h:288 [inline] > > ip6_xmit+0xd84/0x2090 net/ipv6/ip6_output.c:277 > > sctp_v6_xmit+0x438/0x630 net/sctp/ipv6.c:225 > > sctp_packet_transmit+0x225e/0x3750 net/sctp/output.c:638 > > sctp_outq_flush+0xabb/0x4060 net/sctp/outqueue.c:911 > > sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776 > > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline] > > sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline] > > sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181 > > sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88 > > sctp_sendmsg+0x1d2e/0x33f0 net/sctp/socket.c:2018 > > inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764 > > sock_sendmsg_nosec net/socket.c:628 [inline] > > sock_sendmsg+0xca/0x110 net/socket.c:638 > > SYSC_sendto+0x361/0x5c0 net/socket.c:1719 > > SyS_sendto+0x40/0x50 net/socket.c:1687 > > entry_SYSCALL_64_fastpath+0x23/0x9a > > RIP: 0033:0x4456c9 > > RSP: 002b:7f122e218d98 EFLAGS: 0212 ORIG_RAX: 002c > > RAX: ffd
[PATCH net] ip6_gre: init dev->mtu and dev->hard_header_len correctly
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") moved dev->mtu initialization from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a result, the previously set values, before ndo_init(), are reset in the following cases: * rtnl_create_link() can update dev->mtu from IFLA_MTU parameter * ip6gre_tnl_link_config() is invoked before ndo_init() in netlink and ioctl setup, so ndo_init() can reset MTU adjustments with the lower device MTU as well, dev->mtu and dev->hard_header_len. Not applicable for ip6gretap because it has one more call to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init(). Since, initially net_device allocated with kvzalloc, make sure that dev->mtu is zero, i.e. not changed, before setting default MTU inside ndo_init(), and invoke ip6gre_tnl_link_config after setting default values. For ip6gretap, reset dev->mtu to zero in ip6gre_tap_setup() after ether_setup(), in order for it to work with the new check in ip6gre_tunnel_init_common(). Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting") Signed-off-by: Alexey Kodanev --- net/ipv6/ip6_gre.c | 22 ++ 1 files changed, 10 insertions(+), 12 deletions(-) diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 7726959..edf65d0 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -337,7 +337,6 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, struct ip6_tnl *t) nt->dev = dev; nt->net = dev_net(dev); - ip6gre_tnl_link_config(nt, 1); if (register_netdevice(dev) < 0) goto failed_free; @@ -1047,6 +1046,7 @@ static void ip6gre_tnl_init_features(struct net_device *dev) static int ip6gre_tunnel_init_common(struct net_device *dev) { struct ip6_tnl *tunnel; + int set_mtu = !dev->mtu; int ret; int t_hlen; @@ -1072,13 +1072,16 @@ static int ip6gre_tunnel_init_common(struct net_device *dev) t_hlen = tunnel->hlen + sizeof(struct ipv6hdr); dev->hard_header_len = LL_MAX_HEADER + t_hlen; - dev->mtu = ETH_DATA_LEN - t_hlen; - if (dev->type == ARPHRD_ETHER) - dev->mtu -= ETH_HLEN; - if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) - dev->mtu -= 8; + if (set_mtu) { + dev->mtu = ETH_DATA_LEN - t_hlen; + if (dev->type == ARPHRD_ETHER) + dev->mtu -= ETH_HLEN; + if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) + dev->mtu -= 8; + } ip6gre_tnl_init_features(dev); + ip6gre_tnl_link_config(tunnel, set_mtu); return 0; } @@ -1303,7 +1306,6 @@ static void ip6gre_netlink_parms(struct nlattr *data[], static int ip6gre_tap_init(struct net_device *dev) { - struct ip6_tnl *tunnel; int ret; ret = ip6gre_tunnel_init_common(dev); @@ -1312,10 +1314,6 @@ static int ip6gre_tap_init(struct net_device *dev) dev->priv_flags |= IFF_LIVE_ADDR_CHANGE; - tunnel = netdev_priv(dev); - - ip6gre_tnl_link_config(tunnel, 1); - return 0; } @@ -1335,6 +1333,7 @@ static void ip6gre_tap_setup(struct net_device *dev) ether_setup(dev); + dev->mtu = 0; dev->max_mtu = 0; dev->netdev_ops = &ip6gre_tap_netdev_ops; dev->needs_free_netdev = true; @@ -1408,7 +1407,6 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev, nt->dev = dev; nt->net = dev_net(dev); - ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]); err = register_netdevice(dev); if (err) -- 1.7.1
Re: [patch net-next v7 00/13] net: sched: allow qdiscs to share filter block instances
On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko Currently the filters added to qdiscs are independent. So for example if you have 2 netdevices and you create ingress qdisc on both and you want to add identical filter rules both, you need to add them twice. This patchset makes this easier and mainly saves resources allowing to share all filters within a qdisc - I call it a "filter block". Also this helps to save resources when we do offload to hw for example to expensive TCAM. So back to the example. First, we create 2 qdiscs. Both will share block number 22. "22" is just an identification: $ tc qdisc add dev ens7 ingress block 22 $ tc qdisc add dev ens8 ingress block 22 If we don't specify "block" command line option, no shared block would be created: $ tc qdisc add dev ens9 ingress Now if we list the qdiscs, we will see the block index in the output: $ tc qdisc qdisc ingress : dev ens7 parent :fff1 block 22 qdisc ingress : dev ens8 parent :fff1 block 22 qdisc ingress : dev ens9 parent :fff1 To make is more visual, the situation looks like this: ens7 ingress qdisc ens7 ingress qdisc | | | | +--> block 22 <--+ Unlimited number of qdiscs may share the same block. Now we can add filter using the block index: $ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop Note we cannot use the qdisc for filter manipulations of shared blocks: $ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 192.168.100.2 action drop Error: This filter block is shared. Please use the block index to manipulate the filters. We will see the same output if we list filters for ingress qdisc of ens7 and ens8, also for the block 22: $ tc filter show block 22 filter block 22 protocol ip pref 25 flower chain 0 filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 ... $ tc filter show dev ens7 ingress filter block 22 protocol ip pref 25 flower chain 0 filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 ... $ tc filter show dev ens8 ingress filter block 22 protocol ip pref 25 flower chain 0 filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 ... Somewhere here mention the egress issue we talked about, something like: At the moment on ingress and clsact_xxx are well supported by the block infrastructure. For this to work well with egress qdisc, all the ports/qdiscs sharing the block will have to be symmetric. e.g. if ens8 and ens9 root qdiscs shared a block at their (egress) root qdiscs, then those qdiscs would both need to have the same handle id. An example of a symettric shared block setup would like like: tc qdisc add dev ens8 root block 22 handle 1:0 prio tc qdisc add dev ens9 root block 22 handle 1:0 prio I am confident the above would work. You said you are thinking of getting this to always work (I cant think of a simple way to do it), but for the moment the above is fine. Most people who want this would probably use clsact egress and not care about queues (so it may never be "fixed") cheers, jamal
[PATCH 0/3] ipsec: Add ESP over TCP encapsulation
Hi: This series of patches add basic support for ESP over TCP (RFC 8229). Note that this does not include TLS support but it could be added in future. Here is an iproute patch to setup xfrm states with this: diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c index 12c2f72..f3fb1e2 100644 --- a/ip/ipxfrm.c +++ b/ip/ipxfrm.c @@ -738,6 +738,9 @@ void xfrm_xfrma_print(struct rtattr *tb[], __u16 family, case 2: fprintf(fp, "espinudp "); break; + case 6: + fprintf(fp, "espintcp "); + break; default: fprintf(fp, "%u ", e->encap_type); break; @@ -1182,6 +1185,8 @@ int xfrm_encap_type_parse(__u16 *type, int *argcp, char ***argvp) *type = 1; else if (strcmp(*argv, "espinudp") == 0) *type = 2; + else if (strcmp(*argv, "espintcp") == 0) + *type = 6; else invarg("ENCAP-TYPE value is invalid", *argv); Here is a sample program for setting up the TCP socket to use this. Note that it doesn't do the magic word as required by RFC 8229 so you'll need to add that for a real key manager. #include #include #include #include #include #include #include #define TCP_ENCAP 35 int main(int argc, char **argv) { struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(4500), }; char buf[4096]; int one = 1; int err; int s; s = socket(AF_INET, SOCK_STREAM, 0); if (s < 0) error(-1, errno, "socket"); if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) error(-1, errno, "bind"); if (argc > 1) { addr.sin_addr.s_addr = inet_addr(argv[1]); if (connect(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) error(-1, errno, "connect"); } else { if (listen(s, 0) < 0) error(-1, errno, "listen"); s = accept(s, NULL, 0); if (s < 0) error(-1, errno, "accept"); } if (setsockopt(s, SOL_TCP, TCP_NODELAY, &one, sizeof(one)) < 0) error(-1, errno, "TCP_NODELAY"); if (setsockopt(s, SOL_TCP, TCP_ENCAP, NULL, 0) < 0) error(-1, errno, "TCP_ENCAP"); while ((err = read(s, buf, sizeof(buf))) > 0) ; if (err < 0) error(-1, errno, "read"); return 0; } Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked
For a function that needs to be called with the socket spinlock held, sleeping would seem to be a bad idea. This function does in fact avoid sleeping when calling kernel_sendpage_locked on the page part of the skb. However, it doesn't do that when sending the linear part. Resulting in sleeping when the socket send buffer is full. This patch fixes it by setting the MSG_DONTWAIT flag when calling kernel_sendmsg_locked. Signed-off-by: Herbert Xu --- net/core/skbuff.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6b0ff39..8197b7a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2279,6 +2279,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset, kv.iov_base = skb->data + offset; kv.iov_len = slen; memset(&msg, 0, sizeof(msg)); + msg.msg_flags = MSG_DONTWAIT; ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen); if (ret <= 0)
[PATCH 2/3] tcp: Add ESP encapsulation support
This patch adds the plumbing in TCP for ESP encapsulation support per RFC8229. The patch mostly deals with inbound processing, as well as enabling TCP encapsulation on a socket through setsockopt. The outbound processing is dealt with in the ESP code as is done for UDP. The inbound processing is split into two halves. First of all, the softirq path directly intercepts ESP packets and feeds them into the IPsec stack. Most of the time the packet will be freed right away if it contains complete ESP packets. However, if the message is incomplete or it contains non-ESP data, then the skb will be added to the receive queue. We also add packets to the receive queue if it is currently non-emtpy, in order to preserve sequence number continuity and minimise the changes to the TCP code. On the user-space facing side, packets marked as ESP-only are skipped and not visible to user-space. However, some ESP data may seep through. For example, if we receive a partial message then we will always give it to user-space regardless of whether it turns out to be ESP or not. So user-space should be prepared to skip ESP messages (SPI != 0). There is a little bit of code dealing with the encapsulation side. In particular, if encapsulation data comes in while the socket is owned by user-space, the packets will be stored in tp->encap_out and processed during release_sock. Signed-off-by: Herbert Xu --- include/linux/tcp.h | 15 ++ include/net/tcp.h| 27 +++ include/uapi/linux/tcp.h |1 include/uapi/linux/udp.h |1 net/ipv4/tcp.c | 68 + net/ipv4/tcp_input.c | 326 +-- net/ipv4/tcp_ipv4.c |1 net/ipv4/tcp_output.c| 48 ++ 8 files changed, 473 insertions(+), 14 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index ca4a636..1360a0e 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -225,7 +225,8 @@ struct tcp_sock { fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ is_sack_reneg:1,/* in recovery from loss with SACK reneg? */ - unused:2; + encap:1,/* TCP IKE/ESP encapsulation */ + encap_lenhi_valid:1; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ unused1 : 1, @@ -373,6 +374,16 @@ struct tcp_sock { */ struct request_sock *fastopen_rsk; u32 *saved_syn; + +#ifdef CONFIG_XFRM +/* TCP ESP encapsulation */ + struct sk_buff *encap_in; + struct sk_buff_head encap_out; + u32 encap_seq; + u32 encap_last; + u16 encap_backlog; + u8 encap_lenhi; +#endif }; enum tsq_enum { @@ -384,6 +395,7 @@ enum tsq_enum { TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call * tcp_v{4|6}_mtu_reduced() */ + TCP_ESP_DEFERRED, /* esp_output_tcp_encap2 queued packets */ }; enum tsq_flags { @@ -393,6 +405,7 @@ enum tsq_flags { TCPF_WRITE_TIMER_DEFERRED = (1UL << TCP_WRITE_TIMER_DEFERRED), TCPF_DELACK_TIMER_DEFERRED = (1UL << TCP_DELACK_TIMER_DEFERRED), TCPF_MTU_REDUCED_DEFERRED = (1UL << TCP_MTU_REDUCED_DEFERRED), + TCPF_ESP_DEFERRED = (1UL << TCP_ESP_DEFERRED), }; static inline struct tcp_sock *tcp_sk(const struct sock *sk) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6da880d..6513ae2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -327,6 +327,7 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset, size_t size, int flags); ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, size_t size, int flags); +int tcp_encap_output(struct sock *sk, struct sk_buff *skb); void tcp_release_cb(struct sock *sk); void tcp_wfree(struct sk_buff *skb); void tcp_write_timer_handler(struct sock *sk); @@ -399,6 +400,7 @@ int compat_tcp_setsockopt(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen); void tcp_set_keepalive(struct sock *sk, int val); void tcp_syn_ack_timeout(const struct request_sock *req); +void tcp_cleanup_rbuf(struct sock *sk, int copied); int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len); void tcp_parse_options(const struct net *net, const struct sk_buff *skb, @@ -789,7 +791,8 @@ struct tcp_skb_cb { __u8txstamp_ack:1, /* Record TX timestamp for ack? */ eor:1, /* Is skb MSG_EOR marked? */ has_rxtstamp:1, /* SKB has a RX timestamp */ -
[PATCH 3/3] ipsec: Add ESP over TCP encapsulation support
This patch adds support for ESP over TCP encapsulation per RFC8229. Most of the input processing is done in the TCP stack and not in this patch, which is similar to UDP encapsulation. On the output side, there are two potential levels of indirection. Firstly all packets are fed through a tasklet in order to avoid TCP socket lock recursion. They're then processed directly if the TCP socket is not owned by user-space. If it is owned then we'll place the packet in a queue (tp->encap_out) for processing when the socket lock is released. The first outbound packet will trigger a socket lockup for a matching TCP socket. If the TCP connection drops we will repeat the lookup as needed. The TCP socket is cached in the xfrm state and is read using RCU. Note that unlike normal IPsec packets, once we hit a TCP xfrm state, the xfrm stack is short-circuited and its journey will continue through the TCP stack, after which a new IPsec lookup will be done. This is different from how UDP encapsulation is done. This means that if you're doing nested IPsec then you will need to construct the policies with this in mind. That is, start with a new policy whenever TCP encapsulation is done. Signed-off-by: Herbert Xu --- include/net/xfrm.h|7 + net/ipv4/esp4.c | 208 -- net/xfrm/xfrm_input.c | 21 +++-- net/xfrm/xfrm_state.c |3 4 files changed, 228 insertions(+), 11 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index ae35991..3694536 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -180,6 +180,7 @@ struct xfrm_state { /* Data for encapsulator */ struct xfrm_encap_tmpl *encap; + struct sock __rcu *encap_sk; /* Data for care-of address */ xfrm_address_t *coaddr; @@ -210,6 +211,9 @@ struct xfrm_state { u32 replay_maxage; u32 replay_maxdiff; + /* Copy of encap_type from encap to avoid locking. */ + u16 encap_type; + /* Replay detection notification timer */ struct timer_list rtimer; @@ -1570,6 +1574,9 @@ struct xfrmk_spdinfo { int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb); int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type); int xfrm_input_resume(struct sk_buff *skb, int nexthdr); +int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb, +int (*finish)(struct net *, struct sock *, + struct sk_buff *)); int xfrm_trans_queue(struct sk_buff *skb, int (*finish)(struct net *, struct sock *, struct sk_buff *)); diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 61fe6e4..0544e4e 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -9,13 +9,16 @@ #include #include #include +#include #include +#include #include #include #include #include #include #include +#include #include #include @@ -30,6 +33,11 @@ struct esp_output_extra { u32 esphoff; }; +struct esp_tcp_sk { + struct sock *sk; + struct rcu_head rcu; +}; + #define ESP_SKB_CB(__skb) ((struct esp_skb_cb *)&((__skb)->cb[0])) static u32 esp4_get_mtu(struct xfrm_state *x, int mtu); @@ -118,6 +126,143 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp) put_page(sg_page(sg)); } +static void esp_free_tcp_sk(struct rcu_head *head) +{ + struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu); + + sock_put(esk->sk); + kfree(esk); +} + +static struct sock *esp_find_tcp_sk(struct xfrm_state *x) +{ + struct xfrm_encap_tmpl *encap = x->encap; + struct esp_tcp_sk *esk; + __be16 sport, dport; + struct sock *nsk; + struct sock *sk; + + sk = rcu_dereference(x->encap_sk); + if (sk && sk->sk_state == TCP_ESTABLISHED) + return sk; + + spin_lock_bh(&x->lock); + sport = encap->encap_sport; + dport = encap->encap_dport; + nsk = rcu_dereference_protected(x->encap_sk, + lockdep_is_held(&x->lock)); + if (sk && sk == nsk) { + esk = kmalloc(sizeof(*esk), GFP_ATOMIC); + if (!esk) { + spin_unlock_bh(&x->lock); + return ERR_PTR(-ENOMEM); + } + RCU_INIT_POINTER(x->encap_sk, NULL); + esk->sk = sk; + call_rcu(&esk->rcu, esp_free_tcp_sk); + } + spin_unlock_bh(&x->lock); + + /* XXX We don't support bound_dev_if. */ + sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4, +dport, x->props.saddr.a4, sport, 0); + + if (!sk) + return ERR_PTR(-ENOENT); + + if (!tcp_sk(sk)->encap) { + sock_put(sk); +
Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared
On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko As the tcm_ifindex 0 is invalid ifindex, reuse it to indicate that we work with block, instead of qdisc. So if tcm_ifindex is 0, tcm_parent is used to carry block_index. Commit log still refers to ifindex of 0 instead of TCM_IFINDEX_MAGIC_BLOCK cheers, jamal
Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get
On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko Add simple block get operation which primary purpose is to check the block existence by block index. block_dump missing? cheers, jamal
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko Benefit from the previously introduced shared filter blocks infrastructure and allow ingress and clsact qdisc instances to share filter blocks. The block index is coming from userspace as qdisc option. Didnt quiet follow why ingress is special and needs attributes to set the block but other qdiscs didnt. Will check again later after some coffee.. cheers, jamal
Re: [PATCH 26/32] aio: refactor read/write iocb setup
On Wed, Jan 10, 2018 at 04:19:53PM -0500, Jeff Moyer wrote: > > +static int aio_prep_rw(struct kiocb *req, struct iocb *iocb) > > +{ > > + int ret; > > + > > + req->ki_filp = fget(iocb->aio_fildes); > > + if (unlikely(!req->ki_filp)) > > + return -EBADF; > > + req->ki_complete = aio_complete_rw; > > + req->ki_flags = 0; > > The above assignment seems superfluous... Thanks, fixed.
Re: [PATCH 30/32] aio: add delayed cancel support
On Wed, Jan 10, 2018 at 06:26:39PM -0500, Jeff Moyer wrote: > >> The upcoming aio poll support would like to be able to complete the > >> iocb inline from the cancellation context, but that would cause > >> a lock order reversal. Add support for optionally moving the cancelation > >> outside the context lock to avoid this reversal. > >> > >> Signed-off-by: Christoph Hellwig > > > > Acked-by: Jeff Moyer > > Actually, let's move these two defines: > > #define AIO_IOCB_DELAYED_CANCEL (1 << 0) > #define AIO_IOCB_CANCELLED (1 << 1) > > to include/linux/aio.h so that drivers outside of fs/aio.c can make use > of them. struct aio_kiocb is private to aio.c, so just exposing them won't do anything useful. If we really need these elsewhere we'll need to come up with a proper interface.
Re: [PATCH net-next v2] xfrm: Add ESN support for IPSec HW offload
On 1/11/2018 10:28 AM, Yossi Kuperman wrote: From: Shannon Nelson [mailto:shannon.nel...@oracle.com] Sent: Thursday, January 11, 2018 5:21 AM On 1/10/2018 3:09 PM, Yossi Kuperman wrote: On 10 Jan 2018, at 19:36, Shannon Nelson wrote: On 1/10/2018 2:34 AM, yoss...@mellanox.com wrote: From: Yossef Efraim This patch adds ESN support to IPsec device offload. Adding new xfrm device operation to synchronize device ESN. Signed-off-by: Yossef Efraim --- Changes from v1: - Added documentation --- Documentation/networking/xfrm_device.txt | 3 +++ include/linux/netdevice.h| 1 + include/net/xfrm.h | 12 net/xfrm/xfrm_device.c | 4 ++-- net/xfrm/xfrm_replay.c | 2 ++ 5 files changed, 20 insertions(+), 2 deletions(-) [...] diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 7598250..704a055 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -147,8 +147,8 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, if (!x->type_offload) return -EINVAL; -/* We don't yet support UDP encapsulation, TFC padding and ESN. */ -if (x->encap || x->tfcpad || (x->props.flags & XFRM_STATE_ESN)) +/* We don't yet support UDP encapsulation and TFC padding. */ +if (x->encap || x->tfcpad) As I mentioned before, this will cause issues when working with hardware that has no ESN support, such as Intel's x540: the stack will expect the driver to do ESN, and nothing actually happens but a rollover of the numbers. Sure, the driver could look for the ESN attribute and fail the add, but that's a mode where we have to update every driver to fend off problems every time we add a new feature. Much better is to only update drivers that actively support the new feature. You are right. I’m not sure why this check is here in the first place. IMO it should take place in xdo_dev_state_add—a driver-specific callback. If you say I'm right, then why do you say it should take place in the driver callback? I just wrote that it should *not*. Sorry, I wasn't clear; you are right with respect that this change will break Intel's x540 driver. However, I do think that this is the purpose of xdo_dev_state_add(). Again, As far as I can understand, and please correct me if I'm wrong, this shouldn’t be here in the first place. Please have a look at mlx5e_xfrm_validate_state(). Currently, it return an error if the user requests ESN, regardless of the underlying device's capabilities. Subsequent patch to mlx5 driver, will allow such a request if the device does support it; maintaining backward compatibility. Here is a code snippet: - if (x->props.flags & XFRM_STATE_ESN) { + if (x->props.flags & XFRM_STATE_ESN && + !(mlx5_accel_ipsec_device_caps(priv->mdev) & MLX5_ACCEL_IPSEC_ESN)) { netdev_info(netdev, "Cannot offload ESN xfrm states\n"); return -EINVAL; } This code seems to be assuming that all drivers/NICs with the offload will be able to do ESN, and this is not the case. If this code is put into place, suddenly the ixgbe driver's offload will have a failure case: the driver doesn't support ESN, and doesn't know to NAK the state_add if the ESN bit is on. This is a generic capabilities issue for which we already have a solution "pattern". I guess you are right but ixgbe driver is already checking many other caps during add_sa callback (below code from v3 patches for ixgbe ipsec): + if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) { + netdev_err(dev, "Unsupported protocol 0x%04x for ipsec offload\n", + xs->id.proto); + return -EINVAL; + } + + if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) { + struct rx_sa rsa; + + if (xs->calg) { + netdev_err(dev, "Compression offload not supported\n"); + return -EINVAL; + } What is the difference for checking xs->calg exists in state to ESN? I think in long term we can refactor to cap mask declaration by the driver and call add_sa only if mask exists but this can be a totally different patch. We weren't assuming that, please see above. > What do you suggest? > There should be a capabilities/feature flag for the driver to set and the XFRM code shouldn't try the state_add with ESN if the driver hasn't set an ESN bit in its capabilities. Other capabilities that might make sense here are IPv6, TSO, and CSUM; there may be others. Look at how feature bits are added to netdev->features to signify what the driver can do. I think that's a much better approach. It looks like an overkill? Alternatively, just solve this by failing to add the SA that has ESN set if the driver hasn't defined your new xdo_dev_state_advance_esn(). sln sln return -EINVAL;
Re: [PATCH net] net: ipv4: Make "ip route get" match iif lo rules again.
On 1/11/18 2:36 AM, Lorenzo Colitti wrote: > Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu > versions of route lookup") broke "ip route get" in the presence > of rules that specify iif lo. > > Host-originated traffic always has iif lo, because > ip_route_output_key_hash and ip6_route_output_flags set the flow > iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a > convenient way to select only originated traffic and not forwarded > traffic. > > inet_rtm_getroute used to match these rules correctly because > even though it sets the flow iif to 0, it called > ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX. > But now that it calls ip_route_output_key_hash_rcu, the ifindex > will remain 0 and not match the iif lo in the rule. As a result, > "ip route get" will return ENETUNREACH. > > Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of > route lookup") > Tested: > https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py > passes again > Signed-off-by: Lorenzo Colitti > --- > net/ipv4/route.c | 1 + > 1 file changed, 1 insertion(+) > Missed that. Thanks for fixing. Acked-by: David Ahern
Re: [patch net-next v7 03/13] net: sched: avoid usage of tp->q in tcf_classify
On 1/11/18 2:40 AM, Jiri Pirko wrote: > Wed, Jan 10, 2018 at 05:17:28PM CET, dsah...@gmail.com wrote: >> On 1/9/18 7:07 AM, Jiri Pirko wrote: >>> From: Jiri Pirko >>> >>> Use block index in the messages instead. >>> >>> Signed-off-by: Jiri Pirko >>> --- >>> net/sched/cls_api.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c >>> index 9b45950..31e91dc 100644 >>> --- a/net/sched/cls_api.c >>> +++ b/net/sched/cls_api.c >>> @@ -672,8 +672,9 @@ int tcf_classify(struct sk_buff *skb, const struct >>> tcf_proto *tp, >>> #ifdef CONFIG_NET_CLS_ACT >>> reset: >>> if (unlikely(limit++ >= max_reclassify_loop)) { >>> - net_notice_ratelimited("%s: reclassify loop, rule prio %u, >>> protocol %02x\n", >>> - tp->q->ops->id, tp->prio & 0x, >>> + net_notice_ratelimited("%u: reclassify loop, rule prio %u, >>> protocol %02x\n", >> >> if you are dumping index instead of prio shouldn't the 'rule prio' above >> be adjusted? > > I'm not! Why do you think so? > > "%u:" is tp->chain->block->index > "prio %u" is tp->prio & 0x > "%02x" is ntohs(tp->protocol) > Never mind. scanned that too quickly.
[PATCH] netfilter: nf_tables: fix odd_ptr_err.cocci warnings
tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: b4464bcab38d3f7fe995a7cb960eeac6889bec08 commit: 3b49e2e94e6ebb8b23d0955d9e898254455734f8 [8286/9035] netfilter: nf_tables: add flow table netlink frontend The following is a 0-day report generated by Coccinelle. But from the line before, it looks like the fix is backwards, and the test shoud be on flowtable. julia nf_tables_api.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -5419,7 +5419,7 @@ static int nf_tables_getflowtable(struct flowtable = nf_tables_flowtable_lookup(table, nla[NFTA_FLOWTABLE_NAME], genmask); if (IS_ERR(table)) - return PTR_ERR(flowtable); + return PTR_ERR(table); skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); if (!skb2)
Re: [PATCH] netfilter: nf_tables: fix odd_ptr_err.cocci warnings
On Thu, Jan 11, 2018 at 03:02:12PM +0100, Julia Lawall wrote: > tree: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master > head: b4464bcab38d3f7fe995a7cb960eeac6889bec08 > commit: 3b49e2e94e6ebb8b23d0955d9e898254455734f8 [8286/9035] netfilter: > nf_tables: add flow table netlink frontend > > The following is a 0-day report generated by Coccinelle. But from the > line before, it looks like the fix is backwards, and the test shoud be on > flowtable. There's a fix for this in nf-next.git https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/commit/?id=03a0120f75dfb1807c0441376e26b36160087de4 Will pass it up to David asap.
Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared
Thu, Jan 11, 2018 at 02:25:36PM CET, j...@mojatatu.com wrote: >On 18-01-09 09:07 AM, Jiri Pirko wrote: >> From: Jiri Pirko >> >> As the tcm_ifindex 0 is invalid ifindex, reuse it to indicate that we >> work with block, instead of qdisc. So if tcm_ifindex is 0, tcm_parent is >> used to carry block_index. >> > >Commit log still refers to ifindex of 0 instead of TCM_IFINDEX_MAGIC_BLOCK Missed this. Will update, thanks!
Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get
Thu, Jan 11, 2018 at 02:27:11PM CET, j...@mojatatu.com wrote: >On 18-01-09 09:07 AM, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Add simple block get operation which primary purpose is to check the >> block existence by block index. >> > >block_dump missing? It is not needed for anything now. You see all the blocks when you list qdiscs. Yet, dump could be easily added if needed in the future.
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: >On 18-01-09 09:07 AM, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Benefit from the previously introduced shared filter blocks >> infrastructure and allow ingress and clsact qdisc instances to share >> filter blocks. The block index is coming from userspace as qdisc option. > >Didnt quiet follow why ingress is special and needs attributes to >set the block but other qdiscs didnt. Jamal, again, other qdiscs does not support block sharing. This patchset only adds support for sharing of block for ingress and clsact qdiscs. Later on, other qdiscs could also support block sharing. >Will check again later after some coffee.. > >cheers, >jamal
Re: [patch net-next v7 00/13] net: sched: allow qdiscs to share filter block instances
Thu, Jan 11, 2018 at 02:19:16PM CET, j...@mojatatu.com wrote: >On 18-01-09 09:07 AM, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Currently the filters added to qdiscs are independent. So for example if you >> have 2 netdevices and you create ingress qdisc on both and you want to add >> identical filter rules both, you need to add them twice. This patchset >> makes this easier and mainly saves resources allowing to share all filters >> within a qdisc - I call it a "filter block". Also this helps to save >> resources when we do offload to hw for example to expensive TCAM. >> >> So back to the example. First, we create 2 qdiscs. Both will share >> block number 22. "22" is just an identification: >> $ tc qdisc add dev ens7 ingress block 22 >> >> $ tc qdisc add dev ens8 ingress block 22 >> >> >> If we don't specify "block" command line option, no shared block would >> be created: >> $ tc qdisc add dev ens9 ingress >> >> Now if we list the qdiscs, we will see the block index in the output: >> >> $ tc qdisc >> qdisc ingress : dev ens7 parent :fff1 block 22 >> qdisc ingress : dev ens8 parent :fff1 block 22 >> qdisc ingress : dev ens9 parent :fff1 >> >> >> To make is more visual, the situation looks like this: >> >> ens7 ingress qdisc ens7 ingress qdisc >>| | >>| | >>+--> block 22 <--+ >> >> Unlimited number of qdiscs may share the same block. >> >> Now we can add filter using the block index: >> >> $ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 >> action drop >> >> >> Note we cannot use the qdisc for filter manipulations of shared blocks: >> >> $ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip >> 192.168.100.2 action drop >> Error: This filter block is shared. Please use the block index to manipulate >> the filters. >> >> >> We will see the same output if we list filters for ingress qdisc of >> ens7 and ens8, also for the block 22: >> >> $ tc filter show block 22 >> filter block 22 protocol ip pref 25 flower chain 0 >> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 >> ... >> >> $ tc filter show dev ens7 ingress >> filter block 22 protocol ip pref 25 flower chain 0 >> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 >> ... >> >> $ tc filter show dev ens8 ingress >> filter block 22 protocol ip pref 25 flower chain 0 >> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 >> ... >> > >Somewhere here mention the egress issue we talked about, something >like: I don't understand why to mention something that is not supported and future thinking and work needs to be done in order to support it. Let's leave that text for a cover letter of that patchset, could we? > >At the moment on ingress and clsact_xxx are well supported by the >block infrastructure. For this to work well with egress qdisc, >all the ports/qdiscs sharing the block will have to be symmetric. >e.g. if ens8 and ens9 root qdiscs shared a block at their (egress) >root qdiscs, then those qdiscs would both need to have the same >handle id. An example of a symettric shared block setup would like like: > >tc qdisc add dev ens8 root block 22 handle 1:0 prio >tc qdisc add dev ens9 root block 22 handle 1:0 prio > > >I am confident the above would work. You said you are thinking of >getting this to always work (I cant think of a simple way to do it), >but for the moment the above is fine. >Most people who want this would probably use clsact egress and not >care about queues (so it may never be "fixed") > >cheers, >jamal
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
On 18-01-11 09:24 AM, Jiri Pirko wrote: Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko Benefit from the previously introduced shared filter blocks infrastructure and allow ingress and clsact qdisc instances to share filter blocks. The block index is coming from userspace as qdisc option. Didnt quiet follow why ingress is special and needs attributes to set the block but other qdiscs didnt. Jamal, again, other qdiscs does not support block sharing. This patchset only adds support for sharing of block for ingress and clsact qdiscs. Later on, other qdiscs could also support block sharing. Can you stop a config which says: tc qdisc add dev ens9 root block 22 handle 1:0 prio ? cheers, jamal
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote: >On 18-01-11 09:24 AM, Jiri Pirko wrote: >> Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: >> > On 18-01-09 09:07 AM, Jiri Pirko wrote: >> > > From: Jiri Pirko >> > > >> > > Benefit from the previously introduced shared filter blocks >> > > infrastructure and allow ingress and clsact qdisc instances to share >> > > filter blocks. The block index is coming from userspace as qdisc option. >> > >> > Didnt quiet follow why ingress is special and needs attributes to >> > set the block but other qdiscs didnt. >> >> Jamal, again, other qdiscs does not support block sharing. This patchset >> only adds support for sharing of block for ingress and clsact qdiscs. >> Later on, other qdiscs could also support block sharing. >> > >Can you stop a config which says: >tc qdisc add dev ens9 root block 22 handle 1:0 prio ? Please see the iproute2 patches. Parsing of "block" command line option is done inside q_ingress.c
Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework
2018-01-11 9:51 GMT+09:00 Masami Hiramatsu : > Support in-kernel fault-injection framework via debugfs. > This allows you to inject a conditional error to specified > function using debugfs interfaces. > > Here is the result of test script described in > Documentation/fault-injection/fault-injection.txt > > === > # ./test_fail_function.sh > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s > btrfs-progs v4.4 > See http://btrfs.wiki.kernel.org for more information. > > Label: (null) > UUID: bfa96010-12e9-4360-aed0-42eec7af5798 > Node size: 16384 > Sector size:4096 > Filesystem size:1001.00MiB > Block group profiles: > Data: single8.00MiB > Metadata: DUP 58.00MiB > System: DUP 12.00MiB > SSD detected: no > Incompat features: extref, skinny-metadata > Number of devices: 1 > Devices: > IDSIZE PATH > 1 1001.00MiB /dev/loop2 > > mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory > SUCCESS! > === > > > Signed-off-by: Masami Hiramatsu > Reviewed-by: Josef Bacik > --- > Changes in v3: >- Check and adjust error value for each target function >- Clear kporbe flag for reuse >- Add more documents and example > --- > Documentation/fault-injection/fault-injection.txt | 62 ++ > kernel/Makefile |1 > kernel/fail_function.c| 217 > + > lib/Kconfig.debug | 10 + > 4 files changed, 290 insertions(+) > create mode 100644 kernel/fail_function.c > > diff --git a/Documentation/fault-injection/fault-injection.txt > b/Documentation/fault-injection/fault-injection.txt > index 918972babcd8..4aecbceef9d2 100644 > --- a/Documentation/fault-injection/fault-injection.txt > +++ b/Documentation/fault-injection/fault-injection.txt > @@ -30,6 +30,12 @@ o fail_mmc_request >injects MMC data errors on devices permitted by setting >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request > > +o fail_function > + > + injects error return on specific functions, which are marked by > + ALLOW_ERROR_INJECTION() macro, by setting debugfs entries > + under /sys/kernel/debug/fail_function. No boot option supported. > + > Configure fault-injection capabilities behavior > --- > > @@ -123,6 +129,24 @@ configuration of fault-injection capabilities. > default is 'N', setting it to 'Y' will disable failure injections > when dealing with private (address space) futexes. > > +- /sys/kernel/debug/fail_function/inject: > + > + specifies the target function of error injection by name. > + > +- /sys/kernel/debug/fail_function/retval: > + > + specifies the "error" return value to inject to the given > + function. > + Is it possible to inject errors into multiple functions at the same time? If so, it will be more useful to support it in the fault injection, too. Because some kind of bugs are caused by the combination of errors. (e.g. another error in an error path) I suggest the following interface. - /sys/kernel/debug/fail_function/inject: specifies the target function of error injection by name. /sys/kernel/debug/fail_function// directory will be created. - /sys/kernel/debug/fail_function/uninject: specifies the target function of error injection by name that is currently being injected. /sys/kernel/debug/fail_function// directory will be removed. - /sys/kernel/debug/fail_function//retval: specifies the "error" return value to inject to the given function.
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
On 18-01-11 09:41 AM, Jiri Pirko wrote: Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote: On 18-01-11 09:24 AM, Jiri Pirko wrote: Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: On 18-01-09 09:07 AM, Jiri Pirko wrote: From: Jiri Pirko Benefit from the previously introduced shared filter blocks infrastructure and allow ingress and clsact qdisc instances to share filter blocks. The block index is coming from userspace as qdisc option. Didnt quiet follow why ingress is special and needs attributes to set the block but other qdiscs didnt. Jamal, again, other qdiscs does not support block sharing. This patchset only adds support for sharing of block for ingress and clsact qdiscs. Later on, other qdiscs could also support block sharing. Can you stop a config which says: tc qdisc add dev ens9 root block 22 handle 1:0 prio ? Please see the iproute2 patches. Parsing of "block" command line option is done inside q_ingress.c I only looked at the kernel code. Good you can stop it at tc but the API does not stop it (unless you expect the rest of the world to only use tc). Really - there is no reason for this API to be only via ingress qdisc attributes. You can add a check in cls api to reject any parent that is not either of the clsacts + ingress (depending on tc doesnt sound right). cheers, jamal
[PATCH] usbnet: silence an unnecessary warning
That a kevent could not be scheduled is not an error. Such handlers must be able to deal with multiple events anyway. As the successful scheduling of a work is a debug event, make the failure debug priority, too. Signed-off-by: Oliver Neukum Reported-by: Cristian Caravena --- drivers/net/usb/usbnet.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index d56fe32bf48d..1e0bbe23f95c 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -458,8 +458,7 @@ void usbnet_defer_kevent (struct usbnet *dev, int work) { set_bit (work, &dev->flags); if (!schedule_work (&dev->kevent)) { - if (net_ratelimit()) - netdev_err(dev->net, "kevent %d may have been dropped\n", work); + netdev_dbg(dev->net, "kevent %d may have been dropped\n", work); } else { netdev_dbg(dev->net, "kevent %d scheduled\n", work); } -- 2.13.6
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote: >On 18-01-11 09:41 AM, Jiri Pirko wrote: >> Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote: >> > On 18-01-11 09:24 AM, Jiri Pirko wrote: >> > > Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: >> > > > On 18-01-09 09:07 AM, Jiri Pirko wrote: >> > > > > From: Jiri Pirko >> > > > > >> > > > > Benefit from the previously introduced shared filter blocks >> > > > > infrastructure and allow ingress and clsact qdisc instances to share >> > > > > filter blocks. The block index is coming from userspace as qdisc >> > > > > option. >> > > > >> > > > Didnt quiet follow why ingress is special and needs attributes to >> > > > set the block but other qdiscs didnt. >> > > >> > > Jamal, again, other qdiscs does not support block sharing. This patchset >> > > only adds support for sharing of block for ingress and clsact qdiscs. >> > > Later on, other qdiscs could also support block sharing. >> > > >> > >> > Can you stop a config which says: >> > tc qdisc add dev ens9 root block 22 handle 1:0 prio ? >> >> Please see the iproute2 patches. Parsing of "block" command line option >> is done inside q_ingress.c >> > >I only looked at the kernel code. Good you can stop it at tc >but the API does not stop it (unless you expect the rest of the >world to only use tc). Jamal, apparently, you did not looked at the kernel code either :) Look at the changes done in net/sched/sch_ingress.c - there is where the parsing of block attr takes place. >Really - there is no reason for this API to be only via ingress qdisc >attributes. You can add a check in cls api to reject any parent that is >not either of the clsacts + ingress (depending on tc doesnt sound >right). I was thinking to take this direction originally. To have another generic attr called TCA_BLOCK or something that would be used when qdisc is created. For ingress, what would work. But for clsact, you need to be able to specify 2 block during qdisc creation - one for ingress, one for egress. That's when I realized this has to be per-qdisc-type attr.
Re: [patch iproute2 v8 1/2] lib/libnetlink: Add functions rtnl_talk_msg and rtnl_talk_iov
On Wed, Jan 10, 2018 at 09:12:45PM +0100, Phil Sutter wrote: > On Wed, Jan 10, 2018 at 12:20:36PM -0700, David Ahern wrote: > [...] > > 2. I am using a batch file with drop filters: > > > > filter add dev eth2 ingress protocol ip pref 273 flower dst_ip > > 192.168.253.0/16 action drop > > > > and for each command tc is trying to dlopen m_drop.so: > > > > open("/usr/lib/tc//m_drop.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such > > file or directory) > > [...] > > > Can you look at a follow on patch (not part of this set) to cache status > > of dlopen attempts? > > IMHO the logic used in get_action_kind() for gact is the culprit here: > After trying to dlopen m_drop.so, it dlopens m_gact.so although it is > present already. (Unless I missed something.) Not quite, m_gact.c is statically compiled in and there is logic around dlopen(NULL, ...) to prevent calling it twice. > I guess the better (and easier) fix would be to create some more struct > action_util instances in m_gact.c for the primitives it supports so that > the lookup in action_list succeeds for consecutive uses. Note that > parse_gact() even supports this already. Sadly, this doesn't fly: If a lookup for action 'drop' is successful, that value is set as TCA_ACT_KIND and the kernel doesn't know about it. I came up with an alternative solution, what do you think about attached patch? Thanks, Phil diff --git a/tc/m_action.c b/tc/m_action.c index fc4223648e8cf..d3df93c066a89 100644 --- a/tc/m_action.c +++ b/tc/m_action.c @@ -194,7 +194,10 @@ int parse_action(int *argc_p, char ***argv_p, int tca_id, struct nlmsghdr *n) } else { struct action_util *a = NULL; - strncpy(k, *argv, sizeof(k) - 1); + if (!action_a2n(*argv, NULL, false)) + strncpy(k, "gact", sizeof(k) - 1); + else + strncpy(k, *argv, sizeof(k) - 1); eap = 0; if (argc > 0) { a = get_action_kind(k); diff --git a/tc/tc_util.c b/tc/tc_util.c index ee9a70aa6830c..10e5aa91168a1 100644 --- a/tc/tc_util.c +++ b/tc/tc_util.c @@ -511,7 +511,7 @@ static const char *action_n2a(int action) * * In error case, returns -1 and does not touch @result. Otherwise returns 0. */ -static int action_a2n(char *arg, int *result, bool allow_num) +int action_a2n(char *arg, int *result, bool allow_num) { int n; char dummy; @@ -535,13 +535,15 @@ static int action_a2n(char *arg, int *result, bool allow_num) for (iter = a2n; iter->a; iter++) { if (matches(arg, iter->a) != 0) continue; - *result = iter->n; - return 0; + n = iter->n; + goto out_ok; } if (!allow_num || sscanf(arg, "%d%c", &n, &dummy) != 1) return -1; - *result = n; +out_ok: + if (result) + *result = n; return 0; } diff --git a/tc/tc_util.h b/tc/tc_util.h index 1218610d77092..e354765ff1ed0 100644 --- a/tc/tc_util.h +++ b/tc/tc_util.h @@ -132,4 +132,6 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt); int cls_names_init(char *path); void cls_names_uninit(void); +int action_a2n(char *arg, int *result, bool allow_num); + #endif
Re: [PATCH] usbnet: silence an unnecessary warning
Oliver Neukum writes: > That a kevent could not be scheduled is not an error. > Such handlers must be able to deal with multiple events anyway. > As the successful scheduling of a work is a debug event, make > the failure debug priority, too. > > Signed-off-by: Oliver Neukum > Reported-by: Cristian Caravena > --- > drivers/net/usb/usbnet.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c > index d56fe32bf48d..1e0bbe23f95c 100644 > --- a/drivers/net/usb/usbnet.c > +++ b/drivers/net/usb/usbnet.c > @@ -458,8 +458,7 @@ void usbnet_defer_kevent (struct usbnet *dev, int work) > { > set_bit (work, &dev->flags); > if (!schedule_work (&dev->kevent)) { > - if (net_ratelimit()) > - netdev_err(dev->net, "kevent %d may have been > dropped\n", work); > + netdev_dbg(dev->net, "kevent %d may have been dropped\n", work); > } else { > netdev_dbg(dev->net, "kevent %d scheduled\n", work); > } Great! But why do you drop the ratelimit? This can be very noisy when it hits. I'd like to keep it ratelimited. But if you do decide to drop the limit, then you'll have to clean up the braces... Bjørn
Re: [PATCH 2/2] xen-netfront: Fix race between device setup and open
From: Ross Lagerwall Date: Thu, 11 Jan 2018 09:36:38 + > When a netfront device is set up it registers a netdev fairly early on, > before it has set up the queues and is actually usable. A userspace tool > like NetworkManager will immediately try to open it and access its state > as soon as it appears. The bug can be reproduced by hotplugging VIFs > until the VM runs out of grant refs. It registers the netdev but fails > to set up any queues (since there are no more grant refs). In the > meantime, NetworkManager opens the device and the kernel crashes trying > to access the queues (of which there are none). > > Fix this in two ways: > * For initial setup, register the netdev much later, after the queues > are setup. This avoids the race entirely. > * During a suspend/resume cycle, the frontend reconnects to the backend > and the queues are recreated. It is possible (though highly unlikely) to > race with something opening the device and accessing the queues after > they have been destroyed but before they have been recreated. Extend the > region covered by the rtnl semaphore to protect against this race. There > is a possibility that we fail to recreate the queues so check for this > in the open function. > > Signed-off-by: Ross Lagerwall Where is patch 1/2 and the 0/2 header posting which explains what this patch series is doing, how it is doing it, and why it is doing it that way? Thanks.
Re: [PATCH 30/32] aio: add delayed cancel support
Christoph Hellwig writes: > On Wed, Jan 10, 2018 at 06:26:39PM -0500, Jeff Moyer wrote: >> >> The upcoming aio poll support would like to be able to complete the >> >> iocb inline from the cancellation context, but that would cause >> >> a lock order reversal. Add support for optionally moving the cancelation >> >> outside the context lock to avoid this reversal. >> >> >> >> Signed-off-by: Christoph Hellwig >> > >> > Acked-by: Jeff Moyer >> >> Actually, let's move these two defines: >> >> #define AIO_IOCB_DELAYED_CANCEL (1 << 0) >> #define AIO_IOCB_CANCELLED (1 << 1) >> >> to include/linux/aio.h so that drivers outside of fs/aio.c can make use >> of them. > > struct aio_kiocb is private to aio.c, so just exposing them won't > do anything useful. If we really need these elsewhere we'll need > to come up with a proper interface. Duh, good point. My main concern is that things like usb gadget will have to deal with races between cancellation and completion on their own. It would be nice if we had infrastructure for them to use. I'll have a look through that code to see if there's something we could or should be doing. Cheers, Jeff
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
On Thu, Jan 11, 2018 at 7:07 AM, Jiri Pirko wrote: > Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote: >>On 18-01-11 09:41 AM, Jiri Pirko wrote: >>> Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote: >>> > On 18-01-11 09:24 AM, Jiri Pirko wrote: >>> > > Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote: >>> > > > On 18-01-09 09:07 AM, Jiri Pirko wrote: >>> > > > > From: Jiri Pirko >>> > > > > >>> > > > > Benefit from the previously introduced shared filter blocks >>> > > > > infrastructure and allow ingress and clsact qdisc instances to share >>> > > > > filter blocks. The block index is coming from userspace as qdisc >>> > > > > option. >>> > > > >>> > > > Didnt quiet follow why ingress is special and needs attributes to >>> > > > set the block but other qdiscs didnt. >>> > > >>> > > Jamal, again, other qdiscs does not support block sharing. This patchset >>> > > only adds support for sharing of block for ingress and clsact qdiscs. >>> > > Later on, other qdiscs could also support block sharing. >>> > > >>> > >>> > Can you stop a config which says: >>> > tc qdisc add dev ens9 root block 22 handle 1:0 prio ? >>> >>> Please see the iproute2 patches. Parsing of "block" command line option >>> is done inside q_ingress.c >>> >> >>I only looked at the kernel code. Good you can stop it at tc >>but the API does not stop it (unless you expect the rest of the >>world to only use tc). > > Jamal, apparently, you did not looked at the kernel code either :) > Look at the changes done in net/sched/sch_ingress.c - there is where the > parsing of block attr takes place. > > >>Really - there is no reason for this API to be only via ingress qdisc >>attributes. You can add a check in cls api to reject any parent that is >>not either of the clsacts + ingress (depending on tc doesnt sound >>right). > > I was thinking to take this direction originally. To have another > generic attr called TCA_BLOCK or something that would be used when qdisc > is created. For ingress, what would work. But for clsact, you need to be > able to specify 2 block during qdisc creation - one for ingress, one for > egress. That's when I realized this has to be per-qdisc-type attr. yeah, see the problem...but.., would it help if we just introduce two generic attrs TCA_BLOCK_INGRESS and TCA_BLOCK_EGRESS instead of having to duplicate these attrs at every qdisc ?. and add proper validation depending on qdisc type..
Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks
On 18-01-11 10:07 AM, Jiri Pirko wrote: Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote: On 18-01-11 09:41 AM, Jiri Pirko wrote: Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote: I only looked at the kernel code. Good you can stop it at tc but the API does not stop it (unless you expect the rest of the world to only use tc). Jamal, apparently, you did not looked at the kernel code either :) Look at the changes done in net/sched/sch_ingress.c - there is where the parsing of block attr takes place. reason i raised it is from looking at tc_ctl_tfilter(). If i specify ifindex != TCM_IFINDEX_MAGIC_BLOCK, parent = 0X and block = 22 that should work, no? i.e regardless of whether parent is INGRESS etc. And so i was confused why you had attributes in sch_ingress.c Really - there is no reason for this API to be only via ingress qdisc attributes. You can add a check in cls api to reject any parent that is not either of the clsacts + ingress (depending on tc doesnt sound right). I was thinking to take this direction originally. To have another generic attr called TCA_BLOCK or something that would be used when qdisc is created. For ingress, what would work. But for clsact, you need to be able to specify 2 block during qdisc creation - one for ingress, one for egress. That's when I realized this has to be per-qdisc-type attr. ok for clsact - i can see that we dont have enough fields in the tcm message. TCA_BLOCK sounds appealing - could be a speacial tlv with many block ids maybe? I really would like to use this for egress as well - and what i described earlier should work for me. cheers, jamal
Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout
From: Geert Uytterhoeven Date: Tue, 9 Jan 2018 12:11:21 +0100 > In case of success, the return values of (__)phy_write() and > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while > (__)phy_modify() returns the old PHY register value. > > Apparently this change was catered for in drivers/net/phy/marvell.c, but > not in other source files. > > Hence genphy_restart_aneg() now returns 4416 instead zero, which is > considered an error: > > ravb e680.ethernet eth0: failed to connect PHY > IP-Config: Failed to open eth0 > IP-Config: No network devices available > > Fix this by converting positive values to zero in all callers of > phy_modify(). > > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to > phy_modify()") > Signed-off-by: Geert Uytterhoeven > --- > Alternatively, __phy_modify() could be changed to follow __phy_write() > semantics? I really want a resolution to this quickly, this broke lots of stuff for people. __phy_modify() wants to return multiple values, so it should be coded up to do so explicitly rather than trying to encode two values from overlapping value spaces in one return value. That means the original value should be returned by-reference. And this will make the error/no-error return value unambiguous. int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set, u16 *orig_val); Thank you.
Re: [PATCH 0/2] Fix a couple of crashes in netfront
+CC netdev On 01/11/2018 09:36 AM, Ross Lagerwall wrote: Here are a couple of patches to fix two crashes in netfront. Ross Lagerwall (2): xen/grant-table: Use put_page instead of free_page xen-netfront: Fix race between device setup and open drivers/net/xen-netfront.c | 46 -- drivers/xen/grant-table.c | 4 ++-- 2 files changed, 26 insertions(+), 24 deletions(-)
Re: [PATCH 2/2] xen-netfront: Fix race between device setup and open
On 01/11/2018 03:26 PM, David Miller wrote: From: Ross Lagerwall Date: Thu, 11 Jan 2018 09:36:38 + When a netfront device is set up it registers a netdev fairly early on, before it has set up the queues and is actually usable. A userspace tool like NetworkManager will immediately try to open it and access its state as soon as it appears. The bug can be reproduced by hotplugging VIFs until the VM runs out of grant refs. It registers the netdev but fails to set up any queues (since there are no more grant refs). In the meantime, NetworkManager opens the device and the kernel crashes trying to access the queues (of which there are none). Fix this in two ways: * For initial setup, register the netdev much later, after the queues are setup. This avoids the race entirely. * During a suspend/resume cycle, the frontend reconnects to the backend and the queues are recreated. It is possible (though highly unlikely) to race with something opening the device and accessing the queues after they have been destroyed but before they have been recreated. Extend the region covered by the rtnl semaphore to protect against this race. There is a possibility that we fail to recreate the queues so check for this in the open function. Signed-off-by: Ross Lagerwall Where is patch 1/2 and the 0/2 header posting which explains what this patch series is doing, how it is doing it, and why it is doing it that way? I've now added CC'd netdev on the other two. Cheers, -- Ross Lagerwall
Re: [PATCH 1/2] xen/grant-table: Use put_page instead of free_page
+CC netdev On 01/11/2018 09:36 AM, Ross Lagerwall wrote: The page given to gnttab_end_foreign_access() to free could be a compound page so use put_page() instead of free_page() since it can handle both compound and single pages correctly. This bug was discovered when migrating a Xen VM with several VIFs and CONFIG_DEBUG_VM enabled. It hits a BUG usually after fewer than 10 iterations. All netfront devices disconnect from the backend during a suspend/resume and this will call gnttab_end_foreign_access() if a netfront queue has an outstanding skb. The mismatch between calling get_page() and free_page() on a compound page causes a reference counting error which is detected when DEBUG_VM is enabled. Signed-off-by: Ross Lagerwall --- drivers/xen/grant-table.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index f45114f..27be107 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -382,7 +382,7 @@ static void gnttab_handle_deferred(struct timer_list *unused) if (entry->page) { pr_debug("freeing g.e. %#x (pfn %#lx)\n", entry->ref, page_to_pfn(entry->page)); - __free_page(entry->page); + put_page(entry->page); } else pr_info("freeing g.e. %#x\n", entry->ref); kfree(entry); @@ -438,7 +438,7 @@ void gnttab_end_foreign_access(grant_ref_t ref, int readonly, if (gnttab_end_foreign_access_ref(ref, readonly)) { put_free_entry(ref); if (page != 0) - free_page(page); + put_page(virt_to_page(page)); } else gnttab_add_deferred(ref, readonly, page ? virt_to_page(page) : NULL);
Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout
On Thu, Jan 11, 2018 at 4:53 PM, Russell King - ARM Linux wrote: > On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote: >> From: Geert Uytterhoeven >> Date: Tue, 9 Jan 2018 12:11:21 +0100 >> >> > In case of success, the return values of (__)phy_write() and >> > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while >> > (__)phy_modify() returns the old PHY register value. >> > >> > Apparently this change was catered for in drivers/net/phy/marvell.c, but >> > not in other source files. >> > >> > Hence genphy_restart_aneg() now returns 4416 instead zero, which is >> > considered an error: >> > >> > ravb e680.ethernet eth0: failed to connect PHY >> > IP-Config: Failed to open eth0 >> > IP-Config: No network devices available >> > >> > Fix this by converting positive values to zero in all callers of >> > phy_modify(). >> > >> > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to >> > phy_modify()") >> > Signed-off-by: Geert Uytterhoeven >> > --- >> > Alternatively, __phy_modify() could be changed to follow __phy_write() >> > semantics? >> >> I really want a resolution to this quickly, this broke lots of stuff >> for people. >> >> __phy_modify() wants to return multiple values, so it should be coded >> up to do so explicitly rather than trying to encode two values from >> overlapping value spaces in one return value. >> >> That means the original value should be returned by-reference. And >> this will make the error/no-error return value unambiguous. >> >> int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set, >>u16 *orig_val); > > I'm sorry I have no time to work on this right now due to the meltdown > and spectre stuff that hit last week. If you need to do something, > please revert both the mvneta series and the series containing this > patch. I'll have a look into it... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout
On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote: > From: Geert Uytterhoeven > Date: Tue, 9 Jan 2018 12:11:21 +0100 > > > In case of success, the return values of (__)phy_write() and > > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while > > (__)phy_modify() returns the old PHY register value. > > > > Apparently this change was catered for in drivers/net/phy/marvell.c, but > > not in other source files. > > > > Hence genphy_restart_aneg() now returns 4416 instead zero, which is > > considered an error: > > > > ravb e680.ethernet eth0: failed to connect PHY > > IP-Config: Failed to open eth0 > > IP-Config: No network devices available > > > > Fix this by converting positive values to zero in all callers of > > phy_modify(). > > > > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to > > phy_modify()") > > Signed-off-by: Geert Uytterhoeven > > --- > > Alternatively, __phy_modify() could be changed to follow __phy_write() > > semantics? > > I really want a resolution to this quickly, this broke lots of stuff > for people. > > __phy_modify() wants to return multiple values, so it should be coded > up to do so explicitly rather than trying to encode two values from > overlapping value spaces in one return value. > > That means the original value should be returned by-reference. And > this will make the error/no-error return value unambiguous. > > int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set, >u16 *orig_val); I'm sorry I have no time to work on this right now due to the meltdown and spectre stuff that hit last week. If you need to do something, please revert both the mvneta series and the series containing this patch. Thanks. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution
On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina wrote: > On Tue, 9 Jan 2018, Josh Poimboeuf wrote: > >> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote: >> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina wrote: >> > > On Fri, 5 Jan 2018, Dan Williams wrote: >> > > >> > > [ ... snip ... ] >> > >> Andi Kleen (1): >> > >> x86, barrier: stop speculation for failed access_ok >> > >> >> > >> Dan Williams (13): >> > >> x86: implement nospec_barrier() >> > >> [media] uvcvideo: prevent bounds-check bypass via speculative >> > >> execution >> > >> carl9170: prevent bounds-check bypass via speculative execution >> > >> p54: prevent bounds-check bypass via speculative execution >> > >> qla2xxx: prevent bounds-check bypass via speculative execution >> > >> cw1200: prevent bounds-check bypass via speculative execution >> > >> Thermal/int340x: prevent bounds-check bypass via speculative >> > >> execution >> > >> ipv6: prevent bounds-check bypass via speculative execution >> > >> ipv4: prevent bounds-check bypass via speculative execution >> > >> vfs, fdtable: prevent bounds-check bypass via speculative >> > >> execution >> > >> net: mpls: prevent bounds-check bypass via speculative execution >> > >> udf: prevent bounds-check bypass via speculative execution >> > >> userns: prevent bounds-check bypass via speculative execution >> > >> >> > >> Mark Rutland (4): >> > >> asm-generic/barrier: add generic nospec helpers >> > >> Documentation: document nospec helpers >> > >> arm64: implement nospec_ptr() >> > >> arm: implement nospec_ptr() >> > > >> > > So considering the recent publication of [1], how come we all of a sudden >> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and >> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT? >> > > >> > > Is this going to be handled in eBPF in some other way? >> > > >> > > Without that in place, and considering Jann Horn's paper, it would seem >> > > like PTI doesn't really lock it down fully, right? >> > >> > Here is the latest (v3) bpf fix: >> > >> > https://patchwork.ozlabs.org/patch/856645/ >> > >> > I currently have v2 on my 'nospec' branch and will move that to v3 for >> > the next update, unless it goes upstream before then. > > Daniel, I guess you're planning to send this still for 4.15? It's pending in the bpf.git tree: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9 >> That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall >> the only attack vector? Or are there other ways to run bpf programs >> that we should be worried about? > > Seems like Alexei is probably the only person in the whole universe who > isn't CCed here ... let's fix that. He will be cc'd on v2 of this series which will be available later today.
Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout
On Thu, Jan 11, 2018 at 4:54 PM, Geert Uytterhoeven wrote: > On Thu, Jan 11, 2018 at 4:53 PM, Russell King - ARM Linux > wrote: >> On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote: >>> From: Geert Uytterhoeven >>> Date: Tue, 9 Jan 2018 12:11:21 +0100 >>> >>> > In case of success, the return values of (__)phy_write() and >>> > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while >>> > (__)phy_modify() returns the old PHY register value. >>> > >>> > Apparently this change was catered for in drivers/net/phy/marvell.c, but >>> > not in other source files. >>> > >>> > Hence genphy_restart_aneg() now returns 4416 instead zero, which is >>> > considered an error: >>> > >>> > ravb e680.ethernet eth0: failed to connect PHY >>> > IP-Config: Failed to open eth0 >>> > IP-Config: No network devices available >>> > >>> > Fix this by converting positive values to zero in all callers of >>> > phy_modify(). >>> > >>> > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to >>> > phy_modify()") >>> > Signed-off-by: Geert Uytterhoeven >>> > --- >>> > Alternatively, __phy_modify() could be changed to follow __phy_write() >>> > semantics? >>> >>> I really want a resolution to this quickly, this broke lots of stuff >>> for people. >>> >>> __phy_modify() wants to return multiple values, so it should be coded >>> up to do so explicitly rather than trying to encode two values from >>> overlapping value spaces in one return value. >>> >>> That means the original value should be returned by-reference. And >>> this will make the error/no-error return value unambiguous. >>> >>> int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set, >>>u16 *orig_val); >> >> I'm sorry I have no time to work on this right now due to the meltdown >> and spectre stuff that hit last week. If you need to do something, >> please revert both the mvneta series and the series containing this >> patch. > > I'll have a look into it... Sorry, the phy_restore_page() semantics are driving me crazy. Let's revert. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH net-next 3/4] cxgb4: add support for vxlan segmentation offload
From: Ganesh Goudar Date: Wed, 10 Jan 2018 18:15:26 +0530 > add changes to t4_eth_xmit to enable vxlan segmentation > offload support. > > Original work by: Santosh Rastapur > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net-next 1/4] cxgb4: add data structures to support vxlan
From: Ganesh Goudar Date: Wed, 10 Jan 2018 18:14:49 +0530 > Add data structures and macros to be used in vxlan > offload. > > Original work by: Santosh Rastapur > Signed-off-by: Ganesh Goudar Applied.