Re: [PATCH net-next 0/5] vrf: allow simultaneous service instances in default and other VRFs
On 9/20/18 1:58 AM, Mike Manning wrote: > Services currently have to be VRF-aware if they are using an unbound > socket. One cannot have multiple service instances running in the > default and other VRFs for services that are not VRF-aware and listen > on an unbound socket. This is because there is no way of isolating > packets received in the default VRF from those arriving in other VRFs. > > This series provides this isolation subject to the existing kernel > parameter net.ipv4.tcp_l3mdev_accept not being set, given that this is > documented as allowing a single service instance to work across all > VRF domains. The functionality applies to UDP & TCP services, for IPv4 > and IPv6, in particular adding VRF table handling for IPv6 multicast. > > Example of running ssh instances in default and blue VRF: > > $ /usr/sbin/sshd -D > $ ip vrf exec vrf-blue /usr/sbin/sshd > $ ss -ta | egrep 'State|ssh' > State Recv-Q Send-Q Local Address:Port Peer Address:Port > LISTEN 0128 0.0.0.0%vrf-blue:ssh 0.0.0.0:* > LISTEN 01280.0.0.0:ssh 0.0.0.0:* > ESTAB 00 192.168.122.220:ssh 192.168.122.1:50282 > LISTEN 0128 [::]%vrf-blue:ssh[::]:* > LISTEN 0128 [::]:ssh[::]:* > ESTAB 00 [3000::2]%vrf-blue:ssh [3000::9]:45896 > ESTAB 00[2000::2]:ssh [2000::9]:46398 > Hi Dave: I need some time to review and more importantly test this patch set before it is committed. I am traveling tomorrow afternoon through Sunday evening, so I need a few days into next week to get to this. Thanks, David
[PATCH net-next v2] net/tls: Add support for async encryption of records for performance
In current implementation, tls records are encrypted & transmitted serially. Till the time the previously submitted user data is encrypted, the implementation waits and on finish starts transmitting the record. This approach of encrypt-one record at a time is inefficient when asynchronous crypto accelerators are used. For each record, there are overheads of interrupts, driver softIRQ scheduling etc. Also the crypto accelerator sits idle most of time while an encrypted record's pages are handed over to tcp stack for transmission. This patch enables encryption of multiple records in parallel when an async capable crypto accelerator is present in system. This is achieved by allowing the user space application to send more data using sendmsg() even while previously issued data is being processed by crypto accelerator. This requires returning the control back to user space application after submitting encryption request to accelerator. This also means that zero-copy mode of encryption cannot be used with async accelerator as we must be done with user space application buffer before returning from sendmsg(). There can be multiple records in flight to/from the accelerator. Each of the record is represented by 'struct tls_rec'. This is used to store the memory pages for the record. After the records are encrypted, they are added in a linked list called tx_ready_list which contains encrypted tls records sorted as per tls sequence number. The records from tx_ready_list are transmitted using a newly introduced function called tls_tx_records(). The tx_ready_list is polled for any record ready to be transmitted in sendmsg(), sendpage() after initiating encryption of new tls records. This achieves parallel encryption and transmission of records when async accelerator is present. There could be situation when crypto accelerator completes encryption later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore we need a deferred work context to be able to transmit records from tx_ready_list. The deferred work context gets scheduled if applications are not sending much data through the socket. If the applications issue sendmsg()/sendpage() in quick succession, then the scheduling of tx_work_handler gets cancelled as the tx_ready_list would be polled from application's context itself. This saves scheduling overhead of deferred work. The patch also brings some side benefit. We are able to get rid of the concept of CLOSED record. This is because the records once closed are either encrypted and then placed into tx_ready_list or if encryption fails, the socket error is set. This simplifies the kernel tls sendpath. However since tls_device.c is still using macros, accessory functions for CLOSED records have been retained. Signed-off-by: Vakul Garg --- Changes since v1: Addressed Dave Miller's comments. - Removed an extra space between 'inline' and 'bool' in 'is_tx_ready' declaration. - Changed order of variable declarations at several places in order to following 'longest to shortest' line suggestion. include/net/tls.h | 70 +-- net/tls/tls_main.c | 54 ++--- net/tls/tls_sw.c | 586 - 3 files changed, 522 insertions(+), 188 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 9f3c4ea9ad6f..3aa73e2d8823 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -41,7 +41,7 @@ #include #include #include - +#include #include @@ -93,24 +93,47 @@ enum { TLS_NUM_CONFIG, }; -struct tls_sw_context_tx { - struct crypto_aead *aead_send; - struct crypto_wait async_wait; - - char aad_space[TLS_AAD_SPACE_SIZE]; - - unsigned int sg_plaintext_size; - int sg_plaintext_num_elem; +/* TLS records are maintained in 'struct tls_rec'. It stores the memory pages + * allocated or mapped for each TLS record. After encryption, the records are + * stores in a linked list. + */ +struct tls_rec { + struct list_head list; + int tx_flags; struct scatterlist sg_plaintext_data[MAX_SKB_FRAGS]; - - unsigned int sg_encrypted_size; - int sg_encrypted_num_elem; struct scatterlist sg_encrypted_data[MAX_SKB_FRAGS]; /* AAD | sg_plaintext_data | sg_tag */ struct scatterlist sg_aead_in[2]; /* AAD | sg_encrypted_data (data contain overhead for hdr) */ struct scatterlist sg_aead_out[2]; + + unsigned int sg_plaintext_size; + unsigned int sg_encrypted_size; + int sg_plaintext_num_elem; + int sg_encrypted_num_elem; + + char aad_space[TLS_AAD_SPACE_SIZE]; + struct aead_request aead_req; + u8 aead_req_ctx[]; +}; + +struct tx_work { + struct delayed_work work; + struct sock *sk; +}; + +struct tls_sw_context_tx { + struct crypto_aead *aead_send; + struct crypto_wait async_wait; + struct tx_work tx_work; + struct tls_rec *open_rec; + struct
答复: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices
: Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when > sending to many devices > > > > On 09/20/2018 06:43 AM, Eric Dumazet wrote: > > > Sorry, I should cc to you. > > And lastly this patch looks way too complicated to me. > > You probably can write something much simpler. > But it should not increase the negative performance > Something like : > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index > 930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794c > cf3d5a46bdf900 100644 > --- a/net/netlink/af_netlink.c > +++ b/net/netlink/af_netlink.c > @@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff > *skb) > return false; > } > > -static int __netlink_deliver_tap_skb(struct sk_buff *skb, > +static int __netlink_deliver_tap_skb(struct sk_buff **pskb, > struct net_device *dev) { > - struct sk_buff *nskb; > + struct sk_buff *nskb, *skb = *pskb; > struct sock *sk = skb->sk; > int ret = -ENOMEM; > > if (!net_eq(dev_net(dev), sock_net(sk))) > return 0; > > - dev_hold(dev); > - > - if (is_vmalloc_addr(skb->head)) > + if (is_vmalloc_addr(skb->head)) { > nskb = netlink_to_full_skb(skb, GFP_ATOMIC); > - else > - nskb = skb_clone(skb, GFP_ATOMIC); > + if (!nskb) > + return -ENOMEM; > + consume_skb(skb); The original skb can not be freed, since it will be used after send to tap in __netlink_sendskb > + skb = nskb; > + *pskb = skb; > + } > + dev_hold(dev); > + nskb = skb_clone(skb, GFP_ATOMIC); since original skb can not be freed, skb_clone will lead to a leak. > if (nskb) { > nskb->dev = dev; > nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 > +322,7 > @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct > netlink_tap_net *n > return; > > list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { > - ret = __netlink_deliver_tap_skb(skb, tmp->dev); > + ret = __netlink_deliver_tap_skb(, tmp->dev); > if (unlikely(ret)) > break; > } > The below change seems simple, but it increase skb allocation and free one time, diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e3a0538ec0be..b9631137f0fe 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -290,10 +290,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb, dev_hold(dev); - if (is_vmalloc_addr(skb->head)) - nskb = netlink_to_full_skb(skb, GFP_ATOMIC); - else - nskb = skb_clone(skb, GFP_ATOMIC); + nskb = skb_clone(skb GFP_ATOMIC); + if (nskb) { nskb->dev = dev; nskb->protocol = htons((u16) sk->sk_protocol); @@ -317,11 +315,20 @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net *n if (!netlink_filter_tap(skb)) return; + if (is_vmalloc_addr(skb->head)) { + skb = netlink_to_full_skb(skb, GFP_ATOMIC); + if (!skb) + return; + alloc = true; + } + list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { + ret = __netlink_deliver_tap_skb(skb, tmp->dev); if (unlikely(ret)) break; } + + if (alloc) + consume_skb(skb); } -Q
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote: > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > > to having to cope with munmapping parts of existing mmaps, etc. > > > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > > now it would be used just in this clean case for undoing a > > PERF_RECORD_MMAP for a BPF program. > > > > The ABI is already complicated, starting to use something called > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > > I think. > > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult > part was getting the perf tool to dtrt for that use-case. But if we need > unmap events, doing the unmap record now is the right thing. Thanks for the pointers! The answer is a bit long. pls bear with me. I have considered adding MUNMAP to match existing MMAP, but went without it because I didn't want to introduce new bit in perf_event_attr and emit these new events in a misbalanced conditional way for prog load/unload. Like old perf is asking kernel for mmap events via mmap bit, so prog load events will be in perf.data, but old perf report won't recognize them anyway. Whereas new perf would certainly want to catch bpf events and will set both mmap and mumap bits. Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP conditionally based on single mmap bit they will confuse old perf and it will print warning about 'unknown events'. Both situations are ugly, hence I went with reuse of MMAP event for both load/unload. In such case old perf silently ignores them. Which is what I wanted. When we upgrade the kernel we cannot synchronize the kernel upgrade (or downgrade) with user space perf package upgrade. Hence not confusing old perf is important. With new kernel new bpf mmap events get into perf.data and new perf picks them up. Few more considerations: I consider synthetic perf events to be non-ABI. Meaning they're emitted by perf user space into perf.data and there is a convention on names, but it's not a kernel abi. Like RECORD_MMAP with event.filename == "[module_name]" is an indication for perf report to parse elf/build-id of dso==module_name. There is no such support in the kernel. Kernel doesn't emit such events for module load/unload. If in the future we decide to extend kernel with such events they don't have to match what user space perf does today. Why this is important? To get to next step. As Arnaldo pointed out this patch set is missing support for JITed prog annotations and displaying asm code. Absolutely correct. This set only helps perf to reveal the names of bpf progs that _were_ running at the time of perf record, but there is no way yet for perf report to show asm code of the prog that was running. In that sense bpf is drastically different from java, other jits and normal profiling. bpf JIT happens in the kernel and only kernel knows the mapping between original source and JITed code. In addition there are bpf2bpf functions. In the future there will be bpf libraries, more type info, line number support, etc. I strongly believe perf RECORD_* events should NOT care about the development that happens on the bpf side. The only thing kernel will be telling user space is that bpf prog with prog_id=X was loaded. Then based on prog_id the 'perf record' phase can query the kernel for bpf related information. There is already a way to fetch JITed image based on prog_id. Then perf will emit synthetic RECORD_FOOBAR into perf.data that will contain bpf related info (like complete JITed image) and perf report can process it later and annotate things in UI. It may seem that there is a race here. Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event it will ask the kernel about prog_id=X, but that prog could be unloaded already. In such case prog_id will not exist and perf record can ignore such event. So no race. The kernel doesn't need to pass all information about bpf prog to the user space via RECORD_*. Instead 'perf record' can emit synthetic events into perf.data. I was thinking to extend RECORD_MMAP with prog_id already (instead of passing kallsyms's bpf prog name in event->mmap.filename) but bpf functions don't have their own prog_id. Multiple bpf funcs with different JITed blobs are considered to be a part of single prog_id. So as a step one I'm only extending RECORD_MMAP with addr and kallsym name of bpf function/prog. As a step two the plan is to add notification mechanism for prog load/unload that will include prog_id and design new synthetic RECORD_* events in perf user space that will contain JITed code, line info, BTF, etc. TLDR: step 1 (this patch set) Single bpf prog_load can call multiple bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only Similarly unload calls multiple bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only step 2 (future work)
Re: [PATCH net] r8169: fix autoneg issue on resume with RTL8168E
From: Heiner Kallweit Date: Thu, 20 Sep 2018 22:47:09 +0200 > It was reported that chip version 33 (RTL8168E) ends up with > 10MBit/Half on a 1GBit link after resuming from S3 (with different > link partners). For whatever reason the PHY on this chip doesn't > properly start a renegotiation when soft-reset. > Explicitly requesting a renegotiation fixes this. > > Signed-off-by: Heiner Kallweit > Fixes: a2965f12fde6 ("r8169: remove rtl8169_set_speed_xmii") > Reported-by: Neil MacLeod > Tested-by: Neil MacLeod Applied, thank you. Please always make "Fixes: " the first tag in a set of tags. I fixed it up for you this time. Thanks.
[PATCH RESEND] PCI: hv: Fix return value check in hv_pci_assign_slots()
In case of error, the function pci_create_slot() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") Signed-off-by: Wei Yongjun --- Since the orig patch is merged from net tree, cc netdev@vger.kernel.org --- drivers/pci/controller/pci-hyperv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index ee80e79..9ba4d12 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct hv_pcibus_device *hbus) snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser); hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr, name, NULL); - if (!hpdev->pci_slot) + if (IS_ERR(hpdev->pci_slot)) { pr_warn("pci_create slot %s failed\n", name); + hpdev->pci_slot = NULL; + } } }
RE: [PATCH net-next] net/tls: Add support for async encryption of records for performance
> -Original Message- > From: David Miller > Sent: Thursday, September 20, 2018 11:49 PM > To: Vakul Garg > Cc: netdev@vger.kernel.org; bor...@mellanox.com; > avia...@mellanox.com; davejwat...@fb.com; doro...@fb.com > Subject: Re: [PATCH net-next] net/tls: Add support for async encryption of > records for performance > > From: Vakul Garg > Date: Wed, 19 Sep 2018 20:51:35 +0530 > > > This patch enables encryption of multiple records in parallel when an > > async capable crypto accelerator is present in system. > > This seems to be trading off zero copy with async support. > > Async crypto device support is not the common case at all, and synchronous > crypto via cpu crypto acceleration instructions is so much more likely. > > Oh I see, the new logic is only triggered with ASYNC_CAPABLE is set? > > > +static inline bool is_tx_ready(struct tls_context *tls_ctx, > > + struct tls_sw_context_tx *ctx) > > +{ > > Two space between "inline" and "bool", please make it one. Fixed. Seems checkpatch misses it. > > > static void tls_write_space(struct sock *sk) { > > struct tls_context *ctx = tls_get_ctx(sk); > > + struct tls_sw_context_tx *tx_ctx = tls_sw_ctx_tx(ctx); > > Longest to shortest line (reverse christmas tree) ordering for local variable > declarations please. Can't do this. The second variable assignment is dependent upon previous one. > > > > + list_for_each_prev(pos, >tx_ready_list) { > > + struct tls_rec *rec = (struct tls_rec *)pos; > > + u64 seq = be64_to_cpup((const __be64 *)>aad_space); > > Likewise. > I can split variable declaration 'seq' and its assignment into two separate lines. But I am not sure if increasing number of lines in order to comply reverse Christmas tree is a good thing for this case. > > -static int tls_do_encryption(struct tls_context *tls_ctx, > > +int tls_tx_records(struct sock *sk, int flags) { > > + struct tls_rec *rec, *tmp; > > + struct tls_context *tls_ctx = tls_get_ctx(sk); > > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > > + int rc = 0; > > + int tx_flags; > > Likewise. Could partially address since ctx assignment depends upon tls_ctx assignment. > > > +static void tls_encrypt_done(struct crypto_async_request *req, int > > +err) { > > + struct aead_request *aead_req = (struct aead_request *)req; > > + struct sock *sk = req->data; > > + struct tls_context *tls_ctx = tls_get_ctx(sk); > > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > > + struct tls_rec *rec; > > + int pending; > > + bool ready = false; > > Likewise. Placed 'ready' above pending 'pending'. Rest unchanged because of dependencies. > > > +static int tls_do_encryption(struct sock *sk, > > +struct tls_context *tls_ctx, > > struct tls_sw_context_tx *ctx, > > struct aead_request *aead_req, > > size_t data_len) > > { > > int rc; > > + struct tls_rec *rec = ctx->open_rec; > > Likewise. > > > @@ -473,11 +630,12 @@ static int memcopy_from_iter(struct sock *sk, > > struct iov_iter *from, { > > struct tls_context *tls_ctx = tls_get_ctx(sk); > > struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > > - struct scatterlist *sg = ctx->sg_plaintext_data; > > + struct tls_rec *rec = ctx->open_rec; > > + struct scatterlist *sg = rec->sg_plaintext_data; > > int copy, i, rc = 0; > > Likewise. Can't change because of dependencies. > > > +struct tls_rec *get_rec(struct sock *sk) { > > + int mem_size; > > + struct tls_context *tls_ctx = tls_get_ctx(sk); > > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > > + struct tls_rec *rec; > > Likewise. > Declared 'mem_size' below 'rec'. > > @@ -510,21 +707,33 @@ int tls_sw_sendmsg(struct sock *sk, struct > msghdr *msg, size_t size) > > int record_room; > > bool full_record; > > int orig_size; > > + struct tls_rec *rec; > > bool is_kvec = msg->msg_iter.type & ITER_KVEC; > > + struct crypto_tfm *tfm = crypto_aead_tfm(ctx->aead_send); > > + bool async_capable = tfm->__crt_alg->cra_flags & > CRYPTO_ALG_ASYNC; > > + int num_async = 0; > > + int num_zc = 0; > > Likewise. Fixed > > @@ -661,6 +904,8 @@ int tls_sw_sendpage(struct sock *sk, struct page > *page, > > struct scatterlist *sg; > > bool full_record; > > int record_room; > > + struct tls_rec *rec; > > + int num_async = 0; > > Likewise. Fixed. Sending v2.
Re: [PATCH iproute2-next] iplink: add ipvtap support
On 9/18/18 8:03 PM, Hangbin Liu wrote: > IPVLAN and IPVTAP are using the same functions and parameters. So we can > just add a new link_util with id ipvtap. Others are the same. > > Signed-off-by: Hangbin Liu > --- > ip/iplink.c | 4 ++-- > ip/iplink_ipvlan.c| 28 ++-- > man/man8/ip-link.8.in | 4 > 3 files changed, 24 insertions(+), 12 deletions(-) > applied to iproute2-next. Thanks
[PATCH net] net/ipv6: Display all addresses in output of /proc/net/if_inet6
The backend handling for /proc/net/if_inet6 in addrconf.c doesn't properly handle starting/stopping the iteration. The problem is that at some point during the iteration, an overflow is detected and the process is subsequently stopped. The item being shown via seq_printf() when the overflow occurs is not actually shown, though. When start() is subsequently called to resume iterating, it returns the next item, and thus the item that was being processed when the overflow occurred never gets printed. Alter the meaning of the private data member "offset". Currently, when it is not 0 (which only happens at the very beginning), "offset" represents the next hlist item to be printed. After this change, "offset" always represents the current item. This is also consistent with the private data member "bucket", which represents the current bucket, and also the use of "pos" as defined in seq_file.txt: The pos passed to start() will always be either zero, or the most recent pos used in the previous session. Signed-off-by: Jeff Barnhill <0xeff...@gmail.com> --- net/ipv6/addrconf.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index d51a8c0b3372..c63ccce6425f 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4201,7 +4201,6 @@ static struct inet6_ifaddr *if6_get_first(struct seq_file *seq, loff_t pos) p++; continue; } - state->offset++; return ifa; } @@ -4225,13 +4224,12 @@ static struct inet6_ifaddr *if6_get_next(struct seq_file *seq, return ifa; } + state->offset = 0; while (++state->bucket < IN6_ADDR_HSIZE) { - state->offset = 0; hlist_for_each_entry_rcu(ifa, _addr_lst[state->bucket], addr_lst) { if (!net_eq(dev_net(ifa->idev->dev), net)) continue; - state->offset++; return ifa; } } -- 2.14.1
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
On 09/20/2018 04:43 PM, Song Liu wrote: > > I tried to totally skip ndo_poll_controller() here. It did avoid hitting > the issue. However, netpoll will drop (fail to send) more packets. > Why is it failing ? If you are under high memory pressure, then maybe if you absolutely want memory to send netpoll packets, you want to grab all NAPI contexts as a way to prevent other cpus from feeding incoming packets to the host and add more memory pressure ;)
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
> On Sep 20, 2018, at 4:22 PM, Eric Dumazet wrote: > > > > On 09/20/2018 03:42 PM, Song Liu wrote: >> >> >>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher >>> wrote: >>> >>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote: On 09/20/2018 12:01 PM, Song Liu wrote: > The NIC driver should only enable interrupts when napi_complete_done() > returns true. This patch adds the check for ixgbe. > > Cc: sta...@vger.kernel.org # 4.10+ > Cc: Jeff Kirsher > Suggested-by: Eric Dumazet > Signed-off-by: Song Liu > --- Well, unfortunately we do not know why this is needed, this is why I have not yet sent this patch formally. netpoll has correct synchronization : poll_napi() places into napi->poll_owner current cpu number before calling poll_one_napi() netpoll_poll_lock() does also use napi->poll_owner When netpoll calls ixgbe poll() method, it passed a budget of 0, meaning napi_complete_done() is not called. As long as we can not explain the problem properly in the changelog, we should investigate, otherwise we will probably see coming dozens of patches trying to fix a 'potential hazard'. >>> >>> Agreed, which is why I have our validation and developers looking into it, >>> while we test the current patch from Song. >> >> I figured out what is the issue here. And I have a proposal to fix it. I >> have verified that this fixes the issue in our tests. But Alexei suggests >> that it may not be the right way to fix. >> >> Here is what happened: >> >> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it >> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in >> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs >> within the same NIC. >> >> This is problematic, because at the end napi_schedule() calls: >> >>napi_schedule(this_cpu_ptr(_data), n); >> >> which attached these NAPIs to softnet_data on THIS CPU. This is done >> via napi->poll_list. >> >> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will >> not give up the ownership until it calls napi_complete_done(). However, >> for a very busy server, we usually use 16 CPUs to poll NAPI, so this >> CPU can easily be overloaded. And as a result, each call of napi->poll() >> will hit budget (of 64), and it will not call napi_complete_done(), >> and the NAPI stays in the poll_list of this CPU. >> >> When this happens, the host usually cannot get out of this state until >> we throttle/stop client traffic. >> >> >> I am pretty confident this is what happened. Please let me know if >> anything above doesn't make sense. >> >> >> Here is my proposal to fix it: Instead of polling all NAPIs within one >> NIC, I would have netpoll to only poll the NAPI that will free space >> for netpoll_start_xmit(). I attached my two RFC patches to the end of >> this email. >> >> I chatted with Alexei about this. He think polling only one NAPI may >> not guarantee netpoll make progress with the TX queue we are aiming >> for. Also, the bigger problem may be the fact that NAPIs could get >> pinned to one CPU and cannot get released. >> >> At this point, I really don't know what is the best way to fix this. >> >> I will also work on a repro with netperf. > > Thanks ! > >> >> Please let me know your suggestions. >> > > Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() > method at all, > since it is very risky (potentially grab many NAPI, and end up in this locked > situation) > > poll_napi() could attempt to free skbs one napi at a time, > without the current cpu stealing all NAPI. > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index > 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf > 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev) >} > >ops = dev->netdev_ops; > - if (!ops->ndo_poll_controller) { > - up(>dev_lock); > - return; > - } > - > - /* Process pending work on NIC */ > - ops->ndo_poll_controller(dev); > + if (ops->ndo_poll_controller) > + ops->ndo_poll_controller(dev); > >poll_napi(dev); > I tried to totally skip ndo_poll_controller() here. It did avoid hitting the issue. However, netpoll will drop (fail to send) more packets. Thanks, Song
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
On 09/20/2018 03:42 PM, Song Liu wrote: > > >> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher >> wrote: >> >> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote: >>> On 09/20/2018 12:01 PM, Song Liu wrote: The NIC driver should only enable interrupts when napi_complete_done() returns true. This patch adds the check for ixgbe. Cc: sta...@vger.kernel.org # 4.10+ Cc: Jeff Kirsher Suggested-by: Eric Dumazet Signed-off-by: Song Liu --- >>> >>> >>> Well, unfortunately we do not know why this is needed, >>> this is why I have not yet sent this patch formally. >>> >>> netpoll has correct synchronization : >>> >>> poll_napi() places into napi->poll_owner current cpu number before >>> calling poll_one_napi() >>> >>> netpoll_poll_lock() does also use napi->poll_owner >>> >>> When netpoll calls ixgbe poll() method, it passed a budget of 0, >>> meaning napi_complete_done() is not called. >>> >>> As long as we can not explain the problem properly in the changelog, >>> we should investigate, otherwise we will probably see coming dozens of >>> patches >>> trying to fix a 'potential hazard'. >> >> Agreed, which is why I have our validation and developers looking into it, >> while we test the current patch from Song. > > I figured out what is the issue here. And I have a proposal to fix it. I > have verified that this fixes the issue in our tests. But Alexei suggests > that it may not be the right way to fix. > > Here is what happened: > > netpoll tries to send skb with netpoll_start_xmit(). If that fails, it > calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in > the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs > within the same NIC. > > This is problematic, because at the end napi_schedule() calls: > > napi_schedule(this_cpu_ptr(_data), n); > > which attached these NAPIs to softnet_data on THIS CPU. This is done > via napi->poll_list. > > Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will > not give up the ownership until it calls napi_complete_done(). However, > for a very busy server, we usually use 16 CPUs to poll NAPI, so this > CPU can easily be overloaded. And as a result, each call of napi->poll() > will hit budget (of 64), and it will not call napi_complete_done(), > and the NAPI stays in the poll_list of this CPU. > > When this happens, the host usually cannot get out of this state until > we throttle/stop client traffic. > > > I am pretty confident this is what happened. Please let me know if > anything above doesn't make sense. > > > Here is my proposal to fix it: Instead of polling all NAPIs within one > NIC, I would have netpoll to only poll the NAPI that will free space > for netpoll_start_xmit(). I attached my two RFC patches to the end of > this email. > > I chatted with Alexei about this. He think polling only one NAPI may > not guarantee netpoll make progress with the TX queue we are aiming > for. Also, the bigger problem may be the fact that NAPIs could get > pinned to one CPU and cannot get released. > > At this point, I really don't know what is the best way to fix this. > > I will also work on a repro with netperf. Thanks ! > > Please let me know your suggestions. > Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() method at all, since it is very risky (potentially grab many NAPI, and end up in this locked situation) poll_napi() could attempt to free skbs one napi at a time, without the current cpu stealing all NAPI. diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev) } ops = dev->netdev_ops; - if (!ops->ndo_poll_controller) { - up(>dev_lock); - return; - } - - /* Process pending work on NIC */ - ops->ndo_poll_controller(dev); + if (ops->ndo_poll_controller) + ops->ndo_poll_controller(dev); poll_napi(dev);
Re: array bounds warning in xfrm_output_resume
On 9/20/18 7:06 AM, Florian Westphal wrote: > David Ahern wrote: >>> $ make O=kbuild/perf -j 24 -s >>> In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0, >>> from /home/dsa/kernel-3.git/include/linux/list.h:9, >>> from /home/dsa/kernel-3.git/include/linux/module.h:9, >>> from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13: >>> /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function >>> ‘xfrm_output_resume’: >>> /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array >>> subscript is above array bounds [-Warray-bounds] >>>__read_once_size(&(x), __u.__c, sizeof(x)); \ > > Does this thing avoid the warning? nope, still see it.
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher wrote: > > On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote: >> On 09/20/2018 12:01 PM, Song Liu wrote: >>> The NIC driver should only enable interrupts when napi_complete_done() >>> returns true. This patch adds the check for ixgbe. >>> >>> Cc: sta...@vger.kernel.org # 4.10+ >>> Cc: Jeff Kirsher >>> Suggested-by: Eric Dumazet >>> Signed-off-by: Song Liu >>> --- >> >> >> Well, unfortunately we do not know why this is needed, >> this is why I have not yet sent this patch formally. >> >> netpoll has correct synchronization : >> >> poll_napi() places into napi->poll_owner current cpu number before >> calling poll_one_napi() >> >> netpoll_poll_lock() does also use napi->poll_owner >> >> When netpoll calls ixgbe poll() method, it passed a budget of 0, >> meaning napi_complete_done() is not called. >> >> As long as we can not explain the problem properly in the changelog, >> we should investigate, otherwise we will probably see coming dozens of >> patches >> trying to fix a 'potential hazard'. > > Agreed, which is why I have our validation and developers looking into it, > while we test the current patch from Song. I figured out what is the issue here. And I have a proposal to fix it. I have verified that this fixes the issue in our tests. But Alexei suggests that it may not be the right way to fix. Here is what happened: netpoll tries to send skb with netpoll_start_xmit(). If that fails, it calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs within the same NIC. This is problematic, because at the end napi_schedule() calls: napi_schedule(this_cpu_ptr(_data), n); which attached these NAPIs to softnet_data on THIS CPU. This is done via napi->poll_list. Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will not give up the ownership until it calls napi_complete_done(). However, for a very busy server, we usually use 16 CPUs to poll NAPI, so this CPU can easily be overloaded. And as a result, each call of napi->poll() will hit budget (of 64), and it will not call napi_complete_done(), and the NAPI stays in the poll_list of this CPU. When this happens, the host usually cannot get out of this state until we throttle/stop client traffic. I am pretty confident this is what happened. Please let me know if anything above doesn't make sense. Here is my proposal to fix it: Instead of polling all NAPIs within one NIC, I would have netpoll to only poll the NAPI that will free space for netpoll_start_xmit(). I attached my two RFC patches to the end of this email. I chatted with Alexei about this. He think polling only one NAPI may not guarantee netpoll make progress with the TX queue we are aiming for. Also, the bigger problem may be the fact that NAPIs could get pinned to one CPU and cannot get released. At this point, I really don't know what is the best way to fix this. I will also work on a repro with netperf. Please let me know your suggestions. Thanks, Song = RFCs >From 7b6d1d0e21bc7a0034dbcacfdaaec95a0e5f5b14 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Thu, 20 Sep 2018 13:09:26 -0700 Subject: [RFC net 1/2] net: netpoll: when failed to send a message, only poll the target NAPI Currently, netpoll polls all NAPIs in the NIC if it fails to send data to one TX queue. This is not necessary and sometimes problematic. This is because ndo_poll_controller() calls napi_schedule() for all NAPIs in the NIC, and attach available NAPIs to this_cpu's softnet_data->poll_list. Modern highend NIC has tens of NAPIs, and requires multiple CPUs to poll the data from the queues. Attaching multiple NAPIs to one CPU's poll_list fan-in softirq work of multiple CPUs to one CPU. As a result, we see one CPU pegged in ksoftirqd, while other CPUs cannot get assgined work. This patch fixes this issue by only polls target NAPI when send fails. New hook ndo_netpoll_get_napi() is added for the driver to map busy TX queue to the NAPI to poll. If the driver implements this hook, netpoll will only poll the NAPI of the target TX queue. Signed-off-by: Song Liu --- include/linux/netdevice.h | 5 + net/core/netpoll.c| 30 -- 2 files changed, 29 insertions(+), 6 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6a7ac01fa605..576414b150ff 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1033,6 +1033,9 @@ struct dev_ifalias { * * void (*ndo_poll_controller)(struct net_device *dev); * Run napi_schedule() for all NAPI within this device. + * struct napi_struct* (*ndo_netpoll_get_napi)(struct net_device *dev, + * struct netdev_queue *txq); + * Returns the NAPI to poll for the given TX queue. * * SR-IOV management
Re: [PATCH net] af_key: free SKBs under RCU protection
On 09/20/2018 03:10 PM, Eric Dumazet wrote: > > > On 09/20/2018 12:25 PM, stran...@codeaurora.org wrote: >>> >>> I do not believe the changelog or the patch makes sense. >>> >>> Having skb still referencing a socket prevents this socket being released. >>> >>> If you think about it, what would prevent the freeing happening >>> _before_ the rcu_read_lock() in pfkey_broadcast() ? >>> >>> Maybe the correct fix is that pfkey_broadcast_one() should ensure the >>> socket is still valid. >>> >>> I would suggest something like : >>> >>> diff --git a/net/key/af_key.c b/net/key/af_key.c >>> index >>> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d >>> 100644 >>> --- a/net/key/af_key.c >>> +++ b/net/key/af_key.c >>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff >>> *skb, struct sk_buff **skb2, >>> { >>> int err = -ENOBUFS; >>> >>> - sock_hold(sk); >>> + if (!refcount_inc_not_zero(>sk_refcnt)) >>> + return -ENOENT; >>> + >>> if (*skb2 == NULL) { >>> if (refcount_read(>users) != 1) { >>> *skb2 = skb_clone(skb, allocation); >> >> Hi Eric, >> >> I'm not sure that the socket getting freed before the rcu_read_lock() would >> be an issue, since then it would no longer be in the net_pkey->table that >> we loop through (since we call pfkey_remove() from pfkey_relase()). Because >> of >> that, all the sockets processed in pfkey_broadcast_one() have valid >> refcounts, >> so checking for zero there doesn't prevent the crash that I'm seeing. >> >> However, after going over the call flow again, I see that the actual problem >> occurs because of pfkey_broadcast_one(). Specifically, because of this check: >> >> if (*skb2 == NULL) { >> if (refcount_read(>users) != 1) { >> *skb2 = skb_clone(skb, allocation); >> } else { >> *skb2 = skb; >> refcount_inc(>users); >> } >> } >> >> Since we always pass a freshly cloned SKB to this function, skb->users is >> always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to >> belong to the socket. >> >> If the socket we queue skb2 to frees this SKB (thereby decrementing its >> refcount to 1) and the socket is freed before pfkey_broadcast() can >> execute the kfree_skb(skb) on line 284, we will then attempt to run >> sock_rfree() on an SKB with a dangling reference to this socket. >> >> Perhaps a cleaner solution here is to always clone the SKB in >> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls >> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as >> its destructor, and we can avoid this race condition. > > As long as one skb has sock_rfree has its destructor, the socket attached to > this skb can not be released. There is no race here. > > Note that skb_clone() does not propagate the destructor. > > The issue here is that in the rcu lookup, we can find a socket that has been > dismantled, with a 0 refcount. > > We must not use sock_hold() in this case, since we are not sure the socket > refcount is not already 0. > > pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules. > > When in a RCU lookup, one want to increment an object refcount, it needs > to be extra-careful, as I did in my proposal. > > Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y Bug was added in commit 7f6b9dbd5afb ("af_key: locking change")
Re: [PATCH net] af_key: free SKBs under RCU protection
On 09/20/2018 12:25 PM, stran...@codeaurora.org wrote: >> >> I do not believe the changelog or the patch makes sense. >> >> Having skb still referencing a socket prevents this socket being released. >> >> If you think about it, what would prevent the freeing happening >> _before_ the rcu_read_lock() in pfkey_broadcast() ? >> >> Maybe the correct fix is that pfkey_broadcast_one() should ensure the >> socket is still valid. >> >> I would suggest something like : >> >> diff --git a/net/key/af_key.c b/net/key/af_key.c >> index >> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d >> 100644 >> --- a/net/key/af_key.c >> +++ b/net/key/af_key.c >> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff >> *skb, struct sk_buff **skb2, >> { >> int err = -ENOBUFS; >> >> - sock_hold(sk); >> + if (!refcount_inc_not_zero(>sk_refcnt)) >> + return -ENOENT; >> + >> if (*skb2 == NULL) { >> if (refcount_read(>users) != 1) { >> *skb2 = skb_clone(skb, allocation); > > Hi Eric, > > I'm not sure that the socket getting freed before the rcu_read_lock() would > be an issue, since then it would no longer be in the net_pkey->table that > we loop through (since we call pfkey_remove() from pfkey_relase()). Because of > that, all the sockets processed in pfkey_broadcast_one() have valid refcounts, > so checking for zero there doesn't prevent the crash that I'm seeing. > > However, after going over the call flow again, I see that the actual problem > occurs because of pfkey_broadcast_one(). Specifically, because of this check: > > if (*skb2 == NULL) { > if (refcount_read(>users) != 1) { > *skb2 = skb_clone(skb, allocation); > } else { > *skb2 = skb; > refcount_inc(>users); > } > } > > Since we always pass a freshly cloned SKB to this function, skb->users is > always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to > belong to the socket. > > If the socket we queue skb2 to frees this SKB (thereby decrementing its > refcount to 1) and the socket is freed before pfkey_broadcast() can > execute the kfree_skb(skb) on line 284, we will then attempt to run > sock_rfree() on an SKB with a dangling reference to this socket. > > Perhaps a cleaner solution here is to always clone the SKB in > pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls > in pfkey_broadcast() will never be passed an SKB with sock_rfree() as > its destructor, and we can avoid this race condition. As long as one skb has sock_rfree has its destructor, the socket attached to this skb can not be released. There is no race here. Note that skb_clone() does not propagate the destructor. The issue here is that in the rcu lookup, we can find a socket that has been dismantled, with a 0 refcount. We must not use sock_hold() in this case, since we are not sure the socket refcount is not already 0. pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules. When in a RCU lookup, one want to increment an object refcount, it needs to be extra-careful, as I did in my proposal. Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote: > On 09/20/2018 12:01 PM, Song Liu wrote: > > The NIC driver should only enable interrupts when napi_complete_done() > > returns true. This patch adds the check for ixgbe. > > > > Cc: sta...@vger.kernel.org # 4.10+ > > Cc: Jeff Kirsher > > Suggested-by: Eric Dumazet > > Signed-off-by: Song Liu > > --- > > > Well, unfortunately we do not know why this is needed, > this is why I have not yet sent this patch formally. > > netpoll has correct synchronization : > > poll_napi() places into napi->poll_owner current cpu number before > calling poll_one_napi() > > netpoll_poll_lock() does also use napi->poll_owner > > When netpoll calls ixgbe poll() method, it passed a budget of 0, > meaning napi_complete_done() is not called. > > As long as we can not explain the problem properly in the changelog, > we should investigate, otherwise we will probably see coming dozens of > patches > trying to fix a 'potential hazard'. Agreed, which is why I have our validation and developers looking into it, while we test the current patch from Song. signature.asc Description: This is a digitally signed message part
[PATCH net] r8169: fix autoneg issue on resume with RTL8168E
It was reported that chip version 33 (RTL8168E) ends up with 10MBit/Half on a 1GBit link after resuming from S3 (with different link partners). For whatever reason the PHY on this chip doesn't properly start a renegotiation when soft-reset. Explicitly requesting a renegotiation fixes this. Signed-off-by: Heiner Kallweit Fixes: a2965f12fde6 ("r8169: remove rtl8169_set_speed_xmii") Reported-by: Neil MacLeod Tested-by: Neil MacLeod --- drivers/net/ethernet/realtek/r8169.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 1d8631303..7f4647c58 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -4069,6 +4069,15 @@ static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp) phy_speed_up(dev->phydev); genphy_soft_reset(dev->phydev); + + /* It was reported that chip version 33 ends up with 10MBit/Half on a +* 1GBit link after resuming from S3. For whatever reason the PHY on +* this chip doesn't properly start a renegotiation when soft-reset. +* Explicitly requesting a renegotiation fixes this. +*/ + if (tp->mac_version == RTL_GIGA_MAC_VER_33 && + dev->phydev->autoneg == AUTONEG_ENABLE) + phy_restart_aneg(dev->phydev); } static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr) -- 2.19.0
Re: [PATCH net] ixgbe: check return value of napi_complete_done()
On 09/20/2018 12:01 PM, Song Liu wrote: > The NIC driver should only enable interrupts when napi_complete_done() > returns true. This patch adds the check for ixgbe. > > Cc: sta...@vger.kernel.org # 4.10+ > Cc: Jeff Kirsher > Suggested-by: Eric Dumazet > Signed-off-by: Song Liu > --- Well, unfortunately we do not know why this is needed, this is why I have not yet sent this patch formally. netpoll has correct synchronization : poll_napi() places into napi->poll_owner current cpu number before calling poll_one_napi() netpoll_poll_lock() does also use napi->poll_owner When netpoll calls ixgbe poll() method, it passed a budget of 0, meaning napi_complete_done() is not called. As long as we can not explain the problem properly in the changelog, we should investigate, otherwise we will probably see coming dozens of patches trying to fix a 'potential hazard'. Thanks.
[PATCH net-next v2] net: phy: don't reschedule state machine when PHY is halted
When being in state PHY_HALTED we don't have to reschedule the state machine, phy_start() will start it again. Signed-off-by: Heiner Kallweit --- v2: - added comment --- drivers/net/phy/phy.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index a5e0f0721..a1f8e4816 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -1123,9 +1123,13 @@ void phy_state_machine(struct work_struct *work) /* Only re-schedule a PHY state machine change if we are polling the * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving -* between states from phy_mac_interrupt() +* between states from phy_mac_interrupt(). +* +* In state PHY_HALTED the PHY gets suspended, so rescheduling the +* state machine would be pointless and possibly error prone when +* called from phy_disconnect() synchronously. */ - if (phy_polling_mode(phydev)) + if (phy_polling_mode(phydev) && old_state != PHY_HALTED) queue_delayed_work(system_power_efficient_wq, >state_queue, PHY_STATE_TIME * HZ); } -- 2.19.0
[PATCH net] netlabel: check for IPV4MASK in addrinfo_get
netlbl_unlabel_addrinfo_get() assumes that if it finds the NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is not necessarily the case as the current checks in netlbl_unlabel_staticadd() and friends are not sufficent to enforce this. If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR, NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes, these functions will all call netlbl_unlabel_addrinfo_get() which will then attempt dereference NULL when fetching the non-existent NLBL_UNLABEL_A_IPV4MASK attribute: Unable to handle kernel NULL pointer dereference at virtual address 0 Process unlab (pid: 31762, stack limit = 0xff80502d8000) Call trace: netlbl_unlabel_addrinfo_get+0x44/0xd8 netlbl_unlabel_staticremovedef+0x98/0xe0 genl_rcv_msg+0x354/0x388 netlink_rcv_skb+0xac/0x118 genl_rcv+0x34/0x48 netlink_unicast+0x158/0x1f0 netlink_sendmsg+0x32c/0x338 sock_sendmsg+0x44/0x60 ___sys_sendmsg+0x1d0/0x2a8 __sys_sendmsg+0x64/0xb4 SyS_sendmsg+0x34/0x4c el0_svc_naked+0x34/0x38 Code: 51001149 7100113f 54a0 f9401508 (79400108) ---[ end trace f6438a488e737143 ]--- Kernel panic - not syncing: Fatal exception Signed-off-by: Sean Tranchetti --- net/netlabel/netlabel_unlabeled.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/netlabel/netlabel_unlabeled.c b/net/netlabel/netlabel_unlabeled.c index c070dfc..c92894c 100644 --- a/net/netlabel/netlabel_unlabeled.c +++ b/net/netlabel/netlabel_unlabeled.c @@ -781,7 +781,8 @@ static int netlbl_unlabel_addrinfo_get(struct genl_info *info, { u32 addr_len; - if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR]) { + if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR] && + info->attrs[NLBL_UNLABEL_A_IPV4MASK]) { addr_len = nla_len(info->attrs[NLBL_UNLABEL_A_IPV4ADDR]); if (addr_len != sizeof(struct in_addr) && addr_len != nla_len(info->attrs[NLBL_UNLABEL_A_IPV4MASK])) -- 1.9.1
Re: [PATCH rdma-next v1 0/7] Preparation to DevX extension series
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky wrote: > From: Leon Romanovsky > Set uid as part of various IB commands so that the firmware can manage > the IB object in a secured way. This series deals with FS core and commands used by the Ethernet driver > net/mlx5: Update mlx5_ifc with DEVX UID bits > net/mlx5: Set uid as part of CQ commands > net/mlx5: Set uid as part of RQ commands > net/mlx5: Set uid as part of SQ commands such as the life-cycle (create, modify, destroy) of objects like: RQ/SQ/CQ/etc -- saying IB commands is inaccurate, please be more precise > > The firmware should mark this IB object with the given uid so that it > can be used later on only by objects with the same uid. these are not IB objects, but rather Ethernet objects > > Upon DEVX flows that use this objec, the pointed object must have > the same uid as of the issuer uid command. typo - objec --> object > > When a command is issued with uid=0 it means that the issuer of the > command is trusted (i.e. kernel), in that case any pointed object > can be used regardless of its uid. > > Thanks > > net/mlx5: Set uid as part of SRQ commands > net/mlx5: Set uid as part of DCT commands > net/mlx5: Set uid as part of QP commands
Re: [PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky wrote: > Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP, > RQ, XRCD, PD, MKEY and MCG. typo - TRI --> TIR "DEVX information" is a bit cryptic - what does xxx_valid means for the devx use case (fw command is issued by devx process) - and what semantics does it have for non-devx callers -- e.g the Eth driver when they deal with tir/tis/etc Explain and elaborate further at the change log Or.
Re: [PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky wrote: > From: Leon Romanovsky > > Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP, > RQ, XRCD, PD, MKEY and MCG. > > Signed-off-by: Leon Romanovsky > --- > include/linux/mlx5/mlx5_ifc.h | 67 > +++ > 1 file changed, 43 insertions(+), 24 deletions(-) > > diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h > index efa4a60431d4..0f460fb22c31 100644 > --- a/include/linux/mlx5/mlx5_ifc.h > +++ b/include/linux/mlx5/mlx5_ifc.h > @@ -1291,7 +1291,9 @@ struct mlx5_ifc_wq_bits { > u8 reserved_at_118[0x3]; > u8 log_wq_sz[0x5]; > > - u8 reserved_at_120[0x3]; > + u8 dbr_umem_valid[0x1]; > + u8 wq_umem_valid[0x1]; > + u8 reserved_at_122[0x1]; > u8 log_hairpin_num_packets[0x5]; > u8 reserved_at_128[0x3]; > u8 log_hairpin_data_sz[0x5]; > @@ -2365,7 +2367,10 @@ struct mlx5_ifc_qpc_bits { > > u8 dc_access_key[0x40]; > > - u8 reserved_at_680[0xc0]; > + u8 reserved_at_680[0x3]; > + u8 dbr_umem_valid[0x1]; > + > + u8 reserved_at_684[0xbc]; > }; > > struct mlx5_ifc_roce_addr_layout_bits { > @@ -2465,7 +2470,7 @@ struct mlx5_ifc_xrc_srqc_bits { > > u8 wq_signature[0x1]; > u8 cont_srq[0x1]; > - u8 reserved_at_22[0x1]; > + u8 dbr_umem_valid[0x1]; > u8 rlky[0x1]; > u8 basic_cyclic_rcv_wqe[0x1]; > u8 log_rq_stride[0x3]; > @@ -3129,7 +3134,9 @@ enum { > > struct mlx5_ifc_cqc_bits { > u8 status[0x4]; > - u8 reserved_at_4[0x4]; > + u8 reserved_at_4[0x2]; > + u8 dbr_umem_valid[0x1]; > + u8 reserved_at_7[0x1]; > u8 cqe_sz[0x3]; > u8 cc[0x1]; > u8 reserved_at_c[0x1]; > @@ -5315,7 +5322,7 @@ struct mlx5_ifc_modify_tis_bitmask_bits { > > struct mlx5_ifc_modify_tis_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -5354,7 +5361,7 @@ struct mlx5_ifc_modify_tir_out_bits { > > struct mlx5_ifc_modify_tir_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -5455,7 +5462,7 @@ struct mlx5_ifc_rqt_bitmask_bits { > > struct mlx5_ifc_modify_rqt_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -5642,7 +5649,10 @@ struct mlx5_ifc_modify_cq_in_bits { > > struct mlx5_ifc_cqc_bits cq_context; > > - u8 reserved_at_280[0x600]; > + u8 reserved_at_280[0x40]; > + > + u8 cq_umem_valid[0x1]; > + u8 reserved_at_2c1[0x5bf]; > > u8 pas[0][0x40]; > }; > @@ -5963,7 +5973,7 @@ struct mlx5_ifc_detach_from_mcg_out_bits { > > struct mlx5_ifc_detach_from_mcg_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -6031,7 +6041,7 @@ struct mlx5_ifc_destroy_tis_out_bits { > > struct mlx5_ifc_destroy_tis_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -6053,7 +6063,7 @@ struct mlx5_ifc_destroy_tir_out_bits { > > struct mlx5_ifc_destroy_tir_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -6143,7 +6153,7 @@ struct mlx5_ifc_destroy_rqt_out_bits { > > struct mlx5_ifc_destroy_rqt_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -6508,7 +6518,7 @@ struct mlx5_ifc_dealloc_xrcd_out_bits { > > struct mlx5_ifc_dealloc_xrcd_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8 op_mod[0x10]; > @@ -6596,7 +6606,7 @@ struct mlx5_ifc_dealloc_pd_out_bits { > > struct mlx5_ifc_dealloc_pd_in_bits { > u8 opcode[0x10]; > - u8 reserved_at_10[0x10]; > + u8 uid[0x10]; > > u8 reserved_at_20[0x10]; > u8
IT Decision Makers Across the Globe
Hello, Hope all is well! We are a database organization. We provide business executives' contact information. Below, I've included a few examples: Industry-Specific Lists: Agriculture, Business Services, Chambers of Commerce, Cities, Towns & Municipalities, Construction, Consumer Services, Cultural, Education, Energy, Utilities & Waste Treatment, Finance, Government, Healthcare, Hospitality, Insurance, Law Firms & Legal Services, Manufacturing, Media & Internet, Metals & Mining, Organizations, Real Estate, Retail, Software, Telecommunications, Transportation, and more! Technology-Specific Lists: SAP users, PeopleSoft users, SIEBEL customers, Oracle Application customers, Microsoft Dynamic users, Sales force users, Microsoft Exchange users, QuickBooks, Lawson users, Act users, JD Edward users, ASP users, Microsoft GP Applications users, Net Suite users, IBM DBMS Application users, McAfee users, MS Dynamics GP (Great Plains), and many more. Title-Specific Lists: C-level executives: CEO, CFO, CIO, CTO, CMO, CISO, CSO, COO Key decision-makers: All C-level, VP-level, and Director-level executives HR Executives: VP of HR, HR Director & HR Manager, etc. Marketing Executives: CMO, VP of Marketing, Director of Marketing, Marketing Managers IT Executives: CIO, CTO, CISO, IT-VP, IT-Director, IT Manager, MIS Manager, etc. Please keep me informed for any additional details. I look forward to hearing from you. Regards, Beverly Marketing Executive If you don't want to include yourself in our mailing list, please reply back “Leave Out"n a subject line
Re: [PATCH net] af_key: free SKBs under RCU protection
I do not believe the changelog or the patch makes sense. Having skb still referencing a socket prevents this socket being released. If you think about it, what would prevent the freeing happening _before_ the rcu_read_lock() in pfkey_broadcast() ? Maybe the correct fix is that pfkey_broadcast_one() should ensure the socket is still valid. I would suggest something like : diff --git a/net/key/af_key.c b/net/key/af_key.c index 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2, { int err = -ENOBUFS; - sock_hold(sk); + if (!refcount_inc_not_zero(>sk_refcnt)) + return -ENOENT; + if (*skb2 == NULL) { if (refcount_read(>users) != 1) { *skb2 = skb_clone(skb, allocation); Hi Eric, I'm not sure that the socket getting freed before the rcu_read_lock() would be an issue, since then it would no longer be in the net_pkey->table that we loop through (since we call pfkey_remove() from pfkey_relase()). Because of that, all the sockets processed in pfkey_broadcast_one() have valid refcounts, so checking for zero there doesn't prevent the crash that I'm seeing. However, after going over the call flow again, I see that the actual problem occurs because of pfkey_broadcast_one(). Specifically, because of this check: if (*skb2 == NULL) { if (refcount_read(>users) != 1) { *skb2 = skb_clone(skb, allocation); } else { *skb2 = skb; refcount_inc(>users); } } Since we always pass a freshly cloned SKB to this function, skb->users is always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to belong to the socket. If the socket we queue skb2 to frees this SKB (thereby decrementing its refcount to 1) and the socket is freed before pfkey_broadcast() can execute the kfree_skb(skb) on line 284, we will then attempt to run sock_rfree() on an SKB with a dangling reference to this socket. Perhaps a cleaner solution here is to always clone the SKB in pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls in pfkey_broadcast() will never be passed an SKB with sock_rfree() as its destructor, and we can avoid this race condition.
[PATCH net] ixgbe: check return value of napi_complete_done()
The NIC driver should only enable interrupts when napi_complete_done() returns true. This patch adds the check for ixgbe. Cc: sta...@vger.kernel.org # 4.10+ Cc: Jeff Kirsher Suggested-by: Eric Dumazet Signed-off-by: Song Liu --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 9a23d33a47ed..1497df6a8dff 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -3196,11 +3196,13 @@ int ixgbe_poll(struct napi_struct *napi, int budget) return budget; /* all work done, exit the polling mode */ - napi_complete_done(napi, work_done); - if (adapter->rx_itr_setting & 1) - ixgbe_set_itr(q_vector); - if (!test_bit(__IXGBE_DOWN, >state)) - ixgbe_irq_enable_queues(adapter, BIT_ULL(q_vector->v_idx)); + if (likely(napi_complete_done(napi, work_done))) { + if (adapter->rx_itr_setting & 1) + ixgbe_set_itr(q_vector); + if (!test_bit(__IXGBE_DOWN, >state)) + ixgbe_irq_enable_queues(adapter, + BIT_ULL(q_vector->v_idx)); + } return min(work_done, budget - 1); } -- 2.17.1
[PATCH mlx5-next v1 6/7] net/mlx5: Set uid as part of DCT commands
From: Yishai Hadas Set uid as part of DCT commands so that the firmware can manage the DCT object in a secured way. That will enable using a DCT that was created by verbs application to be used by the DEVX flow in case the uid is equal. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 4 include/linux/mlx5/mlx5_ifc.h| 6 +++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c index d9b12136cbfd..91b8139a388d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c @@ -211,6 +211,7 @@ int mlx5_core_create_dct(struct mlx5_core_dev *dev, } qp->qpn = MLX5_GET(create_dct_out, out, dctn); + qp->uid = MLX5_GET(create_dct_in, in, uid); err = create_resource_common(dev, qp, MLX5_RES_DCT); if (err) goto err_cmd; @@ -219,6 +220,7 @@ int mlx5_core_create_dct(struct mlx5_core_dev *dev, err_cmd: MLX5_SET(destroy_dct_in, din, opcode, MLX5_CMD_OP_DESTROY_DCT); MLX5_SET(destroy_dct_in, din, dctn, qp->qpn); + MLX5_SET(destroy_dct_in, din, uid, qp->uid); mlx5_cmd_exec(dev, (void *), sizeof(din), (void *), sizeof(dout)); return err; @@ -277,6 +279,7 @@ static int mlx5_core_drain_dct(struct mlx5_core_dev *dev, MLX5_SET(drain_dct_in, in, opcode, MLX5_CMD_OP_DRAIN_DCT); MLX5_SET(drain_dct_in, in, dctn, qp->qpn); + MLX5_SET(drain_dct_in, in, uid, qp->uid); return mlx5_cmd_exec(dev, (void *), sizeof(in), (void *), sizeof(out)); } @@ -303,6 +306,7 @@ int mlx5_core_destroy_dct(struct mlx5_core_dev *dev, destroy_resource_common(dev, >mqp); MLX5_SET(destroy_dct_in, in, opcode, MLX5_CMD_OP_DESTROY_DCT); MLX5_SET(destroy_dct_in, in, dctn, qp->qpn); + MLX5_SET(destroy_dct_in, in, uid, qp->uid); err = mlx5_cmd_exec(dev, (void *), sizeof(in), (void *), sizeof(out)); return err; diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 5a2f0b02483a..efa4a60431d4 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -5919,7 +5919,7 @@ struct mlx5_ifc_drain_dct_out_bits { struct mlx5_ifc_drain_dct_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6383,7 +6383,7 @@ struct mlx5_ifc_destroy_dct_out_bits { struct mlx5_ifc_destroy_dct_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -7139,7 +7139,7 @@ struct mlx5_ifc_create_dct_out_bits { struct mlx5_ifc_create_dct_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; -- 2.14.4
[PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits
From: Leon Romanovsky Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP, RQ, XRCD, PD, MKEY and MCG. Signed-off-by: Leon Romanovsky --- include/linux/mlx5/mlx5_ifc.h | 67 +++ 1 file changed, 43 insertions(+), 24 deletions(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index efa4a60431d4..0f460fb22c31 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -1291,7 +1291,9 @@ struct mlx5_ifc_wq_bits { u8 reserved_at_118[0x3]; u8 log_wq_sz[0x5]; - u8 reserved_at_120[0x3]; + u8 dbr_umem_valid[0x1]; + u8 wq_umem_valid[0x1]; + u8 reserved_at_122[0x1]; u8 log_hairpin_num_packets[0x5]; u8 reserved_at_128[0x3]; u8 log_hairpin_data_sz[0x5]; @@ -2365,7 +2367,10 @@ struct mlx5_ifc_qpc_bits { u8 dc_access_key[0x40]; - u8 reserved_at_680[0xc0]; + u8 reserved_at_680[0x3]; + u8 dbr_umem_valid[0x1]; + + u8 reserved_at_684[0xbc]; }; struct mlx5_ifc_roce_addr_layout_bits { @@ -2465,7 +2470,7 @@ struct mlx5_ifc_xrc_srqc_bits { u8 wq_signature[0x1]; u8 cont_srq[0x1]; - u8 reserved_at_22[0x1]; + u8 dbr_umem_valid[0x1]; u8 rlky[0x1]; u8 basic_cyclic_rcv_wqe[0x1]; u8 log_rq_stride[0x3]; @@ -3129,7 +3134,9 @@ enum { struct mlx5_ifc_cqc_bits { u8 status[0x4]; - u8 reserved_at_4[0x4]; + u8 reserved_at_4[0x2]; + u8 dbr_umem_valid[0x1]; + u8 reserved_at_7[0x1]; u8 cqe_sz[0x3]; u8 cc[0x1]; u8 reserved_at_c[0x1]; @@ -5315,7 +5322,7 @@ struct mlx5_ifc_modify_tis_bitmask_bits { struct mlx5_ifc_modify_tis_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -5354,7 +5361,7 @@ struct mlx5_ifc_modify_tir_out_bits { struct mlx5_ifc_modify_tir_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -5455,7 +5462,7 @@ struct mlx5_ifc_rqt_bitmask_bits { struct mlx5_ifc_modify_rqt_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -5642,7 +5649,10 @@ struct mlx5_ifc_modify_cq_in_bits { struct mlx5_ifc_cqc_bits cq_context; - u8 reserved_at_280[0x600]; + u8 reserved_at_280[0x40]; + + u8 cq_umem_valid[0x1]; + u8 reserved_at_2c1[0x5bf]; u8 pas[0][0x40]; }; @@ -5963,7 +5973,7 @@ struct mlx5_ifc_detach_from_mcg_out_bits { struct mlx5_ifc_detach_from_mcg_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6031,7 +6041,7 @@ struct mlx5_ifc_destroy_tis_out_bits { struct mlx5_ifc_destroy_tis_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6053,7 +6063,7 @@ struct mlx5_ifc_destroy_tir_out_bits { struct mlx5_ifc_destroy_tir_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6143,7 +6153,7 @@ struct mlx5_ifc_destroy_rqt_out_bits { struct mlx5_ifc_destroy_rqt_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6508,7 +6518,7 @@ struct mlx5_ifc_dealloc_xrcd_out_bits { struct mlx5_ifc_dealloc_xrcd_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6596,7 +6606,7 @@ struct mlx5_ifc_dealloc_pd_out_bits { struct mlx5_ifc_dealloc_pd_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6675,7 +6685,9 @@ struct mlx5_ifc_create_xrc_srq_in_bits { struct mlx5_ifc_xrc_srqc_bits xrc_srq_context_entry; - u8 reserved_at_280[0x600]; + u8 reserved_at_280[0x40]; + u8 xrc_srq_umem_valid[0x1]; + u8 reserved_at_2c1[0x5bf]; u8
[PATCH mlx5-next v1 5/7] net/mlx5: Set uid as part of SRQ commands
From: Yishai Hadas Set uid as part of SRQ commands so that the firmware can manage the SRQ object in a secured way. That will enable using an SRQ that was created by verbs application to be used by the DEVX flow in case the uid is equal. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/srq.c | 30 --- include/linux/mlx5/driver.h | 1 + include/linux/mlx5/mlx5_ifc.h | 22 ++-- include/linux/mlx5/srq.h | 1 + 4 files changed, 40 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/srq.c b/drivers/net/ethernet/mellanox/mlx5/core/srq.c index 23cc337a96c9..5c519615fb1c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/srq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/srq.c @@ -166,6 +166,7 @@ static int create_srq_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, if (!create_in) return -ENOMEM; + MLX5_SET(create_srq_in, create_in, uid, in->uid); srqc = MLX5_ADDR_OF(create_srq_in, create_in, srq_context_entry); pas = MLX5_ADDR_OF(create_srq_in, create_in, pas); @@ -178,8 +179,10 @@ static int create_srq_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, err = mlx5_cmd_exec(dev, create_in, inlen, create_out, sizeof(create_out)); kvfree(create_in); - if (!err) + if (!err) { srq->srqn = MLX5_GET(create_srq_out, create_out, srqn); + srq->uid = in->uid; + } return err; } @@ -193,6 +196,7 @@ static int destroy_srq_cmd(struct mlx5_core_dev *dev, MLX5_SET(destroy_srq_in, srq_in, opcode, MLX5_CMD_OP_DESTROY_SRQ); MLX5_SET(destroy_srq_in, srq_in, srqn, srq->srqn); + MLX5_SET(destroy_srq_in, srq_in, uid, srq->uid); return mlx5_cmd_exec(dev, srq_in, sizeof(srq_in), srq_out, sizeof(srq_out)); @@ -208,6 +212,7 @@ static int arm_srq_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, MLX5_SET(arm_rq_in, srq_in, op_mod, MLX5_ARM_RQ_IN_OP_MOD_SRQ); MLX5_SET(arm_rq_in, srq_in, srq_number, srq->srqn); MLX5_SET(arm_rq_in, srq_in, lwm, lwm); + MLX5_SET(arm_rq_in, srq_in, uid, srq->uid); return mlx5_cmd_exec(dev, srq_in, sizeof(srq_in), srq_out, sizeof(srq_out)); @@ -260,6 +265,7 @@ static int create_xrc_srq_cmd(struct mlx5_core_dev *dev, if (!create_in) return -ENOMEM; + MLX5_SET(create_xrc_srq_in, create_in, uid, in->uid); xrc_srqc = MLX5_ADDR_OF(create_xrc_srq_in, create_in, xrc_srq_context_entry); pas = MLX5_ADDR_OF(create_xrc_srq_in, create_in, pas); @@ -277,6 +283,7 @@ static int create_xrc_srq_cmd(struct mlx5_core_dev *dev, goto out; srq->srqn = MLX5_GET(create_xrc_srq_out, create_out, xrc_srqn); + srq->uid = in->uid; out: kvfree(create_in); return err; @@ -291,6 +298,7 @@ static int destroy_xrc_srq_cmd(struct mlx5_core_dev *dev, MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, opcode, MLX5_CMD_OP_DESTROY_XRC_SRQ); MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, xrc_srqn, srq->srqn); + MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, uid, srq->uid); return mlx5_cmd_exec(dev, xrcsrq_in, sizeof(xrcsrq_in), xrcsrq_out, sizeof(xrcsrq_out)); @@ -306,6 +314,7 @@ static int arm_xrc_srq_cmd(struct mlx5_core_dev *dev, MLX5_SET(arm_xrc_srq_in, xrcsrq_in, op_mod, MLX5_ARM_XRC_SRQ_IN_OP_MOD_XRC_SRQ); MLX5_SET(arm_xrc_srq_in, xrcsrq_in, xrc_srqn, srq->srqn); MLX5_SET(arm_xrc_srq_in, xrcsrq_in, lwm, lwm); + MLX5_SET(arm_xrc_srq_in, xrcsrq_in, uid, srq->uid); return mlx5_cmd_exec(dev, xrcsrq_in, sizeof(xrcsrq_in), xrcsrq_out, sizeof(xrcsrq_out)); @@ -365,10 +374,13 @@ static int create_rmp_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, wq = MLX5_ADDR_OF(rmpc, rmpc, wq); MLX5_SET(rmpc, rmpc, state, MLX5_RMPC_STATE_RDY); + MLX5_SET(create_rmp_in, create_in, uid, in->uid); set_wq(wq, in); memcpy(MLX5_ADDR_OF(rmpc, rmpc, wq.pas), in->pas, pas_size); err = mlx5_core_create_rmp(dev, create_in, inlen, >srqn); + if (!err) + srq->uid = in->uid; kvfree(create_in); return err; @@ -377,7 +389,13 @@ static int create_rmp_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, static int destroy_rmp_cmd(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq) { - return mlx5_core_destroy_rmp(dev, srq->srqn); + u32 in[MLX5_ST_SZ_DW(destroy_rmp_in)] = {}; + u32 out[MLX5_ST_SZ_DW(destroy_rmp_out)] = {}; + + MLX5_SET(destroy_rmp_in, in,
[PATCH mlx5-next v1 2/7] net/mlx5: Set uid as part of QP commands
From: Yishai Hadas Set uid as part of QP commands so that the firmware can manage the QP object in a secured way. That will enable using a QP that was created by verbs application to be used by the DEVX flow in case the uid is equal. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 45 +--- include/linux/mlx5/mlx5_ifc.h| 22 +++--- include/linux/mlx5/qp.h | 1 + 3 files changed, 39 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c index 4ca07bfb6b14..4e2ab3c916bf 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c @@ -240,6 +240,7 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev, if (err) return err; + qp->uid = MLX5_GET(create_qp_in, in, uid); qp->qpn = MLX5_GET(create_qp_out, out, qpn); mlx5_core_dbg(dev, "qpn = 0x%x\n", qp->qpn); @@ -261,6 +262,7 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev, memset(dout, 0, sizeof(dout)); MLX5_SET(destroy_qp_in, din, opcode, MLX5_CMD_OP_DESTROY_QP); MLX5_SET(destroy_qp_in, din, qpn, qp->qpn); + MLX5_SET(destroy_qp_in, din, uid, qp->uid); mlx5_cmd_exec(dev, din, sizeof(din), dout, sizeof(dout)); return err; } @@ -320,6 +322,7 @@ int mlx5_core_destroy_qp(struct mlx5_core_dev *dev, MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP); MLX5_SET(destroy_qp_in, in, qpn, qp->qpn); + MLX5_SET(destroy_qp_in, in, uid, qp->uid); err = mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); if (err) return err; @@ -373,7 +376,7 @@ static void mbox_free(struct mbox_info *mbox) static int modify_qp_mbox_alloc(struct mlx5_core_dev *dev, u16 opcode, int qpn, u32 opt_param_mask, void *qpc, - struct mbox_info *mbox) + struct mbox_info *mbox, u16 uid) { mbox->out = NULL; mbox->in = NULL; @@ -381,26 +384,32 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev *dev, u16 opcode, int qpn, #define MBOX_ALLOC(mbox, typ) \ mbox_alloc(mbox, MLX5_ST_SZ_BYTES(typ##_in), MLX5_ST_SZ_BYTES(typ##_out)) -#define MOD_QP_IN_SET(typ, in, _opcode, _qpn) \ - MLX5_SET(typ##_in, in, opcode, _opcode); \ - MLX5_SET(typ##_in, in, qpn, _qpn) - -#define MOD_QP_IN_SET_QPC(typ, in, _opcode, _qpn, _opt_p, _qpc) \ - MOD_QP_IN_SET(typ, in, _opcode, _qpn); \ - MLX5_SET(typ##_in, in, opt_param_mask, _opt_p); \ - memcpy(MLX5_ADDR_OF(typ##_in, in, qpc), _qpc, MLX5_ST_SZ_BYTES(qpc)) +#define MOD_QP_IN_SET(typ, in, _opcode, _qpn, _uid) \ + do { \ + MLX5_SET(typ##_in, in, opcode, _opcode); \ + MLX5_SET(typ##_in, in, qpn, _qpn); \ + MLX5_SET(typ##_in, in, uid, _uid); \ + } while (0) + +#define MOD_QP_IN_SET_QPC(typ, in, _opcode, _qpn, _opt_p, _qpc, _uid) \ + do { \ + MOD_QP_IN_SET(typ, in, _opcode, _qpn, _uid); \ + MLX5_SET(typ##_in, in, opt_param_mask, _opt_p);\ + memcpy(MLX5_ADDR_OF(typ##_in, in, qpc), _qpc, \ + MLX5_ST_SZ_BYTES(qpc)); \ + } while (0) switch (opcode) { /* 2RST & 2ERR */ case MLX5_CMD_OP_2RST_QP: if (MBOX_ALLOC(mbox, qp_2rst)) return -ENOMEM; - MOD_QP_IN_SET(qp_2rst, mbox->in, opcode, qpn); + MOD_QP_IN_SET(qp_2rst, mbox->in, opcode, qpn, uid); break; case MLX5_CMD_OP_2ERR_QP: if (MBOX_ALLOC(mbox, qp_2err)) return -ENOMEM; - MOD_QP_IN_SET(qp_2err, mbox->in, opcode, qpn); + MOD_QP_IN_SET(qp_2err, mbox->in, opcode, qpn, uid); break; /* MODIFY with QPC */ @@ -408,37 +417,37 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev *dev, u16 opcode, int qpn, if (MBOX_ALLOC(mbox, rst2init_qp)) return -ENOMEM; MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn, - opt_param_mask, qpc); + opt_param_mask, qpc, uid); break; case MLX5_CMD_OP_INIT2RTR_QP: if (MBOX_ALLOC(mbox, init2rtr_qp)) return -ENOMEM; MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn, -
[PATCH rdma-next v1 0/7] Preparation to DevX extension series
From: Leon Romanovsky Changelog v0->v1: * Update commit messages * Split DevX series to small sub-series. * Change static initialization from {0} to be {} --- From Yishai, Set uid as part of various IB commands so that the firmware can manage the IB object in a secured way. The firmware should mark this IB object with the given uid so that it can be used later on only by objects with the same uid. Upon DEVX flows that use this objec, the pointed object must have the same uid as of the issuer uid command. When a command is issued with uid=0 it means that the issuer of the command is trusted (i.e. kernel), in that case any pointed object can be used regardless of its uid. Thanks Leon Romanovsky (1): net/mlx5: Update mlx5_ifc with DEVX UID bits Yishai Hadas (6): net/mlx5: Set uid as part of CQ commands net/mlx5: Set uid as part of QP commands net/mlx5: Set uid as part of RQ commands net/mlx5: Set uid as part of SQ commands net/mlx5: Set uid as part of SRQ commands net/mlx5: Set uid as part of DCT commands drivers/net/ethernet/mellanox/mlx5/core/cq.c | 4 + drivers/net/ethernet/mellanox/mlx5/core/qp.c | 81 +++- drivers/net/ethernet/mellanox/mlx5/core/srq.c | 30 +- include/linux/mlx5/cq.h | 1 + include/linux/mlx5/driver.h | 1 + include/linux/mlx5/mlx5_ifc.h | 135 +++--- include/linux/mlx5/qp.h | 1 + include/linux/mlx5/srq.h | 1 + 8 files changed, 171 insertions(+), 83 deletions(-) -- 2.14.4
[PATCH mlx5-next v1 1/7] net/mlx5: Set uid as part of CQ commands
From: Yishai Hadas Set uid as part of CQ commands so that the firmware can manage the CQ object in a secured way. The firmware should mark this CQ with the given uid so that it can be used later on only by objects with the same uid. Upon DEVX flows that use this CQ (e.g. create QP command), the pointed CQ must have the same uid as of the issuer uid command. When a command is issued with uid=0 it means that the issuer of the command is trusted (i.e. kernel), in that case any pointed object can be used regardless of its uid. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/cq.c | 4 include/linux/mlx5/cq.h | 1 + include/linux/mlx5/mlx5_ifc.h| 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c index a4179122a279..4b85abb5c9f7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c @@ -109,6 +109,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, cq->cons_index = 0; cq->arm_sn = 0; cq->eq = eq; + cq->uid = MLX5_GET(create_cq_in, in, uid); refcount_set(>refcount, 1); init_completion(>free); if (!cq->comp) @@ -144,6 +145,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, memset(dout, 0, sizeof(dout)); MLX5_SET(destroy_cq_in, din, opcode, MLX5_CMD_OP_DESTROY_CQ); MLX5_SET(destroy_cq_in, din, cqn, cq->cqn); + MLX5_SET(destroy_cq_in, din, uid, cq->uid); mlx5_cmd_exec(dev, din, sizeof(din), dout, sizeof(dout)); return err; } @@ -165,6 +167,7 @@ int mlx5_core_destroy_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq) MLX5_SET(destroy_cq_in, in, opcode, MLX5_CMD_OP_DESTROY_CQ); MLX5_SET(destroy_cq_in, in, cqn, cq->cqn); + MLX5_SET(destroy_cq_in, in, uid, cq->uid); err = mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); if (err) return err; @@ -196,6 +199,7 @@ int mlx5_core_modify_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, u32 out[MLX5_ST_SZ_DW(modify_cq_out)] = {0}; MLX5_SET(modify_cq_in, in, opcode, MLX5_CMD_OP_MODIFY_CQ); + MLX5_SET(modify_cq_in, in, uid, cq->uid); return mlx5_cmd_exec(dev, in, inlen, out, sizeof(out)); } EXPORT_SYMBOL(mlx5_core_modify_cq); diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h index 0ef6138eca49..31a750570c38 100644 --- a/include/linux/mlx5/cq.h +++ b/include/linux/mlx5/cq.h @@ -61,6 +61,7 @@ struct mlx5_core_cq { int reset_notify_added; struct list_headreset_notify; struct mlx5_eq *eq; + u16 uid; }; diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index a14c4eaff53f..e62a0825d35c 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -5630,7 +5630,7 @@ enum { struct mlx5_ifc_modify_cq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6405,7 +6405,7 @@ struct mlx5_ifc_destroy_cq_out_bits { struct mlx5_ifc_destroy_cq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -7165,7 +7165,7 @@ struct mlx5_ifc_create_cq_out_bits { struct mlx5_ifc_create_cq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; -- 2.14.4
[PATCH mlx5-next v1 3/7] net/mlx5: Set uid as part of RQ commands
From: Yishai Hadas Set uid as part of RQ commands so that the firmware can manage the RQ object in a secured way. That will enable using an RQ that was created by verbs application to be used by the DEVX flow in case the uid is equal. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 16 ++-- include/linux/mlx5/mlx5_ifc.h| 6 +++--- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c index 4e2ab3c916bf..f57e08d4f970 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c @@ -540,6 +540,17 @@ int mlx5_core_xrcd_dealloc(struct mlx5_core_dev *dev, u32 xrcdn) } EXPORT_SYMBOL_GPL(mlx5_core_xrcd_dealloc); +static void destroy_rq_tracked(struct mlx5_core_dev *dev, u32 rqn, u16 uid) +{ + u32 in[MLX5_ST_SZ_DW(destroy_rq_in)] = {}; + u32 out[MLX5_ST_SZ_DW(destroy_rq_out)] = {}; + + MLX5_SET(destroy_rq_in, in, opcode, MLX5_CMD_OP_DESTROY_RQ); + MLX5_SET(destroy_rq_in, in, rqn, rqn); + MLX5_SET(destroy_rq_in, in, uid, uid); + mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); +} + int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, struct mlx5_core_qp *rq) { @@ -550,6 +561,7 @@ int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, if (err) return err; + rq->uid = MLX5_GET(create_rq_in, in, uid); rq->qpn = rqn; err = create_resource_common(dev, rq, MLX5_RES_RQ); if (err) @@ -558,7 +570,7 @@ int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, return 0; err_destroy_rq: - mlx5_core_destroy_rq(dev, rq->qpn); + destroy_rq_tracked(dev, rq->qpn, rq->uid); return err; } @@ -568,7 +580,7 @@ void mlx5_core_destroy_rq_tracked(struct mlx5_core_dev *dev, struct mlx5_core_qp *rq) { destroy_resource_common(dev, rq); - mlx5_core_destroy_rq(dev, rq->qpn); + destroy_rq_tracked(dev, rq->qpn, rq->uid); } EXPORT_SYMBOL(mlx5_core_destroy_rq_tracked); diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index e5a0d3ecfaad..01b707666fb4 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -5489,7 +5489,7 @@ enum { struct mlx5_ifc_modify_rq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6165,7 +6165,7 @@ struct mlx5_ifc_destroy_rq_out_bits { struct mlx5_ifc_destroy_rq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6848,7 +6848,7 @@ struct mlx5_ifc_create_rq_out_bits { struct mlx5_ifc_create_rq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; -- 2.14.4
[PATCH mlx5-next v1 4/7] net/mlx5: Set uid as part of SQ commands
From: Yishai Hadas Set uid as part of SQ commands so that the firmware can manage the SQ object in a secured way. That will enable using an SQ that was created by verbs application to be used by the DEVX flow in case the uid is equal. Signed-off-by: Yishai Hadas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 16 ++-- include/linux/mlx5/mlx5_ifc.h| 6 +++--- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c index f57e08d4f970..d9b12136cbfd 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c @@ -584,6 +584,17 @@ void mlx5_core_destroy_rq_tracked(struct mlx5_core_dev *dev, } EXPORT_SYMBOL(mlx5_core_destroy_rq_tracked); +static void destroy_sq_tracked(struct mlx5_core_dev *dev, u32 sqn, u16 uid) +{ + u32 in[MLX5_ST_SZ_DW(destroy_sq_in)] = {}; + u32 out[MLX5_ST_SZ_DW(destroy_sq_out)] = {}; + + MLX5_SET(destroy_sq_in, in, opcode, MLX5_CMD_OP_DESTROY_SQ); + MLX5_SET(destroy_sq_in, in, sqn, sqn); + MLX5_SET(destroy_sq_in, in, uid, uid); + mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); +} + int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, struct mlx5_core_qp *sq) { @@ -594,6 +605,7 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, if (err) return err; + sq->uid = MLX5_GET(create_sq_in, in, uid); sq->qpn = sqn; err = create_resource_common(dev, sq, MLX5_RES_SQ); if (err) @@ -602,7 +614,7 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen, return 0; err_destroy_sq: - mlx5_core_destroy_sq(dev, sq->qpn); + destroy_sq_tracked(dev, sq->qpn, sq->uid); return err; } @@ -612,7 +624,7 @@ void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev *dev, struct mlx5_core_qp *sq) { destroy_resource_common(dev, sq); - mlx5_core_destroy_sq(dev, sq->qpn); + destroy_sq_tracked(dev, sq->qpn, sq->uid); } EXPORT_SYMBOL(mlx5_core_destroy_sq_tracked); diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 01b707666fb4..8151488f6570 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -5382,7 +5382,7 @@ struct mlx5_ifc_modify_sq_out_bits { struct mlx5_ifc_modify_sq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6097,7 +6097,7 @@ struct mlx5_ifc_destroy_sq_out_bits { struct mlx5_ifc_destroy_sq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; @@ -6770,7 +6770,7 @@ struct mlx5_ifc_create_sq_out_bits { struct mlx5_ifc_create_sq_in_bits { u8 opcode[0x10]; - u8 reserved_at_10[0x10]; + u8 uid[0x10]; u8 reserved_at_20[0x10]; u8 op_mod[0x10]; -- 2.14.4
Re: [PATCH net] sctp: update dst pmtu with the correct daddr
From: Xin Long Date: Thu, 20 Sep 2018 17:27:28 +0800 > When processing pmtu update from an icmp packet, it calls .update_pmtu > with sk instead of skb in sctp_transport_update_pmtu. > > However for sctp, the daddr in the transport might be different from > inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or > create the route cache. The incorrect daddr will cause a different > route cache created for the path. > > So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr > should be updated with the daddr in the transport, and update it back > after it's done. > > The issue has existed since route exceptions introduction. > > Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") > Reported-by: ian.per...@dialogic.com > Signed-off-by: Xin Long Applied and queued up for -stable. Although are you sure it's OK to temporarily change the sockets address like this? What if an asynchronous context looks at the socket state and sees the temporarily set address? Thanks.
[PATCH net-next 0/3] bnx2x: enhancements
From: Shahed Shaikh Hi Dave, This series adds below changes - - support for VF spoof-check configuration through .ndo_set_vf_spoofchk. - workaround for MFW bug regarding unexpected bandwidth notifcation in single function mode. - supply VF link status as part of get VF config handling. Please apply this to net-next tree. Thanks, Shahed Shahed Shaikh (3): bnx2x: Add VF spoof-checking configuration bnx2x: Ignore bandwidth attention in single function mode bnx2x: Provide VF link status in ndo_get_vf_config drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 1 + drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 11 +++ drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 81 ++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h | 2 + 4 files changed, 93 insertions(+), 2 deletions(-) -- 1.8.3.1
[PATCH net-next 1/3] bnx2x: Add VF spoof-checking configuration
Add support for `ndo_set_vf_spoofchk' to allow PF control over its VF spoof-checking configuration. Signed-off-by: Shahed Shaikh Signed-off-by: Ariel Elior --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 1 + drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 + drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 80 ++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h | 2 + 4 files changed, 82 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h index 0e508e5..142bc11 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h @@ -494,6 +494,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vf, int bnx2x_set_vf_mac(struct net_device *dev, int queue, u8 *mac); int bnx2x_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, u8 qos, __be16 vlan_proto); +int bnx2x_set_vf_spoofchk(struct net_device *dev, int idx, bool val); /* select_queue callback */ u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb, diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 71362b7..faf64ba 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -13121,6 +13121,7 @@ static int bnx2x_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, u16 vid) .ndo_set_vf_mac = bnx2x_set_vf_mac, .ndo_set_vf_vlan= bnx2x_set_vf_vlan, .ndo_get_vf_config = bnx2x_get_vf_config, + .ndo_set_vf_spoofchk= bnx2x_set_vf_spoofchk, #endif #ifdef NETDEV_FCOE_WWNN .ndo_fcoe_get_wwn = bnx2x_fcoe_get_wwn, diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c index 62da465..2f76b37 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c @@ -209,7 +209,10 @@ void bnx2x_vfop_qctor_prep(struct bnx2x *bp, */ __set_bit(BNX2X_Q_FLG_TX_SWITCH, _p->flags); __set_bit(BNX2X_Q_FLG_TX_SEC, _p->flags); - __set_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags); + if (vf->spoofchk) + __set_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags); + else + __clear_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags); /* Setup-op rx parameters */ if (test_bit(BNX2X_Q_TYPE_HAS_RX, _type)) { @@ -1269,6 +1272,8 @@ int bnx2x_iov_init_one(struct bnx2x *bp, int int_mode_param, bnx2x_vf(bp, i, state) = VF_FREE; mutex_init(_vf(bp, i, op_mutex)); bnx2x_vf(bp, i, op_current) = CHANNEL_TLV_NONE; + /* enable spoofchk by default */ + bnx2x_vf(bp, i, spoofchk) = 1; } /* re-read the IGU CAM for VFs - index and abs_vfid must be set */ @@ -2632,7 +2637,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx, ivi->qos = 0; ivi->max_tx_rate = 1; /* always 10G. TBA take from link struct */ ivi->min_tx_rate = 0; - ivi->spoofchk = 1; /*always enabled */ + ivi->spoofchk = vf->spoofchk ? 1 : 0; if (vf->state == VF_ENABLED) { /* mac and vlan are in vlan_mac objects */ if (bnx2x_validate_vf_sp_objs(bp, vf, false)) { @@ -2950,6 +2955,77 @@ int bnx2x_set_vf_vlan(struct net_device *dev, int vfidx, u16 vlan, u8 qos, return rc; } +int bnx2x_set_vf_spoofchk(struct net_device *dev, int idx, bool val) +{ + struct bnx2x *bp = netdev_priv(dev); + struct bnx2x_virtf *vf; + int i, rc = 0; + + vf = BP_VF(bp, idx); + if (!vf) + return -EINVAL; + + /* nothing to do */ + if (vf->spoofchk == val) + return 0; + + vf->spoofchk = val ? 1 : 0; + + DP(BNX2X_MSG_IOV, "%s spoofchk for VF %d\n", + val ? "enabling" : "disabling", idx); + + /* is vf initialized and queue set up? */ + if (vf->state != VF_ENABLED || + bnx2x_get_q_logical_state(bp, _leading_vfq(vf, sp_obj)) != + BNX2X_Q_LOGICAL_STATE_ACTIVE) + return rc; + + /* User should be able to see error in system logs */ + if (!bnx2x_validate_vf_sp_objs(bp, vf, true)) + return -EINVAL; + + /* send queue update ramrods to configure spoofchk */ + for_each_vfq(vf, i) { + struct bnx2x_queue_state_params q_params = {NULL}; + struct bnx2x_queue_update_params *update_params; + + q_params.q_obj = _vfq(vf, i, sp_obj); + + /* validate the Q is UP */ + if (bnx2x_get_q_logical_state(bp, q_params.q_obj) != + BNX2X_Q_LOGICAL_STATE_ACTIVE) + continue; + + __set_bit(RAMROD_COMP_WAIT, _params.ramrod_flags);
[PATCH net-next 2/3] bnx2x: Ignore bandwidth attention in single function mode
From: Shahed Shaikh This is a workaround for FW bug - MFW generates bandwidth attention in single function mode, which is only expected to be generated in multi function mode. This undesired attention in SF mode results in incorrect HW configuration and resulting into Tx timeout. Signed-off-by: Shahed Shaikh Signed-off-by: Ariel Elior --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index faf64ba..16f64c6 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -3536,6 +3536,16 @@ static void bnx2x_drv_info_iscsi_stat(struct bnx2x *bp) */ static void bnx2x_config_mf_bw(struct bnx2x *bp) { + /* Workaround for MFW bug. +* MFW is not supposed to generate BW attention in +* single function mode. +*/ + if (!IS_MF(bp)) { + DP(BNX2X_MSG_MCP, + "Ignoring MF BW config in single function mode\n"); + return; + } + if (bp->link_vars.link_up) { bnx2x_cmng_fns_init(bp, true, CMNG_FNS_MINMAX); bnx2x_link_sync_notify(bp); -- 1.8.3.1
[PATCH net-next 3/3] bnx2x: Provide VF link status in ndo_get_vf_config
From: Shahed Shaikh Provide current link status of VF in ndo_get_vf_config handler. Signed-off-by: Shahed Shaikh Signed-off-by: Ariel Elior --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c index 2f76b37..c835f6c 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c @@ -2638,6 +2638,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx, ivi->max_tx_rate = 1; /* always 10G. TBA take from link struct */ ivi->min_tx_rate = 0; ivi->spoofchk = vf->spoofchk ? 1 : 0; + ivi->linkstate = vf->link_cfg; if (vf->state == VF_ENABLED) { /* mac and vlan are in vlan_mac objects */ if (bnx2x_validate_vf_sp_objs(bp, vf, false)) { -- 1.8.3.1
Re: [PATCH net-next] net/tls: Add support for async encryption of records for performance
From: Vakul Garg Date: Wed, 19 Sep 2018 20:51:35 +0530 > This patch enables encryption of multiple records in parallel when an > async capable crypto accelerator is present in system. This seems to be trading off zero copy with async support. Async crypto device support is not the common case at all, and synchronous crypto via cpu crypto acceleration instructions is so much more likely. Oh I see, the new logic is only triggered with ASYNC_CAPABLE is set? > +static inline bool is_tx_ready(struct tls_context *tls_ctx, > + struct tls_sw_context_tx *ctx) > +{ Two space between "inline" and "bool", please make it one. > static void tls_write_space(struct sock *sk) > { > struct tls_context *ctx = tls_get_ctx(sk); > + struct tls_sw_context_tx *tx_ctx = tls_sw_ctx_tx(ctx); Longest to shortest line (reverse christmas tree) ordering for local variable declarations please. > > + list_for_each_prev(pos, >tx_ready_list) { > + struct tls_rec *rec = (struct tls_rec *)pos; > + u64 seq = be64_to_cpup((const __be64 *)>aad_space); Likewise. > -static int tls_do_encryption(struct tls_context *tls_ctx, > +int tls_tx_records(struct sock *sk, int flags) > +{ > + struct tls_rec *rec, *tmp; > + struct tls_context *tls_ctx = tls_get_ctx(sk); > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > + int rc = 0; > + int tx_flags; Likewise. > +static void tls_encrypt_done(struct crypto_async_request *req, int err) > +{ > + struct aead_request *aead_req = (struct aead_request *)req; > + struct sock *sk = req->data; > + struct tls_context *tls_ctx = tls_get_ctx(sk); > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > + struct tls_rec *rec; > + int pending; > + bool ready = false; Likewise. > +static int tls_do_encryption(struct sock *sk, > + struct tls_context *tls_ctx, >struct tls_sw_context_tx *ctx, >struct aead_request *aead_req, >size_t data_len) > { > int rc; > + struct tls_rec *rec = ctx->open_rec; Likewise. > @@ -473,11 +630,12 @@ static int memcopy_from_iter(struct sock *sk, struct > iov_iter *from, > { > struct tls_context *tls_ctx = tls_get_ctx(sk); > struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > - struct scatterlist *sg = ctx->sg_plaintext_data; > + struct tls_rec *rec = ctx->open_rec; > + struct scatterlist *sg = rec->sg_plaintext_data; > int copy, i, rc = 0; Likewise. > +struct tls_rec *get_rec(struct sock *sk) > +{ > + int mem_size; > + struct tls_context *tls_ctx = tls_get_ctx(sk); > + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); > + struct tls_rec *rec; Likewise. > @@ -510,21 +707,33 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, > size_t size) > int record_room; > bool full_record; > int orig_size; > + struct tls_rec *rec; > bool is_kvec = msg->msg_iter.type & ITER_KVEC; > + struct crypto_tfm *tfm = crypto_aead_tfm(ctx->aead_send); > + bool async_capable = tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC; > + int num_async = 0; > + int num_zc = 0; Likewise. > @@ -661,6 +904,8 @@ int tls_sw_sendpage(struct sock *sk, struct page *page, > struct scatterlist *sg; > bool full_record; > int record_room; > + struct tls_rec *rec; > + int num_async = 0; Likewise.
Re: [PATCH bpf-next] samples/bpf: fix compilation failure
On Thu, Sep 20, 2018 at 12:53 AM Prashant Bhole wrote: > > following commit: > commit d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") > added struct bpf_flow_keys which conflicts with the struct with > same name in sockex2_kern.c and sockex3_kern.c > > similar to commit: > commit 534e0e52bc23 ("samples/bpf: fix a compilation failure") > we tried the rename it "flow_keys" but it also conflicted with struct > having same name in include/net/flow_dissector.h. Hence renaming the > struct to "flow_key_record". Also, this commit doesn't fix the > compilation error completely because the similar struct is present in > sockex3_kern.c. Hence renaming it in both files sockex3_user.c and > sockex3_kern.c > > Signed-off-by: Prashant Bhole Acked-by: Song Liu > --- > samples/bpf/sockex2_kern.c | 11 ++- > samples/bpf/sockex3_kern.c | 8 > samples/bpf/sockex3_user.c | 4 ++-- > 3 files changed, 12 insertions(+), 11 deletions(-) > > diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c > index f58acfc92556..f2f9dbc021b0 100644 > --- a/samples/bpf/sockex2_kern.c > +++ b/samples/bpf/sockex2_kern.c > @@ -14,7 +14,7 @@ struct vlan_hdr { > __be16 h_vlan_encapsulated_proto; > }; > > -struct bpf_flow_keys { > +struct flow_key_record { > __be32 src; > __be32 dst; > union { > @@ -59,7 +59,7 @@ static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, > __u64 off) > } > > static inline __u64 parse_ip(struct __sk_buff *skb, __u64 nhoff, __u64 > *ip_proto, > -struct bpf_flow_keys *flow) > +struct flow_key_record *flow) > { > __u64 verlen; > > @@ -83,7 +83,7 @@ static inline __u64 parse_ip(struct __sk_buff *skb, __u64 > nhoff, __u64 *ip_proto > } > > static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 nhoff, __u64 > *ip_proto, > - struct bpf_flow_keys *flow) > + struct flow_key_record *flow) > { > *ip_proto = load_byte(skb, > nhoff + offsetof(struct ipv6hdr, nexthdr)); > @@ -96,7 +96,8 @@ static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 > nhoff, __u64 *ip_pro > return nhoff; > } > > -static inline bool flow_dissector(struct __sk_buff *skb, struct > bpf_flow_keys *flow) > +static inline bool flow_dissector(struct __sk_buff *skb, > + struct flow_key_record *flow) > { > __u64 nhoff = ETH_HLEN; > __u64 ip_proto; > @@ -198,7 +199,7 @@ struct bpf_map_def SEC("maps") hash_map = { > SEC("socket2") > int bpf_prog2(struct __sk_buff *skb) > { > - struct bpf_flow_keys flow = {}; > + struct flow_key_record flow = {}; > struct pair *value; > u32 key; > > diff --git a/samples/bpf/sockex3_kern.c b/samples/bpf/sockex3_kern.c > index 95907f8d2b17..c527b57d3ec8 100644 > --- a/samples/bpf/sockex3_kern.c > +++ b/samples/bpf/sockex3_kern.c > @@ -61,7 +61,7 @@ struct vlan_hdr { > __be16 h_vlan_encapsulated_proto; > }; > > -struct bpf_flow_keys { > +struct flow_key_record { > __be32 src; > __be32 dst; > union { > @@ -88,7 +88,7 @@ static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, > __u64 off) > } > > struct globals { > - struct bpf_flow_keys flow; > + struct flow_key_record flow; > }; > > struct bpf_map_def SEC("maps") percpu_map = { > @@ -114,14 +114,14 @@ struct pair { > > struct bpf_map_def SEC("maps") hash_map = { > .type = BPF_MAP_TYPE_HASH, > - .key_size = sizeof(struct bpf_flow_keys), > + .key_size = sizeof(struct flow_key_record), > .value_size = sizeof(struct pair), > .max_entries = 1024, > }; > > static void update_stats(struct __sk_buff *skb, struct globals *g) > { > - struct bpf_flow_keys key = g->flow; > + struct flow_key_record key = g->flow; > struct pair *value; > > value = bpf_map_lookup_elem(_map, ); > diff --git a/samples/bpf/sockex3_user.c b/samples/bpf/sockex3_user.c > index 22f74d0e1493..9d02e0404719 100644 > --- a/samples/bpf/sockex3_user.c > +++ b/samples/bpf/sockex3_user.c > @@ -13,7 +13,7 @@ > #define PARSE_IP_PROG_FD (prog_fd[0]) > #define PROG_ARRAY_FD (map_fd[0]) > > -struct flow_keys { > +struct flow_key_record { > __be32 src; > __be32 dst; > union { > @@ -64,7 +64,7 @@ int main(int argc, char **argv) > (void) f; > > for (i = 0; i < 5; i++) { > - struct flow_keys key = {}, next_key; > + struct flow_key_record key = {}, next_key; > struct pair value; > > sleep(1); > -- > 2.17.1 > >
Re: [RFC 4/5] netlink: prepare validate extack setting for recursion
On Thu, Sep 20, 2018 at 10:14:10AM +0200, Johannes Berg wrote: > Anyway - we got into this discussion because of all the extra recursion > stuff I was adding. With the change suggested by David we don't need > that now at all, so I guess it'd be better to propose a patch if you (or > perhaps I will see a need later) need such a facility for multiple > messages or multiple message levels? Yep! Thanks, Marcelo
Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()
On Thu, Sep 20, 2018 at 2:04 AM Kirill Tkhai wrote: > > On 20.09.2018 0:28, Cong Wang wrote: > You added me to CC, so you probably want to know my opinion about this. > Since it's not a real problem fix, but just a refactoring, I say you > my opinion, how this refactoring may be made better. If you don't want > to know my opinion, you may consider not to CC me. Sure, with respects to your opinions, I prefer to drop it happily. Thanks.
Re: [PATCH 2/3] net: Register af_inet_ops earlier
On Thu, Sep 20, 2018 at 2:12 AM Kirill Tkhai wrote: > > This function just initializes locks and defaults. > Let register it before other pernet operation, > since some of them potentially may relay on that. > > Signed-off-by: Kirill Tkhai It adds no benefits but potential risks on error path ordering, it is never late to bring this up again when any future change needs it, until that: Nacked-by: Cong Wang
Re: [PATCH 3/3] ipv4: initialize ra_mutex in inet_init_net()
On Thu, Sep 20, 2018 at 2:12 AM Kirill Tkhai wrote: > > From: Cong Wang > > ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4, > but its initialization is in the generic netns code, setup_net(). > > Move it to IPv4 specific net init code, inet_init_net(). > > Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex") > Signed-off-by: Cong Wang > Acked-by: Kirill Tkhai I regret for wasting my time on this, so: Nacked-by: Cong Wang Let's just leave the current code as it is.
Re: [PATCH net-next 4/5] ipv6: do not drop vrf udp multicast packets
On 20/09/2018 14:02, Paolo Abeni wrote: > Hi, > > On Thu, 2018-09-20 at 09:58 +0100, Mike Manning wrote: >> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c >> index 108f5f88ec98..fc60f297d95b 100644 >> --- a/net/ipv6/ip6_input.c >> +++ b/net/ipv6/ip6_input.c >> @@ -325,9 +325,12 @@ static int ip6_input_finish(struct net *net, struct >> sock *sk, struct sk_buff *sk >> { >> const struct inet6_protocol *ipprot; >> struct inet6_dev *idev; >> +struct net_device *dev; >> unsigned int nhoff; >> +int sdif = inet6_sdif(skb); >> int nexthdr; >> bool raw; >> +bool deliver; >> bool have_final = false; > Please, try instead to sort the variable in reverse x-mas tree order. Will do. >> >> /* >> @@ -371,9 +374,27 @@ static int ip6_input_finish(struct net *net, struct >> sock *sk, struct sk_buff *sk >> skb_postpull_rcsum(skb, skb_network_header(skb), >> skb_network_header_len(skb)); >> hdr = ipv6_hdr(skb); >> -if (ipv6_addr_is_multicast(>daddr) && >> -!ipv6_chk_mcast_addr(skb->dev, >daddr, >> ->saddr) && >> + >> +/* skb->dev passed may be master dev for vrfs. */ >> +if (sdif) { >> +rcu_read_lock(); > AFAICS, the rcu lock is already acquired at the beginning of > ip6_input_finish(), not need to acquire it here again. Nice catch, I will remove this. > + dev = dev_get_by_index_rcu(dev_net(skb->dev), >> + sdif); >> +if (!dev) { >> +rcu_read_unlock(); >> +kfree_skb(skb); >> +return -ENODEV; >> +} >> +} else { >> +dev = skb->dev; > The above fragment of code is a recurring pattern in this series, > perhaps adding an helper for it would reduce code duplication ? This pattern of checking the secondary device index is used only twice, both in this file. But with now one instance having the rcu lock handling, and the other not, I cannot refactor this. > > Cheers, > > Paolo > Thanks for the review! I will wait for further comments before producing a v1 of the series. Regards, Mike
Re: [PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number
On Thu, 20 Sep 2018 15:18:20 +0100 Lorenzo Pieralisi wrote: > On Fri, Sep 14, 2018 at 12:54:55PM -0700, Stephen Hemminger wrote: > > The Hyper-V implementation of PCI controller has concept of 32 bit serial > > number > > (not to be confused with PCI-E serial number). This value is sent in the > > protocol > > from the host to indicate SR-IOV VF device is attached to a synthetic NIC. > > > > Using the serial number (instead of MAC address) to associate the two > > devices > > avoids lots of potential problems when there are duplicate MAC addresses > > from > > tunnels or layered devices. > > > > The patch set is broken into two parts, one is for the PCI controller > > and the other is for the netvsc device. Normally, these go through different > > trees but sending them together here for better review. The PCI changes > > were submitted previously, but the main review comment was "why do you > > need this?". This is why. > > The question was more whether we should convert this serial number into > a PCI slot number (that has user space visibility and that is what you are > after) to improve the current matching, I do not question why you need > it, just for the records. The name slot is way overloaded in this context. There is windows slot number which comes from Hyperv pci address slot which pci-hyperv sets from windows slot pci slot api value which for normal devices comes from ACPI this patch gets it from serial number The netvsc driver needed to be able to find a PCI device based on the serial number. The serial number was not visible in any current PCI-hyperv controller values. The windows slot (wslot) is not the same the serial number.
Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name
On Thu, Sep 20, 2018 at 07:40:39PM +0300, Leon Romanovsky wrote: > > > The protection is done with global device_lock because it is used in > > > allocation and deallocation phases. At this stage, this lock is not > > > busy and easily can be moved to be per-device, once it will be needed. > > > > > > Signed-off-by: Leon Romanovsky > > > drivers/infiniband/core/device.c | 24 +++- > > > include/rdma/ib_verbs.h | 8 +++- > > > 2 files changed, 30 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/infiniband/core/device.c > > > b/drivers/infiniband/core/device.c > > > index 5a680a88aa87..3270cde6d806 100644 > > > +++ b/drivers/infiniband/core/device.c > > > @@ -170,6 +170,14 @@ static struct ib_device > > > *__ib_device_get_by_name(const char *name) > > > return NULL; > > > } > > > > > > +void ib_device_get_name(struct ib_device *ibdev, char *name) > > > +{ > > > + down_read(_rwsem); > > > + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX); > > > + up_read(_rwsem); > > > +} > > > +EXPORT_SYMBOL(ib_device_get_name); > > > > I think we have to follow netdev and just rely on device_rename() > > being 'good enough'. > > > > Switch everything to use dev_name()/etc rather than try and do > > something like this so the responsibility is on the device core to > > keep this working, not us. > > > > Turns out I have a series for that for unrelated reasons.. > > And what should I do now with this knowledge? Maybe I can send it today.. Jason
Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name
On Thu, Sep 20, 2018 at 09:15:41AM -0600, Jason Gunthorpe wrote: > On Thu, Sep 20, 2018 at 02:21:58PM +0300, Leon Romanovsky wrote: > > From: Leon Romanovsky > > > > Prepare IB device name field to rename operation by ensuring that all > > accesses to it are protected with lock and users don't see part of name. > > Oh dear, no, that isn't going to work, there is too much stuff using > dev_name.. Did you read the comment on device_rename?? > > https://elixir.bootlin.com/linux/v4.19-rc4/source/drivers/base/core.c#L2715 Yes, I read, it was mentioned in the cover letter. > > > The protection is done with global device_lock because it is used in > > allocation and deallocation phases. At this stage, this lock is not > > busy and easily can be moved to be per-device, once it will be needed. > > > > Signed-off-by: Leon Romanovsky > > drivers/infiniband/core/device.c | 24 +++- > > include/rdma/ib_verbs.h | 8 +++- > > 2 files changed, 30 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/infiniband/core/device.c > > b/drivers/infiniband/core/device.c > > index 5a680a88aa87..3270cde6d806 100644 > > +++ b/drivers/infiniband/core/device.c > > @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const > > char *name) > > return NULL; > > } > > > > +void ib_device_get_name(struct ib_device *ibdev, char *name) > > +{ > > + down_read(_rwsem); > > + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX); > > + up_read(_rwsem); > > +} > > +EXPORT_SYMBOL(ib_device_get_name); > > I think we have to follow netdev and just rely on device_rename() > being 'good enough'. > > Switch everything to use dev_name()/etc rather than try and do > something like this so the responsibility is on the device core to > keep this working, not us. > > Turns out I have a series for that for unrelated reasons.. And what should I do now with this knowledge? > > > static int alloc_name(char *name) > > { > > unsigned long *inuse; > > @@ -202,6 +210,21 @@ static int alloc_name(char *name) > > return 0; > > } > > > > +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern) > > +{ > > + int ret = 0; > > + > > + mutex_lock(_mutex); > > + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX); > > + if (strchr(ibdev->name, '%')) > > + ret = alloc_name(ibdev->name); > > + > > + mutex_unlock(_mutex); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(ib_device_alloc_name); > > Can't call alloc_name() without also adding to the list, this will > allow duplicates. I planned to change it in the future by moving to different name scheme with unique naming. > > Jason signature.asc Description: PGP signature
Re: [RFC bpf-next 4/4] tools/bpf: handle EOPNOTSUPP when map lookup is failed
On Thu, 20 Sep 2018 14:04:19 +0900, Prashant Bhole wrote: > On 9/20/2018 12:29 AM, Jakub Kicinski wrote: > > On Wed, 19 Sep 2018 16:51:43 +0900, Prashant Bhole wrote: > >> Let's add a check for EOPNOTSUPP error when map lookup is failed. > >> Also in case map doesn't support lookup, the output of map dump is > >> changed from "can't lookup element" to "lookup not supported for > >> this map". > >> > >> Patch adds function print_entry_error() function to print the error > >> value. > >> > >> Following example dumps a map which does not support lookup. > >> > >> Output before: > >> root# bpftool map -jp dump id 40 > >> [ > >> "key": ["0x0a","0x00","0x00","0x00" > >> ], > >> "value": { > >> "error": "can\'t lookup element" > >> }, > >> "key": ["0x0b","0x00","0x00","0x00" > >> ], > >> "value": { > >> "error": "can\'t lookup element" > >> } > >> ] > >> > >> root# bpftool map dump id 40 > >> can't lookup element with key: > >> 0a 00 00 00 > >> can't lookup element with key: > >> 0b 00 00 00 > >> Found 0 elements > >> > >> Output after changes: > >> root# bpftool map dump -jp id 45 > >> [ > >> "key": ["0x0a","0x00","0x00","0x00" > >> ], > >> "value": { > >> "error": "lookup not supported for this map" > >> }, > >> "key": ["0x0b","0x00","0x00","0x00" > >> ], > >> "value": { > >> "error": "lookup not supported for this map" > >> } > >> ] > >> > >> root# bpftool map dump id 45 > >> key: > >> 0a 00 00 00 > >> value: > >> lookup not supported for this map > >> key: > >> 0b 00 00 00 > >> value: > >> lookup not supported for this map > >> Found 0 elements > > > > Nice improvement, thanks for the changes! I wonder what your thoughts > > would be on just printing some form of "lookup not supported for this > > map" only once? It seems slightly like repeated information - if > > lookup is not supported for one key it likely won't be for other keys > > too, so we could shorten the output. Would that make sense? > > > >> Signed-off-by: Prashant Bhole > >> --- > >> tools/bpf/bpftool/main.h | 5 + > >> tools/bpf/bpftool/map.c | 35 ++- > >> 2 files changed, 35 insertions(+), 5 deletions(-) > >> > >> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h > >> index 40492cdc4e53..1a8c683f949b 100644 > >> --- a/tools/bpf/bpftool/main.h > >> +++ b/tools/bpf/bpftool/main.h > >> @@ -46,6 +46,11 @@ > >> > >> #include "json_writer.h" > >> > >> +#define ERR_CANNOT_LOOKUP \ > >> + "can't lookup element" > >> +#define ERR_LOOKUP_NOT_SUPPORTED \ > >> + "lookup not supported for this map" > > > > Do we need these? Are we going to reused them in more parts of the > > code? > > These are used only once. These can be used in do_lookup(). Currently > do_lookup() prints strerror(errno) when lookup is failed. Shall I change > that do_lookup() output? I actually prefer to stick to strerror(), the standard errors more clearly correlate with what happened in my mind (i.e. "Operation not supported" == kernel sent EOPNOTSUPP). strerror() may also print in local language if translation/localization matters. We could even use strerr() in dump_map_elem() but up to you. The one in do_lookup() I'd prefer to leave be ;)
Re: [PATCH] PCI: hv: Fix return value check in hv_pci_assign_slots()
[+DaveM, netdev] Removed erroneously added disclaimer, apologies. On Thu, Sep 20, 2018 at 06:40:31AM +, Wei Yongjun wrote: > In case of error, the function pci_create_slot() returns ERR_PTR() and > never returns NULL. The NULL test in the return value check should be > replaced with IS_ERR(). > > Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot > information") > Signed-off-by: Wei Yongjun > --- > drivers/pci/controller/pci-hyperv.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c > b/drivers/pci/controller/pci-hyperv.c > index ee80e79..9ba4d12 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct > hv_pcibus_device *hbus) > snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser); > hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr, > name, NULL); > - if (!hpdev->pci_slot) > + if (IS_ERR(hpdev->pci_slot)) { > pr_warn("pci_create slot %s failed\n", name); > + hpdev->pci_slot = NULL; > + } > } > } I am dropping this patch from the PCI queue since the original series is now queued in net-next. Lorenzo
Re: [PATCH] PCI: hv: Fix return value check in hv_pci_assign_slots()
[+DaveM, netdev] On Thu, Sep 20, 2018 at 06:40:31AM +, Wei Yongjun wrote: > In case of error, the function pci_create_slot() returns ERR_PTR() and > never returns NULL. The NULL test in the return value check should be > replaced with IS_ERR(). > > Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot > information") > Signed-off-by: Wei Yongjun > --- > drivers/pci/controller/pci-hyperv.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c > b/drivers/pci/controller/pci-hyperv.c > index ee80e79..9ba4d12 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct > hv_pcibus_device *hbus) > snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser); > hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr, > name, NULL); > - if (!hpdev->pci_slot) > + if (IS_ERR(hpdev->pci_slot)) { > pr_warn("pci_create slot %s failed\n", name); > + hpdev->pci_slot = NULL; > + } > } > } > FYI I am dropping this patch from the PCI queue since the original series is now queued in net-next. Lorenzo IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name
On Thu, Sep 20, 2018 at 02:21:58PM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky > > Prepare IB device name field to rename operation by ensuring that all > accesses to it are protected with lock and users don't see part of name. Oh dear, no, that isn't going to work, there is too much stuff using dev_name.. Did you read the comment on device_rename?? https://elixir.bootlin.com/linux/v4.19-rc4/source/drivers/base/core.c#L2715 > The protection is done with global device_lock because it is used in > allocation and deallocation phases. At this stage, this lock is not > busy and easily can be moved to be per-device, once it will be needed. > > Signed-off-by: Leon Romanovsky > drivers/infiniband/core/device.c | 24 +++- > include/rdma/ib_verbs.h | 8 +++- > 2 files changed, 30 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/device.c > b/drivers/infiniband/core/device.c > index 5a680a88aa87..3270cde6d806 100644 > +++ b/drivers/infiniband/core/device.c > @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const > char *name) > return NULL; > } > > +void ib_device_get_name(struct ib_device *ibdev, char *name) > +{ > + down_read(_rwsem); > + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX); > + up_read(_rwsem); > +} > +EXPORT_SYMBOL(ib_device_get_name); I think we have to follow netdev and just rely on device_rename() being 'good enough'. Switch everything to use dev_name()/etc rather than try and do something like this so the responsibility is on the device core to keep this working, not us. Turns out I have a series for that for unrelated reasons.. > static int alloc_name(char *name) > { > unsigned long *inuse; > @@ -202,6 +210,21 @@ static int alloc_name(char *name) > return 0; > } > > +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern) > +{ > + int ret = 0; > + > + mutex_lock(_mutex); > + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX); > + if (strchr(ibdev->name, '%')) > + ret = alloc_name(ibdev->name); > + > + mutex_unlock(_mutex); > + > + return ret; > +} > +EXPORT_SYMBOL(ib_device_alloc_name); Can't call alloc_name() without also adding to the list, this will allow duplicates. Jason
RE: [PATCH v2 1/5] netlink: remove NLA_NESTED_COMPAT
From: Johannes Berg > Sent: 19 September 2018 20:49 > This isn't used anywhere, so we might as well get rid of it. > > Reviewed-by: David Ahern > Signed-off-by: Johannes Berg > --- > include/net/netlink.h | 2 -- > lib/nlattr.c | 11 --- > 2 files changed, 13 deletions(-) > > diff --git a/include/net/netlink.h b/include/net/netlink.h > index 318b1ded3833..b680fe365e91 100644 > --- a/include/net/netlink.h > +++ b/include/net/netlink.h > @@ -172,7 +172,6 @@ enum { > NLA_FLAG, > NLA_MSECS, > NLA_NESTED, > - NLA_NESTED_COMPAT, > NLA_NUL_STRING, > NLA_BINARY, > NLA_S8, ... Is it safe to remove an item from this emun ? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH RFT net-next 0/2] net: phy: Eliminate unnecessary soft
On Tue, Sep 18, 2018 at 06:35:03PM -0700, Florian Fainelli wrote: > Hi all, > > This patch series eliminates unnecessary software resets of the PHY. > This should hopefully not break anybody's hardware; but I would > appreciate testing to make sure this is is the case. > > Sorry for this long email list, I wanted to make sure I reached out to > all people who made changes to the Marvell PHY driver. Hi Florian I want to test this, but i'm battling other issues at the moment and not got around to it. But i see it has been tested by two people. So i think we should merge it, and see if anybody screams. Andrew
Re: [PATCH] smc: generic netlink family should be __ro_after_init
From: Johannes Berg Date: Thu, 20 Sep 2018 09:27:30 +0200 > From: Johannes Berg > > The generic netlink family is only initialized during module init, > so it should be __ro_after_init like all other generic netlink > families. > > Signed-off-by: Johannes Berg Looks good, applied.
Re: [PATCH net] mlxsw: spectrum: Bump required firmware version
From: Ido Schimmel Date: Thu, 20 Sep 2018 09:31:45 +0300 > From: Petr Machata > > MC-aware mode was introduced to mlxsw in commit 7b8195306694 ("mlxsw: > spectrum: > Configure MC-aware mode on mlxsw ports") and fixed up later in commit > 3a3539cd3632 ("mlxsw: spectrum_buffers: Set up a dedicated pool for BUM > traffic"). As the final piece of puzzle, a firmware issue whereby a wrong > priority was assigned to BUM traffic was corrected in FW version 13.1703.4. > Therefore require this FW version in the driver. > > Fixes: 7b8195306694 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw > ports") > Signed-off-by: Petr Machata > Reviewed-by: Jiri Pirko > Signed-off-by: Ido Schimmel Applied.
Re: [PATCH net-next 00/13] mlxsw: Further MC-awareness configuration
From: Ido Schimmel Date: Thu, 20 Sep 2018 09:21:23 +0300 > Petr says: > > Due to an issue in Spectrum chips, when unicast traffic shares the same > queue as BUM traffic, and there is congestion, the BUM traffic is > admitted to the queue anyway, thus pushing out all UC traffic. In order > to give unicast traffic precedence over BUM traffic, multicast-aware > mode is now configured on all ports. Under MC-aware mode, egress TCs > 8..15 are used for BUM traffic, which has its own dedicated pool. > > This patch set improves the way that the MC pool and the higher-order > TCs are integrated into the system. Series applied, thanks. > Then in patch #13 the selftest itself is added. Just wanted to say I'm really happy with the selftests that exist for all of the problems that have been fixed recently in mlxsw.
Re: [PATCH bpf-next 2/2] xsk: fix bug when trying to use both copy and zero-copy on one queue id
On Thu, 20 Sep 2018 11:17:17 +0200, Magnus Karlsson wrote: > On Wed, Sep 19, 2018 at 9:18 AM Magnus Karlsson wrote: > > On Wed, Sep 19, 2018 at 3:58 AM Jakub Kicinski wrote: > > > > > > On Tue, 18 Sep 2018 10:22:11 -0700, Y Song wrote: > > > > > +/* The umem is stored both in the _rx struct and the _tx struct as > > > > > we do > > > > > + * not know if the device has more tx queues than rx, or the > > > > > opposite. > > > > > + * This might also change during run time. > > > > > + */ > > > > > +static void xdp_reg_umem_at_qid(struct net_device *dev, struct > > > > > xdp_umem *umem, > > > > > + u16 queue_id) > > > > > +{ > > > > > + if (queue_id < dev->real_num_rx_queues) > > > > > + dev->_rx[queue_id].umem = umem; > > > > > + if (queue_id < dev->real_num_tx_queues) > > > > > + dev->_tx[queue_id].umem = umem; > > > > > +} > > > > > + > > > > > +static struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, > > > > > + u16 queue_id) > > > > > +{ > > > > > + if (queue_id < dev->real_num_rx_queues) > > > > > + return dev->_rx[queue_id].umem; > > > > > + if (queue_id < dev->real_num_tx_queues) > > > > > + return dev->_tx[queue_id].umem; > > > > > + > > > > > + return NULL; > > > > > +} > > > > > + > > > > > +static void xdp_clear_umem_at_qid(struct net_device *dev, u16 > > > > > queue_id) > > > > > +{ > > > > > + /* Zero out the entry independent on how many queues are > > > > > configured > > > > > +* at this point in time, as it might be used in the future. > > > > > +*/ > > > > > + if (queue_id < dev->num_rx_queues) > > > > > + dev->_rx[queue_id].umem = NULL; > > > > > + if (queue_id < dev->num_tx_queues) > > > > > + dev->_tx[queue_id].umem = NULL; > > > > > +} > > > > > + > > > > > > > > I am sure whether the following scenario can happen or not. > > > > Could you clarify? > > > >1. suppose initially we have num_rx_queues = num_tx_queues = 10 > > > >xdp_reg_umem_at_qid() set umem1 to queue_id = 8 > > > >2. num_tx_queues is changed to 5 > > > >3. xdp_clear_umem_at_qid() is called for queue_id = 8, > > > >and dev->_rx[8].umum = 0. > > > >4. xdp_reg_umem_at_qid() is called gain to set for queue_id = 8 > > > >dev->_rx[8].umem = umem2 > > > >5. num_tx_queues is changed to 10 > > > > Now dev->_rx[8].umem != dev->_tx[8].umem, is this possible and is it > > > > a problem? > > > > > > Plus IIRC the check of qid vs real_num_[rt]x_queues in xsk_bind() is > > > not under rtnl_lock so it doesn't count for much. Why not do all the > > > checks against num_[rt]x_queues here, instead of real_..? > > > > You are correct, two separate rtnl_lock regions is broken. Will spin a > > v2 tomorrow when I am back in the office. > > > > Thanks Jakub for catching this. I really appreciate you reviewing my code. > > Sorry, forgot to answer your question about why real_num_ instead of > num_. If we used num_ instead, we would get a behavior where we can > open a socket on a queue id, let us say 10, that will not be active if > real_num_ is below 10. So no traffic will flow on this socket until we > issue an ethtool command to state that we have 10 or more queues > configured. While this behavior is sane (and consistent), I believe it > will lead to a more complicated driver implementation, break the > current uapi (since we will return an error if you try to bind to a > queue id that is not active at the moment) and I like my suggestion > below better :-). > > What I would like to suggest is that the current model at bind() is > kept, that is it will fail if you try to bind to a non-active queue > id. But we add some code in ethool_set_channels in net/core/ethtool.c > to check if you have an active af_xdp socket on a queue id and try to > make it inactive, we return an error and disallow the change. This > would IMHO be like not allowing a load module to be unloaded if > someone is using it or not allowing unmounting a file system if files > are in use. What do you think? If you think this is the right way to > go, I can implement this in a follow up patch or in this patch set if > that makes more sense, let me know. This suggestion would actually > make it simpler to implement zero-copy support in the driver. Some of > the complications we are having today is due to the fact that ethtool > can come in from the side and change things for us when an AF_XDP > socket is active on a queue id. And implementing zero copy support in > the driver needs to get simpler IMO. > > Please let me know what you think. I think I'd like that: https://patchwork.ozlabs.org/project/netdev/list/?series=57843=* in particular: https://patchwork.ozlabs.org/patch/949891/ ;) If we fix the resize indeed you could stick to real_* but to be honest, I'm not entirely clear
Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name
On 9/20/2018 6:21 AM, Leon Romanovsky wrote: > From: Leon Romanovsky > > Prepare IB device name field to rename operation by ensuring that all > accesses to it are protected with lock and users don't see part of name. > > The protection is done with global device_lock because it is used in > allocation and deallocation phases. At this stage, this lock is not > busy and easily can be moved to be per-device, once it will be needed. > > Signed-off-by: Leon Romanovsky > --- > drivers/infiniband/core/device.c | 24 +++- > include/rdma/ib_verbs.h | 8 +++- > 2 files changed, 30 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/device.c > b/drivers/infiniband/core/device.c > index 5a680a88aa87..3270cde6d806 100644 > --- a/drivers/infiniband/core/device.c > +++ b/drivers/infiniband/core/device.c > @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const > char *name) > return NULL; > } > > +void ib_device_get_name(struct ib_device *ibdev, char *name) > +{ > + down_read(_rwsem); > + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX); > + up_read(_rwsem); > +} > +EXPORT_SYMBOL(ib_device_get_name); > + > static int alloc_name(char *name) > { > unsigned long *inuse; > @@ -202,6 +210,21 @@ static int alloc_name(char *name) > return 0; > } > > +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern) > +{ > + int ret = 0; > + > + mutex_lock(_mutex); > + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX); > + if (strchr(ibdev->name, '%')) > + ret = alloc_name(ibdev->name); > + > + mutex_unlock(_mutex); > + > + return ret; > +} > +EXPORT_SYMBOL(ib_device_alloc_name); > + > static void ib_device_release(struct device *device) > { > struct ib_device *dev = container_of(device, struct ib_device, dev); > @@ -499,7 +522,6 @@ int ib_register_device(struct ib_device *device, > ret = alloc_name(device->name); > if (ret) > goto out; > - } I don't think this is correct... > > if (ib_device_check_mandatory(device)) { > ret = -EINVAL; > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index e764ed1f6025..0e7b9854 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -2260,6 +2260,11 @@ struct ib_device { > /* Do not access @dma_device directly from ULP nor from HW drivers. */ > struct device*dma_device; > > + /* > + * Do not access @name directly, > + * use ib_device_get_name()/ib_device_alloc_name() > + * and don't assume that it can't change after access. > + */ > char name[IB_DEVICE_NAME_MAX]; > > struct list_head event_handler_list; > @@ -2638,7 +2643,8 @@ struct ib_device *ib_alloc_device(size_t size); > void ib_dealloc_device(struct ib_device *device); > > void ib_get_device_fw_str(struct ib_device *device, char *str); > - > +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern); > +void ib_device_get_name(struct ib_device *ibdev, char *name); > int ib_register_device(struct ib_device *device, > int (*port_callback)(struct ib_device *, > u8, struct kobject *)); > -- > 2.14.4 >
Re: [PATCH 00/21] SMMU enablement for NXP LS1043A and LS1046A
On 20.09.2018 14:49, Robin Murphy wrote: > On 20/09/18 11:38, Laurentiu Tudor wrote: >> >> >> On 19.09.2018 17:37, Robin Murphy wrote: >>> On 19/09/18 15:18, Laurentiu Tudor wrote: Hi Robin, On 19.09.2018 16:25, Robin Murphy wrote: > Hi Laurentiu, > > On 19/09/18 13:35, laurentiu.tu...@nxp.com wrote: >> From: Laurentiu Tudor >> >> This patch series adds SMMU support for NXP LS1043A and LS1046A chips >> and consists mostly in important driver fixes and the required device >> tree updates. It touches several subsystems and consists of three >> main >> parts: >> - changes in soc/drivers/fsl/qbman drivers adding iommu >> mapping of >> reserved memory areas, fixes and defered probe support >> - changes in drivers/net/ethernet/freescale/dpaa_eth drivers >> consisting in misc dma mapping related fixes and probe ordering >> - addition of the actual arm smmu device tree node together with >> various adjustments to the device trees >> >> Performance impact >> >> Running iperf benchmarks in a back-to-back setup (both sides >> having smmu enabled) on a 10GBps port show an important >> networking performance degradation of around %40 (9.48Gbps >> linerate vs 5.45Gbps). If you need performance but without >> SMMU support you can use "iommu.passthrough=1" to disable >> SMMU. > > I should have said before - thanks for the numbers there as well. Always > good to add another datapoint to my collection. If you're interested > I've added SMMUv2 support to the "non-strict mode" series (of which I > should be posting v8 soon), so it might be fun to see how well that > works on MMU-500 in the real world. Hmm, I think I gave those a try some weeks ago and vaguely remember that I did see improvements. Can't remember the numbers off the top of my head but I'll re-test with the latest spin and update the numbers. >> >> USB issue and workaround >> >> There's a problem with the usb controllers in these chips >> generating smaller, 40-bit wide dma addresses instead of the >> 48-bit >> supported at the smmu input. So you end up in a situation >> where the >> smmu is mapped with 48-bit address translations, but the >> device >> generates transactions with clipped 40-bit addresses, thus >> smmu >> context faults are triggered. I encountered a similar >> situation for >> mmc that I managed to fix in software [1] however for USB I >> did not >> find a proper place in the code to add a similar fix. The only >> workaround I found was to add this kernel parameter which >> limits the >> usb dma to 32-bit size: "xhci-hcd.quirks=0x80". >> This workaround if far from ideal, so any suggestions for a >> code >> based workaround in this area would be greatly appreciated. > > If you have a nominally-64-bit device with a > narrower-than-the-main-interconnect link in front of it, that should > already be fixed in 4.19-rc by bus_dma_mask picking up DT dma-ranges, > provided the interconnect hierarchy can be described appropriately (or > at least massaged sufficiently to satisfy the binding), e.g.: > > / { > ... > > soc { > ranges; > dma-ranges = <0 0 1 0>; > > dev_48bit { ... }; > > periph_bus { > ranges; > dma-ranges = <0 0 100 0>; > > dev_40bit { ... }; > }; > }; > }; > > and if that fails to work as expected (except for PCI hosts where > handling dma-ranges properly still needs sorting out), please do > let us > know ;) > Just to confirm, Is this [1] the change I was supposed to test? >>> >>> Not quite - dma-ranges is only valid for nodes representing a bus, so >>> putting it directly in the USB device nodes doesn't work (FWIW that's >>> why PCI is broken, because the parser doesn't expect the >>> bus-as-leaf-node case). That's teh point of that intermediate simple-bus >>> node represented by "periph_bus" in my example (sorry, I should have put >>> compatibles in to make it clearer) - often that's actually true to life >>> (i.e. "soc" is something like a CCI and "periph_bus" is something like >>> an AXI NIC gluing a bunch of lower-bandwidth DMA masters to one of the >>> CCI ports) but at worst it's just a necessary evil to make the binding >>> happy (if it literally only represents the point-to-point link between >>> the device master port and interconnect slave port). >>> >> >> Quick update: so I adjusted to device tree according to your example and >> it works so now I can get rid of that nasty kernel arg based
Re: [PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number
On Fri, Sep 14, 2018 at 12:54:55PM -0700, Stephen Hemminger wrote: > The Hyper-V implementation of PCI controller has concept of 32 bit serial > number > (not to be confused with PCI-E serial number). This value is sent in the > protocol > from the host to indicate SR-IOV VF device is attached to a synthetic NIC. > > Using the serial number (instead of MAC address) to associate the two devices > avoids lots of potential problems when there are duplicate MAC addresses from > tunnels or layered devices. > > The patch set is broken into two parts, one is for the PCI controller > and the other is for the netvsc device. Normally, these go through different > trees but sending them together here for better review. The PCI changes > were submitted previously, but the main review comment was "why do you > need this?". This is why. The question was more whether we should convert this serial number into a PCI slot number (that has user space visibility and that is what you are after) to improve the current matching, I do not question why you need it, just for the records. Lorenzo > v2 - slot name can be shorter. > remove locking when creating pci_slots; see comment for explaination > > Stephen Hemminger (2): > PCI: hv: support reporting serial number as slot information > hv_netvsc: pair VF based on serial number > > drivers/net/hyperv/netvsc.c | 3 ++ > drivers/net/hyperv/netvsc_drv.c | 58 - > drivers/pci/controller/pci-hyperv.c | 37 ++ > 3 files changed, 73 insertions(+), 25 deletions(-) > > -- > 2.18.0 >
Re: [PATCH net] sctp: update dst pmtu with the correct daddr
On Thu, Sep 20, 2018 at 05:27:28PM +0800, Xin Long wrote: > When processing pmtu update from an icmp packet, it calls .update_pmtu > with sk instead of skb in sctp_transport_update_pmtu. > > However for sctp, the daddr in the transport might be different from > inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or > create the route cache. The incorrect daddr will cause a different > route cache created for the path. > > So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr > should be updated with the daddr in the transport, and update it back > after it's done. > > The issue has existed since route exceptions introduction. > > Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") > Reported-by: ian.per...@dialogic.com > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > net/sctp/transport.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/net/sctp/transport.c b/net/sctp/transport.c > index 12cac85..033696e 100644 > --- a/net/sctp/transport.c > +++ b/net/sctp/transport.c > @@ -260,6 +260,7 @@ void sctp_transport_pmtu(struct sctp_transport > *transport, struct sock *sk) > bool sctp_transport_update_pmtu(struct sctp_transport *t, u32 pmtu) > { > struct dst_entry *dst = sctp_transport_dst_check(t); > + struct sock *sk = t->asoc->base.sk; > bool change = true; > > if (unlikely(pmtu < SCTP_DEFAULT_MINSEGMENT)) { > @@ -271,12 +272,19 @@ bool sctp_transport_update_pmtu(struct sctp_transport > *t, u32 pmtu) > pmtu = SCTP_TRUNC4(pmtu); > > if (dst) { > - dst->ops->update_pmtu(dst, t->asoc->base.sk, NULL, pmtu); > + struct sctp_pf *pf = sctp_get_pf_specific(dst->ops->family); > + union sctp_addr addr; > + > + pf->af->from_sk(, sk); > + pf->to_sk_daddr(>ipaddr, sk); > + dst->ops->update_pmtu(dst, sk, NULL, pmtu); > + pf->to_sk_daddr(, sk); > + > dst = sctp_transport_dst_check(t); > } > > if (!dst) { > - t->af_specific->get_dst(t, >saddr, >fl, t->asoc->base.sk); > + t->af_specific->get_dst(t, >saddr, >fl, sk); > dst = t->dst; > } > > -- > 2.1.0 >
Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions
On Thu, 20 Sep 2018 09:14:08 +0200, Eelco Chaudron wrote: > Is there anything else blocking from getting this into net-next? > > I still think this patch is beneficial for the full user experience, and > I’ve got requests from QA and others for this. Not from my perspective, the numbers look okay, feel free to repost!
Re: array bounds warning in xfrm_output_resume
David Ahern wrote: > > $ make O=kbuild/perf -j 24 -s > > In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0, > > from /home/dsa/kernel-3.git/include/linux/list.h:9, > > from /home/dsa/kernel-3.git/include/linux/module.h:9, > > from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13: > > /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function > > ‘xfrm_output_resume’: > > /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array > > subscript is above array bounds [-Warray-bounds] > >__read_once_size(&(x), __u.__c, sizeof(x)); \ Does this thing avoid the warning? diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -143,6 +143,8 @@ int xfrm_output_resume(struct sk_buff *skb, int err) struct net *net = xs_net(skb_dst(skb)->xfrm); while (likely((err = xfrm_output_one(skb, err)) == 0)) { + unsigned short pf; + nf_reset(skb); err = skb_dst(skb)->ops->local_out(net, skb->sk, skb); @@ -152,11 +154,19 @@ int xfrm_output_resume(struct sk_buff *skb, int err) if (!skb_dst(skb)->xfrm) return dst_output(net, skb->sk, skb); - err = nf_hook(skb_dst(skb)->ops->family, - NF_INET_POST_ROUTING, net, skb->sk, skb, - NULL, skb_dst(skb)->dev, xfrm_output2); - if (unlikely(err != 1)) - goto out; + pf = skb_dst(skb)->ops->family; + switch (pf) { + case AF_INET: /* fallthrough */ + case AF_INET6: + err = nf_hook(pf, + NF_INET_POST_ROUTING, net, skb->sk, skb, + NULL, skb_dst(skb)->dev, xfrm_output2); + if (unlikely(err != 1)) + goto out; + break; + default: + break; + } } if (err == -EINPROGRESS)
Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
Em Thu, Sep 20, 2018 at 10:36:17AM -0300, Arnaldo Carvalho de Melo escreveu: > Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu: > > Recognize JITed bpf prog load/unload events. > > Add/remove kernel symbols accordingly. > > > > Signed-off-by: Alexei Starovoitov > > --- > > tools/perf/util/machine.c | 27 +++ > > tools/perf/util/symbol.c | 13 + > > tools/perf/util/symbol.h | 1 + > > 3 files changed, 41 insertions(+) > > > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c > > index c4acd2001db0..ae4f8a0fdc7e 100644 > > --- a/tools/perf/util/machine.c > > +++ b/tools/perf/util/machine.c > > @@ -25,6 +25,7 @@ > > #include "sane_ctype.h" > > #include > > #include > > +#include > > > > static void __machine__remove_thread(struct machine *machine, struct > > thread *th, bool lock); > > > > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct > > machine *machine, > > enum dso_kernel_type kernel_type; > > bool is_kernel_mmap; > > > > + /* process JITed bpf programs load/unload events */ > > + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { > > > So, this would be in machine__process_kernel_munmap-event(machine), etc, > no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP. > > > + struct symbol *sym; > > + u64 ip; > > + > > + map = map_groups__find(>kmaps, event->mmap.start); > > + if (!map) { > > + pr_err("No kernel map for IP %lx\n", event->mmap.start); > > + goto out_problem; > > + } > > + ip = event->mmap.start - map->start + map->pgoff; > > + if (event->mmap.filename[0]) { > > + sym = symbol__new(ip, event->mmap.len, 0, 0, > > + event->mmap.filename); > > Humm, so the bpf program would be just one symbol... bpf-to-bpf calls > will be to a different bpf program, right? > > /me goes to read https://lwn.net/Articles/741773/ > "[PATCH bpf-next 00/13] bpf: introduce function calls" After reading it, yeah, I think we need some way to access a symbol table for a BPF program, and also its binary so that we can do annotation, static (perf annotate) and live (perf top), was this already considered? I think one can get the binary for a program giving sufficient perms somehow, right? One other thing I need to catch up 8-) - Arnaldo > > + dso__insert_symbol(map->dso, sym); > > + } else { > > + if (symbols__erase(>dso->symbols, ip)) { > > + pr_err("No bpf prog at IP %lx/%lx\n", > > + event->mmap.start, ip); > > + goto out_problem; > > + } > > + dso__reset_find_symbol_cache(map->dso); > > + } > > + return 0; > > + } > > + > > /* If we have maps from kcore then we do not need or want any others */ > > if (machine__uses_kcore(machine)) > > return 0; > > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > > index d188b7588152..0653f313661d 100644 > > --- a/tools/perf/util/symbol.c > > +++ b/tools/perf/util/symbol.c > > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root > > *symbols, u64 ip) > > return NULL; > > } > > > > +int symbols__erase(struct rb_root *symbols, u64 ip) > > +{ > > + struct symbol *s; > > + > > + s = symbols__find(symbols, ip); > > + if (!s) > > + return -ENOENT; > > + > > + rb_erase(>rb_node, symbols); > > + symbol__delete(s); > > + return 0; > > +} > > + > > static struct symbol *symbols__first(struct rb_root *symbols) > > { > > struct rb_node *n = rb_first(symbols); > > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > > index f25fae4b5743..92ef31953d9a 100644 > > --- a/tools/perf/util/symbol.h > > +++ b/tools/perf/util/symbol.h > > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, > > const char *elf_name); > > > > void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool > > kernel); > > void symbols__insert(struct rb_root *symbols, struct symbol *sym); > > +int symbols__erase(struct rb_root *symbols, u64 ip); > > void symbols__fixup_duplicate(struct rb_root *symbols); > > void symbols__fixup_end(struct rb_root *symbols); > > void map_groups__fixup_end(struct map_groups *mg); > > -- > > 2.17.1
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > to having to cope with munmapping parts of existing mmaps, etc. > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > now it would be used just in this clean case for undoing a > PERF_RECORD_MMAP for a BPF program. > > The ABI is already complicated, starting to use something called > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > I think. Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult part was getting the perf tool to dtrt for that use-case. But if we need unmap events, doing the unmap record now is the right thing.
Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices
On 09/20/2018 06:43 AM, Eric Dumazet wrote: > > And lastly this patch looks way too complicated to me. > You probably can write something much simpler. Something like : diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794ccf3d5a46bdf900 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff *skb) return false; } -static int __netlink_deliver_tap_skb(struct sk_buff *skb, +static int __netlink_deliver_tap_skb(struct sk_buff **pskb, struct net_device *dev) { - struct sk_buff *nskb; + struct sk_buff *nskb, *skb = *pskb; struct sock *sk = skb->sk; int ret = -ENOMEM; if (!net_eq(dev_net(dev), sock_net(sk))) return 0; - dev_hold(dev); - - if (is_vmalloc_addr(skb->head)) + if (is_vmalloc_addr(skb->head)) { nskb = netlink_to_full_skb(skb, GFP_ATOMIC); - else - nskb = skb_clone(skb, GFP_ATOMIC); + if (!nskb) + return -ENOMEM; + consume_skb(skb); + skb = nskb; + *pskb = skb; + } + dev_hold(dev); + nskb = skb_clone(skb, GFP_ATOMIC); if (nskb) { nskb->dev = dev; nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 +322,7 @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net *n return; list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { - ret = __netlink_deliver_tap_skb(skb, tmp->dev); + ret = __netlink_deliver_tap_skb(, tmp->dev); if (unlikely(ret)) break; }
Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes
On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote: > Instead of commas, just have them as separate argvs. > > The parsing state machine might look heavyweight, but it makes it easy to add > more parameters later and distinguish parameter names from encoding names. > > Suggested-by: Michal Kubecek > Signed-off-by: Edward Cree > --- Looks good, thank you. Reviewed-by: Michal Kubecek > ethtool.8.in | 6 +++--- > ethtool.c | 63 > -- > test-cmdline.c | 10 +- > 3 files changed, 25 insertions(+), 54 deletions(-) > > diff --git a/ethtool.8.in b/ethtool.8.in > index 414eaa1..7ea2cc0 100644 > --- a/ethtool.8.in > +++ b/ethtool.8.in > @@ -390,7 +390,7 @@ ethtool \- query or control network driver and hardware > settings > .B ethtool \-\-set\-fec > .I devname > .B encoding > -.BR auto | off | rs | baser [ , ...] > +.BR auto | off | rs | baser \ [...] > . > .\" Adjust lines (i.e. full justification) and hyphenate. > .ad > @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must take > the link down > administratively and report the problem in the system logs for users to > correct. > .RS 4 > .TP > -.BR encoding\ auto | off | rs | baser [ , ...] > +.BR encoding\ auto | off | rs | baser \ [...] > > Sets the FEC encoding for the device. Combinations of options are specified > as > e.g. > -.B auto,rs > +.B encoding auto rs > ; the semantics of such combinations vary between drivers. > .TS > nokeep; > diff --git a/ethtool.c b/ethtool.c > index 9997930..2f7e96b 100644 > --- a/ethtool.c > +++ b/ethtool.c > @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str) > return 0; > } > > -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their > - * corresponding ETHTOOL_FEC_* constants. > - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,'). > - */ > -static int parse_fecmode(const char *str) > -{ > - int fecmode = 0; > - char buf[6]; > - > - if (!str) > - return 0; > - while (*str) { > - size_t next; > - int mode; > - > - next = strcspn(str, ","); > - if (next >= 6) /* Bad mode, longest name is 5 chars */ > - return 0; > - /* Copy into temp buffer and nul-terminate */ > - memcpy(buf, str, next); > - buf[next] = 0; > - mode = fecmode_str_to_type(buf); > - if (!mode) /* Bad mode encountered */ > - return 0; > - fecmode |= mode; > - str += next; > - /* Skip over ',' (but not nul) */ > - if (*str) > - str++; > - } > - return fecmode; > -} > - > static int do_gfec(struct cmd_context *ctx) > { > struct ethtool_fecparam feccmd = { 0 }; > @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx) > > static int do_sfec(struct cmd_context *ctx) > { > - char *fecmode_str = NULL; > + enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE; > struct ethtool_fecparam feccmd; > - struct cmdline_info cmdline_fec[] = { > - { "encoding", CMDL_STR, _str, }, > - }; > - int changed; > - int fecmode; > - int rv; > + int fecmode = 0, newmode; > + int rv, i; > > - parse_generic_cmdline(ctx, , cmdline_fec, > - ARRAY_SIZE(cmdline_fec)); > - > - if (!fecmode_str) > + for (i = 0; i < ctx->argc; i++) { > + if (!strcmp(ctx->argp[i], "encoding")) { > + state = ARG_ENCODING; > + continue; > + } > + if (state == ARG_ENCODING) { > + newmode = fecmode_str_to_type(ctx->argp[i]); > + if (!newmode) > + exit_bad_args(); > + fecmode |= newmode; > + continue; > + } > exit_bad_args(); > + } > > - fecmode = parse_fecmode(fecmode_str); > if (!fecmode) > exit_bad_args(); > > @@ -5265,7 +5236,7 @@ static const struct option { > " [ all ]\n"}, > { "--show-fec", 1, do_gfec, "Show FEC settings"}, > { "--set-fec", 1, do_sfec, "Set FEC settings", > - " [ encoding auto|off|rs|baser ]\n"}, > + " [ encoding auto|off|rs|baser [...]]\n"}, > { "-h|--help", 0, show_usage, "Show this help" }, > { "--version", 0, do_version, "Show version number" }, > {} > diff --git a/test-cmdline.c b/test-cmdline.c > index 9c51cca..84630a5 100644 > --- a/test-cmdline.c > +++ b/test-cmdline.c > @@ -268,12 +268,12 @@ static struct test_case { > { 1, "--set-eee devname advertise foo" }, > { 1, "--set-fec devname" }, > { 0, "--set-fec devname encoding auto" }, > - { 0, "--set-fec devname encoding off," }, > -
Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices
On 09/20/2018 01:54 AM, Li RongQing wrote: > if skb->head is vmalloc address, when this skb is delivered, full > allocation for this skb is required, if there are many devices, > the full allocation will be called for every devices > > now if it is vmalloc, allocate a new skb, whose data is not vmalloc > address, and use new allocated skb to clone and send, to avoid to > allocate full skb everytime. > > Signed-off-by: Zhang Yu > Signed-off-by: Li RongQing This looks a broken Signed-off-by chain to me. Who really wrote this patch ? Also, given that I gave feedback to your first version, it is customary to CC me for future versions, unless you want me to not notice them. And lastly this patch looks way too complicated to me. You probably can write something much simpler. Thanks.
Re: [PATCH v2 2/2] hv_netvsc: pair VF based on serial number
On Fri, Sep 14, 2018 at 12:54:57PM -0700, Stephen Hemminger wrote: > Matching network device based on MAC address is problematic > since a non VF network device can be creted with a duplicate MAC > address causing confusion and problems. The VMBus API does provide > a serial number that is a better matching method. > > Signed-off-by: Stephen Hemminger > --- > drivers/net/hyperv/netvsc.c | 3 ++ > drivers/net/hyperv/netvsc_drv.c | 58 +++-- > 2 files changed, 36 insertions(+), 25 deletions(-) > > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index 31c3d77b4733..fe01e141c8f8 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -1203,6 +1203,9 @@ static void netvsc_send_vf(struct net_device *ndev, > > net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated; > net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial; > + netdev_info(ndev, "VF slot %u %s\n", > + net_device_ctx->vf_serial, > + net_device_ctx->vf_alloc ? "added" : "removed"); > } > > static void netvsc_receive_inband(struct net_device *ndev, > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index 1121a1ec407c..9dedc1463e88 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -1894,20 +1894,6 @@ static void netvsc_link_change(struct work_struct *w) > rtnl_unlock(); > } > > -static struct net_device *get_netvsc_bymac(const u8 *mac) > -{ > - struct net_device_context *ndev_ctx; > - > - list_for_each_entry(ndev_ctx, _dev_list, list) { > - struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx); > - > - if (ether_addr_equal(mac, dev->perm_addr)) > - return dev; > - } > - > - return NULL; > -} > - > static struct net_device *get_netvsc_byref(struct net_device *vf_netdev) > { > struct net_device_context *net_device_ctx; > @@ -2036,26 +2022,48 @@ static void netvsc_vf_setup(struct work_struct *w) > rtnl_unlock(); > } > > +/* Find netvsc by VMBus serial number. > + * The PCI hyperv controller records the serial number as the slot. > + */ > +static struct net_device *get_netvsc_byslot(const struct net_device > *vf_netdev) > +{ > + struct device *parent = vf_netdev->dev.parent; > + struct net_device_context *ndev_ctx; > + struct pci_dev *pdev; > + > + if (!parent || !dev_is_pci(parent)) > + return NULL; /* not a PCI device */ > + > + pdev = to_pci_dev(parent); > + if (!pdev->slot) { > + netdev_notice(vf_netdev, "no PCI slot information\n"); > + return NULL; > + } > + > + list_for_each_entry(ndev_ctx, _dev_list, list) { > + if (!ndev_ctx->vf_alloc) > + continue; > + > + if (ndev_ctx->vf_serial == pdev->slot->number) > + return hv_get_drvdata(ndev_ctx->device_ctx); In patch 1, pdev->slot->number is set to: PCI_SLOT(wslot_to_devfn(hpdev->desc.win_slot.slot)) so I assume vf_serial is initialized (I have no knowledge of VMBUS and hyper-V internals) to that value, somehow. I also do not know how the wslot stuff is handled but I assume 5 bits (ie dev bits in devfn) are enough. BTW, I have noticed this patch (and patch 1) are already in -next so I will drop them from the PCI patch queue. Lorenzo > + } > + > + netdev_notice(vf_netdev, > + "no netdev found for slot %u\n", pdev->slot->number); > + return NULL; > +} > + > static int netvsc_register_vf(struct net_device *vf_netdev) > { > - struct net_device *ndev; > struct net_device_context *net_device_ctx; > - struct device *pdev = vf_netdev->dev.parent; > struct netvsc_device *netvsc_dev; > + struct net_device *ndev; > int ret; > > if (vf_netdev->addr_len != ETH_ALEN) > return NOTIFY_DONE; > > - if (!pdev || !dev_is_pci(pdev) || dev_is_pf(pdev)) > - return NOTIFY_DONE; > - > - /* > - * We will use the MAC address to locate the synthetic interface to > - * associate with the VF interface. If we don't find a matching > - * synthetic interface, move on. > - */ > - ndev = get_netvsc_bymac(vf_netdev->perm_addr); > + ndev = get_netvsc_byslot(vf_netdev); > if (!ndev) > return NOTIFY_DONE; > > -- > 2.18.0 >
Re: [PATCH v2 2/4] dt-bindings: net: qcom: Add binding for shared mdio bus
On 9/19/18 10:20 AM, Andrew Lunn wrote: I suspect that is not going to be easy. Last time i looked, the ACPI standard had nothing about MDIO busses or PHYs. Marcin Wojtas did some work in this area a while back for the mvpp2, but if i remember correctly, he worked around this by simply not having a PHY when using ACPI, and making use of a MAC interrupt which indicated when there was link. Whoever implements this first needs to be an ACPI expert and probably needs to write it up and submit it as an amendment to the ACPI standard. If that's what it takes, then so be it. But adding DT support for a device that is only used on ACPI platforms is not a worthwhile endeavor. After ACPI support is merged, Dongsheng can choose to add DT support to maintain parity, if he wants. Maybe one day the MSM developers will use the upstream driver on one of their SOCs.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(_lock); > > + bpf_get_prog_addr_region(fp, _start, _end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > +buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > > I feels a bit weird to start issuing unmap events for this. FWIW: https://lkml.kernel.org/r/20170127130702.gi6...@twins.programming.kicks-ass.net has talk of PERF_RECORD_MUNMAP
Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu: > Recognize JITed bpf prog load/unload events. > Add/remove kernel symbols accordingly. > > Signed-off-by: Alexei Starovoitov > --- > tools/perf/util/machine.c | 27 +++ > tools/perf/util/symbol.c | 13 + > tools/perf/util/symbol.h | 1 + > 3 files changed, 41 insertions(+) > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c > index c4acd2001db0..ae4f8a0fdc7e 100644 > --- a/tools/perf/util/machine.c > +++ b/tools/perf/util/machine.c > @@ -25,6 +25,7 @@ > #include "sane_ctype.h" > #include > #include > +#include > > static void __machine__remove_thread(struct machine *machine, struct thread > *th, bool lock); > > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct > machine *machine, > enum dso_kernel_type kernel_type; > bool is_kernel_mmap; > > + /* process JITed bpf programs load/unload events */ > + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { So, this would be in machine__process_kernel_munmap-event(machine), etc, no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP. > + struct symbol *sym; > + u64 ip; > + > + map = map_groups__find(>kmaps, event->mmap.start); > + if (!map) { > + pr_err("No kernel map for IP %lx\n", event->mmap.start); > + goto out_problem; > + } > + ip = event->mmap.start - map->start + map->pgoff; > + if (event->mmap.filename[0]) { > + sym = symbol__new(ip, event->mmap.len, 0, 0, > + event->mmap.filename); Humm, so the bpf program would be just one symbol... bpf-to-bpf calls will be to a different bpf program, right? /me goes to read https://lwn.net/Articles/741773/ "[PATCH bpf-next 00/13] bpf: introduce function calls" > + dso__insert_symbol(map->dso, sym); > + } else { > + if (symbols__erase(>dso->symbols, ip)) { > + pr_err("No bpf prog at IP %lx/%lx\n", > +event->mmap.start, ip); > + goto out_problem; > + } > + dso__reset_find_symbol_cache(map->dso); > + } > + return 0; > + } > + > /* If we have maps from kcore then we do not need or want any others */ > if (machine__uses_kcore(machine)) > return 0; > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > index d188b7588152..0653f313661d 100644 > --- a/tools/perf/util/symbol.c > +++ b/tools/perf/util/symbol.c > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root > *symbols, u64 ip) > return NULL; > } > > +int symbols__erase(struct rb_root *symbols, u64 ip) > +{ > + struct symbol *s; > + > + s = symbols__find(symbols, ip); > + if (!s) > + return -ENOENT; > + > + rb_erase(>rb_node, symbols); > + symbol__delete(s); > + return 0; > +} > + > static struct symbol *symbols__first(struct rb_root *symbols) > { > struct rb_node *n = rb_first(symbols); > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > index f25fae4b5743..92ef31953d9a 100644 > --- a/tools/perf/util/symbol.h > +++ b/tools/perf/util/symbol.h > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, > const char *elf_name); > > void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool > kernel); > void symbols__insert(struct rb_root *symbols, struct symbol *sym); > +int symbols__erase(struct rb_root *symbols, u64 ip); > void symbols__fixup_duplicate(struct rb_root *symbols); > void symbols__fixup_end(struct rb_root *symbols); > void map_groups__fixup_end(struct map_groups *mg); > -- > 2.17.1
Re: [PATCH net] af_key: free SKBs under RCU protection
On 09/19/2018 05:18 PM, Sean Tranchetti wrote: > pfkey_broadcast() can make calls to pfkey_broadcast_one() which > will clone or copy the passed in SKB. This new SKB will be assigned > the sock_rfree() function as its destructor, which requires that > the socket reference the SKB contains is valid when the SKB is freed. > > If this SKB is ever passed to pfkey_broadcast() again by some other > function (such as pkfey_dump() or pfkey_promisc) it will then be > freed there. However, since this free occurs outside of RCU protection, > it is possible that userspace could close the socket and trigger > pfkey_release() to free the socket before sock_rfree() can run, creating > the following race condition: > > 1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast(). >It performs the broadcast to any other sockets, and calls >rcu_read_unlock(), but does not yet reach kfree_skb(). > 2: Userspace closes the socket, triggering pfkey_realse(). Since no one >holds the RCU lock, synchronize_rcu() returns and it is allowed to >continue. It calls sock_put() to free the socket. > 3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was >passed, triggering a call to sock_rfree(). This function now accesses >the freed struct sock * via skb->sk, and attempts to update invalid >memory. > > By ensuring that the pfkey_broadcast() also frees the SKBs while it holds > the RCU lock, we can ensure that the socket will remain valid when the SKB > is freed, avoiding crashes like the following: > > Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b > [006b6b6b6b6b6c4b] address between user and kernel address ranges > Internal error: Oops: 9604 [#1] PREEMPT SMP > task: fff78f65b380 task.stack: ff8049a88000 > pc : sock_rfree+0x38/0x6c > lr : skb_release_head_state+0x6c/0xcc > Process repro (pid: 7117, stack limit = 0xff8049a88000) > Call trace: > sock_rfree+0x38/0x6c > skb_release_head_state+0x6c/0xcc > skb_release_all+0x1c/0x38 > __kfree_skb+0x1c/0x30 > kfree_skb+0xd0/0xf4 > pfkey_broadcast+0x14c/0x18c > pfkey_sendmsg+0x1d8/0x408 > sock_sendmsg+0x44/0x60 > ___sys_sendmsg+0x1d0/0x2a8 > __sys_sendmsg+0x64/0xb4 > SyS_sendmsg+0x34/0x4c > el0_svc_naked+0x34/0x38 > Kernel panic - not syncing: Fatal exception > > Signed-off-by: Sean Tranchetti > --- > net/key/af_key.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/key/af_key.c b/net/key/af_key.c > index 9d61266..dd257c7 100644 > --- a/net/key/af_key.c > +++ b/net/key/af_key.c > @@ -275,13 +275,13 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t > allocation, > if ((broadcast_flags & BROADCAST_REGISTERED) && err) > err = err2; > } > - rcu_read_unlock(); > > if (one_sk != NULL) > err = pfkey_broadcast_one(skb, , allocation, one_sk); > > kfree_skb(skb2); > kfree_skb(skb); > + rcu_read_unlock(); > return err; > } > > I do not believe the changelog or the patch makes sense. Having skb still referencing a socket prevents this socket being released. If you think about it, what would prevent the freeing happening _before_ the rcu_read_lock() in pfkey_broadcast() ? Maybe the correct fix is that pfkey_broadcast_one() should ensure the socket is still valid. I would suggest something like : diff --git a/net/key/af_key.c b/net/key/af_key.c index 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2, { int err = -ENOBUFS; - sock_hold(sk); + if (!refcount_inc_not_zero(>sk_refcnt)) + return -ENOENT; + if (*skb2 == NULL) { if (refcount_read(>users) != 1) { *skb2 = skb_clone(skb, allocation);
Re: [PATCH v3 net-next 07/12] net: ethernet: Add helper to remove a supported link mode
> 1. net-next: cf7d97e1e54d ("net: mdio: remove duplicated include from > mdio_bus.c") > > # mii-tool -vv eth0 > Using SIOCGMIIPHY=0x8947 > eth0: no link > registers for MII PHY 0: > 1140 7949 0022 1622 0d81 c1e1 000f > 0300 3000 > 0078 7002 0200 > 0528 > product info: vendor 00:08:85, model 34 rev 2 > basic mode: autonegotiation enabled > basic status: no link > capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > advertising: 100baseTx-FD 100baseTx-HD flow-control > link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD > > 2. net-next with this patch reverted > > # mii-tool -vv eth0 > Using SIOCGMIIPHY=0x8947 > eth0: negotiated 100baseTx-FD, link ok > registers for MII PHY 0: > 1140 796d 0022 1622 0181 c1e1 000f Hi Simon Word 5 is what we are advertising. Bits 10 and 11 are Pause and Asym Pause. In the good case here, neither are set. In this bad case above, both bits are set. The patch i asked you to try only cleared the Pause bit, not the Asymmetric Pause bit. mii-tool only saying 'flow-control' did not help. Word 6 is what the partner is advertising. c1e1 indicates the partner does not support flow control, both bits are 0. I don't see why this is preventing auto-net though. But in the bad case, the status register indicates auto-neg has not completed. Anyway, please can you try this patch, which also removes Aysm Pause. Thanks Andrew >From 00a061304af51831ca1dc86bf6ce23d01f724229 Mon Sep 17 00:00:00 2001 From: Andrew Lunn Date: Tue, 18 Sep 2018 18:12:54 -0500 Subject: [PATCH] ravb: Disable Pause Advertisement The previous commit to ravb had the side effect of making the PHY advertise Pause. This previously did not happen, and it appears the MAC does not support Pause. By default, phydev->supported has Pause enabled, but phydev->advertising does not. Rather than rely on this, be explicit, and remove the Pause link mode. Reported-by: Simon Horman Fixes: 41124fa64d4b ("net: ethernet: Add helper to remove a supported link mode") Signed-off-by: Andrew Lunn --- drivers/net/ethernet/renesas/ravb_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index fb2a1125780d..b0f2612ad226 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -1073,9 +1073,11 @@ static int ravb_phy_init(struct net_device *ndev) netdev_info(ndev, "limited PHY to 100Mbit/s\n"); } - /* 10BASE is not supported */ + /* 10BASE, Pause and Asym Pause is not supported */ phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_10baseT_Half_BIT); phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_10baseT_Full_BIT); + phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_Pause_BIT); + phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_Asym_Pause_BIT); phy_attached_info(phydev); -- 2.19.0.rc1
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(_lock); > > + bpf_get_prog_addr_region(fp, _start, _end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > +buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > I feels a bit weird to start issuing unmap events for this. For reference, this surfaced here: https://lkml.org/lkml/2017/1/27/452 Start of the thread, that involves postgresql, JIT, LLVM, perf is here: https://lkml.org/lkml/2016/12/10/1 PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due to having to cope with munmapping parts of existing mmaps, etc. I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for now it would be used just in this clean case for undoing a PERF_RECORD_MMAP for a BPF program. The ABI is already complicated, starting to use something called PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, I think. - Arnaldo
Re: [PATCH net-next 4/5] ipv6: do not drop vrf udp multicast packets
Hi, On Thu, 2018-09-20 at 09:58 +0100, Mike Manning wrote: > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c > index 108f5f88ec98..fc60f297d95b 100644 > --- a/net/ipv6/ip6_input.c > +++ b/net/ipv6/ip6_input.c > @@ -325,9 +325,12 @@ static int ip6_input_finish(struct net *net, struct sock > *sk, struct sk_buff *sk > { > const struct inet6_protocol *ipprot; > struct inet6_dev *idev; > + struct net_device *dev; > unsigned int nhoff; > + int sdif = inet6_sdif(skb); > int nexthdr; > bool raw; > + bool deliver; > bool have_final = false; Please, try instead to sort the variable in reverse x-mas tree order. > > /* > @@ -371,9 +374,27 @@ static int ip6_input_finish(struct net *net, struct sock > *sk, struct sk_buff *sk > skb_postpull_rcsum(skb, skb_network_header(skb), > skb_network_header_len(skb)); > hdr = ipv6_hdr(skb); > - if (ipv6_addr_is_multicast(>daddr) && > - !ipv6_chk_mcast_addr(skb->dev, >daddr, > - >saddr) && > + > + /* skb->dev passed may be master dev for vrfs. */ > + if (sdif) { > + rcu_read_lock(); AFAICS, the rcu lock is already acquired at the beginning of ip6_input_finish(), not need to acquire it here again. + dev = dev_get_by_index_rcu(dev_net(skb->dev), > +sdif); > + if (!dev) { > + rcu_read_unlock(); > + kfree_skb(skb); > + return -ENODEV; > + } > + } else { > + dev = skb->dev; The above fragment of code is a recurring pattern in this series, perhaps adding an helper for it would reduce code duplication ? Cheers, Paolo
[RFC PATCH iproute2-next 2/3] rdma: Introduce command execution helper with required device name
From: Leon Romanovsky In contradiction to various show commands, the set command explicitly requires to use device name as an argument. Provide new command execution helper which enforces it. Signed-off-by: Leon Romanovsky --- rdma/rdma.h | 1 + rdma/utils.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/rdma/rdma.h b/rdma/rdma.h index d4b7ba19..dde9e128 100644 --- a/rdma/rdma.h +++ b/rdma/rdma.h @@ -90,6 +90,7 @@ int cmd_link(struct rd *rd); int cmd_res(struct rd *rd); int rd_exec_cmd(struct rd *rd, const struct rd_cmd *c, const char *str); int rd_exec_dev(struct rd *rd, int (*cb)(struct rd *rd)); +int rd_exec_require_dev(struct rd *rd, int (*cb)(struct rd *rd)); int rd_exec_link(struct rd *rd, int (*cb)(struct rd *rd), bool strict_port); void rd_free(struct rd *rd); int rd_set_arg_to_devname(struct rd *rd); diff --git a/rdma/utils.c b/rdma/utils.c index 4840bf22..61f4aeb1 100644 --- a/rdma/utils.c +++ b/rdma/utils.c @@ -577,6 +577,16 @@ out: return ret; } +int rd_exec_require_dev(struct rd *rd, int (*cb)(struct rd *rd)) +{ + if (rd_no_arg(rd)) { + pr_err("Please provide device name.\n"); + return -EINVAL; + } + + return rd_exec_dev(rd, cb); +} + int rd_exec_cmd(struct rd *rd, const struct rd_cmd *cmds, const char *str) { const struct rd_cmd *c; -- 2.14.4
[RFC PATCH iproute2-next 1/3] rdma: Update kernel include file to support IB device renaming
From: Leon Romanovsky Bring kernel header file changes upto commit 908b44308eda ("RDMA/nldev: Allow IB device rename through RDMA netlink") Signed-off-by: Leon Romanovsky --- rdma/include/uapi/rdma/rdma_netlink.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h index 6513fb89..e2228c09 100644 --- a/rdma/include/uapi/rdma/rdma_netlink.h +++ b/rdma/include/uapi/rdma/rdma_netlink.h @@ -227,8 +227,9 @@ enum rdma_nldev_command { RDMA_NLDEV_CMD_UNSPEC, RDMA_NLDEV_CMD_GET, /* can dump */ + RDMA_NLDEV_CMD_SET, - /* 2 - 4 are free to use */ + /* 3 - 4 are free to use */ RDMA_NLDEV_CMD_PORT_GET = 5, /* can dump */ -- 2.14.4
[RFC PATCH iproute2-next 3/3] rdma: Add an option to rename IB device interface
From: Leon Romanovsky Enrich rdmatool with an option to rename IB devices, the command interface follows Iproute2 convention: "rdma dev set [OLD-DEVNAME] name NEW-DEVNAME" Signed-off-by: Leon Romanovsky --- rdma/dev.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/rdma/dev.c b/rdma/dev.c index e2eafe47..760b7fb3 100644 --- a/rdma/dev.c +++ b/rdma/dev.c @@ -14,6 +14,7 @@ static int dev_help(struct rd *rd) { pr_out("Usage: %s dev show [DEV]\n", rd->filename); + pr_out(" %s dev set [DEV] name DEVNAME\n", rd->filename); return 0; } @@ -240,17 +241,51 @@ static int dev_one_show(struct rd *rd) return rd_exec_cmd(rd, cmds, "parameter"); } +static int dev_set_name(struct rd *rd) +{ + uint32_t seq; + + if (rd_no_arg(rd)) { + pr_err("Please provide device new name.\n"); + return -EINVAL; + } + + rd_prepare_msg(rd, RDMA_NLDEV_CMD_SET, + , (NLM_F_REQUEST | NLM_F_ACK)); + mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx); + mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, rd_argv(rd)); + + return rd_send_msg(rd); +} + +static int dev_one_set(struct rd *rd) +{ + const struct rd_cmd cmds[] = { + { NULL, dev_help}, + { "name", dev_set_name}, + { 0 } + }; + + return rd_exec_cmd(rd, cmds, "parameter"); +} + static int dev_show(struct rd *rd) { return rd_exec_dev(rd, dev_one_show); } +static int dev_set(struct rd *rd) +{ + return rd_exec_require_dev(rd, dev_one_set); +} + int cmd_dev(struct rd *rd) { const struct rd_cmd cmds[] = { { NULL, dev_show }, { "show", dev_show }, { "list", dev_show }, + { "set",dev_set }, { "help", dev_help }, { 0 } }; -- 2.14.4
[RFC PATCH iproute2-next 0/3] rdma: IB device rename
From: Leon Romanovsky Hi, This is comprehensive part of kernel series posted earlier. The kernel part is not accepted yet, so first patch will have different commit message with different commit SHA1. This is why it is marked as RFC. An example: [leonro@server /]$ lspci |grep -i Ether 00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] [leonro@server /]$ sudo rdma dev 1: mlx5_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 sys_image_guid 5254:00c0:fe12:3455 [leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0 [leonro@server /]$ sudo rdma dev 1: hfi1_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 sys_image_guid 5254:00c0:fe12:3455 Thanks Leon Romanovsky (3): rdma: Update kernel include file to support IB device renaming rdma: Introduce command execution helper with required device name rdma: Add an option to rename IB device interface rdma/dev.c| 35 +++ rdma/include/uapi/rdma/rdma_netlink.h | 3 ++- rdma/rdma.h | 1 + rdma/utils.c | 10 ++ 4 files changed, 48 insertions(+), 1 deletion(-) -- 2.14.4
[PATCH rdma-next 3/5] RDMA: Convert IB drivers to name allocation routine
From: Leon Romanovsky Move internal implementation of device name to the IB/core. Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/device.c | 5 - drivers/infiniband/hw/bnxt_re/main.c | 6 +- drivers/infiniband/hw/cxgb3/iwch_provider.c| 5 - drivers/infiniband/hw/cxgb4/provider.c | 5 - drivers/infiniband/hw/hns/hns_roce_main.c | 4 +++- drivers/infiniband/hw/i40iw/i40iw_verbs.c | 7 ++- drivers/infiniband/hw/mlx4/main.c | 7 ++- drivers/infiniband/hw/mlx5/main.c | 4 +++- drivers/infiniband/hw/mthca/mthca_provider.c | 5 - drivers/infiniband/hw/nes/nes_verbs.c | 6 +- drivers/infiniband/hw/ocrdma/ocrdma_main.c | 7 ++- drivers/infiniband/hw/qedr/main.c | 4 +++- drivers/infiniband/hw/usnic/usnic_ib_main.c| 5 - drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c | 9 ++--- drivers/infiniband/sw/rxe/rxe_verbs.c | 5 - 15 files changed, 63 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 3270cde6d806..a9bc2a3f490c 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -518,11 +518,6 @@ int ib_register_device(struct ib_device *device, mutex_lock(_mutex); - if (strchr(device->name, '%')) { - ret = alloc_name(device->name); - if (ret) - goto out; - if (ib_device_check_mandatory(device)) { ret = -EINVAL; goto out; diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c index 20b9f31052bf..5255e5ffd6f4 100644 --- a/drivers/infiniband/hw/bnxt_re/main.c +++ b/drivers/infiniband/hw/bnxt_re/main.c @@ -575,11 +575,15 @@ static void bnxt_re_unregister_ib(struct bnxt_re_dev *rdev) static int bnxt_re_register_ib(struct bnxt_re_dev *rdev) { struct ib_device *ibdev = >ibdev; + int ret; /* ib device init */ ibdev->owner = THIS_MODULE; ibdev->node_type = RDMA_NODE_IB_CA; - strlcpy(ibdev->name, "bnxt_re%d", IB_DEVICE_NAME_MAX); + ret = ib_device_alloc_name(ibdev, "bnxt_re%d"); + if (ret) + return ret; + strlcpy(ibdev->node_desc, BNXT_RE_DESC " HCA", strlen(BNXT_RE_DESC) + 5); ibdev->phys_port_cnt = 1; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 1b9ff21aa1d5..3680d80036eb 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -1319,7 +1319,10 @@ int iwch_register_device(struct iwch_dev *dev) int i; pr_debug("%s iwch_dev %p\n", __func__, dev); - strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + ret = ib_device_alloc_name(>ibdev, "cxgb3_%d"); + if (ret) + return ret; + memset(>ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(>ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); dev->ibdev.owner = THIS_MODULE; diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c index 4eda6872e617..37d40cbdf0dc 100644 --- a/drivers/infiniband/hw/cxgb4/provider.c +++ b/drivers/infiniband/hw/cxgb4/provider.c @@ -535,7 +535,10 @@ void c4iw_register_device(struct work_struct *work) struct c4iw_dev *dev = ctx->dev; pr_debug("c4iw_dev %p\n", dev); - strlcpy(dev->ibdev.name, "cxgb4_%d", IB_DEVICE_NAME_MAX); + ret = ib_device_alloc_name(>ibdev, "cxgb4_%d"); + if (ret) + return; + memset(>ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(>ibdev.node_guid, dev->rdev.lldi.ports[0]->dev_addr, 6); dev->ibdev.owner = THIS_MODULE; diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c index 6edb547baee8..e16f72692c83 100644 --- a/drivers/infiniband/hw/hns/hns_roce_main.c +++ b/drivers/infiniband/hw/hns/hns_roce_main.c @@ -449,7 +449,9 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev) spin_lock_init(>lock); ib_dev = _dev->ib_dev; - strlcpy(ib_dev->name, "hns_%d", IB_DEVICE_NAME_MAX); + ret = ib_device_alloc_name(ib_dev, "hns_%d"); + if (ret) + return ret; ib_dev->owner = THIS_MODULE; ib_dev->node_type = RDMA_NODE_IB_CA; diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c index e2e6c74a7452..829be11da4e7 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c @@ -2746,13 +2746,18 @@ static struct i40iw_ib_device *i40iw_init_rdma_device(struct i40iw_device *iwdev struct i40iw_ib_device *iwibdev; struct net_device *netdev = iwdev->netdev;
[PATCH rdma-next 2/5] net/smc: Use IB device index instead of name
From: Leon Romanovsky IB device name is not stable and will be possible to rename in the following patches, update SMC code to use IB index as a stable identification. Signed-off-by: Leon Romanovsky --- net/smc/smc_diag.c | 6 +++--- net/smc/smc_pnet.c | 27 --- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c index dbf64a93d68a..ba8b5a5671ec 100644 --- a/net/smc/smc_diag.c +++ b/net/smc/smc_diag.c @@ -156,9 +156,9 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb, .lnk[0].link_id = smc->conn.lgr->lnk[0].link_id, }; - memcpy(linfo.lnk[0].ibname, - smc->conn.lgr->lnk[0].smcibdev->ibdev->name, - sizeof(smc->conn.lgr->lnk[0].smcibdev->ibdev->name)); + ib_device_get_name(smc->conn.lgr->lnk[0].smcibdev->ibdev, + linfo.lnk[0].ibname); + smc_gid_be16_convert(linfo.lnk[0].gid, smc->conn.lgr->lnk[0].gid); smc_gid_be16_convert(linfo.lnk[0].peer_gid, diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c index 01c6ce042a1c..67aec5ce6112 100644 --- a/net/smc/smc_pnet.c +++ b/net/smc/smc_pnet.c @@ -70,15 +70,14 @@ struct smc_pnetentry { u8 ib_port; }; -/* Check if two RDMA device entries are identical. Use device name and port +/* Check if two RDMA device entries are identical. Use device index and port * number for comparison. */ -static bool smc_pnet_same_ibname(struct smc_pnetentry *pnetelem, char *ibname, -u8 ibport) +static bool smc_pnet_same_ibindex(struct smc_pnetentry *pnetelem, u32 ibindex, + u8 ibport) { return pnetelem->ib_port == ibport && - !strncmp(pnetelem->smcibdev->ibdev->name, ibname, - sizeof(pnetelem->smcibdev->ibdev->name)); + pnetelem->smcibdev->ibdev->index == ibindex; } /* Find a pnetid in the pnet table. @@ -179,9 +178,9 @@ static int smc_pnet_enter(struct smc_pnetentry *new_pnetelem) sizeof(new_pnetelem->pnet_name)) || !strncmp(pnetelem->ndev->name, new_pnetelem->ndev->name, sizeof(new_pnetelem->ndev->name)) || - smc_pnet_same_ibname(pnetelem, -new_pnetelem->smcibdev->ibdev->name, -new_pnetelem->ib_port)) { + smc_pnet_same_ibindex(pnetelem, + new_pnetelem->smcibdev->ibdev->index, + new_pnetelem->ib_port)) { dev_put(pnetelem->ndev); goto found; } @@ -227,10 +226,11 @@ static struct smc_ib_device *smc_pnet_find_ib(char *ib_name) spin_lock(_ib_devices.lock); list_for_each_entry(ibdev, _ib_devices.list, list) { - if (!strncmp(ibdev->ibdev->name, ib_name, -sizeof(ibdev->ibdev->name))) { + char name[IB_DEVICE_NAME_MAX] = {}; + + ib_device_get_name(ibdev->ibdev, name); + if (!strncmp(name, ib_name, IB_DEVICE_NAME_MAX)) goto out; - } } ibdev = NULL; out: @@ -267,6 +267,11 @@ static int smc_pnet_fill_entry(struct net *net, struct smc_pnetentry *pnetelem, goto error; rc = -EINVAL; + /* NOTE !!!: Sadly enough, but this is part of ABI. +* From day one, the accesses are performed with device names and not +* device indexes for both ETH and IB. It means that this function isn't +* reliable after device renaming. +*/ if (!tb[SMC_PNETID_IBNAME]) goto error; rc = -ENOENT; -- 2.14.4
[PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name
From: Leon Romanovsky Prepare IB device name field to rename operation by ensuring that all accesses to it are protected with lock and users don't see part of name. The protection is done with global device_lock because it is used in allocation and deallocation phases. At this stage, this lock is not busy and easily can be moved to be per-device, once it will be needed. Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/device.c | 24 +++- include/rdma/ib_verbs.h | 8 +++- 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 5a680a88aa87..3270cde6d806 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const char *name) return NULL; } +void ib_device_get_name(struct ib_device *ibdev, char *name) +{ + down_read(_rwsem); + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX); + up_read(_rwsem); +} +EXPORT_SYMBOL(ib_device_get_name); + static int alloc_name(char *name) { unsigned long *inuse; @@ -202,6 +210,21 @@ static int alloc_name(char *name) return 0; } +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern) +{ + int ret = 0; + + mutex_lock(_mutex); + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX); + if (strchr(ibdev->name, '%')) + ret = alloc_name(ibdev->name); + + mutex_unlock(_mutex); + + return ret; +} +EXPORT_SYMBOL(ib_device_alloc_name); + static void ib_device_release(struct device *device) { struct ib_device *dev = container_of(device, struct ib_device, dev); @@ -499,7 +522,6 @@ int ib_register_device(struct ib_device *device, ret = alloc_name(device->name); if (ret) goto out; - } if (ib_device_check_mandatory(device)) { ret = -EINVAL; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index e764ed1f6025..0e7b9854 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -2260,6 +2260,11 @@ struct ib_device { /* Do not access @dma_device directly from ULP nor from HW drivers. */ struct device*dma_device; + /* +* Do not access @name directly, +* use ib_device_get_name()/ib_device_alloc_name() +* and don't assume that it can't change after access. +*/ char name[IB_DEVICE_NAME_MAX]; struct list_head event_handler_list; @@ -2638,7 +2643,8 @@ struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); void ib_get_device_fw_str(struct ib_device *device, char *str); - +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern); +void ib_device_get_name(struct ib_device *ibdev, char *name); int ib_register_device(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *)); -- 2.14.4
[PATCH rdma-next 4/5] RDMA/core: Implement IB device rename function
From: Leon Romanovsky Generic implementation of IB device rename function. Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/core_priv.h | 1 + drivers/infiniband/core/device.c| 23 +++ 2 files changed, 24 insertions(+) diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index d7399d5b1cb6..c5881756b799 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -87,6 +87,7 @@ int ib_device_register_sysfs(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *)); void ib_device_unregister_sysfs(struct ib_device *device); +int ib_device_rename(struct ib_device *ibdev, const char *name); typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port, struct net_device *idev, void *cookie); diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index a9bc2a3f490c..36b79cd6b620 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -178,6 +178,29 @@ void ib_device_get_name(struct ib_device *ibdev, char *name) } EXPORT_SYMBOL(ib_device_get_name); +int ib_device_rename(struct ib_device *ibdev, const char *name) +{ + struct ib_device *device; + int ret = 0; + + if (!strcmp(name, ibdev->name)) + return ret; + + mutex_lock(_mutex); + list_for_each_entry(device, _list, core_list) { + if (!strcmp(name, device->name)) { + ret = -EEXIST; + goto out; + } + } + + strlcpy(ibdev->name, name, IB_DEVICE_NAME_MAX); + ret = device_rename(>dev, name); +out: + mutex_unlock(_mutex); + return ret; +} + static int alloc_name(char *name) { unsigned long *inuse; -- 2.14.4
[PATCH rdma-next 5/5] RDMA/nldev: Allow IB device rename through RDMA netlink
From: Leon Romanovsky Provide an option to rename IB device name through RDMA netlink and limit it to users with ADMIN capability only. Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/nldev.c | 33 + include/uapi/rdma/rdma_netlink.h | 3 ++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c index 0385ab438320..64e1d12e9678 100644 --- a/drivers/infiniband/core/nldev.c +++ b/drivers/infiniband/core/nldev.c @@ -645,6 +645,35 @@ static int nldev_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh, return err; } +static int nldev_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[RDMA_NLDEV_ATTR_MAX]; + struct ib_device *device; + u32 index; + int err; + + err = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, nldev_policy, + extack); + if (err || !tb[RDMA_NLDEV_ATTR_DEV_INDEX]) + return -EINVAL; + + index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]); + device = ib_device_get_by_index(index); + if (!device) + return -EINVAL; + + if (tb[RDMA_NLDEV_ATTR_DEV_NAME]) { + char name[IB_DEVICE_NAME_MAX] = {}; + + nla_strlcpy(name, tb[RDMA_NLDEV_ATTR_DEV_NAME], + IB_DEVICE_NAME_MAX); + return ib_device_rename(device, name); + } + + return 0; +} + static int _nldev_get_dumpit(struct ib_device *device, struct sk_buff *skb, struct netlink_callback *cb, @@ -1077,6 +1106,10 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = { .doit = nldev_get_doit, .dump = nldev_get_dumpit, }, + [RDMA_NLDEV_CMD_SET] = { + .doit = nldev_set_doit, + .flags = RDMA_NL_ADMIN_PERM, + }, [RDMA_NLDEV_CMD_PORT_GET] = { .doit = nldev_port_get_doit, .dump = nldev_port_get_dumpit, diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index edba6351ac13..f9c41bf59efc 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -227,8 +227,9 @@ enum rdma_nldev_command { RDMA_NLDEV_CMD_UNSPEC, RDMA_NLDEV_CMD_GET, /* can dump */ + RDMA_NLDEV_CMD_SET, - /* 2 - 4 are free to use */ + /* 3 - 4 are free to use */ RDMA_NLDEV_CMD_PORT_GET = 5, /* can dump */ -- 2.14.4
[PATCH rdma-next 0/5] IB device rename support
From: Leon Romanovsky Hi, This series introduce long-waiting feature - "IB device rename". Such feature gives and option to rename user visible IB device name from vendor specific name (e.g. mlx5_0) to anything else. The user space component through rdmatool will follow this series. [leonro@server /]$ lspci |grep -i Ether 00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] [leonro@server /]$ sudo rdma dev 1: mlx5_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 sys_image_guid 5254:00c0:fe12:3455 [leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0 [leonro@server /]$ sudo rdma dev 1: hfi1_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 sys_image_guid 5254:00c0:fe12:3455 First patch introduces getter/setter to access names, i didn't convert all drivers to stop using name directly, because they don't base their decision on "name", and use this print only and can print truncated name if renaming is done at the same time as logging. Second patch updates SMC to use IB device index instead of name. Third patch globally converts all drivers to new allocation name routine. Forth and fifth patches are actually implement and exports through RDMA netlink the rename routines. It uses exported by device_rename() function, despite the comment from 2010, which warns about downsides of this function, the netdev is still uses, so we will use too. There is one patch/series which was dropped from this submission - conversion of SElinux from being IB device name to be IB device index based. It simply needs more special care and more testing. This series was tested with mlx5 devices with/without traffic and with non-modified rdma-core. Dennis, I didn't touch hfi1, but I'm not sure if it is needed. Thanks Leon Romanovsky (5): RDMA/core: Provide getter and setter to access IB device name net/smc: Use IB device index instead of name RDMA: Convert IB drivers to name allocation routine RDMA/core: Implement IB device rename function RDMA/nldev: Allow IB device rename through RDMA netlink drivers/infiniband/core/core_priv.h| 1 + drivers/infiniband/core/device.c | 52 +++--- drivers/infiniband/core/nldev.c| 33 drivers/infiniband/hw/bnxt_re/main.c | 6 ++- drivers/infiniband/hw/cxgb3/iwch_provider.c| 5 ++- drivers/infiniband/hw/cxgb4/provider.c | 5 ++- drivers/infiniband/hw/hns/hns_roce_main.c | 4 +- drivers/infiniband/hw/i40iw/i40iw_verbs.c | 7 +++- drivers/infiniband/hw/mlx4/main.c | 7 +++- drivers/infiniband/hw/mlx5/main.c | 4 +- drivers/infiniband/hw/mthca/mthca_provider.c | 5 ++- drivers/infiniband/hw/nes/nes_verbs.c | 6 ++- drivers/infiniband/hw/ocrdma/ocrdma_main.c | 7 +++- drivers/infiniband/hw/qedr/main.c | 4 +- drivers/infiniband/hw/usnic/usnic_ib_main.c| 5 ++- drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c | 9 +++-- drivers/infiniband/sw/rxe/rxe_verbs.c | 5 ++- include/rdma/ib_verbs.h| 8 +++- include/uapi/rdma/rdma_netlink.h | 3 +- net/smc/smc_diag.c | 6 +-- net/smc/smc_pnet.c | 27 +++-- 21 files changed, 171 insertions(+), 38 deletions(-) -- 2.14.4
Re: [PATCH iproute2 v2 0/3] testsuite: make alltests fixes
On Thu, 2018-09-20 at 01:36 +0200, Petr Vorel wrote: > Hi, > > here are simply fixes to restore 'make alltests'. > Currently it does not run. > > Kind regards, > Petr > > Petr Vorel (3): > testsuite: Fix missing generate_nlmsg > testsuite: Generate generate_nlmsg when needed > testsuite: Warn about empty $(IPVERS) > > testsuite/Makefile | 21 ++--- > 1 file changed, 14 insertions(+), 7 deletions(-) Series-tested-by: Luca Boccassi -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
[PATCH net] sctp: update dst pmtu with the correct daddr
When processing pmtu update from an icmp packet, it calls .update_pmtu with sk instead of skb in sctp_transport_update_pmtu. However for sctp, the daddr in the transport might be different from inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or create the route cache. The incorrect daddr will cause a different route cache created for the path. So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr should be updated with the daddr in the transport, and update it back after it's done. The issue has existed since route exceptions introduction. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Reported-by: ian.per...@dialogic.com Signed-off-by: Xin Long --- net/sctp/transport.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/net/sctp/transport.c b/net/sctp/transport.c index 12cac85..033696e 100644 --- a/net/sctp/transport.c +++ b/net/sctp/transport.c @@ -260,6 +260,7 @@ void sctp_transport_pmtu(struct sctp_transport *transport, struct sock *sk) bool sctp_transport_update_pmtu(struct sctp_transport *t, u32 pmtu) { struct dst_entry *dst = sctp_transport_dst_check(t); + struct sock *sk = t->asoc->base.sk; bool change = true; if (unlikely(pmtu < SCTP_DEFAULT_MINSEGMENT)) { @@ -271,12 +272,19 @@ bool sctp_transport_update_pmtu(struct sctp_transport *t, u32 pmtu) pmtu = SCTP_TRUNC4(pmtu); if (dst) { - dst->ops->update_pmtu(dst, t->asoc->base.sk, NULL, pmtu); + struct sctp_pf *pf = sctp_get_pf_specific(dst->ops->family); + union sctp_addr addr; + + pf->af->from_sk(, sk); + pf->to_sk_daddr(>ipaddr, sk); + dst->ops->update_pmtu(dst, sk, NULL, pmtu); + pf->to_sk_daddr(, sk); + dst = sctp_transport_dst_check(t); } if (!dst) { - t->af_specific->get_dst(t, >saddr, >fl, t->asoc->base.sk); + t->af_specific->get_dst(t, >saddr, >fl, sk); dst = t->dst; } -- 2.1.0
Re: [PATCH bpf-next 2/2] xsk: fix bug when trying to use both copy and zero-copy on one queue id
On Wed, Sep 19, 2018 at 9:18 AM Magnus Karlsson wrote: > > On Wed, Sep 19, 2018 at 3:58 AM Jakub Kicinski > wrote: > > > > On Tue, 18 Sep 2018 10:22:11 -0700, Y Song wrote: > > > > +/* The umem is stored both in the _rx struct and the _tx struct as we > > > > do > > > > + * not know if the device has more tx queues than rx, or the opposite. > > > > + * This might also change during run time. > > > > + */ > > > > +static void xdp_reg_umem_at_qid(struct net_device *dev, struct > > > > xdp_umem *umem, > > > > + u16 queue_id) > > > > +{ > > > > + if (queue_id < dev->real_num_rx_queues) > > > > + dev->_rx[queue_id].umem = umem; > > > > + if (queue_id < dev->real_num_tx_queues) > > > > + dev->_tx[queue_id].umem = umem; > > > > +} > > > > + > > > > +static struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, > > > > + u16 queue_id) > > > > +{ > > > > + if (queue_id < dev->real_num_rx_queues) > > > > + return dev->_rx[queue_id].umem; > > > > + if (queue_id < dev->real_num_tx_queues) > > > > + return dev->_tx[queue_id].umem; > > > > + > > > > + return NULL; > > > > +} > > > > + > > > > +static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id) > > > > +{ > > > > + /* Zero out the entry independent on how many queues are > > > > configured > > > > +* at this point in time, as it might be used in the future. > > > > +*/ > > > > + if (queue_id < dev->num_rx_queues) > > > > + dev->_rx[queue_id].umem = NULL; > > > > + if (queue_id < dev->num_tx_queues) > > > > + dev->_tx[queue_id].umem = NULL; > > > > +} > > > > + > > > > > > I am sure whether the following scenario can happen or not. > > > Could you clarify? > > >1. suppose initially we have num_rx_queues = num_tx_queues = 10 > > >xdp_reg_umem_at_qid() set umem1 to queue_id = 8 > > >2. num_tx_queues is changed to 5 > > >3. xdp_clear_umem_at_qid() is called for queue_id = 8, > > >and dev->_rx[8].umum = 0. > > >4. xdp_reg_umem_at_qid() is called gain to set for queue_id = 8 > > >dev->_rx[8].umem = umem2 > > >5. num_tx_queues is changed to 10 > > > Now dev->_rx[8].umem != dev->_tx[8].umem, is this possible and is it > > > a problem? > > > > Plus IIRC the check of qid vs real_num_[rt]x_queues in xsk_bind() is > > not under rtnl_lock so it doesn't count for much. Why not do all the > > checks against num_[rt]x_queues here, instead of real_..? > > You are correct, two separate rtnl_lock regions is broken. Will spin a > v2 tomorrow when I am back in the office. > > Thanks Jakub for catching this. I really appreciate you reviewing my code. Sorry, forgot to answer your question about why real_num_ instead of num_. If we used num_ instead, we would get a behavior where we can open a socket on a queue id, let us say 10, that will not be active if real_num_ is below 10. So no traffic will flow on this socket until we issue an ethtool command to state that we have 10 or more queues configured. While this behavior is sane (and consistent), I believe it will lead to a more complicated driver implementation, break the current uapi (since we will return an error if you try to bind to a queue id that is not active at the moment) and I like my suggestion below better :-). What I would like to suggest is that the current model at bind() is kept, that is it will fail if you try to bind to a non-active queue id. But we add some code in ethool_set_channels in net/core/ethtool.c to check if you have an active af_xdp socket on a queue id and try to make it inactive, we return an error and disallow the change. This would IMHO be like not allowing a load module to be unloaded if someone is using it or not allowing unmounting a file system if files are in use. What do you think? If you think this is the right way to go, I can implement this in a follow up patch or in this patch set if that makes more sense, let me know. This suggestion would actually make it simpler to implement zero-copy support in the driver. Some of the complications we are having today is due to the fact that ethtool can come in from the side and change things for us when an AF_XDP socket is active on a queue id. And implementing zero copy support in the driver needs to get simpler IMO. Please let me know what you think. Thanks: Magnus > /Magnus
[PATCH net-next 0/3] Refactorings on af_inet pernet initialization
This patch set makes several cleanups around inet_init_net(). --- Cong Wang (1): ipv4: initialize ra_mutex in inet_init_net() Kirill Tkhai (2): net: Remove inet_exit_net() net: Register af_inet_ops earlier net/core/net_namespace.c |1 - net/ipv4/af_inet.c | 13 + 2 files changed, 5 insertions(+), 9 deletions(-) -- Signed-off-by: Kirill Tkhai
[PATCH 2/3] net: Register af_inet_ops earlier
This function just initializes locks and defaults. Let register it before other pernet operation, since some of them potentially may relay on that. Signed-off-by: Kirill Tkhai --- net/ipv4/af_inet.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index f4ecbe0aaf1a..bbd3a072ffea 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1938,6 +1938,9 @@ static int __init inet_init(void) for (q = inetsw_array; q < _array[INETSW_ARRAY_LEN]; ++q) inet_register_protosw(q); + if (init_inet_pernet_ops()) + pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__); + /* * Set the ARP module up */ @@ -1975,9 +1978,6 @@ static int __init inet_init(void) if (ip_mr_init()) pr_crit("%s: Cannot init ipv4 mroute\n", __func__); #endif - - if (init_inet_pernet_ops()) - pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__); /* * Initialise per-cpu ipv4 mibs */
[PATCH 3/3] ipv4: initialize ra_mutex in inet_init_net()
From: Cong Wang ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4, but its initialization is in the generic netns code, setup_net(). Move it to IPv4 specific net init code, inet_init_net(). Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex") Signed-off-by: Cong Wang Acked-by: Kirill Tkhai --- net/core/net_namespace.c |1 - net/ipv4/af_inet.c |2 ++ 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 670c84b1bfc2..b272ccfcbf63 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -308,7 +308,6 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) net->user_ns = user_ns; idr_init(>netns_ids); spin_lock_init(>nsid_lock); - mutex_init(>ipv4.ra_mutex); list_for_each_entry(ops, _list, list) { error = ops_init(ops, net); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index bbd3a072ffea..d4623144e237 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.sysctl_igmp_llm_reports = 1; net->ipv4.sysctl_igmp_qrv = 2; + mutex_init(>ipv4.ra_mutex); + return 0; }
[PATCH 1/3] net: Remove inet_exit_net()
This function does nothing, and since ops_exit_list() checks for NULL ->exit method, we do not need stub here. Signed-off-by: Kirill Tkhai --- net/ipv4/af_inet.c |5 - 1 file changed, 5 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 1fbe2f815474..f4ecbe0aaf1a 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1821,13 +1821,8 @@ static __net_init int inet_init_net(struct net *net) return 0; } -static __net_exit void inet_exit_net(struct net *net) -{ -} - static __net_initdata struct pernet_operations af_inet_ops = { .init = inet_init_net, - .exit = inet_exit_net, }; static int __init init_inet_pernet_ops(void)
Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()
On 20.09.2018 0:28, Cong Wang wrote: > On Wed, Sep 19, 2018 at 1:25 AM Kirill Tkhai wrote: >> >> On 18.09.2018 23:17, Cong Wang wrote: >>> On Mon, Sep 17, 2018 at 12:25 AM Kirill Tkhai wrote: In inet_init() the order of registration is: ip_mr_init(); init_inet_pernet_ops(); This means, ipmr_net_ops pernet operations are before af_inet_ops in pernet_list. So, there is a theoretical probability, sometimes in the future, we will have a problem during a fail of net initialization. Say, setup_net(): ipmr_net_ops->init() returns 0 xxx->init() returns error and then we do: ipmr_net_ops->exit(), which could touch ra_mutex (theoretically). >>> >>> How could ra_mutex be touched in this scenario? >>> >>> ra_mutex is only used in ip_ra_control() which is called >>> only by {get,set}sockopt(). I don't see anything related >>> to netns exit() path here. >> >> Currently, it is not touched. But it's an ordinary practice, >> someone closes sockets in pernet ->exit methods. For example, >> we close percpu icmp sockets in icmp_sk_exit(), which are >> also of RAW type, and there is also called ip_ra_control() >> for them. Yes, they differ by their protocol; icmp sockets >> are of IPPROTO_ICMP protocol, while ip_ra_control() acts >> on IPPROTO_RAW sockets, but it's not good anyway. This does >> not look reliable for the future. In case of someone changes >> something here, we may do not notice this for the long time, >> while some users will meet bugs on their places. > > First of all, we only consider current code base. Even if you > really planned to changed this in the future, it would be still your > responsibility to take care of it. Why? Simple, FIFO. My patch > comes ahead of any future changes here, obviously. > > Secondly, it is certainly not hard to notice. I am pretty sure > you would get a warning/crash if this bug triggered. > > > >> >> Problems on error paths is not easy to detect on testing, >> while user may meet them. We had issue of same type with >> uninitialized xfrm_policy_lock. It was introduced in 2013, >> while the problem was found only in 2017: > > Been there, done that, I've fixed multiple IPv6 init code > error path. This is not where we disagree, by the way. > > >> >> introduced by 283bc9f35bbb >> fixed by c28a45cb >> >> (Last week I met it on RH7 kernel, which still has no a fix. >> But this talk is not about distribution kernels, just about >> the way). >> >> I just want to say if someone makes some creativity on top >> of this code, it will be to more friendly from us to him/her >> to not force this person to think about such not obvious details, >> but just to implement nice architecture right now. > > You keep saying nice architecture, how did you architect > ipv4.ra_mutex into net/core/net_namespace.c? It is apparently > not nice. > > Good luck with your future. > > I am tired, let's just drop it. I have no interest to fix your mess. You added me to CC, so you probably want to know my opinion about this. Since it's not a real problem fix, but just a refactoring, I say you my opinion, how this refactoring may be made better. If you don't want to know my opinion, you may consider not to CC me. Just this, not a reason to take offense. Thanks, Kirill
CONGRATULATIONS:
This is to inform you that you are among the Ten lucky winners of Category A. Hence we do believe with your winning prize, you will continue to be active to the Google search engine/ ancillary services. Google is now the biggest search engine worldwide and in an effort to make sure that it remains the most widely used search engine, we ran an on-line e-mail beta test which your email address won 950 thousand dollars . You are advised to contact your Foreign Transfer Manager with the following details (Ref NO: GP/48-46GC/Q, Winning NO: GC046GP-43-11) and to avoid unnecessary delay. Google Promotion Foreign Transfer Manager Sir. Randy Brown Email: ( rd4217...@gmail.com Sincerely, Lawrence Page Co-founder of Google™”
[PATCH net-next 0/5] vrf: allow simultaneous service instances in default and other VRFs
Services currently have to be VRF-aware if they are using an unbound socket. One cannot have multiple service instances running in the default and other VRFs for services that are not VRF-aware and listen on an unbound socket. This is because there is no way of isolating packets received in the default VRF from those arriving in other VRFs. This series provides this isolation subject to the existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set, given that this is documented as allowing a single service instance to work across all VRF domains. The functionality applies to UDP & TCP services, for IPv4 and IPv6, in particular adding VRF table handling for IPv6 multicast. Example of running ssh instances in default and blue VRF: $ /usr/sbin/sshd -D $ ip vrf exec vrf-blue /usr/sbin/sshd $ ss -ta | egrep 'State|ssh' State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0128 0.0.0.0%vrf-blue:ssh 0.0.0.0:* LISTEN 01280.0.0.0:ssh 0.0.0.0:* ESTAB 00 192.168.122.220:ssh 192.168.122.1:50282 LISTEN 0128 [::]%vrf-blue:ssh[::]:* LISTEN 0128 [::]:ssh[::]:* ESTAB 00 [3000::2]%vrf-blue:ssh [3000::9]:45896 ESTAB 00[2000::2]:ssh [2000::9]:46398 Dewi Morgan (1): ipv6: do not drop vrf udp multicast packets Mike Manning (1): ipv6: allow link-local and multicast packets inside vrf Patrick Ruddy (1): ipv6: add vrf table handling code for ipv6 mcast Robert Shearman (2): net: allow binding socket in a VRF when there's an unbound socket ipv4: Allow sending multicast packets on specific i/f using VRF socket Documentation/networking/vrf.txt | 9 drivers/net/vrf.c| 30 include/net/inet6_hashtables.h | 5 ++-- include/net/inet_hashtables.h| 21 +++-- include/net/inet_sock.h | 13 +++ net/core/sock.c | 2 ++ net/ipv4/datagram.c | 2 +- net/ipv4/inet_connection_sock.c | 13 --- net/ipv4/inet_hashtables.c | 34 +--- net/ipv4/ip_sockglue.c | 3 +++ net/ipv4/ping.c | 2 +- net/ipv4/raw.c | 6 ++--- net/ipv4/udp.c | 17 ++ net/ipv6/datagram.c | 5 +++- net/ipv6/inet6_hashtables.c | 14 +--- net/ipv6/ip6_input.c | 46 + net/ipv6/ip6mr.c | 49 ++-- net/ipv6/ipv6_sockglue.c | 5 +++- net/ipv6/raw.c | 6 ++--- net/ipv6/udp.c | 22 -- 20 files changed, 208 insertions(+), 96 deletions(-) -- 2.11.0
[PATCH net-next 2/5] ipv6: allow link-local and multicast packets inside vrf
Packets that are multicast or to link-local addresses are not enslaved to the vrf of the socket that they are received on. This is needed for NDISC, but breaks applications that rely on receiving such packets when in a VRF. Also to make IPv6 consistent with IPv4 which does handle multicast packets as being enslaved, modify the VRF driver to do the same for IPv6. As a result, the multicast address check needs to verify the address against the enslaved rather than the l3mdev device. Signed-off-by: Mike Manning --- drivers/net/vrf.c| 19 +-- net/ipv6/ip6_input.c | 19 ++- net/ipv6/ipv6_sockglue.c | 2 +- 3 files changed, 28 insertions(+), 12 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index f93547f257fb..9d817c19f3b4 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev, struct sk_buff *skb) { int orig_iif = skb->skb_iif; - bool need_strict; + bool need_strict = rt6_need_strict(_hdr(skb)->daddr); + bool is_ndisc = ipv6_ndisc_frame(skb); - /* loopback traffic; do not push through packet taps again. -* Reset pkt_type for upper layers to process skb + /* loopback, multicast & non-ND link-local traffic; do not push through +* packet taps again. Reset pkt_type for upper layers to process skb */ - if (skb->pkt_type == PACKET_LOOPBACK) { + if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) { skb->dev = vrf_dev; skb->skb_iif = vrf_dev->ifindex; IP6CB(skb)->flags |= IP6SKB_L3SLAVE; - skb->pkt_type = PACKET_HOST; + if (skb->pkt_type == PACKET_LOOPBACK) + skb->pkt_type = PACKET_HOST; goto out; } - /* if packet is NDISC or addressed to multicast or link-local -* then keep the ingress interface -*/ - need_strict = rt6_need_strict(_hdr(skb)->daddr); - if (!ipv6_ndisc_frame(skb) && !need_strict) { + /* if packet is NDISC then keep the ingress interface */ + if (!is_ndisc) { vrf_rx_stats(vrf_dev, skb->len); skb->dev = vrf_dev; skb->skb_iif = vrf_dev->ifindex; diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 96577e742afd..108f5f88ec98 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -432,15 +432,32 @@ EXPORT_SYMBOL_GPL(ip6_input); int ip6_mc_input(struct sk_buff *skb) { + int sdif = inet6_sdif(skb); const struct ipv6hdr *hdr; + struct net_device *dev; bool deliver; __IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev), __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST, skb->len); + /* skb->dev passed may be master dev for vrfs. */ + if (sdif) { + rcu_read_lock(); + dev = dev_get_by_index_rcu(dev_net(skb->dev), sdif); + if (!dev) { + rcu_read_unlock(); + kfree_skb(skb); + return -ENODEV; + } + } else { + dev = skb->dev; + } + hdr = ipv6_hdr(skb); - deliver = ipv6_chk_mcast_addr(skb->dev, >daddr, NULL); + deliver = ipv6_chk_mcast_addr(dev, >daddr, NULL); + if (sdif) + rcu_read_unlock(); #ifdef CONFIG_IPV6_MROUTE /* diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 7dfbc797b130..4ebd395dd3df 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, retv = -EFAULT; break; } - if (sk->sk_bound_dev_if && pkt.ipi6_ifindex != sk->sk_bound_dev_if) + if (!sk_dev_equal_l3scope(sk, pkt.ipi6_ifindex)) goto e_inval; np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex; -- 2.11.0