Re: [PATCH net-next 0/5] vrf: allow simultaneous service instances in default and other VRFs

2018-09-20 Thread David Ahern
On 9/20/18 1:58 AM, Mike Manning wrote:
> Services currently have to be VRF-aware if they are using an unbound
> socket. One cannot have multiple service instances running in the
> default and other VRFs for services that are not VRF-aware and listen
> on an unbound socket. This is because there is no way of isolating
> packets received in the default VRF from those arriving in other VRFs.
> 
> This series provides this isolation subject to the existing kernel
> parameter net.ipv4.tcp_l3mdev_accept not being set, given that this is
> documented as allowing a single service instance to work across all
> VRF domains. The functionality applies to UDP & TCP services, for IPv4
> and IPv6, in particular adding VRF table handling for IPv6 multicast.
> 
> Example of running ssh instances in default and blue VRF:
> 
> $ /usr/sbin/sshd -D
> $ ip vrf exec vrf-blue /usr/sbin/sshd
> $ ss -ta | egrep 'State|ssh'
> State   Recv-Q   Send-Q   Local Address:Port   Peer Address:Port
> LISTEN  0128   0.0.0.0%vrf-blue:ssh 0.0.0.0:*
> LISTEN  01280.0.0.0:ssh 0.0.0.0:*
> ESTAB   00  192.168.122.220:ssh   192.168.122.1:50282
> LISTEN  0128  [::]%vrf-blue:ssh[::]:*
> LISTEN  0128   [::]:ssh[::]:*
> ESTAB   00   [3000::2]%vrf-blue:ssh   [3000::9]:45896
> ESTAB   00[2000::2]:ssh   [2000::9]:46398
> 

Hi Dave:

I need some time to review and more importantly test this patch set
before it is committed. I am traveling tomorrow afternoon through Sunday
evening, so I need a few days into next week to get to this.

Thanks,
David


[PATCH net-next v2] net/tls: Add support for async encryption of records for performance

2018-09-20 Thread Vakul Garg
In current implementation, tls records are encrypted & transmitted
serially. Till the time the previously submitted user data is encrypted,
the implementation waits and on finish starts transmitting the record.
This approach of encrypt-one record at a time is inefficient when
asynchronous crypto accelerators are used. For each record, there are
overheads of interrupts, driver softIRQ scheduling etc. Also the crypto
accelerator sits idle most of time while an encrypted record's pages are
handed over to tcp stack for transmission.

This patch enables encryption of multiple records in parallel when an
async capable crypto accelerator is present in system. This is achieved
by allowing the user space application to send more data using sendmsg()
even while previously issued data is being processed by crypto
accelerator. This requires returning the control back to user space
application after submitting encryption request to accelerator. This
also means that zero-copy mode of encryption cannot be used with async
accelerator as we must be done with user space application buffer before
returning from sendmsg().

There can be multiple records in flight to/from the accelerator. Each of
the record is represented by 'struct tls_rec'. This is used to store the
memory pages for the record.

After the records are encrypted, they are added in a linked list called
tx_ready_list which contains encrypted tls records sorted as per tls
sequence number. The records from tx_ready_list are transmitted using a
newly introduced function called tls_tx_records(). The tx_ready_list is
polled for any record ready to be transmitted in sendmsg(), sendpage()
after initiating encryption of new tls records. This achieves parallel
encryption and transmission of records when async accelerator is
present.

There could be situation when crypto accelerator completes encryption
later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore
we need a deferred work context to be able to transmit records from
tx_ready_list. The deferred work context gets scheduled if applications
are not sending much data through the socket. If the applications issue
sendmsg()/sendpage() in quick succession, then the scheduling of
tx_work_handler gets cancelled as the tx_ready_list would be polled from
application's context itself. This saves scheduling overhead of deferred
work.

The patch also brings some side benefit. We are able to get rid of the
concept of CLOSED record. This is because the records once closed are
either encrypted and then placed into tx_ready_list or if encryption
fails, the socket error is set. This simplifies the kernel tls
sendpath. However since tls_device.c is still using macros, accessory
functions for CLOSED records have been retained.

Signed-off-by: Vakul Garg 
---

Changes since v1: Addressed Dave Miller's comments.
- Removed an extra space between 'inline' and 'bool' in
  'is_tx_ready' declaration.
- Changed order of variable declarations at several places in
  order to following 'longest to shortest' line suggestion.

 include/net/tls.h  |  70 +--
 net/tls/tls_main.c |  54 ++---
 net/tls/tls_sw.c   | 586 -
 3 files changed, 522 insertions(+), 188 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 9f3c4ea9ad6f..3aa73e2d8823 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -41,7 +41,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 
@@ -93,24 +93,47 @@ enum {
TLS_NUM_CONFIG,
 };
 
-struct tls_sw_context_tx {
-   struct crypto_aead *aead_send;
-   struct crypto_wait async_wait;
-
-   char aad_space[TLS_AAD_SPACE_SIZE];
-
-   unsigned int sg_plaintext_size;
-   int sg_plaintext_num_elem;
+/* TLS records are maintained in 'struct tls_rec'. It stores the memory pages
+ * allocated or mapped for each TLS record. After encryption, the records are
+ * stores in a linked list.
+ */
+struct tls_rec {
+   struct list_head list;
+   int tx_flags;
struct scatterlist sg_plaintext_data[MAX_SKB_FRAGS];
-
-   unsigned int sg_encrypted_size;
-   int sg_encrypted_num_elem;
struct scatterlist sg_encrypted_data[MAX_SKB_FRAGS];
 
/* AAD | sg_plaintext_data | sg_tag */
struct scatterlist sg_aead_in[2];
/* AAD | sg_encrypted_data (data contain overhead for hdr) */
struct scatterlist sg_aead_out[2];
+
+   unsigned int sg_plaintext_size;
+   unsigned int sg_encrypted_size;
+   int sg_plaintext_num_elem;
+   int sg_encrypted_num_elem;
+
+   char aad_space[TLS_AAD_SPACE_SIZE];
+   struct aead_request aead_req;
+   u8 aead_req_ctx[];
+};
+
+struct tx_work {
+   struct delayed_work work;
+   struct sock *sk;
+};
+
+struct tls_sw_context_tx {
+   struct crypto_aead *aead_send;
+   struct crypto_wait async_wait;
+   struct tx_work tx_work;
+   struct tls_rec *open_rec;
+   struct 

答复: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices

2018-09-20 Thread Li,Rongqing
: Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when
> sending to many devices
> 
> 
> 
> On 09/20/2018 06:43 AM, Eric Dumazet wrote:
> >
> 

Sorry, I should cc to you.

> > And lastly this patch looks way too complicated to me.
> > You probably can write something much simpler.
> 

But it  should not increase the negative performance

> Something like :
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index
> 930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794c
> cf3d5a46bdf900 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff
> *skb)
> return false;
>  }
> 
> -static int __netlink_deliver_tap_skb(struct sk_buff *skb,
> +static int __netlink_deliver_tap_skb(struct sk_buff **pskb,
>  struct net_device *dev)  {
> -   struct sk_buff *nskb;
> +   struct sk_buff *nskb, *skb = *pskb;
> struct sock *sk = skb->sk;
> int ret = -ENOMEM;
> 
> if (!net_eq(dev_net(dev), sock_net(sk)))
> return 0;
> 
> -   dev_hold(dev);
> -
> -   if (is_vmalloc_addr(skb->head))
> +   if (is_vmalloc_addr(skb->head)) {
> nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
> -   else
> -   nskb = skb_clone(skb, GFP_ATOMIC);
> +   if (!nskb)
> +   return -ENOMEM;
> +   consume_skb(skb);

The original skb can not be freed, since it will be used after send to tap in 
__netlink_sendskb

> +   skb = nskb;
> +   *pskb = skb;
> +   }
> +   dev_hold(dev);
> +   nskb = skb_clone(skb, GFP_ATOMIC);

since original skb can not be freed, skb_clone will lead to a leak.

> if (nskb) {
> nskb->dev = dev;
> nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 
> +322,7
> @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct
> netlink_tap_net *n
> return;
> 
> list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
> -   ret = __netlink_deliver_tap_skb(skb, tmp->dev);
> +   ret = __netlink_deliver_tap_skb(, tmp->dev);
> if (unlikely(ret))
> break;
> }
> 


The below change seems simple, but it increase skb allocation and
free one time,  

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..b9631137f0fe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -290,10 +290,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 
dev_hold(dev);
 
-   if (is_vmalloc_addr(skb->head))
-   nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
-   else
-   nskb = skb_clone(skb, GFP_ATOMIC);
+   nskb = skb_clone(skb GFP_ATOMIC);
+
if (nskb) {
nskb->dev = dev;
nskb->protocol = htons((u16) sk->sk_protocol);
@@ -317,11 +315,20 @@ static void __netlink_deliver_tap(struct sk_buff *skb, 
struct netlink_tap_net *n
if (!netlink_filter_tap(skb))
return;
 
+   if (is_vmalloc_addr(skb->head)) {
+   skb = netlink_to_full_skb(skb, GFP_ATOMIC);
+   if (!skb)
+  return;
+   alloc = true;
+   }
+
list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
+
ret = __netlink_deliver_tap_skb(skb, tmp->dev);
if (unlikely(ret))
break;
}
+
+   if (alloc)
+   consume_skb(skb);
 }

-Q


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Alexei Starovoitov
On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > to having to cope with munmapping parts of existing mmaps, etc.
> > 
> > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > now it would be used just in this clean case for undoing a
> > PERF_RECORD_MMAP for a BPF program.
> > 
> > The ABI is already complicated, starting to use something called
> > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > I think.
> 
> Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> part was getting the perf tool to dtrt for that use-case. But if we need
> unmap events, doing the unmap record now is the right thing.

Thanks for the pointers!
The answer is a bit long. pls bear with me.

I have considered adding MUNMAP to match existing MMAP, but went
without it because I didn't want to introduce new bit in perf_event_attr
and emit these new events in a misbalanced conditional way for prog load/unload.
Like old perf is asking kernel for mmap events via mmap bit, so prog load events
will be in perf.data, but old perf report won't recognize them anyway.
Whereas new perf would certainly want to catch bpf events and will set
both mmap and mumap bits.

Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
conditionally based on single mmap bit they will confuse old perf
and it will print warning about 'unknown events'.

Both situations are ugly, hence I went with reuse of MMAP event
for both load/unload.
In such case old perf silently ignores them. Which is what I wanted.
When we upgrade the kernel we cannot synchronize the kernel upgrade
(or downgrade) with user space perf package upgrade.
Hence not confusing old perf is important.
With new kernel new bpf mmap events get into perf.data and
new perf picks them up.

Few more considerations:

I consider synthetic perf events to be non-ABI. Meaning they're
emitted by perf user space into perf.data and there is a convention
on names, but it's not a kernel abi. Like RECORD_MMAP with
event.filename == "[module_name]" is an indication for perf report
to parse elf/build-id of dso==module_name.
There is no such support in the kernel. Kernel doesn't emit
such events for module load/unload. If in the future
we decide to extend kernel with such events they don't have
to match what user space perf does today.

Why this is important? To get to next step.
As Arnaldo pointed out this patch set is missing support for
JITed prog annotations and displaying asm code. Absolutely correct.
This set only helps perf to reveal the names of bpf progs that _were_
running at the time of perf record, but there is no way yet for
perf report to show asm code of the prog that was running.
In that sense bpf is drastically different from java, other jits
and normal profiling.
bpf JIT happens in the kernel and only kernel knows the mapping
between original source and JITed code.
In addition there are bpf2bpf functions. In the future there will
be bpf libraries, more type info, line number support, etc.
I strongly believe perf RECORD_* events should NOT care about
the development that happens on the bpf side.
The only thing kernel will be telling user space is that bpf prog
with prog_id=X was loaded.
Then based on prog_id the 'perf record' phase can query the kernel
for bpf related information. There is already a way to fetch
JITed image based on prog_id.
Then perf will emit synthetic RECORD_FOOBAR into perf.data
that will contain bpf related info (like complete JITed image)
and perf report can process it later and annotate things in UI.

It may seem that there is a race here.
Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
it will ask the kernel about prog_id=X, but that prog could be
unloaded already.
In such case prog_id will not exist and perf record can ignore such event.
So no race.
The kernel doesn't need to pass all information about bpf prog to
the user space via RECORD_*. Instead 'perf record' can emit
synthetic events into perf.data.
I was thinking to extend RECORD_MMAP with prog_id already
(instead of passing kallsyms's bpf prog name in event->mmap.filename)
but bpf functions don't have their own prog_id. Multiple bpf funcs
with different JITed blobs are considered to be a part of single prog_id.
So as a step one I'm only extending RECORD_MMAP with addr and kallsym
name of bpf function/prog.
As a step two the plan is to add notification mechanism for prog load/unload
that will include prog_id and design new synthetic RECORD_* events in
perf user space that will contain JITed code, line info, BTF, etc.

TLDR:
step 1 (this patch set)
Single bpf prog_load can call multiple
bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
Similarly unload calls multiple
bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only

step 2 (future work)

Re: [PATCH net] r8169: fix autoneg issue on resume with RTL8168E

2018-09-20 Thread David Miller
From: Heiner Kallweit 
Date: Thu, 20 Sep 2018 22:47:09 +0200

> It was reported that chip version 33 (RTL8168E) ends up with
> 10MBit/Half on a 1GBit link after resuming from S3 (with different
> link partners). For whatever reason the PHY on this chip doesn't
> properly start a renegotiation when soft-reset.
> Explicitly requesting a renegotiation fixes this.
> 
> Signed-off-by: Heiner Kallweit 
> Fixes: a2965f12fde6 ("r8169: remove rtl8169_set_speed_xmii")
> Reported-by: Neil MacLeod 
> Tested-by: Neil MacLeod 

Applied, thank you.

Please always make "Fixes: " the first tag in a set of tags.

I fixed it up for you this time.

Thanks.


[PATCH RESEND] PCI: hv: Fix return value check in hv_pci_assign_slots()

2018-09-20 Thread Wei Yongjun
In case of error, the function pci_create_slot() returns ERR_PTR() and
never returns NULL. The NULL test in the return value check should be
replaced with IS_ERR().

Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot 
information")
Signed-off-by: Wei Yongjun 
---
Since the orig patch is merged from net tree, cc netdev@vger.kernel.org
---
 drivers/pci/controller/pci-hyperv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index ee80e79..9ba4d12 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct hv_pcibus_device 
*hbus)
snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser);
hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr,
  name, NULL);
-   if (!hpdev->pci_slot)
+   if (IS_ERR(hpdev->pci_slot)) {
pr_warn("pci_create slot %s failed\n", name);
+   hpdev->pci_slot = NULL;
+   }
}
 }



RE: [PATCH net-next] net/tls: Add support for async encryption of records for performance

2018-09-20 Thread Vakul Garg



> -Original Message-
> From: David Miller 
> Sent: Thursday, September 20, 2018 11:49 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; davejwat...@fb.com; doro...@fb.com
> Subject: Re: [PATCH net-next] net/tls: Add support for async encryption of
> records for performance
> 
> From: Vakul Garg 
> Date: Wed, 19 Sep 2018 20:51:35 +0530
> 
> > This patch enables encryption of multiple records in parallel when an
> > async capable crypto accelerator is present in system.
> 
> This seems to be trading off zero copy with async support.
> 
> Async crypto device support is not the common case at all, and synchronous
> crypto via cpu crypto acceleration instructions is so much more likely.
> 
> Oh I see, the new logic is only triggered with ASYNC_CAPABLE is set?
> 
> > +static inline  bool is_tx_ready(struct tls_context *tls_ctx,
> > +   struct tls_sw_context_tx *ctx)
> > +{
> 
> Two space between "inline" and "bool", please make it one.
 
Fixed.
Seems checkpatch misses it.

> 
> >  static void tls_write_space(struct sock *sk)  {
> > struct tls_context *ctx = tls_get_ctx(sk);
> > +   struct tls_sw_context_tx *tx_ctx = tls_sw_ctx_tx(ctx);
> 
> Longest to shortest line (reverse christmas tree) ordering for local variable
> declarations please.
 
Can't do this. The second variable assignment is dependent upon previous one.


> >
> > +   list_for_each_prev(pos, >tx_ready_list) {
> > +   struct tls_rec *rec = (struct tls_rec *)pos;
> > +   u64 seq = be64_to_cpup((const __be64 *)>aad_space);
> 
> Likewise.
> 

I can split variable declaration 'seq' and its assignment into two separate 
lines.
But I am not sure if increasing number of lines in order to comply reverse 
Christmas tree
is a good thing for this case.

> > -static int tls_do_encryption(struct tls_context *tls_ctx,
> > +int tls_tx_records(struct sock *sk, int flags) {
> > +   struct tls_rec *rec, *tmp;
> > +   struct tls_context *tls_ctx = tls_get_ctx(sk);
> > +   struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> > +   int rc = 0;
> > +   int tx_flags;
> 
> Likewise.

Could partially address since ctx assignment depends upon tls_ctx assignment.
 
> 
> > +static void tls_encrypt_done(struct crypto_async_request *req, int
> > +err) {
> > +   struct aead_request *aead_req = (struct aead_request *)req;
> > +   struct sock *sk = req->data;
> > +   struct tls_context *tls_ctx = tls_get_ctx(sk);
> > +   struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> > +   struct tls_rec *rec;
> > +   int pending;
> > +   bool ready = false;
> 
> Likewise.

Placed 'ready' above pending 'pending'. Rest unchanged because of dependencies.

> 
> > +static int tls_do_encryption(struct sock *sk,
> > +struct tls_context *tls_ctx,
> >  struct tls_sw_context_tx *ctx,
> >  struct aead_request *aead_req,
> >  size_t data_len)
> >  {
> > int rc;
> > +   struct tls_rec *rec = ctx->open_rec;
> 
> Likewise.
> 
> > @@ -473,11 +630,12 @@ static int memcopy_from_iter(struct sock *sk,
> > struct iov_iter *from,  {
> > struct tls_context *tls_ctx = tls_get_ctx(sk);
> > struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> > -   struct scatterlist *sg = ctx->sg_plaintext_data;
> > +   struct tls_rec *rec = ctx->open_rec;
> > +   struct scatterlist *sg = rec->sg_plaintext_data;
> > int copy, i, rc = 0;
> 
> Likewise.
 
Can't change because of dependencies.

> 
> > +struct tls_rec *get_rec(struct sock *sk) {
> > +   int mem_size;
> > +   struct tls_context *tls_ctx = tls_get_ctx(sk);
> > +   struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> > +   struct tls_rec *rec;
> 
> Likewise.
> 
 
Declared 'mem_size' below 'rec'.

> > @@ -510,21 +707,33 @@ int tls_sw_sendmsg(struct sock *sk, struct
> msghdr *msg, size_t size)
> > int record_room;
> > bool full_record;
> > int orig_size;
> > +   struct tls_rec *rec;
> > bool is_kvec = msg->msg_iter.type & ITER_KVEC;
> > +   struct crypto_tfm *tfm = crypto_aead_tfm(ctx->aead_send);
> > +   bool async_capable = tfm->__crt_alg->cra_flags &
> CRYPTO_ALG_ASYNC;
> > +   int num_async = 0;
> > +   int num_zc = 0;
> 
> Likewise.

Fixed

> > @@ -661,6 +904,8 @@ int tls_sw_sendpage(struct sock *sk, struct page
> *page,
> > struct scatterlist *sg;
> > bool full_record;
> > int record_room;
> > +   struct tls_rec *rec;
> > +   int num_async = 0;
> 
> Likewise.
 
Fixed.

Sending v2.


Re: [PATCH iproute2-next] iplink: add ipvtap support

2018-09-20 Thread David Ahern
On 9/18/18 8:03 PM, Hangbin Liu wrote:
> IPVLAN and IPVTAP are using the same functions and parameters. So we can
> just add a new link_util with id ipvtap. Others are the same.
> 
> Signed-off-by: Hangbin Liu 
> ---
>  ip/iplink.c   |  4 ++--
>  ip/iplink_ipvlan.c| 28 ++--
>  man/man8/ip-link.8.in |  4 
>  3 files changed, 24 insertions(+), 12 deletions(-)
> 

applied to iproute2-next. Thanks




[PATCH net] net/ipv6: Display all addresses in output of /proc/net/if_inet6

2018-09-20 Thread Jeff Barnhill
The backend handling for /proc/net/if_inet6 in addrconf.c doesn't properly
handle starting/stopping the iteration.  The problem is that at some point
during the iteration, an overflow is detected and the process is
subsequently stopped.  The item being shown via seq_printf() when the
overflow occurs is not actually shown, though.  When start() is
subsequently called to resume iterating, it returns the next item, and
thus the item that was being processed when the overflow occurred never
gets printed.

Alter the meaning of the private data member "offset".  Currently, when it
is not 0 (which only happens at the very beginning), "offset" represents
the next hlist item to be printed.  After this change, "offset" always
represents the current item.

This is also consistent with the private data member "bucket", which
represents the current bucket, and also the use of "pos" as defined in
seq_file.txt:
The pos passed to start() will always be either zero, or the most
recent pos used in the previous session.

Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
---
 net/ipv6/addrconf.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d51a8c0b3372..c63ccce6425f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4201,7 +4201,6 @@ static struct inet6_ifaddr *if6_get_first(struct seq_file 
*seq, loff_t pos)
p++;
continue;
}
-   state->offset++;
return ifa;
}
 
@@ -4225,13 +4224,12 @@ static struct inet6_ifaddr *if6_get_next(struct 
seq_file *seq,
return ifa;
}
 
+   state->offset = 0;
while (++state->bucket < IN6_ADDR_HSIZE) {
-   state->offset = 0;
hlist_for_each_entry_rcu(ifa,
 _addr_lst[state->bucket], addr_lst) {
if (!net_eq(dev_net(ifa->idev->dev), net))
continue;
-   state->offset++;
return ifa;
}
}
-- 
2.14.1



Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Eric Dumazet



On 09/20/2018 04:43 PM, Song Liu wrote:
> 

> I tried to totally skip ndo_poll_controller() here. It did avoid hitting
> the issue. However, netpoll will drop (fail to send) more packets. 
>

Why is it failing ?

If you are under high memory pressure, then maybe if you absolutely want memory 
to send
netpoll packets, you want to grab all NAPI contexts as a way to prevent other 
cpus 
from feeding incoming packets to the host and add more memory pressure ;)



Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Song Liu



> On Sep 20, 2018, at 4:22 PM, Eric Dumazet  wrote:
> 
> 
> 
> On 09/20/2018 03:42 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher  
>>> wrote:
>>> 
>>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
 On 09/20/2018 12:01 PM, Song Liu wrote:
> The NIC driver should only enable interrupts when napi_complete_done()
> returns true. This patch adds the check for ixgbe.
> 
> Cc: sta...@vger.kernel.org # 4.10+
> Cc: Jeff Kirsher 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Song Liu 
> ---
 
 
 Well, unfortunately we do not know why this is needed,
 this is why I have not yet sent this patch formally.
 
 netpoll has correct synchronization :
 
 poll_napi() places into napi->poll_owner current cpu number before
 calling poll_one_napi()
 
 netpoll_poll_lock() does also use napi->poll_owner
 
 When netpoll calls ixgbe poll() method, it passed a budget of 0,
 meaning napi_complete_done() is not called.
 
 As long as we can not explain the problem properly in the changelog,
 we should investigate, otherwise we will probably see coming dozens of
 patches
 trying to fix a 'potential hazard'.
>>> 
>>> Agreed, which is why I have our validation and developers looking into it,
>>> while we test the current patch from Song.
>> 
>> I figured out what is the issue here. And I have a proposal to fix it. I 
>> have verified that this fixes the issue in our tests. But Alexei suggests
>> that it may not be the right way to fix. 
>> 
>> Here is what happened:
>> 
>> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
>> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
>> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
>> within the same NIC. 
>> 
>> This is problematic, because at the end napi_schedule() calls:
>> 
>>napi_schedule(this_cpu_ptr(_data), n);
>> 
>> which attached these NAPIs to softnet_data on THIS CPU. This is done
>> via napi->poll_list. 
>> 
>> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
>> not give up the ownership until it calls napi_complete_done(). However, 
>> for a very busy server, we usually use 16 CPUs to poll NAPI, so this
>> CPU can easily be overloaded. And as a result, each call of napi->poll() 
>> will hit budget (of 64), and it will not call napi_complete_done(), 
>> and the NAPI stays in the poll_list of this CPU. 
>> 
>> When this happens, the host usually cannot get out of this state until
>> we throttle/stop client traffic. 
>> 
>> 
>> I am pretty confident this is what happened. Please let me know if 
>> anything above doesn't make sense. 
>> 
>> 
>> Here is my proposal to fix it: Instead of polling all NAPIs within one
>> NIC, I would have netpoll to only poll the NAPI that will free space
>> for netpoll_start_xmit(). I attached my two RFC patches to the end of 
>> this email. 
>> 
>> I chatted with Alexei about this. He think polling only one NAPI may 
>> not guarantee netpoll make progress with the TX queue we are aiming 
>> for. Also, the bigger problem may be the fact that NAPIs could get 
>> pinned to one CPU and cannot get released. 
>> 
>> At this point, I really don't know what is the best way to fix this. 
>> 
>> I will also work on a repro with netperf. 
> 
> Thanks !
> 
>> 
>> Please let me know your suggestions. 
>> 
> 
> Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() 
> method at all,
> since it is very risky (potentially grab many NAPI, and end up in this locked 
> situation)
> 
> poll_napi() could attempt to free skbs one napi at a time,
> without the current cpu stealing all NAPI.
> 
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 
> 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf
>  100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev)
>}
> 
>ops = dev->netdev_ops;
> -   if (!ops->ndo_poll_controller) {
> -   up(>dev_lock);
> -   return;
> -   }
> -
> -   /* Process pending work on NIC */
> -   ops->ndo_poll_controller(dev);
> +   if (ops->ndo_poll_controller)
> +   ops->ndo_poll_controller(dev);
> 
>poll_napi(dev);
> 

I tried to totally skip ndo_poll_controller() here. It did avoid hitting
the issue. However, netpoll will drop (fail to send) more packets. 

Thanks,
Song





Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Eric Dumazet



On 09/20/2018 03:42 PM, Song Liu wrote:
> 
> 
>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher  
>> wrote:
>>
>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>>> On 09/20/2018 12:01 PM, Song Liu wrote:
 The NIC driver should only enable interrupts when napi_complete_done()
 returns true. This patch adds the check for ixgbe.

 Cc: sta...@vger.kernel.org # 4.10+
 Cc: Jeff Kirsher 
 Suggested-by: Eric Dumazet 
 Signed-off-by: Song Liu 
 ---
>>>
>>>
>>> Well, unfortunately we do not know why this is needed,
>>> this is why I have not yet sent this patch formally.
>>>
>>> netpoll has correct synchronization :
>>>
>>> poll_napi() places into napi->poll_owner current cpu number before
>>> calling poll_one_napi()
>>>
>>> netpoll_poll_lock() does also use napi->poll_owner
>>>
>>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>>> meaning napi_complete_done() is not called.
>>>
>>> As long as we can not explain the problem properly in the changelog,
>>> we should investigate, otherwise we will probably see coming dozens of
>>> patches
>>> trying to fix a 'potential hazard'.
>>
>> Agreed, which is why I have our validation and developers looking into it,
>> while we test the current patch from Song.
> 
> I figured out what is the issue here. And I have a proposal to fix it. I 
> have verified that this fixes the issue in our tests. But Alexei suggests
> that it may not be the right way to fix. 
> 
> Here is what happened:
> 
> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
> within the same NIC. 
> 
> This is problematic, because at the end napi_schedule() calls:
> 
> napi_schedule(this_cpu_ptr(_data), n);
> 
> which attached these NAPIs to softnet_data on THIS CPU. This is done
> via napi->poll_list. 
> 
> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
> not give up the ownership until it calls napi_complete_done(). However, 
> for a very busy server, we usually use 16 CPUs to poll NAPI, so this
> CPU can easily be overloaded. And as a result, each call of napi->poll() 
> will hit budget (of 64), and it will not call napi_complete_done(), 
> and the NAPI stays in the poll_list of this CPU. 
> 
> When this happens, the host usually cannot get out of this state until
> we throttle/stop client traffic. 
> 
> 
> I am pretty confident this is what happened. Please let me know if 
> anything above doesn't make sense. 
> 
> 
> Here is my proposal to fix it: Instead of polling all NAPIs within one
> NIC, I would have netpoll to only poll the NAPI that will free space
> for netpoll_start_xmit(). I attached my two RFC patches to the end of 
> this email. 
> 
> I chatted with Alexei about this. He think polling only one NAPI may 
> not guarantee netpoll make progress with the TX queue we are aiming 
> for. Also, the bigger problem may be the fact that NAPIs could get 
> pinned to one CPU and cannot get released. 
> 
> At this point, I really don't know what is the best way to fix this. 
> 
> I will also work on a repro with netperf. 

Thanks !

> 
> Please let me know your suggestions. 
> 

Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() 
method at all,
since it is very risky (potentially grab many NAPI, and end up in this locked 
situation)

poll_napi() could attempt to free skbs one napi at a time,
without the current cpu stealing all NAPI.


diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 
57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf
 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev)
}
 
ops = dev->netdev_ops;
-   if (!ops->ndo_poll_controller) {
-   up(>dev_lock);
-   return;
-   }
-
-   /* Process pending work on NIC */
-   ops->ndo_poll_controller(dev);
+   if (ops->ndo_poll_controller)
+   ops->ndo_poll_controller(dev);
 
poll_napi(dev);



Re: array bounds warning in xfrm_output_resume

2018-09-20 Thread David Ahern
On 9/20/18 7:06 AM, Florian Westphal wrote:
> David Ahern  wrote:
>>> $ make O=kbuild/perf -j 24 -s
>>> In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
>>>  from /home/dsa/kernel-3.git/include/linux/list.h:9,
>>>  from /home/dsa/kernel-3.git/include/linux/module.h:9,
>>>  from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
>>> /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
>>> ‘xfrm_output_resume’:
>>> /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
>>> subscript is above array bounds [-Warray-bounds]
>>>__read_once_size(&(x), __u.__c, sizeof(x));  \
> 
> Does this thing avoid the warning?

nope, still see it.



Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Song Liu



> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher  wrote:
> 
> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>> On 09/20/2018 12:01 PM, Song Liu wrote:
>>> The NIC driver should only enable interrupts when napi_complete_done()
>>> returns true. This patch adds the check for ixgbe.
>>> 
>>> Cc: sta...@vger.kernel.org # 4.10+
>>> Cc: Jeff Kirsher 
>>> Suggested-by: Eric Dumazet 
>>> Signed-off-by: Song Liu 
>>> ---
>> 
>> 
>> Well, unfortunately we do not know why this is needed,
>> this is why I have not yet sent this patch formally.
>> 
>> netpoll has correct synchronization :
>> 
>> poll_napi() places into napi->poll_owner current cpu number before
>> calling poll_one_napi()
>> 
>> netpoll_poll_lock() does also use napi->poll_owner
>> 
>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>> meaning napi_complete_done() is not called.
>> 
>> As long as we can not explain the problem properly in the changelog,
>> we should investigate, otherwise we will probably see coming dozens of
>> patches
>> trying to fix a 'potential hazard'.
> 
> Agreed, which is why I have our validation and developers looking into it,
> while we test the current patch from Song.

I figured out what is the issue here. And I have a proposal to fix it. I 
have verified that this fixes the issue in our tests. But Alexei suggests
that it may not be the right way to fix. 

Here is what happened:

netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
within the same NIC. 

This is problematic, because at the end napi_schedule() calls:

napi_schedule(this_cpu_ptr(_data), n);

which attached these NAPIs to softnet_data on THIS CPU. This is done
via napi->poll_list. 

Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
not give up the ownership until it calls napi_complete_done(). However, 
for a very busy server, we usually use 16 CPUs to poll NAPI, so this
CPU can easily be overloaded. And as a result, each call of napi->poll() 
will hit budget (of 64), and it will not call napi_complete_done(), 
and the NAPI stays in the poll_list of this CPU. 

When this happens, the host usually cannot get out of this state until
we throttle/stop client traffic. 


I am pretty confident this is what happened. Please let me know if 
anything above doesn't make sense. 


Here is my proposal to fix it: Instead of polling all NAPIs within one
NIC, I would have netpoll to only poll the NAPI that will free space
for netpoll_start_xmit(). I attached my two RFC patches to the end of 
this email. 

I chatted with Alexei about this. He think polling only one NAPI may 
not guarantee netpoll make progress with the TX queue we are aiming 
for. Also, the bigger problem may be the fact that NAPIs could get 
pinned to one CPU and cannot get released. 

At this point, I really don't know what is the best way to fix this. 

I will also work on a repro with netperf. 

Please let me know your suggestions. 

Thanks,
Song


= RFCs 


>From 7b6d1d0e21bc7a0034dbcacfdaaec95a0e5f5b14 Mon Sep 17 00:00:00 2001
From: Song Liu 
Date: Thu, 20 Sep 2018 13:09:26 -0700
Subject: [RFC net 1/2] net: netpoll: when failed to send a message, only poll
 the target NAPI

Currently, netpoll polls all NAPIs in the NIC if it fails to send data to
one TX queue. This is not necessary and sometimes problematic. This is
because ndo_poll_controller() calls napi_schedule() for all NAPIs in the
NIC, and attach available NAPIs to this_cpu's softnet_data->poll_list.
Modern highend NIC has tens of NAPIs, and requires multiple CPUs to poll
the data from the queues. Attaching multiple NAPIs to one CPU's poll_list
fan-in softirq work of multiple CPUs to one CPU. As a result, we see one
CPU pegged in ksoftirqd, while other CPUs cannot get assgined work.

This patch fixes this issue by only polls target NAPI when send fails.
New hook ndo_netpoll_get_napi() is added for the driver to map busy
TX queue to the NAPI to poll. If the driver implements this hook, netpoll
will only poll the NAPI of the target TX queue.

Signed-off-by: Song Liu 
---
 include/linux/netdevice.h |  5 +
 net/core/netpoll.c| 30 --
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6a7ac01fa605..576414b150ff 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1033,6 +1033,9 @@ struct dev_ifalias {
  *
  * void (*ndo_poll_controller)(struct net_device *dev);
  * Run napi_schedule() for all NAPI within this device.
+ * struct napi_struct* (*ndo_netpoll_get_napi)(struct net_device *dev,
+ * struct netdev_queue *txq);
+ * Returns the NAPI to poll for the given TX queue.
  *
  * SR-IOV management 

Re: [PATCH net] af_key: free SKBs under RCU protection

2018-09-20 Thread Eric Dumazet



On 09/20/2018 03:10 PM, Eric Dumazet wrote:
> 
> 
> On 09/20/2018 12:25 PM, stran...@codeaurora.org wrote:
>>>
>>> I do not believe the changelog or the patch makes sense.
>>>
>>> Having skb still referencing a socket prevents this socket being released.
>>>
>>> If you think about it, what would prevent the freeing happening
>>> _before_ the rcu_read_lock() in pfkey_broadcast() ?
>>>
>>> Maybe the correct fix is that pfkey_broadcast_one() should ensure the
>>> socket is still valid.
>>>
>>> I would suggest something like :
>>>
>>> diff --git a/net/key/af_key.c b/net/key/af_key.c
>>> index
>>> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
>>> 100644
>>> --- a/net/key/af_key.c
>>> +++ b/net/key/af_key.c
>>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
>>> *skb, struct sk_buff **skb2,
>>>  {
>>>     int err = -ENOBUFS;
>>>
>>> -   sock_hold(sk);
>>> +   if (!refcount_inc_not_zero(>sk_refcnt))
>>> +   return -ENOENT;
>>> +
>>>     if (*skb2 == NULL) {
>>>     if (refcount_read(>users) != 1) {
>>>     *skb2 = skb_clone(skb, allocation);
>>
>> Hi Eric,
>>
>> I'm not sure that the socket getting freed before the rcu_read_lock() would
>> be an issue, since then it would no longer be in the net_pkey->table that
>> we loop through (since we call pfkey_remove() from pfkey_relase()). Because 
>> of
>> that, all the sockets processed in pfkey_broadcast_one() have valid 
>> refcounts,
>> so checking for zero there doesn't prevent the crash that I'm seeing.
>>
>> However, after going over the call flow again, I see that the actual problem
>> occurs because of pfkey_broadcast_one(). Specifically, because of this check:
>>
>> if (*skb2 == NULL) {
>>     if (refcount_read(>users) != 1) {
>>     *skb2 = skb_clone(skb, allocation);
>>     } else {
>>     *skb2 = skb;
>>     refcount_inc(>users);
>>     }
>> }
>>
>> Since we always pass a freshly cloned SKB to this function, skb->users is
>> always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
>> belong to the socket.
>>
>> If the socket we queue skb2 to frees this SKB (thereby decrementing its
>> refcount to 1) and the socket is freed before pfkey_broadcast() can
>> execute the kfree_skb(skb) on line 284, we will then attempt to run
>> sock_rfree() on an SKB with a dangling reference to this socket.
>>
>> Perhaps a cleaner solution here is to always clone the SKB in
>> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
>> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
>> its destructor, and we can avoid this race condition.
> 
> As long as one skb has sock_rfree has its destructor, the socket attached to
> this skb can not be released. There is no race here.
> 
> Note that skb_clone() does not propagate the destructor.
> 
> The issue here is that in the rcu lookup, we can find a socket that has been
> dismantled, with a 0 refcount.
> 
> We must not use sock_hold() in this case, since we are not sure the socket 
> refcount is not already 0.
> 
> pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules.
> 
> When in a RCU lookup, one want to increment an object refcount, it needs
> to be extra-careful, as I did in my proposal.
> 
> Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y

Bug was added in commit 7f6b9dbd5afb ("af_key: locking change")



Re: [PATCH net] af_key: free SKBs under RCU protection

2018-09-20 Thread Eric Dumazet



On 09/20/2018 12:25 PM, stran...@codeaurora.org wrote:
>>
>> I do not believe the changelog or the patch makes sense.
>>
>> Having skb still referencing a socket prevents this socket being released.
>>
>> If you think about it, what would prevent the freeing happening
>> _before_ the rcu_read_lock() in pfkey_broadcast() ?
>>
>> Maybe the correct fix is that pfkey_broadcast_one() should ensure the
>> socket is still valid.
>>
>> I would suggest something like :
>>
>> diff --git a/net/key/af_key.c b/net/key/af_key.c
>> index
>> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
>> 100644
>> --- a/net/key/af_key.c
>> +++ b/net/key/af_key.c
>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
>> *skb, struct sk_buff **skb2,
>>  {
>>     int err = -ENOBUFS;
>>
>> -   sock_hold(sk);
>> +   if (!refcount_inc_not_zero(>sk_refcnt))
>> +   return -ENOENT;
>> +
>>     if (*skb2 == NULL) {
>>     if (refcount_read(>users) != 1) {
>>     *skb2 = skb_clone(skb, allocation);
> 
> Hi Eric,
> 
> I'm not sure that the socket getting freed before the rcu_read_lock() would
> be an issue, since then it would no longer be in the net_pkey->table that
> we loop through (since we call pfkey_remove() from pfkey_relase()). Because of
> that, all the sockets processed in pfkey_broadcast_one() have valid refcounts,
> so checking for zero there doesn't prevent the crash that I'm seeing.
> 
> However, after going over the call flow again, I see that the actual problem
> occurs because of pfkey_broadcast_one(). Specifically, because of this check:
> 
> if (*skb2 == NULL) {
>     if (refcount_read(>users) != 1) {
>     *skb2 = skb_clone(skb, allocation);
>     } else {
>     *skb2 = skb;
>     refcount_inc(>users);
>     }
> }
> 
> Since we always pass a freshly cloned SKB to this function, skb->users is
> always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
> belong to the socket.
> 
> If the socket we queue skb2 to frees this SKB (thereby decrementing its
> refcount to 1) and the socket is freed before pfkey_broadcast() can
> execute the kfree_skb(skb) on line 284, we will then attempt to run
> sock_rfree() on an SKB with a dangling reference to this socket.
> 
> Perhaps a cleaner solution here is to always clone the SKB in
> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
> its destructor, and we can avoid this race condition.

As long as one skb has sock_rfree has its destructor, the socket attached to
this skb can not be released. There is no race here.

Note that skb_clone() does not propagate the destructor.

The issue here is that in the rcu lookup, we can find a socket that has been
dismantled, with a 0 refcount.

We must not use sock_hold() in this case, since we are not sure the socket 
refcount is not already 0.

pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules.

When in a RCU lookup, one want to increment an object refcount, it needs
to be extra-careful, as I did in my proposal.

Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y




Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Jeff Kirsher
On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
> On 09/20/2018 12:01 PM, Song Liu wrote:
> > The NIC driver should only enable interrupts when napi_complete_done()
> > returns true. This patch adds the check for ixgbe.
> > 
> > Cc: sta...@vger.kernel.org # 4.10+
> > Cc: Jeff Kirsher 
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Song Liu 
> > ---
> 
> 
> Well, unfortunately we do not know why this is needed,
> this is why I have not yet sent this patch formally.
> 
> netpoll has correct synchronization :
> 
> poll_napi() places into napi->poll_owner current cpu number before
> calling poll_one_napi()
> 
> netpoll_poll_lock() does also use napi->poll_owner
> 
> When netpoll calls ixgbe poll() method, it passed a budget of 0,
> meaning napi_complete_done() is not called.
> 
> As long as we can not explain the problem properly in the changelog,
> we should investigate, otherwise we will probably see coming dozens of
> patches
> trying to fix a 'potential hazard'.

Agreed, which is why I have our validation and developers looking into it,
while we test the current patch from Song.


signature.asc
Description: This is a digitally signed message part


[PATCH net] r8169: fix autoneg issue on resume with RTL8168E

2018-09-20 Thread Heiner Kallweit
It was reported that chip version 33 (RTL8168E) ends up with
10MBit/Half on a 1GBit link after resuming from S3 (with different
link partners). For whatever reason the PHY on this chip doesn't
properly start a renegotiation when soft-reset.
Explicitly requesting a renegotiation fixes this.

Signed-off-by: Heiner Kallweit 
Fixes: a2965f12fde6 ("r8169: remove rtl8169_set_speed_xmii")
Reported-by: Neil MacLeod 
Tested-by: Neil MacLeod 
---
 drivers/net/ethernet/realtek/r8169.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 1d8631303..7f4647c58 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4069,6 +4069,15 @@ static void rtl8169_init_phy(struct net_device *dev, 
struct rtl8169_private *tp)
phy_speed_up(dev->phydev);
 
genphy_soft_reset(dev->phydev);
+
+   /* It was reported that chip version 33 ends up with 10MBit/Half on a
+* 1GBit link after resuming from S3. For whatever reason the PHY on
+* this chip doesn't properly start a renegotiation when soft-reset.
+* Explicitly requesting a renegotiation fixes this.
+*/
+   if (tp->mac_version == RTL_GIGA_MAC_VER_33 &&
+   dev->phydev->autoneg == AUTONEG_ENABLE)
+   phy_restart_aneg(dev->phydev);
 }
 
 static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
-- 
2.19.0



Re: [PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Eric Dumazet



On 09/20/2018 12:01 PM, Song Liu wrote:
> The NIC driver should only enable interrupts when napi_complete_done()
> returns true. This patch adds the check for ixgbe.
> 
> Cc: sta...@vger.kernel.org # 4.10+
> Cc: Jeff Kirsher 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Song Liu 
> ---


Well, unfortunately we do not know why this is needed,
this is why I have not yet sent this patch formally.

netpoll has correct synchronization :

poll_napi() places into napi->poll_owner current cpu number before calling 
poll_one_napi()

netpoll_poll_lock() does also use napi->poll_owner

When netpoll calls ixgbe poll() method, it passed a budget of 0,
meaning napi_complete_done() is not called.

As long as we can not explain the problem properly in the changelog,
we should investigate, otherwise we will probably see coming dozens of patches
trying to fix a 'potential hazard'.

Thanks.




[PATCH net-next v2] net: phy: don't reschedule state machine when PHY is halted

2018-09-20 Thread Heiner Kallweit
When being in state PHY_HALTED we don't have to reschedule the
state machine, phy_start() will start it again.

Signed-off-by: Heiner Kallweit 
---
v2:
- added comment
---
 drivers/net/phy/phy.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index a5e0f0721..a1f8e4816 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1123,9 +1123,13 @@ void phy_state_machine(struct work_struct *work)
 
/* Only re-schedule a PHY state machine change if we are polling the
 * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
-* between states from phy_mac_interrupt()
+* between states from phy_mac_interrupt().
+*
+* In state PHY_HALTED the PHY gets suspended, so rescheduling the
+* state machine would be pointless and possibly error prone when
+* called from phy_disconnect() synchronously.
 */
-   if (phy_polling_mode(phydev))
+   if (phy_polling_mode(phydev) && old_state != PHY_HALTED)
queue_delayed_work(system_power_efficient_wq, 
>state_queue,
   PHY_STATE_TIME * HZ);
 }
-- 
2.19.0




[PATCH net] netlabel: check for IPV4MASK in addrinfo_get

2018-09-20 Thread Sean Tranchetti
netlbl_unlabel_addrinfo_get() assumes that if it finds the
NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the
NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is
not necessarily the case as the current checks in
netlbl_unlabel_staticadd() and friends are not sufficent to
enforce this.

If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR,
NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes,
these functions will all call netlbl_unlabel_addrinfo_get() which
will then attempt dereference NULL when fetching the non-existent
NLBL_UNLABEL_A_IPV4MASK attribute:

Unable to handle kernel NULL pointer dereference at virtual address 0
Process unlab (pid: 31762, stack limit = 0xff80502d8000)
Call trace:
netlbl_unlabel_addrinfo_get+0x44/0xd8
netlbl_unlabel_staticremovedef+0x98/0xe0
genl_rcv_msg+0x354/0x388
netlink_rcv_skb+0xac/0x118
genl_rcv+0x34/0x48
netlink_unicast+0x158/0x1f0
netlink_sendmsg+0x32c/0x338
sock_sendmsg+0x44/0x60
___sys_sendmsg+0x1d0/0x2a8
__sys_sendmsg+0x64/0xb4
SyS_sendmsg+0x34/0x4c
el0_svc_naked+0x34/0x38
Code: 51001149 7100113f 54a0 f9401508 (79400108)
---[ end trace f6438a488e737143 ]---
Kernel panic - not syncing: Fatal exception

Signed-off-by: Sean Tranchetti 
---
 net/netlabel/netlabel_unlabeled.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netlabel/netlabel_unlabeled.c 
b/net/netlabel/netlabel_unlabeled.c
index c070dfc..c92894c 100644
--- a/net/netlabel/netlabel_unlabeled.c
+++ b/net/netlabel/netlabel_unlabeled.c
@@ -781,7 +781,8 @@ static int netlbl_unlabel_addrinfo_get(struct genl_info 
*info,
 {
u32 addr_len;
 
-   if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR]) {
+   if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR] &&
+   info->attrs[NLBL_UNLABEL_A_IPV4MASK]) {
addr_len = nla_len(info->attrs[NLBL_UNLABEL_A_IPV4ADDR]);
if (addr_len != sizeof(struct in_addr) &&
addr_len != nla_len(info->attrs[NLBL_UNLABEL_A_IPV4MASK]))
-- 
1.9.1



Re: [PATCH rdma-next v1 0/7] Preparation to DevX extension series

2018-09-20 Thread Or Gerlitz
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky wrote:
> From: Leon Romanovsky 

> Set uid as part of various IB commands so that the firmware can manage
> the IB object in a secured way.

This series deals with FS core and commands used by the Ethernet driver

>   net/mlx5: Update mlx5_ifc with DEVX UID bits
>   net/mlx5: Set uid as part of CQ commands
>   net/mlx5: Set uid as part of RQ commands
>   net/mlx5: Set uid as part of SQ commands

such as the life-cycle (create, modify, destroy) of objects like:
RQ/SQ/CQ/etc --
saying IB commands is inaccurate, please be more precise


>
> The firmware should mark this IB object with the given uid so that it
> can be used later on only by objects with the same uid.

these are not IB objects, but rather Ethernet objects

>
> Upon DEVX flows that use this objec, the pointed object must have
> the same uid as of the issuer uid command.

typo - objec --> object

>
> When a command is issued with uid=0 it means that the issuer of the
> command is trusted (i.e. kernel), in that case any pointed object
> can be used regardless of its uid.
>
> Thanks
>
>   net/mlx5: Set uid as part of SRQ commands
>   net/mlx5: Set uid as part of DCT commands
>   net/mlx5: Set uid as part of QP commands


Re: [PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits

2018-09-20 Thread Or Gerlitz
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky wrote:
> Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP,
> RQ, XRCD, PD, MKEY and MCG.

typo - TRI --> TIR

"DEVX information" is a bit cryptic - what does xxx_valid means
for the devx use case (fw command is issued by devx process) - and
what semantics does it have for non-devx callers -- e.g the Eth driver when
they deal with tir/tis/etc

Explain and elaborate further at the change log

Or.


Re: [PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits

2018-09-20 Thread Or Gerlitz
On Thu, Sep 20, 2018 at 9:35 PM, Leon Romanovsky  wrote:
> From: Leon Romanovsky 
>
> Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP,
> RQ, XRCD, PD, MKEY and MCG.
>
> Signed-off-by: Leon Romanovsky 
> ---
>  include/linux/mlx5/mlx5_ifc.h | 67 
> +++
>  1 file changed, 43 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index efa4a60431d4..0f460fb22c31 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -1291,7 +1291,9 @@ struct mlx5_ifc_wq_bits {
> u8 reserved_at_118[0x3];
> u8 log_wq_sz[0x5];
>
> -   u8 reserved_at_120[0x3];
> +   u8 dbr_umem_valid[0x1];
> +   u8 wq_umem_valid[0x1];
> +   u8 reserved_at_122[0x1];
> u8 log_hairpin_num_packets[0x5];
> u8 reserved_at_128[0x3];
> u8 log_hairpin_data_sz[0x5];
> @@ -2365,7 +2367,10 @@ struct mlx5_ifc_qpc_bits {
>
> u8 dc_access_key[0x40];
>
> -   u8 reserved_at_680[0xc0];
> +   u8 reserved_at_680[0x3];
> +   u8 dbr_umem_valid[0x1];
> +
> +   u8 reserved_at_684[0xbc];
>  };
>
>  struct mlx5_ifc_roce_addr_layout_bits {
> @@ -2465,7 +2470,7 @@ struct mlx5_ifc_xrc_srqc_bits {
>
> u8 wq_signature[0x1];
> u8 cont_srq[0x1];
> -   u8 reserved_at_22[0x1];
> +   u8 dbr_umem_valid[0x1];
> u8 rlky[0x1];
> u8 basic_cyclic_rcv_wqe[0x1];
> u8 log_rq_stride[0x3];
> @@ -3129,7 +3134,9 @@ enum {
>
>  struct mlx5_ifc_cqc_bits {
> u8 status[0x4];
> -   u8 reserved_at_4[0x4];
> +   u8 reserved_at_4[0x2];
> +   u8 dbr_umem_valid[0x1];
> +   u8 reserved_at_7[0x1];
> u8 cqe_sz[0x3];
> u8 cc[0x1];
> u8 reserved_at_c[0x1];
> @@ -5315,7 +5322,7 @@ struct mlx5_ifc_modify_tis_bitmask_bits {
>
>  struct mlx5_ifc_modify_tis_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -5354,7 +5361,7 @@ struct mlx5_ifc_modify_tir_out_bits {
>
>  struct mlx5_ifc_modify_tir_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -5455,7 +5462,7 @@ struct mlx5_ifc_rqt_bitmask_bits {
>
>  struct mlx5_ifc_modify_rqt_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -5642,7 +5649,10 @@ struct mlx5_ifc_modify_cq_in_bits {
>
> struct mlx5_ifc_cqc_bits cq_context;
>
> -   u8 reserved_at_280[0x600];
> +   u8 reserved_at_280[0x40];
> +
> +   u8 cq_umem_valid[0x1];
> +   u8 reserved_at_2c1[0x5bf];
>
> u8 pas[0][0x40];
>  };
> @@ -5963,7 +5973,7 @@ struct mlx5_ifc_detach_from_mcg_out_bits {
>
>  struct mlx5_ifc_detach_from_mcg_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -6031,7 +6041,7 @@ struct mlx5_ifc_destroy_tis_out_bits {
>
>  struct mlx5_ifc_destroy_tis_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -6053,7 +6063,7 @@ struct mlx5_ifc_destroy_tir_out_bits {
>
>  struct mlx5_ifc_destroy_tir_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -6143,7 +6153,7 @@ struct mlx5_ifc_destroy_rqt_out_bits {
>
>  struct mlx5_ifc_destroy_rqt_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -6508,7 +6518,7 @@ struct mlx5_ifc_dealloc_xrcd_out_bits {
>
>  struct mlx5_ifc_dealloc_xrcd_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8 op_mod[0x10];
> @@ -6596,7 +6606,7 @@ struct mlx5_ifc_dealloc_pd_out_bits {
>
>  struct mlx5_ifc_dealloc_pd_in_bits {
> u8 opcode[0x10];
> -   u8 reserved_at_10[0x10];
> +   u8 uid[0x10];
>
> u8 reserved_at_20[0x10];
> u8   

IT Decision Makers Across the Globe

2018-09-20 Thread beverly . bruce

Hello,

Hope all is well!

We are a database organization. We provide business executives' contact  
information.


Below, I've included a few examples:

Industry-Specific Lists: Agriculture, Business Services, Chambers of  
Commerce, Cities, Towns & Municipalities, Construction, Consumer Services,  
Cultural, Education, Energy, Utilities & Waste Treatment, Finance,  
Government, Healthcare, Hospitality, Insurance, Law Firms & Legal Services,  
Manufacturing, Media & Internet, Metals & Mining, Organizations, Real  
Estate, Retail, Software, Telecommunications, Transportation, and more!


Technology-Specific Lists: SAP users, PeopleSoft users, SIEBEL customers,  
Oracle Application customers, Microsoft Dynamic users, Sales force users,  
Microsoft Exchange users, QuickBooks, Lawson users, Act users, JD Edward  
users, ASP users, Microsoft GP Applications users, Net Suite users, IBM  
DBMS Application users, McAfee users, MS Dynamics GP (Great Plains), and  
many more.


Title-Specific Lists: C-level executives: CEO, CFO, CIO, CTO, CMO, CISO,  
CSO, COO Key decision-makers: All C-level, VP-level, and Director-level  
executives HR Executives: VP of HR, HR Director & HR Manager, etc.   
Marketing Executives: CMO, VP of Marketing, Director of Marketing,  
Marketing Managers IT Executives: CIO, CTO, CISO, IT-VP, IT-Director, IT  
Manager, MIS Manager, etc.


Please keep me informed for any additional details. I look forward to  
hearing from you.


Regards,
Beverly
Marketing Executive

If you don't want to include yourself in our mailing list, please reply  
back “Leave Out"n a subject line


Re: [PATCH net] af_key: free SKBs under RCU protection

2018-09-20 Thread stranche


I do not believe the changelog or the patch makes sense.

Having skb still referencing a socket prevents this socket being 
released.


If you think about it, what would prevent the freeing happening
_before_ the rcu_read_lock() in pfkey_broadcast() ?

Maybe the correct fix is that pfkey_broadcast_one() should ensure the
socket is still valid.

I would suggest something like :

diff --git a/net/key/af_key.c b/net/key/af_key.c
index
9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
*skb, struct sk_buff **skb2,
 {
int err = -ENOBUFS;

-   sock_hold(sk);
+   if (!refcount_inc_not_zero(>sk_refcnt))
+   return -ENOENT;
+
if (*skb2 == NULL) {
if (refcount_read(>users) != 1) {
*skb2 = skb_clone(skb, allocation);


Hi Eric,

I'm not sure that the socket getting freed before the rcu_read_lock() 
would
be an issue, since then it would no longer be in the net_pkey->table 
that
we loop through (since we call pfkey_remove() from pfkey_relase()). 
Because of
that, all the sockets processed in pfkey_broadcast_one() have valid 
refcounts,

so checking for zero there doesn't prevent the crash that I'm seeing.

However, after going over the call flow again, I see that the actual 
problem
occurs because of pfkey_broadcast_one(). Specifically, because of this 
check:


if (*skb2 == NULL) {
if (refcount_read(>users) != 1) {
*skb2 = skb_clone(skb, allocation);
} else {
*skb2 = skb;
refcount_inc(>users);
}
}

Since we always pass a freshly cloned SKB to this function, skb->users 
is

always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
belong to the socket.

If the socket we queue skb2 to frees this SKB (thereby decrementing its
refcount to 1) and the socket is freed before pfkey_broadcast() can
execute the kfree_skb(skb) on line 284, we will then attempt to run
sock_rfree() on an SKB with a dangling reference to this socket.

Perhaps a cleaner solution here is to always clone the SKB in
pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
its destructor, and we can avoid this race condition.


[PATCH net] ixgbe: check return value of napi_complete_done()

2018-09-20 Thread Song Liu
The NIC driver should only enable interrupts when napi_complete_done()
returns true. This patch adds the check for ixgbe.

Cc: sta...@vger.kernel.org # 4.10+
Cc: Jeff Kirsher 
Suggested-by: Eric Dumazet 
Signed-off-by: Song Liu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9a23d33a47ed..1497df6a8dff 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3196,11 +3196,13 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
return budget;
 
/* all work done, exit the polling mode */
-   napi_complete_done(napi, work_done);
-   if (adapter->rx_itr_setting & 1)
-   ixgbe_set_itr(q_vector);
-   if (!test_bit(__IXGBE_DOWN, >state))
-   ixgbe_irq_enable_queues(adapter, BIT_ULL(q_vector->v_idx));
+   if (likely(napi_complete_done(napi, work_done))) {
+   if (adapter->rx_itr_setting & 1)
+   ixgbe_set_itr(q_vector);
+   if (!test_bit(__IXGBE_DOWN, >state))
+   ixgbe_irq_enable_queues(adapter,
+   BIT_ULL(q_vector->v_idx));
+   }
 
return min(work_done, budget - 1);
 }
-- 
2.17.1



[PATCH mlx5-next v1 6/7] net/mlx5: Set uid as part of DCT commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of DCT commands so that the firmware can manage the
DCT object in a secured way.

That will enable using a DCT that was created by verbs application
to be used by the DEVX flow in case the uid is equal.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 4 
 include/linux/mlx5/mlx5_ifc.h| 6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index d9b12136cbfd..91b8139a388d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -211,6 +211,7 @@ int mlx5_core_create_dct(struct mlx5_core_dev *dev,
}

qp->qpn = MLX5_GET(create_dct_out, out, dctn);
+   qp->uid = MLX5_GET(create_dct_in, in, uid);
err = create_resource_common(dev, qp, MLX5_RES_DCT);
if (err)
goto err_cmd;
@@ -219,6 +220,7 @@ int mlx5_core_create_dct(struct mlx5_core_dev *dev,
 err_cmd:
MLX5_SET(destroy_dct_in, din, opcode, MLX5_CMD_OP_DESTROY_DCT);
MLX5_SET(destroy_dct_in, din, dctn, qp->qpn);
+   MLX5_SET(destroy_dct_in, din, uid, qp->uid);
mlx5_cmd_exec(dev, (void *), sizeof(din),
  (void *), sizeof(dout));
return err;
@@ -277,6 +279,7 @@ static int mlx5_core_drain_dct(struct mlx5_core_dev *dev,

MLX5_SET(drain_dct_in, in, opcode, MLX5_CMD_OP_DRAIN_DCT);
MLX5_SET(drain_dct_in, in, dctn, qp->qpn);
+   MLX5_SET(drain_dct_in, in, uid, qp->uid);
return mlx5_cmd_exec(dev, (void *), sizeof(in),
 (void *), sizeof(out));
 }
@@ -303,6 +306,7 @@ int mlx5_core_destroy_dct(struct mlx5_core_dev *dev,
destroy_resource_common(dev, >mqp);
MLX5_SET(destroy_dct_in, in, opcode, MLX5_CMD_OP_DESTROY_DCT);
MLX5_SET(destroy_dct_in, in, dctn, qp->qpn);
+   MLX5_SET(destroy_dct_in, in, uid, qp->uid);
err = mlx5_cmd_exec(dev, (void *), sizeof(in),
(void *), sizeof(out));
return err;
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 5a2f0b02483a..efa4a60431d4 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5919,7 +5919,7 @@ struct mlx5_ifc_drain_dct_out_bits {

 struct mlx5_ifc_drain_dct_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6383,7 +6383,7 @@ struct mlx5_ifc_destroy_dct_out_bits {

 struct mlx5_ifc_destroy_dct_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -7139,7 +7139,7 @@ struct mlx5_ifc_create_dct_out_bits {

 struct mlx5_ifc_create_dct_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
--
2.14.4



[PATCH mlx5-next v1 7/7] net/mlx5: Update mlx5_ifc with DEVX UID bits

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Add DEVX information to WQ, SRQ, CQ, TRI, TIS, QP,
RQ, XRCD, PD, MKEY and MCG.

Signed-off-by: Leon Romanovsky 
---
 include/linux/mlx5/mlx5_ifc.h | 67 +++
 1 file changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index efa4a60431d4..0f460fb22c31 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1291,7 +1291,9 @@ struct mlx5_ifc_wq_bits {
u8 reserved_at_118[0x3];
u8 log_wq_sz[0x5];

-   u8 reserved_at_120[0x3];
+   u8 dbr_umem_valid[0x1];
+   u8 wq_umem_valid[0x1];
+   u8 reserved_at_122[0x1];
u8 log_hairpin_num_packets[0x5];
u8 reserved_at_128[0x3];
u8 log_hairpin_data_sz[0x5];
@@ -2365,7 +2367,10 @@ struct mlx5_ifc_qpc_bits {

u8 dc_access_key[0x40];

-   u8 reserved_at_680[0xc0];
+   u8 reserved_at_680[0x3];
+   u8 dbr_umem_valid[0x1];
+
+   u8 reserved_at_684[0xbc];
 };

 struct mlx5_ifc_roce_addr_layout_bits {
@@ -2465,7 +2470,7 @@ struct mlx5_ifc_xrc_srqc_bits {

u8 wq_signature[0x1];
u8 cont_srq[0x1];
-   u8 reserved_at_22[0x1];
+   u8 dbr_umem_valid[0x1];
u8 rlky[0x1];
u8 basic_cyclic_rcv_wqe[0x1];
u8 log_rq_stride[0x3];
@@ -3129,7 +3134,9 @@ enum {

 struct mlx5_ifc_cqc_bits {
u8 status[0x4];
-   u8 reserved_at_4[0x4];
+   u8 reserved_at_4[0x2];
+   u8 dbr_umem_valid[0x1];
+   u8 reserved_at_7[0x1];
u8 cqe_sz[0x3];
u8 cc[0x1];
u8 reserved_at_c[0x1];
@@ -5315,7 +5322,7 @@ struct mlx5_ifc_modify_tis_bitmask_bits {

 struct mlx5_ifc_modify_tis_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -5354,7 +5361,7 @@ struct mlx5_ifc_modify_tir_out_bits {

 struct mlx5_ifc_modify_tir_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -5455,7 +5462,7 @@ struct mlx5_ifc_rqt_bitmask_bits {

 struct mlx5_ifc_modify_rqt_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -5642,7 +5649,10 @@ struct mlx5_ifc_modify_cq_in_bits {

struct mlx5_ifc_cqc_bits cq_context;

-   u8 reserved_at_280[0x600];
+   u8 reserved_at_280[0x40];
+
+   u8 cq_umem_valid[0x1];
+   u8 reserved_at_2c1[0x5bf];

u8 pas[0][0x40];
 };
@@ -5963,7 +5973,7 @@ struct mlx5_ifc_detach_from_mcg_out_bits {

 struct mlx5_ifc_detach_from_mcg_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6031,7 +6041,7 @@ struct mlx5_ifc_destroy_tis_out_bits {

 struct mlx5_ifc_destroy_tis_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6053,7 +6063,7 @@ struct mlx5_ifc_destroy_tir_out_bits {

 struct mlx5_ifc_destroy_tir_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6143,7 +6153,7 @@ struct mlx5_ifc_destroy_rqt_out_bits {

 struct mlx5_ifc_destroy_rqt_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6508,7 +6518,7 @@ struct mlx5_ifc_dealloc_xrcd_out_bits {

 struct mlx5_ifc_dealloc_xrcd_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6596,7 +6606,7 @@ struct mlx5_ifc_dealloc_pd_out_bits {

 struct mlx5_ifc_dealloc_pd_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6675,7 +6685,9 @@ struct mlx5_ifc_create_xrc_srq_in_bits {

struct mlx5_ifc_xrc_srqc_bits xrc_srq_context_entry;

-   u8 reserved_at_280[0x600];
+   u8 reserved_at_280[0x40];
+   u8 xrc_srq_umem_valid[0x1];
+   u8 reserved_at_2c1[0x5bf];

u8 

[PATCH mlx5-next v1 5/7] net/mlx5: Set uid as part of SRQ commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of SRQ commands so that the firmware can manage the
SRQ object in a secured way.

That will enable using an SRQ that was created by verbs application
to be used by the DEVX flow in case the uid is equal.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/srq.c | 30 ---
 include/linux/mlx5/driver.h   |  1 +
 include/linux/mlx5/mlx5_ifc.h | 22 ++--
 include/linux/mlx5/srq.h  |  1 +
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/srq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/srq.c
index 23cc337a96c9..5c519615fb1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/srq.c
@@ -166,6 +166,7 @@ static int create_srq_cmd(struct mlx5_core_dev *dev, struct 
mlx5_core_srq *srq,
if (!create_in)
return -ENOMEM;

+   MLX5_SET(create_srq_in, create_in, uid, in->uid);
srqc = MLX5_ADDR_OF(create_srq_in, create_in, srq_context_entry);
pas = MLX5_ADDR_OF(create_srq_in, create_in, pas);

@@ -178,8 +179,10 @@ static int create_srq_cmd(struct mlx5_core_dev *dev, 
struct mlx5_core_srq *srq,
err = mlx5_cmd_exec(dev, create_in, inlen, create_out,
sizeof(create_out));
kvfree(create_in);
-   if (!err)
+   if (!err) {
srq->srqn = MLX5_GET(create_srq_out, create_out, srqn);
+   srq->uid = in->uid;
+   }

return err;
 }
@@ -193,6 +196,7 @@ static int destroy_srq_cmd(struct mlx5_core_dev *dev,
MLX5_SET(destroy_srq_in, srq_in, opcode,
 MLX5_CMD_OP_DESTROY_SRQ);
MLX5_SET(destroy_srq_in, srq_in, srqn, srq->srqn);
+   MLX5_SET(destroy_srq_in, srq_in, uid, srq->uid);

return mlx5_cmd_exec(dev, srq_in, sizeof(srq_in),
 srq_out, sizeof(srq_out));
@@ -208,6 +212,7 @@ static int arm_srq_cmd(struct mlx5_core_dev *dev, struct 
mlx5_core_srq *srq,
MLX5_SET(arm_rq_in, srq_in, op_mod, MLX5_ARM_RQ_IN_OP_MOD_SRQ);
MLX5_SET(arm_rq_in, srq_in, srq_number, srq->srqn);
MLX5_SET(arm_rq_in, srq_in, lwm,  lwm);
+   MLX5_SET(arm_rq_in, srq_in, uid, srq->uid);

return  mlx5_cmd_exec(dev, srq_in, sizeof(srq_in),
  srq_out, sizeof(srq_out));
@@ -260,6 +265,7 @@ static int create_xrc_srq_cmd(struct mlx5_core_dev *dev,
if (!create_in)
return -ENOMEM;

+   MLX5_SET(create_xrc_srq_in, create_in, uid, in->uid);
xrc_srqc = MLX5_ADDR_OF(create_xrc_srq_in, create_in,
xrc_srq_context_entry);
pas  = MLX5_ADDR_OF(create_xrc_srq_in, create_in, pas);
@@ -277,6 +283,7 @@ static int create_xrc_srq_cmd(struct mlx5_core_dev *dev,
goto out;

srq->srqn = MLX5_GET(create_xrc_srq_out, create_out, xrc_srqn);
+   srq->uid = in->uid;
 out:
kvfree(create_in);
return err;
@@ -291,6 +298,7 @@ static int destroy_xrc_srq_cmd(struct mlx5_core_dev *dev,
MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, opcode,
 MLX5_CMD_OP_DESTROY_XRC_SRQ);
MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, xrc_srqn, srq->srqn);
+   MLX5_SET(destroy_xrc_srq_in, xrcsrq_in, uid, srq->uid);

return mlx5_cmd_exec(dev, xrcsrq_in, sizeof(xrcsrq_in),
 xrcsrq_out, sizeof(xrcsrq_out));
@@ -306,6 +314,7 @@ static int arm_xrc_srq_cmd(struct mlx5_core_dev *dev,
MLX5_SET(arm_xrc_srq_in, xrcsrq_in, op_mod,   
MLX5_ARM_XRC_SRQ_IN_OP_MOD_XRC_SRQ);
MLX5_SET(arm_xrc_srq_in, xrcsrq_in, xrc_srqn, srq->srqn);
MLX5_SET(arm_xrc_srq_in, xrcsrq_in, lwm,  lwm);
+   MLX5_SET(arm_xrc_srq_in, xrcsrq_in, uid, srq->uid);

return  mlx5_cmd_exec(dev, xrcsrq_in, sizeof(xrcsrq_in),
  xrcsrq_out, sizeof(xrcsrq_out));
@@ -365,10 +374,13 @@ static int create_rmp_cmd(struct mlx5_core_dev *dev, 
struct mlx5_core_srq *srq,
wq = MLX5_ADDR_OF(rmpc, rmpc, wq);

MLX5_SET(rmpc, rmpc, state, MLX5_RMPC_STATE_RDY);
+   MLX5_SET(create_rmp_in, create_in, uid, in->uid);
set_wq(wq, in);
memcpy(MLX5_ADDR_OF(rmpc, rmpc, wq.pas), in->pas, pas_size);

err = mlx5_core_create_rmp(dev, create_in, inlen, >srqn);
+   if (!err)
+   srq->uid = in->uid;

kvfree(create_in);
return err;
@@ -377,7 +389,13 @@ static int create_rmp_cmd(struct mlx5_core_dev *dev, 
struct mlx5_core_srq *srq,
 static int destroy_rmp_cmd(struct mlx5_core_dev *dev,
   struct mlx5_core_srq *srq)
 {
-   return mlx5_core_destroy_rmp(dev, srq->srqn);
+   u32 in[MLX5_ST_SZ_DW(destroy_rmp_in)]   = {};
+   u32 out[MLX5_ST_SZ_DW(destroy_rmp_out)] = {};
+
+   MLX5_SET(destroy_rmp_in, in, 

[PATCH mlx5-next v1 2/7] net/mlx5: Set uid as part of QP commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of QP commands so that the firmware can manage the
QP object in a secured way.

That will enable using a QP that was created by verbs application to
be used by the DEVX flow in case the uid is equal.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 45 +---
 include/linux/mlx5/mlx5_ifc.h| 22 +++---
 include/linux/mlx5/qp.h  |  1 +
 3 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 4ca07bfb6b14..4e2ab3c916bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -240,6 +240,7 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev,
if (err)
return err;

+   qp->uid = MLX5_GET(create_qp_in, in, uid);
qp->qpn = MLX5_GET(create_qp_out, out, qpn);
mlx5_core_dbg(dev, "qpn = 0x%x\n", qp->qpn);

@@ -261,6 +262,7 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev,
memset(dout, 0, sizeof(dout));
MLX5_SET(destroy_qp_in, din, opcode, MLX5_CMD_OP_DESTROY_QP);
MLX5_SET(destroy_qp_in, din, qpn, qp->qpn);
+   MLX5_SET(destroy_qp_in, din, uid, qp->uid);
mlx5_cmd_exec(dev, din, sizeof(din), dout, sizeof(dout));
return err;
 }
@@ -320,6 +322,7 @@ int mlx5_core_destroy_qp(struct mlx5_core_dev *dev,

MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
MLX5_SET(destroy_qp_in, in, qpn, qp->qpn);
+   MLX5_SET(destroy_qp_in, in, uid, qp->uid);
err = mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
if (err)
return err;
@@ -373,7 +376,7 @@ static void mbox_free(struct mbox_info *mbox)

 static int modify_qp_mbox_alloc(struct mlx5_core_dev *dev, u16 opcode, int qpn,
u32 opt_param_mask, void *qpc,
-   struct mbox_info *mbox)
+   struct mbox_info *mbox, u16 uid)
 {
mbox->out = NULL;
mbox->in = NULL;
@@ -381,26 +384,32 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev 
*dev, u16 opcode, int qpn,
 #define MBOX_ALLOC(mbox, typ)  \
mbox_alloc(mbox, MLX5_ST_SZ_BYTES(typ##_in), 
MLX5_ST_SZ_BYTES(typ##_out))

-#define MOD_QP_IN_SET(typ, in, _opcode, _qpn) \
-   MLX5_SET(typ##_in, in, opcode, _opcode); \
-   MLX5_SET(typ##_in, in, qpn, _qpn)
-
-#define MOD_QP_IN_SET_QPC(typ, in, _opcode, _qpn, _opt_p, _qpc) \
-   MOD_QP_IN_SET(typ, in, _opcode, _qpn); \
-   MLX5_SET(typ##_in, in, opt_param_mask, _opt_p); \
-   memcpy(MLX5_ADDR_OF(typ##_in, in, qpc), _qpc, MLX5_ST_SZ_BYTES(qpc))
+#define MOD_QP_IN_SET(typ, in, _opcode, _qpn, _uid)
\
+   do {   \
+   MLX5_SET(typ##_in, in, opcode, _opcode);   \
+   MLX5_SET(typ##_in, in, qpn, _qpn); \
+   MLX5_SET(typ##_in, in, uid, _uid); \
+   } while (0)
+
+#define MOD_QP_IN_SET_QPC(typ, in, _opcode, _qpn, _opt_p, _qpc, _uid)  
\
+   do {   \
+   MOD_QP_IN_SET(typ, in, _opcode, _qpn, _uid);   \
+   MLX5_SET(typ##_in, in, opt_param_mask, _opt_p);\
+   memcpy(MLX5_ADDR_OF(typ##_in, in, qpc), _qpc,  \
+  MLX5_ST_SZ_BYTES(qpc)); \
+   } while (0)

switch (opcode) {
/* 2RST & 2ERR */
case MLX5_CMD_OP_2RST_QP:
if (MBOX_ALLOC(mbox, qp_2rst))
return -ENOMEM;
-   MOD_QP_IN_SET(qp_2rst, mbox->in, opcode, qpn);
+   MOD_QP_IN_SET(qp_2rst, mbox->in, opcode, qpn, uid);
break;
case MLX5_CMD_OP_2ERR_QP:
if (MBOX_ALLOC(mbox, qp_2err))
return -ENOMEM;
-   MOD_QP_IN_SET(qp_2err, mbox->in, opcode, qpn);
+   MOD_QP_IN_SET(qp_2err, mbox->in, opcode, qpn, uid);
break;

/* MODIFY with QPC */
@@ -408,37 +417,37 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev 
*dev, u16 opcode, int qpn,
if (MBOX_ALLOC(mbox, rst2init_qp))
return -ENOMEM;
MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn,
- opt_param_mask, qpc);
+ opt_param_mask, qpc, uid);
break;
case MLX5_CMD_OP_INIT2RTR_QP:
if (MBOX_ALLOC(mbox, init2rtr_qp))
return -ENOMEM;
MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn,
-  

[PATCH rdma-next v1 0/7] Preparation to DevX extension series

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Changelog v0->v1:
 * Update commit messages
 * Split DevX series to small sub-series.
 * Change static initialization from {0} to be {}

--- From Yishai,

Set uid as part of various IB commands so that the firmware can manage
the IB object in a secured way.

The firmware should mark this IB object with the given uid so that it
can be used later on only by objects with the same uid.

Upon DEVX flows that use this objec, the pointed object must have
the same uid as of the issuer uid command.

When a command is issued with uid=0 it means that the issuer of the
command is trusted (i.e. kernel), in that case any pointed object
can be used regardless of its uid.

Thanks

Leon Romanovsky (1):
  net/mlx5: Update mlx5_ifc with DEVX UID bits

Yishai Hadas (6):
  net/mlx5: Set uid as part of CQ commands
  net/mlx5: Set uid as part of QP commands
  net/mlx5: Set uid as part of RQ commands
  net/mlx5: Set uid as part of SQ commands
  net/mlx5: Set uid as part of SRQ commands
  net/mlx5: Set uid as part of DCT commands

 drivers/net/ethernet/mellanox/mlx5/core/cq.c  |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/qp.c  |  81 +++-
 drivers/net/ethernet/mellanox/mlx5/core/srq.c |  30 +-
 include/linux/mlx5/cq.h   |   1 +
 include/linux/mlx5/driver.h   |   1 +
 include/linux/mlx5/mlx5_ifc.h | 135 +++---
 include/linux/mlx5/qp.h   |   1 +
 include/linux/mlx5/srq.h  |   1 +
 8 files changed, 171 insertions(+), 83 deletions(-)

--
2.14.4



[PATCH mlx5-next v1 1/7] net/mlx5: Set uid as part of CQ commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of CQ commands so that the firmware can manage the CQ
object in a secured way.

The firmware should mark this CQ with the given uid so that it can
be used later on only by objects with the same uid.

Upon DEVX flows that use this CQ (e.g. create QP command), the
pointed CQ must have the same uid as of the issuer uid command.

When a command is issued with uid=0 it means that the issuer of the
command is trusted (i.e. kernel), in that case any pointed object
can be used regardless of its uid.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c | 4 
 include/linux/mlx5/cq.h  | 1 +
 include/linux/mlx5/mlx5_ifc.h| 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index a4179122a279..4b85abb5c9f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -109,6 +109,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
cq->cons_index = 0;
cq->arm_sn = 0;
cq->eq = eq;
+   cq->uid = MLX5_GET(create_cq_in, in, uid);
refcount_set(>refcount, 1);
init_completion(>free);
if (!cq->comp)
@@ -144,6 +145,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
memset(dout, 0, sizeof(dout));
MLX5_SET(destroy_cq_in, din, opcode, MLX5_CMD_OP_DESTROY_CQ);
MLX5_SET(destroy_cq_in, din, cqn, cq->cqn);
+   MLX5_SET(destroy_cq_in, din, uid, cq->uid);
mlx5_cmd_exec(dev, din, sizeof(din), dout, sizeof(dout));
return err;
 }
@@ -165,6 +167,7 @@ int mlx5_core_destroy_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq)

MLX5_SET(destroy_cq_in, in, opcode, MLX5_CMD_OP_DESTROY_CQ);
MLX5_SET(destroy_cq_in, in, cqn, cq->cqn);
+   MLX5_SET(destroy_cq_in, in, uid, cq->uid);
err = mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
if (err)
return err;
@@ -196,6 +199,7 @@ int mlx5_core_modify_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
u32 out[MLX5_ST_SZ_DW(modify_cq_out)] = {0};

MLX5_SET(modify_cq_in, in, opcode, MLX5_CMD_OP_MODIFY_CQ);
+   MLX5_SET(modify_cq_in, in, uid, cq->uid);
return mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
 }
 EXPORT_SYMBOL(mlx5_core_modify_cq);
diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
index 0ef6138eca49..31a750570c38 100644
--- a/include/linux/mlx5/cq.h
+++ b/include/linux/mlx5/cq.h
@@ -61,6 +61,7 @@ struct mlx5_core_cq {
int reset_notify_added;
struct list_headreset_notify;
struct mlx5_eq  *eq;
+   u16 uid;
 };


diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a14c4eaff53f..e62a0825d35c 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5630,7 +5630,7 @@ enum {

 struct mlx5_ifc_modify_cq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6405,7 +6405,7 @@ struct mlx5_ifc_destroy_cq_out_bits {

 struct mlx5_ifc_destroy_cq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -7165,7 +7165,7 @@ struct mlx5_ifc_create_cq_out_bits {

 struct mlx5_ifc_create_cq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
--
2.14.4



[PATCH mlx5-next v1 3/7] net/mlx5: Set uid as part of RQ commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of RQ commands so that the firmware can manage the
RQ object in a secured way.

That will enable using an RQ that was created by verbs application
to be used by the DEVX flow in case the uid is equal.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 16 ++--
 include/linux/mlx5/mlx5_ifc.h|  6 +++---
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 4e2ab3c916bf..f57e08d4f970 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -540,6 +540,17 @@ int mlx5_core_xrcd_dealloc(struct mlx5_core_dev *dev, u32 
xrcdn)
 }
 EXPORT_SYMBOL_GPL(mlx5_core_xrcd_dealloc);

+static void destroy_rq_tracked(struct mlx5_core_dev *dev, u32 rqn, u16 uid)
+{
+   u32 in[MLX5_ST_SZ_DW(destroy_rq_in)]   = {};
+   u32 out[MLX5_ST_SZ_DW(destroy_rq_out)] = {};
+
+   MLX5_SET(destroy_rq_in, in, opcode, MLX5_CMD_OP_DESTROY_RQ);
+   MLX5_SET(destroy_rq_in, in, rqn, rqn);
+   MLX5_SET(destroy_rq_in, in, uid, uid);
+   mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
 int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen,
struct mlx5_core_qp *rq)
 {
@@ -550,6 +561,7 @@ int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, 
u32 *in, int inlen,
if (err)
return err;

+   rq->uid = MLX5_GET(create_rq_in, in, uid);
rq->qpn = rqn;
err = create_resource_common(dev, rq, MLX5_RES_RQ);
if (err)
@@ -558,7 +570,7 @@ int mlx5_core_create_rq_tracked(struct mlx5_core_dev *dev, 
u32 *in, int inlen,
return 0;

 err_destroy_rq:
-   mlx5_core_destroy_rq(dev, rq->qpn);
+   destroy_rq_tracked(dev, rq->qpn, rq->uid);

return err;
 }
@@ -568,7 +580,7 @@ void mlx5_core_destroy_rq_tracked(struct mlx5_core_dev *dev,
  struct mlx5_core_qp *rq)
 {
destroy_resource_common(dev, rq);
-   mlx5_core_destroy_rq(dev, rq->qpn);
+   destroy_rq_tracked(dev, rq->qpn, rq->uid);
 }
 EXPORT_SYMBOL(mlx5_core_destroy_rq_tracked);

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index e5a0d3ecfaad..01b707666fb4 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5489,7 +5489,7 @@ enum {

 struct mlx5_ifc_modify_rq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6165,7 +6165,7 @@ struct mlx5_ifc_destroy_rq_out_bits {

 struct mlx5_ifc_destroy_rq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6848,7 +6848,7 @@ struct mlx5_ifc_create_rq_out_bits {

 struct mlx5_ifc_create_rq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
--
2.14.4



[PATCH mlx5-next v1 4/7] net/mlx5: Set uid as part of SQ commands

2018-09-20 Thread Leon Romanovsky
From: Yishai Hadas 

Set uid as part of SQ commands so that the firmware can manage the
SQ object in a secured way.

That will enable using an SQ that was created by verbs application
to be used by the DEVX flow in case the uid is equal.

Signed-off-by: Yishai Hadas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 16 ++--
 include/linux/mlx5/mlx5_ifc.h|  6 +++---
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index f57e08d4f970..d9b12136cbfd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -584,6 +584,17 @@ void mlx5_core_destroy_rq_tracked(struct mlx5_core_dev 
*dev,
 }
 EXPORT_SYMBOL(mlx5_core_destroy_rq_tracked);

+static void destroy_sq_tracked(struct mlx5_core_dev *dev, u32 sqn, u16 uid)
+{
+   u32 in[MLX5_ST_SZ_DW(destroy_sq_in)]   = {};
+   u32 out[MLX5_ST_SZ_DW(destroy_sq_out)] = {};
+
+   MLX5_SET(destroy_sq_in, in, opcode, MLX5_CMD_OP_DESTROY_SQ);
+   MLX5_SET(destroy_sq_in, in, sqn, sqn);
+   MLX5_SET(destroy_sq_in, in, uid, uid);
+   mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
 int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen,
struct mlx5_core_qp *sq)
 {
@@ -594,6 +605,7 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, 
u32 *in, int inlen,
if (err)
return err;

+   sq->uid = MLX5_GET(create_sq_in, in, uid);
sq->qpn = sqn;
err = create_resource_common(dev, sq, MLX5_RES_SQ);
if (err)
@@ -602,7 +614,7 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, 
u32 *in, int inlen,
return 0;

 err_destroy_sq:
-   mlx5_core_destroy_sq(dev, sq->qpn);
+   destroy_sq_tracked(dev, sq->qpn, sq->uid);

return err;
 }
@@ -612,7 +624,7 @@ void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev *dev,
  struct mlx5_core_qp *sq)
 {
destroy_resource_common(dev, sq);
-   mlx5_core_destroy_sq(dev, sq->qpn);
+   destroy_sq_tracked(dev, sq->qpn, sq->uid);
 }
 EXPORT_SYMBOL(mlx5_core_destroy_sq_tracked);

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 01b707666fb4..8151488f6570 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5382,7 +5382,7 @@ struct mlx5_ifc_modify_sq_out_bits {

 struct mlx5_ifc_modify_sq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6097,7 +6097,7 @@ struct mlx5_ifc_destroy_sq_out_bits {

 struct mlx5_ifc_destroy_sq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
@@ -6770,7 +6770,7 @@ struct mlx5_ifc_create_sq_out_bits {

 struct mlx5_ifc_create_sq_in_bits {
u8 opcode[0x10];
-   u8 reserved_at_10[0x10];
+   u8 uid[0x10];

u8 reserved_at_20[0x10];
u8 op_mod[0x10];
--
2.14.4



Re: [PATCH net] sctp: update dst pmtu with the correct daddr

2018-09-20 Thread David Miller
From: Xin Long 
Date: Thu, 20 Sep 2018 17:27:28 +0800

> When processing pmtu update from an icmp packet, it calls .update_pmtu
> with sk instead of skb in sctp_transport_update_pmtu.
> 
> However for sctp, the daddr in the transport might be different from
> inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or
> create the route cache. The incorrect daddr will cause a different
> route cache created for the path.
> 
> So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr
> should be updated with the daddr in the transport, and update it back
> after it's done.
> 
> The issue has existed since route exceptions introduction.
> 
> Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
> Reported-by: ian.per...@dialogic.com
> Signed-off-by: Xin Long 

Applied and queued up for -stable.

Although are you sure it's OK to temporarily change the sockets address
like this?  What if an asynchronous context looks at the socket state
and sees the temporarily set address?

Thanks.


[PATCH net-next 0/3] bnx2x: enhancements

2018-09-20 Thread Shahed Shaikh
From: Shahed Shaikh 

Hi Dave,

This series adds below changes -
- support for VF spoof-check configuration through .ndo_set_vf_spoofchk.
- workaround for MFW bug regarding unexpected bandwidth notifcation
  in single function mode.
- supply VF link status as part of get VF config handling.

Please apply this to net-next tree.

Thanks,
Shahed

Shahed Shaikh (3):
  bnx2x: Add VF spoof-checking configuration
  bnx2x: Ignore bandwidth attention in single function mode
  bnx2x: Provide VF link status in ndo_get_vf_config

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |  1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 11 +++
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 81 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h |  2 +
 4 files changed, 93 insertions(+), 2 deletions(-)

-- 
1.8.3.1



[PATCH net-next 1/3] bnx2x: Add VF spoof-checking configuration

2018-09-20 Thread Shahed Shaikh
Add support for `ndo_set_vf_spoofchk' to allow PF control over
its VF spoof-checking configuration.

Signed-off-by: Shahed Shaikh 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |  1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 80 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h |  2 +
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index 0e508e5..142bc11 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -494,6 +494,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vf,
 int bnx2x_set_vf_mac(struct net_device *dev, int queue, u8 *mac);
 int bnx2x_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, u8 qos,
  __be16 vlan_proto);
+int bnx2x_set_vf_spoofchk(struct net_device *dev, int idx, bool val);
 
 /* select_queue callback */
 u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb,
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 71362b7..faf64ba 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13121,6 +13121,7 @@ static int bnx2x_vlan_rx_kill_vid(struct net_device 
*dev, __be16 proto, u16 vid)
.ndo_set_vf_mac = bnx2x_set_vf_mac,
.ndo_set_vf_vlan= bnx2x_set_vf_vlan,
.ndo_get_vf_config  = bnx2x_get_vf_config,
+   .ndo_set_vf_spoofchk= bnx2x_set_vf_spoofchk,
 #endif
 #ifdef NETDEV_FCOE_WWNN
.ndo_fcoe_get_wwn   = bnx2x_fcoe_get_wwn,
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 62da465..2f76b37 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -209,7 +209,10 @@ void bnx2x_vfop_qctor_prep(struct bnx2x *bp,
 */
__set_bit(BNX2X_Q_FLG_TX_SWITCH, _p->flags);
__set_bit(BNX2X_Q_FLG_TX_SEC, _p->flags);
-   __set_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags);
+   if (vf->spoofchk)
+   __set_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags);
+   else
+   __clear_bit(BNX2X_Q_FLG_ANTI_SPOOF, _p->flags);
 
/* Setup-op rx parameters */
if (test_bit(BNX2X_Q_TYPE_HAS_RX, _type)) {
@@ -1269,6 +1272,8 @@ int bnx2x_iov_init_one(struct bnx2x *bp, int 
int_mode_param,
bnx2x_vf(bp, i, state) = VF_FREE;
mutex_init(_vf(bp, i, op_mutex));
bnx2x_vf(bp, i, op_current) = CHANNEL_TLV_NONE;
+   /* enable spoofchk by default */
+   bnx2x_vf(bp, i, spoofchk) = 1;
}
 
/* re-read the IGU CAM for VFs - index and abs_vfid must be set */
@@ -2632,7 +2637,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx,
ivi->qos = 0;
ivi->max_tx_rate = 1; /* always 10G. TBA take from link struct */
ivi->min_tx_rate = 0;
-   ivi->spoofchk = 1; /*always enabled */
+   ivi->spoofchk = vf->spoofchk ? 1 : 0;
if (vf->state == VF_ENABLED) {
/* mac and vlan are in vlan_mac objects */
if (bnx2x_validate_vf_sp_objs(bp, vf, false)) {
@@ -2950,6 +2955,77 @@ int bnx2x_set_vf_vlan(struct net_device *dev, int vfidx, 
u16 vlan, u8 qos,
return rc;
 }
 
+int bnx2x_set_vf_spoofchk(struct net_device *dev, int idx, bool val)
+{
+   struct bnx2x *bp = netdev_priv(dev);
+   struct bnx2x_virtf *vf;
+   int i, rc = 0;
+
+   vf = BP_VF(bp, idx);
+   if (!vf)
+   return -EINVAL;
+
+   /* nothing to do */
+   if (vf->spoofchk == val)
+   return 0;
+
+   vf->spoofchk = val ? 1 : 0;
+
+   DP(BNX2X_MSG_IOV, "%s spoofchk for VF %d\n",
+  val ? "enabling" : "disabling", idx);
+
+   /* is vf initialized and queue set up? */
+   if (vf->state != VF_ENABLED ||
+   bnx2x_get_q_logical_state(bp, _leading_vfq(vf, sp_obj)) !=
+   BNX2X_Q_LOGICAL_STATE_ACTIVE)
+   return rc;
+
+   /* User should be able to see error in system logs */
+   if (!bnx2x_validate_vf_sp_objs(bp, vf, true))
+   return -EINVAL;
+
+   /* send queue update ramrods to configure spoofchk */
+   for_each_vfq(vf, i) {
+   struct bnx2x_queue_state_params q_params = {NULL};
+   struct bnx2x_queue_update_params *update_params;
+
+   q_params.q_obj = _vfq(vf, i, sp_obj);
+
+   /* validate the Q is UP */
+   if (bnx2x_get_q_logical_state(bp, q_params.q_obj) !=
+   BNX2X_Q_LOGICAL_STATE_ACTIVE)
+   continue;
+
+   __set_bit(RAMROD_COMP_WAIT, _params.ramrod_flags);

[PATCH net-next 2/3] bnx2x: Ignore bandwidth attention in single function mode

2018-09-20 Thread Shahed Shaikh
From: Shahed Shaikh 

This is a workaround for FW bug -
MFW generates bandwidth attention in single function mode, which
is only expected to be generated in multi function mode.
This undesired attention in SF mode results in incorrect HW
configuration and resulting into Tx timeout.

Signed-off-by: Shahed Shaikh 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index faf64ba..16f64c6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3536,6 +3536,16 @@ static void bnx2x_drv_info_iscsi_stat(struct bnx2x *bp)
  */
 static void bnx2x_config_mf_bw(struct bnx2x *bp)
 {
+   /* Workaround for MFW bug.
+* MFW is not supposed to generate BW attention in
+* single function mode.
+*/
+   if (!IS_MF(bp)) {
+   DP(BNX2X_MSG_MCP,
+  "Ignoring MF BW config in single function mode\n");
+   return;
+   }
+
if (bp->link_vars.link_up) {
bnx2x_cmng_fns_init(bp, true, CMNG_FNS_MINMAX);
bnx2x_link_sync_notify(bp);
-- 
1.8.3.1



[PATCH net-next 3/3] bnx2x: Provide VF link status in ndo_get_vf_config

2018-09-20 Thread Shahed Shaikh
From: Shahed Shaikh 

Provide current link status of VF in ndo_get_vf_config
handler.

Signed-off-by: Shahed Shaikh 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 2f76b37..c835f6c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -2638,6 +2638,7 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx,
ivi->max_tx_rate = 1; /* always 10G. TBA take from link struct */
ivi->min_tx_rate = 0;
ivi->spoofchk = vf->spoofchk ? 1 : 0;
+   ivi->linkstate = vf->link_cfg;
if (vf->state == VF_ENABLED) {
/* mac and vlan are in vlan_mac objects */
if (bnx2x_validate_vf_sp_objs(bp, vf, false)) {
-- 
1.8.3.1



Re: [PATCH net-next] net/tls: Add support for async encryption of records for performance

2018-09-20 Thread David Miller
From: Vakul Garg 
Date: Wed, 19 Sep 2018 20:51:35 +0530

> This patch enables encryption of multiple records in parallel when an
> async capable crypto accelerator is present in system.

This seems to be trading off zero copy with async support.

Async crypto device support is not the common case at all, and
synchronous crypto via cpu crypto acceleration instructions is
so much more likely.

Oh I see, the new logic is only triggered with ASYNC_CAPABLE is
set?

> +static inline  bool is_tx_ready(struct tls_context *tls_ctx,
> + struct tls_sw_context_tx *ctx)
> +{

Two space between "inline" and "bool", please make it one.

>  static void tls_write_space(struct sock *sk)
>  {
>   struct tls_context *ctx = tls_get_ctx(sk);
> + struct tls_sw_context_tx *tx_ctx = tls_sw_ctx_tx(ctx);

Longest to shortest line (reverse christmas tree) ordering for
local variable declarations please.
>  
> + list_for_each_prev(pos, >tx_ready_list) {
> + struct tls_rec *rec = (struct tls_rec *)pos;
> + u64 seq = be64_to_cpup((const __be64 *)>aad_space);

Likewise.

> -static int tls_do_encryption(struct tls_context *tls_ctx,
> +int tls_tx_records(struct sock *sk, int flags)
> +{
> + struct tls_rec *rec, *tmp;
> + struct tls_context *tls_ctx = tls_get_ctx(sk);
> + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> + int rc = 0;
> + int tx_flags;

Likewise.

> +static void tls_encrypt_done(struct crypto_async_request *req, int err)
> +{
> + struct aead_request *aead_req = (struct aead_request *)req;
> + struct sock *sk = req->data;
> + struct tls_context *tls_ctx = tls_get_ctx(sk);
> + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> + struct tls_rec *rec;
> + int pending;
> + bool ready = false;

Likewise.

> +static int tls_do_encryption(struct sock *sk,
> +  struct tls_context *tls_ctx,
>struct tls_sw_context_tx *ctx,
>struct aead_request *aead_req,
>size_t data_len)
>  {
>   int rc;
> + struct tls_rec *rec = ctx->open_rec;

Likewise.

> @@ -473,11 +630,12 @@ static int memcopy_from_iter(struct sock *sk, struct 
> iov_iter *from,
>  {
>   struct tls_context *tls_ctx = tls_get_ctx(sk);
>   struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> - struct scatterlist *sg = ctx->sg_plaintext_data;
> + struct tls_rec *rec = ctx->open_rec;
> + struct scatterlist *sg = rec->sg_plaintext_data;
>   int copy, i, rc = 0;

Likewise.

> +struct tls_rec *get_rec(struct sock *sk)
> +{
> + int mem_size;
> + struct tls_context *tls_ctx = tls_get_ctx(sk);
> + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
> + struct tls_rec *rec;

Likewise.

> @@ -510,21 +707,33 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
>   int record_room;
>   bool full_record;
>   int orig_size;
> + struct tls_rec *rec;
>   bool is_kvec = msg->msg_iter.type & ITER_KVEC;
> + struct crypto_tfm *tfm = crypto_aead_tfm(ctx->aead_send);
> + bool async_capable = tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC;
> + int num_async = 0;
> + int num_zc = 0;

Likewise.
> @@ -661,6 +904,8 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
>   struct scatterlist *sg;
>   bool full_record;
>   int record_room;
> + struct tls_rec *rec;
> + int num_async = 0;

Likewise.


Re: [PATCH bpf-next] samples/bpf: fix compilation failure

2018-09-20 Thread Song Liu
On Thu, Sep 20, 2018 at 12:53 AM Prashant Bhole
 wrote:
>
> following commit:
> commit d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
> added struct bpf_flow_keys which conflicts with the struct with
> same name in sockex2_kern.c and sockex3_kern.c
>
> similar to commit:
> commit 534e0e52bc23 ("samples/bpf: fix a compilation failure")
> we tried the rename it "flow_keys" but it also conflicted with struct
> having same name in include/net/flow_dissector.h. Hence renaming the
> struct to "flow_key_record". Also, this commit doesn't fix the
> compilation error completely because the similar struct is present in
> sockex3_kern.c. Hence renaming it in both files sockex3_user.c and
> sockex3_kern.c
>
> Signed-off-by: Prashant Bhole 

Acked-by: Song Liu 


> ---
>  samples/bpf/sockex2_kern.c | 11 ++-
>  samples/bpf/sockex3_kern.c |  8 
>  samples/bpf/sockex3_user.c |  4 ++--
>  3 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
> index f58acfc92556..f2f9dbc021b0 100644
> --- a/samples/bpf/sockex2_kern.c
> +++ b/samples/bpf/sockex2_kern.c
> @@ -14,7 +14,7 @@ struct vlan_hdr {
> __be16 h_vlan_encapsulated_proto;
>  };
>
> -struct bpf_flow_keys {
> +struct flow_key_record {
> __be32 src;
> __be32 dst;
> union {
> @@ -59,7 +59,7 @@ static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, 
> __u64 off)
>  }
>
>  static inline __u64 parse_ip(struct __sk_buff *skb, __u64 nhoff, __u64 
> *ip_proto,
> -struct bpf_flow_keys *flow)
> +struct flow_key_record *flow)
>  {
> __u64 verlen;
>
> @@ -83,7 +83,7 @@ static inline __u64 parse_ip(struct __sk_buff *skb, __u64 
> nhoff, __u64 *ip_proto
>  }
>
>  static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 nhoff, __u64 
> *ip_proto,
> -  struct bpf_flow_keys *flow)
> +  struct flow_key_record *flow)
>  {
> *ip_proto = load_byte(skb,
>   nhoff + offsetof(struct ipv6hdr, nexthdr));
> @@ -96,7 +96,8 @@ static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 
> nhoff, __u64 *ip_pro
> return nhoff;
>  }
>
> -static inline bool flow_dissector(struct __sk_buff *skb, struct 
> bpf_flow_keys *flow)
> +static inline bool flow_dissector(struct __sk_buff *skb,
> + struct flow_key_record *flow)
>  {
> __u64 nhoff = ETH_HLEN;
> __u64 ip_proto;
> @@ -198,7 +199,7 @@ struct bpf_map_def SEC("maps") hash_map = {
>  SEC("socket2")
>  int bpf_prog2(struct __sk_buff *skb)
>  {
> -   struct bpf_flow_keys flow = {};
> +   struct flow_key_record flow = {};
> struct pair *value;
> u32 key;
>
> diff --git a/samples/bpf/sockex3_kern.c b/samples/bpf/sockex3_kern.c
> index 95907f8d2b17..c527b57d3ec8 100644
> --- a/samples/bpf/sockex3_kern.c
> +++ b/samples/bpf/sockex3_kern.c
> @@ -61,7 +61,7 @@ struct vlan_hdr {
> __be16 h_vlan_encapsulated_proto;
>  };
>
> -struct bpf_flow_keys {
> +struct flow_key_record {
> __be32 src;
> __be32 dst;
> union {
> @@ -88,7 +88,7 @@ static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, 
> __u64 off)
>  }
>
>  struct globals {
> -   struct bpf_flow_keys flow;
> +   struct flow_key_record flow;
>  };
>
>  struct bpf_map_def SEC("maps") percpu_map = {
> @@ -114,14 +114,14 @@ struct pair {
>
>  struct bpf_map_def SEC("maps") hash_map = {
> .type = BPF_MAP_TYPE_HASH,
> -   .key_size = sizeof(struct bpf_flow_keys),
> +   .key_size = sizeof(struct flow_key_record),
> .value_size = sizeof(struct pair),
> .max_entries = 1024,
>  };
>
>  static void update_stats(struct __sk_buff *skb, struct globals *g)
>  {
> -   struct bpf_flow_keys key = g->flow;
> +   struct flow_key_record key = g->flow;
> struct pair *value;
>
> value = bpf_map_lookup_elem(_map, );
> diff --git a/samples/bpf/sockex3_user.c b/samples/bpf/sockex3_user.c
> index 22f74d0e1493..9d02e0404719 100644
> --- a/samples/bpf/sockex3_user.c
> +++ b/samples/bpf/sockex3_user.c
> @@ -13,7 +13,7 @@
>  #define PARSE_IP_PROG_FD (prog_fd[0])
>  #define PROG_ARRAY_FD (map_fd[0])
>
> -struct flow_keys {
> +struct flow_key_record {
> __be32 src;
> __be32 dst;
> union {
> @@ -64,7 +64,7 @@ int main(int argc, char **argv)
> (void) f;
>
> for (i = 0; i < 5; i++) {
> -   struct flow_keys key = {}, next_key;
> +   struct flow_key_record key = {}, next_key;
> struct pair value;
>
> sleep(1);
> --
> 2.17.1
>
>


Re: [RFC 4/5] netlink: prepare validate extack setting for recursion

2018-09-20 Thread Marcelo Ricardo Leitner
On Thu, Sep 20, 2018 at 10:14:10AM +0200, Johannes Berg wrote:
> Anyway - we got into this discussion because of all the extra recursion
> stuff I was adding. With the change suggested by David we don't need
> that now at all, so I guess it'd be better to propose a patch if you (or
> perhaps I will see a need later) need such a facility for multiple
> messages or multiple message levels?

Yep! 

Thanks,
  Marcelo


Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Cong Wang
On Thu, Sep 20, 2018 at 2:04 AM Kirill Tkhai  wrote:
>
> On 20.09.2018 0:28, Cong Wang wrote:
> You added me to CC, so you probably want to know my opinion about this.
> Since it's not a real problem fix, but just a refactoring, I say you
> my opinion, how this refactoring may be made better. If you don't want
> to know my opinion, you may consider not to CC me.

Sure, with respects to your opinions, I prefer to drop it happily.

Thanks.


Re: [PATCH 2/3] net: Register af_inet_ops earlier

2018-09-20 Thread Cong Wang
On Thu, Sep 20, 2018 at 2:12 AM Kirill Tkhai  wrote:
>
> This function just initializes locks and defaults.
> Let register it before other pernet operation,
> since some of them potentially may relay on that.
>
> Signed-off-by: Kirill Tkhai 

It adds no benefits but potential risks on error path
ordering, it is never late to bring this up again when any
future change needs it, until that:

Nacked-by: Cong Wang 


Re: [PATCH 3/3] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Cong Wang
On Thu, Sep 20, 2018 at 2:12 AM Kirill Tkhai  wrote:
>
> From: Cong Wang 
>
> ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4,
> but its initialization is in the generic netns code, setup_net().
>
> Move it to IPv4 specific net init code, inet_init_net().
>
> Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex")
> Signed-off-by: Cong Wang 
> Acked-by: Kirill Tkhai 

I regret for wasting my time on this, so:

Nacked-by: Cong Wang 

Let's just leave the current code as it is.


Re: [PATCH net-next 4/5] ipv6: do not drop vrf udp multicast packets

2018-09-20 Thread Mike Manning
On 20/09/2018 14:02, Paolo Abeni wrote:
> Hi,
>
> On Thu, 2018-09-20 at 09:58 +0100, Mike Manning wrote:
>> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
>> index 108f5f88ec98..fc60f297d95b 100644
>> --- a/net/ipv6/ip6_input.c
>> +++ b/net/ipv6/ip6_input.c
>> @@ -325,9 +325,12 @@ static int ip6_input_finish(struct net *net, struct 
>> sock *sk, struct sk_buff *sk
>>  {
>>  const struct inet6_protocol *ipprot;
>>  struct inet6_dev *idev;
>> +struct net_device *dev;
>>  unsigned int nhoff;
>> +int sdif = inet6_sdif(skb);
>>  int nexthdr;
>>  bool raw;
>> +bool deliver;
>>  bool have_final = false;
> Please, try instead to sort the variable in reverse x-mas tree order.
Will do.
>>  
>>  /*
>> @@ -371,9 +374,27 @@ static int ip6_input_finish(struct net *net, struct 
>> sock *sk, struct sk_buff *sk
>>  skb_postpull_rcsum(skb, skb_network_header(skb),
>> skb_network_header_len(skb));
>>  hdr = ipv6_hdr(skb);
>> -if (ipv6_addr_is_multicast(>daddr) &&
>> -!ipv6_chk_mcast_addr(skb->dev, >daddr,
>> ->saddr) &&
>> +
>> +/* skb->dev passed may be master dev for vrfs. */
>> +if (sdif) {
>> +rcu_read_lock();
> AFAICS, the rcu lock is already acquired at the beginning of
> ip6_input_finish(), not need to acquire it here again.
Nice catch, I will remove this.
> + dev = dev_get_by_index_rcu(dev_net(skb->dev),
>> +   sdif);
>> +if (!dev) {
>> +rcu_read_unlock();
>> +kfree_skb(skb);
>> +return -ENODEV;
>> +}
>> +} else {
>> +dev = skb->dev;
> The above fragment of code is a recurring pattern in this series,
> perhaps adding an helper for it would reduce code duplication ?

This pattern of checking the secondary device index is used only twice, both in 
this file.

But with now one instance having the rcu lock handling, and the other not, I 
cannot refactor this.
>
> Cheers,
>
> Paolo
>
Thanks for the review! I will wait for further comments before producing a v1 
of the series.

Regards, Mike




Re: [PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number

2018-09-20 Thread Stephen Hemminger
On Thu, 20 Sep 2018 15:18:20 +0100
Lorenzo Pieralisi  wrote:

> On Fri, Sep 14, 2018 at 12:54:55PM -0700, Stephen Hemminger wrote:
> > The Hyper-V implementation of PCI controller has concept of 32 bit serial 
> > number
> > (not to be confused with PCI-E serial number).  This value is sent in the 
> > protocol
> > from the host to indicate SR-IOV VF device is attached to a synthetic NIC.
> > 
> > Using the serial number (instead of MAC address) to associate the two 
> > devices
> > avoids lots of potential problems when there are duplicate MAC addresses 
> > from
> > tunnels or layered devices.
> > 
> > The patch set is broken into two parts, one is for the PCI controller
> > and the other is for the netvsc device. Normally, these go through different
> > trees but sending them together here for better review. The PCI changes
> > were submitted previously, but the main review comment was "why do you
> > need this?". This is why.  
> 
> The question was more whether we should convert this serial number into
> a PCI slot number (that has user space visibility and that is what you are
> after) to improve the current matching, I do not question why you need
> it, just for the records.

The name slot is way overloaded in this context.
There is 
windows slot number which comes from Hyperv
pci address slot which pci-hyperv sets from windows slot
pci slot api value which for normal devices comes from ACPI
this patch gets it from serial number


The netvsc driver needed to be able to find a PCI device based on the serial
number. The serial number was not visible in any current PCI-hyperv controller
values.  The windows slot (wslot) is not the same the serial number.


Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name

2018-09-20 Thread Jason Gunthorpe
On Thu, Sep 20, 2018 at 07:40:39PM +0300, Leon Romanovsky wrote:

> > > The protection is done with global device_lock because it is used in
> > > allocation and deallocation phases. At this stage, this lock is not
> > > busy and easily can be moved to be per-device, once it will be needed.
> > >
> > > Signed-off-by: Leon Romanovsky 
> > >  drivers/infiniband/core/device.c | 24 +++-
> > >  include/rdma/ib_verbs.h  |  8 +++-
> > >  2 files changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/core/device.c 
> > > b/drivers/infiniband/core/device.c
> > > index 5a680a88aa87..3270cde6d806 100644
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -170,6 +170,14 @@ static struct ib_device 
> > > *__ib_device_get_by_name(const char *name)
> > >   return NULL;
> > >  }
> > >
> > > +void ib_device_get_name(struct ib_device *ibdev, char *name)
> > > +{
> > > + down_read(_rwsem);
> > > + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX);
> > > + up_read(_rwsem);
> > > +}
> > > +EXPORT_SYMBOL(ib_device_get_name);
> >
> > I think we have to follow netdev and just rely on device_rename()
> > being 'good enough'.
> >
> > Switch everything to use dev_name()/etc rather than try and do
> > something like this so the responsibility is on the device core to
> > keep this working, not us.
> >
> > Turns out I have a series for that for unrelated reasons..
> 
> And what should I do now with this knowledge?

Maybe I can send it today..

Jason


Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name

2018-09-20 Thread Leon Romanovsky
On Thu, Sep 20, 2018 at 09:15:41AM -0600, Jason Gunthorpe wrote:
> On Thu, Sep 20, 2018 at 02:21:58PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > Prepare IB device name field to rename operation by ensuring that all
> > accesses to it are protected with lock and users don't see part of name.
>
> Oh dear, no, that isn't going to work, there is too much stuff using
> dev_name.. Did you read the comment on device_rename??
>
> https://elixir.bootlin.com/linux/v4.19-rc4/source/drivers/base/core.c#L2715

Yes, I read, it was mentioned in the cover letter.

>
> > The protection is done with global device_lock because it is used in
> > allocation and deallocation phases. At this stage, this lock is not
> > busy and easily can be moved to be per-device, once it will be needed.
> >
> > Signed-off-by: Leon Romanovsky 
> >  drivers/infiniband/core/device.c | 24 +++-
> >  include/rdma/ib_verbs.h  |  8 +++-
> >  2 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/device.c 
> > b/drivers/infiniband/core/device.c
> > index 5a680a88aa87..3270cde6d806 100644
> > +++ b/drivers/infiniband/core/device.c
> > @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const 
> > char *name)
> > return NULL;
> >  }
> >
> > +void ib_device_get_name(struct ib_device *ibdev, char *name)
> > +{
> > +   down_read(_rwsem);
> > +   strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX);
> > +   up_read(_rwsem);
> > +}
> > +EXPORT_SYMBOL(ib_device_get_name);
>
> I think we have to follow netdev and just rely on device_rename()
> being 'good enough'.
>
> Switch everything to use dev_name()/etc rather than try and do
> something like this so the responsibility is on the device core to
> keep this working, not us.
>
> Turns out I have a series for that for unrelated reasons..

And what should I do now with this knowledge?

>
> >  static int alloc_name(char *name)
> >  {
> > unsigned long *inuse;
> > @@ -202,6 +210,21 @@ static int alloc_name(char *name)
> > return 0;
> >  }
> >
> > +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern)
> > +{
> > +   int ret = 0;
> > +
> > +   mutex_lock(_mutex);
> > +   strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX);
> > +   if (strchr(ibdev->name, '%'))
> > +   ret = alloc_name(ibdev->name);
> > +
> > +   mutex_unlock(_mutex);
> > +
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(ib_device_alloc_name);
>
> Can't call alloc_name() without also adding to the list, this will
> allow duplicates.

I planned to change it in the future by moving to different name scheme
with unique naming.

>
> Jason


signature.asc
Description: PGP signature


Re: [RFC bpf-next 4/4] tools/bpf: handle EOPNOTSUPP when map lookup is failed

2018-09-20 Thread Jakub Kicinski
On Thu, 20 Sep 2018 14:04:19 +0900, Prashant Bhole wrote:
> On 9/20/2018 12:29 AM, Jakub Kicinski wrote:
> > On Wed, 19 Sep 2018 16:51:43 +0900, Prashant Bhole wrote:  
> >> Let's add a check for EOPNOTSUPP error when map lookup is failed.
> >> Also in case map doesn't support lookup, the output of map dump is
> >> changed from "can't lookup element" to "lookup not supported for
> >> this map".
> >>
> >> Patch adds function print_entry_error() function to print the error
> >> value.
> >>
> >> Following example dumps a map which does not support lookup.
> >>
> >> Output before:
> >> root# bpftool map -jp dump id 40
> >> [
> >>  "key": ["0x0a","0x00","0x00","0x00"
> >>  ],
> >>  "value": {
> >>  "error": "can\'t lookup element"
> >>  },
> >>  "key": ["0x0b","0x00","0x00","0x00"
> >>  ],
> >>  "value": {
> >>  "error": "can\'t lookup element"
> >>  }
> >> ]
> >>
> >> root# bpftool map dump id 40
> >> can't lookup element with key:
> >> 0a 00 00 00
> >> can't lookup element with key:
> >> 0b 00 00 00
> >> Found 0 elements
> >>
> >> Output after changes:
> >> root# bpftool map dump -jp  id 45
> >> [
> >>  "key": ["0x0a","0x00","0x00","0x00"
> >>  ],
> >>  "value": {
> >>  "error": "lookup not supported for this map"
> >>  },
> >>  "key": ["0x0b","0x00","0x00","0x00"
> >>  ],
> >>  "value": {
> >>  "error": "lookup not supported for this map"
> >>  }
> >> ]
> >>
> >> root# bpftool map dump id 45
> >> key:
> >> 0a 00 00 00
> >> value:
> >> lookup not supported for this map
> >> key:
> >> 0b 00 00 00
> >> value:
> >> lookup not supported for this map
> >> Found 0 elements  
> > 
> > Nice improvement, thanks for the changes!  I wonder what your thoughts
> > would be on just printing some form of "lookup not supported for this
> > map" only once?  It seems slightly like repeated information - if
> > lookup is not supported for one key it likely won't be for other keys
> > too, so we could shorten the output.  Would that make sense?
> >   
> >> Signed-off-by: Prashant Bhole 
> >> ---
> >>   tools/bpf/bpftool/main.h |  5 +
> >>   tools/bpf/bpftool/map.c  | 35 ++-
> >>   2 files changed, 35 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> >> index 40492cdc4e53..1a8c683f949b 100644
> >> --- a/tools/bpf/bpftool/main.h
> >> +++ b/tools/bpf/bpftool/main.h
> >> @@ -46,6 +46,11 @@
> >>   
> >>   #include "json_writer.h"
> >>   
> >> +#define ERR_CANNOT_LOOKUP \
> >> +  "can't lookup element"
> >> +#define ERR_LOOKUP_NOT_SUPPORTED \
> >> +  "lookup not supported for this map"  
> > 
> > Do we need these?  Are we going to reused them in more parts of the
> > code?  
> 
> These are used only once. These can be used in do_lookup(). Currently 
> do_lookup() prints strerror(errno) when lookup is failed. Shall I change 
> that do_lookup() output?

I actually prefer to stick to strerror(), the standard errors more
clearly correlate with what happened in my mind (i.e. "Operation not
supported" == kernel sent EOPNOTSUPP).   strerror() may also print in
local language if translation/localization matters.

We could even use strerr() in dump_map_elem() but up to you.  The one
in do_lookup() I'd prefer to leave be ;)


Re: [PATCH] PCI: hv: Fix return value check in hv_pci_assign_slots()

2018-09-20 Thread Lorenzo Pieralisi
[+DaveM, netdev]

Removed erroneously added disclaimer, apologies.

On Thu, Sep 20, 2018 at 06:40:31AM +, Wei Yongjun wrote:
> In case of error, the function pci_create_slot() returns ERR_PTR() and
> never returns NULL. The NULL test in the return value check should be
> replaced with IS_ERR().
> 
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot 
> information")
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index ee80e79..9ba4d12 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct 
> hv_pcibus_device *hbus)
>   snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser);
>   hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr,
> name, NULL);
> - if (!hpdev->pci_slot)
> + if (IS_ERR(hpdev->pci_slot)) {
>   pr_warn("pci_create slot %s failed\n", name);
> + hpdev->pci_slot = NULL;
> + }
>   }
>  }

I am dropping this patch from the PCI queue since the original series
is now queued in net-next.

Lorenzo



Re: [PATCH] PCI: hv: Fix return value check in hv_pci_assign_slots()

2018-09-20 Thread Lorenzo Pieralisi
[+DaveM, netdev]

On Thu, Sep 20, 2018 at 06:40:31AM +, Wei Yongjun wrote:
> In case of error, the function pci_create_slot() returns ERR_PTR() and
> never returns NULL. The NULL test in the return value check should be
> replaced with IS_ERR().
>
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot 
> information")
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index ee80e79..9ba4d12 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1484,8 +1484,10 @@ static void hv_pci_assign_slots(struct 
> hv_pcibus_device *hbus)
>   snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser);
>   hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr,
> name, NULL);
> - if (!hpdev->pci_slot)
> + if (IS_ERR(hpdev->pci_slot)) {
>   pr_warn("pci_create slot %s failed\n", name);
> + hpdev->pci_slot = NULL;
> + }
>   }
>  }
>

FYI

I am dropping this patch from the PCI queue since the original series
is now queued in net-next.

Lorenzo
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name

2018-09-20 Thread Jason Gunthorpe
On Thu, Sep 20, 2018 at 02:21:58PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Prepare IB device name field to rename operation by ensuring that all
> accesses to it are protected with lock and users don't see part of name.

Oh dear, no, that isn't going to work, there is too much stuff using
dev_name.. Did you read the comment on device_rename??

https://elixir.bootlin.com/linux/v4.19-rc4/source/drivers/base/core.c#L2715

> The protection is done with global device_lock because it is used in
> allocation and deallocation phases. At this stage, this lock is not
> busy and easily can be moved to be per-device, once it will be needed.
> 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/core/device.c | 24 +++-
>  include/rdma/ib_verbs.h  |  8 +++-
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 5a680a88aa87..3270cde6d806 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const 
> char *name)
>   return NULL;
>  }
> 
> +void ib_device_get_name(struct ib_device *ibdev, char *name)
> +{
> + down_read(_rwsem);
> + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX);
> + up_read(_rwsem);
> +}
> +EXPORT_SYMBOL(ib_device_get_name);

I think we have to follow netdev and just rely on device_rename()
being 'good enough'.  

Switch everything to use dev_name()/etc rather than try and do
something like this so the responsibility is on the device core to
keep this working, not us.

Turns out I have a series for that for unrelated reasons..

>  static int alloc_name(char *name)
>  {
>   unsigned long *inuse;
> @@ -202,6 +210,21 @@ static int alloc_name(char *name)
>   return 0;
>  }
> 
> +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern)
> +{
> + int ret = 0;
> +
> + mutex_lock(_mutex);
> + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX);
> + if (strchr(ibdev->name, '%'))
> + ret = alloc_name(ibdev->name);
> +
> + mutex_unlock(_mutex);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(ib_device_alloc_name);

Can't call alloc_name() without also adding to the list, this will
allow duplicates.

Jason


RE: [PATCH v2 1/5] netlink: remove NLA_NESTED_COMPAT

2018-09-20 Thread David Laight
From: Johannes Berg
> Sent: 19 September 2018 20:49
> This isn't used anywhere, so we might as well get rid of it.
> 
> Reviewed-by: David Ahern 
> Signed-off-by: Johannes Berg 
> ---
>  include/net/netlink.h |  2 --
>  lib/nlattr.c  | 11 ---
>  2 files changed, 13 deletions(-)
> 
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 318b1ded3833..b680fe365e91 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -172,7 +172,6 @@ enum {
>   NLA_FLAG,
>   NLA_MSECS,
>   NLA_NESTED,
> - NLA_NESTED_COMPAT,
>   NLA_NUL_STRING,
>   NLA_BINARY,
>   NLA_S8,
...

Is it safe to remove an item from this emun ?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH RFT net-next 0/2] net: phy: Eliminate unnecessary soft

2018-09-20 Thread Andrew Lunn
On Tue, Sep 18, 2018 at 06:35:03PM -0700, Florian Fainelli wrote:
> Hi all,
> 
> This patch series eliminates unnecessary software resets of the PHY.
> This should hopefully not break anybody's hardware; but I would
> appreciate testing to make sure this is is the case.
> 
> Sorry for this long email list, I wanted to make sure I reached out to
> all people who made changes to the Marvell PHY driver.

Hi Florian

I want to test this, but i'm battling other issues at the moment and
not got around to it.

But i see it has been tested by two people. So i think we should merge
it, and see if anybody screams.

  Andrew


Re: [PATCH] smc: generic netlink family should be __ro_after_init

2018-09-20 Thread David Miller
From: Johannes Berg 
Date: Thu, 20 Sep 2018 09:27:30 +0200

> From: Johannes Berg 
> 
> The generic netlink family is only initialized during module init,
> so it should be __ro_after_init like all other generic netlink
> families.
> 
> Signed-off-by: Johannes Berg 

Looks good, applied.


Re: [PATCH net] mlxsw: spectrum: Bump required firmware version

2018-09-20 Thread David Miller
From: Ido Schimmel 
Date: Thu, 20 Sep 2018 09:31:45 +0300

> From: Petr Machata 
> 
> MC-aware mode was introduced to mlxsw in commit 7b8195306694 ("mlxsw: 
> spectrum:
> Configure MC-aware mode on mlxsw ports") and fixed up later in commit
> 3a3539cd3632 ("mlxsw: spectrum_buffers: Set up a dedicated pool for BUM
> traffic"). As the final piece of puzzle, a firmware issue whereby a wrong
> priority was assigned to BUM traffic was corrected in FW version 13.1703.4.
> Therefore require this FW version in the driver.
> 
> Fixes: 7b8195306694 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw 
> ports")
> Signed-off-by: Petr Machata 
> Reviewed-by: Jiri Pirko 
> Signed-off-by: Ido Schimmel 

Applied.


Re: [PATCH net-next 00/13] mlxsw: Further MC-awareness configuration

2018-09-20 Thread David Miller
From: Ido Schimmel 
Date: Thu, 20 Sep 2018 09:21:23 +0300

> Petr says:
> 
> Due to an issue in Spectrum chips, when unicast traffic shares the same
> queue as BUM traffic, and there is congestion, the BUM traffic is
> admitted to the queue anyway, thus pushing out all UC traffic. In order
> to give unicast traffic precedence over BUM traffic, multicast-aware
> mode is now configured on all ports. Under MC-aware mode, egress TCs
> 8..15 are used for BUM traffic, which has its own dedicated pool.
> 
> This patch set improves the way that the MC pool and the higher-order
> TCs are integrated into the system.

Series applied, thanks.

> Then in patch #13 the selftest itself is added.

Just wanted to say I'm really happy with the selftests that exist
for all of the problems that have been fixed recently in mlxsw.



Re: [PATCH bpf-next 2/2] xsk: fix bug when trying to use both copy and zero-copy on one queue id

2018-09-20 Thread Jakub Kicinski
On Thu, 20 Sep 2018 11:17:17 +0200, Magnus Karlsson wrote:
> On Wed, Sep 19, 2018 at 9:18 AM Magnus Karlsson wrote:
> > On Wed, Sep 19, 2018 at 3:58 AM Jakub Kicinski wrote:  
> > >
> > > On Tue, 18 Sep 2018 10:22:11 -0700, Y Song wrote:  
> > > > > +/* The umem is stored both in the _rx struct and the _tx struct as 
> > > > > we do
> > > > > + * not know if the device has more tx queues than rx, or the 
> > > > > opposite.
> > > > > + * This might also change during run time.
> > > > > + */
> > > > > +static void xdp_reg_umem_at_qid(struct net_device *dev, struct 
> > > > > xdp_umem *umem,
> > > > > +   u16 queue_id)
> > > > > +{
> > > > > +   if (queue_id < dev->real_num_rx_queues)
> > > > > +   dev->_rx[queue_id].umem = umem;
> > > > > +   if (queue_id < dev->real_num_tx_queues)
> > > > > +   dev->_tx[queue_id].umem = umem;
> > > > > +}
> > > > > +
> > > > > +static struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
> > > > > + u16 queue_id)
> > > > > +{
> > > > > +   if (queue_id < dev->real_num_rx_queues)
> > > > > +   return dev->_rx[queue_id].umem;
> > > > > +   if (queue_id < dev->real_num_tx_queues)
> > > > > +   return dev->_tx[queue_id].umem;
> > > > > +
> > > > > +   return NULL;
> > > > > +}
> > > > > +
> > > > > +static void xdp_clear_umem_at_qid(struct net_device *dev, u16 
> > > > > queue_id)
> > > > > +{
> > > > > +   /* Zero out the entry independent on how many queues are 
> > > > > configured
> > > > > +* at this point in time, as it might be used in the future.
> > > > > +*/
> > > > > +   if (queue_id < dev->num_rx_queues)
> > > > > +   dev->_rx[queue_id].umem = NULL;
> > > > > +   if (queue_id < dev->num_tx_queues)
> > > > > +   dev->_tx[queue_id].umem = NULL;
> > > > > +}
> > > > > +  
> > > >
> > > > I am sure whether the following scenario can happen or not.
> > > > Could you clarify?
> > > >1. suppose initially we have num_rx_queues = num_tx_queues = 10
> > > >xdp_reg_umem_at_qid() set umem1 to queue_id = 8
> > > >2. num_tx_queues is changed to 5
> > > >3. xdp_clear_umem_at_qid() is called for queue_id = 8,
> > > >and dev->_rx[8].umum = 0.
> > > >4. xdp_reg_umem_at_qid() is called gain to set for queue_id = 8
> > > >dev->_rx[8].umem = umem2
> > > >5. num_tx_queues is changed to 10
> > > >   Now dev->_rx[8].umem != dev->_tx[8].umem, is this possible and is it
> > > > a problem?  
> > >
> > > Plus IIRC the check of qid vs real_num_[rt]x_queues in xsk_bind() is
> > > not under rtnl_lock so it doesn't count for much.  Why not do all the
> > > checks against num_[rt]x_queues here, instead of real_..?  
> >
> > You are correct, two separate rtnl_lock regions is broken. Will spin a
> > v2 tomorrow when I am back in the office.
> >
> > Thanks Jakub for catching this. I really appreciate you reviewing my code.  
> 
> Sorry, forgot to answer your question about why real_num_ instead of
> num_. If we used num_ instead, we would get a behavior where we can
> open a socket on a queue id, let us say 10, that will not be active if
> real_num_ is below 10. So no traffic will flow on this socket until we
> issue an ethtool command to state that we have 10 or more queues
> configured. While this behavior is sane (and consistent), I believe it
> will lead to a more complicated driver implementation, break the
> current uapi (since we will return an error if you try to bind to a
> queue id that is not active at the moment) and I like my suggestion
> below better :-).
> 
> What I would like to suggest is that the current model at bind() is
> kept, that is it will fail if you try to bind to a non-active queue
> id. But we add some code in ethool_set_channels in net/core/ethtool.c
> to check if you have an active af_xdp socket on a queue id and try to
> make it inactive, we return an error and disallow the change. This
> would IMHO be like not allowing a load module to be unloaded if
> someone is using it or not allowing unmounting a file system if files
> are in use. What do you think? If you think this is the right way to
> go, I can implement this in a follow up patch or in this patch set if
> that makes more sense, let me know. This suggestion would actually
> make it simpler to implement zero-copy support in the driver. Some of
> the complications we are having today is due to the fact that ethtool
> can come in from the side and change things for us when an AF_XDP
> socket is active on a queue id. And implementing zero copy support in
> the driver needs to get simpler IMO.
> 
> Please let me know what you think.

I think I'd like that:
https://patchwork.ozlabs.org/project/netdev/list/?series=57843=*
in particular:
https://patchwork.ozlabs.org/patch/949891/
;)

If we fix the resize indeed you could stick to real_* but to be honest,
I'm not entirely clear 

Re: [PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name

2018-09-20 Thread Steve Wise



On 9/20/2018 6:21 AM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Prepare IB device name field to rename operation by ensuring that all
> accesses to it are protected with lock and users don't see part of name.
> 
> The protection is done with global device_lock because it is used in
> allocation and deallocation phases. At this stage, this lock is not
> busy and easily can be moved to be per-device, once it will be needed.
> 
> Signed-off-by: Leon Romanovsky 
> ---
>  drivers/infiniband/core/device.c | 24 +++-
>  include/rdma/ib_verbs.h  |  8 +++-
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 5a680a88aa87..3270cde6d806 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const 
> char *name)
>   return NULL;
>  }
> 
> +void ib_device_get_name(struct ib_device *ibdev, char *name)
> +{
> + down_read(_rwsem);
> + strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX);
> + up_read(_rwsem);
> +}
> +EXPORT_SYMBOL(ib_device_get_name);
> +
>  static int alloc_name(char *name)
>  {
>   unsigned long *inuse;
> @@ -202,6 +210,21 @@ static int alloc_name(char *name)
>   return 0;
>  }
> 
> +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern)
> +{
> + int ret = 0;
> +
> + mutex_lock(_mutex);
> + strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX);
> + if (strchr(ibdev->name, '%'))
> + ret = alloc_name(ibdev->name);
> +
> + mutex_unlock(_mutex);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(ib_device_alloc_name);
> +
>  static void ib_device_release(struct device *device)
>  {
>   struct ib_device *dev = container_of(device, struct ib_device, dev);
> @@ -499,7 +522,6 @@ int ib_register_device(struct ib_device *device,
>   ret = alloc_name(device->name);
>   if (ret)
>   goto out;
> - }

I don't think this is correct...


> 
>   if (ib_device_check_mandatory(device)) {
>   ret = -EINVAL;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index e764ed1f6025..0e7b9854 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -2260,6 +2260,11 @@ struct ib_device {
>   /* Do not access @dma_device directly from ULP nor from HW drivers. */
>   struct device*dma_device;
> 
> + /*
> +  * Do not access @name directly,
> +  * use ib_device_get_name()/ib_device_alloc_name()
> +  * and don't assume that it can't change after access.
> +  */
>   char  name[IB_DEVICE_NAME_MAX];
> 
>   struct list_head  event_handler_list;
> @@ -2638,7 +2643,8 @@ struct ib_device *ib_alloc_device(size_t size);
>  void ib_dealloc_device(struct ib_device *device);
> 
>  void ib_get_device_fw_str(struct ib_device *device, char *str);
> -
> +int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern);
> +void ib_device_get_name(struct ib_device *ibdev, char *name);
>  int ib_register_device(struct ib_device *device,
>  int (*port_callback)(struct ib_device *,
>   u8, struct kobject *));
> --
> 2.14.4
> 


Re: [PATCH 00/21] SMMU enablement for NXP LS1043A and LS1046A

2018-09-20 Thread Laurentiu Tudor


On 20.09.2018 14:49, Robin Murphy wrote:
> On 20/09/18 11:38, Laurentiu Tudor wrote:
>>
>>
>> On 19.09.2018 17:37, Robin Murphy wrote:
>>> On 19/09/18 15:18, Laurentiu Tudor wrote:
 Hi Robin,

 On 19.09.2018 16:25, Robin Murphy wrote:
> Hi Laurentiu,
>
> On 19/09/18 13:35, laurentiu.tu...@nxp.com wrote:
>> From: Laurentiu Tudor 
>>
>> This patch series adds SMMU support for NXP LS1043A and LS1046A chips
>> and consists mostly in important driver fixes and the required device
>> tree updates. It touches several subsystems and consists of three 
>> main
>> parts:
>>     - changes in soc/drivers/fsl/qbman drivers adding iommu 
>> mapping of
>>   reserved memory areas, fixes and defered probe support
>>     - changes in drivers/net/ethernet/freescale/dpaa_eth drivers
>>   consisting in misc dma mapping related fixes and probe ordering
>>     - addition of the actual arm smmu device tree node together with
>>   various adjustments to the device trees
>>
>> Performance impact
>>
>>    Running iperf benchmarks in a back-to-back setup (both sides
>>    having smmu enabled) on a 10GBps port show an important
>>    networking performance degradation of around %40 (9.48Gbps
>>    linerate vs 5.45Gbps). If you need performance but without
>>    SMMU support you can use "iommu.passthrough=1" to disable
>>    SMMU.
> 
> I should have said before - thanks for the numbers there as well. Always 
> good to add another datapoint to my collection. If you're interested 
> I've added SMMUv2 support to the "non-strict mode" series (of which I 
> should be posting v8 soon), so it might be fun to see how well that 
> works on MMU-500 in the real world.

Hmm, I think I gave those a try some weeks ago and vaguely remember that 
I did see improvements. Can't remember the numbers off the top of my 
head but I'll re-test with the latest spin and update the numbers.

>>
>> USB issue and workaround
>>
>>    There's a problem with the usb controllers in these chips
>>    generating smaller, 40-bit wide dma addresses instead of the
>> 48-bit
>>    supported at the smmu input. So you end up in a situation
>> where the
>>    smmu is mapped with 48-bit address translations, but the 
>> device
>>    generates transactions with clipped 40-bit addresses, thus 
>> smmu
>>    context faults are triggered. I encountered a similar
>> situation for
>>    mmc that I  managed to fix in software [1] however for USB I
>> did not
>>    find a proper place in the code to add a similar fix. The only
>>    workaround I found was to add this kernel parameter which
>> limits the
>>    usb dma to 32-bit size: "xhci-hcd.quirks=0x80".
>>    This workaround if far from ideal, so any suggestions for a 
>> code
>>    based workaround in this area would be greatly appreciated.
>
> If you have a nominally-64-bit device with a
> narrower-than-the-main-interconnect link in front of it, that should
> already be fixed in 4.19-rc by bus_dma_mask picking up DT dma-ranges,
> provided the interconnect hierarchy can be described appropriately (or
> at least massaged sufficiently to satisfy the binding), e.g.:
>
> / {
>    ...
>
>    soc {
>        ranges;
>        dma-ranges = <0 0 1 0>;
>
>        dev_48bit { ... };
>
>        periph_bus {
>        ranges;
>        dma-ranges = <0 0 100 0>;
>
>        dev_40bit { ... };
>        };
>    };
> };
>
> and if that fails to work as expected (except for PCI hosts where
> handling dma-ranges properly still needs sorting out), please do 
> let us
> know ;)
>

 Just to confirm, Is this [1] the change I was supposed to test?
>>>
>>> Not quite - dma-ranges is only valid for nodes representing a bus, so
>>> putting it directly in the USB device nodes doesn't work (FWIW that's
>>> why PCI is broken, because the parser doesn't expect the
>>> bus-as-leaf-node case). That's teh point of that intermediate simple-bus
>>> node represented by "periph_bus" in my example (sorry, I should have put
>>> compatibles in to make it clearer) - often that's actually true to life
>>> (i.e. "soc" is something like a CCI and "periph_bus" is something like
>>> an AXI NIC gluing a bunch of lower-bandwidth DMA masters to one of the
>>> CCI ports) but at worst it's just a necessary evil to make the binding
>>> happy (if it literally only represents the point-to-point link between
>>> the device master port and interconnect slave port).
>>>
>>
>> Quick update: so I adjusted to device tree according to your example and
>> it works so now I can get rid of that nasty kernel arg based 

Re: [PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number

2018-09-20 Thread Lorenzo Pieralisi
On Fri, Sep 14, 2018 at 12:54:55PM -0700, Stephen Hemminger wrote:
> The Hyper-V implementation of PCI controller has concept of 32 bit serial 
> number
> (not to be confused with PCI-E serial number).  This value is sent in the 
> protocol
> from the host to indicate SR-IOV VF device is attached to a synthetic NIC.
> 
> Using the serial number (instead of MAC address) to associate the two devices
> avoids lots of potential problems when there are duplicate MAC addresses from
> tunnels or layered devices.
> 
> The patch set is broken into two parts, one is for the PCI controller
> and the other is for the netvsc device. Normally, these go through different
> trees but sending them together here for better review. The PCI changes
> were submitted previously, but the main review comment was "why do you
> need this?". This is why.

The question was more whether we should convert this serial number into
a PCI slot number (that has user space visibility and that is what you are
after) to improve the current matching, I do not question why you need
it, just for the records.

Lorenzo

> v2 - slot name can be shorter.
>  remove locking when creating pci_slots; see comment for explaination
> 
> Stephen Hemminger (2):
>   PCI: hv: support reporting serial number as slot information
>   hv_netvsc: pair VF based on serial number
> 
>  drivers/net/hyperv/netvsc.c |  3 ++
>  drivers/net/hyperv/netvsc_drv.c | 58 -
>  drivers/pci/controller/pci-hyperv.c | 37 ++
>  3 files changed, 73 insertions(+), 25 deletions(-)
> 
> -- 
> 2.18.0
> 


Re: [PATCH net] sctp: update dst pmtu with the correct daddr

2018-09-20 Thread Marcelo Ricardo Leitner
On Thu, Sep 20, 2018 at 05:27:28PM +0800, Xin Long wrote:
> When processing pmtu update from an icmp packet, it calls .update_pmtu
> with sk instead of skb in sctp_transport_update_pmtu.
> 
> However for sctp, the daddr in the transport might be different from
> inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or
> create the route cache. The incorrect daddr will cause a different
> route cache created for the path.
> 
> So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr
> should be updated with the daddr in the transport, and update it back
> after it's done.
> 
> The issue has existed since route exceptions introduction.
> 
> Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
> Reported-by: ian.per...@dialogic.com
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/transport.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index 12cac85..033696e 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -260,6 +260,7 @@ void sctp_transport_pmtu(struct sctp_transport 
> *transport, struct sock *sk)
>  bool sctp_transport_update_pmtu(struct sctp_transport *t, u32 pmtu)
>  {
>   struct dst_entry *dst = sctp_transport_dst_check(t);
> + struct sock *sk = t->asoc->base.sk;
>   bool change = true;
>  
>   if (unlikely(pmtu < SCTP_DEFAULT_MINSEGMENT)) {
> @@ -271,12 +272,19 @@ bool sctp_transport_update_pmtu(struct sctp_transport 
> *t, u32 pmtu)
>   pmtu = SCTP_TRUNC4(pmtu);
>  
>   if (dst) {
> - dst->ops->update_pmtu(dst, t->asoc->base.sk, NULL, pmtu);
> + struct sctp_pf *pf = sctp_get_pf_specific(dst->ops->family);
> + union sctp_addr addr;
> +
> + pf->af->from_sk(, sk);
> + pf->to_sk_daddr(>ipaddr, sk);
> + dst->ops->update_pmtu(dst, sk, NULL, pmtu);
> + pf->to_sk_daddr(, sk);
> +
>   dst = sctp_transport_dst_check(t);
>   }
>  
>   if (!dst) {
> - t->af_specific->get_dst(t, >saddr, >fl, t->asoc->base.sk);
> + t->af_specific->get_dst(t, >saddr, >fl, sk);
>   dst = t->dst;
>   }
>  
> -- 
> 2.1.0
> 


Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

2018-09-20 Thread Jakub Kicinski
On Thu, 20 Sep 2018 09:14:08 +0200, Eelco Chaudron wrote:
> Is there anything else blocking from getting this into net-next?
> 
> I still think this patch is beneficial for the full user experience, and 
> I’ve got requests from QA and others for this.

Not from my perspective, the numbers look okay, feel free to repost!


Re: array bounds warning in xfrm_output_resume

2018-09-20 Thread Florian Westphal
David Ahern  wrote:
> > $ make O=kbuild/perf -j 24 -s
> > In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
> >  from /home/dsa/kernel-3.git/include/linux/list.h:9,
> >  from /home/dsa/kernel-3.git/include/linux/module.h:9,
> >  from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
> > /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
> > ‘xfrm_output_resume’:
> > /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
> > subscript is above array bounds [-Warray-bounds]
> >__read_once_size(&(x), __u.__c, sizeof(x));  \

Does this thing avoid the warning?

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -143,6 +143,8 @@ int xfrm_output_resume(struct sk_buff *skb, int err)
struct net *net = xs_net(skb_dst(skb)->xfrm);
 
while (likely((err = xfrm_output_one(skb, err)) == 0)) {
+   unsigned short pf;
+
nf_reset(skb);
 
err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
@@ -152,11 +154,19 @@ int xfrm_output_resume(struct sk_buff *skb, int err)
if (!skb_dst(skb)->xfrm)
return dst_output(net, skb->sk, skb);
 
-   err = nf_hook(skb_dst(skb)->ops->family,
- NF_INET_POST_ROUTING, net, skb->sk, skb,
- NULL, skb_dst(skb)->dev, xfrm_output2);
-   if (unlikely(err != 1))
-   goto out;
+   pf = skb_dst(skb)->ops->family;
+   switch (pf) {
+   case AF_INET: /* fallthrough */
+   case AF_INET6:
+   err = nf_hook(pf,
+ NF_INET_POST_ROUTING, net, skb->sk, skb,
+ NULL, skb_dst(skb)->dev, xfrm_output2);
+   if (unlikely(err != 1))
+   goto out;
+   break;
+   default:
+   break;
+   }
}
 
if (err == -EINPROGRESS)


Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs

2018-09-20 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 20, 2018 at 10:36:17AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu:
> > Recognize JITed bpf prog load/unload events.
> > Add/remove kernel symbols accordingly.
> > 
> > Signed-off-by: Alexei Starovoitov 
> > ---
> >  tools/perf/util/machine.c | 27 +++
> >  tools/perf/util/symbol.c  | 13 +
> >  tools/perf/util/symbol.h  |  1 +
> >  3 files changed, 41 insertions(+)
> > 
> > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > index c4acd2001db0..ae4f8a0fdc7e 100644
> > --- a/tools/perf/util/machine.c
> > +++ b/tools/perf/util/machine.c
> > @@ -25,6 +25,7 @@
> >  #include "sane_ctype.h"
> >  #include 
> >  #include 
> > +#include 
> >  
> >  static void __machine__remove_thread(struct machine *machine, struct 
> > thread *th, bool lock);
> >  
> > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct 
> > machine *machine,
> > enum dso_kernel_type kernel_type;
> > bool is_kernel_mmap;
> >  
> > +   /* process JITed bpf programs load/unload events */
> > +   if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {
> 
> 
> So, this would be in machine__process_kernel_munmap-event(machine), etc,
> no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP.
> 
> > +   struct symbol *sym;
> > +   u64 ip;
> > +
> > +   map = map_groups__find(>kmaps, event->mmap.start);
> > +   if (!map) {
> > +   pr_err("No kernel map for IP %lx\n", event->mmap.start);
> > +   goto out_problem;
> > +   }
> > +   ip = event->mmap.start - map->start + map->pgoff;
> > +   if (event->mmap.filename[0]) {
> > +   sym = symbol__new(ip, event->mmap.len, 0, 0,
> > + event->mmap.filename);
> 
> Humm, so the bpf program would be just one symbol... bpf-to-bpf calls
> will be to a different bpf program, right? 
> 
> /me goes to read https://lwn.net/Articles/741773/
>  "[PATCH bpf-next 00/13] bpf: introduce function calls"

After reading it, yeah, I think we need some way to access a symbol
table for a BPF program, and also its binary so that we can do
annotation, static (perf annotate) and live (perf top), was this already
considered? I think one can get the binary for a program giving
sufficient perms somehow, right? One other thing I need to catch up 8-)

- Arnaldo
 
> > +   dso__insert_symbol(map->dso, sym);
> > +   } else {
> > +   if (symbols__erase(>dso->symbols, ip)) {
> > +   pr_err("No bpf prog at IP %lx/%lx\n",
> > +  event->mmap.start, ip);
> > +   goto out_problem;
> > +   }
> > +   dso__reset_find_symbol_cache(map->dso);
> > +   }
> > +   return 0;
> > +   }
> > +
> > /* If we have maps from kcore then we do not need or want any others */
> > if (machine__uses_kcore(machine))
> > return 0;
> > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> > index d188b7588152..0653f313661d 100644
> > --- a/tools/perf/util/symbol.c
> > +++ b/tools/perf/util/symbol.c
> > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root 
> > *symbols, u64 ip)
> > return NULL;
> >  }
> >  
> > +int symbols__erase(struct rb_root *symbols, u64 ip)
> > +{
> > +   struct symbol *s;
> > +
> > +   s = symbols__find(symbols, ip);
> > +   if (!s)
> > +   return -ENOENT;
> > +
> > +   rb_erase(>rb_node, symbols);
> > +   symbol__delete(s);
> > +   return 0;
> > +}
> > +
> >  static struct symbol *symbols__first(struct rb_root *symbols)
> >  {
> > struct rb_node *n = rb_first(symbols);
> > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> > index f25fae4b5743..92ef31953d9a 100644
> > --- a/tools/perf/util/symbol.h
> > +++ b/tools/perf/util/symbol.h
> > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, 
> > const char *elf_name);
> >  
> >  void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool 
> > kernel);
> >  void symbols__insert(struct rb_root *symbols, struct symbol *sym);
> > +int symbols__erase(struct rb_root *symbols, u64 ip);
> >  void symbols__fixup_duplicate(struct rb_root *symbols);
> >  void symbols__fixup_end(struct rb_root *symbols);
> >  void map_groups__fixup_end(struct map_groups *mg);
> > -- 
> > 2.17.1


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> to having to cope with munmapping parts of existing mmaps, etc.
> 
> I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> now it would be used just in this clean case for undoing a
> PERF_RECORD_MMAP for a BPF program.
> 
> The ABI is already complicated, starting to use something called
> PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> I think.

Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
part was getting the perf tool to dtrt for that use-case. But if we need
unmap events, doing the unmap record now is the right thing.




Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices

2018-09-20 Thread Eric Dumazet



On 09/20/2018 06:43 AM, Eric Dumazet wrote:
> 

> And lastly this patch looks way too complicated to me.
> You probably can write something much simpler.

Something like :

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 
930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794ccf3d5a46bdf900
 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff *skb)
return false;
 }
 
-static int __netlink_deliver_tap_skb(struct sk_buff *skb,
+static int __netlink_deliver_tap_skb(struct sk_buff **pskb,
 struct net_device *dev)
 {
-   struct sk_buff *nskb;
+   struct sk_buff *nskb, *skb = *pskb;
struct sock *sk = skb->sk;
int ret = -ENOMEM;
 
if (!net_eq(dev_net(dev), sock_net(sk)))
return 0;
 
-   dev_hold(dev);
-
-   if (is_vmalloc_addr(skb->head))
+   if (is_vmalloc_addr(skb->head)) {
nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
-   else
-   nskb = skb_clone(skb, GFP_ATOMIC);
+   if (!nskb)
+   return -ENOMEM;
+   consume_skb(skb);
+   skb = nskb;
+   *pskb = skb;
+   }
+   dev_hold(dev);
+   nskb = skb_clone(skb, GFP_ATOMIC);
if (nskb) {
nskb->dev = dev;
nskb->protocol = htons((u16) sk->sk_protocol);
@@ -318,7 +322,7 @@ static void __netlink_deliver_tap(struct sk_buff *skb, 
struct netlink_tap_net *n
return;
 
list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
-   ret = __netlink_deliver_tap_skb(skb, tmp->dev);
+   ret = __netlink_deliver_tap_skb(, tmp->dev);
if (unlikely(ret))
break;
}




Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-09-20 Thread Michal Kubecek
On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote:
> Instead of commas, just have them as separate argvs.
> 
> The parsing state machine might look heavyweight, but it makes it easy to add
>  more parameters later and distinguish parameter names from encoding names.
> 
> Suggested-by: Michal Kubecek 
> Signed-off-by: Edward Cree 
> ---

Looks good, thank you.

Reviewed-by: Michal Kubecek 


>  ethtool.8.in   |  6 +++---
>  ethtool.c  | 63 
> --
>  test-cmdline.c | 10 +-
>  3 files changed, 25 insertions(+), 54 deletions(-)
> 
> diff --git a/ethtool.8.in b/ethtool.8.in
> index 414eaa1..7ea2cc0 100644
> --- a/ethtool.8.in
> +++ b/ethtool.8.in
> @@ -390,7 +390,7 @@ ethtool \- query or control network driver and hardware 
> settings
>  .B ethtool \-\-set\-fec
>  .I devname
>  .B encoding
> -.BR auto | off | rs | baser [ , ...]
> +.BR auto | off | rs | baser \ [...]
>  .
>  .\" Adjust lines (i.e. full justification) and hyphenate.
>  .ad
> @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must take 
> the link down
>  administratively and report the problem in the system logs for users to 
> correct.
>  .RS 4
>  .TP
> -.BR encoding\ auto | off | rs | baser [ , ...]
> +.BR encoding\ auto | off | rs | baser \ [...]
>  
>  Sets the FEC encoding for the device.  Combinations of options are specified 
> as
>  e.g.
> -.B auto,rs
> +.B encoding auto rs
>  ; the semantics of such combinations vary between drivers.
>  .TS
>  nokeep;
> diff --git a/ethtool.c b/ethtool.c
> index 9997930..2f7e96b 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str)
>   return 0;
>  }
>  
> -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
> - * corresponding ETHTOOL_FEC_* constants.
> - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> - */
> -static int parse_fecmode(const char *str)
> -{
> - int fecmode = 0;
> - char buf[6];
> -
> - if (!str)
> - return 0;
> - while (*str) {
> - size_t next;
> - int mode;
> -
> - next = strcspn(str, ",");
> - if (next >= 6) /* Bad mode, longest name is 5 chars */
> - return 0;
> - /* Copy into temp buffer and nul-terminate */
> - memcpy(buf, str, next);
> - buf[next] = 0;
> - mode = fecmode_str_to_type(buf);
> - if (!mode) /* Bad mode encountered */
> - return 0;
> - fecmode |= mode;
> - str += next;
> - /* Skip over ',' (but not nul) */
> - if (*str)
> - str++;
> - }
> - return fecmode;
> -}
> -
>  static int do_gfec(struct cmd_context *ctx)
>  {
>   struct ethtool_fecparam feccmd = { 0 };
> @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx)
>  
>  static int do_sfec(struct cmd_context *ctx)
>  {
> - char *fecmode_str = NULL;
> + enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE;
>   struct ethtool_fecparam feccmd;
> - struct cmdline_info cmdline_fec[] = {
> - { "encoding", CMDL_STR,  _str,  },
> - };
> - int changed;
> - int fecmode;
> - int rv;
> + int fecmode = 0, newmode;
> + int rv, i;
>  
> - parse_generic_cmdline(ctx, , cmdline_fec,
> -   ARRAY_SIZE(cmdline_fec));
> -
> - if (!fecmode_str)
> + for (i = 0; i < ctx->argc; i++) {
> + if (!strcmp(ctx->argp[i], "encoding")) {
> + state = ARG_ENCODING;
> + continue;
> + }
> + if (state == ARG_ENCODING) {
> + newmode = fecmode_str_to_type(ctx->argp[i]);
> + if (!newmode)
> + exit_bad_args();
> + fecmode |= newmode;
> + continue;
> + }
>   exit_bad_args();
> + }
>  
> - fecmode = parse_fecmode(fecmode_str);
>   if (!fecmode)
>   exit_bad_args();
>  
> @@ -5265,7 +5236,7 @@ static const struct option {
> " [ all ]\n"},
>   { "--show-fec", 1, do_gfec, "Show FEC settings"},
>   { "--set-fec", 1, do_sfec, "Set FEC settings",
> -   " [ encoding auto|off|rs|baser ]\n"},
> +   " [ encoding auto|off|rs|baser [...]]\n"},
>   { "-h|--help", 0, show_usage, "Show this help" },
>   { "--version", 0, do_version, "Show version number" },
>   {}
> diff --git a/test-cmdline.c b/test-cmdline.c
> index 9c51cca..84630a5 100644
> --- a/test-cmdline.c
> +++ b/test-cmdline.c
> @@ -268,12 +268,12 @@ static struct test_case {
>   { 1, "--set-eee devname advertise foo" },
>   { 1, "--set-fec devname" },
>   { 0, "--set-fec devname encoding auto" },
> - { 0, "--set-fec devname encoding off," },
> -

Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices

2018-09-20 Thread Eric Dumazet



On 09/20/2018 01:54 AM, Li RongQing wrote:
> if skb->head is vmalloc address, when this skb is delivered, full
> allocation for this skb is required, if there are many devices,
> the full allocation will be called for every devices
> 
> now if it is vmalloc, allocate a new skb, whose data is not vmalloc
> address, and use new allocated skb to clone and send, to avoid to
> allocate full skb everytime.
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 


This looks a broken Signed-off-by chain to me.

Who really wrote this patch ?

Also, given that I gave feedback to your first version,
it is customary to CC me for future versions, unless you want me to not notice 
them.

And lastly this patch looks way too complicated to me.
You probably can write something much simpler.

Thanks.


Re: [PATCH v2 2/2] hv_netvsc: pair VF based on serial number

2018-09-20 Thread Lorenzo Pieralisi
On Fri, Sep 14, 2018 at 12:54:57PM -0700, Stephen Hemminger wrote:
> Matching network device based on MAC address is problematic
> since a non VF network device can be creted with a duplicate MAC
> address causing confusion and problems.  The VMBus API does provide
> a serial number that is a better matching method.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  drivers/net/hyperv/netvsc.c |  3 ++
>  drivers/net/hyperv/netvsc_drv.c | 58 +++--
>  2 files changed, 36 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 31c3d77b4733..fe01e141c8f8 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -1203,6 +1203,9 @@ static void netvsc_send_vf(struct net_device *ndev,
>  
>   net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
>   net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
> + netdev_info(ndev, "VF slot %u %s\n",
> + net_device_ctx->vf_serial,
> + net_device_ctx->vf_alloc ? "added" : "removed");
>  }
>  
>  static  void netvsc_receive_inband(struct net_device *ndev,
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 1121a1ec407c..9dedc1463e88 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -1894,20 +1894,6 @@ static void netvsc_link_change(struct work_struct *w)
>   rtnl_unlock();
>  }
>  
> -static struct net_device *get_netvsc_bymac(const u8 *mac)
> -{
> - struct net_device_context *ndev_ctx;
> -
> - list_for_each_entry(ndev_ctx, _dev_list, list) {
> - struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
> -
> - if (ether_addr_equal(mac, dev->perm_addr))
> - return dev;
> - }
> -
> - return NULL;
> -}
> -
>  static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
>  {
>   struct net_device_context *net_device_ctx;
> @@ -2036,26 +2022,48 @@ static void netvsc_vf_setup(struct work_struct *w)
>   rtnl_unlock();
>  }
>  
> +/* Find netvsc by VMBus serial number.
> + * The PCI hyperv controller records the serial number as the slot.
> + */
> +static struct net_device *get_netvsc_byslot(const struct net_device 
> *vf_netdev)
> +{
> + struct device *parent = vf_netdev->dev.parent;
> + struct net_device_context *ndev_ctx;
> + struct pci_dev *pdev;
> +
> + if (!parent || !dev_is_pci(parent))
> + return NULL; /* not a PCI device */
> +
> + pdev = to_pci_dev(parent);
> + if (!pdev->slot) {
> + netdev_notice(vf_netdev, "no PCI slot information\n");
> + return NULL;
> + }
> +
> + list_for_each_entry(ndev_ctx, _dev_list, list) {
> + if (!ndev_ctx->vf_alloc)
> + continue;
> +
> + if (ndev_ctx->vf_serial == pdev->slot->number)
> + return hv_get_drvdata(ndev_ctx->device_ctx);

In patch 1, pdev->slot->number is set to:

PCI_SLOT(wslot_to_devfn(hpdev->desc.win_slot.slot))

so I assume vf_serial is initialized (I have no knowledge of VMBUS and
hyper-V internals) to that value, somehow.

I also do not know how the wslot stuff is handled but I assume 5 bits
(ie dev bits in devfn) are enough.

BTW, I have noticed this patch (and patch 1) are already in -next so I will
drop them from the PCI patch queue.

Lorenzo

> + }
> +
> + netdev_notice(vf_netdev,
> +   "no netdev found for slot %u\n", pdev->slot->number);
> + return NULL;
> +}
> +
>  static int netvsc_register_vf(struct net_device *vf_netdev)
>  {
> - struct net_device *ndev;
>   struct net_device_context *net_device_ctx;
> - struct device *pdev = vf_netdev->dev.parent;
>   struct netvsc_device *netvsc_dev;
> + struct net_device *ndev;
>   int ret;
>  
>   if (vf_netdev->addr_len != ETH_ALEN)
>   return NOTIFY_DONE;
>  
> - if (!pdev || !dev_is_pci(pdev) || dev_is_pf(pdev))
> - return NOTIFY_DONE;
> -
> - /*
> -  * We will use the MAC address to locate the synthetic interface to
> -  * associate with the VF interface. If we don't find a matching
> -  * synthetic interface, move on.
> -  */
> - ndev = get_netvsc_bymac(vf_netdev->perm_addr);
> + ndev = get_netvsc_byslot(vf_netdev);
>   if (!ndev)
>   return NOTIFY_DONE;
>  
> -- 
> 2.18.0
> 


Re: [PATCH v2 2/4] dt-bindings: net: qcom: Add binding for shared mdio bus

2018-09-20 Thread Timur Tabi

On 9/19/18 10:20 AM, Andrew Lunn wrote:

I suspect that is not going to be easy. Last time i looked, the ACPI
standard had nothing about MDIO busses or PHYs. Marcin Wojtas did some
work in this area a while back for the mvpp2, but if i remember
correctly, he worked around this by simply not having a PHY when using
ACPI, and making use of a MAC interrupt which indicated when there was
link.

Whoever implements this first needs to be an ACPI expert and probably
needs to write it up and submit it as an amendment to the ACPI
standard.


If that's what it takes, then so be it.  But adding DT support for a 
device that is only used on ACPI platforms is not a worthwhile endeavor.


After ACPI support is merged, Dongsheng can choose to add DT support to 
maintain parity, if he wants.  Maybe one day the MSM developers will use 
the upstream driver on one of their SOCs.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +   unsigned long symbol_start, symbol_end;
> > +   /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +   char buf[sizeof(u64)] = {};
> > +
> > if (!bpf_prog_kallsyms_candidate(fp))
> > return;
> >  
> > spin_lock_bh(_lock);
> > bpf_prog_ksym_node_del(fp->aux);
> > spin_unlock_bh(_lock);
> > +   bpf_get_prog_addr_region(fp, _start, _end);
> > +   perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
> 
> I feels a bit weird to start issuing unmap events for this.

FWIW:

  
https://lkml.kernel.org/r/20170127130702.gi6...@twins.programming.kicks-ass.net

has talk of PERF_RECORD_MUNMAP


Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs

2018-09-20 Thread Arnaldo Carvalho de Melo
Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu:
> Recognize JITed bpf prog load/unload events.
> Add/remove kernel symbols accordingly.
> 
> Signed-off-by: Alexei Starovoitov 
> ---
>  tools/perf/util/machine.c | 27 +++
>  tools/perf/util/symbol.c  | 13 +
>  tools/perf/util/symbol.h  |  1 +
>  3 files changed, 41 insertions(+)
> 
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index c4acd2001db0..ae4f8a0fdc7e 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -25,6 +25,7 @@
>  #include "sane_ctype.h"
>  #include 
>  #include 
> +#include 
>  
>  static void __machine__remove_thread(struct machine *machine, struct thread 
> *th, bool lock);
>  
> @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct 
> machine *machine,
>   enum dso_kernel_type kernel_type;
>   bool is_kernel_mmap;
>  
> + /* process JITed bpf programs load/unload events */
> + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {


So, this would be in machine__process_kernel_munmap-event(machine), etc,
no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP.

> + struct symbol *sym;
> + u64 ip;
> +
> + map = map_groups__find(>kmaps, event->mmap.start);
> + if (!map) {
> + pr_err("No kernel map for IP %lx\n", event->mmap.start);
> + goto out_problem;
> + }
> + ip = event->mmap.start - map->start + map->pgoff;
> + if (event->mmap.filename[0]) {
> + sym = symbol__new(ip, event->mmap.len, 0, 0,
> +   event->mmap.filename);

Humm, so the bpf program would be just one symbol... bpf-to-bpf calls
will be to a different bpf program, right? 

/me goes to read https://lwn.net/Articles/741773/
 "[PATCH bpf-next 00/13] bpf: introduce function calls"

> + dso__insert_symbol(map->dso, sym);
> + } else {
> + if (symbols__erase(>dso->symbols, ip)) {
> + pr_err("No bpf prog at IP %lx/%lx\n",
> +event->mmap.start, ip);
> + goto out_problem;
> + }
> + dso__reset_find_symbol_cache(map->dso);
> + }
> + return 0;
> + }
> +
>   /* If we have maps from kcore then we do not need or want any others */
>   if (machine__uses_kcore(machine))
>   return 0;
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index d188b7588152..0653f313661d 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root 
> *symbols, u64 ip)
>   return NULL;
>  }
>  
> +int symbols__erase(struct rb_root *symbols, u64 ip)
> +{
> + struct symbol *s;
> +
> + s = symbols__find(symbols, ip);
> + if (!s)
> + return -ENOENT;
> +
> + rb_erase(>rb_node, symbols);
> + symbol__delete(s);
> + return 0;
> +}
> +
>  static struct symbol *symbols__first(struct rb_root *symbols)
>  {
>   struct rb_node *n = rb_first(symbols);
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index f25fae4b5743..92ef31953d9a 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, 
> const char *elf_name);
>  
>  void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool 
> kernel);
>  void symbols__insert(struct rb_root *symbols, struct symbol *sym);
> +int symbols__erase(struct rb_root *symbols, u64 ip);
>  void symbols__fixup_duplicate(struct rb_root *symbols);
>  void symbols__fixup_end(struct rb_root *symbols);
>  void map_groups__fixup_end(struct map_groups *mg);
> -- 
> 2.17.1


Re: [PATCH net] af_key: free SKBs under RCU protection

2018-09-20 Thread Eric Dumazet



On 09/19/2018 05:18 PM, Sean Tranchetti wrote:
> pfkey_broadcast() can make calls to pfkey_broadcast_one() which
> will clone or copy the passed in SKB. This new SKB will be assigned
> the sock_rfree() function as its destructor, which requires that
> the socket reference the SKB contains is valid when the SKB is freed.
> 
> If this SKB is ever passed to pfkey_broadcast() again by some other
> function (such as pkfey_dump() or pfkey_promisc) it will then be
> freed there. However, since this free occurs outside of RCU protection,
> it is possible that userspace could close the socket and trigger
> pfkey_release() to free the socket before sock_rfree() can run, creating
> the following race condition:
> 
> 1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast().
>It performs the broadcast to any other sockets, and calls
>rcu_read_unlock(), but does not yet reach kfree_skb().
> 2: Userspace closes the socket, triggering pfkey_realse(). Since no one
>holds the RCU lock, synchronize_rcu() returns and it is allowed to
>continue. It calls sock_put() to free the socket.
> 3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was
>passed, triggering a call to sock_rfree(). This function now accesses
>the freed struct sock * via skb->sk, and attempts to update invalid
>memory.
> 
> By ensuring that the pfkey_broadcast() also frees the SKBs while it holds
> the RCU lock, we can ensure that the socket will remain valid when the SKB
> is freed, avoiding crashes like the following:
> 
> Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
> [006b6b6b6b6b6c4b] address between user and kernel address ranges
> Internal error: Oops: 9604 [#1] PREEMPT SMP
> task: fff78f65b380 task.stack: ff8049a88000
> pc : sock_rfree+0x38/0x6c
> lr : skb_release_head_state+0x6c/0xcc
> Process repro (pid: 7117, stack limit = 0xff8049a88000)
> Call trace:
>   sock_rfree+0x38/0x6c
>   skb_release_head_state+0x6c/0xcc
>   skb_release_all+0x1c/0x38
>   __kfree_skb+0x1c/0x30
>   kfree_skb+0xd0/0xf4
>   pfkey_broadcast+0x14c/0x18c
>   pfkey_sendmsg+0x1d8/0x408
>   sock_sendmsg+0x44/0x60
>   ___sys_sendmsg+0x1d0/0x2a8
>   __sys_sendmsg+0x64/0xb4
>   SyS_sendmsg+0x34/0x4c
>   el0_svc_naked+0x34/0x38
> Kernel panic - not syncing: Fatal exception
> 
> Signed-off-by: Sean Tranchetti 
> ---
>  net/key/af_key.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/key/af_key.c b/net/key/af_key.c
> index 9d61266..dd257c7 100644
> --- a/net/key/af_key.c
> +++ b/net/key/af_key.c
> @@ -275,13 +275,13 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
> allocation,
>   if ((broadcast_flags & BROADCAST_REGISTERED) && err)
>   err = err2;
>   }
> - rcu_read_unlock();
>  
>   if (one_sk != NULL)
>   err = pfkey_broadcast_one(skb, , allocation, one_sk);
>  
>   kfree_skb(skb2);
>   kfree_skb(skb);
> + rcu_read_unlock();
>   return err;
>  }
>  
> 

I do not believe the changelog or the patch makes sense.

Having skb still referencing a socket prevents this socket being released.

If you think about it, what would prevent the freeing happening 
_before_ the rcu_read_lock() in pfkey_broadcast() ?

Maybe the correct fix is that pfkey_broadcast_one() should ensure the socket is 
still valid.

I would suggest something like :

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 
9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff *skb, struct 
sk_buff **skb2,
 {
int err = -ENOBUFS;
 
-   sock_hold(sk);
+   if (!refcount_inc_not_zero(>sk_refcnt))
+   return -ENOENT;
+
if (*skb2 == NULL) {
if (refcount_read(>users) != 1) {
*skb2 = skb_clone(skb, allocation);





Re: [PATCH v3 net-next 07/12] net: ethernet: Add helper to remove a supported link mode

2018-09-20 Thread Andrew Lunn
> 1. net-next: cf7d97e1e54d ("net: mdio: remove duplicated include from 
> mdio_bus.c")
> 
> # mii-tool -vv eth0
> Using SIOCGMIIPHY=0x8947
> eth0: no link
>   registers for MII PHY 0: 
> 1140 7949 0022 1622 0d81 c1e1 000f 
>  0300      3000
>    0078 7002   0200
>    0528    
>   product info: vendor 00:08:85, model 34 rev 2
>   basic mode:   autonegotiation enabled
>   basic status: no link
>   capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 
> 10baseT-FD 10baseT-HD
>   advertising:  100baseTx-FD 100baseTx-HD flow-control
>   link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> 
> 2. net-next with this patch reverted
> 
> # mii-tool -vv eth0
> Using SIOCGMIIPHY=0x8947
> eth0: negotiated 100baseTx-FD, link ok
>   registers for MII PHY 0: 
> 1140 796d 0022 1622 0181 c1e1 000f 

Hi Simon

Word 5 is what we are advertising. Bits 10 and 11 are Pause and Asym
Pause. In the good case here, neither are set. In this bad case above,
both bits are set.

The patch i asked you to try only cleared the Pause bit, not the
Asymmetric Pause bit. mii-tool only saying 'flow-control' did not
help.

Word 6 is what the partner is advertising. c1e1 indicates the partner
does not support flow control, both bits are 0. I don't see why this
is preventing auto-net though. But in the bad case, the status
register indicates auto-neg has not completed.

Anyway, please can you try this patch, which also removes Aysm Pause.

Thanks
Andrew

>From 00a061304af51831ca1dc86bf6ce23d01f724229 Mon Sep 17 00:00:00 2001
From: Andrew Lunn 
Date: Tue, 18 Sep 2018 18:12:54 -0500
Subject: [PATCH] ravb: Disable Pause Advertisement

The previous commit to ravb had the side effect of making the PHY
advertise Pause. This previously did not happen, and it appears the
MAC does not support Pause. By default, phydev->supported has Pause
enabled, but phydev->advertising does not. Rather than rely on this,
be explicit, and remove the Pause link mode.

Reported-by: Simon Horman 
Fixes: 41124fa64d4b ("net: ethernet: Add helper to remove a supported link 
mode")
Signed-off-by: Andrew Lunn 
---
 drivers/net/ethernet/renesas/ravb_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index fb2a1125780d..b0f2612ad226 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1073,9 +1073,11 @@ static int ravb_phy_init(struct net_device *ndev)
netdev_info(ndev, "limited PHY to 100Mbit/s\n");
}
 
-   /* 10BASE is not supported */
+   /* 10BASE, Pause and Asym Pause is not supported */
phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_10baseT_Half_BIT);
phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_10baseT_Full_BIT);
+   phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_Pause_BIT);
+   phy_remove_link_mode(phydev, ETHTOOL_LINK_MODE_Asym_Pause_BIT);
 
phy_attached_info(phydev);
 
-- 
2.19.0.rc1



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +   unsigned long symbol_start, symbol_end;
> > +   /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +   char buf[sizeof(u64)] = {};
> > +
> > if (!bpf_prog_kallsyms_candidate(fp))
> > return;
> >  
> > spin_lock_bh(_lock);
> > bpf_prog_ksym_node_del(fp->aux);
> > spin_unlock_bh(_lock);
> > +   bpf_get_prog_addr_region(fp, _start, _end);
> > +   perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
 
> I feels a bit weird to start issuing unmap events for this.

For reference, this surfaced here:

https://lkml.org/lkml/2017/1/27/452

Start of the thread, that involves postgresql, JIT, LLVM, perf is here:

https://lkml.org/lkml/2016/12/10/1

PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
to having to cope with munmapping parts of existing mmaps, etc.

I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
now it would be used just in this clean case for undoing a
PERF_RECORD_MMAP for a BPF program.

The ABI is already complicated, starting to use something called
PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
I think.

- Arnaldo


Re: [PATCH net-next 4/5] ipv6: do not drop vrf udp multicast packets

2018-09-20 Thread Paolo Abeni
Hi,

On Thu, 2018-09-20 at 09:58 +0100, Mike Manning wrote:
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 108f5f88ec98..fc60f297d95b 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -325,9 +325,12 @@ static int ip6_input_finish(struct net *net, struct sock 
> *sk, struct sk_buff *sk
>  {
>   const struct inet6_protocol *ipprot;
>   struct inet6_dev *idev;
> + struct net_device *dev;
>   unsigned int nhoff;
> + int sdif = inet6_sdif(skb);
>   int nexthdr;
>   bool raw;
> + bool deliver;
>   bool have_final = false;

Please, try instead to sort the variable in reverse x-mas tree order.
>  
>   /*
> @@ -371,9 +374,27 @@ static int ip6_input_finish(struct net *net, struct sock 
> *sk, struct sk_buff *sk
>   skb_postpull_rcsum(skb, skb_network_header(skb),
>  skb_network_header_len(skb));
>   hdr = ipv6_hdr(skb);
> - if (ipv6_addr_is_multicast(>daddr) &&
> - !ipv6_chk_mcast_addr(skb->dev, >daddr,
> - >saddr) &&
> +
> + /* skb->dev passed may be master dev for vrfs. */
> + if (sdif) {
> + rcu_read_lock();

AFAICS, the rcu lock is already acquired at the beginning of
ip6_input_finish(), not need to acquire it here again.
+   dev = dev_get_by_index_rcu(dev_net(skb->dev),
> +sdif);
> + if (!dev) {
> + rcu_read_unlock();
> + kfree_skb(skb);
> + return -ENODEV;
> + }
> + } else {
> + dev = skb->dev;

The above fragment of code is a recurring pattern in this series,
perhaps adding an helper for it would reduce code duplication ?

Cheers,

Paolo



[RFC PATCH iproute2-next 2/3] rdma: Introduce command execution helper with required device name

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

In contradiction to various show commands, the set command explicitly
requires to use device name as an argument. Provide new command
execution helper which enforces it.

Signed-off-by: Leon Romanovsky 
---
 rdma/rdma.h  |  1 +
 rdma/utils.c | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/rdma/rdma.h b/rdma/rdma.h
index d4b7ba19..dde9e128 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -90,6 +90,7 @@ int cmd_link(struct rd *rd);
 int cmd_res(struct rd *rd);
 int rd_exec_cmd(struct rd *rd, const struct rd_cmd *c, const char *str);
 int rd_exec_dev(struct rd *rd, int (*cb)(struct rd *rd));
+int rd_exec_require_dev(struct rd *rd, int (*cb)(struct rd *rd));
 int rd_exec_link(struct rd *rd, int (*cb)(struct rd *rd), bool strict_port);
 void rd_free(struct rd *rd);
 int rd_set_arg_to_devname(struct rd *rd);
diff --git a/rdma/utils.c b/rdma/utils.c
index 4840bf22..61f4aeb1 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -577,6 +577,16 @@ out:
return ret;
 }

+int rd_exec_require_dev(struct rd *rd, int (*cb)(struct rd *rd))
+{
+   if (rd_no_arg(rd)) {
+   pr_err("Please provide device name.\n");
+   return -EINVAL;
+   }
+
+   return rd_exec_dev(rd, cb);
+}
+
 int rd_exec_cmd(struct rd *rd, const struct rd_cmd *cmds, const char *str)
 {
const struct rd_cmd *c;
--
2.14.4



[RFC PATCH iproute2-next 1/3] rdma: Update kernel include file to support IB device renaming

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Bring kernel header file changes upto commit 908b44308eda
("RDMA/nldev: Allow IB device rename through RDMA netlink")

Signed-off-by: Leon Romanovsky 
---
 rdma/include/uapi/rdma/rdma_netlink.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/rdma/include/uapi/rdma/rdma_netlink.h 
b/rdma/include/uapi/rdma/rdma_netlink.h
index 6513fb89..e2228c09 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -227,8 +227,9 @@ enum rdma_nldev_command {
RDMA_NLDEV_CMD_UNSPEC,

RDMA_NLDEV_CMD_GET, /* can dump */
+   RDMA_NLDEV_CMD_SET,

-   /* 2 - 4 are free to use */
+   /* 3 - 4 are free to use */

RDMA_NLDEV_CMD_PORT_GET = 5, /* can dump */

--
2.14.4



[RFC PATCH iproute2-next 3/3] rdma: Add an option to rename IB device interface

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Enrich rdmatool with an option to rename IB devices,
the command interface follows Iproute2 convention:
"rdma dev set [OLD-DEVNAME] name NEW-DEVNAME"

Signed-off-by: Leon Romanovsky 
---
 rdma/dev.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/rdma/dev.c b/rdma/dev.c
index e2eafe47..760b7fb3 100644
--- a/rdma/dev.c
+++ b/rdma/dev.c
@@ -14,6 +14,7 @@
 static int dev_help(struct rd *rd)
 {
pr_out("Usage: %s dev show [DEV]\n", rd->filename);
+   pr_out("   %s dev set [DEV] name DEVNAME\n", rd->filename);
return 0;
 }

@@ -240,17 +241,51 @@ static int dev_one_show(struct rd *rd)
return rd_exec_cmd(rd, cmds, "parameter");
 }

+static int dev_set_name(struct rd *rd)
+{
+   uint32_t seq;
+
+   if (rd_no_arg(rd)) {
+   pr_err("Please provide device new name.\n");
+   return -EINVAL;
+   }
+
+   rd_prepare_msg(rd, RDMA_NLDEV_CMD_SET,
+  , (NLM_F_REQUEST | NLM_F_ACK));
+   mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, rd_argv(rd));
+
+   return rd_send_msg(rd);
+}
+
+static int dev_one_set(struct rd *rd)
+{
+   const struct rd_cmd cmds[] = {
+   { NULL, dev_help},
+   { "name",   dev_set_name},
+   { 0 }
+   };
+
+   return rd_exec_cmd(rd, cmds, "parameter");
+}
+
 static int dev_show(struct rd *rd)
 {
return rd_exec_dev(rd, dev_one_show);
 }

+static int dev_set(struct rd *rd)
+{
+   return rd_exec_require_dev(rd, dev_one_set);
+}
+
 int cmd_dev(struct rd *rd)
 {
const struct rd_cmd cmds[] = {
{ NULL, dev_show },
{ "show",   dev_show },
{ "list",   dev_show },
+   { "set",dev_set },
{ "help",   dev_help },
{ 0 }
};
--
2.14.4



[RFC PATCH iproute2-next 0/3] rdma: IB device rename

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Hi,

This is comprehensive part of kernel series posted earlier. The kernel
part is not accepted yet, so first patch will have different commit
message with different commit SHA1. This is why it is marked as RFC.

An example:

[leonro@server /]$ lspci |grep -i Ether
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
[leonro@server /]$ sudo rdma dev
1: mlx5_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 
sys_image_guid 5254:00c0:fe12:3455
[leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0
[leonro@server /]$ sudo rdma dev
1: hfi1_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 
sys_image_guid 5254:00c0:fe12:3455

Thanks

Leon Romanovsky (3):
  rdma: Update kernel include file to support IB device renaming
  rdma: Introduce command execution helper with required device name
  rdma: Add an option to rename IB device interface

 rdma/dev.c| 35 +++
 rdma/include/uapi/rdma/rdma_netlink.h |  3 ++-
 rdma/rdma.h   |  1 +
 rdma/utils.c  | 10 ++
 4 files changed, 48 insertions(+), 1 deletion(-)

--
2.14.4



[PATCH rdma-next 3/5] RDMA: Convert IB drivers to name allocation routine

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Move internal implementation of device name to the IB/core.

Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/device.c   | 5 -
 drivers/infiniband/hw/bnxt_re/main.c   | 6 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c| 5 -
 drivers/infiniband/hw/cxgb4/provider.c | 5 -
 drivers/infiniband/hw/hns/hns_roce_main.c  | 4 +++-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c  | 7 ++-
 drivers/infiniband/hw/mlx4/main.c  | 7 ++-
 drivers/infiniband/hw/mlx5/main.c  | 4 +++-
 drivers/infiniband/hw/mthca/mthca_provider.c   | 5 -
 drivers/infiniband/hw/nes/nes_verbs.c  | 6 +-
 drivers/infiniband/hw/ocrdma/ocrdma_main.c | 7 ++-
 drivers/infiniband/hw/qedr/main.c  | 4 +++-
 drivers/infiniband/hw/usnic/usnic_ib_main.c| 5 -
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c | 9 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.c  | 5 -
 15 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 3270cde6d806..a9bc2a3f490c 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -518,11 +518,6 @@ int ib_register_device(struct ib_device *device,

mutex_lock(_mutex);

-   if (strchr(device->name, '%')) {
-   ret = alloc_name(device->name);
-   if (ret)
-   goto out;
-
if (ib_device_check_mandatory(device)) {
ret = -EINVAL;
goto out;
diff --git a/drivers/infiniband/hw/bnxt_re/main.c 
b/drivers/infiniband/hw/bnxt_re/main.c
index 20b9f31052bf..5255e5ffd6f4 100644
--- a/drivers/infiniband/hw/bnxt_re/main.c
+++ b/drivers/infiniband/hw/bnxt_re/main.c
@@ -575,11 +575,15 @@ static void bnxt_re_unregister_ib(struct bnxt_re_dev 
*rdev)
 static int bnxt_re_register_ib(struct bnxt_re_dev *rdev)
 {
struct ib_device *ibdev = >ibdev;
+   int ret;

/* ib device init */
ibdev->owner = THIS_MODULE;
ibdev->node_type = RDMA_NODE_IB_CA;
-   strlcpy(ibdev->name, "bnxt_re%d", IB_DEVICE_NAME_MAX);
+   ret = ib_device_alloc_name(ibdev, "bnxt_re%d");
+   if (ret)
+   return ret;
+
strlcpy(ibdev->node_desc, BNXT_RE_DESC " HCA",
strlen(BNXT_RE_DESC) + 5);
ibdev->phys_port_cnt = 1;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 1b9ff21aa1d5..3680d80036eb 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1319,7 +1319,10 @@ int iwch_register_device(struct iwch_dev *dev)
int i;

pr_debug("%s iwch_dev %p\n", __func__, dev);
-   strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+   ret = ib_device_alloc_name(>ibdev, "cxgb3_%d");
+   if (ret)
+   return ret;
+
memset(>ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
memcpy(>ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
dev->ibdev.owner = THIS_MODULE;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index 4eda6872e617..37d40cbdf0dc 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -535,7 +535,10 @@ void c4iw_register_device(struct work_struct *work)
struct c4iw_dev *dev = ctx->dev;

pr_debug("c4iw_dev %p\n", dev);
-   strlcpy(dev->ibdev.name, "cxgb4_%d", IB_DEVICE_NAME_MAX);
+   ret = ib_device_alloc_name(>ibdev, "cxgb4_%d");
+   if (ret)
+   return;
+
memset(>ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
memcpy(>ibdev.node_guid, dev->rdev.lldi.ports[0]->dev_addr, 6);
dev->ibdev.owner = THIS_MODULE;
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c 
b/drivers/infiniband/hw/hns/hns_roce_main.c
index 6edb547baee8..e16f72692c83 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -449,7 +449,9 @@ static int hns_roce_register_device(struct hns_roce_dev 
*hr_dev)
spin_lock_init(>lock);

ib_dev = _dev->ib_dev;
-   strlcpy(ib_dev->name, "hns_%d", IB_DEVICE_NAME_MAX);
+   ret = ib_device_alloc_name(ib_dev, "hns_%d");
+   if (ret)
+   return ret;

ib_dev->owner   = THIS_MODULE;
ib_dev->node_type   = RDMA_NODE_IB_CA;
diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c 
b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
index e2e6c74a7452..829be11da4e7 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -2746,13 +2746,18 @@ static struct i40iw_ib_device 
*i40iw_init_rdma_device(struct i40iw_device *iwdev
struct i40iw_ib_device *iwibdev;
struct net_device *netdev = iwdev->netdev;

[PATCH rdma-next 2/5] net/smc: Use IB device index instead of name

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

IB device name is not stable and will be possible to rename in the
following patches, update SMC code to use IB index as a stable
identification.

Signed-off-by: Leon Romanovsky 
---
 net/smc/smc_diag.c |  6 +++---
 net/smc/smc_pnet.c | 27 ---
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index dbf64a93d68a..ba8b5a5671ec 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -156,9 +156,9 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff 
*skb,
.lnk[0].link_id = smc->conn.lgr->lnk[0].link_id,
};

-   memcpy(linfo.lnk[0].ibname,
-  smc->conn.lgr->lnk[0].smcibdev->ibdev->name,
-  sizeof(smc->conn.lgr->lnk[0].smcibdev->ibdev->name));
+   ib_device_get_name(smc->conn.lgr->lnk[0].smcibdev->ibdev,
+  linfo.lnk[0].ibname);
+
smc_gid_be16_convert(linfo.lnk[0].gid,
 smc->conn.lgr->lnk[0].gid);
smc_gid_be16_convert(linfo.lnk[0].peer_gid,
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index 01c6ce042a1c..67aec5ce6112 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -70,15 +70,14 @@ struct smc_pnetentry {
u8 ib_port;
 };

-/* Check if two RDMA device entries are identical. Use device name and port
+/* Check if two RDMA device entries are identical. Use device index and port
  * number for comparison.
  */
-static bool smc_pnet_same_ibname(struct smc_pnetentry *pnetelem, char *ibname,
-u8 ibport)
+static bool smc_pnet_same_ibindex(struct smc_pnetentry *pnetelem, u32 ibindex,
+ u8 ibport)
 {
return pnetelem->ib_port == ibport &&
-  !strncmp(pnetelem->smcibdev->ibdev->name, ibname,
-   sizeof(pnetelem->smcibdev->ibdev->name));
+  pnetelem->smcibdev->ibdev->index == ibindex;
 }

 /* Find a pnetid in the pnet table.
@@ -179,9 +178,9 @@ static int smc_pnet_enter(struct smc_pnetentry 
*new_pnetelem)
 sizeof(new_pnetelem->pnet_name)) ||
!strncmp(pnetelem->ndev->name, new_pnetelem->ndev->name,
 sizeof(new_pnetelem->ndev->name)) ||
-   smc_pnet_same_ibname(pnetelem,
-new_pnetelem->smcibdev->ibdev->name,
-new_pnetelem->ib_port)) {
+   smc_pnet_same_ibindex(pnetelem,
+ new_pnetelem->smcibdev->ibdev->index,
+ new_pnetelem->ib_port)) {
dev_put(pnetelem->ndev);
goto found;
}
@@ -227,10 +226,11 @@ static struct smc_ib_device *smc_pnet_find_ib(char 
*ib_name)

spin_lock(_ib_devices.lock);
list_for_each_entry(ibdev, _ib_devices.list, list) {
-   if (!strncmp(ibdev->ibdev->name, ib_name,
-sizeof(ibdev->ibdev->name))) {
+   char name[IB_DEVICE_NAME_MAX] = {};
+
+   ib_device_get_name(ibdev->ibdev, name);
+   if (!strncmp(name, ib_name, IB_DEVICE_NAME_MAX))
goto out;
-   }
}
ibdev = NULL;
 out:
@@ -267,6 +267,11 @@ static int smc_pnet_fill_entry(struct net *net, struct 
smc_pnetentry *pnetelem,
goto error;

rc = -EINVAL;
+   /* NOTE !!!: Sadly enough, but this is part of ABI.
+* From day one, the accesses are performed with device names and not
+* device indexes for both ETH and IB. It means that this function isn't
+* reliable after device renaming.
+*/
if (!tb[SMC_PNETID_IBNAME])
goto error;
rc = -ENOENT;
--
2.14.4



[PATCH rdma-next 1/5] RDMA/core: Provide getter and setter to access IB device name

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Prepare IB device name field to rename operation by ensuring that all
accesses to it are protected with lock and users don't see part of name.

The protection is done with global device_lock because it is used in
allocation and deallocation phases. At this stage, this lock is not
busy and easily can be moved to be per-device, once it will be needed.

Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/device.c | 24 +++-
 include/rdma/ib_verbs.h  |  8 +++-
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 5a680a88aa87..3270cde6d806 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -170,6 +170,14 @@ static struct ib_device *__ib_device_get_by_name(const 
char *name)
return NULL;
 }

+void ib_device_get_name(struct ib_device *ibdev, char *name)
+{
+   down_read(_rwsem);
+   strlcpy(name, ibdev->name, IB_DEVICE_NAME_MAX);
+   up_read(_rwsem);
+}
+EXPORT_SYMBOL(ib_device_get_name);
+
 static int alloc_name(char *name)
 {
unsigned long *inuse;
@@ -202,6 +210,21 @@ static int alloc_name(char *name)
return 0;
 }

+int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern)
+{
+   int ret = 0;
+
+   mutex_lock(_mutex);
+   strlcpy(ibdev->name, pattern, IB_DEVICE_NAME_MAX);
+   if (strchr(ibdev->name, '%'))
+   ret = alloc_name(ibdev->name);
+
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+EXPORT_SYMBOL(ib_device_alloc_name);
+
 static void ib_device_release(struct device *device)
 {
struct ib_device *dev = container_of(device, struct ib_device, dev);
@@ -499,7 +522,6 @@ int ib_register_device(struct ib_device *device,
ret = alloc_name(device->name);
if (ret)
goto out;
-   }

if (ib_device_check_mandatory(device)) {
ret = -EINVAL;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e764ed1f6025..0e7b9854 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2260,6 +2260,11 @@ struct ib_device {
/* Do not access @dma_device directly from ULP nor from HW drivers. */
struct device*dma_device;

+   /*
+* Do not access @name directly,
+* use ib_device_get_name()/ib_device_alloc_name()
+* and don't assume that it can't change after access.
+*/
char  name[IB_DEVICE_NAME_MAX];

struct list_head  event_handler_list;
@@ -2638,7 +2643,8 @@ struct ib_device *ib_alloc_device(size_t size);
 void ib_dealloc_device(struct ib_device *device);

 void ib_get_device_fw_str(struct ib_device *device, char *str);
-
+int ib_device_alloc_name(struct ib_device *ibdev, const char *pattern);
+void ib_device_get_name(struct ib_device *ibdev, char *name);
 int ib_register_device(struct ib_device *device,
   int (*port_callback)(struct ib_device *,
u8, struct kobject *));
--
2.14.4



[PATCH rdma-next 4/5] RDMA/core: Implement IB device rename function

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Generic implementation of IB device rename function.

Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/core_priv.h |  1 +
 drivers/infiniband/core/device.c| 23 +++
 2 files changed, 24 insertions(+)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index d7399d5b1cb6..c5881756b799 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -87,6 +87,7 @@ int  ib_device_register_sysfs(struct ib_device *device,
  int (*port_callback)(struct ib_device *,
   u8, struct kobject *));
 void ib_device_unregister_sysfs(struct ib_device *device);
+int ib_device_rename(struct ib_device *ibdev, const char *name);

 typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
  struct net_device *idev, void *cookie);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index a9bc2a3f490c..36b79cd6b620 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -178,6 +178,29 @@ void ib_device_get_name(struct ib_device *ibdev, char 
*name)
 }
 EXPORT_SYMBOL(ib_device_get_name);

+int ib_device_rename(struct ib_device *ibdev, const char *name)
+{
+   struct ib_device *device;
+   int ret = 0;
+
+   if (!strcmp(name, ibdev->name))
+   return ret;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(device, _list, core_list) {
+   if (!strcmp(name, device->name)) {
+   ret = -EEXIST;
+   goto out;
+   }
+   }
+
+   strlcpy(ibdev->name, name, IB_DEVICE_NAME_MAX);
+   ret = device_rename(>dev, name);
+out:
+   mutex_unlock(_mutex);
+   return ret;
+}
+
 static int alloc_name(char *name)
 {
unsigned long *inuse;
--
2.14.4



[PATCH rdma-next 5/5] RDMA/nldev: Allow IB device rename through RDMA netlink

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Provide an option to rename IB device name through RDMA netlink and
limit it to users with ADMIN capability only.

Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/nldev.c  | 33 +
 include/uapi/rdma/rdma_netlink.h |  3 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 0385ab438320..64e1d12e9678 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -645,6 +645,35 @@ static int nldev_get_doit(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return err;
 }

+static int nldev_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack)
+{
+   struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+   struct ib_device *device;
+   u32 index;
+   int err;
+
+   err = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, nldev_policy,
+ extack);
+   if (err || !tb[RDMA_NLDEV_ATTR_DEV_INDEX])
+   return -EINVAL;
+
+   index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+   device = ib_device_get_by_index(index);
+   if (!device)
+   return -EINVAL;
+
+   if (tb[RDMA_NLDEV_ATTR_DEV_NAME]) {
+   char name[IB_DEVICE_NAME_MAX] = {};
+
+   nla_strlcpy(name, tb[RDMA_NLDEV_ATTR_DEV_NAME],
+   IB_DEVICE_NAME_MAX);
+   return ib_device_rename(device, name);
+   }
+
+   return 0;
+}
+
 static int _nldev_get_dumpit(struct ib_device *device,
 struct sk_buff *skb,
 struct netlink_callback *cb,
@@ -1077,6 +1106,10 @@ static const struct rdma_nl_cbs 
nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
.doit = nldev_get_doit,
.dump = nldev_get_dumpit,
},
+   [RDMA_NLDEV_CMD_SET] = {
+   .doit = nldev_set_doit,
+   .flags = RDMA_NL_ADMIN_PERM,
+   },
[RDMA_NLDEV_CMD_PORT_GET] = {
.doit = nldev_port_get_doit,
.dump = nldev_port_get_dumpit,
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index edba6351ac13..f9c41bf59efc 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -227,8 +227,9 @@ enum rdma_nldev_command {
RDMA_NLDEV_CMD_UNSPEC,

RDMA_NLDEV_CMD_GET, /* can dump */
+   RDMA_NLDEV_CMD_SET,

-   /* 2 - 4 are free to use */
+   /* 3 - 4 are free to use */

RDMA_NLDEV_CMD_PORT_GET = 5, /* can dump */

--
2.14.4



[PATCH rdma-next 0/5] IB device rename support

2018-09-20 Thread Leon Romanovsky
From: Leon Romanovsky 

Hi,

This series introduce long-waiting feature - "IB device rename".
Such feature gives and option to rename user visible IB device name from
vendor specific name (e.g. mlx5_0) to anything else.

The user space component through rdmatool will follow this series.

[leonro@server /]$ lspci |grep -i Ether
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
[leonro@server /]$ sudo rdma dev
1: mlx5_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 
sys_image_guid 5254:00c0:fe12:3455
[leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0
[leonro@server /]$ sudo rdma dev
1: hfi1_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455 
sys_image_guid 5254:00c0:fe12:3455

First patch introduces getter/setter to access names, i didn't convert
all drivers to stop using name directly, because they don't base their
decision on "name", and use this print only and can print truncated name
if renaming is done at the same time as logging.

Second patch updates SMC to use IB device index instead of name.

Third patch globally converts all drivers to new allocation name
routine.

Forth and fifth patches are actually implement and exports through RDMA
netlink the rename routines.

It uses exported by device_rename() function, despite the comment from
2010, which warns about downsides of this function, the netdev is still
uses, so we will use too.

There is one patch/series which was dropped from this submission -
conversion of SElinux from being IB device name to be IB device index
based. It simply needs more special care and more testing.

This series was tested with mlx5 devices with/without traffic and with
non-modified rdma-core.

Dennis,
I didn't touch hfi1, but I'm not sure if it is needed.

Thanks

Leon Romanovsky (5):
  RDMA/core: Provide getter and setter to access IB device name
  net/smc: Use IB device index instead of name
  RDMA: Convert IB drivers to name allocation routine
  RDMA/core: Implement IB device rename function
  RDMA/nldev: Allow IB device rename through RDMA netlink

 drivers/infiniband/core/core_priv.h|  1 +
 drivers/infiniband/core/device.c   | 52 +++---
 drivers/infiniband/core/nldev.c| 33 
 drivers/infiniband/hw/bnxt_re/main.c   |  6 ++-
 drivers/infiniband/hw/cxgb3/iwch_provider.c|  5 ++-
 drivers/infiniband/hw/cxgb4/provider.c |  5 ++-
 drivers/infiniband/hw/hns/hns_roce_main.c  |  4 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c  |  7 +++-
 drivers/infiniband/hw/mlx4/main.c  |  7 +++-
 drivers/infiniband/hw/mlx5/main.c  |  4 +-
 drivers/infiniband/hw/mthca/mthca_provider.c   |  5 ++-
 drivers/infiniband/hw/nes/nes_verbs.c  |  6 ++-
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |  7 +++-
 drivers/infiniband/hw/qedr/main.c  |  4 +-
 drivers/infiniband/hw/usnic/usnic_ib_main.c|  5 ++-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c |  9 +++--
 drivers/infiniband/sw/rxe/rxe_verbs.c  |  5 ++-
 include/rdma/ib_verbs.h|  8 +++-
 include/uapi/rdma/rdma_netlink.h   |  3 +-
 net/smc/smc_diag.c |  6 +--
 net/smc/smc_pnet.c | 27 +++--
 21 files changed, 171 insertions(+), 38 deletions(-)

--
2.14.4



Re: [PATCH iproute2 v2 0/3] testsuite: make alltests fixes

2018-09-20 Thread Luca Boccassi
On Thu, 2018-09-20 at 01:36 +0200, Petr Vorel wrote:
> Hi,
> 
> here are simply fixes to restore 'make alltests'.
> Currently it does not run.
> 
> Kind regards,
> Petr
> 
> Petr Vorel (3):
>   testsuite: Fix missing generate_nlmsg
>   testsuite: Generate generate_nlmsg when needed
>   testsuite: Warn about empty $(IPVERS)
> 
>  testsuite/Makefile | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)

Series-tested-by: Luca Boccassi 

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


[PATCH net] sctp: update dst pmtu with the correct daddr

2018-09-20 Thread Xin Long
When processing pmtu update from an icmp packet, it calls .update_pmtu
with sk instead of skb in sctp_transport_update_pmtu.

However for sctp, the daddr in the transport might be different from
inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or
create the route cache. The incorrect daddr will cause a different
route cache created for the path.

So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr
should be updated with the daddr in the transport, and update it back
after it's done.

The issue has existed since route exceptions introduction.

Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Reported-by: ian.per...@dialogic.com
Signed-off-by: Xin Long 
---
 net/sctp/transport.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index 12cac85..033696e 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -260,6 +260,7 @@ void sctp_transport_pmtu(struct sctp_transport *transport, 
struct sock *sk)
 bool sctp_transport_update_pmtu(struct sctp_transport *t, u32 pmtu)
 {
struct dst_entry *dst = sctp_transport_dst_check(t);
+   struct sock *sk = t->asoc->base.sk;
bool change = true;
 
if (unlikely(pmtu < SCTP_DEFAULT_MINSEGMENT)) {
@@ -271,12 +272,19 @@ bool sctp_transport_update_pmtu(struct sctp_transport *t, 
u32 pmtu)
pmtu = SCTP_TRUNC4(pmtu);
 
if (dst) {
-   dst->ops->update_pmtu(dst, t->asoc->base.sk, NULL, pmtu);
+   struct sctp_pf *pf = sctp_get_pf_specific(dst->ops->family);
+   union sctp_addr addr;
+
+   pf->af->from_sk(, sk);
+   pf->to_sk_daddr(>ipaddr, sk);
+   dst->ops->update_pmtu(dst, sk, NULL, pmtu);
+   pf->to_sk_daddr(, sk);
+
dst = sctp_transport_dst_check(t);
}
 
if (!dst) {
-   t->af_specific->get_dst(t, >saddr, >fl, t->asoc->base.sk);
+   t->af_specific->get_dst(t, >saddr, >fl, sk);
dst = t->dst;
}
 
-- 
2.1.0



Re: [PATCH bpf-next 2/2] xsk: fix bug when trying to use both copy and zero-copy on one queue id

2018-09-20 Thread Magnus Karlsson
On Wed, Sep 19, 2018 at 9:18 AM Magnus Karlsson
 wrote:
>
> On Wed, Sep 19, 2018 at 3:58 AM Jakub Kicinski
>  wrote:
> >
> > On Tue, 18 Sep 2018 10:22:11 -0700, Y Song wrote:
> > > > +/* The umem is stored both in the _rx struct and the _tx struct as we 
> > > > do
> > > > + * not know if the device has more tx queues than rx, or the opposite.
> > > > + * This might also change during run time.
> > > > + */
> > > > +static void xdp_reg_umem_at_qid(struct net_device *dev, struct 
> > > > xdp_umem *umem,
> > > > +   u16 queue_id)
> > > > +{
> > > > +   if (queue_id < dev->real_num_rx_queues)
> > > > +   dev->_rx[queue_id].umem = umem;
> > > > +   if (queue_id < dev->real_num_tx_queues)
> > > > +   dev->_tx[queue_id].umem = umem;
> > > > +}
> > > > +
> > > > +static struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
> > > > + u16 queue_id)
> > > > +{
> > > > +   if (queue_id < dev->real_num_rx_queues)
> > > > +   return dev->_rx[queue_id].umem;
> > > > +   if (queue_id < dev->real_num_tx_queues)
> > > > +   return dev->_tx[queue_id].umem;
> > > > +
> > > > +   return NULL;
> > > > +}
> > > > +
> > > > +static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id)
> > > > +{
> > > > +   /* Zero out the entry independent on how many queues are 
> > > > configured
> > > > +* at this point in time, as it might be used in the future.
> > > > +*/
> > > > +   if (queue_id < dev->num_rx_queues)
> > > > +   dev->_rx[queue_id].umem = NULL;
> > > > +   if (queue_id < dev->num_tx_queues)
> > > > +   dev->_tx[queue_id].umem = NULL;
> > > > +}
> > > > +
> > >
> > > I am sure whether the following scenario can happen or not.
> > > Could you clarify?
> > >1. suppose initially we have num_rx_queues = num_tx_queues = 10
> > >xdp_reg_umem_at_qid() set umem1 to queue_id = 8
> > >2. num_tx_queues is changed to 5
> > >3. xdp_clear_umem_at_qid() is called for queue_id = 8,
> > >and dev->_rx[8].umum = 0.
> > >4. xdp_reg_umem_at_qid() is called gain to set for queue_id = 8
> > >dev->_rx[8].umem = umem2
> > >5. num_tx_queues is changed to 10
> > >   Now dev->_rx[8].umem != dev->_tx[8].umem, is this possible and is it
> > > a problem?
> >
> > Plus IIRC the check of qid vs real_num_[rt]x_queues in xsk_bind() is
> > not under rtnl_lock so it doesn't count for much.  Why not do all the
> > checks against num_[rt]x_queues here, instead of real_..?
>
> You are correct, two separate rtnl_lock regions is broken. Will spin a
> v2 tomorrow when I am back in the office.
>
> Thanks Jakub for catching this. I really appreciate you reviewing my code.

Sorry, forgot to answer your question about why real_num_ instead of
num_. If we used num_ instead, we would get a behavior where we can
open a socket on a queue id, let us say 10, that will not be active if
real_num_ is below 10. So no traffic will flow on this socket until we
issue an ethtool command to state that we have 10 or more queues
configured. While this behavior is sane (and consistent), I believe it
will lead to a more complicated driver implementation, break the
current uapi (since we will return an error if you try to bind to a
queue id that is not active at the moment) and I like my suggestion
below better :-).

What I would like to suggest is that the current model at bind() is
kept, that is it will fail if you try to bind to a non-active queue
id. But we add some code in ethool_set_channels in net/core/ethtool.c
to check if you have an active af_xdp socket on a queue id and try to
make it inactive, we return an error and disallow the change. This
would IMHO be like not allowing a load module to be unloaded if
someone is using it or not allowing unmounting a file system if files
are in use. What do you think? If you think this is the right way to
go, I can implement this in a follow up patch or in this patch set if
that makes more sense, let me know. This suggestion would actually
make it simpler to implement zero-copy support in the driver. Some of
the complications we are having today is due to the fact that ethtool
can come in from the side and change things for us when an AF_XDP
socket is active on a queue id. And implementing zero copy support in
the driver needs to get simpler IMO.

Please let me know what you think.

Thanks: Magnus

> /Magnus


[PATCH net-next 0/3] Refactorings on af_inet pernet initialization

2018-09-20 Thread Kirill Tkhai
This patch set makes several cleanups around inet_init_net().

---

Cong Wang (1):
  ipv4: initialize ra_mutex in inet_init_net()

Kirill Tkhai (2):
  net: Remove inet_exit_net()
  net: Register af_inet_ops earlier


 net/core/net_namespace.c |1 -
 net/ipv4/af_inet.c   |   13 +
 2 files changed, 5 insertions(+), 9 deletions(-)

--
Signed-off-by: Kirill Tkhai 


[PATCH 2/3] net: Register af_inet_ops earlier

2018-09-20 Thread Kirill Tkhai
This function just initializes locks and defaults.
Let register it before other pernet operation,
since some of them potentially may relay on that.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/af_inet.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f4ecbe0aaf1a..bbd3a072ffea 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1938,6 +1938,9 @@ static int __init inet_init(void)
for (q = inetsw_array; q < _array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);
 
+   if (init_inet_pernet_ops())
+   pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__);
+
/*
 *  Set the ARP module up
 */
@@ -1975,9 +1978,6 @@ static int __init inet_init(void)
if (ip_mr_init())
pr_crit("%s: Cannot init ipv4 mroute\n", __func__);
 #endif
-
-   if (init_inet_pernet_ops())
-   pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__);
/*
 *  Initialise per-cpu ipv4 mibs
 */



[PATCH 3/3] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Kirill Tkhai
From: Cong Wang 

ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4,
but its initialization is in the generic netns code, setup_net().

Move it to IPv4 specific net init code, inet_init_net().

Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex")
Signed-off-by: Cong Wang 
Acked-by: Kirill Tkhai 
---
 net/core/net_namespace.c |1 -
 net/ipv4/af_inet.c   |2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 670c84b1bfc2..b272ccfcbf63 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -308,7 +308,6 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(>netns_ids);
spin_lock_init(>nsid_lock);
-   mutex_init(>ipv4.ra_mutex);
 
list_for_each_entry(ops, _list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index bbd3a072ffea..d4623144e237 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.sysctl_igmp_llm_reports = 1;
net->ipv4.sysctl_igmp_qrv = 2;
 
+   mutex_init(>ipv4.ra_mutex);
+
return 0;
 }
 



[PATCH 1/3] net: Remove inet_exit_net()

2018-09-20 Thread Kirill Tkhai
This function does nothing, and since ops_exit_list()
checks for NULL ->exit method, we do not need stub here.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/af_inet.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..f4ecbe0aaf1a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1821,13 +1821,8 @@ static __net_init int inet_init_net(struct net *net)
return 0;
 }
 
-static __net_exit void inet_exit_net(struct net *net)
-{
-}
-
 static __net_initdata struct pernet_operations af_inet_ops = {
.init = inet_init_net,
-   .exit = inet_exit_net,
 };
 
 static int __init init_inet_pernet_ops(void)



Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Kirill Tkhai
On 20.09.2018 0:28, Cong Wang wrote:
> On Wed, Sep 19, 2018 at 1:25 AM Kirill Tkhai  wrote:
>>
>> On 18.09.2018 23:17, Cong Wang wrote:
>>> On Mon, Sep 17, 2018 at 12:25 AM Kirill Tkhai  wrote:
 In inet_init() the order of registration is:

 ip_mr_init();
 init_inet_pernet_ops();

 This means, ipmr_net_ops pernet operations are before af_inet_ops
 in pernet_list. So, there is a theoretical probability, sometimes
 in the future, we will have a problem during a fail of net initialization.

 Say,

 setup_net():
 ipmr_net_ops->init() returns 0
 xxx->init()  returns error
 and then we do:
 ipmr_net_ops->exit(),

 which could touch ra_mutex (theoretically).
>>>
>>> How could ra_mutex be touched in this scenario?
>>>
>>> ra_mutex is only used in ip_ra_control() which is called
>>> only by {get,set}sockopt(). I don't see anything related
>>> to netns exit() path here.
>>
>> Currently, it is not touched. But it's an ordinary practice,
>> someone closes sockets in pernet ->exit methods. For example,
>> we close percpu icmp sockets in icmp_sk_exit(), which are
>> also of RAW type, and there is also called ip_ra_control()
>> for them. Yes, they differ by their protocol; icmp sockets
>> are of IPPROTO_ICMP protocol, while ip_ra_control() acts
>> on IPPROTO_RAW sockets, but it's not good anyway. This does
>> not look reliable for the future. In case of someone changes
>> something here, we may do not notice this for the long time,
>> while some users will meet bugs on their places.
> 
> First of all, we only consider current code base. Even if you
> really planned to changed this in the future, it would be still your
> responsibility to take care of it. Why? Simple, FIFO. My patch
> comes ahead of any future changes here, obviously.
> 
> Secondly, it is certainly not hard to notice. I am pretty sure
> you would get a warning/crash if this bug triggered.
> 
> 
> 
>>
>> Problems on error paths is not easy to detect on testing,
>> while user may meet them. We had issue of same type with
>> uninitialized xfrm_policy_lock. It was introduced in 2013,
>> while the problem was found only in 2017:
> 
> Been there, done that, I've fixed multiple IPv6 init code
> error path. This is not where we disagree, by the way.
> 
> 
>>
>> introduced by 283bc9f35bbb
>> fixed  by c28a45cb
>>
>> (Last week I met it on RH7 kernel, which still has no a fix.
>>  But this talk is not about distribution kernels, just about
>>  the way).
>>
>> I just want to say if someone makes some creativity on top
>> of this code, it will be to more friendly from us to him/her
>> to not force this person to think about such not obvious details,
>> but just to implement nice architecture right now.
> 
> You keep saying nice architecture, how did you architect
> ipv4.ra_mutex into net/core/net_namespace.c? It is apparently
> not nice.
> 
> Good luck with your future.
> 
> I am tired, let's just drop it. I have no interest to fix your mess.

You added me to CC, so you probably want to know my opinion about this.
Since it's not a real problem fix, but just a refactoring, I say you
my opinion, how this refactoring may be made better. If you don't want
to know my opinion, you may consider not to CC me.

Just this, not a reason to take offense.

Thanks,
Kirill


CONGRATULATIONS:

2018-09-20 Thread WINNING AWARD
This is to inform you that you are among the Ten lucky winners of Category A. 
Hence we do believe with your winning prize, you will continue to be active to 
the Google search engine/ ancillary services. Google is now the biggest search 
engine worldwide and in an effort to make sure that it remains the most widely 
used search engine, we ran an on-line e-mail beta test which your email address 
won 950 thousand dollars .

You are advised to contact your Foreign Transfer Manager with the following 
details (Ref NO: GP/48-46GC/Q, Winning NO: GC046GP-43-11) and to avoid 
unnecessary delay.

Google Promotion Foreign Transfer Manager
Sir. Randy Brown 
Email:  ( rd4217...@gmail.com

Sincerely,
Lawrence Page
Co-founder of Google™”


[PATCH net-next 0/5] vrf: allow simultaneous service instances in default and other VRFs

2018-09-20 Thread Mike Manning
Services currently have to be VRF-aware if they are using an unbound
socket. One cannot have multiple service instances running in the
default and other VRFs for services that are not VRF-aware and listen
on an unbound socket. This is because there is no way of isolating
packets received in the default VRF from those arriving in other VRFs.

This series provides this isolation subject to the existing kernel
parameter net.ipv4.tcp_l3mdev_accept not being set, given that this is
documented as allowing a single service instance to work across all
VRF domains. The functionality applies to UDP & TCP services, for IPv4
and IPv6, in particular adding VRF table handling for IPv6 multicast.

Example of running ssh instances in default and blue VRF:

$ /usr/sbin/sshd -D
$ ip vrf exec vrf-blue /usr/sbin/sshd
$ ss -ta | egrep 'State|ssh'
State   Recv-Q   Send-Q   Local Address:Port   Peer Address:Port
LISTEN  0128   0.0.0.0%vrf-blue:ssh 0.0.0.0:*
LISTEN  01280.0.0.0:ssh 0.0.0.0:*
ESTAB   00  192.168.122.220:ssh   192.168.122.1:50282
LISTEN  0128  [::]%vrf-blue:ssh[::]:*
LISTEN  0128   [::]:ssh[::]:*
ESTAB   00   [3000::2]%vrf-blue:ssh   [3000::9]:45896
ESTAB   00[2000::2]:ssh   [2000::9]:46398

Dewi Morgan (1):
  ipv6: do not drop vrf udp multicast packets

Mike Manning (1):
  ipv6: allow link-local and multicast packets inside vrf

Patrick Ruddy (1):
  ipv6: add vrf table handling code for ipv6 mcast

Robert Shearman (2):
  net: allow binding socket in a VRF when there's an unbound socket
  ipv4: Allow sending multicast packets on specific i/f using VRF socket

 Documentation/networking/vrf.txt |  9 
 drivers/net/vrf.c| 30 
 include/net/inet6_hashtables.h   |  5 ++--
 include/net/inet_hashtables.h| 21 +++--
 include/net/inet_sock.h  | 13 +++
 net/core/sock.c  |  2 ++
 net/ipv4/datagram.c  |  2 +-
 net/ipv4/inet_connection_sock.c  | 13 ---
 net/ipv4/inet_hashtables.c   | 34 +---
 net/ipv4/ip_sockglue.c   |  3 +++
 net/ipv4/ping.c  |  2 +-
 net/ipv4/raw.c   |  6 ++---
 net/ipv4/udp.c   | 17 ++
 net/ipv6/datagram.c  |  5 +++-
 net/ipv6/inet6_hashtables.c  | 14 +---
 net/ipv6/ip6_input.c | 46 +
 net/ipv6/ip6mr.c | 49 ++--
 net/ipv6/ipv6_sockglue.c |  5 +++-
 net/ipv6/raw.c   |  6 ++---
 net/ipv6/udp.c   | 22 --
 20 files changed, 208 insertions(+), 96 deletions(-)

-- 
2.11.0



[PATCH net-next 2/5] ipv6: allow link-local and multicast packets inside vrf

2018-09-20 Thread Mike Manning
Packets that are multicast or to link-local addresses are not enslaved
to the vrf of the socket that they are received on. This is needed for
NDISC, but breaks applications that rely on receiving such packets when
in a VRF. Also to make IPv6 consistent with IPv4 which does handle
multicast packets as being enslaved, modify the VRF driver to do the
same for IPv6. As a result, the multicast address check needs to verify
the address against the enslaved rather than the l3mdev device.

Signed-off-by: Mike Manning 
---
 drivers/net/vrf.c| 19 +--
 net/ipv6/ip6_input.c | 19 ++-
 net/ipv6/ipv6_sockglue.c |  2 +-
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index f93547f257fb..9d817c19f3b4 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device 
*vrf_dev,
   struct sk_buff *skb)
 {
int orig_iif = skb->skb_iif;
-   bool need_strict;
+   bool need_strict = rt6_need_strict(_hdr(skb)->daddr);
+   bool is_ndisc = ipv6_ndisc_frame(skb);
 
-   /* loopback traffic; do not push through packet taps again.
-* Reset pkt_type for upper layers to process skb
+   /* loopback, multicast & non-ND link-local traffic; do not push through
+* packet taps again. Reset pkt_type for upper layers to process skb
 */
-   if (skb->pkt_type == PACKET_LOOPBACK) {
+   if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
-   skb->pkt_type = PACKET_HOST;
+   if (skb->pkt_type == PACKET_LOOPBACK)
+   skb->pkt_type = PACKET_HOST;
goto out;
}
 
-   /* if packet is NDISC or addressed to multicast or link-local
-* then keep the ingress interface
-*/
-   need_strict = rt6_need_strict(_hdr(skb)->daddr);
-   if (!ipv6_ndisc_frame(skb) && !need_strict) {
+   /* if packet is NDISC then keep the ingress interface */
+   if (!is_ndisc) {
vrf_rx_stats(vrf_dev, skb->len);
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..108f5f88ec98 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -432,15 +432,32 @@ EXPORT_SYMBOL_GPL(ip6_input);
 
 int ip6_mc_input(struct sk_buff *skb)
 {
+   int sdif = inet6_sdif(skb);
const struct ipv6hdr *hdr;
+   struct net_device *dev;
bool deliver;
 
__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
 __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
 skb->len);
 
+   /* skb->dev passed may be master dev for vrfs. */
+   if (sdif) {
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(dev_net(skb->dev), sdif);
+   if (!dev) {
+   rcu_read_unlock();
+   kfree_skb(skb);
+   return -ENODEV;
+   }
+   } else {
+   dev = skb->dev;
+   }
+
hdr = ipv6_hdr(skb);
-   deliver = ipv6_chk_mcast_addr(skb->dev, >daddr, NULL);
+   deliver = ipv6_chk_mcast_addr(dev, >daddr, NULL);
+   if (sdif)
+   rcu_read_unlock();
 
 #ifdef CONFIG_IPV6_MROUTE
/*
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 7dfbc797b130..4ebd395dd3df 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
retv = -EFAULT;
break;
}
-   if (sk->sk_bound_dev_if && pkt.ipi6_ifindex != 
sk->sk_bound_dev_if)
+   if (!sk_dev_equal_l3scope(sk, pkt.ipi6_ifindex))
goto e_inval;
 
np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex;
-- 
2.11.0



  1   2   >