date:20170312

Re: [PATCH v3 1/2] net: sched: make default fifo qdiscs appear in the dump

2017-03-12 Thread David Miller

From: Jiri Kosina 
Date: Wed, 8 Mar 2017 16:03:32 +0100 (CET)

> From: Jiri Kosina 
> 
> The original reason [1] for having hidden qdiscs (potential scalability
> issues in qdisc_match_from_root() with single linked list in case of large
> amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
> qdisc linked list to hashtable").
> 
> This allows us for bringing more clarity and determinism into the dump by
> making default pfifo qdiscs visible.
> 
> We're not turning this on by default though, at it was deemed [2] too
> intrusive / unnecessary change of default behavior towards userspace.
> Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
> applications to request complete qdisc hierarchy dump, including the
> ones that have always been implicit/invisible.
> 
> Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
> about singletons would require quite some surgery with very little gain
> (seeing no qdisc or seeing noop qdisc in the dump is probably setting
> the same user expectation).
> 
> [1] 
> http://lkml.kernel.org/r/1460732328.10638.74.ca...@edumazet-glaptop3.roam.corp.google.com
> [2] 
> http://lkml.kernel.org/r/20161021.105935.1907696543877061916.da...@davemloft.net
> 
> Signed-off-by: Jiri Kosina 

Applied, thanks Jiri.

Re: [PATCH net-next v2] net: ipv6: Add early demux handler for UDP unicast

2017-03-12 Thread David Miller

From: Subash Abhinov Kasiviswanathan 
Date: Wed,  8 Mar 2017 16:36:49 -0700

> While running a single stream UDPv6 test, we observed that amount
> of CPU spent in NET_RX softirq was much greater than UDPv4 for an
> equivalent receive rate. The test here was run on an ARM64 based
> Android system. On further analysis with perf, we found that UDPv6
> was spending significant time in the statistics netfilter targets
> which did socket lookup per packet. These statistics rules perform
> a lookup when there is no socket associated with the skb. Since
> there are multiple instances of these rules based on UID, there
> will be equal number of lookups per skb.
> 
> By introducing early demux for UDPv6, we avoid the redundant lookups.
> This also helped to improve the performance (800Mbps -> 870Mbps) on a
> CPU limited system in a single stream UDPv6 receive test with 1450
> byte sized datagrams using iperf.
> 
> v1->v2: Use IPv6 cookie to validate dst instead of 0 as suggested
> by Eric
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 

Applied, thanks.

[PATCHv3 4/4] rds: ib: unmap the scatter/gather list when error

2017-03-12 Thread Zhu Yanjun

When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_fmr.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index c936b0d..86ef907 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -112,29 +112,39 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
u64 dma_addr = ib_sg_dma_address(dev, [i]);
 
if (dma_addr & ~PAGE_MASK) {
-   if (i > 0)
+   if (i > 0) {
+   ib_dma_unmap_sg(dev, sg, nents,
+   DMA_BIDIRECTIONAL);
return -EINVAL;
-   else
+   } else {
++page_cnt;
+   }
}
if ((dma_addr + dma_len) & ~PAGE_MASK) {
-   if (i < sg_dma_len - 1)
+   if (i < sg_dma_len - 1) {
+   ib_dma_unmap_sg(dev, sg, nents,
+   DMA_BIDIRECTIONAL);
return -EINVAL;
-   else
+   } else {
++page_cnt;
+   }
}
 
len += dma_len;
}
 
page_cnt += len >> PAGE_SHIFT;
-   if (page_cnt > ibmr->pool->fmr_attr.max_pages)
+   if (page_cnt > ibmr->pool->fmr_attr.max_pages) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
return -EINVAL;
+   }
 
dma_pages = kmalloc_node(sizeof(u64) * page_cnt, GFP_ATOMIC,
 rdsibdev_to_node(rds_ibdev));
-   if (!dma_pages)
+   if (!dma_pages) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
return -ENOMEM;
+   }
 
page_cnt = 0;
for (i = 0; i < sg_dma_len; ++i) {
@@ -147,8 +157,10 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
}
 
ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
-   if (ret)
+   if (ret) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
goto out;
+   }
 
/* Success - we successfully remapped the MR, so we can
 * safely tear down the old mapping.
-- 
2.7.4

[PATCHv3 3/4] rds: ib: add the static type to the function

2017-03-12 Thread Zhu Yanjun

The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_fmr.c | 5 +++--
 net/rds/ib_mr.h  | 2 --
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 249ae1c..c936b0d 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -84,8 +84,9 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
return ERR_PTR(err);
 }
 
-int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr,
-  struct scatterlist *sg, unsigned int nents)
+static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
+ struct rds_ib_mr *ibmr, struct scatterlist *sg,
+ unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
struct rds_ib_fmr *fmr = >u.fmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 5d6e98a..0ea4ab0 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -125,8 +125,6 @@ void rds_ib_mr_exit(void);
 void __rds_ib_teardown_mr(struct rds_ib_mr *);
 void rds_ib_teardown_mr(struct rds_ib_mr *);
 struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *, int);
-int rds_ib_map_fmr(struct rds_ib_device *, struct rds_ib_mr *,
-  struct scatterlist *, unsigned int);
 struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *);
 int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *, int, struct rds_ib_mr **);
 struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *, struct scatterlist *,
-- 
2.7.4

[PATCHv3 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread Zhu Yanjun

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_cm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..4b9405c 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,9 +677,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
event->param.conn.initiator_depth);
 
/* rdma_accept() calls rdma_reject() internally if it fails */
-   err = rdma_accept(cm_id, _param);
-   if (err)
-   rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);
+   if (rdma_accept(cm_id, _param))
+   rds_ib_conn_error(conn, "rdma_accept failed\n");
 
 out:
if (conn)
-- 
2.7.4

[PATCHv3 0/4] rds: ib: trivial patches

2017-03-12 Thread Zhu Yanjun

v2 -> v3
remove err from messages.

Zhu Yanjun (4):
  rds: ib: drop unnecessary rdma_reject
  rds: ib: remove redundant ib_dealloc_fmr
  rds: ib: add the static type to the function
  rds: ib: unmap the scatter/gather list when error

 net/rds/ib_cm.c  |  5 ++---
 net/rds/ib_fmr.c | 38 --
 net/rds/ib_mr.h  |  2 --
 3 files changed, 26 insertions(+), 19 deletions(-)

-- 
2.7.4

[PATCHv3 2/4] rds: ib: remove redundant ib_dealloc_fmr

2017-03-12 Thread Zhu Yanjun

The function ib_dealloc_fmr will never be called. As such, it should
be removed.

Cc: Joe Jin 
Cc: Junxiao Bi 
Reviewed-by: Yuval Shaia 
Reviewed-by: Johannes Thumshirn 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_fmr.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 4fe8f4f..249ae1c 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -78,12 +78,9 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
return ibmr;
 
 out_no_cigar:
-   if (ibmr) {
-   if (fmr->fmr)
-   ib_dealloc_fmr(fmr->fmr);
-   kfree(ibmr);
-   }
+   kfree(ibmr);
atomic_dec(>item_count);
+
return ERR_PTR(err);
 }
 
-- 
2.7.4

[PATCH] net: use net->count to check whether a netns is alive or not

2017-03-12 Thread Andrei Vagin

The previous idea was to check whether a net namespace is in
net_exit_list or not. It doesn't work, because net->exit_list is used in
__register_pernet_operations and __unregister_pernet_operations where
all namespaces are added to a temporary list to make cleanup in a error
case, so list_empty(>exit_list) always returns false.

Reported-by: Mantas Mikulėnas 
Fixes: 002d8a1a6c11 ("net: skip genenerating uevents for network namespaces 
that are exiting")
Signed-off-by: Andrei Vagin 
---
 net/core/net-sysfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b0c04cf..1004418 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -952,7 +952,7 @@ net_rx_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
while (--i >= new_num) {
struct kobject *kobj = >_rx[i].kobj;
 
-   if (!list_empty(_net(dev)->exit_list))
+   if (!atomic_read(_net(dev)->count))
kobj->uevent_suppress = 1;
if (dev->sysfs_rx_queue_group)
sysfs_remove_group(kobj, dev->sysfs_rx_queue_group);
@@ -1370,7 +1370,7 @@ netdev_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
while (--i >= new_num) {
struct netdev_queue *queue = dev->_tx + i;
 
-   if (!list_empty(_net(dev)->exit_list))
+   if (!atomic_read(_net(dev)->count))
queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
sysfs_remove_group(>kobj, _group);
@@ -1557,7 +1557,7 @@ void netdev_unregister_kobject(struct net_device *ndev)
 {
struct device *dev = &(ndev->dev);
 
-   if (!list_empty(_net(ndev)->exit_list))
+   if (!atomic_read(_net(ndev)->count))
dev_set_uevent_suppress(dev, 1);
 
kobject_get(>kobj);
-- 
2.9.3

Re: [PATCH] net: tundra: tsi108: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Mon,  6 Mar 2017 23:26:09 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: hyperv: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Wed,  8 Mar 2017 23:41:04 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: via: via-velocity: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Wed,  8 Mar 2017 22:20:20 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: fjes: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Wed,  8 Mar 2017 23:16:11 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH v2] net: intel: ixgbe: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Tue,  7 Mar 2017 23:32:25 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 
> ---
> Changelog:
> v2:
> - fix compilation (thanks andrewx bowers)

Applied.

Re: [PATCH] net: via: via-rhine: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread David Miller

From: Philippe Reynes 
Date: Tue,  7 Mar 2017 23:46:16 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: tun offloads bug

2017-03-12 Thread David Miller

From: Yaroslav Isakov 
Date: Wed, 8 Mar 2017 23:29:53 +0300

> Hello!
> I've found a bug in TUN/TAP driver with offloads - when Qemu is trying
> to set offloads on tap device, there is no error, but offloads are not
> appied.
> 
> The cause of this is that udev in recent systemd is using ethtool to
> disable offloads. So, udev is setting tun->dev->wanted_features via
> ethtool ioctl to disable TSO, but when qemu is trying to set offloads,
> it's using TUN/TAP ioctl which is not setting wanted_features, so
> netdev_update_features will not see features qemu wants.
> 
> This can be easily reproduced - just run qemu with
> guest_tso4=on,guest_tso6=on on systemd with systemd>=226 and after
> booting the VM, ethtool -k tap0 will show that TSO4 is disabled and
> TSO6 is enabled (systemd is not touching TSO6, that's why it's not
> affected at all)
> I've attached pretty trivial patch to fix this problem.

Please read Documentation/SubmittingPatches for the proper way to
submit a patch.

In particular you have to provide a proper signoff tag.

Thank you.

Re: [PATCH net-next v2] net: ipv4: add support for ECMP hash policy choice

2017-03-12 Thread David Miller

From: Jakub Sitnicki 
Date: Wed, 08 Mar 2017 17:00:05 +0100

> On Wed, Mar 08, 2017 at 12:43 PM GMT, Nikolay Aleksandrov wrote:
>> On 08/03/17 14:05, Jakub Sitnicki wrote:
>>> On Tue, Mar 07, 2017 at 11:01 AM GMT, Nikolay Aleksandrov wrote:
 This patch adds support for ECMP hash policy choice via a new sysctl
 called fib_multipath_hash_policy and also adds support for L4 hashes.
 The current values for fib_multipath_hash_policy are:
  0 - layer 3 (default)
  1 - layer 4
 If there's an skb hash already set and it matches the chosen policy then it
 will be used instead of being calculated. The ICMP inner IP addresses use
 is removed.

 Signed-off-by: Nikolay Aleksandrov 
 ---
 v2:
  - removed the output_key_hash as it's not needed anymore
  - reverted to my original/internal patch with L3 as default hash
>>> 
>>> What about ICMP PTB (Fragmentation Needed) forwarding that makes PMTUD
>>> work with ECMP in setups like described in RFC7690 [1]?
>>> 
>>>   ptb -> router ecmp -> next hop L4/L7 load balancer -> destination
>>> 
>>>router --> load balancer 1 --->
>>> \\--> load balancer 2 ---> load-balanced service
>>>  \--> load balancer N --->
>>> 
>>> Removing special treatment of ICMP errors will break it, won't it?
>>> 
>>
>> Yes, I am aware and this decision was made with that in mind.
>> We'd like to use the HW hash when available and IIRC that doesn't play well 
>> with
>> special-casing ICMP errors for anycast as it may not match also. Another 
>> thing,
>> again if I remember correctly, was that this behaviour is closer to how 
>> hardware
>> handles ECMP.
> 
> OK, I wanted to make sure that is not an oversight that ECMP routing in
> ipv4 stack is to be dumbed down to match the hardware behavior. I
> thought that it was an advantage that we want to have over hardware
> routers. (To be fair, I should mention that we don't have it in ipv6
> stack ATM.)
> 
>>
>> One thing we can do is leave the current L3 behaviour with ICMP error 
>> handling
>> and add a new L3 mode that tries to use the skb hash when available and 
>> doesn't
>> care about the packet type.
>>
>> What do you think ?
> 
> Sounds good to me. Would be good to hear other opinions also.

This would be so much less of an issue with symmetric hashing.  A quick glance
seems to indicate that Microsoft didn't specify the Toeplitz hash to order the
ports and the addresses so that it would be symmetric, so we can guess what
every piece of hardware out there computing a hash does :-/

We could solve this ICMP problem by using a symmetric hash (flow
dissector already supports this), but then we're back to the problem
that this behaves differently from card computed hashes and hardware
offload of ECMP.

I have to say that losing this ICMP handling makes the current code a
non-starter.  The existing code explicitly was written to handle this
case properly, so just undoing it and making it stop working is not
really something we can do.

Re: [PATCH] x86-32: fix tlb flushing when lguest clears PGE

2017-03-12 Thread Kees Cook

Are there nominations for most comprehensive changelog of the year? :)
This is awesome.

-Kees

On Fri, Mar 10, 2017 at 6:31 PM, Daniel Borkmann  wrote:
> Fengguang reported [1] random corruptions from various locations on
> x86-32 after commits d2852a224050 ("arch: add ARCH_HAS_SET_MEMORY
> config") and 9d876e79df6a ("bpf: fix unlocking of jited image when
> module ronx not set") that uses the former. While x86-32 doesn't
> have a JIT like x86_64, the bpf_prog_lock_ro() and bpf_prog_unlock_ro()
> got enabled due to ARCH_HAS_SET_MEMORY, whereas Fengguang's test
> kernel doesn't have module support built in and therefore never
> had the DEBUG_SET_MODULE_RONX setting enabled.
>
> After investigating the crashes further, it turned out that using
> set_memory_ro() and set_memory_rw() didn't have the desired effect,
> for example, setting the pages as read-only on x86-32 would still
> let probe_kernel_write() succeed without error. This behavior would
> manifest itself in situations where the vmalloc'ed buffer was accessed
> prior to set_memory_*() such as in case of bpf_prog_alloc(). In
> cases where it wasn't, the page attribute changes seemed to have
> taken effect, leading to the conclusion that a TLB invalidate
> didn't happen. Moreover, it turned out that this issue reproduced
> with qemu in "-cpu kvm64" mode, but not for "-cpu host". When the
> issue occurs, change_page_attr_set_clr() did trigger a TLB flush
> as expected via __flush_tlb_all() through cpa_flush_range(), though.
>
> There are 3 variants for issuing a TLB flush: invpcid_flush_all()
> (depends on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE),
> cr4 based flush (depends on X86_FEATURE_PGE), and cr3 based flush.
> For "-cpu host" case in my setup, the flush used invpcid_flush_all()
> variant, whereas for "-cpu kvm64", the flush was cr4 based. Switching
> the kvm64 case to cr3 manually worked fine, and further investigating
> the cr4 one turned out that X86_CR4_PGE bit was not set in cr4
> register, meaning the __native_flush_tlb_global_irq_disabled() wrote
> cr4 twice with the same value instead of clearing X86_CR4_PGE in the
> first write to trigger the flush.
>
> It turned out that X86_CR4_PGE was cleared from cr4 during init
> from lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE
> bit is also cleared from there due to concerns of using PGE in
> guest kernel that can lead to hard to trace bugs (see bff672e630a0
> ("lguest: documentation V: Host") in init()). The CPU feature bits
> are cleared in dynamic boot_cpu_data, but they never propagated to
> __flush_tlb_all() as it uses static_cpu_has() instead of boot_cpu_has()
> for testing which variant of TLB flushing to use, meaning they still
> used the old setting of the host kernel.
>
> Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would
> propagate to static_cpu_has() checks is too late at this point as
> sections have been patched already, so for now, it seems reasonable
> to switch back to boot_cpu_has(X86_FEATURE_PGE) as it was prior to
> commit c109bf95992b ("x86/cpufeature: Remove cpu_has_pge"). This
> lets the TLB flush trigger via cr3 as originally intended, properly
> makes the new page attributes visible and thus fixes the crashes
> seen by Fengguang.
>
>   [1] https://lkml.org/lkml/2017/3/1/344
>
> Fixes: c109bf95992b ("x86/cpufeature: Remove cpu_has_pge")
> Reported-by: Fengguang Wu 
> Signed-off-by: Daniel Borkmann 
> Cc: Borislav Petkov 
> Cc: Linus Torvalds 
> Cc: Thomas Gleixner 
> Cc: Kees Cook 
> Cc: Laura Abbott 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: Rusty Russell 
> Cc: Alexei Starovoitov 
> Cc: David S. Miller 
> ---
>  arch/x86/include/asm/tlbflush.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6fa8594..fc5abff 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned 
> long addr)
>
>  static inline void __flush_tlb_all(void)
>  {
> -   if (static_cpu_has(X86_FEATURE_PGE))
> +   if (boot_cpu_has(X86_FEATURE_PGE))
> __flush_tlb_global();
> else
> __flush_tlb();
> --
> 1.9.3
>



-- 
Kees Cook
Pixel Security

Re: [PATCH] net: usb: asix88179_178a: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread Chris Roth

I can test it tomorrow. I'll pull a clean copy of 4.10.2, or do you
suggest a different version than that?

Chris

On 2017-03-12 11:02 AM, Philippe Reynes wrote:
> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
>
> As I don't have the hardware, I'd be very pleased if
> someone may test this patch.
>
> Signed-off-by: Philippe Reynes 
> ---
>  drivers/net/usb/ax88179_178a.c |   14 --
>  1 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
> index a3a7db0..4a0ae7c 100644
> --- a/drivers/net/usb/ax88179_178a.c
> +++ b/drivers/net/usb/ax88179_178a.c
> @@ -620,16 +620,18 @@ static int ax88179_get_eeprom_len(struct net_device 
> *net)
>   return 0;
>  }
>  
> -static int ax88179_get_settings(struct net_device *net, struct ethtool_cmd 
> *cmd)
> +static int ax88179_get_link_ksettings(struct net_device *net,
> +   struct ethtool_link_ksettings *cmd)
>  {
>   struct usbnet *dev = netdev_priv(net);
> - return mii_ethtool_gset(>mii, cmd);
> + return mii_ethtool_get_link_ksettings(>mii, cmd);
>  }
>  
> -static int ax88179_set_settings(struct net_device *net, struct ethtool_cmd 
> *cmd)
> +static int ax88179_set_link_ksettings(struct net_device *net,
> +   const struct ethtool_link_ksettings *cmd)
>  {
>   struct usbnet *dev = netdev_priv(net);
> - return mii_ethtool_sset(>mii, cmd);
> + return mii_ethtool_set_link_ksettings(>mii, cmd);
>  }
>  
>  static int
> @@ -826,11 +828,11 @@ static int ax88179_ioctl(struct net_device *net, struct 
> ifreq *rq, int cmd)
>   .set_wol= ax88179_set_wol,
>   .get_eeprom_len = ax88179_get_eeprom_len,
>   .get_eeprom = ax88179_get_eeprom,
> - .get_settings   = ax88179_get_settings,
> - .set_settings   = ax88179_set_settings,
>   .get_eee= ax88179_get_eee,
>   .set_eee= ax88179_set_eee,
>   .nway_reset = usbnet_nway_reset,
> + .get_link_ksettings = ax88179_get_link_ksettings,
> + .set_link_ksettings = ax88179_set_link_ksettings,
>  };
>  
>  static void ax88179_set_multicast(struct net_device *net)

RE: [PATCH v2] fjes: Do not load fjes driver if system does not have extended socket device.

2017-03-12 Thread Izumi, Taku

Ishimatsu-san,

 Sorry for my late response.

>
> Which tree did you apply the patch to?
> 
> The patch can apply to net-next tree with no conflicts as follows:

  Not net-next but net tree:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git

  I'll review and test your patch soon.

  Sincerely,
  Taku Izumi

> 
> # git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> Cloning into 'net-next'...
> remote: Counting objects: 5265118, done.
> remote: Compressing objects: 100% (805485/805485), done.
> Receiving objects: 100% (5265118/5265118), 910.11 MiB | 23.42 MiB/s, done.
> remote: Total 5265118 (delta 4419240), reused 5264459 (delta 4418809)
> Resolving deltas: 100% (4419240/4419240), done.
> Checking out files: 100% (58005/58005), done.
> # head -n 30 fjes.patch
> Subject: [PATCH v2] fjes: Do not load fjes driver if system does not have
> extended socket device.
> Date: Wed, 8 Mar 2017 16:05:18 -0500
> From: Yasuaki Ishimatsu 
> To: netdev@vger.kernel.org
> CC: David Miller , izumi.t...@jp.fujitsu.com
> 
> The fjes driver is used only by FUJITSU servers and almost of all servers
> in the world never use it. But currently if ACPI PNP0C02 is defined in the
> ACPI table, the following message is always shown:
> 
>   "FUJITSU Extended Socket Network Device Driver - version 1.2
>- Copyright (c) 2015 FUJITSU LIMITED"
> 
> The message makes users confused because there is no reason that the message
> is shown in other vendor servers.
> 
> To avoid the confusion, the patch adds a check that the server has a extended
> socket device or not.
> 
> Signed-off-by: Yasuaki Ishimatsu 
> CC: Taku Izumi 
> ---
> v2:
>   - Order local variable declarations from longest to shortest line
> 
>   drivers/net/fjes/fjes_main.c | 52
> +++-
>   1 file changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
> index b77e4ecf..a57c2cb 100644 # cd net-next/ # git am ../fjes.patch
> Applying: fjes: Do not load fjes driver if system does not have extended
> socket device.
> #
> 
> 
> Thanks,
> Yasuaki Ishimatsu

[PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-12 Thread Eric Dumazet

When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure.

On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.

On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Saeed Mahameed 
Cc: Alexander Duyck 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
 3 files changed, 310 insertions(+), 221 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..de455c8a2dec389cfeca6b6d474a6184d6acf618
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -31,7 +31,6 @@
  *
  */
 
-#include 
 #include 
 #include 
 #include 
@@ -50,59 +49,41 @@
 
 #include "mlx4_en.h"
 
-static int mlx4_alloc_page(struct mlx4_en_priv *priv,
-  struct mlx4_en_rx_alloc *frag,
-  gfp_t gfp)
+static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
+   struct mlx4_en_rx_ring *ring,
+   dma_addr_t *dma,
+   unsigned int node, gfp_t gfp)
 {
struct page *page;
-   dma_addr_t dma;
-
-   page = alloc_page(gfp);
-   if (unlikely(!page))
-   return -ENOMEM;
-   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
-   if (unlikely(dma_mapping_error(priv->ddev, dma))) {
-   __free_page(page);
-   return -ENOMEM;
-   }
-   frag->page = page;
-   frag->dma = dma;
-   frag->page_offset = priv->rx_headroom;
-   return 0;
-}
 
-static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
-  struct mlx4_en_rx_ring *ring,
-  struct mlx4_en_rx_desc *rx_desc,
-  struct mlx4_en_rx_alloc *frags,
-  gfp_t gfp)
-{
-   int i;
+   if (unlikely(!ring->pre_allocated_count)) {
+   unsigned int order = READ_ONCE(ring->rx_alloc_order);
 
-   for (i = 0; i < priv->num_frags; i++, frags++) {
-   if (!frags->page) {
-   if (mlx4_alloc_page(priv, frags, gfp))
-   return -ENOMEM;
-   ring->rx_alloc_pages++;
+   page = __alloc_pages_node(node, gfp | __GFP_NOMEMALLOC |
+   __GFP_NOWARN | __GFP_NORETRY,
+ order);
+   if (page) {
+   split_page(page, order);
+   ring->pre_allocated_count = 1U << order;
+   } else {
+   if (order > 1)
+   ring->rx_alloc_order--;
+   page = __alloc_pages_node(node, gfp, 0);
+   if (unlikely(!page))
+   return NULL;
+   ring->pre_allocated_count = 1U;

Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread Yanjun Zhu




On 2017/3/13 3:43, santosh.shilim...@oracle.com wrote:

On 3/12/17 12:33 PM, Leon Romanovsky wrote:

On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote:

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_cm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..eca3d5f 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id 
*cm_id,

 event->param.conn.initiator_depth);

 /* rdma_accept() calls rdma_reject() internally if it fails */
-err = rdma_accept(cm_id, _param);
-if (err)
+if (rdma_accept(cm_id, _param))
 rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);


You omitted initialization of "err" variable which you print here ^.


Its inited by rds_ib_setup_qp() but you are right. It will print
failed with error = 0. :-)

Zhu, please drop that 'err' from the message.

OK. I will do.

Zhu Yanjun

Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread Hannes Frederic Sowa

Hi,

On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Mon, 13 Mar 2017 00:01:24 +0100
> 
> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
> > can work with afnetns with one limitation: one cannot cross the realm
> > of a network namespace while changing the afnetns compartement. To get
> > into a new afnetns in a different net namespace, one must first change
> > to the net namespace and afterwards switch to the desired afnetns.
> 
> Please explain why this is useful, who wants this kind of facility,
> and how it will be used.

Yes, I have to enhance the cover letter:

The work behind all this is to provide more dense container hosting.
Right now we lose performance, because all packets need to be forwarded
through either a bridge or must be routed until they reach the
containers. For example, we can't make use of early demuxing for the
incoming packets. We basically pass the networking stack twice for
every packet.

The usage is very much in line with how network namespaces are used
nowadays:

ip afnetns add afns-1
ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
ip afnetns exec afns-1 /usr/sbin/httpd

this spawns a shell where all child processes will only have access to
the specific ip addresses, even though they do a wildcard bind. Source
address selection will also use only the ip addresses available to the
children.

In some sense it has lots of characteristics like ipvlan, allowing a
single MAC address to host lots of IP addresses which will end up in
different namespaces. Unlink ipvlan however, it will also solve the
problem around duplicate address detection and multiplexing packets to
the IGMP or MLD state machines.

The resource consumption in comparison with ordinary namespaces will be
much lower. All in all, we will have far less networking subsystems to
cross compared to normal netns solutions.

Some more information also in the first patch, which adds a
Documentation.

Bye,
Hannes

Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread David Miller

From: Hannes Frederic Sowa 
Date: Mon, 13 Mar 2017 00:01:24 +0100

> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
> can work with afnetns with one limitation: one cannot cross the realm
> of a network namespace while changing the afnetns compartement. To get
> into a new afnetns in a different net namespace, one must first change
> to the net namespace and afterwards switch to the desired afnetns.

Please explain why this is useful, who wants this kind of facility,
and how it will be used.

Thank you.

[PATCH net-next RFC v1 18/27] afnetns: afnetns should influence source address selection

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 drivers/target/iscsi/cxgbit/cxgbit_cm.c |  2 +-
 include/linux/inetdevice.h  |  5 +++--
 include/net/route.h | 10 ++
 net/ipv4/devinet.c  | 19 ---
 net/ipv4/icmp.c |  4 ++--
 net/ipv4/igmp.c |  2 +-
 net/ipv4/route.c| 21 -
 net/ipv4/xfrm4_policy.c |  2 +-
 net/sctp/protocol.c |  4 ++--
 net/tipc/udp_media.c|  2 +-
 10 files changed, 45 insertions(+), 26 deletions(-)

diff --git a/drivers/target/iscsi/cxgbit/cxgbit_cm.c 
b/drivers/target/iscsi/cxgbit/cxgbit_cm.c
index 37a05185dcbe0e..4ae59d20d8e260 100644
--- a/drivers/target/iscsi/cxgbit/cxgbit_cm.c
+++ b/drivers/target/iscsi/cxgbit/cxgbit_cm.c
@@ -266,7 +266,7 @@ static struct net_device *cxgbit_ipv4_netdev(__be32 saddr)
 {
struct net_device *ndev;
 
-   ndev = __ip_dev_find(_net, saddr, false);
+   ndev = __ip_dev_find(_net, NULL, saddr, false);
if (!ndev)
return NULL;
 
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index a41bfce099e0a1..9411270cb0fe64 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -160,10 +160,11 @@ void inet_netconf_notify_devconf(struct net *net, int 
type, int ifindex,
 struct ipv4_devconf *devconf);
 
 struct in_ifaddr *ifa_find_rcu(struct net *net, __be32 addr);
-struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref);
+struct net_device *__ip_dev_find(struct net *net, struct afnetns *afnetns,
+__be32 addr, bool devref);
 static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
 {
-   return __ip_dev_find(net, addr, true);
+   return __ip_dev_find(net, NULL, addr, true);
 }
 
 int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
diff --git a/include/net/route.h b/include/net/route.h
index c0874c87c17371..d29449d1863636 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -113,13 +113,15 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
- int mp_hash);
+struct rtable *__ip_route_output_key_hash(struct net *net,
+ struct afnetns *afnetns,
+ struct flowi4 *flp, int mp_hash);
 
 static inline struct rtable *__ip_route_output_key(struct net *net,
+  struct afnetns *afnetns,
   struct flowi4 *flp)
 {
-   return __ip_route_output_key_hash(net, flp, -1);
+   return __ip_route_output_key_hash(net, afnetns, flp, -1);
 }
 
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
@@ -286,7 +288,7 @@ static inline struct rtable *ip_route_connect(struct flowi4 
*fl4,
  sport, dport, sk);
 
if (!dst || !src) {
-   rt = __ip_route_output_key(net, fl4);
+   rt = __ip_route_output_key(net, NULL, fl4);
if (IS_ERR(rt))
return rt;
ip_rt_put(rt);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 0844d917aa8d7d..82a7389ec86faa 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -150,14 +150,27 @@ struct in_ifaddr *ifa_find_rcu(struct net *net, __be32 
addr)
  *
  * If a caller uses devref=false, it should be protected by RCU, or RTNL
  */
-struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
+struct net_device *__ip_dev_find(struct net *net, struct afnetns *afnetns,
+__be32 addr, bool devref)
 {
-   struct net_device *result;
+   struct net_device *result = NULL;
struct in_ifaddr *ifa;
 
rcu_read_lock();
ifa = ifa_find_rcu(net, addr);
-   result = ifa ? ifa->ifa_dev->dev : NULL;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (afnetns && afnetns != net->afnet_ns) {
+   /* we are in a child namespace, thus only allow to
+* explicitly configured addresses
+*/
+   if (!ifa || ifa->afnetns != afnetns) {
+   rcu_read_unlock();
+   return NULL;
+   }
+   }
+#endif
+   if (ifa)
+   result = ifa->ifa_dev->dev;
if (!result) {
struct flowi4 fl4 = { .daddr = addr };
struct fib_result res = { 0 };
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index fc310db2708bf6..74261d6b86e4fc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -505,7 +505,7 @@ static struct rtable

[PATCH net-next RFC v1 25/27] afnetns: ipv4: inherit afnetns from calling application

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/devinet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 82a7389ec86faa..01bdff8a957ae1 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -838,7 +838,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, 
struct nlmsghdr *nlh,
goto errout_free;
}
} else {
-   ifa->afnetns = afnetns_get(net->afnet_ns);
+   ifa->afnetns = afnetns_get(current->nsproxy->afnet_ns);
}
 #else
if (tb[IFA_AFNETNS_FD]) {
-- 
2.9.3

[PATCH net-next RFC v1 21/27] afnetns: add support for tcpv6

2017-03-12 Thread Hannes Frederic Sowa

Same as the support for tcpv4, we simply add the necessary checks so we
just look at our own sockets.

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/inet6_hashtables.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 8570e0e3016b65..05b71f0937e676 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -87,6 +87,7 @@ struct sock *__inet6_lookup_established(struct net *net,
   const u16 hnum,
   const int dif)
 {
+   struct afnetns *afnetns;
struct sock *sk;
const struct hlist_nulls_node *node;
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
@@ -97,11 +98,15 @@ struct sock *__inet6_lookup_established(struct net *net,
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = >ehash[slot];
 
+   afnetns = ipv6_get_ifaddr_afnetns_rcu(net, daddr,
+ dev_get_by_index_rcu(net, dif));
 
 begin:
sk_nulls_for_each_rcu(sk, node, >chain) {
if (sk->sk_hash != hash)
continue;
+   if (sock_afnetns(sk) != afnetns)
+   continue;
if (!INET6_MATCH(sk, net, saddr, daddr, ports, dif))
continue;
if (unlikely(!atomic_inc_not_zero(>sk_refcnt)))
@@ -123,14 +128,15 @@ struct sock *__inet6_lookup_established(struct net *net,
 EXPORT_SYMBOL(__inet6_lookup_established);
 
 static inline int compute_score(struct sock *sk, struct net *net,
+   struct afnetns *afnetns,
const unsigned short hnum,
const struct in6_addr *daddr,
const int dif, bool exact_dif)
 {
int score = -1;
 
-   if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
-   sk->sk_family == PF_INET6) {
+   if (net_eq(sock_net(sk), net) && sock_afnetns(sk) == afnetns &&
+   inet_sk(sk)->inet_num == hnum && sk->sk_family == PF_INET6) {
 
score = 1;
if (!ipv6_addr_any(>sk_v6_rcv_saddr)) {
@@ -162,10 +168,14 @@ struct sock *inet6_lookup_listener(struct net *net,
int score, hiscore = 0, matches = 0, reuseport = 0;
bool exact_dif = inet6_exact_dif_match(net, skb);
struct sock *sk, *result = NULL;
+   struct afnetns *afnetns;
u32 phash = 0;
 
+   afnetns = ipv6_get_ifaddr_afnetns_rcu(net, daddr, skb->dev);
+
sk_for_each(sk, >head) {
-   score = compute_score(sk, net, hnum, daddr, dif, exact_dif);
+   score = compute_score(sk, net, afnetns, hnum, daddr, dif,
+ exact_dif);
if (score > hiscore) {
reuseport = sk->sk_reuseport;
if (reuseport) {
-- 
2.9.3

[PATCH net-next RFC v1 26/27] afnetns: ipv6: inherit afnetns from calling application

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/addrconf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 319f83a7d29dd5..3d9d24ec066a67 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4542,7 +4542,7 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr 
*nlh)
if (IS_ERR(afnetns))
return PTR_ERR(afnetns);
} else {
-   afnetns = afnetns_get(net_afnetns(net));
+   afnetns = afnetns_get(current->nsproxy->afnet_ns);
}
 #else
if (tb[IFA_AFNETNS_FD])
-- 
2.9.3

[PATCH net-next RFC v1 23/27] afnetns: use user_ns from afnetns for checking for binding to port < 1024

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/inet_common.h |  2 +-
 net/ipv4/af_inet.c| 37 ++---
 net/ipv6/af_inet6.c   |  2 +-
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 4ac8229dca6af4..16dfbb02296be6 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -30,7 +30,7 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
-int inet_allow_bind(struct sock *sk, __be32 addr);
+int inet_allow_bind(struct sock *sk, __be32 addr, unsigned short snum);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr, int *uaddr_len,
 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5f11399bafd16f..da7e6299073743 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -428,12 +428,14 @@ int inet_release(struct socket *sock)
 }
 EXPORT_SYMBOL(inet_release);
 
-int inet_allow_bind(struct sock *sk, __be32 addr)
+int inet_allow_bind(struct sock *sk, __be32 addr, unsigned short snum)
 {
struct inet_sock *inet = inet_sk(sk);
struct net *net = sock_net(sk);
+   struct afnetns *afnetns = NULL;
u32 tb_id = RT_TABLE_LOCAL;
int chk_addr_ret;
+   int err = 0;
 
tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
chk_addr_ret = inet_addr_type_table(net, addr, tb_id);
@@ -453,18 +455,29 @@ int inet_allow_bind(struct sock *sk, __be32 addr)
chk_addr_ret != RTN_BROADCAST)
return -EADDRNOTAVAIL;
 
+   rcu_read_lock();
if (chk_addr_ret == RTN_LOCAL &&
net_afnetns(net) != sock_afnetns(sk)) {
-   struct afnetns *afnetns;
-
-   rcu_read_lock();
afnetns = ifa_find_afnetns_rcu(net, addr);
if (afnetns != sock_afnetns(sk))
-   chk_addr_ret = -EADDRNOTAVAIL;
-   rcu_read_unlock();
+   err = -EADDRNOTAVAIL;
+   }
+
+   if (!err && snum && snum < inet_prot_sock(net)) {
+   struct user_namespace *user_ns;
+
+#if IS_ENABLED(CONFIG_AFNETNS)
+   user_ns = afnetns ? afnetns->user_ns : net->user_ns;
+#else
+   user_ns = net->user_ns;
+#endif
+   if (!ns_capable(user_ns, CAP_NET_BIND_SERVICE))
+   err = -EACCES;
}
 
-   return chk_addr_ret;
+   rcu_read_unlock();
+
+   return err;
 }
 EXPORT_SYMBOL(inet_allow_bind);
 
@@ -473,7 +486,6 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
struct sock *sk = sock->sk;
struct inet_sock *inet = inet_sk(sk);
-   struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
int err;
@@ -497,18 +509,13 @@ int inet_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
goto out;
}
 
-   chk_addr_ret = inet_allow_bind(sk, addr->sin_addr.s_addr);
+   snum = ntohs(addr->sin_port);
+   chk_addr_ret = inet_allow_bind(sk, addr->sin_addr.s_addr, snum);
if (chk_addr_ret < 0) {
err = chk_addr_ret;
goto out;
}
 
-   snum = ntohs(addr->sin_port);
-   err = -EACCES;
-   if (snum && snum < inet_prot_sock(net) &&
-   !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
-   goto out;
-
/*  We keep a pair of addresses. rcv_saddr is the one
 *  used by hash lookups, and saddr is used for transmit.
 *
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index ffb116297c0950..30aff01eba5be0 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -324,7 +324,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   err = inet_allow_bind(sk, addr->sin6_addr.s6_addr32[3]);
+   err = inet_allow_bind(sk, addr->sin6_addr.s6_addr32[3], snum);
if (err < 0)
goto out;
else
-- 
2.9.3

[PATCH net-next RFC v1 15/27] afnetns: add ipv6_get_ifaddr_afnetns_rcu

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/addrconf.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index e3f1920ca57968..644fa68bb4ddef 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -104,6 +104,23 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct 
net_device *dev,
 u32 addr_flags, bool sllao, bool tokenized,
 __u32 valid_lft, u32 prefered_lft);
 
+static inline
+struct afnetns *ipv6_get_ifaddr_afnetns_rcu(struct net *net,
+   const struct in6_addr *addr,
+   struct net_device *dev)
+{
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct inet6_ifaddr *ifp;
+
+   ifp = ipv6_get_ifaddr(net, addr, dev, 1);
+   if (ifp)
+   return ifp->afnetns;
+   return net->afnet_ns;
+#else
+   return NULL;
+#endif
+}
+
 static inline int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
 {
if (dev->addr_len != ETH_ALEN)
-- 
2.9.3

[PATCH net-next RFC v1 16/27] afnetns: add udpv6 support

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/datagram.c |  6 --
 net/ipv6/udp.c  | 18 +-
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index eec27f87efaca1..cd811e8b1ba824 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -804,8 +804,10 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
if (addr_type != IPV6_ADDR_ANY) {
int strict = __ipv6_addr_src_scope(addr_type) 
<= IPV6_ADDR_SCOPE_LINKLOCAL;
if (!(inet_sk(sk)->freebind || 
inet_sk(sk)->transparent) &&
-   !ipv6_chk_addr(net, _info->ipi6_addr,
-  strict ? dev : NULL, 0) &&
+   !ipv6_chk_addr_and_flags(net, 
sock_afnetns(sk),
+
_info->ipi6_addr,
+strict ? dev : 
NULL, 0,
+IFA_F_TENTATIVE) &&
!ipv6_chk_acast_addr_src(net, dev,
 
_info->ipi6_addr))
err = -EINVAL;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4e4c401e3bc690..d63e0e362fe72b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -126,6 +126,7 @@ static void udp_v6_rehash(struct sock *sk)
 }
 
 static int compute_score(struct sock *sk, struct net *net,
+struct afnetns *afnetns,
 const struct in6_addr *saddr, __be16 sport,
 const struct in6_addr *daddr, unsigned short hnum,
 int dif, bool exact_dif)
@@ -138,6 +139,9 @@ static int compute_score(struct sock *sk, struct net *net,
sk->sk_family != PF_INET6)
return -1;
 
+   if (sock_afnetns(sk) != afnetns)
+   return -1;
+
score = 0;
inet = inet_sk(sk);
 
@@ -173,6 +177,7 @@ static int compute_score(struct sock *sk, struct net *net,
 
 /* called with rcu_read_lock() */
 static struct sock *udp6_lib_lookup2(struct net *net,
+   struct afnetns *afnetns,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, unsigned int hnum, int dif,
bool exact_dif, struct udp_hslot *hslot2,
@@ -185,7 +190,7 @@ static struct sock *udp6_lib_lookup2(struct net *net,
result = NULL;
badness = -1;
udp_portaddr_for_each_entry_rcu(sk, >head) {
-   score = compute_score(sk, net, saddr, sport,
+   score = compute_score(sk, net, afnetns, saddr, sport,
  daddr, hnum, dif, exact_dif);
if (score > badness) {
reuseport = sk->sk_reuseport;
@@ -224,8 +229,11 @@ struct sock *__udp6_lib_lookup(struct net *net,
struct udp_hslot *hslot2, *hslot = >hash[slot];
bool exact_dif = udp6_lib_exact_dif_match(net, skb);
int score, badness, matches = 0, reuseport = 0;
+   struct afnetns *afnetns;
u32 hash = 0;
 
+   afnetns = ipv6_get_ifaddr_afnetns_rcu(net, daddr, skb->dev);
+
if (hslot->count > 10) {
hash2 = udp6_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
@@ -233,7 +241,7 @@ struct sock *__udp6_lib_lookup(struct net *net,
if (hslot->count < hslot2->count)
goto begin;
 
-   result = udp6_lib_lookup2(net, saddr, sport,
+   result = udp6_lib_lookup2(net, afnetns, saddr, sport,
  daddr, hnum, dif, exact_dif,
  hslot2, skb);
if (!result) {
@@ -248,7 +256,7 @@ struct sock *__udp6_lib_lookup(struct net *net,
if (hslot->count < hslot2->count)
goto begin;
 
-   result = udp6_lib_lookup2(net, saddr, sport,
+   result = udp6_lib_lookup2(net, afnetns, saddr, sport,
  daddr, hnum, dif,
  exact_dif, hslot2,
  skb);
@@ -259,8 +267,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
result = NULL;
badness = -1;
sk_for_each_rcu(sk, >head) {
-   score = compute_score(sk, net, saddr, sport, daddr, hnum, dif,
- exact_dif);
+   score = compute_score(sk, net, afnetns, saddr, sport, daddr,
+ hnum, dif, exact_dif);
if (score > badness) {
reuseport =

[PATCH net-next RFC v1 27/27] afnetns: allow only whitelisted protocols to operate inside afnetns

2017-03-12 Thread Hannes Frederic Sowa

We only care about inet protocols (which is IPv4 and IPv6). Other
protocols, like netlink are not under control of afnetns and thus must
be hardened with capabilities.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/protocol.h |  1 +
 net/ipv4/af_inet.c | 20 +++-
 net/ipv4/udplite.c |  3 ++-
 net/ipv6/af_inet6.c| 14 +++---
 net/ipv6/tcp_ipv6.c|  3 ++-
 net/ipv6/udp.c |  3 ++-
 net/ipv6/udplite.c |  3 ++-
 7 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index bf36ca34af7ad2..7b64f71b16ccc0 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -91,6 +91,7 @@ struct inet_protosw {
 #define INET_PROTOSW_REUSE 0x01 /* Are ports automatically 
reusable? */
 #define INET_PROTOSW_PERMANENT 0x02  /* Permanent protocols are unremovable. */
 #define INET_PROTOSW_ICSK  0x04  /* Is this an inet_connection_sock? */
+#define INET_PROTOSW_AFNETNS_OK 0x08 /* Is this proto afnetns compatible? */
 
 extern const struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
 extern const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS];
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index da7e6299073743..1eb8a8ea49f56c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -302,14 +302,22 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
goto out_rcu_unlock;
}
 
+   sock->ops = answer->ops;
+   answer_prot = answer->prot;
+   answer_flags = answer->flags;
+
err = -EPERM;
if (sock->type == SOCK_RAW && !kern &&
!ns_capable(net->user_ns, CAP_NET_RAW))
goto out_rcu_unlock;
 
-   sock->ops = answer->ops;
-   answer_prot = answer->prot;
-   answer_flags = answer->flags;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (unlikely(!kern &&
+current->nsproxy->afnet_ns != net->afnet_ns &&
+!(answer_flags & INET_PROTOSW_AFNETNS_OK)))
+   goto out_rcu_unlock;
+#endif
+
rcu_read_unlock();
 
WARN_ON(!answer_prot->slab);
@@ -1060,7 +1068,8 @@ static struct inet_protosw inetsw_array[] =
.prot =   _prot,
.ops =_stream_ops,
.flags =  INET_PROTOSW_PERMANENT |
- INET_PROTOSW_ICSK,
+ INET_PROTOSW_ICSK |
+ INET_PROTOSW_AFNETNS_OK,
},
 
{
@@ -1068,7 +1077,8 @@ static struct inet_protosw inetsw_array[] =
.protocol =   IPPROTO_UDP,
.prot =   _prot,
.ops =_dgram_ops,
-   .flags =  INET_PROTOSW_PERMANENT,
+   .flags =  INET_PROTOSW_PERMANENT |
+ INET_PROTOSW_AFNETNS_OK,
},
 
{
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 59f10fe9782e57..fbdb4208ebc483 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -69,7 +69,8 @@ static struct inet_protosw udplite4_protosw = {
.protocol   =  IPPROTO_UDPLITE,
.prot   =  _prot,
.ops=  _dgram_ops,
-   .flags  =  INET_PROTOSW_PERMANENT,
+   .flags  =  INET_PROTOSW_PERMANENT |
+  INET_PROTOSW_AFNETNS_OK,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 4aa221826e753c..e21804b24be408 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -167,14 +167,22 @@ static int inet6_create(struct net *net, struct socket 
*sock, int protocol,
goto out_rcu_unlock;
}
 
+   sock->ops = answer->ops;
+   answer_prot = answer->prot;
+   answer_flags = answer->flags;
+
err = -EPERM;
if (sock->type == SOCK_RAW && !kern &&
!ns_capable(net->user_ns, CAP_NET_RAW))
goto out_rcu_unlock;
 
-   sock->ops = answer->ops;
-   answer_prot = answer->prot;
-   answer_flags = answer->flags;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (unlikely(!kern &&
+current->nsproxy->afnet_ns != net->afnet_ns &&
+!(answer_flags & INET_PROTOSW_AFNETNS_OK)))
+   goto out_rcu_unlock;
+#endif
+
rcu_read_unlock();
 
WARN_ON(!answer_prot->slab);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 56f742fff96723..5b3b34495d4538 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1944,7 +1944,8 @@ static struct inet_protosw tcpv6_protosw = {
.prot   =   _prot,
.ops=   _stream_ops,
.flags  =   INET_PROTOSW_PERMANENT |
-   INET_PROTOSW_ICSK,
+   INET_PROTOSW_ICSK |
+   INET_PROTOSW_AFNETNS_OK,
 };
 
 static int

[PATCH net-next RFC v1 22/27] afnetns: track owning namespace for inet_bind

2017-03-12 Thread Hannes Frederic Sowa

In order for a newly created afnetns to allow its processes to bind to
ports lower than 1024 we need to track the to be created user namespace
to check for the permissions for binding so.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/afnetns.h|  7 +--
 kernel/nsproxy.c |  2 +-
 net/core/afnetns.c   | 18 --
 net/core/net_namespace.c |  2 +-
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/afnetns.h b/include/net/afnetns.h
index 9039086717c356..9db49551fff714 100644
--- a/include/net/afnetns.h
+++ b/include/net/afnetns.h
@@ -8,6 +8,7 @@
 struct afnetns {
 #if IS_ENABLED(CONFIG_AFNETNS)
refcount_t ref;
+   struct user_namespace *user_ns;
struct ns_common ns;
struct net *net;
 #endif
@@ -17,8 +18,10 @@ extern struct afnetns init_afnetns;
 
 int afnet_ns_init(void);
 
-struct afnetns *afnetns_new(struct net *net);
-struct afnetns *copy_afnet_ns(unsigned long flags, struct nsproxy *old);
+struct afnetns *afnetns_new(struct net *net, struct user_namespace *user_ns);
+struct afnetns *copy_afnet_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct nsproxy *old);
 struct afnetns *afnetns_get_by_fd(int fd);
 unsigned int afnetns_to_inode(struct afnetns *afnetns);
 void afnetns_free(struct afnetns *afnetns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f99ecbdd506137..90462012aecf78 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -114,7 +114,7 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
}
 
 #if IS_ENABLED(CONFIG_AFNETNS)
-   new_nsp->afnet_ns = copy_afnet_ns(flags, tsk->nsproxy);
+   new_nsp->afnet_ns = copy_afnet_ns(flags, user_ns, tsk->nsproxy);
if (IS_ERR(new_nsp->afnet_ns)) {
err = PTR_ERR(new_nsp->afnet_ns);
goto out_afnet;
diff --git a/net/core/afnetns.c b/net/core/afnetns.c
index b96c25b5ebe30d..69d776564c69be 100644
--- a/net/core/afnetns.c
+++ b/net/core/afnetns.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 const struct proc_ns_operations afnetns_operations;
 
@@ -17,7 +18,8 @@ static struct afnetns *ns_to_afnet(struct ns_common *ns)
return container_of(ns, struct afnetns, ns);
 }
 
-static int afnet_setup(struct afnetns *afnetns, struct net *net)
+static int afnet_setup(struct afnetns *afnetns, struct net *net,
+  struct user_namespace *user_ns)
 {
int err;
 
@@ -28,11 +30,12 @@ static int afnet_setup(struct afnetns *afnetns, struct net 
*net)
 
refcount_set(>ref, 1);
afnetns->net = get_net(net);
+   afnetns->user_ns = get_user_ns(user_ns);
 
return err;
 }
 
-struct afnetns *afnetns_new(struct net *net)
+struct afnetns *afnetns_new(struct net *net, struct user_namespace *user_ns)
 {
int err;
struct afnetns *afnetns;
@@ -41,7 +44,7 @@ struct afnetns *afnetns_new(struct net *net)
if (!afnetns)
return ERR_PTR(-ENOMEM);
 
-   err = afnet_setup(afnetns, net);
+   err = afnet_setup(afnetns, net, user_ns);
if (err) {
kfree(afnetns);
return ERR_PTR(err);
@@ -54,6 +57,7 @@ void afnetns_free(struct afnetns *afnetns)
 {
ns_free_inum(>ns);
put_net(afnetns->net);
+   put_user_ns(afnetns->user_ns);
kfree(afnetns);
 }
 EXPORT_SYMBOL(afnetns_free);
@@ -85,7 +89,9 @@ unsigned int afnetns_to_inode(struct afnetns *afnetns)
 }
 EXPORT_SYMBOL(afnetns_to_inode);
 
-struct afnetns *copy_afnet_ns(unsigned long flags, struct nsproxy *old)
+struct afnetns *copy_afnet_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct nsproxy *old)
 {
if (flags & CLONE_NEWNET)
return afnetns_get(old->net_ns->afnet_ns);
@@ -93,7 +99,7 @@ struct afnetns *copy_afnet_ns(unsigned long flags, struct 
nsproxy *old)
if (!(flags & CLONE_NEWAFNET))
return afnetns_get(old->afnet_ns);
 
-   return afnetns_new(old->net_ns);
+   return afnetns_new(old->net_ns, user_ns);
 }
 
 static struct ns_common *afnet_get(struct task_struct *task)
@@ -144,7 +150,7 @@ int __init afnet_ns_init(void)
 {
int err;
 
-   err = afnet_setup(_afnetns, _net);
+   err = afnet_setup(_afnetns, _net, _user_ns);
if (err)
return err;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1b11883d8cdbbd..6bb1c87e72dcc0 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -287,7 +287,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
 
 #if IS_ENABLED(CONFIG_AFNETNS)
if (likely(!net_eq(_net, net))) {
-   net->afnet_ns = afnetns_new(net);
+   net->afnet_ns = afnetns_new(net, user_ns);
if (IS_ERR(net->afnet_ns)) {

[PATCH net-next RFC v1 14/27] afnetns: check for afnetns in inet6_bind

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/addrconf.h |  3 ++-
 net/ipv6/addrconf.c| 12 ++--
 net/ipv6/af_inet6.c|  7 +--
 net/ipv6/ndisc.c   |  4 ++--
 net/ipv6/route.c   |  2 +-
 5 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 17c6fd84e28780..e3f1920ca57968 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -63,7 +63,8 @@ int addrconf_set_dstaddr(struct net *net, void __user *arg);
 
 int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
  const struct net_device *dev, int strict);
-int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
+int ipv6_chk_addr_and_flags(struct net *net, struct afnetns *afnetns,
+   const struct in6_addr *addr,
const struct net_device *dev, int strict,
u32 banned_flags);
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c67f6d3c5b9a7a..2e546584695118 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1776,11 +1776,13 @@ static int ipv6_count_addresses(struct inet6_dev *idev)
 int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
  const struct net_device *dev, int strict)
 {
-   return ipv6_chk_addr_and_flags(net, addr, dev, strict, IFA_F_TENTATIVE);
+   return ipv6_chk_addr_and_flags(net, NULL, addr, dev, strict,
+  IFA_F_TENTATIVE);
 }
 EXPORT_SYMBOL(ipv6_chk_addr);
 
-int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
+int ipv6_chk_addr_and_flags(struct net *net, struct afnetns *afnetns,
+   const struct in6_addr *addr,
const struct net_device *dev, int strict,
u32 banned_flags)
 {
@@ -1792,6 +1794,12 @@ int ipv6_chk_addr_and_flags(struct net *net, const 
struct in6_addr *addr,
hlist_for_each_entry_rcu(ifp, _addr_lst[hash], addr_lst) {
if (!net_eq(dev_net(ifp->idev->dev), net))
continue;
+
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (afnetns && ifp->afnetns != afnetns)
+   continue;
+#endif
+
/* Decouple optimistic from tentative for evaluation here.
 * Ban optimistic addresses explicitly, when required.
 */
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index f9367c507573bc..ffb116297c0950 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -362,8 +362,11 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
if (!(addr_type & IPV6_ADDR_MULTICAST)) {
if (!net->ipv6.sysctl.ip_nonlocal_bind &&
!(inet->freebind || inet->transparent) &&
-   !ipv6_chk_addr(net, >sin6_addr,
-  dev, 0)) {
+   !ipv6_chk_addr_and_flags(net,
+sock_afnetns(sk),
+>sin6_addr,
+dev, 0,
+IFA_F_TENTATIVE)) {
err = -EADDRNOTAVAIL;
goto out_unlock;
}
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 7ebac630d3c603..4415659f8cfb0d 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -693,8 +693,8 @@ static void ndisc_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
struct in6_addr *target = (struct in6_addr *)>primary_key;
int probes = atomic_read(>probes);
 
-   if (skb && ipv6_chk_addr_and_flags(dev_net(dev), _hdr(skb)->saddr,
-  dev, 1,
+   if (skb && ipv6_chk_addr_and_flags(dev_net(dev), NULL,
+  _hdr(skb)->saddr, dev, 1,
   IFA_F_TENTATIVE|IFA_F_OPTIMISTIC))
saddr = _hdr(skb)->saddr;
probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 229bfcc451ef50..87d87c5413d71e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2007,7 +2007,7 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
 * prefix route was assigned to, which might be non-loopback.
 */
err = -EINVAL;
-   if (ipv6_chk_addr_and_flags(net, gw_addr,
+   if (ipv6_chk_addr_and_flags(net, NULL, gw_addr,
gwa_type & IPV6_ADDR_LINKLOCAL ?
dev : NULL, 0, 0))

[PATCH net-next RFC v1 24/27] afnetns: check afnetns user_ns in inet6_bind

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/af_inet6.c | 40 
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 30aff01eba5be0..4aa221826e753c 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -273,6 +273,26 @@ static int inet6_create(struct net *net, struct socket 
*sock, int protocol,
goto out;
 }
 
+static int inet6_allow_bind(struct net *net, struct in6_addr *addr,
+   unsigned short snum, struct net_device *dev)
+{
+   struct user_namespace *user_ns;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct afnetns *afnetns;
+
+   afnetns = ipv6_get_ifaddr_afnetns_rcu(net, addr, dev);
+   user_ns = afnetns ? afnetns->user_ns : net->user_ns;
+#else
+   user_ns = net->user_ns;
+#endif
+
+   if (snum && snum < inet_prot_sock(net) &&
+   !ns_capable(user_ns, CAP_NET_BIND_SERVICE))
+   return -EADDRNOTAVAIL;
+
+   return 0;
+}
+
 
 /* bind for INET6 API */
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
@@ -301,11 +321,6 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
if ((addr_type & IPV6_ADDR_MULTICAST) && sock->type == SOCK_STREAM)
return -EINVAL;
 
-   snum = ntohs(addr->sin6_port);
-   if (snum && snum < inet_prot_sock(net) &&
-   !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
-   return -EACCES;
-
lock_sock(sk);
 
/* Check these errors (active socket, double bind). */
@@ -314,6 +329,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
+   snum = ntohs(addr->sin6_port);
+
/* Check if the address belongs to the host. */
if (addr_type == IPV6_ADDR_MAPPED) {
/* Binding to v4-mapped address on a v6-only socket
@@ -330,10 +347,12 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
else
err = 0;
} else {
+   struct net_device *dev = NULL;
+
+   rcu_read_lock();
+
if (addr_type != IPV6_ADDR_ANY) {
-   struct net_device *dev = NULL;
 
-   rcu_read_lock();
if (__ipv6_addr_needs_scope_id(addr_type)) {
if (addr_len >= sizeof(struct sockaddr_in6) &&
addr->sin6_scope_id) {
@@ -371,8 +390,13 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
goto out_unlock;
}
}
-   rcu_read_unlock();
}
+
+   err = inet6_allow_bind(net, >sin6_addr, snum, dev);
+   if (err)
+   goto out_unlock;
+
+   rcu_read_unlock();
}
 
inet->inet_rcv_saddr = v4addr;
-- 
2.9.3

[PATCH net-next RFC v1 08/27] afnetns: factor out inet_allow_bind

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/inet_common.h |  1 +
 net/ipv4/af_inet.c| 51 ++-
 2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index b7952d55b9c000..4ac8229dca6af4 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -30,6 +30,7 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+int inet_allow_bind(struct sock *sk, __be32 addr);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr, int *uaddr_len,
 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 602d40f43687c9..aee599e23137e7 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -428,6 +428,35 @@ int inet_release(struct socket *sock)
 }
 EXPORT_SYMBOL(inet_release);
 
+int inet_allow_bind(struct sock *sk, __be32 addr)
+{
+   struct inet_sock *inet = inet_sk(sk);
+   struct net *net = sock_net(sk);
+   u32 tb_id = RT_TABLE_LOCAL;
+   int chk_addr_ret;
+
+   tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
+   chk_addr_ret = inet_addr_type_table(net, addr, tb_id);
+
+   /* Not specified by any standard per-se, however it breaks too
+* many applications when removed.  It is unfortunate since
+* allowing applications to make a non-local bind solves
+* several problems with systems using dynamic addressing.
+* (ie. your servers still start up even if your ISDN link
+*  is temporarily down)
+*/
+   if (!net->ipv4.sysctl_ip_nonlocal_bind &&
+   !(inet->freebind || inet->transparent) &&
+   addr != htonl(INADDR_ANY) &&
+   chk_addr_ret != RTN_LOCAL &&
+   chk_addr_ret != RTN_MULTICAST &&
+   chk_addr_ret != RTN_BROADCAST)
+   return -EADDRNOTAVAIL;
+
+   return chk_addr_ret;
+}
+EXPORT_SYMBOL(inet_allow_bind);
+
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
@@ -436,7 +465,6 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
-   u32 tb_id = RT_TABLE_LOCAL;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
@@ -458,24 +486,11 @@ int inet_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
goto out;
}
 
-   tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
-   chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);
-
-   /* Not specified by any standard per-se, however it breaks too
-* many applications when removed.  It is unfortunate since
-* allowing applications to make a non-local bind solves
-* several problems with systems using dynamic addressing.
-* (ie. your servers still start up even if your ISDN link
-*  is temporarily down)
-*/
-   err = -EADDRNOTAVAIL;
-   if (!net->ipv4.sysctl_ip_nonlocal_bind &&
-   !(inet->freebind || inet->transparent) &&
-   addr->sin_addr.s_addr != htonl(INADDR_ANY) &&
-   chk_addr_ret != RTN_LOCAL &&
-   chk_addr_ret != RTN_MULTICAST &&
-   chk_addr_ret != RTN_BROADCAST)
+   chk_addr_ret = inet_allow_bind(sk, addr->sin_addr.s_addr);
+   if (chk_addr_ret < 0) {
+   err = chk_addr_ret;
goto out;
+   }
 
snum = ntohs(addr->sin_port);
err = -EACCES;
-- 
2.9.3

[PATCH net-next RFC v1 19/27] afnetns: add afnetns support for tcpv4

2017-03-12 Thread Hannes Frederic Sowa

This commit adds the necessary checks to inet_hashtables, so that
sockets also have to match the corresponding afnetns.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/inet_sock.h|  1 +
 net/ipv4/inet_hashtables.c | 17 +++--
 net/ipv4/tcp_input.c   |  3 +++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index aa95053dfc78d3..d348f150e8e2c9 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -81,6 +81,7 @@ struct inet_request_sock {
 #define ir_iif req.__req_common.skc_bound_dev_if
 #define ir_cookie  req.__req_common.skc_cookie
 #define ireq_net   req.__req_common.skc_net
+#define ireq_afnet req.__req_common.skc_afnet
 #define ireq_state req.__req_common.skc_state
 #define ireq_familyreq.__req_common.skc_family
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8bea74298173f5..813a8fa1331944 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 
+#include 
+
 static u32 inet_ehashfn(const struct net *net, const __be32 laddr,
const __u16 lport, const __be32 faddr,
const __be16 fport)
@@ -169,6 +171,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock 
*child)
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
 static inline int compute_score(struct sock *sk, struct net *net,
+   struct afnetns *afnetns,
const unsigned short hnum, const __be32 daddr,
const int dif, bool exact_dif)
 {
@@ -176,7 +179,7 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
struct inet_sock *inet = inet_sk(sk);
 
if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
-   !ipv6_only_sock(sk)) {
+   afnetns == sock_afnetns(sk) && !ipv6_only_sock(sk)) {
__be32 rcv_saddr = inet->inet_rcv_saddr;
score = sk->sk_family == PF_INET ? 2 : 1;
if (rcv_saddr) {
@@ -215,10 +218,14 @@ struct sock *__inet_lookup_listener(struct net *net,
int score, hiscore = 0, matches = 0, reuseport = 0;
bool exact_dif = inet_exact_dif_match(net, skb);
struct sock *sk, *result = NULL;
+   struct afnetns *afnetns;
u32 phash = 0;
 
+   afnetns = ifa_find_afnetns_rcu(net, daddr);
+
sk_for_each_rcu(sk, >head) {
-   score = compute_score(sk, net, hnum, daddr, dif, exact_dif);
+   score = compute_score(sk, net, afnetns, hnum, daddr, dif,
+ exact_dif);
if (score > hiscore) {
reuseport = sk->sk_reuseport;
if (reuseport) {
@@ -272,6 +279,7 @@ struct sock *__inet_lookup_established(struct net *net,
 {
INET_ADDR_COOKIE(acookie, saddr, daddr);
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
+   struct afnetns *afnetns;
struct sock *sk;
const struct hlist_nulls_node *node;
/* Optimize here for direct hit, only listening connections can
@@ -281,10 +289,14 @@ struct sock *__inet_lookup_established(struct net *net,
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = >ehash[slot];
 
+   afnetns = ifa_find_afnetns_rcu(net, daddr);
+
 begin:
sk_nulls_for_each_rcu(sk, node, >chain) {
if (sk->sk_hash != hash)
continue;
+   if (afnetns != sock_afnetns(sk))
+   continue;
if (likely(INET_MATCH(sk, net, acookie,
  saddr, daddr, ports, dif))) {
if (unlikely(!atomic_inc_not_zero(>sk_refcnt)))
@@ -445,6 +457,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
inet_csk(sk2)->icsk_bind_hash == tb &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
+   sock_afnetns(sk) == sock_afnetns(sk2) &&
inet_rcv_saddr_equal(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96b67a8b18c3c3..0fc69a32c9faea 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6211,6 +6211,9 @@ struct request_sock *inet_reqsk_alloc(const struct 
request_sock_ops *ops,
atomic64_set(>ir_cookie, 0);
ireq->ireq_state = TCP_NEW_SYN_RECV;
write_pnet(>ireq_net, sock_net(sk_listener));
+#if IS_ENABLED(CONFIG_AFNETNS)
+   ireq->ireq_afnet = sock_afnetns(sk_listener);
+#endif
ireq->ireq_family = sk_listener->sk_family;

[PATCH net-next RFC v1 07/27] ipv4: introduce ifa_find_rcu

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/linux/inetdevice.h |  1 +
 net/ipv4/devinet.c | 29 +
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index d5ac959e90baa1..eb1b662f62626f 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -159,6 +159,7 @@ int unregister_inetaddr_notifier(struct notifier_block *nb);
 void inet_netconf_notify_devconf(struct net *net, int type, int ifindex,
 struct ipv4_devconf *devconf);
 
+struct in_ifaddr *ifa_find_rcu(struct net *net, __be32 addr);
 struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref);
 static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
 {
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index d4a38b6e9adb79..cc15afefa1df0a 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -128,6 +128,20 @@ static void inet_hash_remove(struct in_ifaddr *ifa)
hlist_del_init_rcu(>hash);
 }
 
+struct in_ifaddr *ifa_find_rcu(struct net *net, __be32 addr)
+{
+   u32 hash = inet_addr_hash(net, addr);
+   struct in_ifaddr *ifa;
+
+   hlist_for_each_entry_rcu(ifa, _addr_lst[hash], hash) {
+   if (ifa->ifa_local == addr &&
+   net_eq(dev_net(ifa->ifa_dev->dev), net))
+   return ifa;
+   }
+
+   return NULL;
+}
+
 /**
  * __ip_dev_find - find the first device with a given source address.
  * @net: the net namespace
@@ -138,21 +152,12 @@ static void inet_hash_remove(struct in_ifaddr *ifa)
  */
 struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 {
-   u32 hash = inet_addr_hash(net, addr);
-   struct net_device *result = NULL;
+   struct net_device *result;
struct in_ifaddr *ifa;
 
rcu_read_lock();
-   hlist_for_each_entry_rcu(ifa, _addr_lst[hash], hash) {
-   if (ifa->ifa_local == addr) {
-   struct net_device *dev = ifa->ifa_dev->dev;
-
-   if (!net_eq(dev_net(dev), net))
-   continue;
-   result = dev;
-   break;
-   }
-   }
+   ifa = ifa_find_rcu(net, addr);
+   result = ifa ? ifa->ifa_dev->dev : NULL;
if (!result) {
struct flowi4 fl4 = { .daddr = addr };
struct fib_result res = { 0 };
-- 
2.9.3

[PATCH net-next RFC v1 17/27] afnetns: introduce __inet_select_addr

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/linux/inetdevice.h |  2 ++
 net/ipv4/devinet.c | 27 ---
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 01cbcfe93383b7..a41bfce099e0a1 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -170,6 +170,8 @@ int inet_addr_onlink(struct in_device *in_dev, __be32 a, 
__be32 b);
 int devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
 void devinet_init(void);
 struct in_device *inetdev_by_index(struct net *, int);
+__be32 __inet_select_addr(const struct net_device *dev, __be32 dst, int scope,
+ struct afnetns *afnetns);
 __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
 __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst,
 __be32 local, int scope);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index cc15afefa1df0a..0844d917aa8d7d 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1224,7 +1224,17 @@ static int inet_gifconf(struct net_device *dev, char 
__user *buf, int len)
return done;
 }
 
-__be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
+static struct afnetns *ifa_afnetns(struct in_ifaddr *ifa)
+{
+#if IS_ENABLED(CONFIG_AFNETNS)
+   return ifa->afnetns;
+#else
+   return NULL;
+#endif
+}
+
+__be32 __inet_select_addr(const struct net_device *dev, __be32 dst,
+ int scope, struct afnetns *afnetns)
 {
__be32 addr = 0;
struct in_device *in_dev;
@@ -1237,6 +1247,8 @@ __be32 inet_select_addr(const struct net_device *dev, 
__be32 dst, int scope)
goto no_in_dev;
 
for_primary_ifa(in_dev) {
+   if (afnetns && afnetns != ifa_afnetns(ifa))
+   continue;
if (ifa->ifa_scope > scope)
continue;
if (!dst || inet_ifa_match(dst, ifa)) {
@@ -1262,7 +1274,8 @@ __be32 inet_select_addr(const struct net_device *dev, 
__be32 dst, int scope)
(in_dev = __in_dev_get_rcu(dev))) {
for_primary_ifa(in_dev) {
if (ifa->ifa_scope != RT_SCOPE_LINK &&
-   ifa->ifa_scope <= scope) {
+   ifa->ifa_scope <= scope &&
+   (!afnetns || afnetns == ifa_afnetns(ifa))) {
addr = ifa->ifa_local;
goto out_unlock;
}
@@ -1283,7 +1296,8 @@ __be32 inet_select_addr(const struct net_device *dev, 
__be32 dst, int scope)
 
for_primary_ifa(in_dev) {
if (ifa->ifa_scope != RT_SCOPE_LINK &&
-   ifa->ifa_scope <= scope) {
+   ifa->ifa_scope <= scope &&
+   (!afnetns || afnetns == ifa_afnetns(ifa))) {
addr = ifa->ifa_local;
goto out_unlock;
}
@@ -1293,6 +1307,13 @@ __be32 inet_select_addr(const struct net_device *dev, 
__be32 dst, int scope)
rcu_read_unlock();
return addr;
 }
+EXPORT_SYMBOL(__inet_select_addr);
+
+__be32 inet_select_addr(const struct net_device *dev, __be32 dst,
+   int scope)
+{
+   return __inet_select_addr(dev, dst, scope, NULL);
+}
 EXPORT_SYMBOL(inet_select_addr);
 
 static __be32 confirm_addr_indev(struct in_device *in_dev, __be32 dst,
-- 
2.9.3

[PATCH net-next RFC v1 20/27] ipv6: move ipv6_get_ifaddr to vmlinux in case ipv6 is build as module

2017-03-12 Thread Hannes Frederic Sowa

inet6_hashtables is build into vmlinux in case ipv6 gets build as a
module. As the inet6_hashtables functions depend on ipv6_get_ifaddr
via ipv6_get_ifaddr_afnetns_rcu, we need to make the lookup function
always available.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/addrconf.h  |  6 ++
 net/ipv6/addrconf.c | 35 +--
 net/ipv6/inet6_hashtables.c | 39 +++
 3 files changed, 46 insertions(+), 34 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 644fa68bb4ddef..dcb17f88fd2875 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -78,6 +78,7 @@ bool ipv6_chk_custom_prefix(const struct in6_addr *addr,
 
 int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev);
 
+extern struct hlist_head inet6_addr_lst[IN6_ADDR_HSIZE];
 struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net,
 const struct in6_addr *addr,
 struct net_device *dev, int strict);
@@ -416,6 +417,11 @@ static inline bool ipv6_addr_is_solict_mult(const struct 
in6_addr *addr)
 #endif
 }
 
+static inline u32 inet6_addr_hash(const struct in6_addr *addr)
+{
+   return hash_32(ipv6_addr_hash(addr), IN6_ADDR_HSIZE_SHIFT);
+}
+
 #ifdef CONFIG_PROC_FS
 int if6_proc_init(void);
 void if6_proc_exit(void);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2e546584695118..319f83a7d29dd5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -160,7 +160,6 @@ static int ipv6_generate_stable_address(struct in6_addr 
*addr,
 /*
  * Configured unicast address hash table
  */
-static struct hlist_head inet6_addr_lst[IN6_ADDR_HSIZE];
 static DEFINE_SPINLOCK(addrconf_hash_lock);
 
 static void addrconf_verify(void);
@@ -936,11 +935,6 @@ ipv6_link_dev_addr(struct inet6_dev *idev, struct 
inet6_ifaddr *ifp)
list_add_tail(>if_list, p);
 }
 
-static u32 inet6_addr_hash(const struct in6_addr *addr)
-{
-   return hash_32(ipv6_addr_hash(addr), IN6_ADDR_HSIZE_SHIFT);
-}
-
 /* On success it returns ifp with increased reference count */
 
 static struct inet6_ifaddr *
@@ -1888,30 +1882,6 @@ int ipv6_chk_prefix(const struct in6_addr *addr, struct 
net_device *dev)
 }
 EXPORT_SYMBOL(ipv6_chk_prefix);
 
-struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr 
*addr,
-struct net_device *dev, int strict)
-{
-   struct inet6_ifaddr *ifp, *result = NULL;
-   unsigned int hash = inet6_addr_hash(addr);
-
-   rcu_read_lock_bh();
-   hlist_for_each_entry_rcu_bh(ifp, _addr_lst[hash], addr_lst) {
-   if (!net_eq(dev_net(ifp->idev->dev), net))
-   continue;
-   if (ipv6_addr_equal(>addr, addr)) {
-   if (!dev || ifp->idev->dev == dev ||
-   !(ifp->scope&(IFA_LINK|IFA_HOST) || strict)) {
-   result = ifp;
-   in6_ifa_hold(ifp);
-   break;
-   }
-   }
-   }
-   rcu_read_unlock_bh();
-
-   return result;
-}
-
 /* Gets referenced address, destroys ifaddr */
 
 static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
@@ -6518,7 +6488,7 @@ static struct rtnl_af_ops inet6_ops __read_mostly = {
 int __init addrconf_init(void)
 {
struct inet6_dev *idev;
-   int i, err;
+   int err;
 
err = ipv6_addr_label_init();
if (err < 0) {
@@ -6563,9 +6533,6 @@ int __init addrconf_init(void)
goto errlo;
}
 
-   for (i = 0; i < IN6_ADDR_HSIZE; i++)
-   INIT_HLIST_HEAD(_addr_lst[i]);
-
register_netdevice_notifier(_dev_notf);
 
addrconf_verify();
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index d0900918a19e5e..8570e0e3016b65 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -25,6 +25,9 @@
 #include 
 #include 
 
+struct hlist_head inet6_addr_lst[IN6_ADDR_HSIZE];
+EXPORT_SYMBOL(inet6_addr_lst);
+
 u32 inet6_ehashfn(const struct net *net,
  const struct in6_addr *laddr, const u16 lport,
  const struct in6_addr *faddr, const __be16 fport)
@@ -44,6 +47,32 @@ u32 inet6_ehashfn(const struct net *net,
   inet6_ehash_secret + net_hash_mix(net));
 }
 
+struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net,
+const struct in6_addr *addr,
+struct net_device *dev, int strict)
+{
+   struct inet6_ifaddr *ifp, *result = NULL;
+   unsigned int hash = inet6_addr_hash(addr);
+
+   rcu_read_lock_bh();
+   hlist_for_each_entry_rcu_bh(ifp, _addr_lst[hash], addr_lst) {
+   if (!net_eq(dev_net(ifp->idev->dev), net))
+   continue;
+

[PATCH net-next RFC v1 09/27] afnetns: add sock_afnetns

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/sock.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1e05d497db2520..aa204bf3537ba0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2293,6 +2293,15 @@ struct net *sock_net(const struct sock *sk)
return read_pnet(>sk_net);
 }
 
+static inline struct afnetns *sock_afnetns(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_AFNETNS)
+   return sk->sk_afnet;
+#else
+   return NULL;
+#endif
+}
+
 static inline
 void sock_net_set(struct sock *sk, struct net *net)
 {
-- 
2.9.3

[PATCH net-next RFC v1 04/27] afnetns: add net_afnetns

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/net_namespace.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index c59fb018da5e46..9be39b8315a6f9 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -244,6 +244,14 @@ int net_eq(const struct net *net1, const struct net *net2)
 #define net_drop_ns NULL
 #endif
 
+static inline struct afnetns *net_afnetns(struct net *net)
+{
+#if IS_ENABLED(CONFIG_AFNETNS)
+   return net->afnet_ns;
+#else
+   return NULL;
+#endif
+}
 
 typedef struct {
 #ifdef CONFIG_NET_NS
-- 
2.9.3

[PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread Hannes Frederic Sowa

--- >8 ---
Note:
* BE CAREFUL SOURCE ADDRESS SELECTION 
--- >8 ---

afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
can work with afnetns with one limitation: one cannot cross the realm
of a network namespace while changing the afnetns compartement. To get
into a new afnetns in a different net namespace, one must first change
to the net namespace and afterwards switch to the desired afnetns.

The primitive objects in the kernel an afnetns relates to are,
- process
- socket
- ipv4 address
- ipv6 address.

An afnetns basically forms a namespace around socket binds. While not
strictly necessary, it also affects the source routing, so firewall rules
are easier to maintain. It does in now way deal with the reception and
handling of multicast or broadcast sockets. As the afnetns namespaces
are connecting to the same L2 network, it does not make sense to try to
build up separation rules here, as they can be broken anyway.

In comparison to ipvlan, afnetns allows early to use early socket
demuxing.

Loopback is not possible within an afnetns until its own loopback device
is added or its private ip address is used.

The easiest way to use afnetns is to use the iproute2 interface, which
very much follows the style of ip-netns.

$ ip afnetns help
Usage: ip afnetns list
   ip afnetns add NAME
   ip afnetns del NAME
   ip afnetns exec NAME cmd ...

IP addresses carry a afnetns identifier, too. It is visible with the -d
(details) option:

$ ip -d a l dev lo
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 
numtxqueues 1 numrxqueues 1
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self

This shows the afnetns inode number, as well as that we are currently in
the same namespace as the two specified ip addresses. In case we added
a name for the namespace with ip-afnetns, it will be visible here, too.

$ ip a a 10.0.0.1/24 dev lo afnetns test

This command adds a new ip address to the loopback device and makes it
available in the test afnetns. Commands in this namespace can use this
IP address and use it for outgoing communication.

Changelog:
v1) first published version

The same commands work for IPv6, I only used IPv4 as an example.

This is still work in progress.

Hannes Frederic Sowa (27):
  afnetns: add CLONE_NEWAFNET flag
  afnetns: basic namespace operations and representations
  afnetns: prepare for integration into ipv4
  afnetns: add net_afnetns
  afnetns: ipv6 integration
  afnetns: put afnetns pointer into struct sock
  ipv4: introduce ifa_find_rcu
  afnetns: factor out inet_allow_bind
  afnetns: add sock_afnetns
  afnetns: add ifa_find_afnetns_rcu
  afnetns: validate afnetns in inet_allow_bind
  afnetns: ipv4/udp integration
  afnetns: use inet_allow_bind in inet6_bind
  afnetns: check for afnetns in inet6_bind
  afnetns: add ipv6_get_ifaddr_afnetns_rcu
  afnetns: add udpv6 support
  afnetns: introduce __inet_select_addr
  afnetns: afnetns should influence source address selection
  afnetns: add afnetns support for tcpv4
  ipv6: move ipv6_get_ifaddr to vmlinux in case ipv6 is build as module
  afnetns: add support for tcpv6
  afnetns: track owning namespace for inet_bind
  afnetns: use user_ns from afnetns for checking for binding to port <
1024
  afnetns: check afnetns user_ns in inet6_bind
  afnetns: ipv4: inherit afnetns from calling application
  afnetns: ipv6: inherit afnetns from calling application
  afnetns: allow only whitelisted protocols to operate inside afnetns

 Documentation/networking/afnetns.txt|  64 +
 drivers/target/iscsi/cxgbit/cxgbit_cm.c |   2 +-
 fs/proc/namespaces.c|   3 +
 include/linux/inetdevice.h  |  22 -
 include/linux/nsproxy.h |   3 +
 include/linux/proc_ns.h |   1 +
 include/net/addrconf.h  |  26 +-
 include/net/afnetns.h   |  47 ++
 include/net/if_inet6.h  |   3 +
 include/net/inet_common.h   |   1 +
 include/net/inet_sock.h |   1 +
 include/net/net_namespace.h |  12 +++
 include/net/protocol.h  |   1 +
 include/net/route.h |  10 +-
 include/net/sock.h  |  13 +++
 include/uapi/linux/if_addr.h|   2 +
 include/uapi/linux/sched.h  |   1 +
 kernel/fork.c   |  12 ++-
 kernel/nsproxy.c|  24 -
 net/Kconfig |  10 ++
 net/core/Makefile   |   1 +
 net/core/afnetns.c  | 159 
 net/core/net_namespace.c|  25 +
 net/core/sock.c

[PATCH net-next RFC v1 11/27] afnetns: validate afnetns in inet_allow_bind

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/af_inet.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index aee599e23137e7..5f11399bafd16f 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -453,6 +453,17 @@ int inet_allow_bind(struct sock *sk, __be32 addr)
chk_addr_ret != RTN_BROADCAST)
return -EADDRNOTAVAIL;
 
+   if (chk_addr_ret == RTN_LOCAL &&
+   net_afnetns(net) != sock_afnetns(sk)) {
+   struct afnetns *afnetns;
+
+   rcu_read_lock();
+   afnetns = ifa_find_afnetns_rcu(net, addr);
+   if (afnetns != sock_afnetns(sk))
+   chk_addr_ret = -EADDRNOTAVAIL;
+   rcu_read_unlock();
+   }
+
return chk_addr_ret;
 }
 EXPORT_SYMBOL(inet_allow_bind);
-- 
2.9.3

[PATCH net-next RFC v1 12/27] afnetns: ipv4/udp integration

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/udp.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ea6e4cff9fafe9..5bfe2d9f5583da 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -155,6 +155,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 
sk_for_each(sk2, >head) {
if (net_eq(sock_net(sk2), net) &&
+   sock_afnetns(sk) == sock_afnetns(sk2) &&
sk2 != sk &&
(bitmap || udp_sk(sk2)->udp_port_hash == num) &&
(!sk2->sk_reuse || !sk->sk_reuse) &&
@@ -192,6 +193,7 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num,
spin_lock(>lock);
udp_portaddr_for_each_entry(sk2, >head) {
if (net_eq(sock_net(sk2), net) &&
+   sock_afnetns(sk) == sock_afnetns(sk2) &&
sk2 != sk &&
(udp_sk(sk2)->udp_port_hash == num) &&
(!sk2->sk_reuse || !sk->sk_reuse) &&
@@ -220,6 +222,7 @@ static int udp_reuseport_add_sock(struct sock *sk, struct 
udp_hslot *hslot)
 
sk_for_each(sk2, >head) {
if (net_eq(sock_net(sk2), net) &&
+   sock_afnetns(sk) == sock_afnetns(sk2) &&
sk2 != sk &&
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
@@ -379,6 +382,7 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
 }
 
 static int compute_score(struct sock *sk, struct net *net,
+struct afnetns *afnetns,
 __be32 saddr, __be16 sport,
 __be32 daddr, unsigned short hnum, int dif,
 bool exact_dif)
@@ -391,6 +395,9 @@ static int compute_score(struct sock *sk, struct net *net,
ipv6_only_sock(sk))
return -1;
 
+   if (sock_afnetns(sk) != afnetns)
+   return -1;
+
score = (sk->sk_family == PF_INET) ? 2 : 1;
inet = inet_sk(sk);
 
@@ -436,6 +443,7 @@ static u32 udp_ehashfn(const struct net *net, const __be32 
laddr,
 
 /* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
+   struct afnetns *afnetns,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned int hnum, int dif, bool exact_dif,
struct udp_hslot *hslot2,
@@ -448,7 +456,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
result = NULL;
badness = 0;
udp_portaddr_for_each_entry_rcu(sk, >head) {
-   score = compute_score(sk, net, saddr, sport,
+   score = compute_score(sk, net, afnetns, saddr, sport,
  daddr, hnum, dif, exact_dif);
if (score > badness) {
reuseport = sk->sk_reuseport;
@@ -486,8 +494,11 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
struct udp_hslot *hslot2, *hslot = >hash[slot];
bool exact_dif = udp_lib_exact_dif_match(net, skb);
int score, badness, matches = 0, reuseport = 0;
+   struct afnetns *afnetns;
u32 hash = 0;
 
+   afnetns = ifa_find_afnetns_rcu(net, daddr);
+
if (hslot->count > 10) {
hash2 = udp4_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
@@ -495,7 +506,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
if (hslot->count < hslot2->count)
goto begin;
 
-   result = udp4_lib_lookup2(net, saddr, sport,
+   result = udp4_lib_lookup2(net, afnetns, saddr, sport,
  daddr, hnum, dif,
  exact_dif, hslot2, skb);
if (!result) {
@@ -510,7 +521,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
if (hslot->count < hslot2->count)
goto begin;
 
-   result = udp4_lib_lookup2(net, saddr, sport,
+   result = udp4_lib_lookup2(net, afnetns, saddr, sport,
  daddr, hnum, dif,
  exact_dif, hslot2, skb);
}
@@ -520,7 +531,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
result = NULL;
badness = 0;
sk_for_each_rcu(sk, >head) {
-   score = compute_score(sk, net, saddr, sport,
+   score = compute_score(sk, net, afnetns, saddr, sport,
  daddr, hnum, dif, exact_dif);
if (score > badness) {
reuseport = sk->sk_reuseport;
@@ -2031,9 +2042,12 @@ static struct sock *__udp4_lib_demux_lookup(struct net 
*net,

[PATCH net-next RFC v1 10/27] afnetns: add ifa_find_afnetns_rcu

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 include/linux/inetdevice.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index eb1b662f62626f..01cbcfe93383b7 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -180,6 +180,17 @@ static __inline__ bool inet_ifa_match(__be32 addr, struct 
in_ifaddr *ifa)
return !((addr^ifa->ifa_address)>ifa_mask);
 }
 
+static inline struct afnetns *ifa_find_afnetns_rcu(struct net *net, __be32 
addr)
+{
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct in_ifaddr *ifa = ifa_find_rcu(net, addr);
+
+   return ifa ? ifa->afnetns : net->afnet_ns;
+#else
+   return NULL;
+#endif
+}
+
 /*
  * Check if a mask is acceptable.
  */
-- 
2.9.3

[PATCH net-next RFC v1 13/27] afnetns: use inet_allow_bind in inet6_bind

2017-03-12 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv6/af_inet6.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 04db40620ea65c..f9367c507573bc 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -316,8 +316,6 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
 
/* Check if the address belongs to the host. */
if (addr_type == IPV6_ADDR_MAPPED) {
-   int chk_addr_ret;
-
/* Binding to v4-mapped address on a v6-only socket
 * makes no sense
 */
@@ -326,18 +324,11 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
goto out;
}
 
-   /* Reproduce AF_INET checks to make the bindings consistent */
-   v4addr = addr->sin6_addr.s6_addr32[3];
-   chk_addr_ret = inet_addr_type(net, v4addr);
-   if (!net->ipv4.sysctl_ip_nonlocal_bind &&
-   !(inet->freebind || inet->transparent) &&
-   v4addr != htonl(INADDR_ANY) &&
-   chk_addr_ret != RTN_LOCAL &&
-   chk_addr_ret != RTN_MULTICAST &&
-   chk_addr_ret != RTN_BROADCAST) {
-   err = -EADDRNOTAVAIL;
+   err = inet_allow_bind(sk, addr->sin6_addr.s6_addr32[3]);
+   if (err < 0)
goto out;
-   }
+   else
+   err = 0;
} else {
if (addr_type != IPV6_ADDR_ANY) {
struct net_device *dev = NULL;
-- 
2.9.3

[PATCH net-next RFC v1 06/27] afnetns: put afnetns pointer into struct sock

2017-03-12 Thread Hannes Frederic Sowa

All sockets are associated to its creator's afnet namespace.

A little bit care must be taken about in-kernel socket creation.
Basically we associate kernel pointers to the current's net namespace
afnet and don't use the process contexts afnetns.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/sock.h |  4 
 net/core/sock.c| 18 --
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 6db7693b9e6185..1e05d497db2520 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -183,6 +183,9 @@ struct sock_common {
};
struct proto*skc_prot;
possible_net_t  skc_net;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct afnetns  *skc_afnet;
+#endif
 
 #if IS_ENABLED(CONFIG_IPV6)
struct in6_addr skc_v6_daddr;
@@ -337,6 +340,7 @@ struct sock {
 #define sk_bind_node   __sk_common.skc_bind_node
 #define sk_prot__sk_common.skc_prot
 #define sk_net __sk_common.skc_net
+#define sk_afnet   __sk_common.skc_afnet
 #define sk_v6_daddr__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
 #define sk_cookie  __sk_common.skc_cookie
diff --git a/net/core/sock.c b/net/core/sock.c
index 768aedf238f5b4..542d496858f993 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1458,6 +1458,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t 
priority,
if (likely(sk->sk_net_refcnt))
get_net(net);
sock_net_set(sk, net);
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (likely(sk->sk_net_refcnt))
+   sk->sk_afnet = afnetns_get(current->nsproxy->afnet_ns);
+   else
+   sk->sk_afnet = net->afnet_ns;
+#endif
atomic_set(>sk_wmem_alloc, 1);
 
mem_cgroup_sk_alloc(sk);
@@ -1499,8 +1505,12 @@ static void __sk_destruct(struct rcu_head *head)
if (sk->sk_peer_cred)
put_cred(sk->sk_peer_cred);
put_pid(sk->sk_peer_pid);
-   if (likely(sk->sk_net_refcnt))
+   if (likely(sk->sk_net_refcnt)) {
put_net(sock_net(sk));
+#if IS_ENABLED(CONFIG_AFNETNS)
+   afnetns_put(sk->sk_afnet);
+#endif
+   }
sk_prot_free(sk->sk_prot_creator, sk);
 }
 
@@ -1572,8 +1582,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
sock_copy(newsk, sk);
 
/* SANITY */
-   if (likely(newsk->sk_net_refcnt))
+   if (likely(newsk->sk_net_refcnt)) {
get_net(sock_net(newsk));
+#if IS_ENABLED(CONFIG_AFNETNS)
+   afnetns_get(newsk->sk_afnet);
+#endif
+   }
sk_node_init(>sk_node);
sock_lock_init(newsk);
bh_lock_sock(newsk);
-- 
2.9.3

[PATCH net-next RFC v1 05/27] afnetns: ipv6 integration

2017-03-12 Thread Hannes Frederic Sowa

Like the previous IPv4 counterpart, this patch associates every IPv6
address with a corresponding afnet namespace. The namespace can be set
via file descriptor and the inode gets reported during dumping.

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/if_inet6.h |  3 +++
 net/core/afnetns.c |  3 +++
 net/ipv6/addrconf.c| 70 +-
 3 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index f656f9051acafa..cad645851501f4 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -41,6 +41,9 @@ enum {
 struct inet6_ifaddr {
struct in6_addr addr;
__u32   prefix_len;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct afnetns  *afnetns;
+#endif
 
/* In seconds, relative to tstamp. Expiry is at tstamp + HZ * lft. */
__u32   valid_lft;
diff --git a/net/core/afnetns.c b/net/core/afnetns.c
index 12b823ae780796..b96c25b5ebe30d 100644
--- a/net/core/afnetns.c
+++ b/net/core/afnetns.c
@@ -56,6 +56,7 @@ void afnetns_free(struct afnetns *afnetns)
put_net(afnetns->net);
kfree(afnetns);
 }
+EXPORT_SYMBOL(afnetns_free);
 
 struct afnetns *afnetns_get_by_fd(int fd)
 {
@@ -76,11 +77,13 @@ struct afnetns *afnetns_get_by_fd(int fd)
fput(file);
return afnetns;
 }
+EXPORT_SYMBOL(afnetns_get_by_fd);
 
 unsigned int afnetns_to_inode(struct afnetns *afnetns)
 {
return afnetns->ns.inum;
 }
+EXPORT_SYMBOL(afnetns_to_inode);
 
 struct afnetns *copy_afnet_ns(unsigned long flags, struct nsproxy *old)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 8c69768a5c4606..c67f6d3c5b9a7a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -910,7 +910,9 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
return;
}
ip6_rt_put(ifp->rt);
-
+#if IS_ENABLED(CONFIG_AFNETNS)
+   afnetns_put(ifp->afnetns);
+#endif
kfree_rcu(ifp, rcu);
 }
 
@@ -942,9 +944,10 @@ static u32 inet6_addr_hash(const struct in6_addr *addr)
 /* On success it returns ifp with increased reference count */
 
 static struct inet6_ifaddr *
-ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
- const struct in6_addr *peer_addr, int pfxlen,
- int scope, u32 flags, u32 valid_lft, u32 prefered_lft)
+__ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
+   const struct in6_addr *peer_addr, int pfxlen,
+   int scope, u32 flags, u32 valid_lft, u32 prefered_lft,
+   struct afnetns *afnetns)
 {
struct net *net = dev_net(idev->dev);
struct inet6_ifaddr *ifa = NULL;
@@ -1002,7 +1005,9 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
ifa->addr = *addr;
if (peer_addr)
ifa->peer_addr = *peer_addr;
-
+#if IS_ENABLED(CONFIG_AFNETNS)
+   ifa->afnetns = afnetns_get(afnetns);
+#endif
spin_lock_init(>lock);
INIT_DELAYED_WORK(>dad_work, addrconf_dad_work);
INIT_HLIST_NODE(>addr_lst);
@@ -1054,6 +1059,17 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
goto out2;
 }
 
+static struct inet6_ifaddr *ipv6_add_addr(struct inet6_dev *idev,
+ const struct in6_addr *addr,
+ const struct in6_addr *peer_addr,
+ int pfxlen, int scope, u32 flags,
+ u32 valid_lft, u32 prefered_lft)
+{
+   return __ipv6_add_addr(idev, addr, peer_addr, pfxlen, scope, flags,
+  valid_lft, prefered_lft,
+  net_afnetns(dev_net(idev->dev)));
+}
+
 enum cleanup_prefix_rt_t {
CLEANUP_PREFIX_RT_NOP,/* no cleanup action for prefix route */
CLEANUP_PREFIX_RT_DEL,/* delete the prefix route */
@@ -2741,7 +2757,8 @@ static int inet6_addr_add(struct net *net, int ifindex,
  const struct in6_addr *pfx,
  const struct in6_addr *peer_pfx,
  unsigned int plen, __u32 ifa_flags,
- __u32 prefered_lft, __u32 valid_lft)
+ __u32 prefered_lft, __u32 valid_lft,
+ struct afnetns *afnetns)
 {
struct inet6_ifaddr *ifp;
struct inet6_dev *idev;
@@ -2799,8 +2816,8 @@ static int inet6_addr_add(struct net *net, int ifindex,
prefered_lft = timeout;
}
 
-   ifp = ipv6_add_addr(idev, pfx, peer_pfx, plen, scope, ifa_flags,
-   valid_lft, prefered_lft);
+   ifp = __ipv6_add_addr(idev, pfx, peer_pfx, plen, scope, ifa_flags,
+ valid_lft, prefered_lft, afnetns);
 
if (!IS_ERR(ifp)) {
if (!(ifa_flags &

[PATCH net-next RFC v1 01/27] afnetns: add CLONE_NEWAFNET flag

2017-03-12 Thread Hannes Frederic Sowa

This patch adds a new clone flag. It will be used if a clone should also
open up a new afnetns namespace. The only restriction placed on this new
flag, is, that it cannot be used together with CLONE_NEWNET.

The previous usage of flag 0x1000 was used for CLONE_IDLETASK until
2004. It was only allowed to be used by the kernel, thus I consider its
usage safe.

Signed-off-by: Hannes Frederic Sowa 
---
 include/uapi/linux/sched.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe019a7204e..b48dea58f55524 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
 #define CLONE_FS   0x0200  /* set if fs info shared between 
processes */
 #define CLONE_FILES0x0400  /* set if open files shared between 
processes */
 #define CLONE_SIGHAND  0x0800  /* set if signal handlers and blocked 
signals shared */
+#define CLONE_NEWAFNET 0x1000  /* Clone new afnet context */
 #define CLONE_PTRACE   0x2000  /* set if we want to let tracing 
continue on the child too */
 #define CLONE_VFORK0x4000  /* set if the parent wants the child to 
wake it up on mm_release */
 #define CLONE_PARENT   0x8000  /* set if we want to have the same 
parent as the cloner */
-- 
2.9.3

[PATCH RFC iproute v1 4/4] afnetns: only show afnetns when show_details

2017-03-12 Thread Hannes Frederic Sowa

Only show afnetns details when details are requested.

Signed-off-by: Hannes Frederic Sowa 
---
 ip/ipaddress.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index d954f3ea5bff40..cfb58e70e4f29f 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1219,7 +1219,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
fprintf(fp, "%usec", ci->ifa_prefered);
}
}
-   if (rta_tb[IFA_AFNETNS_INODE]) {
+   if (show_details && rta_tb[IFA_AFNETNS_INODE]) {
ino_t inode;
char *name;
 
-- 
2.9.3

[PATCH net-next RFC v1 03/27] afnetns: prepare for integration into ipv4

2017-03-12 Thread Hannes Frederic Sowa

Each IPv4 address has an associated afnet namespace, so it is only going
to be used by applications in the same afnet namespace.

One can open a file descriptor and pass it to the newaddr rtnetlink
functions to put an IP address into a specific afnet namespace.

Dumping the addresses also returns the appropriate afnetns inode number,
so a match with the appropriate afnet namespace can be done in user space.

Signed-off-by: Hannes Frederic Sowa 
---
 include/linux/inetdevice.h   |  3 +++
 include/net/afnetns.h|  2 ++
 include/uapi/linux/if_addr.h |  2 ++
 net/core/afnetns.c   | 26 ++
 net/ipv4/devinet.c   | 39 ++-
 5 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ee971f335a8b65..d5ac959e90baa1 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -141,6 +141,9 @@ struct in_ifaddr {
unsigned char   ifa_scope;
unsigned char   ifa_prefixlen;
__u32   ifa_flags;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct afnetns  *afnetns;
+#endif
charifa_label[IFNAMSIZ];
 
/* In seconds, relative to tstamp. Expiry is at tstamp + HZ * lft. */
diff --git a/include/net/afnetns.h b/include/net/afnetns.h
index d5fbb83023acd6..9039086717c356 100644
--- a/include/net/afnetns.h
+++ b/include/net/afnetns.h
@@ -19,6 +19,8 @@ int afnet_ns_init(void);
 
 struct afnetns *afnetns_new(struct net *net);
 struct afnetns *copy_afnet_ns(unsigned long flags, struct nsproxy *old);
+struct afnetns *afnetns_get_by_fd(int fd);
+unsigned int afnetns_to_inode(struct afnetns *afnetns);
 void afnetns_free(struct afnetns *afnetns);
 
 static inline struct afnetns *afnetns_get(struct afnetns *afnetns)
diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index 4318ab1635cedf..c67703808584eb 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -32,6 +32,8 @@ enum {
IFA_CACHEINFO,
IFA_MULTICAST,
IFA_FLAGS,
+   IFA_AFNETNS_FD,
+   IFA_AFNETNS_INODE,
__IFA_MAX,
 };
 
diff --git a/net/core/afnetns.c b/net/core/afnetns.c
index 997623e4dc5078..12b823ae780796 100644
--- a/net/core/afnetns.c
+++ b/net/core/afnetns.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -56,6 +57,31 @@ void afnetns_free(struct afnetns *afnetns)
kfree(afnetns);
 }
 
+struct afnetns *afnetns_get_by_fd(int fd)
+{
+   struct file *file;
+   struct ns_common *ns;
+   struct afnetns *afnetns;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops == _operations)
+   afnetns = afnetns_get(ns_to_afnet(ns));
+   else
+   afnetns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return afnetns;
+}
+
+unsigned int afnetns_to_inode(struct afnetns *afnetns)
+{
+   return afnetns->ns.inum;
+}
+
 struct afnetns *copy_afnet_ns(unsigned long flags, struct nsproxy *old)
 {
if (flags & CLONE_NEWNET)
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index cebedd545e5e28..d4a38b6e9adb79 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -99,6 +99,7 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
[IFA_LABEL] = { .type = NLA_STRING, .len = IFNAMSIZ - 1 },
[IFA_CACHEINFO] = { .len = sizeof(struct ifa_cacheinfo) },
[IFA_FLAGS] = { .type = NLA_U32 },
+   [IFA_AFNETNS_FD]= { .type = NLA_S32 },
 };
 
 #define IN4_ADDR_HSIZE_SHIFT   8
@@ -203,6 +204,9 @@ static void inet_rcu_free_ifa(struct rcu_head *head)
struct in_ifaddr *ifa = container_of(head, struct in_ifaddr, rcu_head);
if (ifa->ifa_dev)
in_dev_put(ifa->ifa_dev);
+#if IS_ENABLED(CONFIG_AFNETNS)
+   afnetns_put(ifa->afnetns);
+#endif
kfree(ifa);
 }
 
@@ -805,6 +809,26 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, 
struct nlmsghdr *nlh,
else
memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
 
+#if IS_ENABLED(CONFIG_AFNETNS)
+   if (tb[IFA_AFNETNS_FD]) {
+   int fd = nla_get_s32(tb[IFA_AFNETNS_FD]);
+
+   ifa->afnetns = afnetns_get_by_fd(fd);
+   if (IS_ERR(ifa->afnetns)) {
+   err = PTR_ERR(ifa->afnetns);
+   ifa->afnetns = afnetns_get(net->afnet_ns);
+   goto errout_free;
+   }
+   } else {
+   ifa->afnetns = afnetns_get(net->afnet_ns);
+   }
+#else
+   if (tb[IFA_AFNETNS_FD]) {
+   err = -EOPNOTSUPP;
+   goto errout_free;
+   }
+#endif
+
if (tb[IFA_CACHEINFO]) {
struct ifa_cacheinfo *ci;
 
@@ -1089,6 +1113,9

[PATCH RFC iproute v1 0/4] afnetns: add support for afnetns

2017-03-12 Thread Hannes Frederic Sowa

This patchset adds support for afnetns to iproute.

For more information on afnetns please look at the kernel patchset.

Patches for util-linux commands, namely nsenter and unshare, is
available here: 

Hannes Frederic Sowa (4):
  afnetns: add iproute bits for afnetns
  afnetns: support for ipv4/v6 address management
  afnetns: introduce lib/afnetns.c and a name cache
  afnetns: only show afnetns when show_details

 include/afnetns.h   |   6 ++
 include/libnetlink.h|   7 ++
 include/linux/if_addr.h |   2 +
 include/namespace.h |   4 +
 include/utils.h |   1 +
 ip/Makefile |   2 +-
 ip/ip.c |   5 +-
 ip/ip_common.h  |   1 +
 ip/ipaddress.c  |  38 
 ip/ipafnetns.c  | 216 +
 lib/Makefile|   2 +-
 lib/afnetns.c   | 226 
 lib/utils.c |  36 
 13 files changed, 542 insertions(+), 4 deletions(-)
 create mode 100644 include/afnetns.h
 create mode 100644 ip/ipafnetns.c
 create mode 100644 lib/afnetns.c

-- 
2.9.3

[PATCH net] dccp: fix memory leak during tear-down of unsuccessful connection request

2017-03-12 Thread Hannes Frederic Sowa

This patch fixes a memory leak, which happens if the connection request
is not fulfilled between parsing the DCCP options and handling the SYN
(because e.g. the backlog is full), because we forgot to free the
list of ack vectors.

Reported-by: Jianwen Ji 
Signed-off-by: Hannes Frederic Sowa 
---
 net/dccp/ccids/ccid2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index f053198e730c48..5e3a7302f7747e 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -749,6 +749,7 @@ static void ccid2_hc_tx_exit(struct sock *sk)
for (i = 0; i < hc->tx_seqbufc; i++)
kfree(hc->tx_seqbuf[i]);
hc->tx_seqbufc = 0;
+   dccp_ackvec_parsed_cleanup(>tx_av_chunks);
 }
 
 static void ccid2_hc_rx_packet_recv(struct sock *sk, struct sk_buff *skb)
-- 
2.9.3

[PATCH RFC iproute v1 2/4] afnetns: support for ipv4/v6 address management

2017-03-12 Thread Hannes Frederic Sowa

Support ip address add xxx.yyy.zzz.lll/kk dev eth0 afnetns 

Signed-off-by: Hannes Frederic Sowa 
---
 include/libnetlink.h|  7 +++
 include/linux/if_addr.h |  2 ++
 include/namespace.h |  2 ++
 ip/ipaddress.c  | 32 
 ip/ipafnetns.c  | 26 +++---
 lib/namespace.c | 21 +
 6 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index bd0267dfcc02ad..81ba0d3a032360 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -152,10 +152,17 @@ static inline __u32 rta_getattr_u32(const struct rtattr 
*rta)
 {
return *(__u32 *)RTA_DATA(rta);
 }
+
+static inline __s32 rta_getattr_s32(const struct rtattr *rta)
+{
+   return *(__s32 *)RTA_DATA(rta);
+}
+
 static inline __be32 rta_getattr_be32(const struct rtattr *rta)
 {
return ntohl(rta_getattr_u32(rta));
 }
+
 static inline __u64 rta_getattr_u64(const struct rtattr *rta)
 {
__u64 tmp;
diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
index 26f0ecff9f13dd..dea1abe593ab29 100644
--- a/include/linux/if_addr.h
+++ b/include/linux/if_addr.h
@@ -32,6 +32,8 @@ enum {
IFA_CACHEINFO,
IFA_MULTICAST,
IFA_FLAGS,
+   IFA_AFNETNS_FD,
+   IFA_AFNETNS_INODE,
__IFA_MAX,
 };
 
diff --git a/include/namespace.h b/include/namespace.h
index acecc8c1f0d2b8..e0745ab0b50972 100644
--- a/include/namespace.h
+++ b/include/namespace.h
@@ -52,6 +52,8 @@ int netns_switch(char *netns);
 int netns_get_fd(const char *netns);
 int netns_foreach(int (*func)(char *nsname, void *arg), void *arg);
 
+int afnetns_open(const char *name);
+
 struct netns_func {
int (*func)(char *nsname, void *arg);
void *arg;
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index b8d9c7d917fe8d..2994b6a3e0a154 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -37,6 +37,7 @@
 #include "ip_common.h"
 #include "xdp.h"
 #include "color.h"
+#include "namespace.h"
 
 enum {
IPADD_LIST,
@@ -999,6 +1000,18 @@ static int set_lifetime(unsigned int *lifetime, char 
*argv)
return 0;
 }
 
+static int afnetns_get_fd(const char *name)
+{
+   int ns = -1;
+
+   if (name[0] == '/')
+   ns = open(name, O_RDONLY | O_CLOEXEC);
+   else
+   ns = afnetns_open(name);
+
+   return ns;
+}
+
 static unsigned int get_ifa_flags(struct ifaddrmsg *ifa,
  struct rtattr *ifa_flags_attr)
 {
@@ -1205,6 +1218,10 @@ int print_addrinfo(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
fprintf(fp, "%usec", ci->ifa_prefered);
}
}
+   if (rta_tb[IFA_AFNETNS_INODE]) {
+   fprintf(fp, " afnet:[%u]",
+   rta_getattr_u32(rta_tb[IFA_AFNETNS_INODE]));
+   }
fprintf(fp, "\n");
 brief_exit:
fflush(fp);
@@ -1883,6 +1900,7 @@ static int ipaddr_modify(int cmd, int flags, int argc, 
char **argv)
int brd_len = 0;
int any_len = 0;
int scoped = 0;
+   int afnetns_fd = -1;
__u32 preferred_lft = INFINITY_LIFE_TIME;
__u32 valid_lft = INFINITY_LIFE_TIME;
unsigned int ifa_flags = 0;
@@ -1958,6 +1976,14 @@ static int ipaddr_modify(int cmd, int flags, int argc, 
char **argv)
preferred_lftp = *argv;
if (set_lifetime(_lft, *argv))
invarg("preferred_lft value", *argv);
+   } else if (strcmp(*argv, "afnetns") == 0) {
+   if (afnetns_fd != -1)
+   duparg("afnetns", *argv);
+
+   NEXT_ARG();
+   afnetns_fd = afnetns_get_fd(*argv);
+   if (afnetns_fd < 0)
+   invarg("afnetns", *argv);
} else if (strcmp(*argv, "home") == 0) {
ifa_flags |= IFA_F_HOMEADDRESS;
} else if (strcmp(*argv, "nodad") == 0) {
@@ -2064,9 +2090,15 @@ static int ipaddr_modify(int cmd, int flags, int argc, 
char **argv)
return -1;
}
 
+   if (afnetns_fd != -1)
+   addattr32(, sizeof(req), IFA_AFNETNS_FD, afnetns_fd);
+
if (rtnl_talk(, , NULL, 0) < 0)
return -2;
 
+   if (afnetns_fd > 0)
+   close(afnetns_fd);
+
return 0;
 }
 
diff --git a/ip/ipafnetns.c b/ip/ipafnetns.c
index 5b7a7e59bc947a..5a197ad3866d18 100644
--- a/ip/ipafnetns.c
+++ b/ip/ipafnetns.c
@@ -148,37 +148,25 @@ out_delete:
 static int afnetns_switch(const char *name)
 {
int err, ns;
-   char *path;
 
-   err = asprintf(, "%s/%s", AFNETNS_RUN_DIR, name);
-   if (err < 0) {
-   perror("asprintf");
-   return err;
-   };
-
-   ns = open(path, O_RDONLY | O_CLOEXEC);
-   if (ns < 0) {
-

[PATCH RFC iproute v1 3/4] afnetns: introduce lib/afnetns.c and a name cache

2017-03-12 Thread Hannes Frederic Sowa

This patch adds a name cache for afnetns, so we don't need to scan the
inodes all the same again. This speeds up address list in case of many
configured afnetns and ip addresses.

Signed-off-by: Hannes Frederic Sowa 
---
 include/afnetns.h   |   6 ++
 include/namespace.h |   3 -
 ip/ipaddress.c  |  12 ++-
 ip/ipafnetns.c  |   1 +
 lib/Makefile|   2 +-
 lib/afnetns.c   | 226 
 lib/namespace.c |  21 -
 7 files changed, 243 insertions(+), 28 deletions(-)
 create mode 100644 include/afnetns.h
 create mode 100644 lib/afnetns.c

diff --git a/include/afnetns.h b/include/afnetns.h
new file mode 100644
index 00..287bcb6153611b
--- /dev/null
+++ b/include/afnetns.h
@@ -0,0 +1,6 @@
+#pragma once
+
+#define AFNETNS_RUN_DIR "/var/run/afnetns"
+
+int afnetns_open(const char *name);
+char *afnetns_lookup_name(ino_t inode);
diff --git a/include/namespace.h b/include/namespace.h
index e0745ab0b50972..8193e474a75f98 100644
--- a/include/namespace.h
+++ b/include/namespace.h
@@ -7,7 +7,6 @@
 #include 
 #include 
 
-#define AFNETNS_RUN_DIR "/var/run/afnetns"
 #define NETNS_RUN_DIR "/var/run/netns"
 #define NETNS_ETC_DIR "/etc/netns"
 
@@ -52,8 +51,6 @@ int netns_switch(char *netns);
 int netns_get_fd(const char *netns);
 int netns_foreach(int (*func)(char *nsname, void *arg), void *arg);
 
-int afnetns_open(const char *name);
-
 struct netns_func {
int (*func)(char *nsname, void *arg);
void *arg;
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 2994b6a3e0a154..d954f3ea5bff40 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -38,6 +38,7 @@
 #include "xdp.h"
 #include "color.h"
 #include "namespace.h"
+#include "afnetns.h"
 
 enum {
IPADD_LIST,
@@ -1004,7 +1005,7 @@ static int afnetns_get_fd(const char *name)
 {
int ns = -1;
 
-   if (name[0] == '/')
+   if (strnlen(name, 1) && name[0] == '/')
ns = open(name, O_RDONLY | O_CLOEXEC);
else
ns = afnetns_open(name);
@@ -1219,8 +1220,13 @@ int print_addrinfo(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
}
}
if (rta_tb[IFA_AFNETNS_INODE]) {
-   fprintf(fp, " afnet:[%u]",
-   rta_getattr_u32(rta_tb[IFA_AFNETNS_INODE]));
+   ino_t inode;
+   char *name;
+
+   inode = rta_getattr_u32(rta_tb[IFA_AFNETNS_INODE]);
+   name = afnetns_lookup_name(inode);
+   if (name)
+   fprintf(fp, " afnet %s", name);
}
fprintf(fp, "\n");
 brief_exit:
diff --git a/ip/ipafnetns.c b/ip/ipafnetns.c
index 5a197ad3866d18..2fd749a3f20628 100644
--- a/ip/ipafnetns.c
+++ b/ip/ipafnetns.c
@@ -7,6 +7,7 @@
 #include "utils.h"
 #include "ip_common.h"
 #include "namespace.h"
+#include "afnetns.h"
 
 static void usage(void)
 {
diff --git a/lib/Makefile b/lib/Makefile
index 1d24ca24b9a39f..7825021ea3cfa8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
inet_proto.o namespace.o json_writer.o \
-   names.o color.o bpf.o exec.o fs.o
+   names.o color.o bpf.o exec.o fs.o afnetns.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/afnetns.c b/lib/afnetns.c
new file mode 100644
index 00..d58a55df46daa7
--- /dev/null
+++ b/lib/afnetns.c
@@ -0,0 +1,226 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+
+#include "list.h"
+#include "afnetns.h"
+
+#define ULONG_CHARS ((int)ceill(log10l(ULONG_MAX)))
+
+static struct inode_cache {
+   struct inode_cache *next;
+   ino_t inode;
+   char name[];
+} *cache[64];
+
+static int self_inode(ino_t *me)
+{
+   static bool initialized;
+   static ino_t inode;
+   long path_size;
+   char *path;
+   int err;
+
+   if (initialized) {
+   *me = inode;
+   return 0;
+   }
+
+   errno = 0;
+   path_size = pathconf("/proc/self/ns/afnet", _PC_PATH_MAX);
+   if (path_size < 0) {
+   if (errno)
+   perror("pathconf");
+   else
+   fprintf(stderr,
+   "couldn't determine _PC_PATH_MAX for procfs: 
%zd\n",
+   path_size);
+   return -1;
+   }
+
+   path = malloc(path_size);
+   if (!path) {
+   perror("malloc");
+   return -1;
+   }
+
+   err = readlink("/proc/self/ns/afnet", path, path_size);
+   if (err < 0) {
+   perror("readlink");
+   goto out;
+   } else if (err >= path_size) {
+   fprintf(stderr, "readlink(\"/proc/self/ns/afnet\") exceeded 
maximum path length: %d >= %ld",
+   err, path_size);
+

[PATCH net-next RFC v1 02/27] afnetns: basic namespace operations and representations

2017-03-12 Thread Hannes Frederic Sowa

This patch adds the basic afnetns operations. Specifically it implements
the /proc/self/ns/afnet operations which allow to basically manage
afnetns namespaces plus, clone, unshare and setns.

The afnetns is tracked in the nsproxy structure for each task_struct.

Signed-off-by: Hannes Frederic Sowa 
---
 Documentation/networking/afnetns.txt |  64 ++
 fs/proc/namespaces.c |   3 +
 include/linux/nsproxy.h  |   3 +
 include/linux/proc_ns.h  |   1 +
 include/net/afnetns.h|  42 
 include/net/net_namespace.h  |   4 ++
 kernel/fork.c|  12 +++-
 kernel/nsproxy.c |  24 ++-
 net/Kconfig  |  10 +++
 net/core/Makefile|   1 +
 net/core/afnetns.c   | 124 +++
 net/core/net_namespace.c |  25 +++
 12 files changed, 308 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/networking/afnetns.txt
 create mode 100644 include/net/afnetns.h
 create mode 100644 net/core/afnetns.c

diff --git a/Documentation/networking/afnetns.txt 
b/Documentation/networking/afnetns.txt
new file mode 100644
index 00..cede4564f8c396
--- /dev/null
+++ b/Documentation/networking/afnetns.txt
@@ -0,0 +1,64 @@
+Address-family net namespace
+===
+
+Support for afnetns is enabled in the kernel via CONFIG_AFNETNS.
+
+afnetns allows to put address family addresses into separate
+namespaces.
+
+afnetns behaves like all other namespaces: clone, unshare, setns
+syscalls can work with afnetns with one limitation: one cannot cross
+the realm of a network namespace while changing the afnetns
+compartment. To get into a new afnetns in a different net namespace,
+one must first change to the net namespace and afterwards switch to
+the desired afnetns.
+
+The primitive objects in the kernel an afnetns relates to are:
+- process
+- socket
+- ipv4 address
+- ipv6 address.
+
+An afnetns basically forms a namespace around socket binds. While not
+strictly necessary, it also affects source routing, so firewall rules
+are easier to maintain. It does in no way deal with the reception and
+handling of multicast or broadcast sockets. As the afnetns namespaces
+are connecting to the same L2 network, it does not make sense to try
+to build up separation rules here, as they can be broken anyway.
+
+afnetns doesn't allow sharing of the 127.0.0.1/32 loopback
+address. Instead each afnetns must be provided with a loopback address
+from the 127.0.0.0/8 range if needed.
+
+The easiest way to use afnetns is to use the iproute2 interface, which
+very much follows the style of ip-netns.
+
+$ ip afnetns help
+Usage: ip afnetns list
+   ip afnetns add NAME
+   ip afnetns del NAME
+   ip afnetns exec NAME cmd ...
+
+IP addresses carry a afnetns identifier, too. It is visible with the
+-d (details) option:
+
+$ ip -d a l dev lo
+1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1
+link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 
numtxqueues 1 numrxqueues 1 
+inet 127.0.0.1/8 scope host lo
+   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self
+inet6 ::1/128 scope host 
+   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self
+
+This shows the afnetns inode number, as well as that we are currently
+in the same namespace as the two specified ip addresses. In case we
+added a name for the namespace with ip-afnetns, it will be visible
+here, too.
+
+$ ip a a 10.0.0.1/24 dev lo afnetns test
+
+This command adds a new ip address to the loopback device and makes it
+available in the "test" afnetns. Commands in this namespace can use
+this IP address and use it for outgoing communication.
+
+The same commands work for IPv6, I only used IPv4 as an example.
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 766f0c637ad1b4..f1ccef97ce9861 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -31,6 +31,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_CGROUPS
_operations,
 #endif
+#if IS_ENABLED(CONFIG_AFNETNS)
+   _operations,
+#endif
 };
 
 static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index ac0d65bef5d086..0c0e48dca4b744 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,9 @@ struct nsproxy {
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns;
struct cgroup_namespace *cgroup_ns;
+#if IS_ENABLED(CONFIG_AFNETNS)
+   struct afnetns *afnet_ns;
+#endif
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 12cb8bd81d2d12..45f103098ab0c1 100644
--- a/include/linux/proc_ns.h
+++

[PATCH RFC iproute v1 1/4] afnetns: add iproute bits for afnetns

2017-03-12 Thread Hannes Frederic Sowa

Like ip netns, ip afnetns ... provides the basic utility features to
create, delete afnet namespaces and execute commands inside afnetns.

Signed-off-by: Hannes Frederic Sowa 
---
 include/namespace.h |   5 ++
 include/utils.h |   1 +
 ip/Makefile |   2 +-
 ip/ip.c |   5 +-
 ip/ip_common.h  |   1 +
 ip/ipafnetns.c  | 227 
 lib/utils.c |  36 +
 7 files changed, 274 insertions(+), 3 deletions(-)
 create mode 100644 ip/ipafnetns.c

diff --git a/include/namespace.h b/include/namespace.h
index 51324b21ba0cd5..acecc8c1f0d2b8 100644
--- a/include/namespace.h
+++ b/include/namespace.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 
+#define AFNETNS_RUN_DIR "/var/run/afnetns"
 #define NETNS_RUN_DIR "/var/run/netns"
 #define NETNS_ETC_DIR "/etc/netns"
 
@@ -14,6 +15,10 @@
 #define CLONE_NEWNET 0x4000/* New network namespace (lo, device, 
names sockets, etc) */
 #endif
 
+#ifndef CLONE_NEWAFNET
+#define CLONE_NEWAFNET 0x1000  /* Clone new afnet context */
+#endif
+
 #ifndef MNT_DETACH
 #define MNT_DETACH 0x0002  /* Just detach from the tree */
 #endif /* MNT_DETACH */
diff --git a/include/utils.h b/include/utils.h
index 22369e0b4e0374..59fdd76b502b3c 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -256,6 +256,7 @@ int do_each_netns(int (*func)(char *nsname, void *arg), 
void *arg,
 char *int_to_str(int val, char *buf);
 int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
+int cmd_exec(const char *cmd, char **argv, bool do_fork);
 
 int cmd_exec(const char *cmd, char **argv, bool do_fork);
 int make_path(const char *path, mode_t mode);
diff --git a/ip/Makefile b/ip/Makefile
index 4276a34b529e3f..4da6f33968ffe1 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,7 +8,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
 iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o \
-ipvrf.o iplink_xstats.o
+ipvrf.o iplink_xstats.o ipafnetns.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index 07050b07592ac1..6aa8aaab4c03f9 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -51,8 +51,8 @@ static void usage(void)
 "   ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | address | addrlabel | route | rule | neigh | ntable 
|\n"
 "   tunnel | tuntap | maddress | mroute | mrule | monitor | 
xfrm |\n"
-"   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila |\n"
-"   vrf }\n"
+"   netns | afnetns | l2tp | fou | macsec | tcp_metrics | 
token |\n"
+"   netconf | ila | vrf }\n"
 "   OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "-h[uman-readable] | -iec |\n"
 "-f[amily] { inet | inet6 | ipx | dnet | mpls | bridge | 
link } |\n"
@@ -99,6 +99,7 @@ static const struct cmd {
{ "mroute", do_multiroute },
{ "mrule",  do_multirule },
{ "netns",  do_netns },
+   { "afnetns",do_afnetns },
{ "netconf",do_ipnetconf },
{ "vrf",do_ipvrf},
{ "help",   do_help },
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 5a39623aa21d9f..1f59db40038ef2 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -50,6 +50,7 @@ int do_multiaddr(int argc, char **argv);
 int do_multiroute(int argc, char **argv);
 int do_multirule(int argc, char **argv);
 int do_netns(int argc, char **argv);
+int do_afnetns(int argc, char **argv);
 int do_xfrm(int argc, char **argv);
 int do_ipl2tp(int argc, char **argv);
 int do_ipfou(int argc, char **argv);
diff --git a/ip/ipafnetns.c b/ip/ipafnetns.c
new file mode 100644
index 00..5b7a7e59bc947a
--- /dev/null
+++ b/ip/ipafnetns.c
@@ -0,0 +1,227 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "ip_common.h"
+#include "namespace.h"
+
+static void usage(void)
+{
+   static const char *help =
+   "Usage: ip afnetns list\n"
+   "   ip afnetns add NAME\n"
+   "   ip afnetns del NAME\n"
+   "   ip afnetns exec NAME cmd ...\n";
+   fputs(help, stderr);
+}
+
+static int afnetns_list(void)
+{
+   struct dirent *entry;
+   DIR *dir;
+
+   dir = opendir(AFNETNS_RUN_DIR);
+   if (!dir)
+   return 0;
+
+   while ((entry = readdir(dir))) {
+   if (!strcmp(entry->d_name, ".") ||
+   !strcmp(entry->d_name, ".."))
+   continue;
+   printf("%s\n", entry->d_name);
+   }
+   closedir(dir);
+
+   return 0;
+}
+
+static int create_afnetns_dir(void)
+{
+   int err;
+   const mode_t mode =

[PATCH net] tun: fix premature POLLOUT notification on tun devices

2017-03-12 Thread Hannes Frederic Sowa

aszlig observed failing ssh tunnels (-w) during initialization since
commit cc9da6cc4f56e0 ("ipv6: addrconf: use stable address generator for
ARPHRD_NONE"). We already had reports that the mentioned commit breaks
Juniper VPN connections. I can't clearly say that the Juniper VPN client
has the same problem, but it is worth a try to hint to this patch.

Because of the early generation of link local addresses, the kernel now
can start asking for routers on the local subnet much earlier than usual.
Those router solicitation packets arrive inside the ssh channels and
should be transmitted to the tun fd before the configuration scripts
might have upped the interface and made it ready for transmission.

ssh polls on the interface and receives back a POLL_OUT. It tries to send
the earily router solicitation packet to the tun interface.  Unfortunately
it hasn't been up'ed yet by config scripts, thus failing with -EIO. ssh
doesn't retry again and considers the tun interface broken forever.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
Fixes: cc9da6cc4f56 ("ipv6: addrconf: use stable address generator for 
ARPHRD_NONE")
Cc: Bjørn Mork 
Reported-by: Valdis Kletnieks 
Cc: Valdis Kletnieks 
Reported-by: Jonas Lippuner 
Cc: Jonas Lippuner 
Reported-by: aszlig 
Cc: aszlig 
Signed-off-by: Hannes Frederic Sowa 
---
 drivers/net/tun.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f58b7d850114b0..34cc3c590aa5c5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -822,7 +822,18 @@ static void tun_net_uninit(struct net_device *dev)
 /* Net device open. */
 static int tun_net_open(struct net_device *dev)
 {
+   struct tun_struct *tun = netdev_priv(dev);
+   int i;
+
netif_tx_start_all_queues(dev);
+
+   for (i = 0; i < tun->numqueues; i++) {
+   struct tun_file *tfile;
+
+   tfile = rtnl_dereference(tun->tfiles[i]);
+   tfile->socket.sk->sk_write_space(tfile->socket.sk);
+   }
+
return 0;
 }
 
@@ -1103,9 +1114,10 @@ static unsigned int tun_chr_poll(struct file *file, 
poll_table *wait)
if (!skb_array_empty(>tx_array))
mask |= POLLIN | POLLRDNORM;
 
-   if (sock_writeable(sk) ||
-   (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags) &&
-sock_writeable(sk)))
+   if (tun->dev->flags & IFF_UP &&
+   (sock_writeable(sk) ||
+(!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags) &&
+ sock_writeable(sk
mask |= POLLOUT | POLLWRNORM;
 
if (tun->dev->reg_state != NETREG_REGISTERED)
-- 
2.9.3

[PATCH] net: usb: rtl8150: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/rtl8150.c |   35 ---
 1 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/net/usb/rtl8150.c b/drivers/net/usb/rtl8150.c
index c81c791..daaa88a 100644
--- a/drivers/net/usb/rtl8150.c
+++ b/drivers/net/usb/rtl8150.c
@@ -791,47 +791,52 @@ static void rtl8150_get_drvinfo(struct net_device 
*netdev, struct ethtool_drvinf
usb_make_path(dev->udev, info->bus_info, sizeof(info->bus_info));
 }
 
-static int rtl8150_get_settings(struct net_device *netdev, struct ethtool_cmd 
*ecmd)
+static int rtl8150_get_link_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *ecmd)
 {
rtl8150_t *dev = netdev_priv(netdev);
short lpa, bmcr;
+   u32 supported;
 
-   ecmd->supported = (SUPPORTED_10baseT_Half |
+   supported = (SUPPORTED_10baseT_Half |
  SUPPORTED_10baseT_Full |
  SUPPORTED_100baseT_Half |
  SUPPORTED_100baseT_Full |
  SUPPORTED_Autoneg |
  SUPPORTED_TP | SUPPORTED_MII);
-   ecmd->port = PORT_TP;
-   ecmd->transceiver = XCVR_INTERNAL;
-   ecmd->phy_address = dev->phy;
+   ecmd->base.port = PORT_TP;
+   ecmd->base.phy_address = dev->phy;
get_registers(dev, BMCR, 2, );
get_registers(dev, ANLP, 2, );
if (bmcr & BMCR_ANENABLE) {
u32 speed = ((lpa & (LPA_100HALF | LPA_100FULL)) ?
 SPEED_100 : SPEED_10);
-   ethtool_cmd_speed_set(ecmd, speed);
-   ecmd->autoneg = AUTONEG_ENABLE;
+   ecmd->base.speed = speed;
+   ecmd->base.autoneg = AUTONEG_ENABLE;
if (speed == SPEED_100)
-   ecmd->duplex = (lpa & LPA_100FULL) ?
+   ecmd->base.duplex = (lpa & LPA_100FULL) ?
DUPLEX_FULL : DUPLEX_HALF;
else
-   ecmd->duplex = (lpa & LPA_10FULL) ?
+   ecmd->base.duplex = (lpa & LPA_10FULL) ?
DUPLEX_FULL : DUPLEX_HALF;
} else {
-   ecmd->autoneg = AUTONEG_DISABLE;
-   ethtool_cmd_speed_set(ecmd, ((bmcr & BMCR_SPEED100) ?
-SPEED_100 : SPEED_10));
-   ecmd->duplex = (bmcr & BMCR_FULLDPLX) ?
+   ecmd->base.autoneg = AUTONEG_DISABLE;
+   ecmd->base.speed = ((bmcr & BMCR_SPEED100) ?
+SPEED_100 : SPEED_10);
+   ecmd->base.duplex = (bmcr & BMCR_FULLDPLX) ?
DUPLEX_FULL : DUPLEX_HALF;
}
+
+   ethtool_convert_legacy_u32_to_link_mode(ecmd->link_modes.supported,
+   supported);
+
return 0;
 }
 
 static const struct ethtool_ops ops = {
.get_drvinfo = rtl8150_get_drvinfo,
-   .get_settings = rtl8150_get_settings,
-   .get_link = ethtool_op_get_link
+   .get_link = ethtool_op_get_link,
+   .get_link_ksettings = rtl8150_get_link_ksettings,
 };
 
 static int rtl8150_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

[PATCH 4/4] uapi/if_ether.h: prevent redefinition of struct ethhdr

2017-03-12 Thread Hauke Mehrtens

From: David Heidelberger 

Musl provides its own ethhdr struct definition. Add a guard to prevent
its definition of the appropriate musl header has already been included.

Signed-off-by: John Spencer 
Tested-by: David Heidelberger 
Signed-off-by: Jonas Gorski 
---
 include/uapi/linux/if_ether.h|  3 +++
 include/uapi/linux/libc-compat.h | 11 +++
 2 files changed, 14 insertions(+)

diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 5bc9bfd816b7..fb5ab8c1e753 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -22,6 +22,7 @@
 #define _UAPI_LINUX_IF_ETHER_H
 
 #include 
+#include 
 
 /*
  * IEEE 802.3 Ethernet magic constants.  The frame sizes omit the preamble
@@ -142,11 +143,13 @@
  * This is an Ethernet frame header.
  */
 
+#if __UAPI_DEF_ETHHDR
 struct ethhdr {
unsigned char   h_dest[ETH_ALEN];   /* destination eth addr */
unsigned char   h_source[ETH_ALEN]; /* source ether addr*/
__be16  h_proto;/* packet type ID field */
 } __attribute__((packed));
+#endif
 
 
 #endif /* _UAPI_LINUX_IF_ETHER_H */
diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index ce2fa8a4ced6..c92d32f213d1 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -87,6 +87,14 @@
 
 #endif /* _NET_IF_H */
 
+/* musl defines the ethhdr struct itself in its netinet/if_ether.h.
+ * Glibc just includes the kernel header and uses a different guard. */
+#if defined(_NETINET_IF_ETHER_H)
+#define __UAPI_DEF_ETHHDR  0
+#else
+#define __UAPI_DEF_ETHHDR  1
+#endif
+
 /* Coordinate with glibc netinet/in.h header. */
 #if defined(_NETINET_IN_H)
 
@@ -182,6 +190,9 @@
 /* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
 #define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
 
+/* Definitions for if_ether.h */
+#define __UAPI_DEF_ETHHDR  1
+
 /* Definitions for in.h */
 #define __UAPI_DEF_IN_ADDR 1
 #define __UAPI_DEF_IN_IPPROTO  1
-- 
2.11.0

[PATCH 0/4] uapi glibc compat: fix musl libc compatibility

2017-03-12 Thread Hauke Mehrtens

The code from libc-compat.h depends on some glibc specific defines and 
causes compile problems with the musl libc. These patches remove some 
of the glibc dependencies. With these patches the LEDE (OpenWrt) base 
user space applications can be build with unmodified kernel headers and 
musl libc.

This was compile tested with the user space from LEDE (OpenWrt) with 
musl 1.1.16, glibc 2.25 and uClibc-ng 1.0.22.

David Heidelberger (1):
  uapi/if_ether.h: prevent redefinition of struct ethhdr

Hauke Mehrtens (3):
  uapi glibc compat: add libc compat code when not build for kernel
  uapi glibc compat: fix build if libc defines IFF_ECHO
  uapi glibc compat: Do not check for __USE_MISC

 include/uapi/linux/if_ether.h|  3 +++
 include/uapi/linux/libc-compat.h | 25 +++--
 2 files changed, 22 insertions(+), 6 deletions(-)

-- 
2.11.0

[PATCH 2/4] uapi glibc compat: fix build if libc defines IFF_ECHO

2017-03-12 Thread Hauke Mehrtens

musl 1.1.15 defines IFF_ECHO and the other net_device_flags options.
When a user application includes linux/if.h and net/if.h the compile
will fail.

Activate __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO only when
it is needed. This should also make this work in case glibc will add
these defines.

Signed-off-by: Hauke Mehrtens 
---
 include/uapi/linux/libc-compat.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 7c1fead03c50..49a8cc3138ae 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -64,9 +64,11 @@
 /* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
 #define __UAPI_DEF_IF_NET_DEVICE_FLAGS 0
 /* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
+#ifndef IFF_ECHO
 #ifndef __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO
 #define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
 #endif /* __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO */
+#endif /* IFF_ECHO */
 
 #else /* _NET_IF_H */
 
-- 
2.11.0

[PATCH 1/4] uapi glibc compat: add libc compat code when not build for kernel

2017-03-12 Thread Hauke Mehrtens

Instead of checking if this header file is used in the glibc, check if
iti is not used in kernel context, this way it will also work with
other libc implementations like musl.

Signed-off-by: Hauke Mehrtens 
---
 include/uapi/linux/libc-compat.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 44b8a6bd5fe1..7c1fead03c50 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -48,8 +48,8 @@
 #ifndef _UAPI_LIBC_COMPAT_H
 #define _UAPI_LIBC_COMPAT_H
 
-/* We have included glibc headers... */
-#if defined(__GLIBC__)
+/* We have included libc headers... */
+#if !defined(__KERNEL__)
 
 /* Coordinate with glibc net/if.h header. */
 #if defined(_NET_IF_H) && defined(__USE_MISC)
@@ -168,7 +168,7 @@
 /* If we did not see any headers from any supported C libraries,
  * or we are being included in the kernel, then define everything
  * that we need. */
-#else /* !defined(__GLIBC__) */
+#else /* defined(__KERNEL__) */
 
 /* Definitions for if.h */
 #define __UAPI_DEF_IF_IFCONF 1
@@ -208,6 +208,6 @@
 /* Definitions for xattr.h */
 #define __UAPI_DEF_XATTR   1
 
-#endif /* __GLIBC__ */
+#endif /* __KERNEL__ */
 
 #endif /* _UAPI_LIBC_COMPAT_H */
-- 
2.11.0

[PATCH 3/4] uapi glibc compat: Do not check for __USE_MISC

2017-03-12 Thread Hauke Mehrtens

__USE_MISC is glibc specific and not available in musl libc. Only do
this check when glibc is used. This fixes a problem with musl libc.

Signed-off-by: Hauke Mehrtens 
---
 include/uapi/linux/libc-compat.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 49a8cc3138ae..ce2fa8a4ced6 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -51,8 +51,8 @@
 /* We have included libc headers... */
 #if !defined(__KERNEL__)
 
-/* Coordinate with glibc net/if.h header. */
-#if defined(_NET_IF_H) && defined(__USE_MISC)
+/* Coordinate with libc net/if.h header. */
+#if defined(_NET_IF_H) && (!defined(__GLIBC__) || defined(__USE_MISC))
 
 /* GLIBC headers included first so don't define anything
  * that would already be defined. */
-- 
2.11.0

[PATCH v3] Enable tx timestamping on loopback and dummy

2017-03-12 Thread Ezequiel Lara Gomez


>From f3483202625648be728df39622360508f2fb03e1 Mon Sep 17 00:00:00 2001
From: Ezequiel Lara Gomez 
Date: Sat, 11 Mar 2017 20:06:54 +
Subject: [PATCH v3] Enable tx timestamping on loopback and dummy

This enables testing of SO_TIMESTAMPING options by targetting localhost
addresses.

Tested on qemu using txtimestamping.c from the kernel selftests, and
ethtool -T.
---
Changes:
 * Added ethtool flags to both drivers to report they support timestamping.

 drivers/net/dummy.c| 15 +++
 drivers/net/loopback.c | 15 +++
 2 files changed, 30 insertions(+)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 2c80611..149244a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -125,6 +126,7 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct 
net_device *dev)
dstats->tx_bytes += skb->len;
u64_stats_update_end(>syncp);
 
+   skb_tx_timestamp(skb);
dev_kfree_skb(skb);
return NETDEV_TX_OK;
 }
@@ -304,8 +306,21 @@ static void dummy_get_drvinfo(struct net_device *dev,
strlcpy(info->version, DRV_VERSION, sizeof(info->version));
 }
 
+static int dummy_get_ts_info(struct net_device *dev,
+ struct ethtool_ts_info *ts_info)
+{
+   ts_info->so_timestamping = SOF_TIMESTAMPING_TX_SOFTWARE |
+  SOF_TIMESTAMPING_RX_SOFTWARE |
+  SOF_TIMESTAMPING_SOFTWARE;
+
+   ts_info->phc_index = -1;
+
+   return 0;
+};
+
 static const struct ethtool_ops dummy_ethtool_ops = {
.get_drvinfo= dummy_get_drvinfo,
+   .get_ts_info= dummy_get_ts_info,
 };
 
 static void dummy_free_netdev(struct net_device *dev)
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 122cc2d..3a60d27 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -74,6 +75,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
struct pcpu_lstats *lb_stats;
int len;
 
+   skb_tx_timestamp(skb);
skb_orphan(skb);
 
/* Before queueing this packet to netif_rx(),
@@ -129,8 +131,21 @@ static u32 always_on(struct net_device *dev)
return 1;
 }
 
+static int loopback_get_ts_info(struct net_device *netdev,
+   struct ethtool_ts_info *ts_info)
+{
+   ts_info->so_timestamping = SOF_TIMESTAMPING_TX_SOFTWARE |
+  SOF_TIMESTAMPING_RX_SOFTWARE |
+  SOF_TIMESTAMPING_SOFTWARE;
+
+   ts_info->phc_index = -1;
+
+   return 0;
+};
+
 static const struct ethtool_ops loopback_ethtool_ops = {
.get_link   = always_on,
+   .get_ts_info= loopback_get_ts_info,
 };
 
 static int loopback_dev_init(struct net_device *dev)
-- 
1.9.1

Amazon Data Services Ireland Limited registered office: One Burlington Plaza, 
Burlington Road, Dublin 4, Ireland. Registered in Ireland. Registration number 
390566.

[PATCH] net: usb: r8152: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/r8152.c |   21 -
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 986243c..227e1fd 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3800,7 +3800,8 @@ static void rtl8152_get_drvinfo(struct net_device *netdev,
 }
 
 static
-int rtl8152_get_settings(struct net_device *netdev, struct ethtool_cmd *cmd)
+int rtl8152_get_link_ksettings(struct net_device *netdev,
+  struct ethtool_link_ksettings *cmd)
 {
struct r8152 *tp = netdev_priv(netdev);
int ret;
@@ -3814,7 +3815,7 @@ int rtl8152_get_settings(struct net_device *netdev, 
struct ethtool_cmd *cmd)
 
mutex_lock(>control);
 
-   ret = mii_ethtool_gset(>mii, cmd);
+   ret = mii_ethtool_get_link_ksettings(>mii, cmd);
 
mutex_unlock(>control);
 
@@ -3824,7 +3825,8 @@ int rtl8152_get_settings(struct net_device *netdev, 
struct ethtool_cmd *cmd)
return ret;
 }
 
-static int rtl8152_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int rtl8152_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct r8152 *tp = netdev_priv(dev);
int ret;
@@ -3835,11 +3837,12 @@ static int rtl8152_set_settings(struct net_device *dev, 
struct ethtool_cmd *cmd)
 
mutex_lock(>control);
 
-   ret = rtl8152_set_speed(tp, cmd->autoneg, cmd->speed, cmd->duplex);
+   ret = rtl8152_set_speed(tp, cmd->base.autoneg, cmd->base.speed,
+   cmd->base.duplex);
if (!ret) {
-   tp->autoneg = cmd->autoneg;
-   tp->speed = cmd->speed;
-   tp->duplex = cmd->duplex;
+   tp->autoneg = cmd->base.autoneg;
+   tp->speed = cmd->base.speed;
+   tp->duplex = cmd->base.duplex;
}
 
mutex_unlock(>control);
@@ -4117,8 +4120,6 @@ static int rtl8152_set_coalesce(struct net_device *netdev,
 
 static const struct ethtool_ops ops = {
.get_drvinfo = rtl8152_get_drvinfo,
-   .get_settings = rtl8152_get_settings,
-   .set_settings = rtl8152_set_settings,
.get_link = ethtool_op_get_link,
.nway_reset = rtl8152_nway_reset,
.get_msglevel = rtl8152_get_msglevel,
@@ -4132,6 +4133,8 @@ static int rtl8152_set_coalesce(struct net_device *netdev,
.set_coalesce = rtl8152_set_coalesce,
.get_eee = rtl_ethtool_get_eee,
.set_eee = rtl_ethtool_set_eee,
+   .get_link_ksettings = rtl8152_get_link_ksettings,
+   .set_link_ksettings = rtl8152_set_link_ksettings,
 };
 
 static int rtl8152_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

2017-03-12 Thread Borislav Petkov

On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
> Maybe it's the lguest games with PGE that need to be removed?

Btw, tglx suggested something else the other day: warn when we're
changing boot_cpu_data x86_capability bits *after* alternatives have
run. The reasoning behind it being that potentially some patching
static_cpu_has() has done won't be correct anymore.

And it is pretty cheap to do it, it fires nicely on the 32-bit config
with LGUEST=y.

---
diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index d59c15c3defd..f06c3dc6db70 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -124,8 +124,18 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 
 #define boot_cpu_has(bit)  cpu_has(_cpu_data, bit)
 
-#define set_cpu_cap(c, bit)set_bit(bit, (unsigned long 
*)((c)->x86_capability))
-#define clear_cpu_cap(c, bit)  clear_bit(bit, (unsigned long 
*)((c)->x86_capability))
+#define set_cpu_cap(c, bit)\
+({ \
+   WARN_ON(c == _cpu_data && alternatives_patched);   \
+   set_bit(bit, (unsigned long *)((c)->x86_capability));   \
+})
+
+#define clear_cpu_cap(c, bit)  \
+({ \
+   WARN_ON(c == _cpu_data && alternatives_patched);   \
+   clear_bit(bit, (unsigned long *)((c)->x86_capability)); \
+})
+
 #define setup_clear_cpu_cap(bit) do { \
clear_cpu_cap(_cpu_data, bit); \
set_bit(bit, (unsigned long *)cpu_caps_cleared); \

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--

[PATCH] net: usb: catc: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/catc.c |   31 ++-
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/drivers/net/usb/catc.c b/drivers/net/usb/catc.c
index 0acc9b64..fce92f0 100644
--- a/drivers/net/usb/catc.c
+++ b/drivers/net/usb/catc.c
@@ -688,29 +688,34 @@ static void catc_get_drvinfo(struct net_device *dev,
usb_make_path(catc->usbdev, info->bus_info, sizeof(info->bus_info));
 }
 
-static int catc_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int catc_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct catc *catc = netdev_priv(dev);
if (!catc->is_f5u011)
return -EOPNOTSUPP;
 
-   cmd->supported = SUPPORTED_10baseT_Half | SUPPORTED_TP;
-   cmd->advertising = ADVERTISED_10baseT_Half | ADVERTISED_TP;
-   ethtool_cmd_speed_set(cmd, SPEED_10);
-   cmd->duplex = DUPLEX_HALF;
-   cmd->port = PORT_TP; 
-   cmd->phy_address = 0;
-   cmd->transceiver = XCVR_INTERNAL;
-   cmd->autoneg = AUTONEG_DISABLE;
-   cmd->maxtxpkt = 1;
-   cmd->maxrxpkt = 1;
+   ethtool_link_ksettings_zero_link_mode(cmd, supported);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, 10baseT_Half);
+   ethtool_link_ksettings_add_link_mode(cmd, supported, TP);
+
+   ethtool_link_ksettings_zero_link_mode(cmd, advertising);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, 10baseT_Half);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, TP);
+
+   cmd->base.speed = SPEED_10;
+   cmd->base.duplex = DUPLEX_HALF;
+   cmd->base.port = PORT_TP;
+   cmd->base.phy_address = 0;
+   cmd->base.autoneg = AUTONEG_DISABLE;
+
return 0;
 }
 
 static const struct ethtool_ops ops = {
.get_drvinfo = catc_get_drvinfo,
-   .get_settings = catc_get_settings,
-   .get_link = ethtool_op_get_link
+   .get_link = ethtool_op_get_link,
+   .get_link_ksettings = catc_get_link_ksettings,
 };
 
 /*
-- 
1.7.4.4

[PATCH iproute2] man: add examples to ip.8

2017-03-12 Thread Alexander Alemayhu

Having some examples in the top level man page might make it a little bit easier
for new users to get started. Reused some words / sentences from the existing
man pages.

Suggested-by: 積丹尼 Dan Jacobson 
Signed-off-by: Alexander Alemayhu 
---
This is my first man page patch, hopefully I've done everything correctly.
If not please let me know, thanks.

 man/man8/ip.8 | 28 
 1 file changed, 28 insertions(+)

diff --git a/man/man8/ip.8 b/man/man8/ip.8
index 8ecb1996da92..1c5a7419e4fc 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -319,6 +319,34 @@ or, if the objects of this class cannot be listed,
 Exit status is 0 if command was successful, and 1 if there is a syntax error.
 If an error was reported by the kernel exit status is 2.
 
+.SH "EXAMPLES"
+.PP
+ip addr
+.RS 4
+Shows addresses assigned to all network interfaces.
+.RE
+.PP
+ip neigh
+.RS 4
+Shows the current neighbour table in kernel.
+.RE
+.PP
+ip link set x up
+.RS 4
+Bring up interface x.
+.RE
+.PP
+ip link set x down
+.RE
+.RS 4
+Bring down interface x.
+.RE
+.PP
+ip route
+.RS 4
+Show table routes.
+.RE
+
 .SH HISTORY
 .B ip
 was written by Alexey N. Kuznetsov and added in Linux 2.2.
-- 
2.9.3

Re: [PATCH] Enable tx timestamping on loopback and dummy

2017-03-12 Thread Ezequiel Lara Gomez

On Sun, Mar 12, 2017 at 05:30:00PM +0100, Richard Cochran wrote:
> On Sun, Mar 12, 2017 at 12:17:30AM +0100, Oliver Hartkopp wrote:
> > in fact you're doing three different things here:
> > 
> > 1. introduce tx timestamping
> > 2. silently change an include:  -> 
> > 3. fix some whitespace and empty line issues
> > 
> > You'd better provide one patch for 1 & 2 and explain why 2 is needed.
> 
> And while you are at it, explain why #1 is needed.

Certainly - 2 (and 3) come from checkpatch.pl suggestions 
(I'm mostly following through https://kernelnewbies.org/FirstKernelPatch ), 
so as suggested I've broken them into a separate patch, being
styling related only (at least according to that script output).

The intent behind 1 is helping developing code that does TX timestamping
- I lost a good chunk of time toying with it until I realised 127.0.0.1
could not do TX timestamping because lo did not implement it.

>
> 
> Thanks,
> Richard
> 
Amazon Data Services Ireland Limited registered office: One Burlington Plaza, 
Burlington Road, Dublin 4, Ireland. Registered in Ireland. Registration number 
390566.

Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread santosh.shilim...@oracle.com

On 3/12/17 12:33 PM, Leon Romanovsky wrote:

On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote:

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_cm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..eca3d5f 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
event->param.conn.initiator_depth);

/* rdma_accept() calls rdma_reject() internally if it fails */
-   err = rdma_accept(cm_id, _param);
-   if (err)
+   if (rdma_accept(cm_id, _param))
rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);

You omitted initialization of "err" variable which you print here ^.

Its inited by rds_ib_setup_qp() but you are right. It will print
failed with error = 0. :-)

Zhu, please drop that 'err' from the message.

Re: [PATCHv2 3/4] rds: ib: add the static type to the function

2017-03-12 Thread Leon Romanovsky

On Sun, Mar 12, 2017 at 04:07:57AM -0400, Zhu Yanjun wrote:
> The function rds_ib_map_fmr is used only in the ib_fmr.c
> file. As such, the static type is added to limit it in this file.
>
> Cc: Joe Jin 
> Cc: Junxiao Bi 
> Acked-by: Santosh Shilimkar 
> Signed-off-by: Zhu Yanjun 
> ---

Thanks,
Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature

Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread Leon Romanovsky

On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote:
> When rdma_accept fails, rdma_reject is called in it. As such, it is
> not necessary to execute rdma_reject again.
>
> Cc: Joe Jin 
> Cc: Junxiao Bi 
> Acked-by: Santosh Shilimkar 
> Signed-off-by: Zhu Yanjun 
> ---
> Change from v1 to v2:
>   Add the acker.
>
>  net/rds/ib_cm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
> index ce3775a..eca3d5f 100644
> --- a/net/rds/ib_cm.c
> +++ b/net/rds/ib_cm.c
> @@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
>   event->param.conn.initiator_depth);
>
>   /* rdma_accept() calls rdma_reject() internally if it fails */
> - err = rdma_accept(cm_id, _param);
> - if (err)
> + if (rdma_accept(cm_id, _param))
>   rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);

You omitted initialization of "err" variable which you print here ^.

>
>  out:
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

Re: [PATCH v2] fjes: Do not load fjes driver if system does not have extended socket device.

2017-03-12 Thread Bjørn Mork

Yasuaki Ishimatsu  writes:

> The fjes driver is used only by FUJITSU servers and almost of all
> servers in the world never use it. But currently if ACPI PNP0C02
> is defined in the ACPI table, the following message is always shown:
>
>  "FUJITSU Extended Socket Network Device Driver - version 1.2
>   - Copyright (c) 2015 FUJITSU LIMITED"

Matching on PNP0C02 is fundamentally wrong. It's a way to load a device
driver on all ACPI systems.  You should not do that. I don't think it is
fair to make everyone suffer because of your inability to properly
narrow down the driver matching rules.

Could we please just delete the whole MODULE_DEVICE_TABLE() from this
driver until a proper solution is found? That way we don't need to
blacklist the driver everywhere.

Bjørn

Re: [PATCH v2 06/23] MAINTAINERS: Add file patterns for dsa device tree bindings

2017-03-12 Thread Andrew Lunn

On Sun, Mar 12, 2017 at 02:16:50PM +0100, Geert Uytterhoeven wrote:
> Submitters of device tree binding documentation may forget to CC
> the subsystem maintainer if this is missing.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index ce461fefec6c9463..c0822d4fa5c3ade5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7812,6 +7812,7 @@ M:  Andrew Lunn 
>  M:   Vivien Didelot 
>  L:   netdev@vger.kernel.org
>  S:   Maintained
> +F:   Documentation/devicetree/bindings/net/dsa/
>  F:   drivers/net/dsa/mv88e6xxx/
>  F:   Documentation/devicetree/bindings/net/dsa/marvell.txt

Hi Geert

If i'm reading this patch correctly, you are putting it under

MARVELL 88E6XXX ETHERNET SWITCH FABRIC DRIVER
M:  Andrew Lunn 
M:  Vivien Didelot 
L:  netdev@vger.kernel.org
S:  Maintained
F:  drivers/net/dsa/mv88e6xxx/
F:  Documentation/devicetree/bindings/net/dsa/marvell.txt

I would says the correct place for
Documentation/devicetree/bindings/net/dsa/ is

NETWORKING [DSA]
M:  Andrew Lunn 
M:  Vivien Didelot 
M:  Florian Fainelli 
S:  Maintained
F:  net/dsa/
F:  include/net/dsa.h
F:  drivers/net/dsa/

Andrew

[PATCH] net: usb: asix88179_178a: use new api ethtool_{get|set}_link_ksettings

2017-03-12 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/ax88179_178a.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
index a3a7db0..4a0ae7c 100644
--- a/drivers/net/usb/ax88179_178a.c
+++ b/drivers/net/usb/ax88179_178a.c
@@ -620,16 +620,18 @@ static int ax88179_get_eeprom_len(struct net_device *net)
return 0;
 }
 
-static int ax88179_get_settings(struct net_device *net, struct ethtool_cmd 
*cmd)
+static int ax88179_get_link_ksettings(struct net_device *net,
+ struct ethtool_link_ksettings *cmd)
 {
struct usbnet *dev = netdev_priv(net);
-   return mii_ethtool_gset(>mii, cmd);
+   return mii_ethtool_get_link_ksettings(>mii, cmd);
 }
 
-static int ax88179_set_settings(struct net_device *net, struct ethtool_cmd 
*cmd)
+static int ax88179_set_link_ksettings(struct net_device *net,
+ const struct ethtool_link_ksettings *cmd)
 {
struct usbnet *dev = netdev_priv(net);
-   return mii_ethtool_sset(>mii, cmd);
+   return mii_ethtool_set_link_ksettings(>mii, cmd);
 }
 
 static int
@@ -826,11 +828,11 @@ static int ax88179_ioctl(struct net_device *net, struct 
ifreq *rq, int cmd)
.set_wol= ax88179_set_wol,
.get_eeprom_len = ax88179_get_eeprom_len,
.get_eeprom = ax88179_get_eeprom,
-   .get_settings   = ax88179_get_settings,
-   .set_settings   = ax88179_set_settings,
.get_eee= ax88179_get_eee,
.set_eee= ax88179_set_eee,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = ax88179_get_link_ksettings,
+   .set_link_ksettings = ax88179_set_link_ksettings,
 };
 
 static void ax88179_set_multicast(struct net_device *net)
-- 
1.7.4.4

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet

On Sun, 2017-03-12 at 17:49 +0200, Saeed Mahameed wrote:
> On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet  wrote:
> > On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
> >
> >> Problem is XDP TX :
> >>
> >> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> >> RX is owned by current cpu.
> >>
> >> Since TX completion is using a different NAPI, I really do not believe
> >> we can avoid an atomic operation, like a spinlock, to protect the list
> >> of pages ( ring->page_cache )
> >
> > A quick fix for net-next would be :
> >
> 
> Hi Eric, Good catch.
> 
> I don't think we need to complicate with an expensive spinlock,
>  we can simply fix this by not enabling interrupts on XDP TX CQ (not
> arm this CQ at all).
> and handle XDP TX CQ completion from the RX NAPI context, in a serial
> (Atomic) manner before handling RX completions themselves.
> This way locking is not required since all page cache handling is done
> from the same context (RX NAPI).
> 
> This is how we do this in mlx5, and this is the best approach
> (performance wise) since we dealy XDP TX CQ completions handling
> until we really need the space they hold (On new RX packets).

SGTM, can you provide the patch for mlx4 ?

Thanks !

Re: [PATCH] Enable tx timestamping on loopback and dummy

2017-03-12 Thread Richard Cochran

On Sun, Mar 12, 2017 at 12:17:30AM +0100, Oliver Hartkopp wrote:
> in fact you're doing three different things here:
> 
> 1. introduce tx timestamping
> 2. silently change an include:  -> 
> 3. fix some whitespace and empty line issues
> 
> You'd better provide one patch for 1 & 2 and explain why 2 is needed.

And while you are at it, explain why #1 is needed.

Thanks,
Richard

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Saeed Mahameed

On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet  wrote:
> On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
>
>> Problem is XDP TX :
>>
>> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
>> RX is owned by current cpu.
>>
>> Since TX completion is using a different NAPI, I really do not believe
>> we can avoid an atomic operation, like a spinlock, to protect the list
>> of pages ( ring->page_cache )
>
> A quick fix for net-next would be :
>

Hi Eric, Good catch.

I don't think we need to complicate with an expensive spinlock,
 we can simply fix this by not enabling interrupts on XDP TX CQ (not
arm this CQ at all).
and handle XDP TX CQ completion from the RX NAPI context, in a serial
(Atomic) manner before handling RX completions themselves.
This way locking is not required since all page cache handling is done
from the same context (RX NAPI).

This is how we do this in mlx5, and this is the best approach
(performance wise) since we dealy XDP TX CQ completions handling
until we really need the space they hold (On new RX packets).

> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
> *priv,
> struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
> struct mlx4_en_rx_alloc *frags = ring->rx_info +
> (index << priv->log_rx_info);
> +
> if (ring->page_cache.index > 0) {
> +   spin_lock(>page_cache.lock);
> +
> /* XDP uses a single page per frame */
> if (!frags->page) {
> ring->page_cache.index--;
> frags->page = 
> ring->page_cache.buf[ring->page_cache.index].page;
> frags->dma  = 
> ring->page_cache.buf[ring->page_cache.index].dma;
> }
> +   spin_unlock(>page_cache.lock);
> frags->page_offset = XDP_PACKET_HEADROOM;
> rx_desc->data[0].addr = cpu_to_be64(frags->dma +
> XDP_PACKET_HEADROOM);
> @@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
> }
> }
>
> +   spin_lock_init(>page_cache.lock);
> ring->prod = 0;
> ring->cons = 0;
> ring->size = size;
> @@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>
> if (cache->index >= MLX4_EN_CACHE_SIZE)
> return false;
> -
> -   cache->buf[cache->index].page = frame->page;
> -   cache->buf[cache->index].dma = frame->dma;
> -   cache->index++;
> +   spin_lock(>lock);
> +   if (cache->index < MLX4_EN_CACHE_SIZE) {
> +   cache->buf[cache->index].page = frame->page;
> +   cache->buf[cache->index].dma = frame->dma;
> +   cache->index++;
> +   }
> +   spin_unlock(>lock);
> return true;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 
> 39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
>  #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
>
>  struct mlx4_en_page_cache {
> -   u32 index;
> +   u32 index;
> +   spinlock_t  lock;
> struct {
> struct page *page;
> dma_addr_t  dma;
>
>

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet

On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:

> Problem is XDP TX :
> 
> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> RX is owned by current cpu.
> 
> Since TX completion is using a different NAPI, I really do not believe
> we can avoid an atomic operation, like a spinlock, to protect the list
> of pages ( ring->page_cache )

A quick fix for net-next would be :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
*priv,
struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
struct mlx4_en_rx_alloc *frags = ring->rx_info +
(index << priv->log_rx_info);
+
if (ring->page_cache.index > 0) {
+   spin_lock(>page_cache.lock);
+
/* XDP uses a single page per frame */
if (!frags->page) {
ring->page_cache.index--;
frags->page = 
ring->page_cache.buf[ring->page_cache.index].page;
frags->dma  = 
ring->page_cache.buf[ring->page_cache.index].dma;
}
+   spin_unlock(>page_cache.lock);
frags->page_offset = XDP_PACKET_HEADROOM;
rx_desc->data[0].addr = cpu_to_be64(frags->dma +
XDP_PACKET_HEADROOM);
@@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
}
}
 
+   spin_lock_init(>page_cache.lock);
ring->prod = 0;
ring->cons = 0;
ring->size = size;
@@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 
if (cache->index >= MLX4_EN_CACHE_SIZE)
return false;
-
-   cache->buf[cache->index].page = frame->page;
-   cache->buf[cache->index].dma = frame->dma;
-   cache->index++;
+   spin_lock(>lock);
+   if (cache->index < MLX4_EN_CACHE_SIZE) {
+   cache->buf[cache->index].page = frame->page;
+   cache->buf[cache->index].dma = frame->dma;
+   cache->index++;
+   }
+   spin_unlock(>lock);
return true;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
 
 struct mlx4_en_page_cache {
-   u32 index;
+   u32 index;
+   spinlock_t  lock;
struct {
struct page *page;
dma_addr_t  dma;

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet

On Wed, 2017-02-22 at 18:06 -0800, Eric Dumazet wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> > 
> 
> You must have misunderstood me.
> 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.
> 
> 
> > The whole idea behind using only half the page per descriptor is to
> > allow us to loop through the ring before we end up reusing it again.
> > That buys us enough time that usually the stack has consumed the frame
> > before we need it again.
> 
> 
> The same will happen really.
> 
> Best maybe is for me to send the patch ;)

Excellent results so far, performance on PowerPC is back, and x86 gets a
gain as well.

Problem is XDP TX :

I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
RX is owned by current cpu.

Since TX completion is using a different NAPI, I really do not believe
we can avoid an atomic operation, like a spinlock, to protect the list
of pages ( ring->page_cache )

[PATCH v2 06/23] MAINTAINERS: Add file patterns for dsa device tree bindings

2017-03-12 Thread Geert Uytterhoeven

Submitters of device tree binding documentation may forget to CC
the subsystem maintainer if this is missing.

Signed-off-by: Geert Uytterhoeven 
Cc: Andrew Lunn 
Cc: Vivien Didelot 
Cc: netdev@vger.kernel.org
---
Please apply this patch directly if you want to be involved in device
tree binding documentation for your subsystem.

v2:
  - Rebased on top of commit 0d3cd4b6b49865e8 ("net: dsa: mv88e6xxx:
move driver in its own folder").

Impact on next-20170310:

+Andrew Lunn  (maintainer:MARVELL 88E6XXX ETHERNET SWITCH 
FABRIC DRIVER)
+Vivien Didelot  (maintainer:MARVELL 
88E6XXX ETHERNET SWITCH FABRIC DRIVER)
 Rob Herring  (maintainer:OPEN FIRMWARE AND FLATTENED 
DEVICE TREE BINDINGS)
 Mark Rutland  (maintainer:OPEN FIRMWARE AND FLATTENED 
DEVICE TREE BINDINGS)
-"David S. Miller"  (commit_signer:11/12=92%)
-Andrew Lunn  (commit_signer:8/12=67%,authored:7/12=58%)
-Florian Fainelli  
(commit_signer:4/12=33%,authored:2/12=17%)
-Vivien Didelot  
(commit_signer:2/12=17%,authored:1/12=8%)
-John Crispin  (commit_signer:1/12=8%,authored:1/12=8%)
-"Otto Kekäläinen"  (authored:1/12=8%)
-netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
+netdev@vger.kernel.org (open list:MARVELL 88E6XXX ETHERNET SWITCH FABRIC 
DRIVER)
 devicet...@vger.kernel.org (open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE 
BINDINGS)
 linux-ker...@vger.kernel.org (open list)
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ce461fefec6c9463..c0822d4fa5c3ade5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7812,6 +7812,7 @@ M:Andrew Lunn 
 M: Vivien Didelot 
 L: netdev@vger.kernel.org
 S: Maintained
+F: Documentation/devicetree/bindings/net/dsa/
 F: drivers/net/dsa/mv88e6xxx/
 F: Documentation/devicetree/bindings/net/dsa/marvell.txt
 
-- 
2.7.4

Re: regression (4.10) - interface remove uevents not generated

2017-03-12 Thread Mantas Mikulėnas

On 2017-03-12 08:51, Andrei Vagin wrote:
> On Sat, Mar 11, 2017 at 11:24:34PM +0200, Mantas Mikulėnas wrote:
>> On 2017-03-11 21:50, Andrei Vagin wrote:
>>> Hi Mantas,
>>>
>>> Thank you for the report. Could you try out the attached patch?
>>
>> Thanks, I tested it on current master but it doesn't seem to help; there
>> still aren't any uevents for removed interfaces.
> 
> I reproduced the issue on my host and the correct patch is attached to
> this message. Thanks!

Just tested, the new patch seems to be working fine here.

Thanks for the quick fix; any chance of getting it into 4.10.x as well?

-- 
Mantas Mikulėnas

[PATCH 1/1] r8169: replace init_timer with setup_timer

2017-03-12 Thread Zhu Yanjun

Replace init_timer with setup_timer to simplify the source code.

Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/realtek/r8169.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 81f18a8..44cc422 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8444,9 +8444,7 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
tp->opts1_mask = (tp->mac_version != RTL_GIGA_MAC_VER_01) ?
~(RxBOVF | RxFOVF) : ~0;
 
-   init_timer(>timer);
-   tp->timer.data = (unsigned long) dev;
-   tp->timer.function = rtl8169_phy_timer;
+   setup_timer(>timer, rtl8169_phy_timer, (unsigned long)dev);
 
tp->rtl_fw = RTL_FIRMWARE_UNKNOWN;
 
-- 
2.7.4

[PATCH v7 4/6] ipv6: addrconf: fix 48 bit 6lowpan autoconfiguration

2017-03-12 Thread Luiz Augusto von Dentz

From: Alexander Aring 

This patch adds support for 48 bit 6LoWPAN address length
autoconfiguration which is the case for BTLE 6LoWPAN.

Signed-off-by: Alexander Aring 
Signed-off-by: Luiz Augusto von Dentz 
Reviewed-by: Stefan Schmidt 
Acked-by: David S. Miller 
---
 net/ipv6/addrconf.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3a2025f..7756640 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2052,12 +2052,19 @@ static void addrconf_leave_anycast(struct inet6_ifaddr 
*ifp)
__ipv6_dev_ac_dec(ifp->idev, );
 }
 
-static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
+static int addrconf_ifid_6lowpan(u8 *eui, struct net_device *dev)
 {
-   if (dev->addr_len != EUI64_ADDR_LEN)
+   switch (dev->addr_len) {
+   case ETH_ALEN:
+   return addrconf_ifid_eui48(eui, dev);
+   case EUI64_ADDR_LEN:
+   memcpy(eui, dev->dev_addr, EUI64_ADDR_LEN);
+   eui[0] ^= 2;
+   break;
+   default:
return -1;
-   memcpy(eui, dev->dev_addr, EUI64_ADDR_LEN);
-   eui[0] ^= 2;
+   }
+
return 0;
 }
 
@@ -2149,7 +2156,7 @@ static int ipv6_generate_eui64(u8 *eui, struct net_device 
*dev)
case ARPHRD_TUNNEL:
return addrconf_ifid_gre(eui, dev);
case ARPHRD_6LOWPAN:
-   return addrconf_ifid_eui64(eui, dev);
+   return addrconf_ifid_6lowpan(eui, dev);
case ARPHRD_IEEE1394:
return addrconf_ifid_ieee1394(eui, dev);
case ARPHRD_TUNNEL6:
-- 
2.9.3

[PATCH v7 2/6] 6lowpan: Set MAC address length according to LOWPAN_LLTYPE

2017-03-12 Thread Luiz Augusto von Dentz

From: Patrik Flykt 

Set MAC address length according to the 6LoWPAN link layer in use.
Bluetooth Low Energy uses 48 bit addressing while IEEE802.15.4 uses
64 bits.

Signed-off-by: Patrik Flykt 
Reviewed-by: Stefan Schmidt 
---
 net/6lowpan/core.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/6lowpan/core.c b/net/6lowpan/core.c
index 5945f7e..5f9909a 100644
--- a/net/6lowpan/core.c
+++ b/net/6lowpan/core.c
@@ -23,7 +23,16 @@ int lowpan_register_netdevice(struct net_device *dev,
 {
int i, ret;
 
-   dev->addr_len = EUI64_ADDR_LEN;
+   switch (lltype) {
+   case LOWPAN_LLTYPE_IEEE802154:
+   dev->addr_len = EUI64_ADDR_LEN;
+   break;
+
+   case LOWPAN_LLTYPE_BTLE:
+   dev->addr_len = ETH_ALEN;
+   break;
+   }
+
dev->type = ARPHRD_6LOWPAN;
dev->mtu = IPV6_MIN_MTU;
dev->priv_flags |= IFF_NO_QUEUE;
-- 
2.9.3

[PATCH v7 5/6] 6lowpan: Use netdev addr_len to determine lladdr len

2017-03-12 Thread Luiz Augusto von Dentz

From: Luiz Augusto von Dentz 

This allow technologies such as Bluetooth to use its native lladdr which
is eui48 instead of eui64 which was expected by functions like
lowpan_header_decompress and lowpan_header_compress.

Signed-off-by: Luiz Augusto von Dentz 
Reviewed-by: Stefan Schmidt 
---
 include/net/6lowpan.h   | 19 +++
 net/6lowpan/iphc.c  | 49 ++---
 net/bluetooth/6lowpan.c | 42 ++
 3 files changed, 63 insertions(+), 47 deletions(-)

diff --git a/include/net/6lowpan.h b/include/net/6lowpan.h
index 5ab4c99..c5792cb 100644
--- a/include/net/6lowpan.h
+++ b/include/net/6lowpan.h
@@ -198,6 +198,25 @@ static inline void 
lowpan_iphc_uncompress_eui64_lladdr(struct in6_addr *ipaddr,
ipaddr->s6_addr[8] ^= 0x02;
 }
 
+static inline void lowpan_iphc_uncompress_eui48_lladdr(struct in6_addr *ipaddr,
+  const void *lladdr)
+{
+   /* fe:80:::XXff:feXX:
+*\_/
+*  hwaddr
+*/
+   ipaddr->s6_addr[0] = 0xFE;
+   ipaddr->s6_addr[1] = 0x80;
+   memcpy(>s6_addr[8], lladdr, 3);
+   ipaddr->s6_addr[11] = 0xFF;
+   ipaddr->s6_addr[12] = 0xFE;
+   memcpy(>s6_addr[13], lladdr + 3, 3);
+   /* second bit-flip (Universe/Local)
+* is done according RFC2464
+*/
+   ipaddr->s6_addr[8] ^= 0x02;
+}
+
 #ifdef DEBUG
 /* print data in line */
 static inline void raw_dump_inline(const char *caller, char *msg,
diff --git a/net/6lowpan/iphc.c b/net/6lowpan/iphc.c
index fb5f6fa..6b1042e 100644
--- a/net/6lowpan/iphc.c
+++ b/net/6lowpan/iphc.c
@@ -278,6 +278,23 @@ lowpan_iphc_ctx_get_by_mcast_addr(const struct net_device 
*dev,
return ret;
 }
 
+static void lowpan_iphc_uncompress_lladdr(const struct net_device *dev,
+ struct in6_addr *ipaddr,
+ const void *lladdr)
+{
+   switch (dev->addr_len) {
+   case ETH_ALEN:
+   lowpan_iphc_uncompress_eui48_lladdr(ipaddr, lladdr);
+   break;
+   case EUI64_ADDR_LEN:
+   lowpan_iphc_uncompress_eui64_lladdr(ipaddr, lladdr);
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   break;
+   }
+}
+
 /* Uncompress address function for source and
  * destination address(non-multicast).
  *
@@ -320,7 +337,7 @@ static int lowpan_iphc_uncompress_addr(struct sk_buff *skb,
lowpan_iphc_uncompress_802154_lladdr(ipaddr, lladdr);
break;
default:
-   lowpan_iphc_uncompress_eui64_lladdr(ipaddr, lladdr);
+   lowpan_iphc_uncompress_lladdr(dev, ipaddr, lladdr);
break;
}
break;
@@ -381,7 +398,7 @@ static int lowpan_iphc_uncompress_ctx_addr(struct sk_buff 
*skb,
lowpan_iphc_uncompress_802154_lladdr(ipaddr, lladdr);
break;
default:
-   lowpan_iphc_uncompress_eui64_lladdr(ipaddr, lladdr);
+   lowpan_iphc_uncompress_lladdr(dev, ipaddr, lladdr);
break;
}
ipv6_addr_prefix_copy(ipaddr, >pfx, ctx->plen);
@@ -810,6 +827,21 @@ lowpan_iphc_compress_ctx_802154_lladdr(const struct 
in6_addr *ipaddr,
return lladdr_compress;
 }
 
+static bool lowpan_iphc_addr_equal(const struct net_device *dev,
+  const struct lowpan_iphc_ctx *ctx,
+  const struct in6_addr *ipaddr,
+  const void *lladdr)
+{
+   struct in6_addr tmp = {};
+
+   lowpan_iphc_uncompress_lladdr(dev, , lladdr);
+
+   if (ctx)
+   ipv6_addr_prefix_copy(, >pfx, ctx->plen);
+
+   return ipv6_addr_equal(, ipaddr);
+}
+
 static u8 lowpan_compress_ctx_addr(u8 **hc_ptr, const struct net_device *dev,
   const struct in6_addr *ipaddr,
   const struct lowpan_iphc_ctx *ctx,
@@ -827,13 +859,7 @@ static u8 lowpan_compress_ctx_addr(u8 **hc_ptr, const 
struct net_device *dev,
}
break;
default:
-   /* check for SAM/DAM = 11 */
-   memcpy(_addr[8], lladdr, EUI64_ADDR_LEN);
-   /* second bit-flip (Universe/Local) is done according RFC2464 */
-   tmp.s6_addr[8] ^= 0x02;
-   /* context information are always used */
-   ipv6_addr_prefix_copy(, >pfx, ctx->plen);
-   if (ipv6_addr_equal(, ipaddr)) {
+   if (lowpan_iphc_addr_equal(dev, ctx, ipaddr, lladdr)) {
dam = LOWPAN_IPHC_DAM_11;
goto

[PATCH v7 3/6] 6lowpan: iphc: override l2 packet information

2017-03-12 Thread Luiz Augusto von Dentz

From: Alexander Aring 

The skb->pkt_type need to be set by L2, but on 6LoWPAN there exists L2
e.g. BTLE which doesn't has multicast addressing. If it's a multicast or
not is detected by IPHC headers multicast bit. The IPv6 layer will
evaluate this pkt_type, so we force set this type while uncompressing.
Should be okay for 802.15.4 as well.

Signed-off-by: Alexander Aring 
Reviewed-by: Stefan Schmidt 
---
 net/6lowpan/iphc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/6lowpan/iphc.c b/net/6lowpan/iphc.c
index 79f1fa2..fb5f6fa 100644
--- a/net/6lowpan/iphc.c
+++ b/net/6lowpan/iphc.c
@@ -666,6 +666,8 @@ int lowpan_header_decompress(struct sk_buff *skb, const 
struct net_device *dev,
 
switch (iphc1 & (LOWPAN_IPHC_M | LOWPAN_IPHC_DAC)) {
case LOWPAN_IPHC_M | LOWPAN_IPHC_DAC:
+   skb->pkt_type = PACKET_BROADCAST;
+
spin_lock_bh(_dev(dev)->ctx.lock);
ci = lowpan_iphc_ctx_get_by_id(dev, LOWPAN_IPHC_CID_DCI(cid));
if (!ci) {
@@ -681,11 +683,15 @@ int lowpan_header_decompress(struct sk_buff *skb, const 
struct net_device *dev,
spin_unlock_bh(_dev(dev)->ctx.lock);
break;
case LOWPAN_IPHC_M:
+   skb->pkt_type = PACKET_BROADCAST;
+
/* multicast */
err = lowpan_uncompress_multicast_daddr(skb, ,
iphc1 & 
LOWPAN_IPHC_DAM_MASK);
break;
case LOWPAN_IPHC_DAC:
+   skb->pkt_type = PACKET_HOST;
+
spin_lock_bh(_dev(dev)->ctx.lock);
ci = lowpan_iphc_ctx_get_by_id(dev, LOWPAN_IPHC_CID_DCI(cid));
if (!ci) {
@@ -701,6 +707,8 @@ int lowpan_header_decompress(struct sk_buff *skb, const 
struct net_device *dev,
spin_unlock_bh(_dev(dev)->ctx.lock);
break;
default:
+   skb->pkt_type = PACKET_HOST;
+
err = lowpan_iphc_uncompress_addr(skb, dev, ,
  iphc1 & LOWPAN_IPHC_DAM_MASK,
  daddr);
-- 
2.9.3

[PATCH v7 1/6] bluetooth: Set 6 byte device addresses

2017-03-12 Thread Luiz Augusto von Dentz

From: Patrik Flykt 

Set BTLE MAC addresses that are 6 bytes long and not 8 bytes
that are used in other places with 6lowpan.

Signed-off-by: Patrik Flykt 
Signed-off-by: Luiz Augusto von Dentz 
Reviewed-by: Stefan Schmidt 
---
 net/bluetooth/6lowpan.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 1904a93..1456b01 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -80,6 +80,8 @@ struct lowpan_btle_dev {
struct delayed_work notify_peers;
 };
 
+static void set_addr(u8 *eui, u8 *addr, u8 addr_type);
+
 static inline struct lowpan_btle_dev *
 lowpan_btle_dev(const struct net_device *netdev)
 {
@@ -272,9 +274,10 @@ static int give_skb_to_upper(struct sk_buff *skb, struct 
net_device *dev)
 static int iphc_decompress(struct sk_buff *skb, struct net_device *netdev,
   struct l2cap_chan *chan)
 {
-   const u8 *saddr, *daddr;
+   const u8 *saddr;
struct lowpan_btle_dev *dev;
struct lowpan_peer *peer;
+   unsigned char eui64_daddr[EUI64_ADDR_LEN];
 
dev = lowpan_btle_dev(netdev);
 
@@ -285,9 +288,9 @@ static int iphc_decompress(struct sk_buff *skb, struct 
net_device *netdev,
return -EINVAL;
 
saddr = peer->eui64_addr;
-   daddr = dev->netdev->dev_addr;
+   set_addr(_daddr[0], chan->src.b, chan->src_type);
 
-   return lowpan_header_decompress(skb, netdev, daddr, saddr);
+   return lowpan_header_decompress(skb, netdev, _daddr, saddr);
 }
 
 static int recv_pkt(struct sk_buff *skb, struct net_device *dev,
@@ -681,13 +684,6 @@ static void set_addr(u8 *eui, u8 *addr, u8 addr_type)
BT_DBG("type %d addr %*phC", addr_type, 8, eui);
 }
 
-static void set_dev_addr(struct net_device *netdev, bdaddr_t *addr,
-u8 addr_type)
-{
-   netdev->addr_assign_type = NET_ADDR_PERM;
-   set_addr(netdev->dev_addr, addr->b, addr_type);
-}
-
 static void ifup(struct net_device *netdev)
 {
int err;
@@ -803,7 +799,8 @@ static int setup_netdev(struct l2cap_chan *chan, struct 
lowpan_btle_dev **dev)
if (!netdev)
return -ENOMEM;
 
-   set_dev_addr(netdev, >src, chan->src_type);
+   netdev->addr_assign_type = NET_ADDR_PERM;
+   baswap((void *)netdev->dev_addr, >src);
 
netdev->netdev_ops = _ops;
SET_NETDEV_DEV(netdev, >conn->hcon->hdev->dev);
-- 
2.9.3

[PATCH v7 6/6] 6lowpan: Fix IID format for Bluetooth

2017-03-12 Thread Luiz Augusto von Dentz

From: Luiz Augusto von Dentz 

According to RFC 7668 U/L bit shall not be used:

https://wiki.tools.ietf.org/html/rfc7668#section-3.2.2 [Page 10]:

   In the figure, letter 'b' represents a bit from the
   Bluetooth device address, copied as is without any changes on any
   bit.  This means that no bit in the IID indicates whether the
   underlying Bluetooth device address is public or random.

   |0  1|1  3|3  4|4  6|
   |0  5|6  1|2  7|8  3|
   +++++
   |||1110||
   +++++

Because of this the code cannot figure out the address type from the IP
address anymore thus it makes no sense to use peer_lookup_ba as it needs
the peer address type.

Signed-off-by: Luiz Augusto von Dentz 
Reviewed-by: Stefan Schmidt 
Acked-by: Jukka Rissanen 
---
 include/net/6lowpan.h   |  4 ---
 net/bluetooth/6lowpan.c | 79 -
 net/ipv6/addrconf.c |  6 +++-
 3 files changed, 17 insertions(+), 72 deletions(-)

diff --git a/include/net/6lowpan.h b/include/net/6lowpan.h
index c5792cb..a713780 100644
--- a/include/net/6lowpan.h
+++ b/include/net/6lowpan.h
@@ -211,10 +211,6 @@ static inline void 
lowpan_iphc_uncompress_eui48_lladdr(struct in6_addr *ipaddr,
ipaddr->s6_addr[11] = 0xFF;
ipaddr->s6_addr[12] = 0xFE;
memcpy(>s6_addr[13], lladdr + 3, 3);
-   /* second bit-flip (Universe/Local)
-* is done according RFC2464
-*/
-   ipaddr->s6_addr[8] ^= 0x02;
 }
 
 #ifdef DEBUG
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 0b68cfc..ec89c55 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -398,37 +398,6 @@ static int chan_recv_cb(struct l2cap_chan *chan, struct 
sk_buff *skb)
return err;
 }
 
-static u8 get_addr_type_from_eui64(u8 byte)
-{
-   /* Is universal(0) or local(1) bit */
-   return ((byte & 0x02) ? BDADDR_LE_RANDOM : BDADDR_LE_PUBLIC);
-}
-
-static void copy_to_bdaddr(struct in6_addr *ip6_daddr, bdaddr_t *addr)
-{
-   u8 *eui64 = ip6_daddr->s6_addr + 8;
-
-   addr->b[0] = eui64[7];
-   addr->b[1] = eui64[6];
-   addr->b[2] = eui64[5];
-   addr->b[3] = eui64[2];
-   addr->b[4] = eui64[1];
-   addr->b[5] = eui64[0];
-}
-
-static void convert_dest_bdaddr(struct in6_addr *ip6_daddr,
-   bdaddr_t *addr, u8 *addr_type)
-{
-   copy_to_bdaddr(ip6_daddr, addr);
-
-   /* We need to toggle the U/L bit that we got from IPv6 address
-* so that we get the proper address and type of the BD address.
-*/
-   addr->b[5] ^= 0x02;
-
-   *addr_type = get_addr_type_from_eui64(addr->b[5]);
-}
-
 static int setup_header(struct sk_buff *skb, struct net_device *netdev,
bdaddr_t *peer_addr, u8 *peer_addr_type)
 {
@@ -436,8 +405,7 @@ static int setup_header(struct sk_buff *skb, struct 
net_device *netdev,
struct ipv6hdr *hdr;
struct lowpan_btle_dev *dev;
struct lowpan_peer *peer;
-   bdaddr_t addr, *any = BDADDR_ANY;
-   u8 *daddr = any->b;
+   u8 *daddr;
int err, status = 0;
 
hdr = ipv6_hdr(skb);
@@ -448,34 +416,24 @@ static int setup_header(struct sk_buff *skb, struct 
net_device *netdev,
 
if (ipv6_addr_is_multicast(_daddr)) {
lowpan_cb(skb)->chan = NULL;
+   daddr = NULL;
} else {
-   u8 addr_type;
+   BT_DBG("dest IP %pI6c", _daddr);
 
-   /* Get destination BT device from skb.
-* If there is no such peer then discard the packet.
+   /* The packet might be sent to 6lowpan interface
+* because of routing (either via default route
+* or user set route) so get peer according to
+* the destination address.
 */
-   convert_dest_bdaddr(_daddr, , _type);
-
-   BT_DBG("dest addr %pMR type %d IP %pI6c", ,
-  addr_type, _daddr);
-
-   peer = peer_lookup_ba(dev, , addr_type);
+   peer = peer_lookup_dst(dev, _daddr, skb);
if (!peer) {
-   /* The packet might be sent to 6lowpan interface
-* because of routing (either via default route
-* or user set route) so get peer according to
-* the destination address.
-*/
-   peer = peer_lookup_dst(dev, _daddr, skb);
-   if (!peer) {
-   BT_DBG("no such peer %pMR found", );
-

[PATCH v7 0/6] Bluetooth: 6LoWPAN: Fix lladdr length

2017-03-12 Thread Luiz Augusto von Dentz

From: Luiz Augusto von Dentz 

These patches fixes lladdr length to be 6 bytes long and not 8 which cause
neighbor advertisement to be sent with wrong lladdr including FF:FE filler
bytes for eui64.

Note: This does not fix some of the existing crashes which I hope to address
in a different set.

v2: Make all code paths that generate a link-local from lladdr use the same
code.
v3: Use lowpan_iphc_uncompress_eui48_lladdr to generate the remote ip address.
v4: Handle comments from Stefan Schmidt.
v5: Add patch to fix IID format for Bluetooth
v6: Fix addrconf_ifid_eui48 to follow IID format for Bluetooth
v7: Rework addrconf_ifid_6lowpan so it doesn't use addrconf_ifid_eui48

Alexander Aring (2):
  6lowpan: iphc: override l2 packet information
  ipv6: addrconf: fix 48 bit 6lowpan autoconfiguration

Luiz Augusto von Dentz (2):
  6lowpan: Use netdev addr_len to determine lladdr len
  6lowpan: Fix IID format for Bluetooth

Patrik Flykt (2):
  bluetooth: Set 6 byte device addresses
  6lowpan: Set MAC address length according to LOWPAN_LLTYPE

 include/net/6lowpan.h   |  15 ++
 net/6lowpan/core.c  |  11 +++-
 net/6lowpan/iphc.c  |  57 +
 net/bluetooth/6lowpan.c | 130 
 net/ipv6/addrconf.c |  23 ++---
 5 files changed, 109 insertions(+), 127 deletions(-)

-- 
2.9.3

[PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread Zhu Yanjun

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_cm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..eca3d5f 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
event->param.conn.initiator_depth);
 
/* rdma_accept() calls rdma_reject() internally if it fails */
-   err = rdma_accept(cm_id, _param);
-   if (err)
+   if (rdma_accept(cm_id, _param))
rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);
 
 out:
-- 
2.7.4

[PATCHv2 2/4] rds: ib: remove redundant ib_dealloc_fmr

2017-03-12 Thread Zhu Yanjun

The function ib_dealloc_fmr will never be called. As such, it should
be removed.

Cc: Joe Jin 
Cc: Junxiao Bi 
Reviewed-by: Yuval Shaia 
Reviewed-by: Johannes Thumshirn 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the reviewer and acker.
  Remove ibmr NULL test.

 net/rds/ib_fmr.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 4fe8f4f..249ae1c 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -78,12 +78,9 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
return ibmr;
 
 out_no_cigar:
-   if (ibmr) {
-   if (fmr->fmr)
-   ib_dealloc_fmr(fmr->fmr);
-   kfree(ibmr);
-   }
+   kfree(ibmr);
atomic_dec(>item_count);
+
return ERR_PTR(err);
 }
 
-- 
2.7.4

[PATCHv2 3/4] rds: ib: add the static type to the function

2017-03-12 Thread Zhu Yanjun

The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_fmr.c | 5 +++--
 net/rds/ib_mr.h  | 2 --
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 249ae1c..c936b0d 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -84,8 +84,9 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
return ERR_PTR(err);
 }
 
-int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr,
-  struct scatterlist *sg, unsigned int nents)
+static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
+ struct rds_ib_mr *ibmr, struct scatterlist *sg,
+ unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
struct rds_ib_fmr *fmr = >u.fmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 5d6e98a..0ea4ab0 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -125,8 +125,6 @@ void rds_ib_mr_exit(void);
 void __rds_ib_teardown_mr(struct rds_ib_mr *);
 void rds_ib_teardown_mr(struct rds_ib_mr *);
 struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *, int);
-int rds_ib_map_fmr(struct rds_ib_device *, struct rds_ib_mr *,
-  struct scatterlist *, unsigned int);
 struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *);
 int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *, int, struct rds_ib_mr **);
 struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *, struct scatterlist *,
-- 
2.7.4

[PATCHv2 4/4] rds: ib: unmap the scatter/gather list when error

2017-03-12 Thread Zhu Yanjun

When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_fmr.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index c936b0d..86ef907 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -112,29 +112,39 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
u64 dma_addr = ib_sg_dma_address(dev, [i]);
 
if (dma_addr & ~PAGE_MASK) {
-   if (i > 0)
+   if (i > 0) {
+   ib_dma_unmap_sg(dev, sg, nents,
+   DMA_BIDIRECTIONAL);
return -EINVAL;
-   else
+   } else {
++page_cnt;
+   }
}
if ((dma_addr + dma_len) & ~PAGE_MASK) {
-   if (i < sg_dma_len - 1)
+   if (i < sg_dma_len - 1) {
+   ib_dma_unmap_sg(dev, sg, nents,
+   DMA_BIDIRECTIONAL);
return -EINVAL;
-   else
+   } else {
++page_cnt;
+   }
}
 
len += dma_len;
}
 
page_cnt += len >> PAGE_SHIFT;
-   if (page_cnt > ibmr->pool->fmr_attr.max_pages)
+   if (page_cnt > ibmr->pool->fmr_attr.max_pages) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
return -EINVAL;
+   }
 
dma_pages = kmalloc_node(sizeof(u64) * page_cnt, GFP_ATOMIC,
 rdsibdev_to_node(rds_ibdev));
-   if (!dma_pages)
+   if (!dma_pages) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
return -ENOMEM;
+   }
 
page_cnt = 0;
for (i = 0; i < sg_dma_len; ++i) {
@@ -147,8 +157,10 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev,
}
 
ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
-   if (ret)
+   if (ret) {
+   ib_dma_unmap_sg(dev, sg, nents, DMA_BIDIRECTIONAL);
goto out;
+   }
 
/* Success - we successfully remapped the MR, so we can
 * safely tear down the old mapping.
-- 
2.7.4

95 matches

Mail list logo