Re: Kernel 4.1.12 crash

2015-11-21 Thread Andrew
Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
almost all of these boxes, except H61M-based box from 1st log, works for 
a long time with uptime more than year; and only software was changed on 
it; H61M-based box runs memtest86 for a tens of hours w/o any error. If 
it was caused by hardware - they should crash even earlier.


Rarely on different servers I saw 'zram decompression error' messages 
(in this case I've got such message on H61M-based box).


Also, other people that uses accel-ppp as BRAS software, have different 
kernel panics/bugs/oopses on fresh kernels.


I'll try to apply these patches, and I'll try to switch back to kernels 
that were stable on some boxes.


21.11.2015 01:13, Alexander Duyck пишет:

On 11/20/2015 05:58 AM, Andrew wrote:

Hi all.

Today some BRASes on 4.1.12 kernel were crashed.

Here's crash traces: http://pastebin.com/p68hNS8R
http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6

On 3.2 kernel same hardware works OK, troubles were noticed after kernel
upgrade.

What additional info is needed?


Looking over the traces there seem to be two areas called out.

The first is the fib_trie resize BUG_ON that was triggered due to the 
parent and child not being associated.  I think that might be due to 
memory corruption as I cannot find any spots where we are resizing 
without correctly setting up the parent-child relationship of the 
nodes first.


The other spot that is showing up is ppp_shutdown_interface and it's 
related path.  It looks like there are a couple of patches you could 
try back-porting to see if it resolves the issue.  If they do then 
perhaps they should be considered candidates for stable:


8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion")
58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()")

- Alex


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] net/hsr: fix a warning message

2015-11-21 Thread Dan Carpenter
WARN_ON_ONCE() takes a condition, it doesn't take an error message.  I
have converted this to WARN() instead.

Signed-off-by: Dan Carpenter 

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index 35a9788..c7d1adc 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -312,7 +312,7 @@ static void send_hsr_supervision_frame(struct hsr_port 
*master, u8 type)
return;
 
 out:
-   WARN_ON_ONCE("HSR: Could not send supervision frame\n");
+   WARN_ONCE(1, "HSR: Could not send supervision frame\n");
kfree_skb(skb);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux 3.16.7: WARNING at kernfs_get+0x2a/0x30() and ida_remove+0xdd/0x120() in loop

2015-11-21 Thread Eugene A. Kravtsov

Last week I have a problem with kernel update my PPPoE BRAS server
from Linux 3.2.0-4-686-pae #1 SMP Debian 3.2.68-1+deb7u1 i686 GNU/Linux
to  3.16.0-4-686-pae #1 Debian  3.16.7-ckt11-1+deb8u5.
At  any  time,  for  no reason, at any load server downed to oops with following
trace http://spec.oborona.net/bras_panic

Server  software  is:  rp-pppoe + pppoe + tc htb shapers on ppp interfaces
called  from  ip-ip  script.  No  special  settings are not using.

I tried to install new debian kernel 4.2.6, but the problem is not solved,
everything became worse, server worked with 4.2.6 about 15 minutes and downed 
to panic
(no trace - network rsyslog is empty).

After that, i decide to try stable kernel from kernel.org - 4.2.3,
and saw the same panic with no network logs as with 4.2.6.

Many  technical  staff from Russia and UA ISP confirmed these problems
with ppp in all new kernels (accell-ppp and rp-pppoe).
https://translate.google.ru/translate?hl=ru&sl=ru&tl=en&u=http%3A%2F%2Fforum.nag.ru%2Fforum%2Findex.php%3Fshowtopic%3D45266%26st%3D6220
This is the same our problem from UA colleague 
http://www.spinics.net/lists/netdev/msg352992.html

Last  kernel with no problem, for me - 3.2.68-1+deb7u1 i686 GNU/Linux.
With 3.2 kerel server uptime is infinite.

Can i help with additional information?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 9/9] net: ipmr: factor out common vif init code

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Factor out common vif init code used in both tunnel and pimreg
initialization and create ipmr_init_vif_indev() function.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 40 +++-
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index f2ea97fbb8a8..73361c908e5e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -391,6 +391,23 @@ static void ipmr_del_tunnel(struct net_device *dev, struct 
vifctl *v)
}
 }
 
+/* Initialize ipmr pimreg/tunnel in_device */
+static bool ipmr_init_vif_indev(const struct net_device *dev)
+{
+   struct in_device *in_dev;
+
+   ASSERT_RTNL();
+
+   in_dev = __in_dev_get_rtnl(dev);
+   if (!in_dev)
+   return false;
+   ipv4_devconf_setall(in_dev);
+   neigh_parms_data_state_setall(in_dev->arp_parms);
+   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
+
+   return true;
+}
+
 static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v)
 {
struct net_device  *dev;
@@ -402,7 +419,6 @@ static struct net_device *ipmr_new_tunnel(struct net *net, 
struct vifctl *v)
int err;
struct ifreq ifr;
struct ip_tunnel_parm p;
-   struct in_device  *in_dev;
 
memset(&p, 0, sizeof(p));
p.iph.daddr = v->vifc_rmt_addr.s_addr;
@@ -427,15 +443,8 @@ static struct net_device *ipmr_new_tunnel(struct net *net, 
struct vifctl *v)
if (err == 0 &&
(dev = __dev_get_by_name(net, p.name)) != NULL) {
dev->flags |= IFF_MULTICAST;
-
-   in_dev = __in_dev_get_rtnl(dev);
-   if (!in_dev)
+   if (!ipmr_init_vif_indev(dev))
goto failure;
-
-   ipv4_devconf_setall(in_dev);
-   neigh_parms_data_state_setall(in_dev->arp_parms);
-   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
-
if (dev_open(dev))
goto failure;
dev_hold(dev);
@@ -502,7 +511,6 @@ static void reg_vif_setup(struct net_device *dev)
 static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt)
 {
struct net_device *dev;
-   struct in_device *in_dev;
char name[IFNAMSIZ];
 
if (mrt->id == RT_TABLE_DEFAULT)
@@ -522,18 +530,8 @@ static struct net_device *ipmr_reg_vif(struct net *net, 
struct mr_table *mrt)
return NULL;
}
 
-   rcu_read_lock();
-   in_dev = __in_dev_get_rcu(dev);
-   if (!in_dev) {
-   rcu_read_unlock();
+   if (!ipmr_init_vif_indev(dev))
goto failure;
-   }
-
-   ipv4_devconf_setall(in_dev);
-   neigh_parms_data_state_setall(in_dev->arp_parms);
-   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
-   rcu_read_unlock();
-
if (dev_open(dev))
goto failure;
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

It's not necessary to panic upon allocation failure, returning an error
at that point is okay because user-space won't be able to use any of the
ops since they didn't get registered and the default table is null.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a006d96d6cd9..2c7fa584a274 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2675,7 +2675,7 @@ int __init ip_mr_init(void)
 
mrt_cachep = kmem_cache_create("ip_mrt_cache",
   sizeof(struct mfc_cache),
-  0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
+  0, SLAB_HWCACHE_ALIGN,
   NULL);
if (!mrt_cachep)
return -ENOMEM;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/9] net: ipmr: always define mroute_reg_vif_num

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_
options were set, but that's not really necessary as the size of the
struct is the same in both cases (checked with pahole, both cases size
is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the
code.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5271e2eee110..dd2462f70d34 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -84,9 +84,7 @@ struct mr_table {
atomic_tcache_resolve_queue_len;
boolmroute_do_assert;
boolmroute_do_pim;
-#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
int mroute_reg_vif_num;
-#endif
 };
 
 struct ipmr_rule {
@@ -347,9 +345,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 
id)
setup_timer(&mrt->ipmr_expire_timer, ipmr_expire_process,
(unsigned long)mrt);
 
-#ifdef CONFIG_IP_PIMSM
mrt->mroute_reg_vif_num = -1;
-#endif
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
list_add_tail_rcu(&mrt->list, &net->ipv4.mr_tables);
 #endif
@@ -584,10 +580,8 @@ static int vif_delete(struct mr_table *mrt, int vifi, int 
notify,
return -EADDRNOTAVAIL;
}
 
-#ifdef CONFIG_IP_PIMSM
if (vifi == mrt->mroute_reg_vif_num)
mrt->mroute_reg_vif_num = -1;
-#endif
 
if (vifi + 1 == mrt->maxvif) {
int tmp;
@@ -824,10 +818,8 @@ static int vif_add(struct net *net, struct mr_table *mrt,
/* And finish update writing critical data */
write_lock_bh(&mrt_lock);
v->dev = dev;
-#ifdef CONFIG_IP_PIMSM
if (v->flags & VIFF_REGISTER)
mrt->mroute_reg_vif_num = vifi;
-#endif
if (vifi+1 > mrt->maxvif)
mrt->maxvif = vifi+1;
write_unlock_bh(&mrt_lock);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/9] net: ipmr: make ip_mroute_getsockopt more understandable

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Use a switch to determine if optname is correct and set val accordingly.
This produces a much more straight-forward and readable code.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 286ede3716ee..694fecf7838e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1443,29 +1443,29 @@ int ip_mroute_getsockopt(struct sock *sk, int optname, 
char __user *optval, int
if (!mrt)
return -ENOENT;
 
-   if (optname != MRT_VERSION &&
-  optname != MRT_PIM &&
-  optname != MRT_ASSERT)
+   switch (optname) {
+   case MRT_VERSION:
+   val = 0x0305;
+   break;
+   case MRT_PIM:
+   if (!pimsm_enabled())
+   return -ENOPROTOOPT;
+   val = mrt->mroute_do_pim;
+   break;
+   case MRT_ASSERT:
+   val = mrt->mroute_do_assert;
+   break;
+   default:
return -ENOPROTOOPT;
+   }
 
if (get_user(olr, optlen))
return -EFAULT;
-
olr = min_t(unsigned int, olr, sizeof(int));
if (olr < 0)
return -EINVAL;
-
if (put_user(olr, optlen))
return -EFAULT;
-   if (optname == MRT_VERSION) {
-   val = 0x0305;
-   } else if (optname == MRT_PIM) {
-   if (!pimsm_enabled())
-   return -ENOPROTOOPT;
-   val = mrt->mroute_do_pim;
-   } else {
-   val = mrt->mroute_do_assert;
-   }
if (copy_to_user(optval, &val, olr))
return -EFAULT;
return 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 8/9] net: ipmr: rearrange and cleanup setsockopt

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Take rtnl in the beginning unconditionally as most options already need
it (one exception - MRT_DONE, see the comment inside), make the
lock/unlock places central and move out the switch() local variables.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 191 +++-
 1 file changed, 107 insertions(+), 84 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2c7fa584a274..f2ea97fbb8a8 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1276,38 +1276,45 @@ static void mrtsock_destruct(struct sock *sk)
  * MOSPF/PIM router set up we can clean this up.
  */
 
-int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, 
unsigned int optlen)
+int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
+unsigned int optlen)
 {
-   int ret, parent = 0;
-   struct vifctl vif;
-   struct mfcctl mfc;
struct net *net = sock_net(sk);
+   int val, ret = 0, parent = 0;
struct mr_table *mrt;
+   struct vifctl vif;
+   struct mfcctl mfc;
+   u32 uval;
 
+   /* There's one exception to the lock - MRT_DONE which needs to unlock */
+   rtnl_lock();
if (sk->sk_type != SOCK_RAW ||
-   inet_sk(sk)->inet_num != IPPROTO_IGMP)
-   return -EOPNOTSUPP;
+   inet_sk(sk)->inet_num != IPPROTO_IGMP) {
+   ret = -EOPNOTSUPP;
+   goto out_unlock;
+   }
 
mrt = ipmr_get_table(net, raw_sk(sk)->ipmr_table ? : RT_TABLE_DEFAULT);
-   if (!mrt)
-   return -ENOENT;
-
+   if (!mrt) {
+   ret = -ENOENT;
+   goto out_unlock;
+   }
if (optname != MRT_INIT) {
if (sk != rcu_access_pointer(mrt->mroute_sk) &&
-   !ns_capable(net->user_ns, CAP_NET_ADMIN))
-   return -EACCES;
+   !ns_capable(net->user_ns, CAP_NET_ADMIN)) {
+   ret = -EACCES;
+   goto out_unlock;
+   }
}
 
switch (optname) {
case MRT_INIT:
if (optlen != sizeof(int))
-   return -EINVAL;
-
-   rtnl_lock();
-   if (rtnl_dereference(mrt->mroute_sk)) {
-   rtnl_unlock();
-   return -EADDRINUSE;
-   }
+   ret = -EINVAL;
+   if (rtnl_dereference(mrt->mroute_sk))
+   ret = -EADDRINUSE;
+   if (ret)
+   break;
 
ret = ip_ra_control(sk, 1, mrtsock_destruct);
if (ret == 0) {
@@ -1317,30 +1324,41 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
NETCONFA_IFINDEX_ALL,
net->ipv4.devconf_all);
}
-   rtnl_unlock();
-   return ret;
+   break;
case MRT_DONE:
-   if (sk != rcu_access_pointer(mrt->mroute_sk))
-   return -EACCES;
-   return ip_ra_control(sk, 0, NULL);
+   if (sk != rcu_access_pointer(mrt->mroute_sk)) {
+   ret = -EACCES;
+   } else {
+   /* We need to unlock here because mrtsock_destruct takes
+* care of rtnl itself and we can't change that due to
+* the IP_ROUTER_ALERT setsockopt which runs without it.
+*/
+   rtnl_unlock();
+   ret = ip_ra_control(sk, 0, NULL);
+   goto out;
+   }
+   break;
case MRT_ADD_VIF:
case MRT_DEL_VIF:
-   if (optlen != sizeof(vif))
-   return -EINVAL;
-   if (copy_from_user(&vif, optval, sizeof(vif)))
-   return -EFAULT;
-   if (vif.vifc_vifi >= MAXVIFS)
-   return -ENFILE;
-   rtnl_lock();
+   if (optlen != sizeof(vif)) {
+   ret = -EINVAL;
+   break;
+   }
+   if (copy_from_user(&vif, optval, sizeof(vif))) {
+   ret = -EFAULT;
+   break;
+   }
+   if (vif.vifc_vifi >= MAXVIFS) {
+   ret = -ENFILE;
+   break;
+   }
if (optname == MRT_ADD_VIF) {
ret = vif_add(net, mrt, &vif,
  sk == rtnl_dereference(mrt->mroute_sk));
} else {
ret = vif_delete(mrt, vif.vifc_vifi, 0, NULL);
}
-   rtnl_unlock();
-   return ret;
-
+   break;
/* Manipula

[PATCH net-next 0/9] net: ipmr: cleanups and minor improvements

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Hi,
Since I'll have to work with ipmr, I decided to clean it up and do some
minor improvements. Functionally there're almost no changes except the
SLAB_PANIC removal. Most of the patches just re-design some functions to
be clearer and more concise and try to remove the ifdef web that was
inside. There's more information in each commit. This is the first set,
the end goal is to introduce complete netlink support and control over
the mfc and vif devices.
I've tried to test all of the setsockopt/getsockopt options, and also
made builds with various ipmr kconfig options turned on and off.

Thank you,
 Nik

Nikolay Aleksandrov (9):
  net: ipmr: move the tbl id check in ipmr_new_table
  net: ipmr: always define mroute_reg_vif_num
  net: ipmr: remove some pimsm ifdefs and simplify
  net: ipmr: fix code and comment style
  net: ipmr: make ip_mroute_getsockopt more understandable
  net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES
  net: ipmr: remove SLAB_PANIC
  net: ipmr: rearrange and cleanup setsockopt
  net: ipmr: factor out common vif init code

 include/uapi/linux/mroute.h |  59 ++---
 net/ipv4/ipmr.c | 597 
 2 files changed, 285 insertions(+), 371 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 4/9] net: ipmr: fix code and comment style

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Trivial code and comment style fixes, also removed some extra newlines,
spaces and tabs.

Signed-off-by: Nikolay Aleksandrov 
---
 include/uapi/linux/mroute.h |  59 ++
 net/ipv4/ipmr.c | 142 
 2 files changed, 54 insertions(+), 147 deletions(-)

diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
index a382d2c04a42..cf943016930f 100644
--- a/include/uapi/linux/mroute.h
+++ b/include/uapi/linux/mroute.h
@@ -4,15 +4,13 @@
 #include 
 #include 
 
-/*
- * Based on the MROUTING 3.5 defines primarily to keep
- * source compatibility with BSD.
+/* Based on the MROUTING 3.5 defines primarily to keep
+ * source compatibility with BSD.
  *
- * See the mrouted code for the original history.
- *
- *  Protocol Independent Multicast (PIM) data structures included
- *  Carlos Picoto (c...@di.fc.ul.pt)
+ * See the mrouted code for the original history.
  *
+ * Protocol Independent Multicast (PIM) data structures included
+ * Carlos Picoto (c...@di.fc.ul.pt)
  */
 
 #define MRT_BASE   200
@@ -34,15 +32,13 @@
 #define SIOCGETSGCNT   (SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF (SIOCPROTOPRIVATE+2)
 
-#define MAXVIFS32  
+#define MAXVIFS32
 typedef unsigned long vifbitmap_t; /* User mode code depends on this lot */
 typedef unsigned short vifi_t;
 #define ALL_VIFS   ((vifi_t)(-1))
 
-/*
- * Same idea as select
- */
- 
+/* Same idea as select */
+
 #define VIFM_SET(n,m)  ((m)|=(1<<(n)))
 #define VIFM_CLR(n,m)  ((m)&=~(1<<(n)))
 #define VIFM_ISSET(n,m)((m)&(1<<(n)))
@@ -50,11 +46,9 @@ typedef unsigned short vifi_t;
 #define VIFM_COPY(mfrom,mto)   ((mto)=(mfrom))
 #define VIFM_SAME(m1,m2)   ((m1)==(m2))
 
-/*
- * Passed by mrouted for an MRT_ADD_VIF - again we use the
- * mrouted 3.6 structures for compatibility
+/* Passed by mrouted for an MRT_ADD_VIF - again we use the
+ * mrouted 3.6 structures for compatibility
  */
- 
 struct vifctl {
vifi_t  vifc_vifi;  /* Index of VIF */
unsigned char vifc_flags;   /* VIFF_ flags */
@@ -73,10 +67,7 @@ struct vifctl {
 #define VIFF_USE_IFINDEX   0x8 /* use vifc_lcl_ifindex instead of
   vifc_lcl_addr to find an interface */
 
-/*
- * Cache manipulation structures for mrouted and PIMd
- */
- 
+/* Cache manipulation structures for mrouted and PIMd */
 struct mfcctl {
struct in_addr mfcc_origin; /* Origin of mcast  */
struct in_addr mfcc_mcastgrp;   /* Group in question*/
@@ -88,10 +79,7 @@ struct mfcctl {
int  mfcc_expire;
 };
 
-/* 
- * Group count retrieval for mrouted
- */
- 
+/*  Group count retrieval for mrouted */
 struct sioc_sg_req {
struct in_addr src;
struct in_addr grp;
@@ -100,10 +88,7 @@ struct sioc_sg_req {
unsigned long wrong_if;
 };
 
-/*
- * To get vif packet counts
- */
-
+/* To get vif packet counts */
 struct sioc_vif_req {
vifi_t  vifi;   /* Which iface */
unsigned long icount;   /* In packets */
@@ -112,11 +97,9 @@ struct sioc_vif_req {
unsigned long obytes;   /* Out bytes */
 };
 
-/*
- * This is the format the mroute daemon expects to see IGMP control
- * data. Magically happens to be like an IP packet as per the original
+/* This is the format the mroute daemon expects to see IGMP control
+ * data. Magically happens to be like an IP packet as per the original
  */
- 
 struct igmpmsg {
__u32 unused1,unused2;
unsigned char im_msgtype;   /* What is this */
@@ -126,21 +109,13 @@ struct igmpmsg {
struct in_addr im_src,im_dst;
 };
 
-/*
- * That's all usermode folks
- */
-
-
+/* That's all usermode folks */
 
 #define MFC_ASSERT_THRESH (3*HZ)   /* Maximal freq. of asserts */
 
-/*
- * Pseudo messages used by mrouted
- */
-
+/* Pseudo messages used by mrouted */
 #define IGMPMSG_NOCACHE1   /* Kern cache fill 
request to mrouted */
 #define IGMPMSG_WRONGVIF   2   /* For PIM assert processing 
(unused) */
 #define IGMPMSG_WHOLEPKT   3   /* For PIM Register processing 
*/
 
-
 #endif /* _UAPI__LINUX_MROUTE_H */
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e153ab7b17a1..286ede3716ee 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -102,9 +102,7 @@ static inline bool pimsm_enabled(void)
 
 static DEFINE_RWLOCK(mrt_lock);
 
-/*
- * Multicast router control variables
- */
+/* Multicast router control variables */
 
 #define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL)
 
@@ -393,8 +391,7 @@ static void ipmr_del_tunnel(struct net_device *dev, struct 
vifctl *v)
}
 }
 
-static
-struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v)
+static struct net_device *ipmr_new_tunnel(struct net *net, stru

[PATCH net-next 1/9] net: ipmr: move the tbl id check in ipmr_new_table

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Move the table id check in ipmr_new_table and make it return error
pointer. We need this change for the upcoming netlink table manipulation
support in order to avoid code duplication and a race condition.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 92dd4b74d513..5271e2eee110 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -252,8 +252,8 @@ static int __net_init ipmr_rules_init(struct net *net)
INIT_LIST_HEAD(&net->ipv4.mr_tables);
 
mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
-   if (!mrt) {
-   err = -ENOMEM;
+   if (IS_ERR(mrt)) {
+   err = PTR_ERR(mrt);
goto err1;
}
 
@@ -301,8 +301,13 @@ static int ipmr_fib_lookup(struct net *net, struct flowi4 
*flp4,
 
 static int __net_init ipmr_rules_init(struct net *net)
 {
-   net->ipv4.mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
-   return net->ipv4.mrt ? 0 : -ENOMEM;
+   struct mr_table *mrt;
+
+   mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
+   if (IS_ERR(mrt))
+   return PTR_ERR(mrt);
+   net->ipv4.mrt = mrt;
+   return 0;
 }
 
 static void __net_exit ipmr_rules_exit(struct net *net)
@@ -319,13 +324,17 @@ static struct mr_table *ipmr_new_table(struct net *net, 
u32 id)
struct mr_table *mrt;
unsigned int i;
 
+   /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
+   if (id != RT_TABLE_DEFAULT && id >= 10)
+   return ERR_PTR(-EINVAL);
+
mrt = ipmr_get_table(net, id);
if (mrt)
return mrt;
 
mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
if (!mrt)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
write_pnet(&mrt->net, net);
mrt->id = id;
 
@@ -1407,17 +1416,14 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
if (get_user(v, (u32 __user *)optval))
return -EFAULT;
 
-   /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
-   if (v != RT_TABLE_DEFAULT && v >= 10)
-   return -EINVAL;
-
rtnl_lock();
ret = 0;
if (sk == rtnl_dereference(mrt->mroute_sk)) {
ret = -EBUSY;
} else {
-   if (!ipmr_new_table(net, v))
-   ret = -ENOMEM;
+   mrt = ipmr_new_table(net, v);
+   if (IS_ERR(mrt))
+   ret = PTR_ERR(mrt);
else
raw_sk(sk)->ipmr_table = v;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 6/9] net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Trivial replace of ifdef with IS_BUILTIN().

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 694fecf7838e..a006d96d6cd9 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1396,11 +1396,12 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
rtnl_unlock();
return ret;
}
-#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
case MRT_TABLE:
{
u32 v;
 
+   if (!IS_BUILTIN(CONFIG_IP_MROUTE_MULTIPLE_TABLES))
+   return -ENOPROTOOPT;
if (optlen != sizeof(u32))
return -EINVAL;
if (get_user(v, (u32 __user *)optval))
@@ -1420,7 +1421,6 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
rtnl_unlock();
return ret;
}
-#endif
/* Spurious command, or MRT_VERSION which you cannot set. */
default:
return -ENOPROTOOPT;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/9] net: ipmr: remove some pimsm ifdefs and simplify

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM
define and is used to check if any version of PIM-SM has been enabled.
Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because
only a VIFF_REGISTER device can send such packet, and it can't be
created if pimsm_enabled() is false.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 180 ++--
 1 file changed, 84 insertions(+), 96 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index dd2462f70d34..e153ab7b17a1 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -67,10 +67,6 @@
 #include 
 #include 
 
-#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
-#define CONFIG_IP_PIMSM1
-#endif
-
 struct mr_table {
struct list_headlist;
possible_net_t  net;
@@ -95,6 +91,11 @@ struct ipmr_result {
struct mr_table *mrt;
 };
 
+static inline bool pimsm_enabled(void)
+{
+   return IS_BUILTIN(CONFIG_IP_PIMSM_V1) || IS_BUILTIN(CONFIG_IP_PIMSM_V2);
+}
+
 /* Big lock, protecting vif table, mrt cache and mroute socket state.
  * Note that the changes are semaphored via rtnl_lock.
  */
@@ -454,8 +455,7 @@ failure:
return NULL;
 }
 
-#ifdef CONFIG_IP_PIMSM
-
+#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
 static netdev_tx_t reg_vif_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct net *net = dev_net(dev);
@@ -552,6 +552,51 @@ failure:
unregister_netdevice(dev);
return NULL;
 }
+
+/* called with rcu_read_lock() */
+static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb,
+unsigned int pimlen)
+{
+   struct net_device *reg_dev = NULL;
+   struct iphdr *encap;
+
+   encap = (struct iphdr *)(skb_transport_header(skb) + pimlen);
+   /*
+* Check that:
+* a. packet is really sent to a multicast group
+* b. packet is not a NULL-REGISTER
+* c. packet is not truncated
+*/
+   if (!ipv4_is_multicast(encap->daddr) ||
+   encap->tot_len == 0 ||
+   ntohs(encap->tot_len) + pimlen > skb->len)
+   return 1;
+
+   read_lock(&mrt_lock);
+   if (mrt->mroute_reg_vif_num >= 0)
+   reg_dev = mrt->vif_table[mrt->mroute_reg_vif_num].dev;
+   read_unlock(&mrt_lock);
+
+   if (!reg_dev)
+   return 1;
+
+   skb->mac_header = skb->network_header;
+   skb_pull(skb, (u8 *)encap - skb->data);
+   skb_reset_network_header(skb);
+   skb->protocol = htons(ETH_P_IP);
+   skb->ip_summed = CHECKSUM_NONE;
+
+   skb_tunnel_rx(skb, reg_dev, dev_net(reg_dev));
+
+   netif_rx(skb);
+
+   return NET_RX_SUCCESS;
+}
+#else
+static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt)
+{
+   return NULL;
+}
 #endif
 
 /**
@@ -734,10 +779,10 @@ static int vif_add(struct net *net, struct mr_table *mrt,
return -EADDRINUSE;
 
switch (vifc->vifc_flags) {
-#ifdef CONFIG_IP_PIMSM
case VIFF_REGISTER:
-   /*
-* Special Purpose VIF in PIM
+   if (!pimsm_enabled())
+   return -EINVAL;
+   /* Special Purpose VIF in PIM
 * All the packets will be sent to the daemon
 */
if (mrt->mroute_reg_vif_num >= 0)
@@ -752,7 +797,6 @@ static int vif_add(struct net *net, struct mr_table *mrt,
return err;
}
break;
-#endif
case VIFF_TUNNEL:
dev = ipmr_new_tunnel(net, vifc);
if (!dev)
@@ -942,34 +986,29 @@ static void ipmr_cache_resolve(struct net *net, struct 
mr_table *mrt,
}
 }
 
-/*
- * Bounce a cache query up to mrouted. We could use netlink for this but 
mrouted
- * expects the following bizarre scheme.
+/* Bounce a cache query up to mrouted. We could use netlink for this but 
mrouted
+ * expects the following bizarre scheme.
  *
- * Called under mrt_lock.
+ * Called under mrt_lock.
  */
-
 static int ipmr_cache_report(struct mr_table *mrt,
 struct sk_buff *pkt, vifi_t vifi, int assert)
 {
-   struct sk_buff *skb;
const int ihl = ip_hdrlen(pkt);
+   struct sock *mroute_sk;
struct igmphdr *igmp;
struct igmpmsg *msg;
-   struct sock *mroute_sk;
+   struct sk_buff *skb;
int ret;
 
-#ifdef CONFIG_IP_PIMSM
if (assert == IGMPMSG_WHOLEPKT)
skb = skb_realloc_headroom(pkt, sizeof(struct iphdr));
else
-#endif
skb = alloc_skb(128, GFP_ATOMIC);
 
if (!skb)
return -ENOBUFS;
 
-#ifdef CONFIG_IP_PIMSM
if (assert == IGMPMSG_WHOLEPKT) {
/* Ugly, but we have no choice with this interface.
 

yet another uninterruptable hang in sendfile

2015-11-21 Thread Dmitry Vyukov
Hello,

On commit 8005c49d9aea74d382f474ce11afbbc7d7130bec (Nov 15).

The program is:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 

int main()
{
long r0 = syscall(SYS_socket, 0x10ul, 0x2ul, 0x0ul, 0, 0, 0);
long r1 = syscall(SYS_mmap, 0x2000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r2 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
*(uint64_t*)0x2000153f = 0x20001f99;
*(uint64_t*)0x20001547 = 0x67;
*(uint64_t*)0x2000154f = 0x20001fa5;
*(uint64_t*)0x20001557 = 0x5b;
*(uint64_t*)0x2000155f = 0x20001000;
*(uint64_t*)0x20001567 = 0x6;
long r9 = syscall(SYS_readv, r0, 0x2000153ful, 0x3ul, 0, 0, 0);
long r10 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
memcpy((void*)0x20002000, "\x65\x74\x68\x31\x00", 5);
long r12 = syscall(SYS_memfd_create, 0x20002000ul, 0x1ul, 0, 0, 0, 0);
long r13 = syscall(SYS_fallocate, r12, 0x0ul, 0x5616e07ul, 0x1ul, 0, 0);
memcpy((void*)0x2da2,
"\x02\xbe\x98\x59\x88\xb1\x7b\xfd\xe6\x27\x95\xdc\x18\x4e\x04\x87\x28\x1a\xd0\x30\x52\xcd\xa5\xee\x09\x7f\xfa\x7a\x9b\x72\x17\xfa\x2a\xa1\xe1\x60\x09\xbb\xaf\xdd\x0b\x5c\xa8\x18\x81\x4b\x6d\x42\x11\x20\x4a\xd7\x9e\x86\x8b\x63\xd2\x36\xbf\x5f\xb0\x36\x13\x82\x79\xc8\x31\x3b\x3b\x1e",
70);
memcpy((void*)0x28b7,
"\x0a\x00\x33\xe8\x3d\xe7\x4a\xcc\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xbf\xce\xa1\x60",
28);
long r16 = syscall(SYS_sendto, r0, 0x2da2ul, 0x46ul,
0x8000ul, 0x28b7ul, 0x1cul);
long r17 = syscall(SYS_sendfile, r0, r12, 0x2000ul,
0x4785d2c1ul, 0, 0);
return 0;
}


It hangs in unkillable state. It is probably similar issue to the
other reported issues related to sendfile:
https://groups.google.com/forum/#!topic/syzkaller/zfuHHRXL7Zg
https://groups.google.com/forum/#!topic/syzkaller/sjA9DrBQviw

However this one also blankets dmesg with zillions of:

[ 1682.801412] SELinux: unrecognized netlink message: protocol=0
nlmsg_type=0 sclass=netlink_route_socket
[ 1682.803565] SELinux: unrecognized netlink message: protocol=0
nlmsg_type=0 sclass=netlink_route_socket
[ 1682.804991] SELinux: unrecognized netlink message: protocol=0
nlmsg_type=0 sclass=netlink_route_socket

The program should be killable.

Thank you
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/27] brcm80211: move under broadcom vendor directory

2015-11-21 Thread Hauke Mehrtens
On 11/20/2015 10:53 PM, Arend van Spriel wrote:
> On 11/19/2015 08:48 AM, Kalle Valo wrote:
>> Hauke Mehrtens  writes:
>>
>>> On 11/18/2015 03:45 PM, Kalle Valo wrote:
 Part of reorganising wireless drivers directory and Kconfig. Note
 that I had to
 edit Makefiles from subdirectories to use the new location.

 Signed-off-by: Kalle Valo 
 ---
>>>
>>> I would prefer to remove the brcm80211 directory in this process and
>>> create:
>>> drivers/net/wireless/broadcom/brcmfmac
>>> drivers/net/wireless/broadcom/brcmsmac
>>> drivers/net/wireless/broadcom/brcmutil
>>> drivers/net/wireless/broadcom/include
>>>
>>> This way we have one directory less.
>>
>> I think this could be done separately. This patchset is big enough
>> already, I would not like to make it anymore complicated.
>>
>> And I actually like the brcm80211 directory, I would not mind keeping it
>> still.
> 
> I prefer to keep it as brcmsmac and brcmfmac rely on brcmutil module so
> I want to keep them together under brcm80211.
> 
> So does this patch go in before or after the patches I submitted before
> the merge window. I hope after :-p

Ok, then leave it like Kalle proposed. backports should work with both
versions.

Hauke

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC

2015-11-21 Thread Eric Dumazet
On Sat, 2015-11-21 at 14:01 +0100, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov 
> 
> It's not necessary to panic upon allocation failure, returning an error
> at that point is okay because user-space won't be able to use any of the
> ops since they didn't get registered and the default table is null.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  net/ipv4/ipmr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index a006d96d6cd9..2c7fa584a274 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -2675,7 +2675,7 @@ int __init ip_mr_init(void)
>  
>   mrt_cachep = kmem_cache_create("ip_mrt_cache",
>  sizeof(struct mfc_cache),
> -0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
> +0, SLAB_HWCACHE_ALIGN,
>  NULL);
>   if (!mrt_cachep)
>   return -ENOMEM;

This runs at boot time.

I very much prefer a panic instead of having to deal with a host with a
probable bug in mm layer at this point.

For IPv6, it is a different matter as it might be a module, thus
ip6_mr_init() does not use SLAB_PANIC



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC

2015-11-21 Thread Nikolay Aleksandrov
On 11/21/2015 03:23 PM, Eric Dumazet wrote:
> On Sat, 2015-11-21 at 14:01 +0100, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov 
>>
>> It's not necessary to panic upon allocation failure, returning an error
>> at that point is okay because user-space won't be able to use any of the
>> ops since they didn't get registered and the default table is null.
>>
>> Signed-off-by: Nikolay Aleksandrov 
>> ---
>>  net/ipv4/ipmr.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index a006d96d6cd9..2c7fa584a274 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -2675,7 +2675,7 @@ int __init ip_mr_init(void)
>>  
>>  mrt_cachep = kmem_cache_create("ip_mrt_cache",
>> sizeof(struct mfc_cache),
>> -   0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
>> +   0, SLAB_HWCACHE_ALIGN,
>> NULL);
>>  if (!mrt_cachep)
>>  return -ENOMEM;
> 
> This runs at boot time.
> 
> I very much prefer a panic instead of having to deal with a host with a
> probable bug in mm layer at this point.
> 
Right, I was unsure about this one, I tried to rely on the ipmr_get_table() 
returning NULL but
I guess it's better to be safe than sorry.
I'll respin with the panic in place and removed null check.

> For IPv6, it is a different matter as it might be a module, thus
> ip6_mr_init() does not use SLAB_PANIC
Yep, I saw. It wasn't the reason I removed this one.

Thanks for the review!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 7/9] net: ipmr: drop ip_mr_init() mrt_cachep null check as we'll panic if it fails

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

It's not necessary to check for null as SLAB_PANIC is used and we'll
panic if the alloc fails, so just drop it.

Signed-off-by: Nikolay Aleksandrov 
---
v2: new patch, keep SLAB_PANIC and drop the unnecessary null check

 net/ipv4/ipmr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a006d96d6cd9..50aec313119d 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2677,8 +2677,6 @@ int __init ip_mr_init(void)
   sizeof(struct mfc_cache),
   0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
   NULL);
-   if (!mrt_cachep)
-   return -ENOMEM;
 
err = register_pernet_subsys(&ipmr_net_ops);
if (err)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 9/9] net: ipmr: factor out common vif init code

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Factor out common vif init code used in both tunnel and pimreg
initialization and create ipmr_init_vif_indev() function.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 40 +++-
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e384f39202cb..a2d248d9c35c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -391,6 +391,23 @@ static void ipmr_del_tunnel(struct net_device *dev, struct 
vifctl *v)
}
 }
 
+/* Initialize ipmr pimreg/tunnel in_device */
+static bool ipmr_init_vif_indev(const struct net_device *dev)
+{
+   struct in_device *in_dev;
+
+   ASSERT_RTNL();
+
+   in_dev = __in_dev_get_rtnl(dev);
+   if (!in_dev)
+   return false;
+   ipv4_devconf_setall(in_dev);
+   neigh_parms_data_state_setall(in_dev->arp_parms);
+   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
+
+   return true;
+}
+
 static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v)
 {
struct net_device  *dev;
@@ -402,7 +419,6 @@ static struct net_device *ipmr_new_tunnel(struct net *net, 
struct vifctl *v)
int err;
struct ifreq ifr;
struct ip_tunnel_parm p;
-   struct in_device  *in_dev;
 
memset(&p, 0, sizeof(p));
p.iph.daddr = v->vifc_rmt_addr.s_addr;
@@ -427,15 +443,8 @@ static struct net_device *ipmr_new_tunnel(struct net *net, 
struct vifctl *v)
if (err == 0 &&
(dev = __dev_get_by_name(net, p.name)) != NULL) {
dev->flags |= IFF_MULTICAST;
-
-   in_dev = __in_dev_get_rtnl(dev);
-   if (!in_dev)
+   if (!ipmr_init_vif_indev(dev))
goto failure;
-
-   ipv4_devconf_setall(in_dev);
-   neigh_parms_data_state_setall(in_dev->arp_parms);
-   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
-
if (dev_open(dev))
goto failure;
dev_hold(dev);
@@ -502,7 +511,6 @@ static void reg_vif_setup(struct net_device *dev)
 static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt)
 {
struct net_device *dev;
-   struct in_device *in_dev;
char name[IFNAMSIZ];
 
if (mrt->id == RT_TABLE_DEFAULT)
@@ -522,18 +530,8 @@ static struct net_device *ipmr_reg_vif(struct net *net, 
struct mr_table *mrt)
return NULL;
}
 
-   rcu_read_lock();
-   in_dev = __in_dev_get_rcu(dev);
-   if (!in_dev) {
-   rcu_read_unlock();
+   if (!ipmr_init_vif_indev(dev))
goto failure;
-   }
-
-   ipv4_devconf_setall(in_dev);
-   neigh_parms_data_state_setall(in_dev->arp_parms);
-   IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0;
-   rcu_read_unlock();
-
if (dev_open(dev))
goto failure;
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 4/9] net: ipmr: fix code and comment style

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Trivial code and comment style fixes, also removed some extra newlines,
spaces and tabs.

Signed-off-by: Nikolay Aleksandrov 
---
 include/uapi/linux/mroute.h |  59 ++
 net/ipv4/ipmr.c | 142 
 2 files changed, 54 insertions(+), 147 deletions(-)

diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
index a382d2c04a42..cf943016930f 100644
--- a/include/uapi/linux/mroute.h
+++ b/include/uapi/linux/mroute.h
@@ -4,15 +4,13 @@
 #include 
 #include 
 
-/*
- * Based on the MROUTING 3.5 defines primarily to keep
- * source compatibility with BSD.
+/* Based on the MROUTING 3.5 defines primarily to keep
+ * source compatibility with BSD.
  *
- * See the mrouted code for the original history.
- *
- *  Protocol Independent Multicast (PIM) data structures included
- *  Carlos Picoto (c...@di.fc.ul.pt)
+ * See the mrouted code for the original history.
  *
+ * Protocol Independent Multicast (PIM) data structures included
+ * Carlos Picoto (c...@di.fc.ul.pt)
  */
 
 #define MRT_BASE   200
@@ -34,15 +32,13 @@
 #define SIOCGETSGCNT   (SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF (SIOCPROTOPRIVATE+2)
 
-#define MAXVIFS32  
+#define MAXVIFS32
 typedef unsigned long vifbitmap_t; /* User mode code depends on this lot */
 typedef unsigned short vifi_t;
 #define ALL_VIFS   ((vifi_t)(-1))
 
-/*
- * Same idea as select
- */
- 
+/* Same idea as select */
+
 #define VIFM_SET(n,m)  ((m)|=(1<<(n)))
 #define VIFM_CLR(n,m)  ((m)&=~(1<<(n)))
 #define VIFM_ISSET(n,m)((m)&(1<<(n)))
@@ -50,11 +46,9 @@ typedef unsigned short vifi_t;
 #define VIFM_COPY(mfrom,mto)   ((mto)=(mfrom))
 #define VIFM_SAME(m1,m2)   ((m1)==(m2))
 
-/*
- * Passed by mrouted for an MRT_ADD_VIF - again we use the
- * mrouted 3.6 structures for compatibility
+/* Passed by mrouted for an MRT_ADD_VIF - again we use the
+ * mrouted 3.6 structures for compatibility
  */
- 
 struct vifctl {
vifi_t  vifc_vifi;  /* Index of VIF */
unsigned char vifc_flags;   /* VIFF_ flags */
@@ -73,10 +67,7 @@ struct vifctl {
 #define VIFF_USE_IFINDEX   0x8 /* use vifc_lcl_ifindex instead of
   vifc_lcl_addr to find an interface */
 
-/*
- * Cache manipulation structures for mrouted and PIMd
- */
- 
+/* Cache manipulation structures for mrouted and PIMd */
 struct mfcctl {
struct in_addr mfcc_origin; /* Origin of mcast  */
struct in_addr mfcc_mcastgrp;   /* Group in question*/
@@ -88,10 +79,7 @@ struct mfcctl {
int  mfcc_expire;
 };
 
-/* 
- * Group count retrieval for mrouted
- */
- 
+/*  Group count retrieval for mrouted */
 struct sioc_sg_req {
struct in_addr src;
struct in_addr grp;
@@ -100,10 +88,7 @@ struct sioc_sg_req {
unsigned long wrong_if;
 };
 
-/*
- * To get vif packet counts
- */
-
+/* To get vif packet counts */
 struct sioc_vif_req {
vifi_t  vifi;   /* Which iface */
unsigned long icount;   /* In packets */
@@ -112,11 +97,9 @@ struct sioc_vif_req {
unsigned long obytes;   /* Out bytes */
 };
 
-/*
- * This is the format the mroute daemon expects to see IGMP control
- * data. Magically happens to be like an IP packet as per the original
+/* This is the format the mroute daemon expects to see IGMP control
+ * data. Magically happens to be like an IP packet as per the original
  */
- 
 struct igmpmsg {
__u32 unused1,unused2;
unsigned char im_msgtype;   /* What is this */
@@ -126,21 +109,13 @@ struct igmpmsg {
struct in_addr im_src,im_dst;
 };
 
-/*
- * That's all usermode folks
- */
-
-
+/* That's all usermode folks */
 
 #define MFC_ASSERT_THRESH (3*HZ)   /* Maximal freq. of asserts */
 
-/*
- * Pseudo messages used by mrouted
- */
-
+/* Pseudo messages used by mrouted */
 #define IGMPMSG_NOCACHE1   /* Kern cache fill 
request to mrouted */
 #define IGMPMSG_WRONGVIF   2   /* For PIM assert processing 
(unused) */
 #define IGMPMSG_WHOLEPKT   3   /* For PIM Register processing 
*/
 
-
 #endif /* _UAPI__LINUX_MROUTE_H */
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e153ab7b17a1..286ede3716ee 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -102,9 +102,7 @@ static inline bool pimsm_enabled(void)
 
 static DEFINE_RWLOCK(mrt_lock);
 
-/*
- * Multicast router control variables
- */
+/* Multicast router control variables */
 
 #define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL)
 
@@ -393,8 +391,7 @@ static void ipmr_del_tunnel(struct net_device *dev, struct 
vifctl *v)
}
 }
 
-static
-struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v)
+static struct net_device *ipmr_new_tunnel(struct net *net, stru

[PATCH net-next v2 2/9] net: ipmr: always define mroute_reg_vif_num

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_
options were set, but that's not really necessary as the size of the
struct is the same in both cases (checked with pahole, both cases size
is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the
code.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5271e2eee110..dd2462f70d34 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -84,9 +84,7 @@ struct mr_table {
atomic_tcache_resolve_queue_len;
boolmroute_do_assert;
boolmroute_do_pim;
-#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
int mroute_reg_vif_num;
-#endif
 };
 
 struct ipmr_rule {
@@ -347,9 +345,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 
id)
setup_timer(&mrt->ipmr_expire_timer, ipmr_expire_process,
(unsigned long)mrt);
 
-#ifdef CONFIG_IP_PIMSM
mrt->mroute_reg_vif_num = -1;
-#endif
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
list_add_tail_rcu(&mrt->list, &net->ipv4.mr_tables);
 #endif
@@ -584,10 +580,8 @@ static int vif_delete(struct mr_table *mrt, int vifi, int 
notify,
return -EADDRNOTAVAIL;
}
 
-#ifdef CONFIG_IP_PIMSM
if (vifi == mrt->mroute_reg_vif_num)
mrt->mroute_reg_vif_num = -1;
-#endif
 
if (vifi + 1 == mrt->maxvif) {
int tmp;
@@ -824,10 +818,8 @@ static int vif_add(struct net *net, struct mr_table *mrt,
/* And finish update writing critical data */
write_lock_bh(&mrt_lock);
v->dev = dev;
-#ifdef CONFIG_IP_PIMSM
if (v->flags & VIFF_REGISTER)
mrt->mroute_reg_vif_num = vifi;
-#endif
if (vifi+1 > mrt->maxvif)
mrt->maxvif = vifi+1;
write_unlock_bh(&mrt_lock);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 6/9] net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Trivial replace of ifdef with IS_BUILTIN().

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 694fecf7838e..a006d96d6cd9 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1396,11 +1396,12 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
rtnl_unlock();
return ret;
}
-#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
case MRT_TABLE:
{
u32 v;
 
+   if (!IS_BUILTIN(CONFIG_IP_MROUTE_MULTIPLE_TABLES))
+   return -ENOPROTOOPT;
if (optlen != sizeof(u32))
return -EINVAL;
if (get_user(v, (u32 __user *)optval))
@@ -1420,7 +1421,6 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
rtnl_unlock();
return ret;
}
-#endif
/* Spurious command, or MRT_VERSION which you cannot set. */
default:
return -ENOPROTOOPT;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 8/9] net: ipmr: rearrange and cleanup setsockopt

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Take rtnl in the beginning unconditionally as most options already need
it (one exception - MRT_DONE, see the comment inside), make the
lock/unlock places central and move out the switch() local variables.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 191 +++-
 1 file changed, 107 insertions(+), 84 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 50aec313119d..e384f39202cb 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1276,38 +1276,45 @@ static void mrtsock_destruct(struct sock *sk)
  * MOSPF/PIM router set up we can clean this up.
  */
 
-int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, 
unsigned int optlen)
+int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
+unsigned int optlen)
 {
-   int ret, parent = 0;
-   struct vifctl vif;
-   struct mfcctl mfc;
struct net *net = sock_net(sk);
+   int val, ret = 0, parent = 0;
struct mr_table *mrt;
+   struct vifctl vif;
+   struct mfcctl mfc;
+   u32 uval;
 
+   /* There's one exception to the lock - MRT_DONE which needs to unlock */
+   rtnl_lock();
if (sk->sk_type != SOCK_RAW ||
-   inet_sk(sk)->inet_num != IPPROTO_IGMP)
-   return -EOPNOTSUPP;
+   inet_sk(sk)->inet_num != IPPROTO_IGMP) {
+   ret = -EOPNOTSUPP;
+   goto out_unlock;
+   }
 
mrt = ipmr_get_table(net, raw_sk(sk)->ipmr_table ? : RT_TABLE_DEFAULT);
-   if (!mrt)
-   return -ENOENT;
-
+   if (!mrt) {
+   ret = -ENOENT;
+   goto out_unlock;
+   }
if (optname != MRT_INIT) {
if (sk != rcu_access_pointer(mrt->mroute_sk) &&
-   !ns_capable(net->user_ns, CAP_NET_ADMIN))
-   return -EACCES;
+   !ns_capable(net->user_ns, CAP_NET_ADMIN)) {
+   ret = -EACCES;
+   goto out_unlock;
+   }
}
 
switch (optname) {
case MRT_INIT:
if (optlen != sizeof(int))
-   return -EINVAL;
-
-   rtnl_lock();
-   if (rtnl_dereference(mrt->mroute_sk)) {
-   rtnl_unlock();
-   return -EADDRINUSE;
-   }
+   ret = -EINVAL;
+   if (rtnl_dereference(mrt->mroute_sk))
+   ret = -EADDRINUSE;
+   if (ret)
+   break;
 
ret = ip_ra_control(sk, 1, mrtsock_destruct);
if (ret == 0) {
@@ -1317,30 +1324,41 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
NETCONFA_IFINDEX_ALL,
net->ipv4.devconf_all);
}
-   rtnl_unlock();
-   return ret;
+   break;
case MRT_DONE:
-   if (sk != rcu_access_pointer(mrt->mroute_sk))
-   return -EACCES;
-   return ip_ra_control(sk, 0, NULL);
+   if (sk != rcu_access_pointer(mrt->mroute_sk)) {
+   ret = -EACCES;
+   } else {
+   /* We need to unlock here because mrtsock_destruct takes
+* care of rtnl itself and we can't change that due to
+* the IP_ROUTER_ALERT setsockopt which runs without it.
+*/
+   rtnl_unlock();
+   ret = ip_ra_control(sk, 0, NULL);
+   goto out;
+   }
+   break;
case MRT_ADD_VIF:
case MRT_DEL_VIF:
-   if (optlen != sizeof(vif))
-   return -EINVAL;
-   if (copy_from_user(&vif, optval, sizeof(vif)))
-   return -EFAULT;
-   if (vif.vifc_vifi >= MAXVIFS)
-   return -ENFILE;
-   rtnl_lock();
+   if (optlen != sizeof(vif)) {
+   ret = -EINVAL;
+   break;
+   }
+   if (copy_from_user(&vif, optval, sizeof(vif))) {
+   ret = -EFAULT;
+   break;
+   }
+   if (vif.vifc_vifi >= MAXVIFS) {
+   ret = -ENFILE;
+   break;
+   }
if (optname == MRT_ADD_VIF) {
ret = vif_add(net, mrt, &vif,
  sk == rtnl_dereference(mrt->mroute_sk));
} else {
ret = vif_delete(mrt, vif.vifc_vifi, 0, NULL);
}
-   rtnl_unlock();
-   return ret;
-
+   break;
/* Manipula

[PATCH net-next v2 1/9] net: ipmr: move the tbl id check in ipmr_new_table

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Move the table id check in ipmr_new_table and make it return error
pointer. We need this change for the upcoming netlink table manipulation
support in order to avoid code duplication and a race condition.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 92dd4b74d513..5271e2eee110 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -252,8 +252,8 @@ static int __net_init ipmr_rules_init(struct net *net)
INIT_LIST_HEAD(&net->ipv4.mr_tables);
 
mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
-   if (!mrt) {
-   err = -ENOMEM;
+   if (IS_ERR(mrt)) {
+   err = PTR_ERR(mrt);
goto err1;
}
 
@@ -301,8 +301,13 @@ static int ipmr_fib_lookup(struct net *net, struct flowi4 
*flp4,
 
 static int __net_init ipmr_rules_init(struct net *net)
 {
-   net->ipv4.mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
-   return net->ipv4.mrt ? 0 : -ENOMEM;
+   struct mr_table *mrt;
+
+   mrt = ipmr_new_table(net, RT_TABLE_DEFAULT);
+   if (IS_ERR(mrt))
+   return PTR_ERR(mrt);
+   net->ipv4.mrt = mrt;
+   return 0;
 }
 
 static void __net_exit ipmr_rules_exit(struct net *net)
@@ -319,13 +324,17 @@ static struct mr_table *ipmr_new_table(struct net *net, 
u32 id)
struct mr_table *mrt;
unsigned int i;
 
+   /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
+   if (id != RT_TABLE_DEFAULT && id >= 10)
+   return ERR_PTR(-EINVAL);
+
mrt = ipmr_get_table(net, id);
if (mrt)
return mrt;
 
mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
if (!mrt)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
write_pnet(&mrt->net, net);
mrt->id = id;
 
@@ -1407,17 +1416,14 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, unsi
if (get_user(v, (u32 __user *)optval))
return -EFAULT;
 
-   /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
-   if (v != RT_TABLE_DEFAULT && v >= 10)
-   return -EINVAL;
-
rtnl_lock();
ret = 0;
if (sk == rtnl_dereference(mrt->mroute_sk)) {
ret = -EBUSY;
} else {
-   if (!ipmr_new_table(net, v))
-   ret = -ENOMEM;
+   mrt = ipmr_new_table(net, v);
+   if (IS_ERR(mrt))
+   ret = PTR_ERR(mrt);
else
raw_sk(sk)->ipmr_table = v;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 3/9] net: ipmr: remove some pimsm ifdefs and simplify

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM
define and is used to check if any version of PIM-SM has been enabled.
Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because
only a VIFF_REGISTER device can send such packet, and it can't be
created if pimsm_enabled() is false.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 180 ++--
 1 file changed, 84 insertions(+), 96 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index dd2462f70d34..e153ab7b17a1 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -67,10 +67,6 @@
 #include 
 #include 
 
-#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
-#define CONFIG_IP_PIMSM1
-#endif
-
 struct mr_table {
struct list_headlist;
possible_net_t  net;
@@ -95,6 +91,11 @@ struct ipmr_result {
struct mr_table *mrt;
 };
 
+static inline bool pimsm_enabled(void)
+{
+   return IS_BUILTIN(CONFIG_IP_PIMSM_V1) || IS_BUILTIN(CONFIG_IP_PIMSM_V2);
+}
+
 /* Big lock, protecting vif table, mrt cache and mroute socket state.
  * Note that the changes are semaphored via rtnl_lock.
  */
@@ -454,8 +455,7 @@ failure:
return NULL;
 }
 
-#ifdef CONFIG_IP_PIMSM
-
+#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
 static netdev_tx_t reg_vif_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct net *net = dev_net(dev);
@@ -552,6 +552,51 @@ failure:
unregister_netdevice(dev);
return NULL;
 }
+
+/* called with rcu_read_lock() */
+static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb,
+unsigned int pimlen)
+{
+   struct net_device *reg_dev = NULL;
+   struct iphdr *encap;
+
+   encap = (struct iphdr *)(skb_transport_header(skb) + pimlen);
+   /*
+* Check that:
+* a. packet is really sent to a multicast group
+* b. packet is not a NULL-REGISTER
+* c. packet is not truncated
+*/
+   if (!ipv4_is_multicast(encap->daddr) ||
+   encap->tot_len == 0 ||
+   ntohs(encap->tot_len) + pimlen > skb->len)
+   return 1;
+
+   read_lock(&mrt_lock);
+   if (mrt->mroute_reg_vif_num >= 0)
+   reg_dev = mrt->vif_table[mrt->mroute_reg_vif_num].dev;
+   read_unlock(&mrt_lock);
+
+   if (!reg_dev)
+   return 1;
+
+   skb->mac_header = skb->network_header;
+   skb_pull(skb, (u8 *)encap - skb->data);
+   skb_reset_network_header(skb);
+   skb->protocol = htons(ETH_P_IP);
+   skb->ip_summed = CHECKSUM_NONE;
+
+   skb_tunnel_rx(skb, reg_dev, dev_net(reg_dev));
+
+   netif_rx(skb);
+
+   return NET_RX_SUCCESS;
+}
+#else
+static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt)
+{
+   return NULL;
+}
 #endif
 
 /**
@@ -734,10 +779,10 @@ static int vif_add(struct net *net, struct mr_table *mrt,
return -EADDRINUSE;
 
switch (vifc->vifc_flags) {
-#ifdef CONFIG_IP_PIMSM
case VIFF_REGISTER:
-   /*
-* Special Purpose VIF in PIM
+   if (!pimsm_enabled())
+   return -EINVAL;
+   /* Special Purpose VIF in PIM
 * All the packets will be sent to the daemon
 */
if (mrt->mroute_reg_vif_num >= 0)
@@ -752,7 +797,6 @@ static int vif_add(struct net *net, struct mr_table *mrt,
return err;
}
break;
-#endif
case VIFF_TUNNEL:
dev = ipmr_new_tunnel(net, vifc);
if (!dev)
@@ -942,34 +986,29 @@ static void ipmr_cache_resolve(struct net *net, struct 
mr_table *mrt,
}
 }
 
-/*
- * Bounce a cache query up to mrouted. We could use netlink for this but 
mrouted
- * expects the following bizarre scheme.
+/* Bounce a cache query up to mrouted. We could use netlink for this but 
mrouted
+ * expects the following bizarre scheme.
  *
- * Called under mrt_lock.
+ * Called under mrt_lock.
  */
-
 static int ipmr_cache_report(struct mr_table *mrt,
 struct sk_buff *pkt, vifi_t vifi, int assert)
 {
-   struct sk_buff *skb;
const int ihl = ip_hdrlen(pkt);
+   struct sock *mroute_sk;
struct igmphdr *igmp;
struct igmpmsg *msg;
-   struct sock *mroute_sk;
+   struct sk_buff *skb;
int ret;
 
-#ifdef CONFIG_IP_PIMSM
if (assert == IGMPMSG_WHOLEPKT)
skb = skb_realloc_headroom(pkt, sizeof(struct iphdr));
else
-#endif
skb = alloc_skb(128, GFP_ATOMIC);
 
if (!skb)
return -ENOBUFS;
 
-#ifdef CONFIG_IP_PIMSM
if (assert == IGMPMSG_WHOLEPKT) {
/* Ugly, but we have no choice with this interface.
 

[PATCH net-next v2 0/9] net: ipmr: cleanups and minor improvements

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Hi,
Since I'll have to work with ipmr, I decided to clean it up and do some
minor improvements. Functionally there're almost no changes except the
SLAB_PANIC removal. Most of the patches just re-design some functions to
be clearer and more concise and try to remove the ifdef web that was
inside. There's more information in each commit. This is the first set,
the end goal is to introduce complete netlink support and control over
the mfc and vif devices.
I've tried to test all of the setsockopt/getsockopt options, and also
made builds with various ipmr kconfig options turned on and off.

v2: change patch 7 to keep SLAB_PANIC and just drop the unnecessary null
check

Thank you,
 Nik


Nikolay Aleksandrov (9):
  net: ipmr: move the tbl id check in ipmr_new_table
  net: ipmr: always define mroute_reg_vif_num
  net: ipmr: remove some pimsm ifdefs and simplify
  net: ipmr: fix code and comment style
  net: ipmr: make ip_mroute_getsockopt more understandable
  net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES
  net: ipmr: drop ip_mr_init() mrt_cachep null check as we'll panic if
it fails
  net: ipmr: rearrange and cleanup setsockopt
  net: ipmr: factor out common vif init code

 include/uapi/linux/mroute.h |  59 ++---
 net/ipv4/ipmr.c | 597 
 2 files changed, 284 insertions(+), 372 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 5/9] net: ipmr: make ip_mroute_getsockopt more understandable

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

Use a switch to determine if optname is correct and set val accordingly.
This produces a much more straight-forward and readable code.

Signed-off-by: Nikolay Aleksandrov 
---
 net/ipv4/ipmr.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 286ede3716ee..694fecf7838e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1443,29 +1443,29 @@ int ip_mroute_getsockopt(struct sock *sk, int optname, 
char __user *optval, int
if (!mrt)
return -ENOENT;
 
-   if (optname != MRT_VERSION &&
-  optname != MRT_PIM &&
-  optname != MRT_ASSERT)
+   switch (optname) {
+   case MRT_VERSION:
+   val = 0x0305;
+   break;
+   case MRT_PIM:
+   if (!pimsm_enabled())
+   return -ENOPROTOOPT;
+   val = mrt->mroute_do_pim;
+   break;
+   case MRT_ASSERT:
+   val = mrt->mroute_do_assert;
+   break;
+   default:
return -ENOPROTOOPT;
+   }
 
if (get_user(olr, optlen))
return -EFAULT;
-
olr = min_t(unsigned int, olr, sizeof(int));
if (olr < 0)
return -EINVAL;
-
if (put_user(olr, optlen))
return -EFAULT;
-   if (optname == MRT_VERSION) {
-   val = 0x0305;
-   } else if (optname == MRT_PIM) {
-   if (!pimsm_enabled())
-   return -ENOPROTOOPT;
-   val = mrt->mroute_do_pim;
-   } else {
-   val = mrt->mroute_do_assert;
-   }
if (copy_to_user(optval, &val, olr))
return -EFAULT;
return 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] kernfs: implement kernfs_walk_and_get()

2015-11-21 Thread Tejun Heo
Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.

v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.

Signed-off-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Cc: David Miller 
---
 fs/kernfs/dir.c| 46 ++
 include/linux/kernfs.h | 12 
 2 files changed, 58 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..742bf4a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -694,6 +694,29 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node *parent,
return NULL;
 }
 
+static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
+ const unsigned char *path,
+ const void *ns)
+{
+   static char path_buf[PATH_MAX]; /* protected by kernfs_mutex */
+   size_t len = strlcpy(path_buf, path, PATH_MAX);
+   char *p = path_buf;
+   char *name;
+
+   lockdep_assert_held(&kernfs_mutex);
+
+   if (len >= PATH_MAX)
+   return NULL;
+
+   while ((name = strsep(&p, "/")) && parent) {
+   if (*name == '\0')
+   continue;
+   parent = kernfs_find_ns(parent, name, ns);
+   }
+
+   return parent;
+}
+
 /**
  * kernfs_find_and_get_ns - find and get kernfs_node with the given name
  * @parent: kernfs_node to search under
@@ -719,6 +742,29 @@ struct kernfs_node *kernfs_find_and_get_ns(struct 
kernfs_node *parent,
 EXPORT_SYMBOL_GPL(kernfs_find_and_get_ns);
 
 /**
+ * kernfs_walk_and_get_ns - find and get kernfs_node with the given path
+ * @parent: kernfs_node to search under
+ * @path: path to look for
+ * @ns: the namespace tag to use
+ *
+ * Look for kernfs_node with path @path under @parent and get a reference
+ * if found.  This function may sleep and returns pointer to the found
+ * kernfs_node on success, %NULL on failure.
+ */
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+  const char *path, const void *ns)
+{
+   struct kernfs_node *kn;
+
+   mutex_lock(&kernfs_mutex);
+   kn = kernfs_walk_ns(parent, path, ns);
+   kernfs_get(kn);
+   mutex_unlock(&kernfs_mutex);
+
+   return kn;
+}
+
+/**
  * kernfs_create_root - create a new kernfs hierarchy
  * @scops: optional syscall operations for the hierarchy
  * @flags: KERNFS_ROOT_* flags
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5d4e9c4..af51df3 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn);
 struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
 struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
   const char *name, const void *ns);
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+  const char *path, const void *ns);
 void kernfs_get(struct kernfs_node *kn);
 void kernfs_put(struct kernfs_node *kn);
 
@@ -350,6 +352,10 @@ static inline struct kernfs_node *
 kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name,
   const void *ns)
 { return NULL; }
+static inline struct kernfs_node *
+kernfs_walk_and_get_ns(struct kernfs_node *parent, const char *path,
+  const void *ns)
+{ return NULL; }
 
 static inline void kernfs_get(struct kernfs_node *kn) { }
 static inline void kernfs_put(struct kernfs_node *kn) { }
@@ -431,6 +437,12 @@ kernfs_find_and_get(struct kernfs_node *kn, const char 
*name)
 }
 
 static inline struct kernfs_node *
+kernfs_walk_and_get(struct kernfs_node *kn, const char *path)
+{
+   return kernfs_walk_and_get_ns(kn, path, NULL);
+}
+
+static inline struct kernfs_node *
 kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
  void *priv)
 {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

2015-11-21 Thread Tejun Heo
Hello,

This is v3 of the xt_cgroup2 patchset.  Changes from the last take are

* Folded cgroup2 path matching into xt_cgroup as a new revision rather
  than a separate xt_cgroup2 match as suggested by Pablo.

* Refreshed on top of Nina's net_cls dynamic config update fix patch.
  I included the fix patch as part of this series to ease reviewing.

The changes from v1 to v2 are

* Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
  carries either (prioidx, classid) pair or cgroup2 pointer.  This
  avoids inflating struct sock with yet another cgroup related field.
  Unfortunately, this does add some complexity but that's the
  trade-off and the complexity is contained in cgroup proper.

* Various small updats as per David and Jan's reviews.


In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound.  As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter.  They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical.  Both inherit
configuration from the parent on creation but there's no interaction
afterwards.  An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level.  net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

This patchset includes the following nine patches.

 0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch
 0002-kernfs-implement-kernfs_walk_and_get.patch
 0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch
 0004-cgroups-Allow-dynamically-changing-net_classid.patch
 0005-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch
 0006-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch
 0007-sock-cgroup-add-sock-sk_cgroup.patch
 0008-netfilter-prepare-xt_cgroup-for-multi-revisions.patch
 0009-netfilter-implement-xt_cgroup-cgroup2-path-match.patch

0001-0003 are prepatory patches in kernfs and cgroup.  These patches
are available in the following branch which will stay stable.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
for-4.5-ancestor-test

0004 is the following net_cls config update fix patch included in this
series to ease reviewing as it causes a conflict with a later patch in
this series.

 http://lkml.kernel.org/g/1448051499-1885574-1-git-send-email-nin...@fb.com

0005-0007 consolidate two cgroup related fields in struct sock into
cgroup_sock_data and update it so that it can alternatively carry a
cgroup pointer.

0008-0009 implement cgroup2 patch matching in xt_cgroup.

This patchset is on top of v4.4-rc1 and also available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2

I'll post iptables extension as a reply.  diffstat follows.  Thanks.

 fs/kernfs/dir.c  |   46 +++
 include/linux/cgroup-defs.h  |  126 +++
 include/linux/cgroup.h   |   66 +++-
 include/linux/kernfs.h   |   12 ++
 include/net/cls_cgroup.h |   11 +-
 include/net/netprio_cgroup.h |   16 +++
 include/net/sock.h   |   13 ---
 include/uapi/linux/netfilter/xt_cgroup.h |   15 +++
 kernel/cgroup.c  |  126 ---
 net/Kconfig  |6 +
 net/core/dev.c   |3 
 net/core/netclassid_cgroup.c |   37 ++---
 net/core/netprio_cgroup.c|   19 
 net/core/scm.c   |4 
 net/core/sock.c  |   17 
 net/netfilter/nft_meta.c |2 
 net/netfilter/xt_cgroup.c|  108 ++
 17 files changed, 531 insertions(+

[PATCH 6/9] net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct

2015-11-21 Thread Tejun Heo
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
and its accessors are defined in cgroup-defs.h.  This is to prepare
for overloading the fields with a cgroup pointer.

This patch mostly performs equivalent conversions but the followings
are noteworthy.

* Equality test before updating classid is removed from
  sock_update_classid().  This shouldn't make any noticeable
  difference and a similar test will be implemented on the helper side
  later.

* sock_update_netprioidx() now takes struct sock_cgroup_data and can
  be moved to netprio_cgroup.h without causing include dependency
  loop.  Moved.

* The dummy version of sock_update_netprioidx() converted to a static
  inline function while at it.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup-defs.h  | 36 
 include/net/cls_cgroup.h | 11 +--
 include/net/netprio_cgroup.h | 16 +---
 include/net/sock.h   | 11 +++
 net/Kconfig  |  6 ++
 net/core/dev.c   |  3 ++-
 net/core/netclassid_cgroup.c |  4 ++--
 net/core/netprio_cgroup.c|  3 ++-
 net/core/scm.c   |  4 ++--
 net/core/sock.c  | 15 ++-
 net/netfilter/nft_meta.c |  2 +-
 net/netfilter/xt_cgroup.c|  3 ++-
 12 files changed, 76 insertions(+), 38 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 504d859..ed128fed 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -542,4 +542,40 @@ static inline void cgroup_threadgroup_change_end(struct 
task_struct *tsk) {}
 
 #endif /* CONFIG_CGROUPS */
 
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+struct sock_cgroup_data {
+   u16 prioidx;
+   u32 classid;
+};
+
+static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
+{
+   return skcd->prioidx;
+}
+
+static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
+{
+   return skcd->classid;
+}
+
+static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
+  u16 prioidx)
+{
+   skcd->prioidx = prioidx;
+}
+
+static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
+  u32 classid)
+{
+   skcd->classid = classid;
+}
+
+#else  /* CONFIG_SOCK_CGROUP_DATA */
+
+struct sock_cgroup_data {
+};
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
 #endif /* _LINUX_CGROUP_DEFS_H */
diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index ccd6d8b..c0a92e2 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -41,13 +41,12 @@ static inline u32 task_cls_classid(struct task_struct *p)
return classid;
 }
 
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
 {
u32 classid;
 
classid = task_cls_classid(current);
-   if (classid != sk->sk_classid)
-   sk->sk_classid = classid;
+   sock_cgroup_set_classid(skcd, classid);
 }
 
 static inline u32 task_get_classid(const struct sk_buff *skb)
@@ -64,17 +63,17 @@ static inline u32 task_get_classid(const struct sk_buff 
*skb)
 * softirqs always disables bh.
 */
if (in_serving_softirq()) {
-   /* If there is an sk_classid we'll use that. */
+   /* If there is an sock_cgroup_classid we'll use that. */
if (!skb->sk)
return 0;
 
-   classid = skb->sk->sk_classid;
+   classid = sock_cgroup_classid(&skb->sk->sk_cgrp_data);
}
 
return classid;
 }
 #else /* !CONFIG_CGROUP_NET_CLASSID */
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
 {
 }
 
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index f2a9597..6041905 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -25,8 +25,6 @@ struct netprio_map {
u32 priomap[];
 };
 
-void sock_update_netprioidx(struct sock *sk);
-
 static inline u32 task_netprioidx(struct task_struct *p)
 {
struct cgroup_subsys_state *css;
@@ -38,13 +36,25 @@ static inline u32 task_netprioidx(struct task_struct *p)
rcu_read_unlock();
return idx;
 }
+
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+   if (in_interrupt())
+   return;
+
+   sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
+}
+
 #else /* !CONFIG_CGROUP_NET_PRIO */
+
 static inline u32 task_netprioidx(struct task_struct *p)
 {
return 0;
 }
 
-#define sock_update_netprioidx(sk)
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+}
 
 #endif /* CONFIG_CGROUP_NET_PRIO */
 #endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h

[PATCH 8/9] netfilter: prepare xt_cgroup for multi revisions

2015-11-21 Thread Tejun Heo
xt_cgroup will grow cgroup2 path based match.  Postfix existing
symbols with _v0 and prepare for multi revision registration.

Signed-off-by: Tejun Heo 
Cc: Daniel Borkmann 
Cc: Daniel Wagner 
CC: Neil Horman 
Cc: Jan Engelhardt 
Cc: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/xt_cgroup.h |  2 +-
 net/netfilter/xt_cgroup.c| 36 +---
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/uapi/linux/netfilter/xt_cgroup.h 
b/include/uapi/linux/netfilter/xt_cgroup.h
index 43acb7e..577c9e0 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -3,7 +3,7 @@
 
 #include 
 
-struct xt_cgroup_info {
+struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
 };
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 54eaeb4..1730025 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -24,9 +24,9 @@ MODULE_DESCRIPTION("Xtables: process control group matching");
 MODULE_ALIAS("ipt_cgroup");
 MODULE_ALIAS("ip6t_cgroup");
 
-static int cgroup_mt_check(const struct xt_mtchk_param *par)
+static int cgroup_mt_check_v0(const struct xt_mtchk_param *par)
 {
-   struct xt_cgroup_info *info = par->matchinfo;
+   struct xt_cgroup_info_v0 *info = par->matchinfo;
 
if (info->invert & ~1)
return -EINVAL;
@@ -35,9 +35,9 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par)
 }
 
 static bool
-cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
-   const struct xt_cgroup_info *info = par->matchinfo;
+   const struct xt_cgroup_info_v0 *info = par->matchinfo;
 
if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;
@@ -46,27 +46,29 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param 
*par)
info->invert;
 }
 
-static struct xt_match cgroup_mt_reg __read_mostly = {
-   .name   = "cgroup",
-   .revision   = 0,
-   .family = NFPROTO_UNSPEC,
-   .checkentry = cgroup_mt_check,
-   .match  = cgroup_mt,
-   .matchsize  = sizeof(struct xt_cgroup_info),
-   .me = THIS_MODULE,
-   .hooks  = (1 << NF_INET_LOCAL_OUT) |
- (1 << NF_INET_POST_ROUTING) |
- (1 << NF_INET_LOCAL_IN),
+static struct xt_match cgroup_mt_reg[] __read_mostly = {
+   {
+   .name   = "cgroup",
+   .revision   = 0,
+   .family = NFPROTO_UNSPEC,
+   .checkentry = cgroup_mt_check_v0,
+   .match  = cgroup_mt_v0,
+   .matchsize  = sizeof(struct xt_cgroup_info_v0),
+   .me = THIS_MODULE,
+   .hooks  = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+   },
 };
 
 static int __init cgroup_mt_init(void)
 {
-   return xt_register_match(&cgroup_mt_reg);
+   return xt_register_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
 }
 
 static void __exit cgroup_mt_exit(void)
 {
-   xt_unregister_match(&cgroup_mt_reg);
+   xt_unregister_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
 }
 
 module_init(cgroup_mt_init);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

2015-11-21 Thread Tejun Heo
This patch implements xt_cgroup path match which matches cgroup2
membership of the associated socket.  The match is recursive and
invertible.

For rationales on introducing another cgroup based match, please refer
to a preceding commit "sock, cgroup: add sock->sk_cgroup".

v3: Folded into xt_cgroup as a new revision interface as suggested by
Pablo.

v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX.  Added
explicit alignment to the priv field.  Both suggested by Jan.

Signed-off-by: Tejun Heo 
Cc: Daniel Borkmann 
Cc: Daniel Wagner 
CC: Neil Horman 
Cc: Jan Engelhardt 
Cc: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/xt_cgroup.h | 13 ++
 net/netfilter/xt_cgroup.c| 69 
 2 files changed, 82 insertions(+)

diff --git a/include/uapi/linux/netfilter/xt_cgroup.h 
b/include/uapi/linux/netfilter/xt_cgroup.h
index 577c9e0..1e4b37b 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -2,10 +2,23 @@
 #define _UAPI_XT_CGROUP_H
 
 #include 
+#include 
 
 struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
 };
 
+struct xt_cgroup_info_v1 {
+   __u8has_path;
+   __u8has_classid;
+   __u8invert_path;
+   __u8invert_classid;
+   charpath[PATH_MAX];
+   __u32   classid;
+
+   /* kernel internal data */
+   void*priv __attribute__((aligned(8)));
+};
+
 #endif /* _UAPI_XT_CGROUP_H */
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 1730025..a086a91 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -34,6 +34,37 @@ static int cgroup_mt_check_v0(const struct xt_mtchk_param 
*par)
return 0;
 }
 
+static int cgroup_mt_check_v1(const struct xt_mtchk_param *par)
+{
+   struct xt_cgroup_info_v1 *info = par->matchinfo;
+   struct cgroup *cgrp;
+
+   if ((info->invert_path & ~1) || (info->invert_classid & ~1))
+   return -EINVAL;
+
+   if (!info->has_path && !info->has_classid) {
+   pr_info("xt_cgroup: no path or classid specified\n");
+   return -EINVAL;
+   }
+
+   if (info->has_path && info->has_classid) {
+   pr_info("xt_cgroup: both path and classid specified\n");
+   return -EINVAL;
+   }
+
+   if (info->has_path) {
+   cgrp = cgroup_get_from_path(info->path);
+   if (IS_ERR(cgrp)) {
+   pr_info("xt_cgroup: invalid path, errno=%ld\n",
+   PTR_ERR(cgrp));
+   return -EINVAL;
+   }
+   info->priv = cgrp;
+   }
+
+   return 0;
+}
+
 static bool
 cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
@@ -46,6 +77,31 @@ cgroup_mt_v0(const struct sk_buff *skb, struct 
xt_action_param *par)
info->invert;
 }
 
+static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param 
*par)
+{
+   const struct xt_cgroup_info_v1 *info = par->matchinfo;
+   struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data;
+   struct cgroup *ancestor = info->priv;
+
+   if (!skb->sk || !sk_fullsock(skb->sk))
+   return false;
+
+   if (ancestor)
+   return cgroup_is_descendant(sock_cgroup_ptr(skcd), ancestor) ^
+   info->invert_path;
+   else
+   return (info->classid == sock_cgroup_classid(skcd)) ^
+   info->invert_classid;
+}
+
+static void cgroup_mt_destroy_v1(const struct xt_mtdtor_param *par)
+{
+   struct xt_cgroup_info_v1 *info = par->matchinfo;
+
+   if (info->priv)
+   cgroup_put(info->priv);
+}
+
 static struct xt_match cgroup_mt_reg[] __read_mostly = {
{
.name   = "cgroup",
@@ -59,6 +115,19 @@ static struct xt_match cgroup_mt_reg[] __read_mostly = {
  (1 << NF_INET_POST_ROUTING) |
  (1 << NF_INET_LOCAL_IN),
},
+   {
+   .name   = "cgroup",
+   .revision   = 1,
+   .family = NFPROTO_UNSPEC,
+   .checkentry = cgroup_mt_check_v1,
+   .match  = cgroup_mt_v1,
+   .matchsize  = sizeof(struct xt_cgroup_info_v1),
+   .destroy= cgroup_mt_destroy_v1,
+   .me = THIS_MODULE,
+   .hooks  = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+   },
 };
 
 static int __init cgroup_mt_init(void)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] cgroups: Allow dynamically changing net_classid

2015-11-21 Thread Tejun Heo
From: Nina Schiff 

The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.

Signed-off-by: Nina Schiff 
Signed-off-by: Tejun Heo 
---
 net/core/netclassid_cgroup.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 6441f47..2e4df84 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -56,7 +56,7 @@ static void cgrp_css_free(struct cgroup_subsys_state *css)
kfree(css_cls_state(css));
 }
 
-static int update_classid(const void *v, struct file *file, unsigned n)
+static int update_classid_sock(const void *v, struct file *file, unsigned n)
 {
int err;
struct socket *sock = sock_from_file(file, &err);
@@ -67,18 +67,25 @@ static int update_classid(const void *v, struct file *file, 
unsigned n)
return 0;
 }
 
-static void cgrp_attach(struct cgroup_subsys_state *css,
-   struct cgroup_taskset *tset)
+static void update_classid(struct cgroup_subsys_state *css, void *v)
 {
-   struct cgroup_cls_state *cs = css_cls_state(css);
-   void *v = (void *)(unsigned long)cs->classid;
+   struct css_task_iter it;
struct task_struct *p;
 
-   cgroup_taskset_for_each(p, tset) {
+   css_task_iter_start(css, &it);
+   while ((p = css_task_iter_next(&it))) {
task_lock(p);
-   iterate_fd(p->files, 0, update_classid, v);
+   iterate_fd(p->files, 0, update_classid_sock, v);
task_unlock(p);
}
+   css_task_iter_end(&it);
+}
+
+static void cgrp_attach(struct cgroup_subsys_state *css,
+   struct cgroup_taskset *tset)
+{
+   update_classid(css,
+  (void *)(unsigned long)css_cls_state(css)->classid);
 }
 
 static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -89,8 +96,11 @@ static u64 read_classid(struct cgroup_subsys_state *css, 
struct cftype *cft)
 static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
 u64 value)
 {
-   css_cls_state(css)->classid = (u32) value;
+   struct cgroup_cls_state *cs = css_cls_state(css);
+
+   cs->classid = (u32)value;
 
+   update_classid(css, (void *)(unsigned long)cs->classid);
return 0;
 }
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] sock, cgroup: add sock->sk_cgroup

2015-11-21 Thread Tejun Heo
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound.  As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter.  They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical.  Both inherit
configuration from the parent on creation but there's no interaction
afterwards.  An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level.  net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used.  Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid.  This is to avoid adding yet another cgroup related field
to struct sock.

As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead.  It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs.  Non-critical inaccuracies from small race windows won't
make any noticeable difference.

This patch doesn't make use of the pointer yet.  The following patch
will implement netfilter match for cgroup2 membership.

v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.

v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.

Signed-off-by: Tejun Heo 
Cc: Daniel Borkmann 
Cc: Daniel Wagner 
CC: Neil Horman 
---
 include/linux/cgroup-defs.h  | 88 +---
 include/linux/cgroup.h   | 41 +
 kernel/cgroup.c  | 55 ++-
 net/core/netclassid_cgroup.c |  7 +++-
 net/core/netprio_cgroup.c|  7 +++-
 net/core/sock.c  |  2 +
 6 files changed, 191 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ed128fed..9dc2263 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -544,31 +544,107 @@ static inline void cgroup_threadgroup_change_end(struct 
task_struct *tsk) {}
 
 #ifdef CONFIG_SOCK_CGROUP_DATA
 
+/*
+ * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
+ * per-socket cgroup information except for memcg association.
+ *
+ * On legacy hierarchies, net_prio and net_cls controllers directly set
+ * attributes on each sock which can then be tested by the network layer.
+ * On the default hierarchy, each sock is associated with the cgroup it was
+ * created in and the networking layer can match the cgroup directly.
+ *
+ * To avoid carrying all three cgroup related fields separately in sock,
+ * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
+ * On boot, sock_cgroup_data records the cgroup that the sock was created
+ * in so that cgroup2 matches can be made; however, once either net_prio or
+ * net_cls starts being used, the area is overriden to carry prioidx and/or
+ * classid.  The two modes are distinguished by whether the lowest bit is
+ * set.  Clear bit indicates cgroup pointer while set bit prioidx and
+ * classid.
+ *
+ * While userland may start using net_prio or net_cls at any time, once
+ * either is used, cgroup2 matching no longer works.  There is no reason to
+ * mix the two and this is in line with how legacy and v2 compatibility is
+ * handled.  On mode switch, cgroup references whi

[PATCH 1/2 iptables] libxt_cgroup: prepare for multi revisions

2015-11-21 Thread Tejun Heo
libxt_cgroup will grow cgroup2 path based match.  Postfix existing
symbols with _v0 and prepare for multi revision registration.  While
at it, rename O_CGROUP to O_CLASSID and fwid to classid.

Signed-off-by: Tejun Heo 
Cc: Daniel Borkmann 
Cc: Jan Engelhardt 
Cc: Pablo Neira Ayuso 
---
 extensions/libxt_cgroup.c   |   51 +++-
 include/linux/netfilter/xt_cgroup.h |2 -
 2 files changed, 28 insertions(+), 25 deletions(-)

--- a/extensions/libxt_cgroup.c
+++ b/extensions/libxt_cgroup.c
@@ -3,30 +3,30 @@
 #include 
 
 enum {
-   O_CGROUP = 0,
+   O_CLASSID = 0,
 };
 
-static void cgroup_help(void)
+static void cgroup_help_v0(void)
 {
printf(
 "cgroup match options:\n"
-"[!] --cgroup fwid  Match cgroup fwid\n");
+"[!] --cgroup classidMatch cgroup classid\n");
 }
 
-static const struct xt_option_entry cgroup_opts[] = {
+static const struct xt_option_entry cgroup_opts_v0[] = {
{
.name = "cgroup",
-   .id = O_CGROUP,
+   .id = O_CLASSID,
.type = XTTYPE_UINT32,
.flags = XTOPT_INVERT | XTOPT_MAND | XTOPT_PUT,
-   XTOPT_POINTER(struct xt_cgroup_info, id)
+   XTOPT_POINTER(struct xt_cgroup_info_v0, id)
},
XTOPT_TABLEEND,
 };
 
-static void cgroup_parse(struct xt_option_call *cb)
+static void cgroup_parse_v0(struct xt_option_call *cb)
 {
-   struct xt_cgroup_info *cgroupinfo = cb->data;
+   struct xt_cgroup_info_v0 *cgroupinfo = cb->data;
 
xtables_option_parse(cb);
if (cb->invert)
@@ -34,34 +34,37 @@ static void cgroup_parse(struct xt_optio
 }
 
 static void
-cgroup_print(const void *ip, const struct xt_entry_match *match, int numeric)
+cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int 
numeric)
 {
-   const struct xt_cgroup_info *info = (void *) match->data;
+   const struct xt_cgroup_info_v0 *info = (void *) match->data;
 
printf(" cgroup %s%u", info->invert ? "! ":"", info->id);
 }
 
-static void cgroup_save(const void *ip, const struct xt_entry_match *match)
+static void cgroup_save_v0(const void *ip, const struct xt_entry_match *match)
 {
-   const struct xt_cgroup_info *info = (void *) match->data;
+   const struct xt_cgroup_info_v0 *info = (void *) match->data;
 
printf("%s --cgroup %u", info->invert ? " !" : "", info->id);
 }
 
-static struct xtables_match cgroup_match = {
-   .family = NFPROTO_UNSPEC,
-   .name   = "cgroup",
-   .version= XTABLES_VERSION,
-   .size   = XT_ALIGN(sizeof(struct xt_cgroup_info)),
-   .userspacesize  = XT_ALIGN(sizeof(struct xt_cgroup_info)),
-   .help   = cgroup_help,
-   .print  = cgroup_print,
-   .save   = cgroup_save,
-   .x6_parse   = cgroup_parse,
-   .x6_options = cgroup_opts,
+static struct xtables_match cgroup_match[] = {
+   {
+   .family = NFPROTO_UNSPEC,
+   .revision   = 0,
+   .name   = "cgroup",
+   .version= XTABLES_VERSION,
+   .size   = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)),
+   .userspacesize  = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)),
+   .help   = cgroup_help_v0,
+   .print  = cgroup_print_v0,
+   .save   = cgroup_save_v0,
+   .x6_parse   = cgroup_parse_v0,
+   .x6_options = cgroup_opts_v0,
+   },
 };
 
 void _init(void)
 {
-   xtables_register_match(&cgroup_match);
+   xtables_register_matches(cgroup_match, ARRAY_SIZE(cgroup_match));
 }
--- a/include/linux/netfilter/xt_cgroup.h
+++ b/include/linux/netfilter/xt_cgroup.h
@@ -3,7 +3,7 @@
 
 #include 
 
-struct xt_cgroup_info {
+struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
 };
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it

2015-11-21 Thread Tejun Heo
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question.  While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.

This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level.  cgroup_is_descendant() is accordingly reimplmented
and made inline.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup-defs.h | 14 ++
 include/linux/cgroup.h  | 18 +-
 kernel/cgroup.c | 32 ++--
 3 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 60d44b2..504d859 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -235,6 +235,14 @@ struct cgroup {
int id;
 
/*
+* The depth this cgroup is at.  The root is at depth zero and each
+* step down the hierarchy increments the level.  This along with
+* ancestor_ids[] can determine whether a given cgroup is a
+* descendant of another without traversing the hierarchy.
+*/
+   int level;
+
+   /*
 * Each non-empty css_set associated with this cgroup contributes
 * one to populated_cnt.  All children with non-zero popuplated_cnt
 * of their own contribute one.  The count is zero iff there's no
@@ -289,6 +297,9 @@ struct cgroup {
 
/* used to schedule release agent */
struct work_struct release_agent_work;
+
+   /* ids of the ancestors at each level including self */
+   int ancestor_ids[];
 };
 
 /*
@@ -308,6 +319,9 @@ struct cgroup_root {
/* The root cgroup.  Root is destroyed on its release. */
struct cgroup cgrp;
 
+   /* for cgrp->ancestor_ids[0] */
+   int cgrp_ancestor_id_storage;
+
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 22e3754..b5ee2c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,7 +81,6 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup 
*cgroup,
 struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
   struct cgroup_subsys 
*ss);
 
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
@@ -459,6 +458,23 @@ static inline struct cgroup *task_cgroup(struct 
task_struct *task,
return task_css(task, subsys_id)->cgroup;
 }
 
+/**
+ * cgroup_is_descendant - test ancestry
+ * @cgrp: the cgroup to be tested
+ * @ancestor: possible ancestor of @cgrp
+ *
+ * Test whether @cgrp is a descendant of @ancestor.  It also returns %true
+ * if @cgrp == @ancestor.  This function is safe to call as long as @cgrp
+ * and @ancestor are accessible.
+ */
+static inline bool cgroup_is_descendant(struct cgroup *cgrp,
+   struct cgroup *ancestor)
+{
+   if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
+   return false;
+   return cgrp->ancestor_ids[ancestor->level] == ancestor->id;
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f1603c1..3190040 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -459,25 +459,6 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file 
*of)
 }
 EXPORT_SYMBOL_GPL(of_css);
 
-/**
- * cgroup_is_descendant - test ancestry
- * @cgrp: the cgroup to be tested
- * @ancestor: possible ancestor of @cgrp
- *
- * Test whether @cgrp is a descendant of @ancestor.  It also returns %true
- * if @cgrp == @ancestor.  This function is safe to call as long as @cgrp
- * and @ancestor are accessible.
- */
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
-{
-   while (cgrp) {
-   if (cgrp == ancestor)
-   return true;
-   cgrp = cgroup_parent(cgrp);
-   }
-   return false;
-}
-
 static int notify_on_release(const struct cgroup *cgrp)
 {
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -1903,6 +1884,7 @@ static int cgroup_setup_root(struct cgroup_root *root, 
unsigned long ss_mask)
if (ret < 0)
goto out;
root_cgrp->id = ret;
+   root_cgrp->ancestor_ids[0] = ret;
 
ret = percpu

[PATCH 3/9] cgroup: implement cgroup_get_from_path() and expose cgroup_put()

2015-11-21 Thread Tejun Heo
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path.  This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.

v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  7 +++
 kernel/cgroup.c| 39 ++-
 2 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5ee2c4..4c3ffab 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup 
*cgroup,
 struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
   struct cgroup_subsys 
*ss);
 
+struct cgroup *cgroup_get_from_path(const char *path);
+
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
@@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state 
*css, unsigned int n)
percpu_ref_put_many(&css->refcnt, n);
 }
 
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(&cgrp->self);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3190040..3db5e8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
return css_tryget(&cgrp->self);
 }
 
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(&cgrp->self);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
@@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct 
cgroup_subsys *ss)
return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
 }
 
+/**
+ * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy 
path
+ * @path: path on the default hierarchy
+ *
+ * Find the cgroup at @path on the default hierarchy, increment its
+ * reference count and return it.  Returns pointer to the found cgroup on
+ * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
+ * if @path points to a non-directory.
+ */
+struct cgroup *cgroup_get_from_path(const char *path)
+{
+   struct kernfs_node *kn;
+   struct cgroup *cgrp;
+
+   mutex_lock(&cgroup_mutex);
+
+   kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
+   if (kn) {
+   if (kernfs_type(kn) == KERNFS_DIR) {
+   cgrp = kn->priv;
+   cgroup_get(cgrp);
+   } else {
+   cgrp = ERR_PTR(-ENOTDIR);
+   }
+   kernfs_put(kn);
+   } else {
+   cgrp = ERR_PTR(-ENOENT);
+   }
+
+   mutex_unlock(&cgroup_mutex);
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] netprio_cgroup: limit the maximum css->id to USHRT_MAX

2015-11-21 Thread Tejun Heo
netprio builds per-netdev contiguous priomap array which is indexed by
css->id.  The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range.  This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.

This allows reducing sock->sk_cgrp_prioidx to u16 from u32.  The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.

Signed-off-by: Tejun Heo 
Acked-by: Daniel Wagner 
Cc: Daniel Borkmann 
CC: Neil Horman 
---
 include/net/sock.h| 10 +-
 net/core/netprio_cgroup.c |  9 +
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index bbf7c2c..b517351 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,7 +288,6 @@ struct cg_proto;
   *@sk_ack_backlog: current listen backlog
   *@sk_max_ack_backlog: listen backlog set in listen()
   *@sk_priority: %SO_PRIORITY setting
-  *@sk_cgrp_prioidx: socket group's priority map index
   *@sk_type: socket type (%SOCK_STREAM, etc)
   *@sk_protocol: which protocol this socket belongs in this network family
   *@sk_peer_pid: &struct pid for this socket's peer
@@ -309,6 +308,7 @@ struct cg_proto;
   *@sk_send_head: front of stuff to transmit
   *@sk_security: used by security modules
   *@sk_mark: generic packet mark
+  *@sk_cgrp_prioidx: socket group's priority map index
   *@sk_classid: this socket's cgroup classid
   *@sk_cgrp: this socket's cgroup-specific proto data
   *@sk_write_pending: a write to stream socket waits to start
@@ -423,9 +423,7 @@ struct sock {
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
__u32   sk_priority;
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
-   __u32   sk_cgrp_prioidx;
-#endif
+   __u32   sk_mark;
struct pid  *sk_peer_pid;
const struct cred   *sk_peer_cred;
longsk_rcvtimeo;
@@ -443,7 +441,9 @@ struct sock {
 #ifdef CONFIG_SECURITY
void*sk_security;
 #endif
-   __u32   sk_mark;
+#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
+   u16 sk_cgrp_prioidx;
+#endif
 #ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
 #endif
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..2b9159b 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,6 +27,12 @@
 
 #include 
 
+/*
+ * netprio allocates per-net_device priomap array which is indexed by
+ * css->id.  Limiting css ID to 16bits doesn't lose anything.
+ */
+#define NETPRIO_ID_MAX USHRT_MAX
+
 #define PRIOMAP_MIN_SZ 128
 
 /*
@@ -144,6 +150,9 @@ static int cgrp_css_online(struct cgroup_subsys_state *css)
struct net_device *dev;
int ret = 0;
 
+   if (css->id > NETPRIO_ID_MAX)
+   return -ENOSPC;
+
if (!parent_css)
return 0;
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

2015-11-21 Thread Tejun Heo
Oops, made a copy & paste error on Neil Horman's address.  Sorry,
Neil.  The thread can be found at

  http://lkml.kernel.org/g/1448122441-9335-1-git-send-email...@kernel.org

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2 iptables] libxt_cgroup: add support for cgroup2 path matching

2015-11-21 Thread Tejun Heo
This patch updates xt_cgroup so that it supports revision 1 interface
which includes cgroup2 path based matching.

v3: Folded into xt_cgroup as a new revision interface as suggested by
Pablo.

v2: cgroup2_match->userspacesize and ->save and man page updated as
per Jan.

Signed-off-by: Tejun Heo 
Cc: Daniel Borkmann 
Cc: Jan Engelhardt 
Cc: Pablo Neira Ayuso 
---
 extensions/libxt_cgroup.c   |   86 
 extensions/libxt_cgroup.man |   33 -
 include/linux/netfilter/xt_cgroup.h |   13 +
 3 files changed, 119 insertions(+), 13 deletions(-)

--- a/extensions/libxt_cgroup.c
+++ b/extensions/libxt_cgroup.c
@@ -4,6 +4,7 @@
 
 enum {
O_CLASSID = 0,
+   O_PATH = 1,
 };
 
 static void cgroup_help_v0(void)
@@ -13,6 +14,14 @@ static void cgroup_help_v0(void)
 "[!] --cgroup classidMatch cgroup classid\n");
 }
 
+static void cgroup_help_v1(void)
+{
+   printf(
+"cgroup match options:\n"
+"[!] --path path Recursively match path relative to cgroup2 
root\n"
+"[!] --cgroup claasidMatch cgroup classid, can't be used with 
--path\n");
+}
+
 static const struct xt_option_entry cgroup_opts_v0[] = {
{
.name = "cgroup",
@@ -24,6 +33,24 @@ static const struct xt_option_entry cgro
XTOPT_TABLEEND,
 };
 
+static const struct xt_option_entry cgroup_opts_v1[] = {
+   {
+   .name = "path",
+   .id = O_PATH,
+   .type = XTTYPE_STRING,
+   .flags = XTOPT_INVERT | XTOPT_PUT,
+   XTOPT_POINTER(struct xt_cgroup_info_v1, path)
+   },
+   {
+   .name = "cgroup",
+   .id = O_CLASSID,
+   .type = XTTYPE_UINT32,
+   .flags = XTOPT_INVERT | XTOPT_PUT,
+   XTOPT_POINTER(struct xt_cgroup_info_v1, classid)
+   },
+   XTOPT_TABLEEND,
+};
+
 static void cgroup_parse_v0(struct xt_option_call *cb)
 {
struct xt_cgroup_info_v0 *cgroupinfo = cb->data;
@@ -33,6 +60,26 @@ static void cgroup_parse_v0(struct xt_op
cgroupinfo->invert = true;
 }
 
+static void cgroup_parse_v1(struct xt_option_call *cb)
+{
+   struct xt_cgroup_info_v1 *info = cb->data;
+
+   xtables_option_parse(cb);
+
+   switch (cb->entry->id) {
+   case O_PATH:
+   info->has_path = true;
+   if (cb->invert)
+   info->invert_path = true;
+   break;
+   case O_CLASSID:
+   info->has_classid = true;
+   if (cb->invert)
+   info->invert_classid = true;
+   break;
+   }
+}
+
 static void
 cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int 
numeric)
 {
@@ -48,6 +95,32 @@ static void cgroup_save_v0(const void *i
printf("%s --cgroup %u", info->invert ? " !" : "", info->id);
 }
 
+static void
+cgroup_print_v1(const void *ip, const struct xt_entry_match *match, int 
numeric)
+{
+   const struct xt_cgroup_info_v1 *info = (void *)match->data;
+
+   printf(" cgroup");
+   if (info->has_path)
+   printf(" %s%s", info->invert_path ? "! ":"", info->path);
+   if (info->has_classid)
+   printf(" %s%u", info->invert_classid ? "! ":"", info->classid);
+}
+
+static void cgroup_save_v1(const void *ip, const struct xt_entry_match *match)
+{
+   const struct xt_cgroup_info_v1 *info = (void *)match->data;
+
+   if (info->has_path) {
+   printf("%s --path", info->invert_path ? " !" : "");
+   xtables_save_string(info->path);
+   }
+
+   if (info->has_classid)
+   printf("%s --cgroup %u", info->invert_classid ? " !" : "",
+  info->classid);
+}
+
 static struct xtables_match cgroup_match[] = {
{
.family = NFPROTO_UNSPEC,
@@ -62,6 +135,19 @@ static struct xtables_match cgroup_match
.x6_parse   = cgroup_parse_v0,
.x6_options = cgroup_opts_v0,
},
+   {
+   .family = NFPROTO_UNSPEC,
+   .revision   = 1,
+   .name   = "cgroup",
+   .version= XTABLES_VERSION,
+   .size   = XT_ALIGN(sizeof(struct xt_cgroup_info_v1)),
+   .userspacesize  = offsetof(struct xt_cgroup_info_v1, priv),
+   .help   = cgroup_help_v1,
+   .print  = cgroup_print_v1,
+   .save   = cgroup_save_v1,
+   .x6_parse   = cgroup_parse_v1,
+   .x6_options = cgroup_opts_v1,
+   },
 };
 
 void _init(void)
--- a/extensions/libxt_cgroup.man
+++ b/extensions/libxt_cgroup.man
@@ -1,23 +1,30 @@
 .TP
-[\fB!\fP] \fB\-\-cgroup\fP \fIfwid\fP
-Match corresponding cgroup for this packet.
+[\fB!\fP] \fB\-\-path\fP \fIpath\fP
+Match cgroup2 membership.
 
-Can be used in the OUTPUT chain to assign particula

Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

2015-11-21 Thread Florian Westphal
Tejun Heo  wrote:
> This patch implements xt_cgroup path match which matches cgroup2
> membership of the associated socket.  The match is recursive and
> invertible.
> 
> For rationales on introducing another cgroup based match, please refer
> to a preceding commit "sock, cgroup: add sock->sk_cgroup".
> 
> v3: Folded into xt_cgroup as a new revision interface as suggested by
> Pablo.
> 
> v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX.  Added
> explicit alignment to the priv field.  Both suggested by Jan.
> 
> Signed-off-by: Tejun Heo 
> Cc: Daniel Borkmann 
> Cc: Daniel Wagner 
> CC: Neil Horman 
> Cc: Jan Engelhardt 
> Cc: Pablo Neira Ayuso 
> ---
>  include/uapi/linux/netfilter/xt_cgroup.h | 13 ++
>  net/netfilter/xt_cgroup.c| 69 
> 
>  2 files changed, 82 insertions(+)
> 
> diff --git a/include/uapi/linux/netfilter/xt_cgroup.h 
> b/include/uapi/linux/netfilter/xt_cgroup.h
> index 577c9e0..1e4b37b 100644
> --- a/include/uapi/linux/netfilter/xt_cgroup.h
> +++ b/include/uapi/linux/netfilter/xt_cgroup.h
> @@ -2,10 +2,23 @@
>  #define _UAPI_XT_CGROUP_H
>  
>  #include 
> +#include 
>  
>  struct xt_cgroup_info_v0 {
>   __u32 id;
>   __u32 invert;
>  };
>  
> +struct xt_cgroup_info_v1 {
> + __u8has_path;
> + __u8has_classid;
> + __u8invert_path;
> + __u8invert_classid;
> + charpath[PATH_MAX];
> + __u32   classid;
> +
> + /* kernel internal data */
> + void*priv __attribute__((aligned(8)));
> +};

Ahem.  Am I reading this right? This struct is > 4k in size?
If so -- Ugh.  Does sizeof(path) really have to be PATH_MAX?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

2015-11-21 Thread Tejun Heo
Hello,

On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
> > +struct xt_cgroup_info_v1 {
> > +   __u8has_path;
> > +   __u8has_classid;
> > +   __u8invert_path;
> > +   __u8invert_classid;
> > +   charpath[PATH_MAX];
> > +   __u32   classid;
> > +
> > +   /* kernel internal data */
> > +   void*priv __attribute__((aligned(8)));
> > +};
> 
> Ahem.  Am I reading this right? This struct is > 4k in size?
> If so -- Ugh.  Does sizeof(path) really have to be PATH_MAX?

Hmmm... yeap but would this be an acutual problem?  We can try to make
it shorter but idk it ultimately is a path.  Another solution would be
trying to pass inode around but that is problematic with showing and
printing rules as the only way to reverse-map inode to path is walking
the tree and the cgroup may already be gone at that point.  While >4k
struct isn't pretty, this looks like the path of least resistance.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: atm: constify in_cache_ops and eg_cache_ops structures

2015-11-21 Thread Julia Lawall
The in_cache_ops and eg_cache_ops structures are never modified, so declare
them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall 

---
 net/atm/mpc.h |4 ++--
 net/atm/mpoa_caches.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/atm/mpc.h b/net/atm/mpc.h
index 0919a88..cfc7b74 100644
--- a/net/atm/mpc.h
+++ b/net/atm/mpc.h
@@ -21,11 +21,11 @@ struct mpoa_client {
uint8_t our_ctrl_addr[ATM_ESA_LEN];  /* MPC's control ATM address   */
 
rwlock_t ingress_lock;
-   struct in_cache_ops *in_ops; /* ingress cache operations*/
+   const struct in_cache_ops *in_ops; /* ingress cache operations  */
in_cache_entry *in_cache;/* the ingress cache of this MPC   */
 
rwlock_t egress_lock;
-   struct eg_cache_ops *eg_ops; /* egress cache operations */
+   const struct eg_cache_ops *eg_ops; /* egress cache operations   */
eg_cache_entry *eg_cache;/* the egress  cache of this MPC   */
 
uint8_t *mps_macs;   /* array of MPS MAC addresses, >=1 */
diff --git a/net/atm/mpoa_caches.c b/net/atm/mpoa_caches.c
index d1b2d9a..9e60e74 100644
--- a/net/atm/mpoa_caches.c
+++ b/net/atm/mpoa_caches.c
@@ -534,7 +534,7 @@ static void eg_destroy_cache(struct mpoa_client *mpc)
 }
 
 
-static struct in_cache_ops ingress_ops = {
+static const struct in_cache_ops ingress_ops = {
in_cache_add_entry,   /* add_entry   */
in_cache_get, /* get */
in_cache_get_with_mask,   /* get_with_mask   */
@@ -548,7 +548,7 @@ static struct in_cache_ops ingress_ops = {
in_destroy_cache  /* destroy_cache   */
 };
 
-static struct eg_cache_ops egress_ops = {
+static const struct eg_cache_ops egress_ops = {
eg_cache_add_entry,   /* add_entry*/
eg_cache_get_by_cache_id, /* get_by_cache_id  */
eg_cache_get_by_tag,  /* get_by_tag   */

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] VSOCK: constify vmci_transport_notify_ops structures

2015-11-21 Thread Julia Lawall
The vmci_transport_notify_ops structures are never modified, so declare
them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall 

---
 net/vmw_vsock/vmci_transport.h   |2 +-
 net/vmw_vsock/vmci_transport_notify.c|2 +-
 net/vmw_vsock/vmci_transport_notify.h|5 +++--
 net/vmw_vsock/vmci_transport_notify_qstate.c |2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/vmw_vsock/vmci_transport.h b/net/vmw_vsock/vmci_transport.h
index 2ad46f3..1820e74 100644
--- a/net/vmw_vsock/vmci_transport.h
+++ b/net/vmw_vsock/vmci_transport.h
@@ -121,7 +121,7 @@ struct vmci_transport {
u64 queue_pair_max_size;
u32 detach_sub_id;
union vmci_transport_notify notify;
-   struct vmci_transport_notify_ops *notify_ops;
+   const struct vmci_transport_notify_ops *notify_ops;
struct list_head elem;
struct sock *sk;
spinlock_t lock; /* protects sk. */
diff --git a/net/vmw_vsock/vmci_transport_notify.c 
b/net/vmw_vsock/vmci_transport_notify.c
index 9b7f207..fd8cf02 100644
--- a/net/vmw_vsock/vmci_transport_notify.c
+++ b/net/vmw_vsock/vmci_transport_notify.c
@@ -661,7 +661,7 @@ static void 
vmci_transport_notify_pkt_process_negotiate(struct sock *sk)
 }
 
 /* Socket control packet based operations. */
-struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops = {
+const struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops = {
vmci_transport_notify_pkt_socket_init,
vmci_transport_notify_pkt_socket_destruct,
vmci_transport_notify_pkt_poll_in,
diff --git a/net/vmw_vsock/vmci_transport_notify.h 
b/net/vmw_vsock/vmci_transport_notify.h
index 7df7932..3c464d3 100644
--- a/net/vmw_vsock/vmci_transport_notify.h
+++ b/net/vmw_vsock/vmci_transport_notify.h
@@ -77,7 +77,8 @@ struct vmci_transport_notify_ops {
void (*process_negotiate) (struct sock *sk);
 };
 
-extern struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops;
-extern struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops;
+extern const struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops;
+extern const
+struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops;
 
 #endif /* __VMCI_TRANSPORT_NOTIFY_H__ */
diff --git a/net/vmw_vsock/vmci_transport_notify_qstate.c 
b/net/vmw_vsock/vmci_transport_notify_qstate.c
index dc9c792..21e591d 100644
--- a/net/vmw_vsock/vmci_transport_notify_qstate.c
+++ b/net/vmw_vsock/vmci_transport_notify_qstate.c
@@ -419,7 +419,7 @@ vmci_transport_notify_pkt_send_pre_enqueue(
 }
 
 /* Socket always on control packet based operations. */
-struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops = {
+const struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops = 
{
vmci_transport_notify_pkt_socket_init,
vmci_transport_notify_pkt_socket_destruct,
vmci_transport_notify_pkt_poll_in,

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] vrf: fix double free and memory corruption on register_netdevice failure

2015-11-21 Thread Nikolay Aleksandrov
From: Nikolay Aleksandrov 

When vrf's ->newlink is called, if register_netdevice() fails then it
does free_netdev(), but that's also done by rtnl_newlink() so a second
free happens and memory gets corrupted, to reproduce execute the
following line a couple of times (1 - 5 usually is enough):
$ for i in `seq 1 5`; do ip link add vrf: type vrf table 1; done;
This works because we fail in register_netdevice() because of the wrong
name "vrf:".

And here's a trace of one crash:
[   28.792157] [ cut here ]
[   28.792407] kernel BUG at fs/namei.c:246!
[   28.792608] invalid opcode:  [#1] SMP
[   28.793240] Modules linked in: vrf nfsd auth_rpcgss oid_registry
nfs_acl nfs lockd grace sunrpc crct10dif_pclmul crc32_pclmul
crc32c_intel qxl drm_kms_helper ttm drm aesni_intel aes_x86_64 psmouse
glue_helper lrw evdev gf128mul i2c_piix4 ablk_helper cryptd ppdev
parport_pc parport serio_raw pcspkr virtio_balloon virtio_console
i2c_core acpi_cpufreq button 9pnet_virtio 9p 9pnet fscache ipv6 autofs4
ext4 crc16 mbcache jbd2 virtio_blk virtio_net sg sr_mod cdrom
ata_generic ehci_pci uhci_hcd ehci_hcd e1000 usbcore usb_common ata_piix
libata virtio_pci virtio_ring virtio scsi_mod floppy
[   28.796016] CPU: 0 PID: 1148 Comm: ld-linux-x86-64 Not tainted
4.4.0-rc1+ #24
[   28.796016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[   28.796016] task: 8800352561c0 ti: 88003592c000 task.ti:
88003592c000
[   28.796016] RIP: 0010:[]  []
putname+0x43/0x60
[   28.796016] RSP: 0018:88003592fe88  EFLAGS: 00010246
[   28.796016] RAX:  RBX: 8800352561c0 RCX:
0001
[   28.796016] RDX:  RSI:  RDI:
88003784f000
[   28.796016] RBP: 88003592ff08 R08: 0001 R09:

[   28.796016] R10:  R11: 0001 R12:

[   28.796016] R13: 047c R14: 88003784f000 R15:
8800358c4a00
[   28.796016] FS:  () GS:88003fc0()
knlGS:
[   28.796016] CS:  0010 DS:  ES:  CR0: 80050033
[   28.796016] CR2: 7ffd583bc2d9 CR3: 35a99000 CR4:
000406f0
[   28.796016] Stack:
[   28.796016]  8121045d 812102d3 8800352561c0
880035a91660
[   28.796016]  888a9880  81a49940
00ff81218684
[   28.796016]  8800352561c0 047c 
880035b36d80
[   28.796016] Call Trace:
[   28.796016]  [] ?
do_execveat_common.isra.34+0x74d/0x930
[   28.796016]  [] ?
do_execveat_common.isra.34+0x5c3/0x930
[   28.796016]  [] do_execve+0x2c/0x30
[   28.796016]  []
call_usermodehelper_exec_async+0xf0/0x140
[   28.796016]  [] ? umh_complete+0x40/0x40
[   28.796016]  [] ret_from_fork+0x3f/0x70
[   28.796016] Code: 48 8d 47 1c 48 89 e5 53 48 8b 37 48 89 fb 48 39 c6
74 1a 48 8b 3d 7e e9 8f 00 e8 49 fa fc ff 48 89 df e8 f1 01 fd ff 5b 5d
f3 c3 <0f> 0b 48 89 fe 48 8b 3d 61 e9 8f 00 e8 2c fa fc ff 5b 5d eb e9
[   28.796016] RIP  [] putname+0x43/0x60
[   28.796016]  RSP 

Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
Signed-off-by: Nikolay Aleksandrov 
---
 drivers/net/vrf.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 92fa3e1ea65c..4f9748457f5a 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -907,7 +907,6 @@ static int vrf_newlink(struct net *src_net, struct 
net_device *dev,
   struct nlattr *tb[], struct nlattr *data[])
 {
struct net_vrf *vrf = netdev_priv(dev);
-   int err;
 
if (!data || !data[IFLA_VRF_TABLE])
return -EINVAL;
@@ -916,15 +915,7 @@ static int vrf_newlink(struct net *src_net, struct 
net_device *dev,
 
dev->priv_flags |= IFF_L3MDEV_MASTER;
 
-   err = register_netdevice(dev);
-   if (err < 0)
-   goto out_fail;
-
-   return 0;
-
-out_fail:
-   free_netdev(dev);
-   return err;
+   return register_netdevice(dev);
 }
 
 static size_t vrf_nl_getsize(const struct net_device *dev)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

2015-11-21 Thread Florian Westphal
Tejun Heo  wrote:
> On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
> > > +struct xt_cgroup_info_v1 {
> > > + __u8has_path;
> > > + __u8has_classid;
> > > + __u8invert_path;
> > > + __u8invert_classid;
> > > + charpath[PATH_MAX];
> > > + __u32   classid;
> > > +
> > > + /* kernel internal data */
> > > + void*priv __attribute__((aligned(8)));
> > > +};
> > 
> > Ahem.  Am I reading this right? This struct is > 4k in size?
> > If so -- Ugh.  Does sizeof(path) really have to be PATH_MAX?
> 
> Hmmm... yeap but would this be an acutual problem?

Since rule blob can be allocated via vmalloc i guess "no", its not
really a problem unless someone needs realy insane amount of such rules.

I don't have any better suggestion, so I guess its necessary evil.

The only other question I have is wheter PATH_MAX might be a possible
ABI breaker in future.  It would have to be guaranteed that this is the
same size forever, else you'd get strange errors on rule insertion if
the sizes of the kernel and userspace version differs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

2015-11-21 Thread Jan Engelhardt

On Saturday 2015-11-21 19:54, Florian Westphal wrote:
>
>The only other question I have is wheter PATH_MAX might be a possible
>ABI breaker in future.  It would have to be guaranteed that this is the
>same size forever, else you'd get strange errors on rule insertion if
>the sizes of the kernel and userspace version differs.
>
The same goes for IFNAMSIZ. But, so far, nobody changed it in the kernel,
even though there are voices that 15 characters + '\0' were a tight choice.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] packet: Allow packets with only a header (but no payload)

2015-11-21 Thread Sergei Shtylyov

Hello.

On 11/21/2015 3:50 AM, Martin Blumenstingl wrote:


9c70776 added validation for the packet size in packet_snd. This change


   Please run your patch thru scripts/checkpatch.pl -- it now enforces 
certain commit citing style.



enforced that every packet needs a header with at least hard_header_len
bytes  and at least one byte payload.

This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24 <= 24)"

Signed-off-by: Martin Blumenstingl 


[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [B.A.T.M.A.N.] [PATCH 2/3] batman-adv: Split a condition check

2015-11-21 Thread Marek Lindner
On Tuesday, November 03, 2015 21:54:35 SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Tue, 3 Nov 2015 20:41:02 +0100
> 
> Let us split a check for a condition at the beginning of the
> batadv_is_ap_isolated() function so that a direct return can be performed
> in this function if the variable "vlan" contained a null pointer.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/batman-adv/translation-table.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)

Applied in revision b1199c6.

Thanks,
Marek


signature.asc
Description: This is a digitally signed message part.


Re: [B.A.T.M.A.N.] [PATCH 2/2] batman-adv: Less checks in batadv_tvlv_unicast_send()

2015-11-21 Thread Marek Lindner
On Sunday, November 15, 2015 09:45:51 SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sun, 15 Nov 2015 09:00:42 +0100
> 
> * Let us return directly if a call of the batadv_orig_hash_find() function
>   returned a null pointer.
> 
> * Omit the initialisation for the variable "skb" at the beginning.
> 
> * Replace an assignment by a call of the kfree_skb() function
>   and delete the affected variable "ret" then.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/batman-adv/main.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)

Applied in revision 5a878b8.

Thanks,
Marek


signature.asc
Description: This is a digitally signed message part.


Re: [B.A.T.M.A.N.] [PATCH 1/2] batman-adv: Delete unnecessary checks before the function call "kfree_skb"

2015-11-21 Thread Marek Lindner
On Sunday, November 15, 2015 09:43:26 SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sun, 15 Nov 2015 08:04:43 +0100
> 
> The kfree_skb() function tests whether its argument is NULL and then
> returns immediately. Thus the test around the calls is not needed.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/batman-adv/main.c   | 2 +-
>  net/batman-adv/network-coding.c | 4 +---
>  net/batman-adv/send.c   | 3 +--
>  3 files changed, 3 insertions(+), 6 deletions(-)

Applied in revision 77d84a6.

Thanks,
Marek


signature.asc
Description: This is a digitally signed message part.


Re: [B.A.T.M.A.N.] [PATCH 1/3] batman-adv: Delete an unnecessary check before the function call "batadv_softif_vlan_free_ref"

2015-11-21 Thread Marek Lindner
On Tuesday, November 03, 2015 21:52:58 SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Tue, 3 Nov 2015 19:20:34 +0100
> 
> The batadv_softif_vlan_free_ref() function tests whether its argument is
> NULL and then returns immediately. Thus the test around the call is not
> needed.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/batman-adv/translation-table.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Applied in revision bbcbe0f.

Thanks,
Marek


signature.asc
Description: This is a digitally signed message part.


Re: r8169 regression: UDP packets dropped intermittantly

2015-11-21 Thread Francois Romieu
Francois Romieu  :
[...]

If you can crash your system at will, you may apply the patch below to
da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq
handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and
build it in a current tree (say 4.2).

If it works it will be easier for me to compare behaviors in current
tree (especially as none of my current test boxes wants to boot with
my own ga311).

How much memory and CPU may I rely on in your test computer ?

--- r8169.c 2015-11-21 23:02:10.435275753 +0100
+++ r8169.c 2015-11-21 23:21:49.429554012 +0100
@@ -29,7 +29,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 
@@ -1616,7 +1615,7 @@ static int rtl8169_set_features(struct n
else
tp->cp_cmd &= ~RxChkSum;
 
-   if (dev->features & NETIF_F_HW_VLAN_RX)
+   if (dev->features & NETIF_F_HW_VLAN_CTAG_RX)
tp->cp_cmd |= RxVlan;
else
tp->cp_cmd &= ~RxVlan;
@@ -1632,8 +1631,8 @@ static int rtl8169_set_features(struct n
 static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp,
  struct sk_buff *skb)
 {
-   return (vlan_tx_tag_present(skb)) ?
-   TxVlanTag | swab16(vlan_tx_tag_get(skb)) : 0x00;
+   return (skb_vlan_tag_present(skb)) ?
+   TxVlanTag | swab16(skb_vlan_tag_get(skb)) : 0x00;
 }
 
 static void rtl8169_rx_vlan_tag(struct RxDesc *desc, struct sk_buff *skb)
@@ -1641,7 +1640,7 @@ static void rtl8169_rx_vlan_tag(struct R
u32 opts2 = le32_to_cpu(desc->opts2);
 
if (opts2 & RxVlanTag)
-   __vlan_hwaccel_put_tag(skb, swab16(opts2 & 0x));
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), swab16(opts2 & 
0x));
 
desc->opts2 = 0;
 }
@@ -3508,7 +3507,7 @@ static const struct net_device_ops rtl81
 
 };
 
-static void __devinit rtl_init_mdio_ops(struct rtl8169_private *tp)
+static void rtl_init_mdio_ops(struct rtl8169_private *tp)
 {
struct mdio_ops *ops = &tp->mdio_ops;
 
@@ -3725,7 +3724,7 @@ static void rtl_pll_power_up(struct rtl8
rtl_generic_op(tp, tp->pll_power_ops.up);
 }
 
-static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp)
+static void rtl_init_pll_power_ops(struct rtl8169_private *tp)
 {
struct pll_power_ops *ops = &tp->pll_power_ops;
 
@@ -3905,7 +3904,7 @@ static void r8168b_1_hw_jumbo_disable(st
RTL_W8(Config4, RTL_R8(Config4) & ~(1 << 0));
 }
 
-static void __devinit rtl_init_jumbo_ops(struct rtl8169_private *tp)
+static void rtl_init_jumbo_ops(struct rtl8169_private *tp)
 {
struct jumbo_ops *ops = &tp->jumbo_ops;
 
@@ -3971,7 +3970,7 @@ static void rtl_hw_reset(struct rtl8169_
}
 }
 
-static int __devinit
+static int
 rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
const struct rtl_cfg_info *cfg = rtl_cfg_infos + ent->driver_data;
@@ -4137,7 +4136,7 @@ rtl8169_init_one(struct pci_dev *pdev, c
dev->dev_addr[i] = RTL_R8(MAC0 + i);
memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);
 
-   SET_ETHTOOL_OPS(dev, &rtl8169_ethtool_ops);
+   dev->ethtool_ops = &rtl8169_ethtool_ops;
dev->watchdog_timeo = RTL8169_TX_TIMEOUT;
dev->irq = pdev->irq;
dev->base_addr = (unsigned long) ioaddr;
@@ -4147,16 +4146,16 @@ rtl8169_init_one(struct pci_dev *pdev, c
/* don't enable SG, IP_CSUM and TSO by default - it might not work
 * properly for all devices */
dev->features |= NETIF_F_RXCSUM |
-   NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
+   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;
 
dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO |
-   NETIF_F_RXCSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
+   NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_TX | 
NETIF_F_HW_VLAN_CTAG_RX;
dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO |
NETIF_F_HIGHDMA;
 
if (tp->mac_version == RTL_GIGA_MAC_VER_05)
/* 8110SCd requires hardware Rx VLAN - disallow toggling */
-   dev->hw_features &= ~NETIF_F_HW_VLAN_RX;
+   dev->hw_features &= ~NETIF_F_HW_VLAN_CTAG_RX;
 
tp->intr_mask = 0x;
tp->hw_start = cfg->hw_start;
@@ -4217,7 +4216,7 @@ err_out_free_dev_1:
goto out;
 }
 
-static void __devexit rtl8169_remove_one(struct pci_dev *pdev)
+static void rtl8169_remove_one(struct pci_dev *pdev)
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct rtl8169_private *tp = netdev_priv(dev);
@@ -6218,20 +6217,9 @@ static struct pci_driver rtl8169_pci_dri
.name   = MODULENAME,
.id_table   = rtl8169_pci_tbl,
.probe  = rtl8169_init_one,
-   .remove = __devexit_p(rtl8169_remove_one),
+   .remove = rtl8169_remove_one,
.shutdown   = rtl_shutdown,
.d

Re: [PATCH net] vrf: fix double free and memory corruption on register_netdevice failure

2015-11-21 Thread David Ahern

On 11/21/15 11:46 AM, Nikolay Aleksandrov wrote:

From: Nikolay Aleksandrov

When vrf's ->newlink is called, if register_netdevice() fails then it
does free_netdev(), but that's also done by rtnl_newlink() so a second
free happens and memory gets corrupted, to reproduce execute the
following line a couple of times (1 - 5 usually is enough):
$ for i in `seq 1 5`; do ip link add vrf: type vrf table 1; done;
This works because we fail in register_netdevice() because of the wrong
name "vrf:".


Acked-by: David Ahern 

Thanks, Nik.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] broadcom: fix PHY_ID_BCM5481 entry in the id table

2015-11-21 Thread Aaro Koskinen
Commit fcb26ec5b18d ("broadcom: move all PHY_ID's to header")
updated broadcom_tbl to use PHY_IDs, but incorrectly replaced 0x0143bca0
with PHY_ID_BCM5482 (making a duplicate entry, and completely omitting
the original). Fix that.

Fixes: fcb26ec5b18d ("broadcom: move all PHY_ID's to header")
Signed-off-by: Aaro Koskinen 
---
 drivers/net/phy/broadcom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 07a6119..3ce5d95 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -614,7 +614,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] 
= {
{ PHY_ID_BCM5461, 0xfff0 },
{ PHY_ID_BCM54616S, 0xfff0 },
{ PHY_ID_BCM5464, 0xfff0 },
-   { PHY_ID_BCM5482, 0xfff0 },
+   { PHY_ID_BCM5481, 0xfff0 },
{ PHY_ID_BCM5482, 0xfff0 },
{ PHY_ID_BCM50610, 0xfff0 },
{ PHY_ID_BCM50610M, 0xfff0 },
-- 
2.4.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bpf: add show_fdinfo handler for maps

2015-11-21 Thread Alexei Starovoitov
On Fri, Nov 20, 2015 at 11:30:13AM +0100, Hannes Frederic Sowa wrote:
> Hi Alexei,
> 
> > If user space can be see both 'count' and 'max_entries', it can be very
> > tempting to start assuming 'full' and 'empty' state of the map which will
> > lead to race conditions and bad design.
> > bpf programs and maps are inherently multi-thread and concurrent.
> > If userapp wants to do the counting of elements it needs to do so on its
> > own
> > and shoot itself in the foot eventually.
> > For the same reason I don't want to see BPF_MAP_GET_COUNT command.
> 
> Hmmm... I don't understand your argument. This is the same with memory
> management in general and we still report memory statistics to user
> space. I really would find it helpful to have a feeling if a map is
> nearly full or nearly empty.

memory is not the same, since it's a shared resource and knowledge
about consumption by the process gives no insight whether next malloc()
will succeed or not.

> We can also count collisions or the load in the buckets, but some
> evidence what is going on would be nice, wouldn't it?

reporting collisions may be ok, since it's probably hard to exploit
such stats, but security may become a concern in some use cases,
so may not be such a good idea at the end.

In general when user space passed kernel some numbers (like type,
element size, max_entries) it's ok to report it back via fdinfo.
Anything else I'd rather keep private.
debugging in general should be done by debugging tools.
Like I often use kprobe+bpf to debug networking+bpf :)
Unfortunately kprobe+bpf doesn't work to debug itself, but then
regular tracing and kprobe comes to rescue.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 4.1.12 crash

2015-11-21 Thread Alexander Duyck

On 11/21/2015 12:16 AM, Andrew wrote:
Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
almost all of these boxes, except H61M-based box from 1st log, works 
for a long time with uptime more than year; and only software was 
changed on it; H61M-based box runs memtest86 for a tens of hours w/o 
any error. If it was caused by hardware - they should crash even earlier.


I wasn't saying it was hardware related.  My thought is that it could be 
some sort of use after free or double free type issue. Basically what 
you end up with is the memory getting corrupted by software that is 
accessing regions it shouldn't be.


Rarely on different servers I saw 'zram decompression error' messages 
(in this case I've got such message on H61M-based box).


Also, other people that uses accel-ppp as BRAS software, have 
different kernel panics/bugs/oopses on fresh kernels.


I'll try to apply these patches, and I'll try to switch back to 
kernels that were stable on some boxes.


If you could bisect this it would be useful.  Basically we just need to 
determine where in the git history these issues started popping up so 
that we can then narrow down on the root cause.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tipc: fix error handling of expanding buffer headroom

2015-11-21 Thread Ying Xue
Coverity says:

*** CID 1338065:  Error handling issues  (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156 struct udp_media_addr *dst = (struct udp_media_addr 
*)&dest->value;
157 struct udp_media_addr *src = (struct udp_media_addr 
*)&b->addr.value;
158 struct sk_buff *clone;
159 struct rtable *rt;
160
161 if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>> CID 1338065:  Error handling issues  (CHECKED_RETURN)
>>> Calling "pskb_expand_head" without checking return value (as is done 
>>> elsewhere 51 out of 56 times).
162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164 clone = skb_clone(skb, GFP_ATOMIC);
165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166 ub = rcu_dereference_rtnl(b->media_ptr);
167 if (!ub) {

When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.

Fixes: e53567948f82 ("tipc: conditionally expand buffer headroom over udp 
tunnel")
Reported-by: 
Cc: Stephen Hemminger 
Signed-off-by: Ying Xue 
---
 net/tipc/udp_media.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index ad2719a..f01ff16 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -158,8 +158,11 @@ static int tipc_udp_send_msg(struct net *net, struct 
sk_buff *skb,
struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
struct rtable *rt;
 
-   if (skb_headroom(skb) < UDP_MIN_HEADROOM)
-   pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
+   if (skb_headroom(skb) < UDP_MIN_HEADROOM) {
+   err = pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
+   if (!err)
+   goto tx_error;
+   }
 
skb_set_inner_protocol(skb, htons(ETH_P_TIPC));
ub = rcu_dereference_rtnl(b->media_ptr);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tipc: avoid packets leaking on socket receive queue

2015-11-21 Thread Ying Xue
Even if we drain receive queue thoroughly in tipc_release() after tipc
socket is removed from rhashtable, it is possible that some packets
are in flight because some CPU runs receiver and did rhashtable lookup
before we removed socket. They will achieve receive queue, but nobody
delete them at all. To avoid this leak, we register a private socket
destructor to purge receive queue, meaning releasing packets pending
on receive queue will be delayed until the last reference of tipc
socket will be released.

Signed-off-by: Ying Xue 
---
 net/tipc/socket.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 552dbab..b53246f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -105,6 +105,7 @@ struct tipc_sock {
 static int tipc_backlog_rcv(struct sock *sk, struct sk_buff *skb);
 static void tipc_data_ready(struct sock *sk);
 static void tipc_write_space(struct sock *sk);
+static void tipc_sock_destruct(struct sock *sk);
 static int tipc_release(struct socket *sock);
 static int tipc_accept(struct socket *sock, struct socket *new_sock, int 
flags);
 static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p);
@@ -381,6 +382,7 @@ static int tipc_sk_create(struct net *net, struct socket 
*sock,
sk->sk_rcvbuf = sysctl_tipc_rmem[1];
sk->sk_data_ready = tipc_data_ready;
sk->sk_write_space = tipc_write_space;
+   sk->sk_destruct = tipc_sock_destruct;
tsk->conn_timeout = CONN_TIMEOUT_DEFAULT;
tsk->sent_unacked = 0;
atomic_set(&tsk->dupl_rcvcnt, 0);
@@ -470,9 +472,6 @@ static int tipc_release(struct socket *sock)
tipc_node_remove_conn(net, dnode, tsk->portid);
}
 
-   /* Discard any remaining (connection-based) messages in receive queue */
-   __skb_queue_purge(&sk->sk_receive_queue);
-
/* Reject any messages that accumulated in backlog queue */
sock->state = SS_DISCONNECTING;
release_sock(sk);
@@ -1515,6 +1514,11 @@ static void tipc_data_ready(struct sock *sk)
rcu_read_unlock();
 }
 
+static void tipc_sock_destruct(struct sock *sk)
+{
+   __skb_queue_purge(&sk->sk_receive_queue);
+}
+
 /**
  * filter_connect - Handle all incoming messages for a connection-based socket
  * @tsk: TIPC socket
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/13] mvneta Buffer Management and enhancements

2015-11-21 Thread Marcin Wojtas
Hi,

Hereby I submit a patchset that introduces various fixes and support
for new features and enhancements to the mvneta driver:

1. First three patches are minimal fixes, stable-CC'ed.

2. Suspend to ram ('s2ram') support. Due to some stability problems
Thomas Petazzoni's patches did not get merged yet, but I used them for
verification. Contrary to wfi mode ('standby' - linux does not
differentiate between them, so same routines are used) all registers'
contents are lost due to power down, so the configuration has to be
fully reconstructed during resume.

3. Optimisations - concatenating TX descriptors' flush, basing on
xmit_more support and combined approach for finalizing egress processing.
Thanks to HR timer buffers can be released with small latency, which is
good for low transfer and small queues. Along with the timer, coalescing
irqs are used, whose threshold could be increased back to 15.

4. Buffer manager (BM) support with two preparatory commits. As it is a
separate block, common for all network ports, a new driver is introduced,
which configures it and exposes API to the main network driver. It is
throughly described in binding documentation and commit log. Please note,
that enabling per-port BM usage is done using phandle and the data passed
in mvneta_bm_probe. It is designed for usage of on-demand device probe
and dev_set/get_drvdata, however it's awaiting merge to linux-next.
Therefore, deferring probe is not used - if something goes wrong (same
in case of errors during changing MTU or suspend/resume cycle) mvneta
driver falls back to software buffer management and works in a regular way.

Known issues:
- problems with obtaining all mapped buffers from internal SRAM, when
destroying the buffer pointer pool
- problems with unmapping chunk of SRAM during driver removal
Above do not have an impact on the operation, as they are called during
driver removal or in error path.

5. Enable BM on Armada XP and 38X development boards - those ones and
A370 I could check on my own. In all cases they survived night-long
linerate iperf. Also tests were performed with A388 SoC working as a
network bridge between two packet generators. They showed increase of
maximum processed 64B packets by ~20k (~555k packets with BM enabled
vs ~535 packets without BM). Also when pushing 1500B-packets with a
line rate achieved, CPU load decreased from around 25% without BM vs
18-20% with BM.

I'm looking forward to any remarks and comments.

Best regards,
Marcin Wojtas

Marcin Wojtas (12):
  net: mvneta: add configuration for MBUS windows access protection
  net: mvneta: enable IP checksum with jumbo frames for Armada 38x on
Port0
  net: mvneta: fix bit assignment in MVNETA_RXQ_CONFIG_REG
  net: mvneta: enable suspend/resume support
  net: mvneta: enable mixed egress processing using HR timer
  bus: mvebu-mbus: provide api for obtaining IO and DRAM window
information
  ARM: mvebu: enable SRAM support in mvebu_v7_defconfig
  net: mvneta: bm: add support for hardware buffer management
  ARM: mvebu: add buffer manager nodes to armada-38x.dtsi
  ARM: mvebu: enable buffer manager support on Armada 38x boards
  ARM: mvebu: add buffer manager nodes to armada-xp.dtsi
  ARM: mvebu: enable buffer manager support on Armada XP boards

Simon Guinot (1):
  net: mvneta: add xmit_more support

 .../bindings/net/marvell-armada-370-neta.txt   |  19 +-
 .../devicetree/bindings/net/marvell-neta-bm.txt|  49 ++
 arch/arm/boot/dts/armada-385-db-ap.dts |  20 +-
 arch/arm/boot/dts/armada-388-db.dts|  17 +-
 arch/arm/boot/dts/armada-388-gp.dts|  17 +-
 arch/arm/boot/dts/armada-38x.dtsi  |  20 +-
 arch/arm/boot/dts/armada-xp-db.dts |  19 +-
 arch/arm/boot/dts/armada-xp-gp.dts |  19 +-
 arch/arm/boot/dts/armada-xp.dtsi   |  18 +
 arch/arm/configs/mvebu_v7_defconfig|   1 +
 drivers/bus/mvebu-mbus.c   |  51 ++
 drivers/net/ethernet/marvell/Kconfig   |  14 +
 drivers/net/ethernet/marvell/Makefile  |   1 +
 drivers/net/ethernet/marvell/mvneta.c  | 660 +++--
 drivers/net/ethernet/marvell/mvneta_bm.c   | 642 
 drivers/net/ethernet/marvell/mvneta_bm.h   | 171 ++
 include/linux/mbus.h   |   3 +
 17 files changed, 1677 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/marvell-neta-bm.txt
 create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.c
 create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.h

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/13] net: mvneta: enable suspend/resume support

2015-11-21 Thread Marcin Wojtas
This commit introduces suspend/resume routines used for both in 'standby'
and 'mem' modes. For the latter, in which registers' contents are lost,
following steps are performed:
* in suspend - update port statistics and, if interface is running,
  detach netif, clean the queues, disable cpu notifier, shutdown
  interface and reset port's link status;
* in resume, for all interfaces, set default configuration of the port and
  MBUS windows;
* in resume, in case the interface is running, enable accepting packets in
  legacy parser, power up the port, register cpu notifier and attach netif.

Signed-off-by: Marcin Wojtas 
---
 drivers/net/ethernet/marvell/mvneta.c | 70 +++
 1 file changed, 70 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index d12b8c6..f079b13 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3442,6 +3442,72 @@ static int mvneta_remove(struct platform_device *pdev)
return 0;
 }
 
+#ifdef CONFIG_PM_SLEEP
+static int mvneta_suspend(struct platform_device *pdev, pm_message_t state)
+{
+   struct net_device *dev = platform_get_drvdata(pdev);
+   struct mvneta_port *pp = netdev_priv(dev);
+
+   mvneta_ethtool_update_stats(pp);
+
+   if (!netif_running(dev))
+   return 0;
+
+   netif_device_detach(dev);
+
+   mvneta_stop_dev(pp);
+   unregister_cpu_notifier(&pp->cpu_notifier);
+   mvneta_cleanup_rxqs(pp);
+   mvneta_cleanup_txqs(pp);
+
+   /* Reset link status */
+   pp->link = 0;
+   pp->duplex = -1;
+   pp->speed = 0;
+
+   return 0;
+}
+
+static int mvneta_resume(struct platform_device *pdev)
+{
+   const struct mbus_dram_target_info *dram_target_info;
+   struct net_device *dev = platform_get_drvdata(pdev);
+   struct mvneta_port *pp = netdev_priv(dev);
+   int ret;
+
+   mvneta_defaults_set(pp);
+   mvneta_port_power_up(pp, pp->phy_interface);
+
+   dram_target_info = mv_mbus_dram_info();
+   if (dram_target_info)
+   mvneta_conf_mbus_windows(pp, dram_target_info);
+
+   if (!netif_running(dev))
+   return 0;
+
+   ret = mvneta_setup_rxqs(pp);
+   if (ret) {
+   netdev_err(dev, "unable to setup rxqs after resume\n");
+   return ret;
+   }
+
+   ret = mvneta_setup_txqs(pp);
+   if (ret) {
+   netdev_err(dev, "unable to setup txqs after resume\n");
+   return ret;
+   }
+
+   mvneta_set_rx_mode(dev);
+   mvneta_percpu_elect(pp);
+   register_cpu_notifier(&pp->cpu_notifier);
+   mvneta_start_dev(pp);
+
+   netif_device_attach(dev);
+
+   return 0;
+}
+#endif /* CONFIG_PM_SLEEP */
+
 static const struct of_device_id mvneta_match[] = {
{ .compatible = "marvell,armada-370-neta" },
{ .compatible = "marvell,armada-xp-neta" },
@@ -3452,6 +3518,10 @@ MODULE_DEVICE_TABLE(of, mvneta_match);
 static struct platform_driver mvneta_driver = {
.probe = mvneta_probe,
.remove = mvneta_remove,
+#ifdef CONFIG_PM_SLEEP
+   .suspend = mvneta_suspend,
+   .resume = mvneta_resume,
+#endif
.driver = {
.name = MVNETA_DRIVER_NAME,
.of_match_table = mvneta_match,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/13] ARM: mvebu: add buffer manager nodes to armada-xp.dtsi

2015-11-21 Thread Marcin Wojtas
Armada XP network controller supports hardware buffer management (BM).
Since it is now enabled in mvneta driver, appropriate nodes can be added
to armada-xp.dtsi - for the actual common BM unit (bm@c) and its
internal SRAM (bm-bppi), which is used for indirect access to buffer
pointer ring residing in DRAM.

Pools - ports mapping, bm-bppi entry in 'soc' node's ranges and optional
parameters are supposed to be set in board files.

Signed-off-by: Marcin Wojtas 
---
 arch/arm/boot/dts/armada-xp.dtsi | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm/boot/dts/armada-xp.dtsi b/arch/arm/boot/dts/armada-xp.dtsi
index be23196..bd45936 100644
--- a/arch/arm/boot/dts/armada-xp.dtsi
+++ b/arch/arm/boot/dts/armada-xp.dtsi
@@ -253,6 +253,14 @@
marvell,crypto-sram-size = <0x800>;
};
 
+   bm: bm@c {
+   compatible = "marvell,armada-380-neta-bm";
+   reg = <0xc 0xac>;
+   clocks = <&gateclk 13>;
+   internal-mem = <&bm_bppi>;
+   status = "disabled";
+   };
+
xor@f0900 {
compatible = "marvell,orion-xor";
reg = <0xF0900 0x100
@@ -291,6 +299,16 @@
#size-cells = <1>;
ranges = <0 MBUS_ID(0x09, 0x05) 0 0x800>;
};
+
+   bm_bppi: bm-bppi {
+   compatible = "mmio-sram";
+   reg = ;
+   ranges = <0 MBUS_ID(0x0c, 0x04) 0 0x10>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   clocks = <&gateclk 13>;
+   status = "disabled";
+   };
};
 
clocks {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/13] ARM: mvebu: enable buffer manager support on Armada 38x boards

2015-11-21 Thread Marcin Wojtas
Since mvneta driver supports using hardware buffer management (BM), in
order to use it, board files have to be adjusted accordingly. This commit
enables BM on:
* A385-DB-AP - each port has its own pool for long and common pool for
short packets,
* A388-DB - to each port unique 'short' and 'long' pools are mapped,
* A388-GP - same as above.

Moreover appropriate entry is added to 'soc' node ranges, as well as "okay"
status for 'bm' and 'bm-bppi' (internal SRAM) nodes.

Signed-off-by: Marcin Wojtas 
---
 arch/arm/boot/dts/armada-385-db-ap.dts | 20 +++-
 arch/arm/boot/dts/armada-388-db.dts| 17 -
 arch/arm/boot/dts/armada-388-gp.dts| 17 -
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/armada-385-db-ap.dts 
b/arch/arm/boot/dts/armada-385-db-ap.dts
index acd5b15..5f9451b 100644
--- a/arch/arm/boot/dts/armada-385-db-ap.dts
+++ b/arch/arm/boot/dts/armada-385-db-ap.dts
@@ -61,7 +61,8 @@
ranges = ;
+ MBUS_ID(0x09, 0x15) 0 0xf111 0x1
+ MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>;
 
internal-regs {
spi1: spi@10680 {
@@ -138,12 +139,18 @@
status = "okay";
phy = <&phy2>;
phy-mode = "sgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <1>;
+   bm,pool-short = <3>;
};
 
ethernet@34000 {
status = "okay";
phy = <&phy1>;
phy-mode = "sgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <2>;
+   bm,pool-short = <3>;
};
 
ethernet@7 {
@@ -157,6 +164,13 @@
status = "okay";
phy = <&phy0>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <0>;
+   bm,pool-short = <3>;
+   };
+
+   bm@c8000 {
+   status = "okay";
};
 
nfc: flash@d {
@@ -178,6 +192,10 @@
};
};
 
+   bm-bppi {
+   status = "okay";
+   };
+
pcie-controller {
status = "okay";
 
diff --git a/arch/arm/boot/dts/armada-388-db.dts 
b/arch/arm/boot/dts/armada-388-db.dts
index ff47af5..ea93ed7 100644
--- a/arch/arm/boot/dts/armada-388-db.dts
+++ b/arch/arm/boot/dts/armada-388-db.dts
@@ -66,7 +66,8 @@
ranges = ;
+ MBUS_ID(0x09, 0x15) 0 0xf111 0x1
+ MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>;
 
internal-regs {
spi@10600 {
@@ -99,6 +100,9 @@
status = "okay";
phy = <&phy1>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <2>;
+   bm,pool-short = <3>;
};
 
usb@58000 {
@@ -109,6 +113,9 @@
status = "okay";
phy = <&phy0>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <0>;
+   bm,pool-short = <1>;
};
 
mdio@72004 {
@@ -129,6 +136,10 @@
status = "okay";
};
 
+   bm@c8000 {
+   status = "okay";
+   };
+
flash@d {
status = "okay";
num-cs = <1>;
@@ -169,6 +180,10 @@
};
};
 
+   bm-bppi {
+   status = "okay";
+   };
+
pcie-controller {
status = "okay";
/*
diff --git a/arch/arm/boot/dts/armada-388-gp.dts 
b/arch/arm/boot/dts/armada-388-gp.dts
index a633be3..0a3bd7f 100644
--- a/arch/arm/boot/dts/armada-388-gp.dts
+++ b/arch/arm/boot/dts/armada-388-gp.dts
@@ -60,7 +60,8 @@
ranges = ;
+ MBUS_ID(0x09, 0x15) 0 0xf111 0x1
+ MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>;
 
  

[PATCH 09/13] net: mvneta: bm: add support for hardware buffer management

2015-11-21 Thread Marcin Wojtas
Buffer manager (BM) is a dedicated hardware unit that can be used by all
ethernet ports of Armada XP and 38x SoC's. It allows to offload CPU on RX
path by sparing DRAM access on refilling buffer pool, hardware-based
filling of descriptor ring data and better memory utilization due to HW
arbitration for using 'short' pools for small packets.

Tests performed with A388 SoC working as a network bridge between two
packet generators showed increase of maximum processed 64B packets by
~20k (~555k packets with BM enabled vs ~535 packets without BM). Also
when pushing 1500B-packets with a line rate achieved, CPU load decreased
from around 25% without BM to 20% with BM.

BM comprise up to 4 buffer pointers' (BP) rings kept in DRAM, which
are called external BP pools - BPPE. Allocating and releasing buffer
pointers (BP) to/from BPPE is performed indirectly by write/read access
to a dedicated internal SRAM, where internal BP pools (BPPI) are placed.
BM hardware controls status of BPPE automatically, as well as assigning
proper buffers to RX descriptors. For more details please refer to
Functional Specification of Armada XP or 38x SoC.

In order to enable support for a separate hardware block, common for all
ports, a new driver has to be implemented ('mvneta_bm'). It provides
initialization sequence of address space, clocks, registers, SRAM,
empty pools' structures and also obtaining optional configuration
from DT (please refer to device tree binding documentation). mvneta_bm
exposes also a necessary API to mvneta driver, as well as a dedicated
structure with BM information (bm_priv), whose presence is used as a
flag notifying of BM usage by port. It has to be ensured that mvneta_bm
probe is executed prior to the ones in ports' driver. In case BM is not
used or its probe fails, mvneta falls back to use software buffer
management.

A sequence executed in mvneta_probe function is modified in order to have
an access to needed resources before possible port's BM initialization is
done. According to port-pools mapping provided by DT appropriate registers
are configured and the buffer pools are filled. RX path is modified
accordingly. Becaues the hardware allows a wide variety of configuration
options, following assumptions are made:
* using BM mechanisms can be selectively disabled/enabled basing
  on DT configuration among the ports
* 'long' pool's single buffer size is tied to port's MTU
* using 'long' pool by port is obligatory and it cannot be shared
* using 'short' pool for smaller packets is optional
* one 'short' pool can be shared among all ports

This commit enables hardware buffer management operation cooperating with
existing mvneta driver. New device tree binding documentation is added and
the one of mvneta is updated accordingly.

Signed-off-by: Marcin Wojtas 
---
 .../bindings/net/marvell-armada-370-neta.txt   |  19 +-
 .../devicetree/bindings/net/marvell-neta-bm.txt|  49 ++
 drivers/net/ethernet/marvell/Kconfig   |  14 +
 drivers/net/ethernet/marvell/Makefile  |   1 +
 drivers/net/ethernet/marvell/mvneta.c  | 493 ++--
 drivers/net/ethernet/marvell/mvneta_bm.c   | 642 +
 drivers/net/ethernet/marvell/mvneta_bm.h   | 171 ++
 7 files changed, 1335 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/marvell-neta-bm.txt
 create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.c
 create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.h

diff --git a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt 
b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
index f5a8ca2..ea1aae2 100644
--- a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
+++ b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
@@ -8,14 +8,29 @@ Required properties:
 - phy-mode: See ethernet.txt file in the same directory
 - clocks: a pointer to the reference clock for this device.
 
+Optional properties (valid only for Armada XP/38x):
+
+- buffer-manager: a phandle to a buffer manager node. Please refer to
+  Documentation/devicetree/bindings/net/marvell-neta-bm.txt
+- bm,pool-long: ID of a pool, that will accept all packets of a size
+  higher than 'short' pool's threshold (if set) and up to MTU value.
+  Obligatory, when the port is supposed to use hardware
+  buffer management.
+- bm,pool-short: ID of a pool, that will be used for accepting
+  packets of a size lower than given threshold. If not set, the port
+  will use a single 'long' pool for all packets, as defined above.
+
 Example:
 
-ethernet@d007 {
+ethernet@7 {
compatible = "marvell,armada-370-neta";
-   reg = <0xd007 0x2500>;
+   reg = <0x7 0x2500>;
interrupts = <8>;
clocks = <&gate_clk 4>;
status = "okay";
phy = <&phy0>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <0>;
+   bm,pool-s

[PATCH 13/13] ARM: mvebu: enable buffer manager support on Armada XP boards

2015-11-21 Thread Marcin Wojtas
Since mvneta driver supports using hardware buffer management (BM), in
order to use it, board files have to be adjusted accordingly. This commit
enables BM on AXP-DB and AXP-GP in same manner - because number of ports
on those boards is the same as number of possible pools, each port is
supposed to use single pool for all kind of packets.

Moreover appropriate entry is added to 'soc' node ranges, as well as "okay"
status for 'bm' and 'bm-bppi' (internal SRAM) nodes.

Signed-off-by: Marcin Wojtas 
---
 arch/arm/boot/dts/armada-xp-db.dts | 19 ++-
 arch/arm/boot/dts/armada-xp-gp.dts | 19 ++-
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/armada-xp-db.dts 
b/arch/arm/boot/dts/armada-xp-db.dts
index f774101..3065730 100644
--- a/arch/arm/boot/dts/armada-xp-db.dts
+++ b/arch/arm/boot/dts/armada-xp-db.dts
@@ -77,7 +77,8 @@
  MBUS_ID(0x01, 0x1d) 0 0 0xfff0 0x10
  MBUS_ID(0x01, 0x2f) 0 0 0xf000 0x100
  MBUS_ID(0x09, 0x09) 0 0 0xf810 0x1
- MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1>;
+ MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1
+ MBUS_ID(0x0c, 0x04) 0 0 0xf120 0x10>;
 
devbus-bootcs {
status = "okay";
@@ -181,21 +182,33 @@
status = "okay";
phy = <&phy0>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <0>;
};
ethernet@74000 {
status = "okay";
phy = <&phy1>;
phy-mode = "rgmii-id";
+   buffer-manager = <&bm>;
+   bm,pool-long = <1>;
};
ethernet@3 {
status = "okay";
phy = <&phy2>;
phy-mode = "sgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <2>;
};
ethernet@34000 {
status = "okay";
phy = <&phy3>;
phy-mode = "sgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <3>;
+   };
+
+   bm@c {
+   status = "okay";
};
 
mvsdio@d4000 {
@@ -230,5 +243,9 @@
};
};
};
+
+   bm-bppi {
+   status = "okay";
+   };
};
 };
diff --git a/arch/arm/boot/dts/armada-xp-gp.dts 
b/arch/arm/boot/dts/armada-xp-gp.dts
index 4878d73..a1ded01 100644
--- a/arch/arm/boot/dts/armada-xp-gp.dts
+++ b/arch/arm/boot/dts/armada-xp-gp.dts
@@ -96,7 +96,8 @@
  MBUS_ID(0x01, 0x1d) 0 0 0xfff0 0x10
  MBUS_ID(0x01, 0x2f) 0 0 0xf000 0x100
  MBUS_ID(0x09, 0x09) 0 0 0xf810 0x1
- MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1>;
+ MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1
+ MBUS_ID(0x0c, 0x04) 0 0 0xf120 0x10>;
 
devbus-bootcs {
status = "okay";
@@ -196,21 +197,29 @@
status = "okay";
phy = <&phy0>;
phy-mode = "qsgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <0>;
};
ethernet@74000 {
status = "okay";
phy = <&phy1>;
phy-mode = "qsgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <1>;
};
ethernet@3 {
status = "okay";
phy = <&phy2>;
phy-mode = "qsgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <2>;
};
ethernet@34000 {
status = "okay";
phy = <&phy3>;
phy-mode = "qsgmii";
+   buffer-manager = <&bm>;
+   bm,pool-long = <3>

[PATCH 05/13] net: mvneta: add xmit_more support

2015-11-21 Thread Marcin Wojtas
From: Simon Guinot 

Basing on xmit_more flag of the skb, TX descriptors can be concatenated
before flushing. This commit delay Tx descriptor flush if the queue is
running and if there is more skb's to send.

Signed-off-by: Simon Guinot 
---
 drivers/net/ethernet/marvell/mvneta.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index f079b13..9c9e858 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -467,6 +467,7 @@ struct mvneta_tx_queue {
 * descriptor ring
 */
int count;
+   int pending;
int tx_stop_threshold;
int tx_wake_threshold;
 
@@ -751,8 +752,9 @@ static void mvneta_txq_pend_desc_add(struct mvneta_port *pp,
/* Only 255 descriptors can be added at once ; Assume caller
 * process TX desriptors in quanta less than 256
 */
-   val = pend_desc;
+   val = pend_desc + txq->pending;
mvreg_write(pp, MVNETA_TXQ_UPDATE_REG(txq->id), val);
+   txq->pending = 0;
 }
 
 /* Get pointer to next TX descriptor to be processed (send) by HW */
@@ -1857,11 +1859,14 @@ out:
struct netdev_queue *nq = netdev_get_tx_queue(dev, txq_id);
 
txq->count += frags;
-   mvneta_txq_pend_desc_add(pp, txq, frags);
-
if (txq->count >= txq->tx_stop_threshold)
netif_tx_stop_queue(nq);
 
+   if (!skb->xmit_more || netif_xmit_stopped(nq))
+   mvneta_txq_pend_desc_add(pp, txq, frags);
+   else
+   txq->pending += frags;
+
u64_stats_update_begin(&stats->syncp);
stats->tx_packets++;
stats->tx_bytes  += len;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/13] bus: mvebu-mbus: provide api for obtaining IO and DRAM window information

2015-11-21 Thread Marcin Wojtas
This commit enables finding appropriate mbus window and obtaining its
target id and attribute for given physical address in two separate
routines, both for IO and DRAM windows. This functionality
is needed for Armada XP/38x Network Controller's Buffer Manager and
PnC configuration.

Signed-off-by: Marcin Wojtas 

[DRAM window information reference in LKv3.10]
Signed-off-by: Evan Wang 
---
 drivers/bus/mvebu-mbus.c | 51 
 include/linux/mbus.h |  3 +++
 2 files changed, 54 insertions(+)

diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c
index c43c3d2..3d1c0c3 100644
--- a/drivers/bus/mvebu-mbus.c
+++ b/drivers/bus/mvebu-mbus.c
@@ -948,6 +948,57 @@ void mvebu_mbus_get_pcie_io_aperture(struct resource *res)
*res = mbus_state.pcie_io_aperture;
 }
 
+int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target, u8 *attr)
+{
+   const struct mbus_dram_target_info *dram;
+   int i;
+
+   /* Get dram info */
+   dram = mv_mbus_dram_info();
+   if (!dram) {
+   pr_err("missing DRAM information\n");
+   return -ENODEV;
+   }
+
+   /* Try to find matching DRAM window for phyaddr */
+   for (i = 0; i < dram->num_cs; i++) {
+   const struct mbus_dram_window *cs = dram->cs + i;
+
+   if (cs->base <= phyaddr && phyaddr <= (cs->base + cs->size)) {
+   *target = dram->mbus_dram_target_id;
+   *attr = cs->mbus_attr;
+   return 0;
+   }
+   }
+
+   pr_err("invalid dram address 0x%x\n", phyaddr);
+   return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(mvebu_mbus_get_dram_win_info);
+
+int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size, u8 *target,
+  u8 *attr)
+{
+   int win;
+
+   for (win = 0; win < mbus_state.soc->num_wins; win++) {
+   u64 wbase;
+   int enabled;
+
+   mvebu_mbus_read_window(&mbus_state, win, &enabled, &wbase,
+  size, target, attr, NULL);
+
+   if (!enabled)
+   continue;
+
+   if (wbase <= phyaddr && phyaddr <= wbase + *size)
+   return win;
+   }
+
+   return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(mvebu_mbus_get_io_win_info);
+
 static __init int mvebu_mbus_debugfs_init(void)
 {
struct mvebu_mbus_state *s = &mbus_state;
diff --git a/include/linux/mbus.h b/include/linux/mbus.h
index 1f7bc63..ea34a86 100644
--- a/include/linux/mbus.h
+++ b/include/linux/mbus.h
@@ -69,6 +69,9 @@ static inline const struct mbus_dram_target_info 
*mv_mbus_dram_info_nooverlap(vo
 int mvebu_mbus_save_cpu_target(u32 *store_addr);
 void mvebu_mbus_get_pcie_mem_aperture(struct resource *res);
 void mvebu_mbus_get_pcie_io_aperture(struct resource *res);
+int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target, u8 *attr);
+int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size, u8 *target,
+  u8 *attr);
 int mvebu_mbus_add_window_remap_by_id(unsigned int target,
  unsigned int attribute,
  phys_addr_t base, size_t size,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/13] net: mvneta: add configuration for MBUS windows access protection

2015-11-21 Thread Marcin Wojtas
This commit adds missing configuration of MBUS windows access protection
in mvneta_conf_mbus_windows function - a dedicated variable for that
purpose remained there unused since v3.8 initial mvneta support. Because
of that the register contents were inherited from the bootloader.

Signed-off-by: Marcin Wojtas 
Cc:  # v3.8+
---
 drivers/net/ethernet/marvell/mvneta.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index e84c7f2..0f30aaa 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -62,6 +62,7 @@
 #define MVNETA_WIN_SIZE(w)  (0x2204 + ((w) << 3))
 #define MVNETA_WIN_REMAP(w) (0x2280 + ((w) << 2))
 #define MVNETA_BASE_ADDR_ENABLE 0x2290
+#define MVNETA_ACCESS_PROTECT_ENABLE   0x2294
 #define MVNETA_PORT_CONFIG  0x2400
 #define  MVNETA_UNI_PROMISC_MODEBIT(0)
 #define  MVNETA_DEF_RXQ(q)  ((q) << 1)
@@ -3188,6 +3189,8 @@ static void mvneta_conf_mbus_windows(struct mvneta_port 
*pp,
 
win_enable &= ~(1 << i);
win_protect |= 3 << (2 * i);
+
+   mvreg_write(pp, MVNETA_ACCESS_PROTECT_ENABLE, win_protect);
}
 
mvreg_write(pp, MVNETA_BASE_ADDR_ENABLE, win_enable);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/13] net: mvneta: enable mixed egress processing using HR timer

2015-11-21 Thread Marcin Wojtas
Mixed approach allows using higher interrupt threshold (increased back to
15 packets), useful in high throughput. In case of small amount of data
or very short TX queues HR timer ensures releasing buffers with small
latency.

Along with existing tx_done processing by coalescing interrupts this
commit enables triggering HR timer each time the packets are sent.
Time threshold can also be configured, using ethtool.

Signed-off-by: Marcin Wojtas 
Signed-off-by: Simon Guinot 
---
 drivers/net/ethernet/marvell/mvneta.c | 89 +--
 1 file changed, 85 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 9c9e858..f5acaf6 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -226,7 +228,8 @@
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS1
+#define MVNETA_TXDONE_COAL_PKTS15
+#define MVNETA_TXDONE_COAL_USEC100
 #define MVNETA_RX_COAL_PKTS32
 #define MVNETA_RX_COAL_USEC100
 
@@ -356,6 +359,11 @@ struct mvneta_port {
struct net_device *dev;
struct notifier_block cpu_notifier;
 
+   /* Egress finalization */
+   struct tasklet_struct tx_done_tasklet;
+   struct hrtimer tx_done_timer;
+   bool timer_scheduled;
+
/* Core clock */
struct clk *clk;
u8 mcast_count[256];
@@ -481,6 +489,7 @@ struct mvneta_tx_queue {
int txq_get_index;
 
u32 done_pkts_coal;
+   u32 done_time_coal;
 
/* Virtual address of the TX DMA descriptors array */
struct mvneta_tx_desc *descs;
@@ -1791,6 +1800,30 @@ error:
return -ENOMEM;
 }
 
+/* Trigger HR timer for TX processing */
+static void mvneta_timer_set(struct mvneta_port *pp)
+{
+   ktime_t interval;
+
+   if (!pp->timer_scheduled) {
+   pp->timer_scheduled = true;
+   interval = ktime_set(0, pp->txqs[0].done_time_coal * 1000);
+   hrtimer_start(&pp->tx_done_timer, interval,
+ HRTIMER_MODE_REL_PINNED);
+   }
+}
+
+/* TX processing HR timer callback */
+static enum hrtimer_restart mvneta_hr_timer_cb(struct hrtimer *timer)
+{
+   struct mvneta_port *pp = container_of(timer, struct mvneta_port,
+ tx_done_timer);
+
+   tasklet_schedule(&pp->tx_done_tasklet);
+
+   return HRTIMER_NORESTART;
+}
+
 /* Main tx processing */
 static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 {
@@ -1862,10 +1895,13 @@ out:
if (txq->count >= txq->tx_stop_threshold)
netif_tx_stop_queue(nq);
 
-   if (!skb->xmit_more || netif_xmit_stopped(nq))
+   if (!skb->xmit_more || netif_xmit_stopped(nq)) {
mvneta_txq_pend_desc_add(pp, txq, frags);
-   else
+   if (txq->done_time_coal && !pp->timer_scheduled)
+   mvneta_timer_set(pp);
+   } else {
txq->pending += frags;
+   }
 
u64_stats_update_begin(&stats->syncp);
stats->tx_packets++;
@@ -1902,6 +1938,7 @@ static void mvneta_tx_done_gbe(struct mvneta_port *pp, 
u32 cause_tx_done)
 {
struct mvneta_tx_queue *txq;
struct netdev_queue *nq;
+   unsigned int tx_todo = 0;
 
while (cause_tx_done) {
txq = mvneta_tx_done_policy(pp, cause_tx_done);
@@ -1909,12 +1946,40 @@ static void mvneta_tx_done_gbe(struct mvneta_port *pp, 
u32 cause_tx_done)
nq = netdev_get_tx_queue(pp->dev, txq->id);
__netif_tx_lock(nq, smp_processor_id());
 
-   if (txq->count)
+   if (txq->count) {
mvneta_txq_done(pp, txq);
+   tx_todo += txq->count;
+   }
 
__netif_tx_unlock(nq);
cause_tx_done &= ~((1 << txq->id));
}
+
+   if (!pp->txqs[0].done_time_coal)
+   return;
+
+   /* Set the timer in case not all the packets were
+* processed. Otherwise attempt to cancel timer.
+*/
+   if (tx_todo)
+   mvneta_timer_set(pp);
+   else if (pp->timer_scheduled)
+   hrtimer_cancel(&pp->tx_done_timer);
+}
+
+/* TX done processing tasklet */
+static void mvneta_tx_done_proc(unsigned long data)
+{
+   struct net_device *dev = (struct net_device *)data;
+   struct mvneta_port *pp = netdev_priv(dev);
+
+   if (!netif_running(dev))
+   return;
+
+   pp->timer_scheduled = false;
+
+   /* Process all Tx queues */
+   mvneta_tx_done_gbe(pp, (1 << txq_number) - 1);
 }
 
 /* Compute crc8 of the specified address, using a unique al

[PATCH 10/13] ARM: mvebu: add buffer manager nodes to armada-38x.dtsi

2015-11-21 Thread Marcin Wojtas
Armada 38x network controller supports hardware buffer management (BM).
Since it is now enabled in mvneta driver, appropriate nodes can be added
to armada-38x.dtsi - for the actual common BM unit (bm@c8000) and its
internal SRAM (bm-bppi), which is used for indirect access to buffer
pointer ring residing in DRAM.

Pools - ports mapping, bm-bppi entry in 'soc' node's ranges and optional
parameters are supposed to be set in board files.

Signed-off-by: Marcin Wojtas 
---
 arch/arm/boot/dts/armada-38x.dtsi | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm/boot/dts/armada-38x.dtsi 
b/arch/arm/boot/dts/armada-38x.dtsi
index b7868b2..b9f4ce2 100644
--- a/arch/arm/boot/dts/armada-38x.dtsi
+++ b/arch/arm/boot/dts/armada-38x.dtsi
@@ -539,6 +539,14 @@
status = "disabled";
};
 
+   bm: bm@c8000 {
+   compatible = "marvell,armada-380-neta-bm";
+   reg = <0xc8000 0xac>;
+   clocks = <&gateclk 13>;
+   internal-mem = <&bm_bppi>;
+   status = "disabled";
+   };
+
sata@e {
compatible = "marvell,armada-380-ahci";
reg = <0xe 0x2000>;
@@ -617,6 +625,16 @@
#size-cells = <1>;
ranges = <0 MBUS_ID(0x09, 0x15) 0 0x800>;
};
+
+   bm_bppi: bm-bppi {
+   compatible = "mmio-sram";
+   reg = ;
+   ranges = <0 MBUS_ID(0x0c, 0x04) 0 0x10>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   clocks = <&gateclk 13>;
+   status = "disabled";
+   };
};
 
clocks {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/13] ARM: mvebu: enable SRAM support in mvebu_v7_defconfig

2015-11-21 Thread Marcin Wojtas
Signed-off-by: Marcin Wojtas 
---
 arch/arm/configs/mvebu_v7_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/configs/mvebu_v7_defconfig 
b/arch/arm/configs/mvebu_v7_defconfig
index c6729bf..fe57e20 100644
--- a/arch/arm/configs/mvebu_v7_defconfig
+++ b/arch/arm/configs/mvebu_v7_defconfig
@@ -58,6 +58,7 @@ CONFIG_MTD_M25P80=y
 CONFIG_MTD_NAND=y
 CONFIG_MTD_NAND_PXA3xx=y
 CONFIG_MTD_SPI_NOR=y
+CONFIG_SRAM=y
 CONFIG_EEPROM_AT24=y
 CONFIG_BLK_DEV_SD=y
 CONFIG_ATA=y
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html