date:20160905

Re: Centralizing support for TCAM?

2016-09-05 Thread Alexei Starovoitov

On Sat, Sep 03, 2016 at 09:09:50AM +0200, Jiri Pirko wrote:
> Fri, Sep 02, 2016 at 08:49:34PM CEST, john.fastab...@gmail.com wrote:
> >On 16-09-02 10:18 AM, Florian Fainelli wrote:
> >> Hi all,
> >> 
> >
> >Hi Florian,
> >
> >> (apologies for the long CC list and the fact that I can't type correctly
> >> email addresses)
> >> 
> >
> >My favorite topic ;)
> >
> >> While working on adding support for the Broadcom Ethernet switches
> >> Compact Field Processor (which is essentially a TCAM,
> >> action/policer/rate meter RAMs, 256 entries), I started working with the
> >> ethtool::rxnfc API which is actually kind of nice in that it fits nicely
> >> with my use simple case of being able to insert rules at a given or
> >> driver selected location and has a pretty good flow representation for
> >> common things you may match: TCP/UDP v4/v6 (not so much for non-IP, or
> >> L2 stuff though you can use the extension flow representation). It lacks
> >> support for more complex actions other than redirect to a particular
> >> port/queue.
> >
> >When I was doing this for one of the products I work on I decided that
> >extending ethtool was likely not a good approach and building a netlink
> >interface would be a better choice. My reasons were mainly extending
> >ethtool is a bit painful to keep structure compatibility across versions
> >and I also had use cases that wanted to get notifications both made
> >easier when using netlink. However my netlink port+extensions were not
> >accepted and were called a "kernel bypass" and the general opinion was
> >that it was not going to be accepted upstream. Hence the 'tc' effort.
> 
> Ethtool should die peacefully. Don't poke in it in the process...
> 
> 
> >
> >> 
> >> Now ethtool::rxnfc is one possible user, but tc and netfiler also are,
> >> more powerful and extensible, but since this is a resource constrained
> >> piece of hardware, and it would suck for people to have to implement
> >> these 3 APIs if we could come up with a central one that satisfies the
> >> superset offered by tc + netfilter. We can surely imagine an use case we
> >
> >My opinion is that tc and netfilter are sufficiently different that
> >building a common layer is challenging and is actually more complex vs
> >just implementing two interfaces. Always happy to review code though.
> 
> In february, Pablo did some work on finding the common intermediate
> layer for classifier-action subsystem. It was rejected with the argument
> of unnecessary overhead. Makes sense to me. After that, you introduced
> u32 tc offload. Since that, couple more tc classifiers and actions were
> offloaded.
> 
> I believe that for Florian's usecase, TC is a great fit. You can just use
> cls_flower with couple of actions.
> 
> My colleagues are working hard on enabling cls_flower offload. You can
> easily benefit that. In mlxsw we also plan to use that for our TCAM ACLs
> offloading.
> 
> 
> >
> >There is also an already established packet flow through tc, netfilter,
> >fdb, l3 in linux that folks want to maintain. At the moment I just don't
> >see the need for a common layer IMO.
> >
> >Also adding another layer of abstraction so we end up doing multiple
> >translations into and out of these layers adds overhead. Eventually
> >I need to get reasonable operations per second on the TCAM tables.
> >Reasonable for me being somewhere in the 50k to 100k add/del/update
> >commands per second. I'm hesitant to create more abstractions then
> >are actually needed.
> >
> >> centralize the whole matching + action into a Domain Specific Language
> >> that we compiled into eBPF and then translate into whatever the HW
> >> understands, although that raises the question of where do we put the
> >> translation tool in user space or kernel space.
> >
> >The eBPF to HW translation I started to look at but gave up. The issue
> >was the program space of eBPF is much larger than any traditional
> >parser, table hardware implementation can support so most programs get
> >rejected (obvious observation right?). I'm more inclined to build
> >hardware that can support eBPF vs restricting eBPF to fit into a
> >parser/table model.
> 
> +1
> I have been thinging a lot about this and I believe that parsing bpf
> program in drivers into some pre-defined tables is quite complex. I
> think that bpf is just very unsuitable to offload, if you don't have a
> hw which could directly interpret it.
> I know that Alexei disagrees :)

lol :)
compiling bpf into fixed pipeline asic is definitely not easy.
The problem with adding new cls classifieris and actions to match
what configurable hw does isn't pretty either. The fixed pipeline
isn't interesting beyond l2/l3 and flow-based hw features are mostly
useless in the tor. I'm not against adding new classifiers, since it's
better than sdk, but we won't be using such tc features either.
Since this thread about tcam... my 0.02 here is it's pretty bad in
the nic(host) due to power consumption and in the tor it's only good as

[RFC PATCH v2 6/6] net: Suppress the "Comparison to NULL could be written" warning

2016-09-05 Thread Jia He

This is to suppress the checkpatch.pl warning "Comparison to NULL
could be written". No functional changes here.

Signed-off-by: Jia He 
---
 net/ipv4/proc.c  | 44 ++--
 net/sctp/proc.c  |  4 ++--
 net/xfrm/xfrm_proc.c |  4 ++--
 3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index f413fdc..bf0bb22 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -358,22 +358,22 @@ static void icmp_put(struct seq_file *seq)
atomic_long_t *ptr = net->mib.icmpmsg_statistics->mibs;
 
seq_puts(seq, "\nIcmp: InMsgs InErrors InCsumErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " In%s", icmpmibmap[i].name);
seq_puts(seq, " OutMsgs OutErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " Out%s", icmpmibmap[i].name);
seq_printf(seq, "\nIcmp: %lu %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INERRORS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_CSUMERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + icmpmibmap[i].index));
seq_printf(seq, " %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + (icmpmibmap[i].index | 
0x100)));
 }
@@ -390,7 +390,7 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, void 
*v)
memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
seq_puts(seq, "Ip: Forwarding DefaultTTL");
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
seq_printf(seq, "\nIp: %d %d",
@@ -400,13 +400,13 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, 
void *v)
BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
 
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
buff64[i] += snmp_get_cpu_field64(
net->mib.ip_statistics,
c, snmp4_ipstats_list[i].entry,
offsetof(struct ipstats_mib, syncp));
}
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %llu", buff64[i]);
 
return 0;
@@ -421,17 +421,17 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
void *v)
memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
seq_puts(seq, "\nTcp:");
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_tcp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_tcp_list[i].name; i++)
buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics,
c, snmp4_tcp_list[i].entry);
}
 
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
+   for (i = 0; snmp4_tcp_list[i].name; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
seq_printf(seq, " %ld", buff[i]);
@@ -442,15 +442,15 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
void *v)
memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
buff[i] += snmp_get_cpu_field(net->mib.udp_statistics,
c, snmp4_udp_list[i].entry);
}
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)

[RFC PATCH v2 3/6] proc: Reduce cache miss in sctp_snmp_seq_show

2016-09-05 Thread Jia He

This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/sctp/proc.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index ef8ba77..085fb95 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -74,12 +74,19 @@ static const struct snmp_mib sctp_snmp_list[] = {
 static int sctp_snmp_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   int i;
+   int i, c;
+   unsigned long buff[SCTP_MIB_MAX];
 
+   memset(buff, 0, sizeof(unsigned long) * SCTP_MIB_MAX);
+
+   for_each_possible_cpu(c)
+   for (i = 0; sctp_snmp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(
+   net->sctp.sctp_statistics,
+   c, sctp_snmp_list[i].entry);
for (i = 0; sctp_snmp_list[i].name != NULL; i++)
seq_printf(seq, "%-32s\t%ld\n", sctp_snmp_list[i].name,
-  snmp_fold_field(net->sctp.sctp_statistics,
- sctp_snmp_list[i].entry));
+   buff[i]);
 
return 0;
 }
-- 
1.8.3.1

[RFC PATCH v2 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-05 Thread Jia He

This patch exchanges the two loop for collecting the percpu statistics
data. This can aggregate the data by going through all the items of each
cpu sequentially. Then snmp_seq_show is split into 2 parts to avoid build
warning "the frame size" larger than 1024.

Signed-off-by: Jia He 
---
 net/ipv4/proc.c | 112 
 1 file changed, 81 insertions(+), 31 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..f413fdc 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,9 @@
 #include 
 #include 
 
+#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b))
+#define TCPUDP_MIB_MAX MAX(UDP_MIB_MAX, TCP_MIB_MAX)
+
 /*
  * Report socket allocation statistics [m...@utu.fi]
  */
@@ -378,13 +381,15 @@ static void icmp_put(struct seq_file *seq)
 /*
  * Called from the PROCfs module. This outputs /proc/net/snmp.
  */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
 {
-   int i;
+   int i, c;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
 
-   seq_puts(seq, "Ip: Forwarding DefaultTTL");
+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
+   seq_puts(seq, "Ip: Forwarding DefaultTTL");
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
@@ -393,57 +398,92 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
   net->ipv4.sysctl_ip_default_ttl);
 
BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   buff64[i] += snmp_get_cpu_field64(
+   net->mib.ip_statistics,
+   c, snmp4_ipstats_list[i].entry,
+   offsetof(struct ipstats_mib, syncp));
+   }
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
 
-   icmp_put(seq);  /* RFC 2011 compatibility */
-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i, c;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
seq_puts(seq, "\nTcp:");
for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics,
+   c, snmp4_tcp_list[i].entry);
+   }
+
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
 
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
+
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(net->mib.udp_statistics,
+   c, snmp4_udp_list[i].entry);
+   }
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
 
/* the UDP and UDP-Lite MIBs are the same */
seq_puts(seq, "\nUdpLite:");
+   for_each_possible_cpu(c) {
+   for (i = 0;

[RFC PATCH v2 5/6] ipv6: Remove useless parameter in __snmp6_fill_statsdev

2016-09-05 Thread Jia He

The parameter items(always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev.

Signed-off-by: Jia He 
---
 net/ipv6/addrconf.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f418d2e..e170554 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4952,18 +4952,18 @@ static inline size_t inet6_if_nlmsg_size(void)
 }
 
 static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
- int items, int bytes)
+   int bytes)
 {
int i;
-   int pad = bytes - sizeof(u64) * items;
+   int pad = bytes - sizeof(u64) * ICMP6_MIB_MAX;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, [0]);
-   for (i = 1; i < items; i++)
+   put_unaligned(ICMP6_MIB_MAX, [0]);
+   for (i = 1; i < ICMP6_MIB_MAX; i++)
put_unaligned(atomic_long_read([i]), [i]);
 
-   memset([items], 0, pad);
+   memset([ICMP6_MIB_MAX], 0, pad);
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
@@ -4996,7 +4996,7 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 offsetof(struct ipstats_mib, syncp));
break;
case IFLA_INET6_ICMP6STATS:
-   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
+   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
bytes);
break;
}
 }
-- 
1.8.3.1

[RFC PATCH v2 4/6] proc: Reduce cache miss in xfrm_statistics_seq_show

2016-09-05 Thread Jia He

This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/xfrm/xfrm_proc.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c
index 9c4fbd8..c9df546 100644
--- a/net/xfrm/xfrm_proc.c
+++ b/net/xfrm/xfrm_proc.c
@@ -51,11 +51,20 @@ static const struct snmp_mib xfrm_mib_list[] = {
 static int xfrm_statistics_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   int i;
-   for (i = 0; xfrm_mib_list[i].name; i++)
+   int i, c;
+   unsigned long buff[LINUX_MIB_XFRMMAX];
+
+   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_XFRMMAX);
+
+   for_each_possible_cpu(c)
+   for (i = 0; xfrm_mib_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(
+   net->mib.xfrm_statistics,
+   c, xfrm_mib_list[i].entry);
+   for (i = 0; xfrm_mib_list[i].name != NULL; i++)
seq_printf(seq, "%-24s\t%lu\n", xfrm_mib_list[i].name,
-  snmp_fold_field(net->mib.xfrm_statistics,
-  xfrm_mib_list[i].entry));
+   buff[i]);
+
return 0;
 }
 
-- 
1.8.3.1

[RFC PATCH v2 2/6] proc: Reduce cache miss in snmp6_seq_show

2016-09-05 Thread Jia He

This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/ipv6/proc.c | 47 ---
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..c834646 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,13 @@
 #include 
 #include 
 
+#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b))
+
+#define MAX4(a, b, c, d) \
+   MAX(MAX(a, b), MAX(c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
 static int sockstat6_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
@@ -191,25 +198,43 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
atomic_long_t *smib,
const struct snmp_mib *itemlist)
 {
-   int i;
-   unsigned long val;
-
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   int i, c;
+   unsigned long buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
+
+   if (pcpumib) {
+   for_each_possible_cpu(c)
+   for (i = 0; itemlist[i].name; i++)
+   buff[i] += snmp_get_cpu_field(pcpumib, c,
+   itemlist[i].entry);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
 }
 
 static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,
  const struct snmp_mib *itemlist, size_t 
syncpoff)
 {
-   int i;
+   int i, c;
+   u64 buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
 
+   for_each_possible_cpu(c) {
+   for (i = 0; itemlist[i].name; i++) {
+   buff[i] += snmp_get_cpu_field64(mib, c,
+   itemlist[i].entry,
+   syncpoff);
+   }
+   }
for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff[i]);
 }
 
 static int snmp6_seq_show(struct seq_file *seq, void *v)
-- 
1.8.3.1

[RFC PATCH v2 0/6] Reduce cache miss for snmp_fold_field

2016-09-05 Thread Jia He

In a PowerPc server with large cpu number(160), besides commit
a3a773726c9f ("net: Optimize snmp stat aggregation by walking all
the percpu data at once"), I watched several other snmp_fold_field
callsites which will cause high cache miss rate.

My simple test case, which read from the procfs items endlessly:
/***/
#include 
#include 
#include 
#include 
#include 
#define LINELEN  2560
int main(int argc, char **argv)
{
int i;
int fd = -1 ;
int rdsize = 0;
char buf[LINELEN+1];

buf[LINELEN] = 0;
memset(buf,0,LINELEN);

if(1 >= argc) {
printf("file name empty\n");
return -1;
}

fd = open(argv[1], O_RDWR, 0644);
if(0 > fd){
printf("open error\n");
return -2;
}

for(i=0;i<0x;i++) {
while(0 < (rdsize = read(fd,buf,LINELEN))){
//nothing here
}

lseek(fd, 0, SEEK_SET);
}

close(fd);
return 0;
}
/**/

compile and run:
gcc test.c -o test

perf stat -d -e cache-misses ./test /proc/net/snmp
perf stat -d -e cache-misses ./test /proc/net/snmp6
perf stat -d -e cache-misses ./test /proc/net/netstat
perf stat -d -e cache-misses ./test /proc/net/sctp/snmp
perf stat -d -e cache-misses ./test /proc/net/xfrm_stat

before the patch set:

 Performance counter stats for 'system wide':

 355911097  cache-misses
 [40.08%]
2356829300  L1-dcache-loads 
 [60.04%]
 355642645  L1-dcache-load-misses #   15.09% of all L1-dcache 
hits   [60.02%]
 346544541  LLC-loads   
 [59.97%]
389763  LLC-load-misses   #0.11% of all LL-cache 
hits[40.02%]

   6.245162638 seconds time elapsed

After the patch set:
===
 Performance counter stats for 'system wide':

 194992476  cache-misses
 [40.03%]
6718051877  L1-dcache-loads 
 [60.07%]
 194871921  L1-dcache-load-misses #2.90% of all L1-dcache 
hits   [60.11%]
 187632232  LLC-loads   
 [60.04%]
464466  LLC-load-misses   #0.25% of all LL-cache 
hits[39.89%]

   6.868422769 seconds time elapsed
The cache-miss rate can be reduced from 15% to 2.9%

v2:
- 1/6 fix bug in udplite statistics. 
- 1/6 snmp_seq_show is split into 2 parts

Jia He (6):
  proc: Reduce cache miss in {snmp,netstat}_seq_show
  proc: Reduce cache miss in snmp6_seq_show
  proc: Reduce cache miss in sctp_snmp_seq_show
  proc: Reduce cache miss in xfrm_statistics_seq_show
  ipv6: Remove useless parameter in __snmp6_fill_statsdev
  net: Suppress the "Comparison to NULL could be written" warning

 net/ipv4/proc.c  | 144 ++-
 net/ipv6/addrconf.c  |  12 ++---
 net/ipv6/proc.c  |  47 +
 net/sctp/proc.c  |  15 --
 net/xfrm/xfrm_proc.c |  15 --
 5 files changed, 162 insertions(+), 71 deletions(-)

-- 
1.8.3.1

Re: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values

2016-09-05 Thread Jeffrey Altman

Reply inline 

On 9/5/2016 12:24 PM, David Howells wrote:
> [cc'ing Jeff Altman for comment]
> 
> David Laight  wrote:
> 
>>> Create a random epoch value rather than a time-based one on startup and set
>>> the top bit to indicate that this is the case.
>>
>> Why set the top bit?
>> There is nothing to stop the time (in seconds) from having the top bit set.
>> Nothing else can care - otherwise this wouldn't work.
> 
> This is what I'm told I should do by purveyors of other RxRPC solutions.

The protocol specification requires that the top bit be 1 for a random
epoch and 0 for a time derived epoch.
> 
>>> Also create a random starting client connection ID value.  This will be
>>> incremented from here as new client connections are created.
>>
>> I'm guessing this is to make duplicates less likely after a restart?

Its to reduce the possibility of duplicates on multiple machines that
might at some point exchange an endpoint address either due to mobility
or NAT/PAT.
> 
> Again, it's been suggested that I do this, but I would guess so.
> 
>> You may want to worry about duplicate allocations (after 2^32 connects).
> 
> It's actually a quarter of that, but connection != call, so a connection may
> be used for up to ~16 billion RPC operations before it *has* to be flushed.
> 
>> There are id allocation algorithms that guarantee not to generate duplicates
>> and not to reuse values quickly while still being fixed cost.
>> Look at the code NetBSD uses to allocate process ids for an example.
> 
> I'm using idr_alloc_cyclic()[*] with a fixed size "window" on the active conn
> ID values.  Client connections with IDs outside of that window are discarded
> as soon as possible to keep the memory consumption of the tree down (and to
> force security renegotiation occasionally).  However, given that there are a
> billion IDs to cycle through, it will take quite a while for reuse to become
> an issue.
> 
> I like the idea of incrementing the epoch every time we cycle through the ID
> space, but I'm told that a change in the epoch value is an indication that the
> client rebooted - with what consequences I cannot say.

State information might be recorded about an rx peer with the assumption
that state will be reset when the epoch changes.  The most frequent use
of this technique is for rx rpc statistics monitoring.


> 
> [*] which is what Linux uses to allocate process IDs.
> 
> David
> 

<>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-05 Thread Feng Gao

On Tue, Sep 6, 2016 at 10:06 AM,   wrote:
> From: Gao Feng 
>
> It is valid that the TCP RST packet which does not set ack flag, and bytes
> of ack number are zero. For these RST packets, seqadj could not adjust the
> ack number.
>
> Signed-off-by: Gao Feng 
> ---
>  v2: Regenerate because the first patch is removed
>  v1: Initial patch
>
>  net/netfilter/nf_conntrack_seqadj.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/net/netfilter/nf_conntrack_seqadj.c 
> b/net/netfilter/nf_conntrack_seqadj.c
> index dff0f0c..3bd9c7e 100644
> --- a/net/netfilter/nf_conntrack_seqadj.c
> +++ b/net/netfilter/nf_conntrack_seqadj.c
> @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
>
> tcph = (void *)skb->data + protoff;
> spin_lock_bh(>lock);
> +
> if (after(ntohl(tcph->seq), this_way->correction_pos))
> seqoff = this_way->offset_after;
> else
> seqoff = this_way->offset_before;
>
> -   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> - other_way->correction_pos))
> -   ackoff = other_way->offset_after;
> -   else
> -   ackoff = other_way->offset_before;
> -
> newseq = htonl(ntohl(tcph->seq) + seqoff);
> -   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> -
> inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
> -   inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
> -false);
> -
> -   pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
> -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
> -ntohl(newack));
>
> +   pr_debug("Adjusting sequence number from %u->%u\n",
> +ntohl(tcph->seq), ntohl(newseq));
> tcph->seq = newseq;
> -   tcph->ack_seq = newack;
> +
> +   if (likely(tcph->ack)) {
> +   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> + other_way->correction_pos))
> +   ackoff = other_way->offset_after;
> +   else
> +   ackoff = other_way->offset_before;
> +
> +   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> +   inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
> +newack, false);
> +
> +   pr_debug("Adjusting ack number from %u->%u\n",
> +ntohl(tcph->ack_seq), ntohl(newack));
> +   tcph->ack_seq = newack;
> +   }
>
> res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo);
> spin_unlock_bh(>lock);
> --
> 1.9.1
>
>

Sorry, I forget to add the v2 in the subject.

Best Regards
Feng

[PATCH nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-05 Thread fgao

From: Gao Feng 

It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. For these RST packets, seqadj could not adjust the
ack number.

Signed-off-by: Gao Feng 
---
 v2: Regenerate because the first patch is removed
 v1: Initial patch

 net/netfilter/nf_conntrack_seqadj.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nf_conntrack_seqadj.c 
b/net/netfilter/nf_conntrack_seqadj.c
index dff0f0c..3bd9c7e 100644
--- a/net/netfilter/nf_conntrack_seqadj.c
+++ b/net/netfilter/nf_conntrack_seqadj.c
@@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
 
tcph = (void *)skb->data + protoff;
spin_lock_bh(>lock);
+
if (after(ntohl(tcph->seq), this_way->correction_pos))
seqoff = this_way->offset_after;
else
seqoff = this_way->offset_before;
 
-   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
- other_way->correction_pos))
-   ackoff = other_way->offset_after;
-   else
-   ackoff = other_way->offset_before;
-
newseq = htonl(ntohl(tcph->seq) + seqoff);
-   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
-
inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
-   inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
-false);
-
-   pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
-ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
-ntohl(newack));
 
+   pr_debug("Adjusting sequence number from %u->%u\n",
+ntohl(tcph->seq), ntohl(newseq));
tcph->seq = newseq;
-   tcph->ack_seq = newack;
+
+   if (likely(tcph->ack)) {
+   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
+ other_way->correction_pos))
+   ackoff = other_way->offset_after;
+   else
+   ackoff = other_way->offset_before;
+
+   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
+   inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
+newack, false);
+
+   pr_debug("Adjusting ack number from %u->%u\n",
+ntohl(tcph->ack_seq), ntohl(newack));
+   tcph->ack_seq = newack;
+   }
 
res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo);
spin_unlock_bh(>lock);
-- 
1.9.1

[PATCH v4 nf] netfilter: seqadj: Drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer later

2016-09-05 Thread fgao

From: Gao Feng 

When memory is exhausted, nfct_seqadj_ext_add may fail to add the seqadj
extension. But the function nf_ct_seqadj_init doesn't check if get valid
seqadj pointer by the nfct_seqadj.

Now drop the packet directly when fail to add seqadj extension to avoid
dereference NULL pointer in nf_ct_seqadj_init.

Signed-off-by: Gao Feng 
---
 v4: Drop the packet directly when fail to add seqadj extension;
 v3: Remove the warning log when seqadj is null;
 v2: Remove the unnessary seqadj check in nf_ct_seq_adjust
 v1: Initial patch

 net/netfilter/nf_conntrack_core.c | 6 +-
 net/netfilter/nf_nat_core.c   | 3 ++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index dd2c43a..dfa76ce 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1036,7 +1036,11 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
return (struct nf_conntrack_tuple_hash *)ct;
 
if (tmpl && nfct_synproxy(tmpl)) {
-   nfct_seqadj_ext_add(ct);
+   if (!nfct_seqadj_ext_add(ct)) {
+   nf_conntrack_free(ct);
+   pr_debug("Can't add seqadj extension\n");
+   return NULL;
+   }
nfct_synproxy_ext_add(ct);
}
 
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index de31818..b82282a 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -441,7 +441,8 @@ nf_nat_setup_info(struct nf_conn *ct,
ct->status |= IPS_DST_NAT;
 
if (nfct_help(ct))
-   nfct_seqadj_ext_add(ct);
+   if (!nfct_seqadj_ext_add(ct))
+   return NF_DROP;
}
 
if (maniptype == NF_NAT_MANIP_SRC) {
-- 
1.9.1

Re: [PATCH] RDS: Simplify code

2016-09-05 Thread santosh.shilim...@oracle.com


On 9/4/16 11:23 AM, Leon Romanovsky wrote:

On Sun, Sep 04, 2016 at 05:57:20PM +0200, Christophe JAILLET wrote:

Le 04/09/2016 à 14:20, Leon Romanovsky a écrit :

On Sat, Sep 03, 2016 at 07:33:29AM +0200, Christophe JAILLET wrote:

Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to
'list_splice_init'.

It is not 100% accurate

list_splice(y, z)
INIT_LIST_HEAD(y)

==>

if (!list_empty(y))
 __list_splice(y, z, z>next);
INIT_LIST_HEAD(y)

and not

if (!list_empty(y)) {
 __list_splice(y, z, z>next);
 INIT_LIST_HEAD(y)
}

as list_splice_init will do.


You are right but if you dig further you will see that calling
INIT_LIST_HEAD on an empty list is a no-op (AFAIK).
And if this list was not already correctly initialized, then you would have
some other troubles.


Thank you for the suggestion,
It looks like the code after that can be skipped in case of loop_conns
list is empty, the tmp_list will be empty too.

174 list_for_each_entry_safe(lc, _lc, _list, loop_node) {
175 WARN_ON(lc->conn->c_passive);
176 rds_conn_destroy(lc->conn);
177 }



Thanks for trying. As already pointed, your change doesn't simplify
much rather change the behavior. The loop cursor already takes care
of list empty case. I don't see any reason to change that code.

Regards,
Santosh

Great Offer

2016-09-05 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact
(julieleach...@hotmail.com) for claims.

Great Offer

2016-09-05 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact
(julieleach...@hotmail.com) for claims.

Great Offer

2016-09-05 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact
(julieleach...@hotmail.com) for claims.

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Alexei Starovoitov


On 9/5/16 2:40 PM, Sargun Dhillon wrote:

On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote:

Hi,

On 08/30/2016 01:04 AM, Sargun Dhillon wrote:

On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

   A - B - C
 \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.


How does this work when running and orchestrator within an orchestrator? The
Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is
observing the traffic, and there is an orchestrator within that also need to run
it.

In this case, I'd like to run E's filter, then if it returns 0, D's, and B's,
and so on.


Running multiple programs was an idea I had in one of my earlier drafts,
but after some discussion, I refrained from it again because potentially
walking the cgroup hierarchy on every packet is just too expensive.


I think you're correct here. Maybe this is something I do with the LSM-attached
filters, and not for skb filters. Do you think there might be a way to opt-in to
this option?


Is it possible to allow this, either by flattening out the
datastructure (copy a ref to the bpf programs to C and E) or
something similar?


That would mean we carry a list of eBPF program pointers of dynamic
size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
so it can store a reference to all programs of all of its ancestor.

While I think that would be possible, even at some later point, I'd
really like to avoid it for the sake of simplicity.

Is there any reason why this can't be done in userspace? Compile a
program X for A, and overload it with Y, with Y doing the same than X
but add some extra checks? Note that all users of the bpf(2) syscall API
will need CAP_NET_ADMIN anyway, so there is no delegation to
unprivileged sub-orchestators or anything alike really.


One of the use-cases that's becoming more and more common are
containers-in-containers. In this, you have a privileged container that's
running something like build orchestration, and you want to do macro-isolation
(say limit access to only that tennant's infrastructure). Then, when the build
orchestrator runs a build, it may want to monitor, and further isolate the tasks
that run in the build job. This is a side-effect of composing different
container technologies. Typically you use one system for images, then another
for orchestration, and the actual program running inside of it can also leverage
containerization.

Example:
K8s->Docker->Jenkins Agent->Jenkins Build Job


frankly I don't buy this argument, since above
and other 'examples' of container-in-container look
fake to me. There is a ton work to be done for such
scheme to be even remotely feasible. The cgroup+bpf
stuff would be the last on my list to 'fix' for such
deployments. I don't think we should worry about it
at present.

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Sargun Dhillon

On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
> > On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
> >> This patch adds two sets of eBPF program pointers to struct cgroup.
> >> One for such that are directly pinned to a cgroup, and one for such
> >> that are effective for it.
> >>
> >> To illustrate the logic behind that, assume the following example
> >> cgroup hierarchy.
> >>
> >>   A - B - C
> >> \ D - E
> >>
> >> If only B has a program attached, it will be effective for B, C, D
> >> and E. If D then attaches a program itself, that will be effective for
> >> both D and E, and the program in B will only affect B and C. Only one
> >> program of a given type is effective for a cgroup.
> >>
> > How does this work when running and orchestrator within an orchestrator? 
> > The 
> > Docker in Docker / Mesos in Mesos use case, where the top level 
> > orchestrator is 
> > observing the traffic, and there is an orchestrator within that also need 
> > to run 
> > it.
> > 
> > In this case, I'd like to run E's filter, then if it returns 0, D's, and 
> > B's, 
> > and so on.
> 
> Running multiple programs was an idea I had in one of my earlier drafts,
> but after some discussion, I refrained from it again because potentially
> walking the cgroup hierarchy on every packet is just too expensive.
>
I think you're correct here. Maybe this is something I do with the LSM-attached 
filters, and not for skb filters. Do you think there might be a way to opt-in 
to 
this option? 

> > Is it possible to allow this, either by flattening out the
> > datastructure (copy a ref to the bpf programs to C and E) or
> > something similar?
> 
> That would mean we carry a list of eBPF program pointers of dynamic
> size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
> so it can store a reference to all programs of all of its ancestor.
> 
> While I think that would be possible, even at some later point, I'd
> really like to avoid it for the sake of simplicity.
> 
> Is there any reason why this can't be done in userspace? Compile a
> program X for A, and overload it with Y, with Y doing the same than X
> but add some extra checks? Note that all users of the bpf(2) syscall API
> will need CAP_NET_ADMIN anyway, so there is no delegation to
> unprivileged sub-orchestators or anything alike really.

One of the use-cases that's becoming more and more common are 
containers-in-containers. In this, you have a privileged container that's 
running something like build orchestration, and you want to do macro-isolation 
(say limit access to only that tennant's infrastructure). Then, when the build 
orchestrator runs a build, it may want to monitor, and further isolate the 
tasks 
that run in the build job. This is a side-effect of composing different 
container technologies. Typically you use one system for images, then another 
for orchestration, and the actual program running inside of it can also 
leverage 
containerization.

Example:
K8s->Docker->Jenkins Agent->Jenkins Build Job

There's also a differentiation of ownership in each of these systems. I would 
really not require a middleware system that all my software has to talk to, 
because sometimes I'm taking off the shelf software (Jenkins), and porting it 
to 
containers. I think one of the pieces that's led to the success of cgroups is 
the straightforward API, and ease of use (and it's getting even easier in v2).

It's perfectly fine to give the lower level tasks CAP_NET_ADMIN, because we use 
something like seccomp-bpf plus some of the work I've been doing with the LSM 
to 
prevent the sub-orchestrators from accidentally blowing away the system. 
Usually, we trust these orchestrators (internal users), so it's more of a 
precautionary measure as opposed to a true security measure.

Also, rewriting BPF programs, although pretty straightforward sounds like a 
pain 
to do in userspace, even with a helper. If we were to take peoples programs and 
chain them together via tail call, or similar, I can imagine where rewriting a 
program might push you over the instruction limit.
> 
> 
> Thanks,
> Daniel
>

Re: [PATCH -next v2] net: hns: fix return value check in hns_dsaf_get_cfg()

2016-09-05 Thread David Miller

From: Salil Mehta 
Date: Mon, 5 Sep 2016 14:20:33 +

> This patch will conflict with Doug Ledford's hns-roce's HNS driver.
> This might lead to problems later during this merge window of 4.9.

You don't need to say this three times.

These changes will not be reverted, instead the conflicts will need
to simply be resolved during the merges just like any other conflict
that ever happens in our trees.

Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver

2016-09-05 Thread David Miller

From: Salil Mehta 
Date: Mon, 5 Sep 2016 12:53:07 +

> There is a patch in net-next for HNS Ethernet driver which has been accepted.
> "b3dc935 net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()"
> 
> This patch is creating conflict with Doug Ledford's hns-roce branch. 
> Internally,
> We have decided to let all the HNS patches pass through the hns-roce for 4.9
> Merge window to facilitate non-conflict merge of pending RoCE driver by Doug 
> Ledford.
> 
> Though, we are trying to take a grip of the process for this merge window but
> Somehow this one bypassed the internal process. This will create problems for
> Hns-roce during merge later this window. Can I request you to drop this patch
> For now. We shall re-submit this patch later when things have been settled at
> the RoCE/RDMA end or would come again to you through hns-roce branch.
> 
> Please let me know if this is possible this time. Thanks in anticipation!

This cannot work like this.

If I see an obvious fix to something in networking, I will apply the
patch.

Merge conflicts happen and they have to be resolved.

We don't just revert legitimate changes to eliminate the conflict.

Re: [PATCH v4 4/5] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC

2016-09-05 Thread Martin Blumenstingl

On Mon, Sep 5, 2016 at 12:53 PM, Arnd Bergmann  wrote:
> On Monday, September 5, 2016 9:37:29 AM CEST kbuild test robot wrote:
>> All error/warnings (new ones prefixed by >>):
>>
>> >> drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c:63:18: error: field 
>> >> 'm250_mux' has incomplete type
>>  struct clk_mux  m250_mux;
>>  ^
>> >> drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c:67:21: error: field 
>> >> 'm250_div' has incomplete type
>>  struct clk_divider m250_div;
>> ^
>>
>
> I think this needs a compile-time dependency on COMMON_CLK
indeed, since we are also a clock provider we have to depend on
CONFIG_COMMON_CLK.

That brings up a question though:
so far the new driver uses the same Kconfig symbol as the "old" driver
(CONFIG_DWMAC_MESON).
The "old" driver does not need CONFIG_COMMON_CLK while the new one does.
I see a few options here:
1. simply adding the dependency (as most configurations will have
CONFIG_COMMON_CLK enabled anyways)
2. add some depends on COMMON_CLK || MACH_MESON6 || MACH_MESON8 foo
3. use a new Kconfig symbol for new new driver (CONFIG_DWMAC_MESON8B?)

And finally regarding your other mail: I have already changed
WARN_ON(PTR_ERR_OR_ZERO(...)) to WARN_ON(IS_ERR(...)) in v4

Regards,
Martin

Re: [PATCH] rxrpc: initialize sched to false to ensure it is not a garbage value

2016-09-05 Thread David Howells

Colin King  wrote:

> From: Colin Ian King 
> 
> sched will be uninitialized (and contain a garbage value) in the case
> where call->state >= RXRPC_CALL_DEAD;  fix this by initializing sched
> to false to avoid an inadvertent call to rxrpc_queue_call.
> 
> Signed-off-by: Colin Ian King 

I already have a patch queued for this, thanks.

David

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack

On 09/05/2016 08:32 PM, Alexei Starovoitov wrote:
> On 9/5/16 10:09 AM, Daniel Borkmann wrote:
>> On 09/05/2016 04:09 PM, Daniel Mack wrote:

>>> I really don't think it's worth sparing 8 bytes here and then do the
>>> binary compat dance after flags are added, for no real gain.
>>
>> Sure, but there's not much of a dance needed, see for example how map_flags
>> were added some time ago. So, iff there's really no foreseeable use-case in
>> sight and since we have this flexibility in place already, then I don't
>> quite
>> follow why it's needed, if there's zero pain to add it later on. I would
>> understand it of course, if it cannot be handled later on anymore.
> 
> I agree with Daniel B. Since flags are completely unused right now,
> there is no plan to use it for anything in the coming months and
> even worse they make annoying hole in the struct, let's not
> add them. We can safely do that later. CHECK_ATTR() allows us to
> do it easily. It's not like syscall where flags are must have,
> since we cannot add it later. Here it's done trivially.

Okay then. If you both agree, I won't interfere :)


Daniel

Re: [PATCH v3 nf] netfilter: seqadj: Fix one possible panic in seqadj when mem is exhausted

2016-09-05 Thread Pablo Neira Ayuso

On Sat, Sep 03, 2016 at 07:51:50PM +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> When memory is exhausted, nfct_seqadj_ext_add may fail to add the seqadj
> extension. But the function nf_ct_seqadj_init doesn't check if get valid
> seqadj pointer by the nfct_seqadj, while other functions perform the
> sanity check.
> 
> So the system would be panic when nfct_seqadj_ext_add failed.
> 
> Signed-off-by: Gao Feng 
> ---
>  v3: Remove the warning log when seqadj is null;
>  v2: Remove the unnessary seqadj check in nf_ct_seq_adjust
>  v1: Initial patch
> 
>  net/netfilter/nf_conntrack_seqadj.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nf_conntrack_seqadj.c 
> b/net/netfilter/nf_conntrack_seqadj.c
> index dff0f0c..7f8d814 100644
> --- a/net/netfilter/nf_conntrack_seqadj.c
> +++ b/net/netfilter/nf_conntrack_seqadj.c
> @@ -16,9 +16,12 @@ int nf_ct_seqadj_init(struct nf_conn *ct, enum 
> ip_conntrack_info ctinfo,
>   if (off == 0)
>   return 0;
>  
> + seqadj = nfct_seqadj(ct);
> + if (unlikely(!seqadj))
> + return 0;

I think we should handle this from init_conntrack() by simply dropping
the packet as we do under similar circunstances (too stress to deal).

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Alexei Starovoitov


On 9/5/16 10:09 AM, Daniel Borkmann wrote:

On 09/05/2016 04:09 PM, Daniel Mack wrote:

On 09/05/2016 03:56 PM, Daniel Borkmann wrote:

On 09/05/2016 02:54 PM, Daniel Mack wrote:

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:

On 08/26/2016 09:58 PM, Daniel Mack wrote:



enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64pathname;
__u32bpf_fd;
};
+
+struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH
commands */
+__u32target_fd;/* container object to attach
to */
+__u32attach_bpf_fd;/* eBPF program to attach */
+__u32attach_type;/* BPF_ATTACH_TYPE_* */
+__u64attach_flags;


Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.


I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't
harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.


With the below suggestion added, then flags doesn't need to be
added currently as it can be done safely at a later point in time
with respecting old binaries. See also the syscall handling code
in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
underlying idea of this was taken from perf_event_open() syscall
back then, see [1] for a summary.

[1] https://lkml.org/lkml/2014/8/26/116


Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.


Sure, but there's not much of a dance needed, see for example how map_flags
were added some time ago. So, iff there's really no foreseeable use-case in
sight and since we have this flexibility in place already, then I don't
quite
follow why it's needed, if there's zero pain to add it later on. I would
understand it of course, if it cannot be handled later on anymore.


I agree with Daniel B. Since flags are completely unused right now,
there is no plan to use it for anything in the coming months and
even worse they make annoying hole in the struct, let's not
add them. We can safely do that later. CHECK_ATTR() allows us to
do it easily. It's not like syscall where flags are must have,
since we cannot add it later. Here it's done trivially.

Re: 6pack: stack-out-of-bounds in sixpack_receive_buf

2016-09-05 Thread Dmitry Vyukov

On Mon, Sep 5, 2016 at 7:49 PM, One Thousand Gnomes
 wrote:
>> different runs). Looking at code, the following looks suspicious -- we
>> limit copy by 512 bytes, but use the original count which can be
>> larger than 512:
>>
>> static void sixpack_receive_buf(struct tty_struct *tty,
>> const unsigned char *cp, char *fp, int count)
>> {
>> unsigned char buf[512];
>> 
>> memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
>> 
>> sixpack_decode(sp, buf, count1);
>>
>>
>> On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next.
>
> With the sane tty locking we now have I believe the following is safe as
> we consume the bytes and move them into the decoded buffer before
> returning.
>
> diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
> index 5a1e985..470b3dc 100644
> --- a/drivers/net/hamradio/6pack.c
> +++ b/drivers/net/hamradio/6pack.c
> @@ -127,7 +127,7 @@ struct sixpack {
>
>  #define AX25_6PACK_HEADER_LEN 0
>
> -static void sixpack_decode(struct sixpack *, unsigned char[], int);
> +static void sixpack_decode(struct sixpack *, const unsigned char[], int);
>  static int encode_sixpack(unsigned char *, unsigned char *, int, unsigned 
> char);
>
>  /*
> @@ -428,7 +428,7 @@ out:
>
>  /*
>   * Handle the 'receiver data ready' interrupt.
> - * This function is called by the 'tty_io' module in the kernel when
> + * This function is called by the tty module in the kernel when
>   * a block of 6pack data has been received, which can now be decapsulated
>   * and sent on to some IP layer for further processing.
>   */
> @@ -436,7 +436,6 @@ static void sixpack_receive_buf(struct tty_struct *tty,
> const unsigned char *cp, char *fp, int count)
>  {
> struct sixpack *sp;
> -   unsigned char buf[512];
> int count1;
>
> if (!count)
> @@ -446,10 +445,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
> if (!sp)
> return;
>
> -   memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
> -
> /* Read the characters out of the buffer */
> -
> count1 = count;
> while (count) {
> count--;
> @@ -459,7 +455,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
> continue;
> }
> }
> -   sixpack_decode(sp, buf, count1);
> +   sixpack_decode(sp, cp, count1);
>
> sp_put(sp);
> tty_unthrottle(tty);
> @@ -992,7 +988,7 @@ static void decode_std_command(struct sixpack *sp, 
> unsigned char cmd)
>  /* decode a 6pack packet */
>
>  static void
> -sixpack_decode(struct sixpack *sp, unsigned char *pre_rbuff, int count)
> +sixpack_decode(struct sixpack *sp, const unsigned char *pre_rbuff, int count)
>  {
> unsigned char inbyte;
> int count1;


Applied locally for testing. I will notify if I see this bug again.
Thanks!

Re: 6pack: stack-out-of-bounds in sixpack_receive_buf

2016-09-05 Thread One Thousand Gnomes

> different runs). Looking at code, the following looks suspicious -- we
> limit copy by 512 bytes, but use the original count which can be
> larger than 512:
> 
> static void sixpack_receive_buf(struct tty_struct *tty,
> const unsigned char *cp, char *fp, int count)
> {
> unsigned char buf[512];
> 
> memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
> 
> sixpack_decode(sp, buf, count1);
> 
> 
> On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next.

With the sane tty locking we now have I believe the following is safe as
we consume the bytes and move them into the decoded buffer before
returning.

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 5a1e985..470b3dc 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -127,7 +127,7 @@ struct sixpack {
 
 #define AX25_6PACK_HEADER_LEN 0
 
-static void sixpack_decode(struct sixpack *, unsigned char[], int);
+static void sixpack_decode(struct sixpack *, const unsigned char[], int);
 static int encode_sixpack(unsigned char *, unsigned char *, int, unsigned 
char);
 
 /*
@@ -428,7 +428,7 @@ out:
 
 /*
  * Handle the 'receiver data ready' interrupt.
- * This function is called by the 'tty_io' module in the kernel when
+ * This function is called by the tty module in the kernel when
  * a block of 6pack data has been received, which can now be decapsulated
  * and sent on to some IP layer for further processing.
  */
@@ -436,7 +436,6 @@ static void sixpack_receive_buf(struct tty_struct *tty,
const unsigned char *cp, char *fp, int count)
 {
struct sixpack *sp;
-   unsigned char buf[512];
int count1;
 
if (!count)
@@ -446,10 +445,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
if (!sp)
return;
 
-   memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
-
/* Read the characters out of the buffer */
-
count1 = count;
while (count) {
count--;
@@ -459,7 +455,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
continue;
}
}
-   sixpack_decode(sp, buf, count1);
+   sixpack_decode(sp, cp, count1);
 
sp_put(sp);
tty_unthrottle(tty);
@@ -992,7 +988,7 @@ static void decode_std_command(struct sixpack *sp, unsigned 
char cmd)
 /* decode a 6pack packet */
 
 static void
-sixpack_decode(struct sixpack *sp, unsigned char *pre_rbuff, int count)
+sixpack_decode(struct sixpack *sp, const unsigned char *pre_rbuff, int count)
 {
unsigned char inbyte;
int count1;

Re: [PATCH 02/29] netfilter: physdev: add missed blank

2016-09-05 Thread Joe Perches

On Mon, 2016-09-05 at 12:58 +0200, Pablo Neira Ayuso wrote:
[]
> diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
[]
> @@ -107,8 +107,8 @@ static int physdev_mt_check(const struct xt_mtchk_param 
> *par)
>    info->invert & XT_PHYSDEV_OP_BRIDGED) &&
>   par->hook_mask & ((1 << NF_INET_LOCAL_OUT) |
>   (1 << NF_INET_FORWARD) | (1 << NF_INET_POST_ROUTING))) {
> - pr_info("using --physdev-out and --physdev-is-out are only"
> - "supported in the FORWARD and POSTROUTING chains with"
> + pr_info("using --physdev-out and --physdev-is-out are only "
> + "supported in the FORWARD and POSTROUTING chains with "
>   "bridged traffic.\n");
>   if (par->hook_mask & (1 << NF_INET_LOCAL_OUT))
>   return -EINVAL;

Perhaps it would be reasonable at some point to coalesce
all the string fragments.

Maybe using this could help:

$ git ls-files -- "net/netfilter/*.[ch]" | \
  xargs ./scripts/checkpatch.pl  -f --types=split_string --fix-inplace

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Joe Perches

On Mon, 2016-09-05 at 14:56 +0200, Daniel Mack wrote:
> On 08/27/2016 02:08 AM, Alexei Starovoitov wrote:
[]
> > +   switch (attr->attach_type) {
> > +   case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
> > +   case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
> > +   struct cgroup *cgrp;
> > +
> > +   prog = bpf_prog_get_type(attr->attach_bpf_fd,
> > +    BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
> > +   if (IS_ERR(prog))
> > +   return PTR_ERR(prog);
> > +
> > +   cgrp = cgroup_get_from_fd(attr->target_fd);
> > +   if (IS_ERR(cgrp)) {
> > +   bpf_prog_put(prog);
> > +   return PTR_ERR(cgrp);
> > +   }
> > +
> > +   cgroup_bpf_update(cgrp, prog, attr->attach_type);
> > +   cgroup_put(cgrp);
> > +
> > +   break;
> > +   }
> this } formatting style is confusing. The above } looks
> like it matches 'switch () {'.
> May be move 'struct cgroup *cgrp' to the top to avoid that?

This style of case statements that declare local variables
with an open brace after the case statement

switch (bar) {
[cases...]
case foo: {
local declarations;

code...
}
[cases...]
}

is used quite frequently in the kernel.
I think it's fine as is.

Re: [PATCH net-next v3] gso: Support partial splitting at the frag_list pointer

2016-09-05 Thread Alexander Duyck

On Mon, Sep 5, 2016 at 3:47 AM, Steffen Klassert
 wrote:
> Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
> gro may build buffers with a frag_list. This can hurt forwarding
> because most NICs can't offload such packets, they need to be
> segmented in software. This patch splits buffers with a frag_list
> at the frag_list pointer into buffers that can be TSO offloaded.
>
> Signed-off-by: Steffen Klassert 
> ---
>
> Changes since v1:
>
> - Use the assumption that all buffers in the chain excluding the last
>   containing the same amount of data.
>
> - Simplify some checks against gso partial.
>
> - Fix the generation of IP IDs.
>
> Changes since v2:
>
> - Merge common code of gso partial and frag_list pointer splitting.
>
>  net/core/skbuff.c  | 50 
> +++---
>  net/ipv4/af_inet.c | 14 ++
>  net/ipv4/gre_offload.c |  6 --
>  net/ipv4/tcp_offload.c | 13 +++--
>  net/ipv4/udp_offload.c |  6 --
>  net/ipv6/ip6_offload.c |  5 -
>  6 files changed, 68 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3864b4b6..f754d47 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3078,11 +3078,30 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> sg = !!(features & NETIF_F_SG);
> csum = !!can_checksum_protocol(features, proto);
>
> -   /* GSO partial only requires that we trim off any excess that
> -* doesn't fit into an MSS sized block, so take care of that
> -* now.
> -*/
> -   if (sg && csum && (features & NETIF_F_GSO_PARTIAL)) {
> +   if (sg && csum && (mss != GSO_BY_FRAGS))  {
> +   if (!(features & NETIF_F_GSO_PARTIAL)) {
> +   if (list_skb &&
> +   net_gso_ok(features, 
> skb_shinfo(head_skb)->gso_type)) {

The testing logic here is a bit off.

You need to prevent us from doing the partial_segs bit below if
NETIF_F_GSO_PARTIAL is not set and if your list_skb or net_gso_ok
tests fail.  Since as you pointed out we shouldn't ever be trying to
perform GSO_PARTIAL on a frame that has a frag_list, what you might do
is something like:
if (!list_skb || !net_gso_ok(...))
goto normal;

That way we don't setup partial_segs unless we are actually using it.

> +   struct sk_buff *iter;
> +
> +   /* Split the buffer at the frag_list pointer.
> +* This is based on the assumption that all
> +* buffers in the chain excluding the last
> +* containing the same amount of data.
> +*/
> +   skb_walk_frags(head_skb, iter) {
> +   if (skb_headlen(iter))
> +   goto normal;
> +
> +   len -= iter->len;
> +   }
> +   }
> +   }
> +
> +   /* GSO partial only requires that we trim off any excess that
> +* doesn't fit into an MSS sized block, so take care of that
> +* now.
> +*/
> partial_segs = len / mss;
> if (partial_segs > 1)
> mss *= partial_segs;
> @@ -3090,6 +3109,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> partial_segs = 0;
> }
>
> +normal:
> headroom = skb_headroom(head_skb);
> pos = skb_headlen(head_skb);
>
> @@ -3281,21 +3301,29 @@ perform_csum_check:
>  */
> segs->prev = tail;
>
> -   /* Update GSO info on first skb in partial sequence. */
> if (partial_segs) {
> +   struct sk_buff *iter;
> int type = skb_shinfo(head_skb)->gso_type;
> +   unsigned short gso_size = skb_shinfo(head_skb)->gso_size;
>
> /* Update type to add partial and then remove dodgy if set */
> -   type |= SKB_GSO_PARTIAL;
> +   type |= (features & NETIF_F_GSO_PARTIAL) / 
> NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL;
> type &= ~SKB_GSO_DODGY;
>
> /* Update GSO info and prepare to start updating headers on
>  * our way back down the stack of protocols.
>  */
> -   skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size;
> -   skb_shinfo(segs)->gso_segs = partial_segs;
> -   skb_shinfo(segs)->gso_type = type;
> -   SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
> +   for (iter = segs; iter; iter = iter->next) {
> +   skb_shinfo(iter)->gso_size = gso_size;
> +   skb_shinfo(iter)->gso_segs =

[patch net-next v7 2/3] net: core: Add offload stats to if_stats_msg

2016-09-05 Thread Jiri Pirko

From: Nogah Frankel 

Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/if_link.h | 10 +
 net/core/rtnetlink.c | 88 ++--
 2 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9bf3aec..4aaa2a1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -826,6 +826,7 @@ enum {
IFLA_STATS_LINK_64,
IFLA_STATS_LINK_XSTATS,
IFLA_STATS_LINK_XSTATS_SLAVE,
+   IFLA_STATS_LINK_OFFLOAD_XSTATS,
__IFLA_STATS_MAX,
 };
 
@@ -845,6 +846,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* These are stats embedded into IFLA_STATS_LINK_OFFLOAD_XSTATS */
+enum {
+   IFLA_OFFLOAD_XSTATS_UNSPEC,
+   IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */
+   __IFLA_OFFLOAD_XSTATS_MAX
+};
+#define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1)
+#define IFLA_OFFLOAD_XSTATS_FIRST (IFLA_OFFLOAD_XSTATS_UNSPEC + 1)
+
 /* XDP section */
 
 enum {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1dfca1c..95eb131 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3577,6 +3577,74 @@ static bool stats_attr_valid(unsigned int mask, int 
attrid, int idxattr)
   (!idxattr || idxattr == attrid);
 }
 
+static int rtnl_get_offload_stats_attr_size(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return sizeof(struct rtnl_link_stats64);
+   }
+
+   return 0;
+}
+
+static int rtnl_get_offload_stats(struct sk_buff *skb, struct net_device *dev)
+{
+   struct nlattr *attr;
+   int attr_id, size;
+   void *attr_data;
+   int err;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return 0;
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   size = rtnl_get_offload_stats_attr_size(attr_id);
+   if (!size)
+   continue;
+
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+
+   attr = nla_reserve_64bit(skb, attr_id, size,
+IFLA_OFFLOAD_XSTATS_UNSPEC);
+   if (!attr)
+   return -EMSGSIZE;
+
+   attr_data = nla_data(attr);
+   memset(attr_data, 0, size);
+   err = dev->netdev_ops->ndo_get_offload_stats(attr_id, dev,
+attr_data);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
+static int rtnl_get_offload_stats_size(const struct net_device *dev)
+{
+   int size = 0;
+   int attr_id;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return nla_total_size(0);
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+
+   size += rtnl_get_offload_stats_attr_size(attr_id);
+   }
+
+   size += nla_total_size(0);
+
+   return size;
+}
+
 static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
   int type, u32 pid, u32 seq, u32 change,
   unsigned int flags, unsigned int filter_mask,
@@ -3586,6 +3654,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
struct nlmsghdr *nlh;
struct nlattr *attr;
int s_prividx = *prividx;
+   int err;
 
ASSERT_RTNL();
 
@@ -3614,8 +3683,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
 
if (ops && ops->fill_linkxstats) {
-   int err;
-
*idxattr = IFLA_STATS_LINK_XSTATS;
attr = nla_nest_start(skb,
  IFLA_STATS_LINK_XSTATS);
@@ -3639,8 +3706,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
if (master)
ops = master->rtnl_link_ops;
if (ops && ops->fill_linkxstats) {
-   int err;
-
*idxattr = IFLA_STATS_LINK_XSTATS_SLAVE;

[patch net-next v7 1/3] netdevice: Add offload statistics ndo

2016-09-05 Thread Jiri Pirko

From: Nogah Frankel 

Add a new ndo to return statistics for offloaded operation.
Since there can be many different offloaded operation with many
stats types, the ndo gets an attribute id by which it knows which
stats are wanted. The ndo also gets a void pointer to be cast according
to the attribute id.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/linux/netdevice.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 67bb978..2d2c09b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -924,6 +924,14 @@ struct netdev_xdp {
  * 3. Update dev->stats asynchronously and atomically, and define
  *neither operation.
  *
+ * bool (*ndo_has_offload_stats)(int attr_id)
+ * Return true if this device supports offload stats of this attr_id.
+ *
+ * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,
+ * void *attr_data)
+ * Get statistics for offload operations by attr_id. Write it into the
+ * attr_data pointer.
+ *
  * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid);
  * If device supports VLAN filtering this function is called when a
  * VLAN id is registered.
@@ -1155,6 +1163,10 @@ struct net_device_ops {
 
struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev,
 struct rtnl_link_stats64 
*storage);
+   bool(*ndo_has_offload_stats)(int attr_id);
+   int (*ndo_get_offload_stats)(int attr_id,
+const struct 
net_device *dev,
+void *attr_data);
struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
 
int (*ndo_vlan_rx_add_vid)(struct net_device *dev,
-- 
2.5.5

[patch net-next v7 0/3] return offloaded stats as default and expose original sw stats

2016-09-05 Thread Jiri Pirko

From: Jiri Pirko 

The problem we try to handle is about offloaded forwarded packets
which are not seen by kernel. Let me try to draw it:

port1   port2 (HW stats are counted here)
  \  /
   \/
\  /
 --(A) ASIC --(B)--
|
   (C)
|
   CPU (SW stats are counted here)


Now we have couple of flows for TX and RX (direction does not matter here):

1) port1->A->ASIC->C->CPU

   For this flow, HW and SW stats are equal.

2) port1->A->ASIC->C->CPU->C->ASIC->B->port2

   For this flow, HW and SW stats are equal.

3) port1->A->ASIC->B->port2

   For this flow, SW stats are 0.

The purpose of this patchset is to provide facility for user to
find out the difference between flows 1+2 and 3. In other words, user
will be able to see the statistics for the slow-path (through kernel).

Also note that HW stats are what someone calls "accumulated" stats.
Every packet counted by SW is also counted by HW. Not the other way around.

As a default the accumulated stats (HW) will be exposed to user
so the userspace apps can react properly.

This patchset add the SW stats (flows 1+2) under offload related stats, so
in the future we can expose other offload related stat in a similar way.

---
v6->v7:
- patch 1/3:
 - ndo interface changed to get the wanted stats type as an input.
 - change commit message.
- patch 2/3:
 - create a nesting for offloaded stat and put SW stats under it.
 - change the ndo call to indicate which offload stats we wants.
 - change commit message.
- patch 3/3:
 - change ndo implementation to match the changes in the previous patches.
 - change commit message.
v5->v6:
- patch 2/4 was dropped as requested by Roopa
- patch 1/3:
 - comment changed to indicate that default stats are combined stats
 - commit massage changed
- patch 2/3: (previously 3/4)
 - SW stats return nothing if there is no SW stats ndo
v4->v5:
- updated cover letter
- patch3/4:
  - using memcpy directly to copy stats as requested by DaveM
v3->v4:
- patch1/4:
  - fixed "return ()" pointed out by EricD
- patch2/4:
  - fixed if_nlmsg_size as pointed out by EricD
v2->v3:
- patch1/4:
  - added dev_have_sw_stats helper
- patch2/4:
  - avoided memcpy as requested by DaveM
- patch3/4:
  - use new dev_have_sw_stats helper
v1->v2:
- patch3/4:
  - fixed NULL initialization

Nogah Frankel (3):
  netdevice: Add offload statistics ndo
  net: core: Add offload stats to if_stats_msg
  mlxsw: spectrum: Implement offload stats ndo and expose HW stats by
default

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 include/linux/netdevice.h  |  12 +++
 include/uapi/linux/if_link.h   |  10 ++
 net/core/rtnetlink.c   |  88 -
 5 files changed, 233 insertions(+), 11 deletions(-)

-- 
2.5.5

[patch net-next v7 3/3] mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default

2016-09-05 Thread Jiri Pirko

From: Nogah Frankel 

Change the default statistics ndo to return HW statistics
(like the one returned by ethtool_ops).
The HW stats are collected to a cache by delayed work every 1 sec.
Implement the offload stat ndo.
Add a function to get SW statistics, to be called from this function.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 2 files changed, 127 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 6c6b726..2a7f93f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -818,9 +818,9 @@ err_span_port_mtu_update:
return err;
 }
 
-static struct rtnl_link_stats64 *
-mlxsw_sp_port_get_stats64(struct net_device *dev,
- struct rtnl_link_stats64 *stats)
+int
+mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
+struct rtnl_link_stats64 *stats)
 {
struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
struct mlxsw_sp_port_pcpu_stats *p;
@@ -847,6 +847,107 @@ mlxsw_sp_port_get_stats64(struct net_device *dev,
tx_dropped  += p->tx_dropped;
}
stats->tx_dropped   = tx_dropped;
+   return 0;
+}
+
+bool mlxsw_sp_port_has_offload_stats(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return true;
+   }
+
+   return false;
+}
+
+int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
+   void *sp)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return mlxsw_sp_port_get_sw_stats64(dev, sp);
+   }
+
+   return -EINVAL;
+}
+
+static int mlxsw_sp_port_get_stats_raw(struct net_device *dev, int grp,
+  int prio, char *ppcnt_pl)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+
+   mlxsw_reg_ppcnt_pack(ppcnt_pl, mlxsw_sp_port->local_port, grp, prio);
+   return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ppcnt), ppcnt_pl);
+}
+
+static int mlxsw_sp_port_get_hw_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   char ppcnt_pl[MLXSW_REG_PPCNT_LEN];
+   int err;
+
+   err = mlxsw_sp_port_get_stats_raw(dev, MLXSW_REG_PPCNT_IEEE_8023_CNT,
+ 0, ppcnt_pl);
+   if (err)
+   goto out;
+
+   stats->tx_packets =
+   mlxsw_reg_ppcnt_a_frames_transmitted_ok_get(ppcnt_pl);
+   stats->rx_packets =
+   mlxsw_reg_ppcnt_a_frames_received_ok_get(ppcnt_pl);
+   stats->tx_bytes =
+   mlxsw_reg_ppcnt_a_octets_transmitted_ok_get(ppcnt_pl);
+   stats->rx_bytes =
+   mlxsw_reg_ppcnt_a_octets_received_ok_get(ppcnt_pl);
+   stats->multicast =
+   mlxsw_reg_ppcnt_a_multicast_frames_received_ok_get(ppcnt_pl);
+
+   stats->rx_crc_errors =
+   mlxsw_reg_ppcnt_a_frame_check_sequence_errors_get(ppcnt_pl);
+   stats->rx_frame_errors =
+   mlxsw_reg_ppcnt_a_alignment_errors_get(ppcnt_pl);
+
+   stats->rx_length_errors = (
+   mlxsw_reg_ppcnt_a_in_range_length_errors_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_out_of_range_length_field_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_frame_too_long_errors_get(ppcnt_pl));
+
+   stats->rx_errors = (stats->rx_crc_errors +
+   stats->rx_frame_errors + stats->rx_length_errors);
+
+out:
+   return err;
+}
+
+static void update_stats_cache(struct work_struct *work)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port =
+   container_of(work, struct mlxsw_sp_port,
+hw_stats.update_dw.work);
+
+   if (!netif_carrier_ok(mlxsw_sp_port->dev))
+   goto out;
+
+   mlxsw_sp_port_get_hw_stats(mlxsw_sp_port->dev,
+  mlxsw_sp_port->hw_stats.cache);
+
+out:
+   mlxsw_core_schedule_dw(_sp_port->hw_stats.update_dw,
+  MLXSW_HW_STATS_UPDATE_TIME);
+}
+
+/* Return the stats from a cache that is updated periodically,
+ * as this function might get called in an atomic context.
+ */
+static struct rtnl_link_stats64 *
+mlxsw_sp_port_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+
+   memcpy(stats, mlxsw_sp_port->hw_stats.cache, sizeof(*stats));
+
return stats;
 }
 
@@ -1208,6 +1309,8 @@ static const struct net_device_ops 
mlxsw_sp_port_netdev_ops = {

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Borkmann


On 09/05/2016 04:09 PM, Daniel Mack wrote:

On 09/05/2016 03:56 PM, Daniel Borkmann wrote:

On 09/05/2016 02:54 PM, Daniel Mack wrote:

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:

On 08/26/2016 09:58 PM, Daniel Mack wrote:



enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;/* BPF_ATTACH_TYPE_* */
+   __u64   attach_flags;


Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.


I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.


With the below suggestion added, then flags doesn't need to be
added currently as it can be done safely at a later point in time
with respecting old binaries. See also the syscall handling code
in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
underlying idea of this was taken from perf_event_open() syscall
back then, see [1] for a summary.

[1] https://lkml.org/lkml/2014/8/26/116


Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.


Sure, but there's not much of a dance needed, see for example how map_flags
were added some time ago. So, iff there's really no foreseeable use-case in
sight and since we have this flexibility in place already, then I don't quite
follow why it's needed, if there's zero pain to add it later on. I would
understand it of course, if it cannot be handled later on anymore.

[PATCH] rxrpc: initialize sched to false to ensure it is not a garbage value

2016-09-05 Thread Colin King

From: Colin Ian King 

sched will be uninitialized (and contain a garbage value) in the case
where call->state >= RXRPC_CALL_DEAD;  fix this by initializing sched
to false to avoid an inadvertent call to rxrpc_queue_call.

Signed-off-by: Colin Ian King 
---
 net/rxrpc/call_object.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 516d8ea..57e00fc 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -586,7 +586,7 @@ static void rxrpc_dead_call_expired(unsigned long _call)
  */
 static void rxrpc_mark_call_released(struct rxrpc_call *call)
 {
-   bool sched;
+   bool sched = false;
 
rxrpc_see_call(call);
write_lock(>state_lock);
-- 
2.9.3

Re: 6pack: stack-out-of-bounds in sixpack_receive_buf

2016-09-05 Thread One Thousand Gnomes

On Sat, 3 Sep 2016 15:38:08 +0200
Dmitry Vyukov  wrote:

> Hello,
> 
> While running syzkaller fuzzer I've got the following report:
> 
> BUG: KASAN: stack-out-of-bounds in sixpack_receive_buf+0xf8a/0x1450 at
> addr 880037fbf850
> Read of size 1 by task syz-executor/6759
> page:eadfefc0 count:0 mapcount:0 mapping:  (null) index:0x0
> flags: 0x1fffc00()
> page dumped because: kasan: bad access detected
> CPU: 3 PID: 6759 Comm: syz-executor Not tainted 4.8.0-rc3-next-20160825+ #8
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  886b6fe0 880037fbf520 82db38d9 37fbf5b0
>  fbfff10d6dfc 880037fbf5b0 880037fbf850 880037fbf850
>  880037d3f180 dc00 880037fbf5a0 8180a383
> Call Trace:
>  [] __asan_report_load1_noabort+0x3e/0x40
> mm/kasan/report.c:319
>  [< inline >] sixpack_decode drivers/net/hamradio/6pack.c:1001
>  [] sixpack_receive_buf+0xf8a/0x1450
> drivers/net/hamradio/6pack.c:462
>  [] tty_ldisc_receive_buf+0x168/0x1b0
> drivers/tty/tty_buffer.c:433
>  [] paste_selection+0x27e/0x3e0 
> drivers/tty/vt/selection.c:363
>  [] tioclinux+0x126/0x410 drivers/tty/vt/vt.c:2683
>  [] vt_ioctl+0x13ef/0x2910 drivers/tty/vt/vt_ioctl.c:365
>  [] tty_ioctl+0x69d/0x21e0 drivers/tty/tty_io.c:2983
>  [< inline >] vfs_ioctl fs/ioctl.c:43
>  [] do_vfs_ioctl+0x18c/0x1080 fs/ioctl.c:675
>  [< inline >] SYSC_ioctl fs/ioctl.c:690
>  [] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:681
>  [] entry_SYSCALL_64_fastpath+0x23/0xc1
> Memory state around the buggy address:
>  880037fbf700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  880037fbf780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >880037fbf800: 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3 00 00  
>  ^
>  880037fbf880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  880037fbf900: 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2
> ==
> 
> 
> It is then followed by similar reports that access subsequent stack bytes.
> Unfortunately I can't reproduce it (though, I got 6 similar crashes in
> different runs). Looking at code, the following looks suspicious -- we
> limit copy by 512 bytes, but use the original count which can be
> larger than 512:
> 
> static void sixpack_receive_buf(struct tty_struct *tty,
> const unsigned char *cp, char *fp, int count)
> {
> unsigned char buf[512];
> 
> memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
> 
> sixpack_decode(sp, buf, count1);
> 

Your suspicion is correct.

Alan

Re: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values

2016-09-05 Thread David Howells

[cc'ing Jeff Altman for comment]

David Laight  wrote:

> > Create a random epoch value rather than a time-based one on startup and set
> > the top bit to indicate that this is the case.
> 
> Why set the top bit?
> There is nothing to stop the time (in seconds) from having the top bit set.
> Nothing else can care - otherwise this wouldn't work.

This is what I'm told I should do by purveyors of other RxRPC solutions.

> > Also create a random starting client connection ID value.  This will be
> > incremented from here as new client connections are created.
> 
> I'm guessing this is to make duplicates less likely after a restart?

Again, it's been suggested that I do this, but I would guess so.

> You may want to worry about duplicate allocations (after 2^32 connects).

It's actually a quarter of that, but connection != call, so a connection may
be used for up to ~16 billion RPC operations before it *has* to be flushed.

> There are id allocation algorithms that guarantee not to generate duplicates
> and not to reuse values quickly while still being fixed cost.
> Look at the code NetBSD uses to allocate process ids for an example.

I'm using idr_alloc_cyclic()[*] with a fixed size "window" on the active conn
ID values.  Client connections with IDs outside of that window are discarded
as soon as possible to keep the memory consumption of the tree down (and to
force security renegotiation occasionally).  However, given that there are a
billion IDs to cycle through, it will take quite a while for reuse to become
an issue.

I like the idea of incrementing the epoch every time we cycle through the ID
space, but I'm told that a change in the epoch value is an indication that the
client rebooted - with what consequences I cannot say.

[*] which is what Linux uses to allocate process IDs.

David

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack

On 09/05/2016 05:30 PM, David Laight wrote:
> From: Daniel Mack
 +
 +  struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
 +  __u32   target_fd;  /* container object to attach 
 to */
 +  __u32   attach_bpf_fd;  /* eBPF program to attach */
 +  __u32   attach_type;/* BPF_ATTACH_TYPE_* */
 +  __u64   attach_flags;
 +  };
>>>
>>> there is a 4 byte hole in this struct. Can we pack it differently?
>>
>> Okay - I swapped "type" and "flags" to repair this.
> 
> That just moves the pad to the end of the structure.
> Still likely to cause a problem for 32bit apps on a 64bit kernel.

What kind of problem do you have in mind? Again, this is embedded in a
union of much bigger total size, and the API is not used in any kind of
hot-path.

> If you can't think of any flags, why 64 of them?

I can't think of them right now, but this is about defining an API that
can be used in other context as well. Also, it doesn't matter at all,
they don't harm. IMO, it's just better to have them right away than to
do a binary compat dance once someone needs them.

Thanks,
Daniel

RE: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread David Laight

From: Daniel Mack
> >> +
> >> +  struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> >> +  __u32   target_fd;  /* container object to attach 
> >> to */
> >> +  __u32   attach_bpf_fd;  /* eBPF program to attach */
> >> +  __u32   attach_type;/* BPF_ATTACH_TYPE_* */
> >> +  __u64   attach_flags;
> >> +  };
> >
> > there is a 4 byte hole in this struct. Can we pack it differently?
> 
> Okay - I swapped "type" and "flags" to repair this.

That just moves the pad to the end of the structure.
Still likely to cause a problem for 32bit apps on a 64bit kernel.
If you can't think of any flags, why 64 of them?

David

Hello

2016-09-05 Thread Amira Hamza

Hello,
I am Amira, 24 years young female. Please i will like to discuss something 
important with you. Please Reply

Re: [PATCH 2/2 nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-05 Thread Feng Gao

Hi Pablo,

On Mon, Sep 5, 2016 at 11:02 PM,   wrote:
> From: Gao Feng 
>
> It is valid that the TCP RST packet which does not set ack flag, and bytes
> of ack number are zero. For these RST packets, seqadj could not adjust the
> ack number.
>
> Signed-off-by: Gao Feng 
> ---
>  net/netfilter/nf_conntrack_seqadj.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/net/netfilter/nf_conntrack_seqadj.c 
> b/net/netfilter/nf_conntrack_seqadj.c
> index 7f8d814..65bb4a6 100644
> --- a/net/netfilter/nf_conntrack_seqadj.c
> +++ b/net/netfilter/nf_conntrack_seqadj.c
> @@ -182,30 +182,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
>
> tcph = (void *)skb->data + protoff;
> spin_lock_bh(>lock);
> +
> if (after(ntohl(tcph->seq), this_way->correction_pos))
> seqoff = this_way->offset_after;
> else
> seqoff = this_way->offset_before;
>
> -   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> - other_way->correction_pos))
> -   ackoff = other_way->offset_after;
> -   else
> -   ackoff = other_way->offset_before;
> -
> newseq = htonl(ntohl(tcph->seq) + seqoff);
> -   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> -
> inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
> -   inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
> -false);
> -
> -   pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
> -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
> -ntohl(newack));
>
> +   pr_debug("Adjusting sequence number from %u->%u\n",
> +ntohl(tcph->seq), ntohl(newseq));
> tcph->seq = newseq;
> -   tcph->ack_seq = newack;
> +
> +   if (likely(tcph->ack)) {
> +   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> + other_way->correction_pos))
> +   ackoff = other_way->offset_after;
> +   else
> +   ackoff = other_way->offset_before;
> +
> +   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> +   inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
> +newack, false);
> +
> +   pr_debug("Adjusting ack number from %u->%u\n",
> +ntohl(tcph->ack_seq), ntohl(newack));
> +   tcph->ack_seq = newack;
> +   }
>
> res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo);
> spin_unlock_bh(>lock);
> --
> 1.9.1
>
>

This patch is generated base on the patch commit "netfilter: seqadj:
Fix one possible panic in seqadj when mem is exhausted" whose link is
http://patchwork.ozlabs.org/patch/665116/.

So its subject contains "2/2".

Best Regards
Feng



Best Regards
Feng

RE: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values

2016-09-05 Thread David Laight

From: David Howells
> Sent: 04 September 2016 22:03
> Create a random epoch value rather than a time-based one on startup and set
> the top bit to indicate that this is the case.

Why set the top bit?
There is nothing to stop the time (in seconds) from having the top bit set.
Nothing else can care - otherwise this wouldn't work.

> Also create a random starting client connection ID value.  This will be
> incremented from here as new client connections are created.

I'm guessing this is to make duplicates less likely after a restart?
You may want to worry about duplicate allocations (after 2^32 connects).
There are id allocation algorithms that guarantee not to generate duplicates
and not to reuse values quickly while still being fixed cost.
Look at the code NetBSD uses to allocate process ids for an example.

David

[PATCH 2/2 nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-05 Thread fgao

From: Gao Feng 

It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. For these RST packets, seqadj could not adjust the
ack number.

Signed-off-by: Gao Feng 
---
 net/netfilter/nf_conntrack_seqadj.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nf_conntrack_seqadj.c 
b/net/netfilter/nf_conntrack_seqadj.c
index 7f8d814..65bb4a6 100644
--- a/net/netfilter/nf_conntrack_seqadj.c
+++ b/net/netfilter/nf_conntrack_seqadj.c
@@ -182,30 +182,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
 
tcph = (void *)skb->data + protoff;
spin_lock_bh(>lock);
+
if (after(ntohl(tcph->seq), this_way->correction_pos))
seqoff = this_way->offset_after;
else
seqoff = this_way->offset_before;
 
-   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
- other_way->correction_pos))
-   ackoff = other_way->offset_after;
-   else
-   ackoff = other_way->offset_before;
-
newseq = htonl(ntohl(tcph->seq) + seqoff);
-   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
-
inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
-   inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
-false);
-
-   pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
-ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
-ntohl(newack));
 
+   pr_debug("Adjusting sequence number from %u->%u\n",
+ntohl(tcph->seq), ntohl(newseq));
tcph->seq = newseq;
-   tcph->ack_seq = newack;
+
+   if (likely(tcph->ack)) {
+   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
+ other_way->correction_pos))
+   ackoff = other_way->offset_after;
+   else
+   ackoff = other_way->offset_before;
+
+   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
+   inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
+newack, false);
+
+   pr_debug("Adjusting ack number from %u->%u\n",
+ntohl(tcph->ack_seq), ntohl(newack));
+   tcph->ack_seq = newack;
+   }
 
res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo);
spin_unlock_bh(>lock);
-- 
1.9.1

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Daniel Mack

Hi,

On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>   A - B - C
>> \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
> How does this work when running and orchestrator within an orchestrator? The 
> Docker in Docker / Mesos in Mesos use case, where the top level orchestrator 
> is 
> observing the traffic, and there is an orchestrator within that also need to 
> run 
> it.
> 
> In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
> and so on.

Running multiple programs was an idea I had in one of my earlier drafts,
but after some discussion, I refrained from it again because potentially
walking the cgroup hierarchy on every packet is just too expensive.

> Is it possible to allow this, either by flattening out the
> datastructure (copy a ref to the bpf programs to C and E) or
> something similar?

That would mean we carry a list of eBPF program pointers of dynamic
size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
so it can store a reference to all programs of all of its ancestor.

While I think that would be possible, even at some later point, I'd
really like to avoid it for the sake of simplicity.

Is there any reason why this can't be done in userspace? Compile a
program X for A, and overload it with Y, with Y doing the same than X
but add some extra checks? Note that all users of the bpf(2) syscall API
will need CAP_NET_ADMIN anyway, so there is no delegation to
unprivileged sub-orchestators or anything alike really.

Thanks,
Daniel

Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs

2016-09-05 Thread Daniel Mack

On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index a75df86..17484e6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -141,6 +141,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>
>>   #include "net-sysfs.h"
>>
>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
>> *accel_priv)
>>  if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>  __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>
>> +rc = cgroup_bpf_run_filter(skb->sk, skb,
>> +   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>> +if (rc)
>> +return rc;
> 
> This would leak the whole skb by the way.

Ah, right.

> Apart from that, could this be modeled w/o affecting the forwarding path (at 
> some
> local output point where we know to have a valid socket)? Then you could also 
> drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of 
> what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to 
> be
> just zeroes since not available at that point?

Hmm, I wonder where this hook could be put instead then. When placed in
ip_output() and ip6_output(), the mac headers cannot be pushed before
running the program, resulting in bogus skb data from the eBPF program.

Also, if I read the code correctly, ip[6]_output is not called for
multicast packets.

Any other ideas?


Thanks,
Daniel

RE: [PATCH -next v2] net: hns: fix return value check in hns_dsaf_get_cfg()

2016-09-05 Thread Salil Mehta

Hello,
This patch will conflict with Doug Ledford's hns-roce's HNS driver.
This might lead to problems later during this merge window of 4.9.

Therefore, Please re-submit it later. The patch files it has are
Directly conflicting with RoCE patches:

[PATCH for-next 1/2] net: hns: Add support of ACPI to HNS driver RoCE Reset 
function 
drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c
drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.h
drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h

we are in process of streamlining the internal process for the better
synchronization between HNS and RoCE/RDMA teams and would maintain
internal repo for such conflicting patches and which can be
git-pulled by David Miller and Doug Ledford.

Best regards
Salil
> -Original Message-
> From: weiyj...@163.com [mailto:weiyj...@163.com]
> Sent: Tuesday, July 05, 2016 8:57 AM
> To: Zhuangyuzeng (Yisen); Salil Mehta; Yankejian (Hackim Yim)
> Cc: Wei Yongjun; netdev@vger.kernel.org
> Subject: [PATCH -next v2] net: hns: fix return value check in
> hns_dsaf_get_cfg()
> 
> From: Wei Yongjun 
> 
> In case of error, function devm_ioremap_resource() returns ERR_PTR()
> and never returns NULL. The NULL test in the return value check should
> be replaced with IS_ERR().
> 
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 16 --
> --
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> index 86ce28a..2ef4277 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> @@ -114,9 +114,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev)
> 
>   dsaf_dev->sc_base = devm_ioremap_resource(>dev,
> res);
> - if (!dsaf_dev->sc_base) {
> + if (IS_ERR(dsaf_dev->sc_base)) {
>   dev_err(dsaf_dev->dev, "subctrl can not
> map!\n");
> - return -ENOMEM;
> + return PTR_ERR(dsaf_dev->sc_base);
>   }
> 
>   res = platform_get_resource(pdev, IORESOURCE_MEM,
> @@ -128,9 +128,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev)
> 
>   dsaf_dev->sds_base = devm_ioremap_resource(
> >dev,
>  res);
> - if (!dsaf_dev->sds_base) {
> + if (IS_ERR(dsaf_dev->sds_base)) {
>   dev_err(dsaf_dev->dev, "serdes-ctrl can not
> map!\n");
> - return -ENOMEM;
> + return PTR_ERR(dsaf_dev->sds_base);
>   }
>   } else {
>   dsaf_dev->sub_ctrl = syscon;
> @@ -146,9 +146,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev)
>   }
>   }
>   dsaf_dev->ppe_base = devm_ioremap_resource(>dev, res);
> - if (!dsaf_dev->ppe_base) {
> + if (IS_ERR(dsaf_dev->ppe_base)) {
>   dev_err(dsaf_dev->dev, "ppe-base resource can not map!\n");
> - return -ENOMEM;
> + return PTR_ERR(dsaf_dev->ppe_base);
>   }
>   dsaf_dev->ppe_paddr = res->start;
> 
> @@ -165,9 +165,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev)
>   }
>   }
>   dsaf_dev->io_base = devm_ioremap_resource(>dev, res);
> - if (!dsaf_dev->io_base) {
> + if (IS_ERR(dsaf_dev->io_base)) {
>   dev_err(dsaf_dev->dev, "dsaf-base resource can not
> map!\n");
> - return -ENOMEM;
> + return PTR_ERR(dsaf_dev->io_base);
>   }
>   }
>

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack

On 09/05/2016 03:56 PM, Daniel Borkmann wrote:
> On 09/05/2016 02:54 PM, Daniel Mack wrote:
>> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>>
enum bpf_map_type {
 @@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
 +
 +  struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
 +  __u32   target_fd;  /* container object to attach 
 to */
 +  __u32   attach_bpf_fd;  /* eBPF program to attach */
 +  __u32   attach_type;/* BPF_ATTACH_TYPE_* */
 +  __u64   attach_flags;
>>>
>>> Could we just do ...
>>>
>>> __u32 dst_fd;
>>> __u32 src_fd;
>>> __u32 attach_type;
>>>
>>> ... and leave flags out, since unused anyway? Also see below.
>>
>> I'd really like to keep the flags, even if they're unused right now.
>> This only adds 8 bytes during the syscall operation, so it doesn't harm.
>> However, we cannot change the userspace API after the fact, and who
>> knows what this (rather generic) interface will be used for later on.
> 
> With the below suggestion added, then flags doesn't need to be
> added currently as it can be done safely at a later point in time
> with respecting old binaries. See also the syscall handling code
> in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
> underlying idea of this was taken from perf_event_open() syscall
> back then, see [1] for a summary.
> 
>[1] https://lkml.org/lkml/2014/8/26/116

Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.



Thanks,
Daniel

RE: [PATCH -next] net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()

2016-09-05 Thread Salil Mehta

> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Wednesday, August 24, 2016 1:19 AM
> To: weiyj...@gmail.com
> Cc: Zhuangyuzeng (Yisen); Salil Mehta; huangdaode; Yankejian (Hackim
> Yim); xieqianqian; weiyongjun (A); netdev@vger.kernel.org
> Subject: Re: [PATCH -next] net: hns: remove redundant dev_err call in
> hns_dsaf_get_cfg()
> 
> From: Wei Yongjun 
> Date: Tue, 23 Aug 2016 15:11:03 +
> 
> > From: Wei Yongjun 
> >
> > There is a error message within devm_ioremap_resource
> > already, so remove the dev_err call to avoid redundant
> > error message.
> >
> > Signed-off-by: Wei Yongjun 
> 
> Applied.
Hi David,
Forgive me for acting late on this but just realized that this patch will
Conflict with Doug Ledford's hns-roce's HNS driver. This might lead to
problems later during this merge window of 4.9. RoCE base driver has already
suffered because of such problem during earlier merge window. 

Can I request you to drop this patch for now? We shall submit this patch later
again either through hns-roce or directly to net-next.

Thanks in anticipation
Salil

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Borkmann


On 09/05/2016 02:54 PM, Daniel Mack wrote:

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:

On 08/26/2016 09:58 PM, Daniel Mack wrote:



   enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;/* BPF_ATTACH_TYPE_* */
+   __u64   attach_flags;


Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.


I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.


With the below suggestion added, then flags doesn't need to be
added currently as it can be done safely at a later point in time
with respecting old binaries. See also the syscall handling code
in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
underlying idea of this was taken from perf_event_open() syscall
back then, see [1] for a summary.

  [1] https://lkml.org/lkml/2014/8/26/116


#define BPF_PROG_ATTACH_LAST_FIELD attach_type
#define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD


...



Correct way would be to:

if (CHECK_ATTR(BPF_PROG_ATTACH))
return -EINVAL;


Will add - thanks!


Daniel

[PATCH] ath10k: remove unused variable ar_pci

2016-09-05 Thread Chaehyun Lim

Trival fix to remove unused variable ar_pci in ath10k_pci_tx_pipe_cleanup
when building with W=1:
drivers/net/wireless/ath/ath10k/pci.c:1696:21: warning: variable
'ar_pci' set but not used [-Wunused-but-set-variable]

Signed-off-by: Chaehyun Lim 
---
 drivers/net/wireless/ath/ath10k/pci.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
b/drivers/net/wireless/ath/ath10k/pci.c
index 1b841ad..afd5ef7 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1693,14 +1693,12 @@ static void ath10k_pci_rx_pipe_cleanup(struct 
ath10k_pci_pipe *pci_pipe)
 static void ath10k_pci_tx_pipe_cleanup(struct ath10k_pci_pipe *pci_pipe)
 {
struct ath10k *ar;
-   struct ath10k_pci *ar_pci;
struct ath10k_ce_pipe *ce_pipe;
struct ath10k_ce_ring *ce_ring;
struct sk_buff *skb;
int i;
 
ar = pci_pipe->hif_ce_state;
-   ar_pci = ath10k_pci_priv(ar);
ce_pipe = pci_pipe->ce_hdl;
ce_ring = ce_pipe->src_ring;
 
-- 
2.9.2

Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-05 Thread Dmitry Vyukov

On Mon, Sep 5, 2016 at 3:08 PM, Tejun Heo  wrote:
> Hello,
>
> On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote:
>> > I've seen it only several times in several months, so I don't it will
>> > be helpful.
>>
>>
>> Bad news: I hit it again.
>> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have
>> bf389cabb3b8079c23f9762e62b05f291e2d5e99.
>
> Hmmm... we're not getting anywhere.  I've applied the following patch
> to wq/for-4.8-fixes so that when this happens the next time we can
> actually tell what the hell is going wrong.
>
> Thanks.
>
> -- 8< --
> From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Mon, 5 Sep 2016 08:54:06 -0400
> Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in
>  destroy_workqueue()
>
> destroy_workqueue() performs a number of sanity checks to ensure that
> the workqueue is empty before proceeding with destruction.  However,
> it's not always easy to tell what's going on just from the warning
> message.  Let's dump workqueue state after sanity check failures to
> help debugging.
>
> Signed-off-by: Tejun Heo 
> Link: 
> http://lkml.kernel.org/r/cact4y+zs6vkjho9qhb4treiz3s4+quvvvq9vwvj2mx6petg...@mail.gmail.com
> Cc: Dmitry Vyukov 
> ---
>  kernel/workqueue.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index ef071ca..4eaec8b8 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
> for (i = 0; i < WORK_NR_COLORS; i++) {
> if (WARN_ON(pwq->nr_in_flight[i])) {
> mutex_unlock(>mutex);
> +   show_workqueue_state();
> return;
> }
> }
> @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
> WARN_ON(pwq->nr_active) ||
> WARN_ON(!list_empty(>delayed_works))) {
> mutex_unlock(>mutex);
> +   show_workqueue_state();
> return;
> }
> }


Applied this to my test tree.

Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-05 Thread Tejun Heo

Hello,

On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote:
> > I've seen it only several times in several months, so I don't it will
> > be helpful.
> 
> 
> Bad news: I hit it again.
> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have
> bf389cabb3b8079c23f9762e62b05f291e2d5e99.

Hmmm... we're not getting anywhere.  I've applied the following patch
to wq/for-4.8-fixes so that when this happens the next time we can
actually tell what the hell is going wrong.

Thanks.

-- 8< --
>From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Mon, 5 Sep 2016 08:54:06 -0400
Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in
 destroy_workqueue()

destroy_workqueue() performs a number of sanity checks to ensure that
the workqueue is empty before proceeding with destruction.  However,
it's not always easy to tell what's going on just from the warning
message.  Let's dump workqueue state after sanity check failures to
help debugging.

Signed-off-by: Tejun Heo 
Link: 
http://lkml.kernel.org/r/cact4y+zs6vkjho9qhb4treiz3s4+quvvvq9vwvj2mx6petg...@mail.gmail.com
Cc: Dmitry Vyukov 
---
 kernel/workqueue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ef071ca..4eaec8b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for (i = 0; i < WORK_NR_COLORS; i++) {
if (WARN_ON(pwq->nr_in_flight[i])) {
mutex_unlock(>mutex);
+   show_workqueue_state();
return;
}
}
@@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(>delayed_works))) {
mutex_unlock(>mutex);
+   show_workqueue_state();
return;
}
}
-- 
2.7.4

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack

On 08/27/2016 02:08 AM, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 09:58:49PM +0200, Daniel Mack wrote:

>> +
>> +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> +__u32   target_fd;  /* container object to attach 
>> to */
>> +__u32   attach_bpf_fd;  /* eBPF program to attach */
>> +__u32   attach_type;/* BPF_ATTACH_TYPE_* */
>> +__u64   attach_flags;
>> +};
> 
> there is a 4 byte hole in this struct. Can we pack it differently?

Okay - I swapped "type" and "flags" to repair this.

>> +switch (attr->attach_type) {
>> +case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
>> +case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
>> +struct cgroup *cgrp;
>> +
>> +prog = bpf_prog_get_type(attr->attach_bpf_fd,
>> + BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
>> +if (IS_ERR(prog))
>> +return PTR_ERR(prog);
>> +
>> +cgrp = cgroup_get_from_fd(attr->target_fd);
>> +if (IS_ERR(cgrp)) {
>> +bpf_prog_put(prog);
>> +return PTR_ERR(cgrp);
>> +}
>> +
>> +cgroup_bpf_update(cgrp, prog, attr->attach_type);
>> +cgroup_put(cgrp);
>> +
>> +break;
>> +}
> 
> this } formatting style is confusing. The above } looks
> like it matches 'switch () {'.
> May be move 'struct cgroup *cgrp' to the top to avoid that?

I kept it local to its users, but you're right, it's not worth it. Will
change.


Thanks,
Daniel

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>>   enum bpf_map_type {
>> @@ -147,6 +149,13 @@ union bpf_attr {
>>  __aligned_u64   pathname;
>>  __u32   bpf_fd;
>>  };
>> +
>> +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> +__u32   target_fd;  /* container object to attach 
>> to */
>> +__u32   attach_bpf_fd;  /* eBPF program to attach */
>> +__u32   attach_type;/* BPF_ATTACH_TYPE_* */
>> +__u64   attach_flags;
> 
> Could we just do ...
> 
> __u32 dst_fd;
> __u32 src_fd;
> __u32 attach_type;
> 
> ... and leave flags out, since unused anyway? Also see below.

I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.

> #define BPF_PROG_ATTACH_LAST_FIELD attach_type
> #define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD

...

> 
> Correct way would be to:
> 
>   if (CHECK_ATTR(BPF_PROG_ATTACH))
>   return -EINVAL;

Will add - thanks!


Daniel

RE: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver

2016-09-05 Thread Salil Mehta

> -Original Message-
> From: Doug Ledford [mailto:dledf...@redhat.com]
> Sent: Thursday, August 25, 2016 12:57 PM
> To: David Miller; Salil Mehta
> Cc: Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen);
> mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Linuxarm
> Subject: Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to
> the Hisilicon RoCE Driver
> 
> On 8/25/2016 12:53 AM, David Miller wrote:
> > From: Salil Mehta 
> > Date: Wed, 24 Aug 2016 04:44:48 +0800
> >
> >> This patch is meant to add support of ACPI to the Hisilicon RoCE
> driver.
> >> Following changes have been made in the driver(s):
> >>
> >> Patch 1/2: HNS Ethernet Driver: changes to support ACPI have been
> done in
> >>the RoCE reset function part of the HNS ethernet driver. Earlier
> it only
> >>supported DT/syscon.
> >>
> >> Patch 2/2. HNS RoCE driver: changes done in RoCE driver are meant to
> detect
> >>the type and then either use DT specific or ACPI spcific
> functions. Where
> >>ever possible, this patch tries to make use of "Unified Device
> Property
> >>Interface" APIs to support both DT and ACPI through single
> interface.
> >>
> >> NOTE 1: ACPI changes done in both of the drivers depend upon the
> ACPI Table
> >>  (DSDT and IORT tables) changes part of UEFI/BIOS. These changes
> are NOT
> >>  part of this patch-set.
> >> NOTE 2: Reset function in Patch 1/2 depends upon the reset function
> added in
> >>  ACPI tables(basically DSDT table) part of the UEFI/BIOS. Again,
> this
> >>  change is NOT reflected in this patch-set.
> >
> > I can't apply this series to my tree because the hns infiniband
> driver
> > doesn't exist in it.
> >
> 
> No.  This probably needs to go through my tree.  Although with all of
> the requirements, I'm a bit concerned about those being present
> elsewhere.
> 
Hi David,
There is a patch in net-next for HNS Ethernet driver which has been accepted.
"b3dc935 net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()"

This patch is creating conflict with Doug Ledford's hns-roce branch. Internally,
We have decided to let all the HNS patches pass through the hns-roce for 4.9
Merge window to facilitate non-conflict merge of pending RoCE driver by Doug 
Ledford.

Though, we are trying to take a grip of the process for this merge window but
Somehow this one bypassed the internal process. This will create problems for
Hns-roce during merge later this window. Can I request you to drop this patch
For now. We shall re-submit this patch later when things have been settled at
the RoCE/RDMA end or would come again to you through hns-roce branch.

Please let me know if this is possible this time. Thanks in anticipation!

Best regards
Salil

> --
> Doug Ledford 
> GPG Key ID: 0E572FDD

Re: [PATCH] tcp: cwnd does not increase in TCP YeAH

2016-09-05 Thread Neal Cardwell

On Mon, Sep 5, 2016 at 12:03 AM, Artem Germanov
 wrote:
>
> Commit 76174004a0f19785a328f40388e87e982bbf69b9
> (tcp: do not slow start when cwnd equals ssthresh )
> introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual
> ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP
> connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth
> ~100KB/s.
>  That is caused by stalled cwnd when cwnd equals ssthresh. This patch
> fixes it by proper increasing cwnd in this case.
>
> Signed-off-by: Artem Germanov 

Acked-by: Neal Cardwell 

neal

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Daniel Mack

On 08/30/2016 12:42 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>A - B - C
>>  \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
>> Attaching and detaching programs will be done through the bpf(2)
>> syscall. For now, ingress and egress inet socket filtering are the
>> only supported use-cases.
>>
>> Signed-off-by: Daniel Mack 
> [...]
>> +void __cgroup_bpf_update(struct cgroup *cgrp,
>> + struct cgroup *parent,
>> + struct bpf_prog *prog,
>> + enum bpf_attach_type type)
>> +{
>> +struct bpf_prog *old_prog, *effective;
>> +struct cgroup_subsys_state *pos;
>> +
>> +old_prog = xchg(cgrp->bpf.prog + type, prog);
>> +
>> +if (prog)
>> +static_branch_inc(_bpf_enabled_key);
>> +
>> +if (old_prog) {
>> +bpf_prog_put(old_prog);
>> +static_branch_dec(_bpf_enabled_key);
>> +}
>> +
>> +effective = (!prog && parent) ?
>> +rcu_dereference_protected(parent->bpf.effective[type],
>> +  lockdep_is_held(_mutex)) :
>> +prog;
>> +
>> +css_for_each_descendant_pre(pos, >self) {
>> +struct cgroup *desc = container_of(pos, struct cgroup, self);
>> +
>> +/* skip the subtree if the descendant has its own program */
>> +if (desc->bpf.prog[type] && desc != cgrp)
>> +pos = css_rightmost_descendant(pos);
>> +else
>> +rcu_assign_pointer(desc->bpf.effective[type],
>> +   effective);
>> +}
> 
> Shouldn't the old_prog reference only be released right here at the end
> instead of above (otherwise this could race)?

Yes, that's right. Will change as well. Thanks for spotting!


Daniel

Re: [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering

2016-09-05 Thread Daniel Mack

On 08/30/2016 12:14 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>> For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
>> terms of checks during the verification process. It may access the skb as
>> well.
>>
>> Programs of this type will be attached to cgroups for network filtering
>> and accounting.
>>
>> Signed-off-by: Daniel Mack 
>> ---
>>   include/uapi/linux/bpf.h | 7 +++
>>   kernel/bpf/verifier.c| 1 +
>>   net/core/filter.c| 6 ++
>>   3 files changed, 14 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index e4c5a1b..1d5db42 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -95,6 +95,13 @@ enum bpf_prog_type {
>>  BPF_PROG_TYPE_SCHED_ACT,
>>  BPF_PROG_TYPE_TRACEPOINT,
>>  BPF_PROG_TYPE_XDP,
>> +BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
>> +};
> 
> Nit: can we drop the _FILTER suffix? So just leaving it
> at BPF_PROG_TYPE_CGROUP_SOCKET. Some of these use cases
> might not always strictly be related to filtering, so
> seems cleaner to just leave it out everywhere.
> 
>> +
>> +enum bpf_attach_type {
>> +BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
>> +BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
>> +__MAX_BPF_ATTACH_TYPE
>>   };
> 
> #define BPF_MAX_ATTACH_TYPE   __BPF_MAX_ATTACH_TYPE
> 
> And then use that in your follow-up patches for declaring
> arrays, etc?

Agreed, will change.


Thanks,
Daniel

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Daniel Mack

Hi Alexei,

On 08/27/2016 02:03 AM, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>   A - B - C
>> \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
>> Attaching and detaching programs will be done through the bpf(2)
>> syscall. For now, ingress and egress inet socket filtering are the
>> only supported use-cases.
>>
>> Signed-off-by: Daniel Mack 
> ...
>> +css_for_each_descendant_pre(pos, >self) {
>> +struct cgroup *desc = container_of(pos, struct cgroup, self);
>> +
>> +/* skip the subtree if the descendant has its own program */
>> +if (desc->bpf.prog[type] && desc != cgrp)
> 
> is desc != cgrp really needed?
> I thought css_for_each_descendant_pre() shouldn't walk itself
> or I'm missing how it works.

Hmm, no - that check is necessary in my tests, and also according to the
documentation:

/**
 * css_for_each_descendant_pre - pre-order walk of a css's descendants
 * @pos: the css * to use as the loop cursor
 * @root: css whose descendants to walk
 *
 * Walk @root's descendants.  @root is included in the iteration and the
 * first node to be visited.  Must be called under rcu_read_lock().
 *


Daniel

[Patch v6] net: ethernet: xilinx: Enable emaclite for MIPS

2016-09-05 Thread Zubair Lutfullah Kakakhel

The MIPS based xilfpga platform uses this driver.
Enable it for MIPS

Signed-off-by: Zubair Lutfullah Kakakhel 

---
V1 -> V6 are from a series that has gotten too big.
So I have split this patch and am sending it separately.
---
 drivers/net/ethernet/xilinx/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/xilinx/Kconfig 
b/drivers/net/ethernet/xilinx/Kconfig
index 4f5c024..6d68c8a 100644
--- a/drivers/net/ethernet/xilinx/Kconfig
+++ b/drivers/net/ethernet/xilinx/Kconfig
@@ -5,7 +5,7 @@
 config NET_VENDOR_XILINX
bool "Xilinx devices"
default y
-   depends on PPC || PPC32 || MICROBLAZE || ARCH_ZYNQ
+   depends on PPC || PPC32 || MICROBLAZE || ARCH_ZYNQ || MIPS
---help---
  If you have a network (Ethernet) card belonging to this class, say Y.
 
@@ -18,7 +18,7 @@ if NET_VENDOR_XILINX
 
 config XILINX_EMACLITE
tristate "Xilinx 10/100 Ethernet Lite support"
-   depends on (PPC32 || MICROBLAZE || ARCH_ZYNQ)
+   depends on PPC32 || MICROBLAZE || ARCH_ZYNQ || MIPS
select PHYLIB
---help---
  This driver supports the 10/100 Ethernet Lite from Xilinx.
-- 
1.9.1

Re: [Patch v5 0/2] net: ethernet: xilinx: mac addr and mips

2016-09-05 Thread Zubair Lutfullah Kakakhel

On 09/04/2016 07:45 PM, David Miller wrote:

From: Zubair Lutfullah Kakakhel 
Date: Fri, 2 Sep 2016 12:39:24 +0100

A couple of simple patches to generate the random mac address
if none is found. And enabling the driver for mips.

Based on v4.8-rc4.

These were part of a larger series but that series is growing
wildly. Splitting and submitting the net subsystem patches separately.
Hence the v5.

This doesn't apply cleanly to any of my trees.

:)

Looks like there is an identical patch by someone else already in net-next.

https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/xilinx/xilinx_emaclite.c?id=5575cf133cf7f564da991595c6bc9344afa7d89a

Regards,
ZubairLK

Re: 答复: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed

2016-09-05 Thread Hannes Frederic Sowa

On 05.09.2016 13:54, weiyongjun (A) wrote:
> On 05.09.2016 10:06, Wei Yongjun wrote:
 In general, when DAD detected IPv6 duplicate address, ifp->state will 
 be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed 
 work, the call tree should be like this:

 ndisc_recv_ns
   -> addrconf_dad_failure<- missing ifp put
  -> addrconf_mod_dad_work
-> schedule addrconf_dad_work()
  -> addrconf_dad_stop()  <- missing ifp hold before call it

 addrconf_dad_failure() called with ifp refcont holding but not put.
 addrconf_dad_work() call addrconf_dad_stop() without extra holding 
 refcount. This will not cause any issue normally.

 But the race between addrconf_dad_failure() and addrconf_dad_work() 
 may cause ifp refcount leak and netdevice can not be unregister, 
 dmesg show the following messages:

 IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected!
 ...
 unregister_netdevice: waiting for eth0 to become free. Usage count = 
 1

 Cc: sta...@vger.kernel.org
 Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing 
 to
 workqueue")
 Signed-off-by: Wei Yongjun 

 diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index
 bdf368e..2f1f5d4 100644
 --- a/net/ipv6/addrconf.c
 +++ b/net/ipv6/addrconf.c
 @@ -1948,6 +1948,7 @@ errdad:
spin_unlock_bh(>lock);
  
addrconf_mod_dad_work(ifp, 0);
 +  in6_ifa_put(ifp);
  }
> 
>>> This in6_ifa_put makes sense.
> 
  
  /* Join to solicited addr multicast group.
 @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w)
addrconf_dad_begin(ifp);
goto out;
} else if (action == DAD_ABORT) {
 +  in6_ifa_hold(ifp);
addrconf_dad_stop(ifp, 1);
if (disable_ipv6)
addrconf_ifdown(idev->dev, 0);

> 
>>> But why you add a in6_ifa_hold here isn't clear to me. Could you 
>>> explain why this is necessary? I don't see any async stuff being done 
>>> in addrconf_dad_stop, thus the reference we already have should be 
>>> sufficient for the lifetime of addrconf_dad_stop.
> 
>> I think it that link local is added with flag IFA_F_PERMANENT, which we real 
>> need it is to remove in6_ifa_put() in addrconf_dad_stop.
> 
>> static void addrconf_dad_stop(...)
>> {
>>  if (ifp->flags_F_PERMANENT) {
>>  ...
>>  in6_ifa_put(ifp);   <== remove this line since caller hold 
>> refcount
>>  } else if (ifp->flags_F_TEMPORARY) {
>>  ...
>>  ipv6_del_addr(ifp);
>>  } else {
>>  ipv6_del_addr(ifp);
>>  }
>> }
> 
> I see the comment of ipv6_del_addr:
> 
> /* This function wants to get referenced ifp and releases it before return */
> static void ipv6_del_addr(struct inet6_ifaddr *ifp)
> {
>...
> }
> 
> Both in6_ifa_put() and ipv6_del_addr() need a refcount holding, so we should
> use a extra in6_ifa_hold() call before call addrconf_dad_stop() as the origin 
> patch.

You are correct!

We always call in6_ifa_put() at the end of addrconf_dad_work. In this
case it is not correct. Either we add a in6_ifa_hold or we suppress or
jump over the last in6_ifa_put.

Good catch!

Hannes

答复: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed

2016-09-05 Thread weiyongjun (A)

On 05.09.2016 10:06, Wei Yongjun wrote:
>>> In general, when DAD detected IPv6 duplicate address, ifp->state will 
>>> be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed 
>>> work, the call tree should be like this:
>>> 
>>> ndisc_recv_ns
>>>   -> addrconf_dad_failure<- missing ifp put
>>>  -> addrconf_mod_dad_work
>>>-> schedule addrconf_dad_work()
>>>  -> addrconf_dad_stop()  <- missing ifp hold before call it
>>> 
>>> addrconf_dad_failure() called with ifp refcont holding but not put.
>>> addrconf_dad_work() call addrconf_dad_stop() without extra holding 
>>> refcount. This will not cause any issue normally.
>>> 
>>> But the race between addrconf_dad_failure() and addrconf_dad_work() 
>>> may cause ifp refcount leak and netdevice can not be unregister, 
>>> dmesg show the following messages:
>>> 
>>> IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected!
>>> ...
>>> unregister_netdevice: waiting for eth0 to become free. Usage count = 
>>> 1
>>> 
>>> Cc: sta...@vger.kernel.org
>>> Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing 
>>> to
>>> workqueue")
>>> Signed-off-by: Wei Yongjun 
>>> 
>>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index
>>> bdf368e..2f1f5d4 100644
>>> --- a/net/ipv6/addrconf.c
>>> +++ b/net/ipv6/addrconf.c
>>> @@ -1948,6 +1948,7 @@ errdad:
>>> spin_unlock_bh(>lock);
>>>  
>>> addrconf_mod_dad_work(ifp, 0);
>>> +   in6_ifa_put(ifp);
>>>  }

>>This in6_ifa_put makes sense.

>>>  
>>>  /* Join to solicited addr multicast group.
>>> @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w)
>>> addrconf_dad_begin(ifp);
>>> goto out;
>>> } else if (action == DAD_ABORT) {
>>> +   in6_ifa_hold(ifp);
>>> addrconf_dad_stop(ifp, 1);
>>> if (disable_ipv6)
>>> addrconf_ifdown(idev->dev, 0);
>>> 

>>But why you add a in6_ifa_hold here isn't clear to me. Could you 
>>explain why this is necessary? I don't see any async stuff being done 
>>in addrconf_dad_stop, thus the reference we already have should be sufficient 
>>for the lifetime of addrconf_dad_stop.

> I think it that link local is added with flag IFA_F_PERMANENT, which we real 
> need it is to remove in6_ifa_put() in addrconf_dad_stop.

> static void addrconf_dad_stop(...)
> {
>   if (ifp->flags_F_PERMANENT) {
>   ...
>   in6_ifa_put(ifp);   <== remove this line since caller hold 
> refcount
>   } else if (ifp->flags_F_TEMPORARY) {
>   ...
>   ipv6_del_addr(ifp);
>   } else {
>   ipv6_del_addr(ifp);
>   }
>}

I see the comment of ipv6_del_addr:

/* This function wants to get referenced ifp and releases it before return */
static void ipv6_del_addr(struct inet6_ifaddr *ifp)
{
   ...
}

Both in6_ifa_put() and ipv6_del_addr() need a refcount holding, so we should
use a extra in6_ifa_hold() call before call addrconf_dad_stop() as the origin 
patch.

Regards,
Wei Yongjun

[PATCH net-next 0/3] qed*: Debug data collection

2016-09-05 Thread Tomer Tayar

This patch series adds the support of debug data collection in the qed driver,
and the means to extract it in the qede driver via the get_regs operation.

Hi Dave,

Please consider applying this to 'net-next'.

Thanks,
Tomer

Tomer Tayar (3):
  qed: Add infrastructure for debug data collection
  qed: Add support for debug data collection
  qed*: Add support for the ethtool get_regs operation

 drivers/net/ethernet/qlogic/qed/Makefile|2 +-
 drivers/net/ethernet/qlogic/qed/qed.h   |   20 +
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 6898 +++
 drivers/net/ethernet/qlogic/qed/qed_debug.h |   54 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h   | 1059 +++-
 drivers/net/ethernet/qlogic/qed/qed_main.c  |6 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |   76 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |   46 +
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h  |  900 +++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |   24 +
 include/linux/qed/common_hsi.h  |3 +
 include/linux/qed/qed_if.h  |4 +
 12 files changed, 9080 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_debug.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_debug.h

-- 
1.8.3.1

Re: A potential bug in drivers/net/ethernet/synopsys/dwc_eth_qos.ko

2016-09-05 Thread Lars Persson


Hi Pavel,

Thanks for the notification. I agree that we should register the device 
after all initialization has completed. A patch will be sent shortly.


BR,
 Lars


On 09/05/2016 10:26 AM, Pavel Andrianov wrote:

Hi!

There is a potential bug in
drivers/net/ethernet/synopsys/dwc_eth_qos.ko. In dwceqos_probe there is
a registration of net device (line 2879). After that initialization of
common resources is continued (lines 2918, 2924, 2941), but at the
moment handlers from net device may be already executed.

Should the registration of net device be at the end of dwceqos_probe?

[PATCH net-next 1/3] qed: Add infrastructure for debug data collection

2016-09-05 Thread Tomer Tayar

Adds support for several infrastructure operations that are done as part of
debug data collection.

Signed-off-by: Tomer Tayar 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h  |  3 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 76 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  | 46 
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |  6 ++
 4 files changed, 131 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index df94a8b7..3a4cb4a 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -7640,6 +7640,7 @@ struct public_drv_mb {
 #define DRV_MSG_CODE_CFG_VF_MSIX   0xc001
 #define DRV_MSG_CODE_MCP_RESET 0x0009
 #define DRV_MSG_CODE_SET_VERSION   0x000f
+#define DRV_MSG_CODE_MCP_HALT   0x0010
 
 #define DRV_MSG_CODE_GET_STATS  0x0013
 #define DRV_MSG_CODE_STATS_TYPE_LAN 1
@@ -7647,6 +7648,8 @@ struct public_drv_mb {
 #define DRV_MSG_CODE_STATS_TYPE_ISCSI   3
 #define DRV_MSG_CODE_STATS_TYPE_RDMA4
 
+#define DRV_MSG_CODE_MASK_PARITIES  0x001a
+
 #define DRV_MSG_CODE_BIST_TEST 0x001e
 #define DRV_MSG_CODE_SET_LED_MODE  0x0020
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 4c21266..ea673e5 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -390,6 +390,34 @@ int qed_mcp_cmd(struct qed_hwfn *p_hwfn,
return 0;
 }
 
+int qed_mcp_nvm_rd_cmd(struct qed_hwfn *p_hwfn,
+  struct qed_ptt *p_ptt,
+  u32 cmd,
+  u32 param,
+  u32 *o_mcp_resp,
+  u32 *o_mcp_param, u32 *o_txn_size, u32 *o_buf)
+{
+   struct qed_mcp_mb_params mb_params;
+   union drv_union_data union_data;
+   int rc;
+
+   memset(_params, 0, sizeof(mb_params));
+   mb_params.cmd = cmd;
+   mb_params.param = param;
+   mb_params.p_data_dst = _data;
+   rc = qed_mcp_cmd_and_union(p_hwfn, p_ptt, _params);
+   if (rc)
+   return rc;
+
+   *o_mcp_resp = mb_params.mcp_resp;
+   *o_mcp_param = mb_params.mcp_param;
+
+   *o_txn_size = *o_mcp_param;
+   memcpy(o_buf, _data.raw_data, *o_txn_size);
+
+   return 0;
+}
+
 int qed_mcp_load_req(struct qed_hwfn *p_hwfn,
 struct qed_ptt *p_ptt, u32 *p_load_code)
 {
@@ -1169,6 +1197,33 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
return rc;
 }
 
+int qed_mcp_halt(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
+{
+   u32 resp = 0, param = 0;
+   int rc;
+
+   rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_MCP_HALT, 0, ,
+);
+   if (rc)
+   DP_ERR(p_hwfn, "MCP response failure, aborting\n");
+
+   return rc;
+}
+
+int qed_mcp_resume(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
+{
+   u32 value, cpu_mode;
+
+   qed_wr(p_hwfn, p_ptt, MCP_REG_CPU_STATE, 0x);
+
+   value = qed_rd(p_hwfn, p_ptt, MCP_REG_CPU_MODE);
+   value &= ~MCP_REG_CPU_MODE_SOFT_HALT;
+   qed_wr(p_hwfn, p_ptt, MCP_REG_CPU_MODE, value);
+   cpu_mode = qed_rd(p_hwfn, p_ptt, MCP_REG_CPU_MODE);
+
+   return (cpu_mode & MCP_REG_CPU_MODE_SOFT_HALT) ? -EAGAIN : 0;
+}
+
 int qed_mcp_set_led(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt, enum qed_led_mode mode)
 {
@@ -1196,6 +1251,27 @@ int qed_mcp_set_led(struct qed_hwfn *p_hwfn,
return rc;
 }
 
+int qed_mcp_mask_parities(struct qed_hwfn *p_hwfn,
+ struct qed_ptt *p_ptt, u32 mask_parities)
+{
+   u32 resp = 0, param = 0;
+   int rc;
+
+   rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_MASK_PARITIES,
+mask_parities, , );
+
+   if (rc) {
+   DP_ERR(p_hwfn,
+  "MCP response failure for mask parities, aborting\n");
+   } else if (resp != FW_MSG_CODE_OK) {
+   DP_ERR(p_hwfn,
+  "MCP did not acknowledge mask parity request. Old 
MFW?\n");
+   rc = -EINVAL;
+   }
+
+   return rc;
+}
+
 int qed_mcp_bist_register_test(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
 {
u32 drv_mb_param = 0, rsp, param;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index c6372fa..dff520e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -468,6 +468,29 @@ int qed_mcp_reset(struct qed_hwfn *p_hwfn,
  struct qed_ptt *p_ptt);
 
 /**
+ * @brief - Sends an NVM read command request to the MFW to get
+ *a buffer.
+

[PATCH net-next 3/3] qed*: Add support for the ethtool get_regs operation

2016-09-05 Thread Tomer Tayar

Signed-off-by: Tomer Tayar 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c  |  2 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 24 
 include/linux/qed/qed_if.h  |  4 
 3 files changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 78cbf99..1716979 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1421,6 +1421,8 @@ const struct qed_common_ops qed_common_ops_pass = {
.get_link = _get_current_link,
.drain = _drain,
.update_msglvl = _init_dp,
+   .dbg_all_data = _dbg_all_data,
+   .dbg_all_data_size = _dbg_all_data_size,
.chain_alloc = _chain_alloc,
.chain_free = _chain_free,
.get_coalesce = _get_coalesce,
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 6e17ee1..eba0128 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -689,6 +689,28 @@ static int qede_set_pauseparam(struct net_device *dev,
return 0;
 }
 
+static void qede_get_regs(struct net_device *ndev,
+ struct ethtool_regs *regs, void *buffer)
+{
+   struct qede_dev *edev = netdev_priv(ndev);
+
+   regs->version = 0;
+   memset(buffer, 0, regs->len);
+
+   if (edev->ops && edev->ops->common)
+   edev->ops->common->dbg_all_data(edev->cdev, buffer);
+}
+
+static int qede_get_regs_len(struct net_device *ndev)
+{
+   struct qede_dev *edev = netdev_priv(ndev);
+
+   if (edev->ops && edev->ops->common)
+   return edev->ops->common->dbg_all_data_size(edev->cdev);
+   else
+   return -EINVAL;
+}
+
 static void qede_update_mtu(struct qede_dev *edev, union qede_reload_args 
*args)
 {
edev->ndev->mtu = args->mtu;
@@ -1340,6 +1362,8 @@ static const struct ethtool_ops qede_ethtool_ops = {
.get_link_ksettings = qede_get_link_ksettings,
.set_link_ksettings = qede_set_link_ksettings,
.get_drvinfo = qede_get_drvinfo,
+   .get_regs_len = qede_get_regs_len,
+   .get_regs = qede_get_regs,
.get_msglevel = qede_get_msglevel,
.set_msglevel = qede_set_msglevel,
.nway_reset = qede_nway_reset,
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index 864265f..5a39a2c 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -449,6 +449,10 @@ struct qed_common_ops {
void(*simd_handler_clean)(struct qed_dev *cdev,
  int index);
 
+   int (*dbg_all_data) (struct qed_dev *cdev, void *buffer);
+
+   int (*dbg_all_data_size) (struct qed_dev *cdev);
+
 /**
  * @brief can_link_change - can the instance change the link or not
  *
-- 
1.8.3.1

Re: linux-next: manual merge of the char-misc tree with the net-next tree

2016-09-05 Thread Greg KH

On Mon, Sep 05, 2016 at 04:56:50PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the char-misc tree got a conflict in:
> 
>   include/linux/hyperv.h
> 
> between commit:
> 
>   30d1de08c87d ("hv_netvsc: make inline functions static")
> 
> from the net-next tree and commit:
> 
>   bb08d431a914 ("Drivers: hv: ring_buffer: count on wrap around mappings in 
> get_next_pkt_raw()")
> 
> from the char-misc tree.
> 
> I fixed it up (the former moved the code modified by the latter, so the
> below patch applies to the new location of the code) and can carry the
> fix as necessary. This is now fixed as far as linux-next is concerned,
> but any non trivial conflicts should be mentioned to your upstream
> maintainer when your tree is submitted for merging.  You may also want
> to consider cooperating with the maintainer of the conflicting tree to
> minimise any particularly complex conflicts.
> 
> From: Stephen Rothwell 
> Date: Mon, 5 Sep 2016 16:53:06 +1000
> Subject: [PATCH] Drivers: hv: ring_buffer: merge fix up for "hv_netvsc: make
>  inline functions static"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  drivers/net/hyperv/netvsc.c | 32 +++-
>  1 file changed, 11 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 2a9ccc4d9e3c..026df6556068 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -42,31 +42,23 @@ static struct vmpacket_descriptor *
>  get_next_pkt_raw(struct vmbus_channel *channel)
>  {
>   struct hv_ring_buffer_info *ring_info = >inbound;
> - u32 read_loc = ring_info->priv_read_index;
> + u32 priv_read_loc = ring_info->priv_read_index;
>   void *ring_buffer = hv_get_ring_buffer(ring_info);
> - struct vmpacket_descriptor *cur_desc;
> - u32 packetlen;
>   u32 dsize = ring_info->ring_datasize;
> - u32 delta = read_loc - ring_info->ring_buffer->read_index;
> + /*
> +  * delta is the difference between what is available to read and
> +  * what was already consumed in place. We commit read index after
> +  * the whole batch is processed.
> +  */
> + u32 delta = priv_read_loc >= ring_info->ring_buffer->read_index ?
> + priv_read_loc - ring_info->ring_buffer->read_index :
> + (dsize - ring_info->ring_buffer->read_index) + priv_read_loc;
>   u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta);
>  
>   if (bytes_avail_toread < sizeof(struct vmpacket_descriptor))
>   return NULL;
>  
> - if ((read_loc + sizeof(*cur_desc)) > dsize)
> - return NULL;
> -
> - cur_desc = ring_buffer + read_loc;
> - packetlen = cur_desc->len8 << 3;
> -
> - /*
> -  * If the packet under consideration is wrapping around,
> -  * return failure.
> -  */
> - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1))
> - return NULL;
> -
> - return cur_desc;
> + return ring_buffer + priv_read_loc;
>  }
>  
>  /*
> @@ -78,16 +70,14 @@ static void put_pkt_raw(struct vmbus_channel *channel,
>   struct vmpacket_descriptor *desc)
>  {
>   struct hv_ring_buffer_info *ring_info = >inbound;
> - u32 read_loc = ring_info->priv_read_index;
>   u32 packetlen = desc->len8 << 3;
>   u32 dsize = ring_info->ring_datasize;
>  
> - BUG_ON((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize);
> -
>   /*
>* Include the packet trailer.
>*/
>   ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER;
> + ring_info->priv_read_index %= dsize;
>  }
>  
>  /*

Ugh, messy.  Thanks for this.

KY, how did this happen?

greg k-h

Re: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed

2016-09-05 Thread weiyongjun (A)

On 05.09.2016 10:06, Wei Yongjun wrote:
>> In general, when DAD detected IPv6 duplicate address, ifp->state will 
>> be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed 
>> work, the call tree should be like this:
>> 
>> ndisc_recv_ns
>>   -> addrconf_dad_failure<- missing ifp put
>>  -> addrconf_mod_dad_work
>>-> schedule addrconf_dad_work()
>>  -> addrconf_dad_stop()  <- missing ifp hold before call it
>> 
>> addrconf_dad_failure() called with ifp refcont holding but not put.
>> addrconf_dad_work() call addrconf_dad_stop() without extra holding 
>> refcount. This will not cause any issue normally.
>> 
>> But the race between addrconf_dad_failure() and addrconf_dad_work() 
>> may cause ifp refcount leak and netdevice can not be unregister, dmesg 
>> show the following messages:
>> 
>> IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected!
>> ...
>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>> 
>> Cc: sta...@vger.kernel.org
>> Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing to 
>> workqueue")
>> Signed-off-by: Wei Yongjun 
>> 
>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 
>> bdf368e..2f1f5d4 100644
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -1948,6 +1948,7 @@ errdad:
>>  spin_unlock_bh(>lock);
>>  
>>  addrconf_mod_dad_work(ifp, 0);
>> +in6_ifa_put(ifp);
>>  }

>This in6_ifa_put makes sense.

>>  
>>  /* Join to solicited addr multicast group.
>> @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w)
>>  addrconf_dad_begin(ifp);
>>  goto out;
>>  } else if (action == DAD_ABORT) {
>> +in6_ifa_hold(ifp);
>>  addrconf_dad_stop(ifp, 1);
>>  if (disable_ipv6)
>>  addrconf_ifdown(idev->dev, 0);
>> 

>But why you add a in6_ifa_hold here isn't clear to me. Could you explain why 
>this is
>necessary? I don't see any async stuff being done in addrconf_dad_stop, thus 
>the
>reference we already have should be sufficient for the lifetime of 
>addrconf_dad_stop.

I think it that link local is added with flag IFA_F_PERMANENT, which we real 
need
it is to remove in6_ifa_put() in addrconf_dad_stop.

static void addrconf_dad_stop(...)
{
if (ifp->flags_F_PERMANENT) {
...
in6_ifa_put(ifp);   <== remove this line since caller hold 
refcount
} else if (ifp->flags_F_TEMPORARY) {
...
ipv6_del_addr(ifp);
} else {
ipv6_del_addr(ifp);
}
}

If so, the addrconf_dad_begin() also need to fix because if hold a ref before
addrconf_dad_stop():

static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
{
 ...
in6_ifa_hold(ifp);  <-- remove this line 
addrconf_dad_stop(ifp, 0);
 ...
}

Also inet6_addr_del which called ipv6_del_addr with refcount hold:
inet6_addr_del(...)
{
  ...
  list_for_each_entry(...) {
...
in6_ifa_hold(ifp);   <-- remove this line
...
ipv6_del_addr(ifp);
...
  }
  ...
}

Regards,
Yongjun Wei

[PATCH 12/29] netfilter: nf_tables: add number generator expression

2016-09-05 Thread Pablo Neira Ayuso

From: Laura Garcia Liebana 

This patch adds the numgen expression that allows us to generated
incremental and random numbers, this generator is bound to a upper limit
that is specified by userspace.

This expression is useful to distribute packets in a round-robin fashion
as well as randomly.

Signed-off-by: Laura Garcia Liebana 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  24 
 net/netfilter/Kconfig|   6 +
 net/netfilter/Makefile   |   1 +
 net/netfilter/nft_numgen.c   | 192 +++
 4 files changed, 223 insertions(+)
 create mode 100644 net/netfilter/nft_numgen.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 784fbf1..8c9d6ff 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1121,4 +1121,28 @@ enum nft_trace_types {
__NFT_TRACETYPE_MAX
 };
 #define NFT_TRACETYPE_MAX (__NFT_TRACETYPE_MAX - 1)
+
+/**
+ * enum nft_ng_attributes - nf_tables number generator expression netlink 
attributes
+ *
+ * @NFTA_NG_DREG: destination register (NLA_U32)
+ * @NFTA_NG_UNTIL: source value to increment the counter until reset (NLA_U32)
+ * @NFTA_NG_TYPE: operation type (NLA_U32)
+ */
+enum nft_ng_attributes {
+   NFTA_NG_UNSPEC,
+   NFTA_NG_DREG,
+   NFTA_NG_UNTIL,
+   NFTA_NG_TYPE,
+   __NFTA_NG_MAX
+};
+#define NFTA_NG_MAX(__NFTA_NG_MAX - 1)
+
+enum nft_ng_types {
+   NFT_NG_INCREMENTAL,
+   NFT_NG_RANDOM,
+   __NFT_NG_MAX
+};
+#define NFT_NG_MAX (__NFT_NG_MAX - 1)
+
 #endif /* _LINUX_NF_TABLES_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 29a8078..e8d56d9 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -474,6 +474,12 @@ config NFT_META
  This option adds the "meta" expression that you can use to match and
  to set packet metainformation such as the packet mark.
 
+config NFT_NUMGEN
+   tristate "Netfilter nf_tables number generator module"
+   help
+ This option adds the number generator expression used to perform
+ incremental counting and random numbers bound to a upper limit.
+
 config NFT_CT
depends on NF_CONNTRACK
tristate "Netfilter nf_tables conntrack module"
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 0fc42df..0c858110 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -80,6 +80,7 @@ obj-$(CONFIG_NF_TABLES_NETDEV)+= nf_tables_netdev.o
 obj-$(CONFIG_NFT_COMPAT)   += nft_compat.o
 obj-$(CONFIG_NFT_EXTHDR)   += nft_exthdr.o
 obj-$(CONFIG_NFT_META) += nft_meta.o
+obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
 obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o
 obj-$(CONFIG_NFT_NAT)  += nft_nat.o
diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c
new file mode 100644
index 000..176e26d
--- /dev/null
+++ b/net/netfilter/nft_numgen.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright (c) 2016 Laura Garcia 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct rnd_state, nft_numgen_prandom_state);
+
+struct nft_ng_inc {
+   enum nft_registers  dreg:8;
+   u32 until;
+   atomic_tcounter;
+};
+
+static void nft_ng_inc_eval(const struct nft_expr *expr,
+   struct nft_regs *regs,
+   const struct nft_pktinfo *pkt)
+{
+   struct nft_ng_inc *priv = nft_expr_priv(expr);
+   u32 nval, oval;
+
+   do {
+   oval = atomic_read(>counter);
+   nval = (oval + 1 < priv->until) ? oval + 1 : 0;
+   } while (atomic_cmpxchg(>counter, oval, nval) != oval);
+
+   memcpy(>data[priv->dreg], >counter, sizeof(u32));
+}
+
+static const struct nla_policy nft_ng_policy[NFTA_NG_MAX + 1] = {
+   [NFTA_NG_DREG]  = { .type = NLA_U32 },
+   [NFTA_NG_UNTIL] = { .type = NLA_U32 },
+   [NFTA_NG_TYPE]  = { .type = NLA_U32 },
+};
+
+static int nft_ng_inc_init(const struct nft_ctx *ctx,
+  const struct nft_expr *expr,
+  const struct nlattr * const tb[])
+{
+   struct nft_ng_inc *priv = nft_expr_priv(expr);
+
+   priv->until = ntohl(nla_get_be32(tb[NFTA_NG_UNTIL]));
+   if (priv->until == 0)
+   return -ERANGE;
+
+   priv->dreg = nft_parse_register(tb[NFTA_NG_DREG]);
+   atomic_set(>counter, 0);
+
+   return nft_validate_register_store(ctx, priv->dreg, NULL,

[PATCH 27/29] netfilter: remove __nf_ct_kill_acct helper

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack.h | 13 +++--
 net/netfilter/nf_conntrack_core.c| 12 +---
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 7277751..5041805 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -219,21 +219,14 @@ static inline void nf_ct_refresh(struct nf_conn *ct,
__nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, 0);
 }
 
-bool __nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb, int do_acct);
-
 /* kill conntrack and do accounting */
-static inline bool nf_ct_kill_acct(struct nf_conn *ct,
-  enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb)
-{
-   return __nf_ct_kill_acct(ct, ctinfo, skb, 1);
-}
+bool nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
+const struct sk_buff *skb);
 
 /* kill conntrack without accounting */
 static inline bool nf_ct_kill(struct nf_conn *ct)
 {
-   return __nf_ct_kill_acct(ct, 0, NULL, 0);
+   return nf_ct_delete(ct, 0, 0);
 }
 
 /* These are for NAT.  Icky. */
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 7c66ce40..ac1db40 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1430,17 +1430,15 @@ acct:
 }
 EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct);
 
-bool __nf_ct_kill_acct(struct nf_conn *ct,
-  enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb,
-  int do_acct)
+bool nf_ct_kill_acct(struct nf_conn *ct,
+enum ip_conntrack_info ctinfo,
+const struct sk_buff *skb)
 {
-   if (do_acct)
-   nf_ct_acct_update(ct, ctinfo, skb->len);
+   nf_ct_acct_update(ct, ctinfo, skb->len);
 
return nf_ct_delete(ct, 0, 0);
 }
-EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
+EXPORT_SYMBOL_GPL(nf_ct_kill_acct);
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
 
-- 
2.1.4

[PATCH 28/29] netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1'

2016-09-05 Thread Pablo Neira Ayuso

From: Gao Feng 

There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP,
so we could use it instead of the literal number '1'.

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/nf_log_arp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
index e7ad950..cf8f2d4 100644
--- a/net/ipv4/netfilter/nf_log_arp.c
+++ b/net/ipv4/netfilter/nf_log_arp.c
@@ -62,7 +62,7 @@ static void dump_arp_packet(struct nf_log_buf *m,
/* If it's for Ethernet and the lengths are OK, then log the ARP
 * payload.
 */
-   if (ah->ar_hrd != htons(1) ||
+   if (ah->ar_hrd != htons(ARPHRD_ETHER) ||
ah->ar_hln != ETH_ALEN ||
ah->ar_pln != sizeof(__be32))
return;
-- 
2.1.4

[PATCH 23/29] netfilter: conntrack: get rid of conntrack timer

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de657f8, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack.h | 23 +++--
 net/netfilter/nf_conntrack_core.c| 91 
 net/netfilter/nf_conntrack_netlink.c | 14 ++
 net/netfilter/nf_conntrack_pptp.c|  3 +-
 net/netfilter/nf_nat_core.c  |  6 ---
 5 files changed, 74 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 2a12748..7277751 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -42,7 +42,6 @@ union nf_conntrack_expect_proto {
 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_NETFILTER_DEBUG
 #define NF_CT_ASSERT(x)WARN_ON(!(x))
@@ -73,7 +72,7 @@ struct nf_conn_help {
 #include 
 
 struct nf_conn {
-   /* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
+   /* Usage count in here is 1 for hash table, 1 per skb,
 * plus 1 for any connection(s) we are `master' for
 *
 * Hint, SKB address this struct and refcnt via skb->nfct and
@@ -96,8 +95,8 @@ struct nf_conn {
/* Have we seen traffic both ways yet? (bitset) */
unsigned long status;
 
-   /* Timer function; drops refcnt when it goes off. */
-   struct timer_list timeout;
+   /* jiffies32 when this ct is considered dead */
+   u32 timeout;
 
possible_net_t ct_net;
 
@@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct 
sk_buff *skb)
return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+#define nfct_time_stamp ((u32)(jiffies))
+
 /* jiffies until ct expires, 0 if already expired */
 static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
 {
-   long timeout = (long)ct->timeout.expires - (long)jiffies;
+   s32 timeout = ct->timeout - nfct_time_stamp;
 
return timeout > 0 ? timeout : 0;
 }
 
+static inline bool nf_ct_is_expired(const struct nf_conn *ct)
+{
+   return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+}
+
+/* use after obtaining a reference count */
+static inline bool nf_ct_should_gc(const struct nf_conn *ct)
+{
+   return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
+  !nf_ct_is_dying(ct);
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 887926a..87ee6da 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
pr_debug("destroy_conntrack(%p)\n", ct);
NF_CT_ASSERT(atomic_read(>use) == 0);
-   NF_CT_ASSERT(!timer_pending(>timeout));
 
if (unlikely(nf_ct_is_template(ct))) {
nf_ct_tmpl_free(ct);
@@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int 
report)
 {
struct nf_conn_tstamp *tstamp;
 
+   if (test_and_set_bit(IPS_DYING_BIT, >status))
+   return false;
+
tstamp = nf_conn_tstamp_find(ct);
if (tstamp && tstamp->stop == 0)
tstamp->stop = ktime_get_real_ns();
 
-   if (nf_ct_is_dying(ct))
-   goto delete;
-
if (nf_conntrack_event_report(IPCT_DESTROY, ct,
portid, report) < 0) {
-   /* destroy event was not delivered */
+   /* destroy event was not delivered. nf_ct_put will
+* be done by event cache worker on redelivery.
+*/
nf_ct_delete_from_lists(ct);

[PATCH 19/29] netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion

2016-09-05 Thread Pablo Neira Ayuso

If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.

This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  3 ++-
 net/netfilter/nf_tables_api.c | 20 +++-
 net/netfilter/nft_set_hash.c  | 17 +
 net/netfilter/nft_set_rbtree.c| 12 
 4 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index f2f1339..8972468 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -251,7 +251,8 @@ struct nft_set_ops {
 
int (*insert)(const struct net *net,
  const struct nft_set *set,
- const struct nft_set_elem 
*elem);
+ const struct nft_set_elem 
*elem,
+ struct nft_set_ext **ext);
void(*activate)(const struct net *net,
const struct nft_set *set,
const struct nft_set_elem 
*elem);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 221d27f..bd9715e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3483,12 +3483,12 @@ static int nft_setelem_parse_flags(const struct nft_set 
*set,
 }
 
 static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
-   const struct nlattr *attr)
+   const struct nlattr *attr, u32 nlmsg_flags)
 {
struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
struct nft_data_desc d1, d2;
struct nft_set_ext_tmpl tmpl;
-   struct nft_set_ext *ext;
+   struct nft_set_ext *ext, *ext2;
struct nft_set_elem elem;
struct nft_set_binding *binding;
struct nft_userdata *udata;
@@ -3615,9 +3615,19 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct 
nft_set *set,
goto err4;
 
ext->genmask = nft_genmask_cur(ctx->net) | NFT_SET_ELEM_BUSY_MASK;
-   err = set->ops->insert(ctx->net, set, );
-   if (err < 0)
+   err = set->ops->insert(ctx->net, set, , );
+   if (err) {
+   if (err == -EEXIST) {
+   if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) &&
+   nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) &&
+   memcmp(nft_set_ext_data(ext),
+  nft_set_ext_data(ext2), set->dlen) != 0)
+   err = -EBUSY;
+   else if (!(nlmsg_flags & NLM_F_EXCL))
+   err = 0;
+   }
goto err5;
+   }
 
nft_trans_elem(trans) = elem;
list_add_tail(>list, >net->nft.commit_list);
@@ -3673,7 +3683,7 @@ static int nf_tables_newsetelem(struct net *net, struct 
sock *nlsk,
!atomic_add_unless(>nelems, 1, set->size + 
set->ndeact))
return -ENFILE;
 
-   err = nft_add_set_elem(, set, attr);
+   err = nft_add_set_elem(, set, attr, nlh->nlmsg_flags);
if (err < 0) {
atomic_dec(>nelems);
break;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 564fa79..3794cb2 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -126,7 +126,8 @@ err1:
 }
 
 static int nft_hash_insert(const struct net *net, const struct nft_set *set,
-  const struct nft_set_elem *elem)
+  const struct nft_set_elem *elem,
+  struct nft_set_ext **ext)
 {
struct nft_hash *priv = nft_set_priv(set);
struct nft_hash_elem *he = elem->priv;
@@ -135,9 +136,17 @@ static int nft_hash_insert(const struct net *net, const 
struct nft_set *set,
.set = set,
.key = elem->key.val.data,
};
-
-   return rhashtable_lookup_insert_key(>ht, , >node,
-   nft_hash_params);
+   struct nft_hash_elem *prev;
+
+   prev = rhashtable_lookup_get_insert_key(>ht, , >node,
+  nft_hash_params);
+   if (IS_ERR(prev))
+   return PTR_ERR(prev);
+   if (prev) {
+   *ext = >ext;
+

[PATCH 08/29] netfilter: remove ip_conntrack* sysctl compat code

2016-09-05 Thread Pablo Neira Ayuso

This backward compatibility has been around for more than ten years,
since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
the conntrack utility got adopted by many people in the user community
according to what I observed on the netfilter user mailing list.

So let's get rid of this.

Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
not need to be exported as symbol anymore.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack_l4proto.h   |   8 -
 include/net/netns/conntrack.h  |   8 -
 net/ipv4/netfilter/Kconfig |  11 -
 net/ipv4/netfilter/Makefile|   5 -
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  70 ---
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   | 491 -
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  39 +-
 net/netfilter/nf_conntrack_core.c  |   3 -
 net/netfilter/nf_conntrack_proto.c |  81 +---
 net/netfilter/nf_conntrack_proto_generic.c |  39 +-
 net/netfilter/nf_conntrack_proto_sctp.c|  85 +---
 net/netfilter/nf_conntrack_proto_tcp.c | 127 +-
 net/netfilter/nf_conntrack_proto_udp.c |  49 +-
 13 files changed, 7 insertions(+), 1009 deletions(-)
 delete mode 100644 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c

diff --git a/include/net/netfilter/nf_conntrack_l4proto.h 
b/include/net/netfilter/nf_conntrack_l4proto.h
index 1a5fb36..de629f1 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -134,14 +134,6 @@ void nf_ct_l4proto_pernet_unregister(struct net *net,
 int nf_ct_l4proto_register(struct nf_conntrack_l4proto *proto);
 void nf_ct_l4proto_unregister(struct nf_conntrack_l4proto *proto);
 
-static inline void nf_ct_kfree_compat_sysctl_table(struct nf_proto_net *pn)
-{
-#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
-   kfree(pn->ctl_compat_table);
-   pn->ctl_compat_table = NULL;
-#endif
-}
-
 /* Generic netlink helpers */
 int nf_ct_port_tuple_to_nlattr(struct sk_buff *skb,
   const struct nf_conntrack_tuple *tuple);
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index 38b1a80..e469e85 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -15,10 +15,6 @@ struct nf_proto_net {
 #ifdef CONFIG_SYSCTL
struct ctl_table_header *ctl_table_header;
struct ctl_table*ctl_table;
-#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
-   struct ctl_table_header *ctl_compat_header;
-   struct ctl_table*ctl_compat_table;
-#endif
 #endif
unsigned intusers;
 };
@@ -58,10 +54,6 @@ struct nf_ip_net {
struct nf_udp_net   udp;
struct nf_icmp_net  icmp;
struct nf_icmp_net  icmpv6;
-#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
-   struct ctl_table_header *ctl_table_header;
-   struct ctl_table*ctl_table;
-#endif
 };
 
 struct ct_pcpu {
diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index c187c60..d613309 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -25,17 +25,6 @@ config NF_CONNTRACK_IPV4
 
  To compile it as a module, choose M here.  If unsure, say N.
 
-config NF_CONNTRACK_PROC_COMPAT
-   bool "proc/sysctl compatibility with old connection tracking"
-   depends on NF_CONNTRACK_PROCFS && NF_CONNTRACK_IPV4
-   default y
-   help
- This option enables /proc and sysctl compatibility with the old
- layer 3 dependent connection tracking. This is needed to keep
- old programs that have not been adapted to the new names working.
-
- If unsure, say Y.
-
 if NF_TABLES
 
 config NF_TABLES_IPV4
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index 87b073d..853328f 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -4,11 +4,6 @@
 
 # objects for l3 independent conntrack
 nf_conntrack_ipv4-y:=  nf_conntrack_l3proto_ipv4.o 
nf_conntrack_proto_icmp.o
-ifeq ($(CONFIG_NF_CONNTRACK_PROC_COMPAT),y)
-ifeq ($(CONFIG_PROC_FS),y)
-nf_conntrack_ipv4-objs += nf_conntrack_l3proto_ipv4_compat.o
-endif
-endif
 
 # connection tracking
 obj-$(CONFIG_NF_CONNTRACK_IPV4) += nf_conntrack_ipv4.o
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index ae1a71a..870aebd 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -202,47 +202,6 @@ static struct nf_hook_ops ipv4_conntrack_ops[] 
__read_mostly = {
},
 };
 
-#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
-static int log_invalid_proto_min =

[PATCH 25/29] netfilter: conntrack: add gc worker to remove timed-out entries

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.

v3: don't call cancel_delayed_work_sync twice (again, Eric).

Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_core.c | 76 +++
 1 file changed, 76 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 87ee6da..f95a9e9 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -72,11 +72,24 @@ EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_hash);
 
+struct conntrack_gc_work {
+   struct delayed_work dwork;
+   u32 last_bucket;
+   boolexiting;
+};
+
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+#define GC_MAX_BUCKETS_DIV 64u
+#define GC_MAX_BUCKETS 8192u
+#define GC_INTERVAL(5 * HZ)
+#define GC_MAX_EVICTS  256u
+
+static struct conntrack_gc_work conntrack_gc_work;
+
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
spin_lock(lock);
@@ -928,6 +941,63 @@ static noinline int early_drop(struct net *net, unsigned 
int _hash)
return false;
 }
 
+static void gc_worker(struct work_struct *work)
+{
+   unsigned int i, goal, buckets = 0, expired_count = 0;
+   unsigned long next_run = GC_INTERVAL;
+   struct conntrack_gc_work *gc_work;
+
+   gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
+
+   goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, 
GC_MAX_BUCKETS);
+   i = gc_work->last_bucket;
+
+   do {
+   struct nf_conntrack_tuple_hash *h;
+   struct hlist_nulls_head *ct_hash;
+   struct hlist_nulls_node *n;
+   unsigned int hashsz;
+   struct nf_conn *tmp;
+
+   i++;
+   rcu_read_lock();
+
+   nf_conntrack_get_ht(_hash, );
+   if (i >= hashsz)
+   i = 0;
+
+   hlist_nulls_for_each_entry_rcu(h, n, _hash[i], hnnode) {
+   tmp = nf_ct_tuplehash_to_ctrack(h);
+
+   if (nf_ct_is_expired(tmp)) {
+   nf_ct_gc_expired(tmp);
+   expired_count++;
+   continue;
+   }
+   }
+
+   /* could check get_nulls_value() here and restart if ct
+* was moved to another chain.  But given gc is best-effort
+* we will just continue with next hash slot.
+*/
+   rcu_read_unlock();
+   cond_resched_rcu_qs();
+   } while (++buckets < goal &&
+expired_count < GC_MAX_EVICTS);
+
+   if (gc_work->exiting)
+   return;
+
+   gc_work->last_bucket = i;
+   schedule_delayed_work(_work->dwork, next_run);
+}
+
+static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
+{
+   INIT_DELAYED_WORK(_work->dwork, gc_worker);
+   gc_work->exiting = false;
+}
+
 static struct nf_conn *
 __nf_conntrack_alloc(struct net *net,
 const struct nf_conntrack_zone *zone,
@@ -1534,6 +1604,7 @@ static int untrack_refs(void)
 
 void nf_conntrack_cleanup_start(void)
 {
+   conntrack_gc_work.exiting = true;
RCU_INIT_POINTER(ip_ct_attach, NULL);
 }
 
@@ -1543,6 +1614,7 @@ void nf_conntrack_cleanup_end(void)
while (untrack_refs() > 0)
schedule();
 
+   cancel_delayed_work_sync(_gc_work.dwork);
nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
 
nf_conntrack_proto_fini();

[PATCH 01/29] netfilter: conntrack: Only need first 4 bytes to get l4proto ports

2016-09-05 Thread Pablo Neira Ayuso

From: Gao Feng 

We only need first 4 bytes instead of 8 bytes to get the ports of
tcp/udp/dccp/sctp/udplite in their pkt_to_tuple function.

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_proto_dccp.c| 3 ++-
 net/netfilter/nf_conntrack_proto_sctp.c| 4 ++--
 net/netfilter/nf_conntrack_proto_tcp.c | 4 ++--
 net/netfilter/nf_conntrack_proto_udp.c | 4 ++--
 net/netfilter/nf_conntrack_proto_udplite.c | 3 ++-
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_dccp.c 
b/net/netfilter/nf_conntrack_proto_dccp.c
index 399a38f..a45bee5 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -402,7 +402,8 @@ static bool dccp_pkt_to_tuple(const struct sk_buff *skb, 
unsigned int dataoff,
 {
struct dccp_hdr _hdr, *dh;
 
-   dh = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr);
+   /* Actually only need first 4 bytes to get ports. */
+   dh = skb_header_pointer(skb, dataoff, 4, &_hdr);
if (dh == NULL)
return false;
 
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c 
b/net/netfilter/nf_conntrack_proto_sctp.c
index 1d7ab96..e769f05 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -161,8 +161,8 @@ static bool sctp_pkt_to_tuple(const struct sk_buff *skb, 
unsigned int dataoff,
const struct sctphdr *hp;
struct sctphdr _hdr;
 
-   /* Actually only need first 8 bytes. */
-   hp = skb_header_pointer(skb, dataoff, 8, &_hdr);
+   /* Actually only need first 4 bytes to get ports. */
+   hp = skb_header_pointer(skb, dataoff, 4, &_hdr);
if (hp == NULL)
return false;
 
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c 
b/net/netfilter/nf_conntrack_proto_tcp.c
index 70c8381..4abe9e1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -282,8 +282,8 @@ static bool tcp_pkt_to_tuple(const struct sk_buff *skb, 
unsigned int dataoff,
const struct tcphdr *hp;
struct tcphdr _hdr;
 
-   /* Actually only need first 8 bytes. */
-   hp = skb_header_pointer(skb, dataoff, 8, &_hdr);
+   /* Actually only need first 4 bytes to get ports. */
+   hp = skb_header_pointer(skb, dataoff, 4, &_hdr);
if (hp == NULL)
return false;
 
diff --git a/net/netfilter/nf_conntrack_proto_udp.c 
b/net/netfilter/nf_conntrack_proto_udp.c
index 4fd0405..8a057e1 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -44,8 +44,8 @@ static bool udp_pkt_to_tuple(const struct sk_buff *skb,
const struct udphdr *hp;
struct udphdr _hdr;
 
-   /* Actually only need first 8 bytes. */
-   hp = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr);
+   /* Actually only need first 4 bytes to get ports. */
+   hp = skb_header_pointer(skb, dataoff, 4, &_hdr);
if (hp == NULL)
return false;
 
diff --git a/net/netfilter/nf_conntrack_proto_udplite.c 
b/net/netfilter/nf_conntrack_proto_udplite.c
index 9d692f5..029206e 100644
--- a/net/netfilter/nf_conntrack_proto_udplite.c
+++ b/net/netfilter/nf_conntrack_proto_udplite.c
@@ -54,7 +54,8 @@ static bool udplite_pkt_to_tuple(const struct sk_buff *skb,
const struct udphdr *hp;
struct udphdr _hdr;
 
-   hp = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr);
+   /* Actually only need first 4 bytes to get ports. */
+   hp = skb_header_pointer(skb, dataoff, 4, &_hdr);
if (hp == NULL)
return false;
 
-- 
2.1.4

[PATCH 13/29] netfilter: fix spelling mistake: "delimitter" -> "delimiter"

2016-09-05 Thread Pablo Neira Ayuso

From: Colin Ian King 

trivial fix to spelling mistake in pr_debug message

Signed-off-by: Colin Ian King 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_ftp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_ftp.c b/net/netfilter/nf_conntrack_ftp.c
index 4314700..b6934b5 100644
--- a/net/netfilter/nf_conntrack_ftp.c
+++ b/net/netfilter/nf_conntrack_ftp.c
@@ -237,7 +237,7 @@ static int try_eprt(const char *data, size_t dlen, struct 
nf_conntrack_man *cmd,
}
delim = data[0];
if (isdigit(delim) || delim < 33 || delim > 126 || data[2] != delim) {
-   pr_debug("try_eprt: invalid delimitter.\n");
+   pr_debug("try_eprt: invalid delimiter.\n");
return 0;
}
 
-- 
2.1.4

[PATCH 05/29] ipvs: use nf_ct_kill helper

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

Once timer is removed from nf_conn struct we cannot open-code
the removal sequence anymore.

Signed-off-by: Florian Westphal 
Acked-by: Julian Anastasov 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/ipvs/ip_vs_nfct.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c
index f04fd8d..fc230d9 100644
--- a/net/netfilter/ipvs/ip_vs_nfct.c
+++ b/net/netfilter/ipvs/ip_vs_nfct.c
@@ -281,13 +281,10 @@ void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp)
h = nf_conntrack_find_get(cp->ipvs->net, _ct_zone_dflt, );
if (h) {
ct = nf_ct_tuplehash_to_ctrack(h);
-   /* Show what happens instead of calling nf_ct_kill() */
-   if (del_timer(>timeout)) {
-   IP_VS_DBG(7, "%s: ct=%p, deleted conntrack timer for 
tuple="
+   if (nf_ct_kill(ct)) {
+   IP_VS_DBG(7, "%s: ct=%p, deleted conntrack for tuple="
FMT_TUPLE "\n",
__func__, ct, ARG_TUPLE());
-   if (ct->timeout.function)
-   ct->timeout.function(ct->timeout.data);
} else {
IP_VS_DBG(7, "%s: ct=%p, no conntrack timer for tuple="
FMT_TUPLE "\n",
-- 
2.1.4

[PATCH 17/29] netfilter: nf_tables: reject hook configuration updates on existing chains

2016-09-05 Thread Pablo Neira Ayuso

Currently, if you add a base chain whose name clashes with an existing
non-base chain, nf_tables doesn't complain about this. Similarly, if you
update the chain type, the hook number and priority.

With this patch, nf_tables bails out in case any of this unsupported
operations occur by returning EBUSY.

 # nft add table x
 # nft add chain x y
 # nft add chain x y { type nat hook input priority 0\; }
 :1:1-49: Error: Could not process rule: Device or resource busy
 add chain x y { type nat hook input priority 0; }
 ^

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 463fcad..221d27f 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1348,6 +1348,37 @@ static int nf_tables_newchain(struct net *net, struct 
sock *nlsk,
if (nlh->nlmsg_flags & NLM_F_REPLACE)
return -EOPNOTSUPP;
 
+   if (nla[NFTA_CHAIN_HOOK]) {
+   struct nft_base_chain *basechain;
+   struct nft_chain_hook hook;
+   struct nf_hook_ops *ops;
+
+   if (!(chain->flags & NFT_BASE_CHAIN))
+   return -EBUSY;
+
+   err = nft_chain_parse_hook(net, nla, afi, ,
+  create);
+   if (err < 0)
+   return err;
+
+   basechain = nft_base_chain(chain);
+   if (basechain->type != hook.type) {
+   nft_chain_release_hook();
+   return -EBUSY;
+   }
+
+   for (i = 0; i < afi->nops; i++) {
+   ops = >ops[i];
+   if (ops->hooknum != hook.num ||
+   ops->priority != hook.priority ||
+   ops->dev != hook.dev) {
+   nft_chain_release_hook();
+   return -EBUSY;
+   }
+   }
+   nft_chain_release_hook();
+   }
+
if (nla[NFTA_CHAIN_HANDLE] && name) {
struct nft_chain *chain2;
 
-- 
2.1.4

[PATCH 21/29] netfilter: restart search if moved to other chain

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.

Reported-by: Eric Dumazet 
Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get 
rid of call_rcu()")
Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 7d90a5d..887926a 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -809,6 +809,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple 
*tuple,
zone = nf_ct_zone(ignored_conntrack);
 
rcu_read_lock();
+ begin:
nf_conntrack_get_ht(_hash, );
hash = __hash_conntrack(net, tuple, hsize);
 
@@ -822,6 +823,12 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple 
*tuple,
}
NF_CT_STAT_INC_ATOMIC(net, searched);
}
+
+   if (get_nulls_value(n) != hash) {
+   NF_CT_STAT_INC_ATOMIC(net, search_restart);
+   goto begin;
+   }
+
rcu_read_unlock();
 
return 0;
-- 
2.1.4

[PATCH 11/29] netfilter: nf_tables: add quota expression

2016-09-05 Thread Pablo Neira Ayuso

This patch adds the quota expression. This new stateful expression
integrate easily into the dynset expression to build 'hashquota' flow
tables.

Arguably, we could use instead "counter bytes > 1000" instead, but this
approach has several problems:

1) We only support for one single stateful expression in dynamic set
   definitions, and the expression above is a composite of two
   expressions: get counter + comparison.

2) We would need to restore the packed counter representation (that we
   used to have) based on seqlock to synchronize this, since per-cpu is
   not suitable for this.

So instead of bloating the counter expression back with the seqlock
representation and extending the existing set infrastructure to make it
more complex for the composite described above, let's follow the more
simple approach of adding a quota expression that we can plug into our
existing infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  19 +
 net/netfilter/Kconfig|   6 ++
 net/netfilter/Makefile   |   1 +
 net/netfilter/nft_quota.c| 121 +++
 4 files changed, 147 insertions(+)
 create mode 100644 net/netfilter/nft_quota.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 6ce0a6d..784fbf1 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -900,6 +900,25 @@ enum nft_queue_attributes {
 #define NFT_QUEUE_FLAG_CPU_FANOUT  0x02 /* use current CPU (no hashing) */
 #define NFT_QUEUE_FLAG_MASK0x03
 
+enum nft_quota_flags {
+   NFT_QUOTA_F_INV = (1 << 0),
+};
+
+/**
+ * enum nft_quota_attributes - nf_tables quota expression netlink attributes
+ *
+ * @NFTA_QUOTA_BYTES: quota in bytes (NLA_U16)
+ * @NFTA_QUOTA_FLAGS: flags (NLA_U32)
+ */
+enum nft_quota_attributes {
+   NFTA_QUOTA_UNSPEC,
+   NFTA_QUOTA_BYTES,
+   NFTA_QUOTA_FLAGS,
+   NFTA_QUOTA_PAD,
+   __NFTA_QUOTA_MAX
+};
+#define NFTA_QUOTA_MAX (__NFTA_QUOTA_MAX - 1)
+
 /**
  * enum nft_reject_types - nf_tables reject expression reject types
  *
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 9cfaa00..29a8078 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -542,6 +542,12 @@ config NFT_QUEUE
  This is required if you intend to use the userspace queueing
  infrastructure (also known as NFQUEUE) from nftables.
 
+config NFT_QUOTA
+   tristate "Netfilter nf_tables quota module"
+   help
+ This option adds the "quota" expression that you can use to match
+ enforce bytes quotas.
+
 config NFT_REJECT
default m if NETFILTER_ADVANCED=n
tristate "Netfilter nf_tables reject support"
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 1106ccd..0fc42df 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_NFT_CT)  += nft_ct.o
 obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o
 obj-$(CONFIG_NFT_NAT)  += nft_nat.o
 obj-$(CONFIG_NFT_QUEUE)+= nft_queue.o
+obj-$(CONFIG_NFT_QUOTA)+= nft_quota.o
 obj-$(CONFIG_NFT_REJECT)   += nft_reject.o
 obj-$(CONFIG_NFT_REJECT_INET)  += nft_reject_inet.o
 obj-$(CONFIG_NFT_SET_RBTREE)   += nft_set_rbtree.o
diff --git a/net/netfilter/nft_quota.c b/net/netfilter/nft_quota.c
new file mode 100644
index 000..6eafbf9
--- /dev/null
+++ b/net/netfilter/nft_quota.c
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2016 Pablo Neira Ayuso 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct nft_quota {
+   u64 quota;
+   boolinvert;
+   atomic64_t  remain;
+};
+
+static inline long nft_quota(struct nft_quota *priv,
+const struct nft_pktinfo *pkt)
+{
+   return atomic64_sub_return(pkt->skb->len, >remain);
+}
+
+static void nft_quota_eval(const struct nft_expr *expr,
+  struct nft_regs *regs,
+  const struct nft_pktinfo *pkt)
+{
+   struct nft_quota *priv = nft_expr_priv(expr);
+
+   if (nft_quota(priv, pkt) < 0 && !priv->invert)
+   regs->verdict.code = NFT_BREAK;
+}
+
+static const struct nla_policy nft_quota_policy[NFTA_QUOTA_MAX + 1] = {
+   [NFTA_QUOTA_BYTES]  = { .type = NLA_U64 },
+   [NFTA_QUOTA_FLAGS]  = { .type = NLA_U32 },
+};
+
+static int nft_quota_init(const struct nft_ctx *ctx,
+ const struct nft_expr *expr,
+ const struct nlattr * const tb[])
+{
+   struct nft_quota *priv =

[PATCH 24/29] netfilter: evict stale entries on netlink dumps

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.

We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later.  Considering that resizes are very rare it doesn't
seem worth doing it.

Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_netlink.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index 81fd34c..dedbe4b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -815,14 +815,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct 
netlink_callback *cb)
struct hlist_nulls_node *n;
struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
u_int8_t l3proto = nfmsg->nfgen_family;
-   int res;
+   struct nf_conn *nf_ct_evict[8];
+   int res, i;
spinlock_t *lockp;
 
last = (struct nf_conn *)cb->args[1];
+   i = 0;
 
local_bh_disable();
for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
+   while (i) {
+   i--;
+   if (nf_ct_should_gc(nf_ct_evict[i]))
+   nf_ct_kill(nf_ct_evict[i]);
+   nf_ct_put(nf_ct_evict[i]);
+   }
+
lockp = _conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
nf_conntrack_lock(lockp);
if (cb->args[0] >= nf_conntrack_htable_size) {
@@ -834,6 +843,13 @@ restart:
if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
continue;
ct = nf_ct_tuplehash_to_ctrack(h);
+   if (nf_ct_is_expired(ct)) {
+   if (i < ARRAY_SIZE(nf_ct_evict) &&
+   atomic_inc_not_zero(>ct_general.use))
+   nf_ct_evict[i++] = ct;
+   continue;
+   }
+
if (!net_eq(net, nf_ct_net(ct)))
continue;
 
@@ -875,6 +891,13 @@ out:
if (last)
nf_ct_put(last);
 
+   while (i) {
+   i--;
+   if (nf_ct_should_gc(nf_ct_evict[i]))
+   nf_ct_kill(nf_ct_evict[i]);
+   nf_ct_put(nf_ct_evict[i]);
+   }
+
return skb->len;
 }
 
-- 
2.1.4

[PATCH 20/29] netfilter: nf_tables: Use nla_put_be32() to dump immediate parameters

2016-09-05 Thread Pablo Neira Ayuso

nft_dump_register() should only be used with registers, not with
immediates.

Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression")
Fixes: 91dbc6be0a62("netfilter: nf_tables: add number generator expression")
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_hash.c   | 6 +++---
 net/netfilter/nft_numgen.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index e090aee..764251d 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -88,11 +88,11 @@ static int nft_hash_dump(struct sk_buff *skb,
goto nla_put_failure;
if (nft_dump_register(skb, NFTA_HASH_DREG, priv->dreg))
goto nla_put_failure;
-   if (nft_dump_register(skb, NFTA_HASH_LEN, priv->len))
+   if (nla_put_be32(skb, NFTA_HASH_LEN, htonl(priv->len)))
goto nla_put_failure;
-   if (nft_dump_register(skb, NFTA_HASH_MODULUS, priv->modulus))
+   if (nla_put_be32(skb, NFTA_HASH_MODULUS, htonl(priv->modulus)))
goto nla_put_failure;
-   if (nft_dump_register(skb, NFTA_HASH_SEED, priv->seed))
+   if (nla_put_be32(skb, NFTA_HASH_SEED, htonl(priv->seed)))
goto nla_put_failure;
 
return 0;
diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c
index 176e26d..294745e 100644
--- a/net/netfilter/nft_numgen.c
+++ b/net/netfilter/nft_numgen.c
@@ -68,9 +68,9 @@ static int nft_ng_dump(struct sk_buff *skb, enum 
nft_registers dreg,
 {
if (nft_dump_register(skb, NFTA_NG_DREG, dreg))
goto nla_put_failure;
-   if (nft_dump_register(skb, NFTA_NG_UNTIL, until))
+   if (nla_put_be32(skb, NFTA_NG_UNTIL, htonl(until)))
goto nla_put_failure;
-   if (nft_dump_register(skb, NFTA_NG_TYPE, type))
+   if (nla_put_be32(skb, NFTA_NG_TYPE, htonl(type)))
goto nla_put_failure;
 
return 0;
-- 
2.1.4

[PATCH 16/29] netfilter: nf_tables: introduce nft_chain_parse_hook()

2016-09-05 Thread Pablo Neira Ayuso

Introduce a new function to wrap the code that parses the chain hook
configuration so we can reuse this code to validate chain updates.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 152 +-
 1 file changed, 89 insertions(+), 63 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 7e1c876..463fcad 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1196,6 +1196,83 @@ static void nf_tables_chain_destroy(struct nft_chain 
*chain)
}
 }
 
+struct nft_chain_hook {
+   u32 num;
+   u32 priority;
+   const struct nf_chain_type  *type;
+   struct net_device   *dev;
+};
+
+static int nft_chain_parse_hook(struct net *net,
+   const struct nlattr * const nla[],
+   struct nft_af_info *afi,
+   struct nft_chain_hook *hook, bool create)
+{
+   struct nlattr *ha[NFTA_HOOK_MAX + 1];
+   const struct nf_chain_type *type;
+   struct net_device *dev;
+   int err;
+
+   err = nla_parse_nested(ha, NFTA_HOOK_MAX, nla[NFTA_CHAIN_HOOK],
+  nft_hook_policy);
+   if (err < 0)
+   return err;
+
+   if (ha[NFTA_HOOK_HOOKNUM] == NULL ||
+   ha[NFTA_HOOK_PRIORITY] == NULL)
+   return -EINVAL;
+
+   hook->num = ntohl(nla_get_be32(ha[NFTA_HOOK_HOOKNUM]));
+   if (hook->num >= afi->nhooks)
+   return -EINVAL;
+
+   hook->priority = ntohl(nla_get_be32(ha[NFTA_HOOK_PRIORITY]));
+
+   type = chain_type[afi->family][NFT_CHAIN_T_DEFAULT];
+   if (nla[NFTA_CHAIN_TYPE]) {
+   type = nf_tables_chain_type_lookup(afi, nla[NFTA_CHAIN_TYPE],
+  create);
+   if (IS_ERR(type))
+   return PTR_ERR(type);
+   }
+   if (!(type->hook_mask & (1 << hook->num)))
+   return -EOPNOTSUPP;
+   if (!try_module_get(type->owner))
+   return -ENOENT;
+
+   hook->type = type;
+
+   hook->dev = NULL;
+   if (afi->flags & NFT_AF_NEEDS_DEV) {
+   char ifname[IFNAMSIZ];
+
+   if (!ha[NFTA_HOOK_DEV]) {
+   module_put(type->owner);
+   return -EOPNOTSUPP;
+   }
+
+   nla_strlcpy(ifname, ha[NFTA_HOOK_DEV], IFNAMSIZ);
+   dev = dev_get_by_name(net, ifname);
+   if (!dev) {
+   module_put(type->owner);
+   return -ENOENT;
+   }
+   hook->dev = dev;
+   } else if (ha[NFTA_HOOK_DEV]) {
+   module_put(type->owner);
+   return -EOPNOTSUPP;
+   }
+
+   return 0;
+}
+
+static void nft_chain_release_hook(struct nft_chain_hook *hook)
+{
+   module_put(hook->type->owner);
+   if (hook->dev != NULL)
+   dev_put(hook->dev);
+}
+
 static int nf_tables_newchain(struct net *net, struct sock *nlsk,
  struct sk_buff *skb, const struct nlmsghdr *nlh,
  const struct nlattr * const nla[])
@@ -1206,10 +1283,8 @@ static int nf_tables_newchain(struct net *net, struct 
sock *nlsk,
struct nft_table *table;
struct nft_chain *chain;
struct nft_base_chain *basechain = NULL;
-   struct nlattr *ha[NFTA_HOOK_MAX + 1];
u8 genmask = nft_genmask_next(net);
int family = nfmsg->nfgen_family;
-   struct net_device *dev = NULL;
u8 policy = NF_ACCEPT;
u64 handle = 0;
unsigned int i;
@@ -1320,102 +1395,53 @@ static int nf_tables_newchain(struct net *net, struct 
sock *nlsk,
return -EOVERFLOW;
 
if (nla[NFTA_CHAIN_HOOK]) {
-   const struct nf_chain_type *type;
+   struct nft_chain_hook hook;
struct nf_hook_ops *ops;
nf_hookfn *hookfn;
-   u32 hooknum, priority;
-
-   type = chain_type[family][NFT_CHAIN_T_DEFAULT];
-   if (nla[NFTA_CHAIN_TYPE]) {
-   type = nf_tables_chain_type_lookup(afi,
-  nla[NFTA_CHAIN_TYPE],
-  create);
-   if (IS_ERR(type))
-   return PTR_ERR(type);
-   }
 
-   err = nla_parse_nested(ha, NFTA_HOOK_MAX, nla[NFTA_CHAIN_HOOK],
-  nft_hook_policy);
+   err = nft_chain_parse_hook(net, nla, afi, , create);
if (err < 0)
return err;
-   if (ha[NFTA_HOOK_HOOKNUM] == NULL ||
-   ha[NFTA_HOOK_PRIORITY] == NULL)
-

[PATCH 26/29] netfilter: conntrack: resched gc again if eviction rate is high

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_core.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index f95a9e9..7c66ce40 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -945,6 +945,7 @@ static void gc_worker(struct work_struct *work)
 {
unsigned int i, goal, buckets = 0, expired_count = 0;
unsigned long next_run = GC_INTERVAL;
+   unsigned int ratio, scanned = 0;
struct conntrack_gc_work *gc_work;
 
gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
@@ -969,6 +970,7 @@ static void gc_worker(struct work_struct *work)
hlist_nulls_for_each_entry_rcu(h, n, _hash[i], hnnode) {
tmp = nf_ct_tuplehash_to_ctrack(h);
 
+   scanned++;
if (nf_ct_is_expired(tmp)) {
nf_ct_gc_expired(tmp);
expired_count++;
@@ -988,6 +990,10 @@ static void gc_worker(struct work_struct *work)
if (gc_work->exiting)
return;
 
+   ratio = scanned ? expired_count * 100 / scanned : 0;
+   if (ratio >= 90)
+   next_run = 0;
+
gc_work->last_bucket = i;
schedule_delayed_work(_work->dwork, next_run);
 }
-- 
2.1.4

[PATCH 06/29] netfilter: nf_tables: rename set implementations

2016-09-05 Thread Pablo Neira Ayuso

Use nft_set_* prefix for backend set implementations, thus we can use
nft_hash for the new hash expression.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Kconfig| 4 ++--
 net/netfilter/Makefile   | 4 ++--
 net/netfilter/{nft_hash.c => nft_set_hash.c} | 0
 net/netfilter/{nft_rbtree.c => nft_set_rbtree.c} | 0
 4 files changed, 4 insertions(+), 4 deletions(-)
 rename net/netfilter/{nft_hash.c => nft_set_hash.c} (100%)
 rename net/netfilter/{nft_rbtree.c => nft_set_rbtree.c} (100%)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 9266cee..e5740e1 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -481,13 +481,13 @@ config NFT_CT
  This option adds the "meta" expression that you can use to match
  connection tracking information such as the flow state.
 
-config NFT_RBTREE
+config NFT_SET_RBTREE
tristate "Netfilter nf_tables rbtree set module"
help
  This option adds the "rbtree" set type (Red Black tree) that is used
  to build interval-based sets.
 
-config NFT_HASH
+config NFT_SET_HASH
tristate "Netfilter nf_tables hash set module"
help
  This option adds the "hash" set type that is used to build one-way
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 6913454..101fb85 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -86,8 +86,8 @@ obj-$(CONFIG_NFT_NAT) += nft_nat.o
 obj-$(CONFIG_NFT_QUEUE)+= nft_queue.o
 obj-$(CONFIG_NFT_REJECT)   += nft_reject.o
 obj-$(CONFIG_NFT_REJECT_INET)  += nft_reject_inet.o
-obj-$(CONFIG_NFT_RBTREE)   += nft_rbtree.o
-obj-$(CONFIG_NFT_HASH) += nft_hash.o
+obj-$(CONFIG_NFT_SET_RBTREE)   += nft_set_rbtree.o
+obj-$(CONFIG_NFT_SET_HASH) += nft_set_hash.o
 obj-$(CONFIG_NFT_COUNTER)  += nft_counter.o
 obj-$(CONFIG_NFT_LOG)  += nft_log.o
 obj-$(CONFIG_NFT_MASQ) += nft_masq.o
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_set_hash.c
similarity index 100%
rename from net/netfilter/nft_hash.c
rename to net/netfilter/nft_set_hash.c
diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_set_rbtree.c
similarity index 100%
rename from net/netfilter/nft_rbtree.c
rename to net/netfilter/nft_set_rbtree.c
-- 
2.1.4

[PATCH 03/29] netfilter: nf_dup4: remove redundant checksum recalculation

2016-09-05 Thread Pablo Neira Ayuso

From: Liping Zhang 

IP header checksum will be recalculated at ip_local_out, so
there's no need to calculated it here, remove it. Also update
code comments to illustrate it, and delete the misleading
comments about checksum recalculation.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/nf_dup_ipv4.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index ceb1873..cf986e1 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -74,21 +74,19 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
nf_conntrack_get(skb->nfct);
 #endif
/*
-* If we are in PREROUTING/INPUT, the checksum must be recalculated
-* since the length could have changed as a result of defragmentation.
-*
-* We also decrease the TTL to mitigate potential loops between two
-* hosts.
+* If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
+* loops between two hosts.
 *
 * Set %IP_DF so that the original source is notified of a potentially
 * decreased MTU on the clone route. IPv6 does this too.
+*
+* IP header checksum will be recalculated at ip_local_out.
 */
iph = ip_hdr(skb);
iph->frag_off |= htons(IP_DF);
if (hooknum == NF_INET_PRE_ROUTING ||
hooknum == NF_INET_LOCAL_IN)
--iph->ttl;
-   ip_send_check(iph);
 
if (nf_dup_ipv4_route(net, skb, gw, oif)) {
__this_cpu_write(nf_skb_duplicated, true);
-- 
2.1.4

[PATCH 14/29] netfilter: nft_hash: fix non static symbol warning

2016-09-05 Thread Pablo Neira Ayuso

From: Wei Yongjun 

Fixes the following sparse warning:

net/netfilter/nft_hash.c:40:25: warning:
 symbol 'nft_hash_policy' was not declared. Should it be static?

Signed-off-by: Wei Yongjun 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_hash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index b82ff29..e090aee 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -37,7 +37,7 @@ static void nft_hash_eval(const struct nft_expr *expr,
 priv->modulus);
 }
 
-const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = {
+static const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = {
[NFTA_HASH_SREG]= { .type = NLA_U32 },
[NFTA_HASH_DREG]= { .type = NLA_U32 },
[NFTA_HASH_LEN] = { .type = NLA_U32 },
-- 
2.1.4

[PATCH 15/29] netfilter: nf_tables: typo in trace attribute definition

2016-09-05 Thread Pablo Neira Ayuso

From: Pablo Neira 

Should be attributes, instead of attibutes, for consistency with other
definitions.

Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 8c9d6ff..8a63f22 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1090,7 +1090,7 @@ enum nft_gen_attributes {
  * @NFTA_TRACE_NFPROTO: nf protocol processed (NLA_U32)
  * @NFTA_TRACE_POLICY: policy that decided fate of packet (NLA_U32)
  */
-enum nft_trace_attibutes {
+enum nft_trace_attributes {
NFTA_TRACE_UNSPEC,
NFTA_TRACE_TABLE,
NFTA_TRACE_CHAIN,
-- 
2.1.4

[PATCH 29/29] netfilter: log: Check param to avoid overflow in nf_log_set

2016-09-05 Thread Pablo Neira Ayuso

From: Gao Feng 

The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_log.h   | 3 +--
 net/bridge/netfilter/nf_log_bridge.c | 3 +--
 net/ipv4/netfilter/nf_log_arp.c  | 3 +--
 net/ipv4/netfilter/nf_log_ipv4.c | 3 +--
 net/ipv6/netfilter/nf_log_ipv6.c | 3 +--
 net/netfilter/nf_log.c   | 8 +---
 6 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index 83d855b..ee07dc8 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -60,8 +60,7 @@ struct nf_logger {
 int nf_log_register(u_int8_t pf, struct nf_logger *logger);
 void nf_log_unregister(struct nf_logger *logger);
 
-void nf_log_set(struct net *net, u_int8_t pf,
-   const struct nf_logger *logger);
+int nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger);
 void nf_log_unset(struct net *net, const struct nf_logger *logger);
 
 int nf_log_bind_pf(struct net *net, u_int8_t pf,
diff --git a/net/bridge/netfilter/nf_log_bridge.c 
b/net/bridge/netfilter/nf_log_bridge.c
index 5d9953a..1663df5 100644
--- a/net/bridge/netfilter/nf_log_bridge.c
+++ b/net/bridge/netfilter/nf_log_bridge.c
@@ -50,8 +50,7 @@ static struct nf_logger nf_bridge_logger __read_mostly = {
 
 static int __net_init nf_log_bridge_net_init(struct net *net)
 {
-   nf_log_set(net, NFPROTO_BRIDGE, _bridge_logger);
-   return 0;
+   return nf_log_set(net, NFPROTO_BRIDGE, _bridge_logger);
 }
 
 static void __net_exit nf_log_bridge_net_exit(struct net *net)
diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
index cf8f2d4..8945c26 100644
--- a/net/ipv4/netfilter/nf_log_arp.c
+++ b/net/ipv4/netfilter/nf_log_arp.c
@@ -111,8 +111,7 @@ static struct nf_logger nf_arp_logger __read_mostly = {
 
 static int __net_init nf_log_arp_net_init(struct net *net)
 {
-   nf_log_set(net, NFPROTO_ARP, _arp_logger);
-   return 0;
+   return nf_log_set(net, NFPROTO_ARP, _arp_logger);
 }
 
 static void __net_exit nf_log_arp_net_exit(struct net *net)
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 076aadd..20f2255 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -347,8 +347,7 @@ static struct nf_logger nf_ip_logger __read_mostly = {
 
 static int __net_init nf_log_ipv4_net_init(struct net *net)
 {
-   nf_log_set(net, NFPROTO_IPV4, _ip_logger);
-   return 0;
+   return nf_log_set(net, NFPROTO_IPV4, _ip_logger);
 }
 
 static void __net_exit nf_log_ipv4_net_exit(struct net *net)
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index 8dd8696..c1bcf69 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -379,8 +379,7 @@ static struct nf_logger nf_ip6_logger __read_mostly = {
 
 static int __net_init nf_log_ipv6_net_init(struct net *net)
 {
-   nf_log_set(net, NFPROTO_IPV6, _ip6_logger);
-   return 0;
+   return nf_log_set(net, NFPROTO_IPV6, _ip6_logger);
 }
 
 static void __net_exit nf_log_ipv6_net_exit(struct net *net)
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..30a17d6 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -39,12 +39,12 @@ static struct nf_logger *__find_logger(int pf, const char 
*str_logger)
return NULL;
 }
 
-void nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger)
+int nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger)
 {
const struct nf_logger *log;
 
-   if (pf == NFPROTO_UNSPEC)
-   return;
+   if (pf == NFPROTO_UNSPEC || pf >= ARRAY_SIZE(net->nf.nf_loggers))
+   return -EOPNOTSUPP;
 
mutex_lock(_log_mutex);
log = nft_log_dereference(net->nf.nf_loggers[pf]);
@@ -52,6 +52,8 @@ void nf_log_set(struct net *net, u_int8_t pf, const struct 
nf_logger *logger)
rcu_assign_pointer(net->nf.nf_loggers[pf], logger);
 
mutex_unlock(_log_mutex);
+
+   return 0;
 }
 EXPORT_SYMBOL(nf_log_set);
 
-- 
2.1.4

[PATCH 18/29] rhashtable: add rhashtable_lookup_get_insert_key()

2016-09-05 Thread Pablo Neira Ayuso

This patch modifies __rhashtable_insert_fast() so it returns the
existing object that clashes with the one that you want to insert.
In case the object is successfully inserted, NULL is returned.
Otherwise, you get an error via ERR_PTR().

This patch adapts the existing callers of __rhashtable_insert_fast()
so they handle this new logic, and it adds a new
rhashtable_lookup_get_insert_key() interface to fetch this existing
object.

nf_tables needs this change to improve handling of EEXIST cases via
honoring the NLM_F_EXCL flag and by checking if the data part of the
mapping matches what we have.

Cc: Herbert Xu 
Cc: Thomas Graf 
Signed-off-by: Pablo Neira Ayuso 
Acked-by: Herbert Xu 
---
 include/linux/rhashtable.h | 70 +-
 lib/rhashtable.c   | 10 +--
 2 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 3eef080..26b7a05 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -343,7 +343,8 @@ int rhashtable_init(struct rhashtable *ht,
 struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
const void *key,
struct rhash_head *obj,
-   struct bucket_table *old_tbl);
+   struct bucket_table *old_tbl,
+   void **data);
 int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
 
 int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter,
@@ -563,8 +564,11 @@ restart:
return NULL;
 }
 
-/* Internal function, please use rhashtable_insert_fast() instead */
-static inline int __rhashtable_insert_fast(
+/* Internal function, please use rhashtable_insert_fast() instead. This
+ * function returns the existing element already in hashes in there is a clash,
+ * otherwise it returns an error via ERR_PTR().
+ */
+static inline void *__rhashtable_insert_fast(
struct rhashtable *ht, const void *key, struct rhash_head *obj,
const struct rhashtable_params params)
 {
@@ -577,6 +581,7 @@ static inline int __rhashtable_insert_fast(
spinlock_t *lock;
unsigned int elasticity;
unsigned int hash;
+   void *data = NULL;
int err;
 
 restart:
@@ -601,11 +606,14 @@ restart:
 
new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
if (unlikely(new_tbl)) {
-   tbl = rhashtable_insert_slow(ht, key, obj, new_tbl);
+   tbl = rhashtable_insert_slow(ht, key, obj, new_tbl, );
if (!IS_ERR_OR_NULL(tbl))
goto slow_path;
 
err = PTR_ERR(tbl);
+   if (err == -EEXIST)
+   err = 0;
+
goto out;
}
 
@@ -619,25 +627,25 @@ slow_path:
err = rhashtable_insert_rehash(ht, tbl);
rcu_read_unlock();
if (err)
-   return err;
+   return ERR_PTR(err);
 
goto restart;
}
 
-   err = -EEXIST;
+   err = 0;
elasticity = ht->elasticity;
rht_for_each(head, tbl, hash) {
if (key &&
unlikely(!(params.obj_cmpfn ?
   params.obj_cmpfn(, rht_obj(ht, head)) :
-  rhashtable_compare(, rht_obj(ht, head)
+  rhashtable_compare(, rht_obj(ht, head) {
+   data = rht_obj(ht, head);
goto out;
+   }
if (!--elasticity)
goto slow_path;
}
 
-   err = 0;
-
head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
 
RCU_INIT_POINTER(obj->next, head);
@@ -652,7 +660,7 @@ out:
spin_unlock_bh(lock);
rcu_read_unlock();
 
-   return err;
+   return err ? ERR_PTR(err) : data;
 }
 
 /**
@@ -675,7 +683,13 @@ static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
const struct rhashtable_params params)
 {
-   return __rhashtable_insert_fast(ht, NULL, obj, params);
+   void *ret;
+
+   ret = __rhashtable_insert_fast(ht, NULL, obj, params);
+   if (IS_ERR(ret))
+   return PTR_ERR(ret);
+
+   return ret == NULL ? 0 : -EEXIST;
 }
 
 /**
@@ -704,11 +718,15 @@ static inline int rhashtable_lookup_insert_fast(
const struct rhashtable_params params)
 {
const char *key = rht_obj(ht, obj);
+   void *ret;
 
BUG_ON(ht->p.obj_hashfn);
 
-   return __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj,
-   params);
+   ret = __rhashtable_insert_fast(ht, key +

[PATCH 02/29] netfilter: physdev: add missed blank

2016-09-05 Thread Pablo Neira Ayuso

From: Hangbin Liu 

Signed-off-by: Hangbin Liu 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/xt_physdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index e5f1898..bb33598 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -107,8 +107,8 @@ static int physdev_mt_check(const struct xt_mtchk_param 
*par)
 info->invert & XT_PHYSDEV_OP_BRIDGED) &&
par->hook_mask & ((1 << NF_INET_LOCAL_OUT) |
(1 << NF_INET_FORWARD) | (1 << NF_INET_POST_ROUTING))) {
-   pr_info("using --physdev-out and --physdev-is-out are only"
-   "supported in the FORWARD and POSTROUTING chains with"
+   pr_info("using --physdev-out and --physdev-is-out are only "
+   "supported in the FORWARD and POSTROUTING chains with "
"bridged traffic.\n");
if (par->hook_mask & (1 << NF_INET_LOCAL_OUT))
return -EINVAL;
-- 
2.1.4

[PATCH 22/29] netfilter: don't rely on DYING bit to detect when destroy event was sent

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, >status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso 
Signed-off-by: Florian Westphal 
Acked-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack_ecache.h | 17 -
 net/netfilter/nf_conntrack_ecache.c | 22 ++
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h 
b/include/net/netfilter/nf_conntrack_ecache.h
index fa36447..12d967b 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -12,12 +12,19 @@
 #include 
 #include 
 
+enum nf_ct_ecache_state {
+   NFCT_ECACHE_UNKNOWN,/* destroy event not sent */
+   NFCT_ECACHE_DESTROY_FAIL,   /* tried but failed to send destroy 
event */
+   NFCT_ECACHE_DESTROY_SENT,   /* sent destroy event after failure */
+};
+
 struct nf_conntrack_ecache {
-   unsigned long cache;/* bitops want long */
-   unsigned long missed;   /* missed events */
-   u16 ctmask; /* bitmask of ct events to be delivered */
-   u16 expmask;/* bitmask of expect events to be delivered */
-   u32 portid; /* netlink portid of destroyer */
+   unsigned long cache;/* bitops want long */
+   unsigned long missed;   /* missed events */
+   u16 ctmask; /* bitmask of ct events to be delivered 
*/
+   u16 expmask;/* bitmask of expect events to be 
delivered */
+   u32 portid; /* netlink portid of destroyer */
+   enum nf_ct_ecache_state state;  /* ecache state */
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_ecache.c 
b/net/netfilter/nf_conntrack_ecache.c
index d28011b..da9df2d 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -49,8 +49,13 @@ static enum retry_state ecache_work_evict_list(struct 
ct_pcpu *pcpu)
 
hlist_nulls_for_each_entry(h, n, >dying, hnnode) {
struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+   struct nf_conntrack_ecache *e;
 
-   if (nf_ct_is_dying(ct))
+   if (!nf_ct_is_confirmed(ct))
+   continue;
+
+   e = nf_ct_ecache_find(ct);
+   if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL)
continue;
 
if (nf_conntrack_event(IPCT_DESTROY, ct)) {
@@ -58,8 +63,7 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu 
*pcpu)
break;
}
 
-   /* we've got the event delivered, now it's dying */
-   set_bit(IPS_DYING_BIT, >status);
+   e->state = NFCT_ECACHE_DESTROY_SENT;
refs[evicted] = ct;
 
if (++evicted >= ARRAY_SIZE(refs)) {
@@ -130,7 +134,7 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, 
struct nf_conn *ct,
if (!e)
goto out_unlock;
 
-   if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+   if (nf_ct_is_confirmed(ct)) {
struct nf_ct_event item = {
.ct = ct,
.portid = e->portid

[PATCH 10/29] netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbol

2016-09-05 Thread Pablo Neira Ayuso

This is required to iterate over the hash table in cttimeout, ctnetlink
and nf_conntrack_ipv4.

>> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] 
>> undefined!
   ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] 
undefined!
   ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] 
undefined!

Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code")
Reported-by: kbuild test robot 
Reported-by: Stephen Rothwell 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index aeba28c..7d90a5d 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -160,6 +160,8 @@ static void nf_conntrack_all_unlock(void)
 }
 
 unsigned int nf_conntrack_htable_size __read_mostly;
+EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
+
 unsigned int nf_conntrack_max __read_mostly;
 seqcount_t nf_conntrack_generation __read_mostly;
 
-- 
2.1.4

[PATCH 00/29] Netfilter updates for net-next

2016-09-05 Thread Pablo Neira Ayuso

Hi David,

The following patchset contains Netfilter updates for your net-next
tree.  Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.

More specifically, they are:

1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
   transport area in dccp, sctp, tcp, udp and udplite protocol
   conntrackers, from Gao Feng.

2) Missing whitespace on error message in physdev match, from Hangbin Liu.

3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
   from Florian.

6) Rename nf_tables set implementation to nft_set_{name}.c

7) Introduce the hash expression to allow arbitrary hashing of selector
   concatenations, from Laura Garcia Liebana.

8) Remove ip_conntrack sysctl backward compatibility code, this code has
   been around for long time already, and we have two interfaces to do
   this already: nf_conntrack sysctl and ctnetlink.

9) Use nf_conntrack_get_ht() helper function whenever possible, instead
   of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

10) Add quota expression for nf_tables.

11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.

12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.

14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.

15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.

16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.

17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.

18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.

19) Add a garbage collector to get rid of stale conntracks, from
Florian.

20) Reschedule garbage collector if eviction rate is high.

21) Get rid of the __nf_ct_kill_acct() helper.

22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

23) Make nf_log_set() interface assertive on unsupported families.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks!



The following changes since commit f08aff444ae0004c9ae6df3241fc313a5024d375:

  net: ethernet: renesas: sh_eth: use new api ethtool_{get|set}_link_ksettings 
(2016-08-10 23:14:53 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 779994fa3636d46848edb402fe7517968e036e6f:

  netfilter: log: Check param to avoid overflow in nf_log_set (2016-08-30 
11:52:32 +0200)


Colin Ian King (1):
  netfilter: fix spelling mistake: "delimitter" -> "delimiter"

Florian Westphal (9):
  netfilter: use_nf_conn_expires helper in more places
  ipvs: use nf_ct_kill helper
  netfilter: restart search if moved to other chain
  netfilter: don't rely on DYING bit to detect when destroy event was sent
  netfilter: conntrack: get rid of conntrack timer
  netfilter: evict stale entries on netlink dumps
  netfilter: conntrack: add gc worker to remove timed-out entries
  netfilter: conntrack: resched gc again if eviction rate is high
  netfilter: remove __nf_ct_kill_acct helper

Gao Feng (3):
  netfilter: conntrack: Only need first 4 bytes to get l4proto ports
  netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1'
  netfilter: log: Check param to avoid overflow in nf_log_set

Hangbin Liu (1):
  netfilter: physdev: add missed blank

Laura Garcia Liebana (2):
  netfilter: nf_tables: add hash expression
  netfilter: nf_tables: add number generator expression

Liping Zhang (2):
  netfilter: nf_dup4: remove redundant checksum recalculation
  netfilter: conntrack: simplify the code by using nf_conntrack_get_ht

Pablo Neira (1):
  netfilter: nf_tables: typo in trace attribute definition

Pablo Neira Ayuso (9):
  netfilter:

[PATCH 04/29] netfilter: use_nf_conn_expires helper in more places

2016-09-05 Thread Pablo Neira Ayuso

From: Florian Westphal 

... so we don't need to touch all of these places when we get rid of the
timer in nf_conn.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 3 +--
 net/netfilter/nf_conntrack_netlink.c  | 5 +
 net/netfilter/nf_conntrack_standalone.c   | 3 +--
 net/netfilter/xt_conntrack.c  | 4 +---
 4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 6392371..67bfc69 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -163,8 +163,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
ret = -ENOSPC;
seq_printf(s, "%-8s %u %ld ",
   l4proto->name, nf_ct_protonum(ct),
-  timer_pending(>timeout)
-  ? (long)(ct->timeout.expires - jiffies)/HZ : 0);
+  nf_ct_expires(ct) / HZ);
 
if (l4proto->print_conntrack)
l4proto->print_conntrack(s, ct);
diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index 050bb34..68800c1 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -149,10 +149,7 @@ nla_put_failure:
 
 static int ctnetlink_dump_timeout(struct sk_buff *skb, const struct nf_conn 
*ct)
 {
-   long timeout = ((long)ct->timeout.expires - (long)jiffies) / HZ;
-
-   if (timeout < 0)
-   timeout = 0;
+   long timeout = nf_ct_expires(ct) / HZ;
 
if (nla_put_be32(skb, CTA_TIMEOUT, htonl(timeout)))
goto nla_put_failure;
diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 958a145..4e7becd 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -224,8 +224,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
seq_printf(s, "%-8s %u %-8s %u %ld ",
   l3proto->name, nf_ct_l3num(ct),
   l4proto->name, nf_ct_protonum(ct),
-  timer_pending(>timeout)
-  ? (long)(ct->timeout.expires - jiffies)/HZ : 0);
+  nf_ct_expires(ct)  / HZ);
 
if (l4proto->print_conntrack)
l4proto->print_conntrack(s, ct);
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 188404b9..a3b8f69 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -233,10 +233,8 @@ conntrack_mt(const struct sk_buff *skb, struct 
xt_action_param *par,
return false;
 
if (info->match_flags & XT_CONNTRACK_EXPIRES) {
-   unsigned long expires = 0;
+   unsigned long expires = nf_ct_expires(ct) / HZ;
 
-   if (timer_pending(>timeout))
-   expires = (ct->timeout.expires - jiffies) / HZ;
if ((expires >= info->expires_min &&
expires <= info->expires_max) ^
!(info->invert_flags & XT_CONNTRACK_EXPIRES))
-- 
2.1.4

[PATCH 07/29] netfilter: nf_tables: add hash expression

2016-09-05 Thread Pablo Neira Ayuso

From: Laura Garcia Liebana 

This patch adds a new hash expression, this provides jhash support but
this can be extended to support for other hash functions. The modulus
and seed already comes embedded into this new expression.

Use case example:

... meta mark set hash ip saddr mod 10

Signed-off-by: Laura Garcia Liebana 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  20 +
 net/netfilter/Kconfig|   6 ++
 net/netfilter/Makefile   |   1 +
 net/netfilter/nft_hash.c | 136 +++
 4 files changed, 163 insertions(+)
 create mode 100644 net/netfilter/nft_hash.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 01751fa..6ce0a6d 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -724,6 +724,26 @@ enum nft_meta_keys {
 };
 
 /**
+ * enum nft_hash_attributes - nf_tables hash expression netlink attributes
+ *
+ * @NFTA_HASH_SREG: source register (NLA_U32)
+ * @NFTA_HASH_DREG: destination register (NLA_U32)
+ * @NFTA_HASH_LEN: source data length (NLA_U32)
+ * @NFTA_HASH_MODULUS: modulus value (NLA_U32)
+ * @NFTA_HASH_SEED: seed value (NLA_U32)
+ */
+enum nft_hash_attributes {
+   NFTA_HASH_UNSPEC,
+   NFTA_HASH_SREG,
+   NFTA_HASH_DREG,
+   NFTA_HASH_LEN,
+   NFTA_HASH_MODULUS,
+   NFTA_HASH_SEED,
+   __NFTA_HASH_MAX,
+};
+#define NFTA_HASH_MAX  (__NFTA_HASH_MAX - 1)
+
+/**
  * enum nft_meta_attributes - nf_tables meta expression netlink attributes
  *
  * @NFTA_META_DREG: destination register (NLA_U32)
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e5740e1..9cfaa00 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -563,6 +563,12 @@ config NFT_COMPAT
  x_tables match/target extensions over the nf_tables
  framework.
 
+config NFT_HASH
+   tristate "Netfilter nf_tables hash module"
+   help
+ This option adds the "hash" expression that you can use to perform
+ a hash operation on registers.
+
 if NF_TABLES_NETDEV
 
 config NF_DUP_NETDEV
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 101fb85..1106ccd 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_NFT_COUNTER) += nft_counter.o
 obj-$(CONFIG_NFT_LOG)  += nft_log.o
 obj-$(CONFIG_NFT_MASQ) += nft_masq.o
 obj-$(CONFIG_NFT_REDIR)+= nft_redir.o
+obj-$(CONFIG_NFT_HASH) += nft_hash.o
 
 # nf_tables netdev
 obj-$(CONFIG_NFT_DUP_NETDEV)   += nft_dup_netdev.o
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
new file mode 100644
index 000..b82ff29
--- /dev/null
+++ b/net/netfilter/nft_hash.c
@@ -0,0 +1,136 @@
+/*
+ * Copyright (c) 2016 Laura Garcia 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct nft_hash {
+   enum nft_registers  sreg:8;
+   enum nft_registers  dreg:8;
+   u8  len;
+   u32 modulus;
+   u32 seed;
+};
+
+static void nft_hash_eval(const struct nft_expr *expr,
+ struct nft_regs *regs,
+ const struct nft_pktinfo *pkt)
+{
+   struct nft_hash *priv = nft_expr_priv(expr);
+   const void *data = >data[priv->sreg];
+
+   regs->data[priv->dreg] =
+   reciprocal_scale(jhash(data, priv->len, priv->seed),
+priv->modulus);
+}
+
+const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = {
+   [NFTA_HASH_SREG]= { .type = NLA_U32 },
+   [NFTA_HASH_DREG]= { .type = NLA_U32 },
+   [NFTA_HASH_LEN] = { .type = NLA_U32 },
+   [NFTA_HASH_MODULUS] = { .type = NLA_U32 },
+   [NFTA_HASH_SEED]= { .type = NLA_U32 },
+};
+
+static int nft_hash_init(const struct nft_ctx *ctx,
+const struct nft_expr *expr,
+const struct nlattr * const tb[])
+{
+   struct nft_hash *priv = nft_expr_priv(expr);
+   u32 len;
+
+   if (!tb[NFTA_HASH_SREG] ||
+   !tb[NFTA_HASH_DREG] ||
+   !tb[NFTA_HASH_LEN]  ||
+   !tb[NFTA_HASH_SEED] ||
+   !tb[NFTA_HASH_MODULUS])
+   return -EINVAL;
+
+   priv->sreg = nft_parse_register(tb[NFTA_HASH_SREG]);
+   priv->dreg = nft_parse_register(tb[NFTA_HASH_DREG]);
+
+   len = ntohl(nla_get_be32(tb[NFTA_HASH_LEN]));
+   if (len == 0 || len > U8_MAX)
+   return -ERANGE;
+
+   priv->len = len;
+
+

[PATCH 09/29] netfilter: conntrack: simplify the code by using nf_conntrack_get_ht

2016-09-05 Thread Pablo Neira Ayuso

From: Liping Zhang 

Since commit 64b87639c9cb ("netfilter: conntrack: fix race between
nf_conntrack proc read and hash resize") introduce the
nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation
again and again to get the hash table and hash size. And convert
nf_conntrack_get_ht to inline function here.

Suggested-by: Pablo Neira Ayuso 
Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack.h  | 20 ++
 include/net/netfilter/nf_conntrack_core.h |  3 --
 net/netfilter/nf_conntrack_core.c | 46 +++
 3 files changed, 30 insertions(+), 39 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 445b019..2a12748 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -303,9 +303,29 @@ struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
 int nf_conntrack_hash_resize(unsigned int hashsize);
+
+extern struct hlist_nulls_head *nf_conntrack_hash;
 extern unsigned int nf_conntrack_htable_size;
+extern seqcount_t nf_conntrack_generation;
 extern unsigned int nf_conntrack_max;
 
+/* must be called with rcu read lock held */
+static inline void
+nf_conntrack_get_ht(struct hlist_nulls_head **hash, unsigned int *hsize)
+{
+   struct hlist_nulls_head *hptr;
+   unsigned int sequence, hsz;
+
+   do {
+   sequence = read_seqcount_begin(_conntrack_generation);
+   hsz = nf_conntrack_htable_size;
+   hptr = nf_conntrack_hash;
+   } while (read_seqcount_retry(_conntrack_generation, sequence));
+
+   *hash = hptr;
+   *hsize = hsz;
+}
+
 struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
 const struct nf_conntrack_zone *zone,
 gfp_t flags);
diff --git a/include/net/netfilter/nf_conntrack_core.h 
b/include/net/netfilter/nf_conntrack_core.h
index 79d7ac5..62e17d1 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -51,8 +51,6 @@ bool nf_ct_invert_tuple(struct nf_conntrack_tuple *inverse,
const struct nf_conntrack_l3proto *l3proto,
const struct nf_conntrack_l4proto *l4proto);
 
-void nf_conntrack_get_ht(struct hlist_nulls_head **hash, unsigned int *hsize);
-
 /* Find a connection corresponding to a tuple. */
 struct nf_conntrack_tuple_hash *
 nf_conntrack_find_get(struct net *net,
@@ -83,7 +81,6 @@ print_tuple(struct seq_file *s, const struct 
nf_conntrack_tuple *tuple,
 
 #define CONNTRACK_LOCKS 1024
 
-extern struct hlist_nulls_head *nf_conntrack_hash;
 extern spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS];
 void nf_conntrack_lock(spinlock_t *lock);
 
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 22558b7..aeba28c 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -74,7 +74,6 @@ EXPORT_SYMBOL_GPL(nf_conntrack_hash);
 
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
-static __read_mostly seqcount_t nf_conntrack_generation;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
@@ -162,6 +161,7 @@ static void nf_conntrack_all_unlock(void)
 
 unsigned int nf_conntrack_htable_size __read_mostly;
 unsigned int nf_conntrack_max __read_mostly;
+seqcount_t nf_conntrack_generation __read_mostly;
 
 DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
@@ -478,23 +478,6 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
   net_eq(net, nf_ct_net(ct));
 }
 
-/* must be called with rcu read lock held */
-void nf_conntrack_get_ht(struct hlist_nulls_head **hash, unsigned int *hsize)
-{
-   struct hlist_nulls_head *hptr;
-   unsigned int sequence, hsz;
-
-   do {
-   sequence = read_seqcount_begin(_conntrack_generation);
-   hsz = nf_conntrack_htable_size;
-   hptr = nf_conntrack_hash;
-   } while (read_seqcount_retry(_conntrack_generation, sequence));
-
-   *hash = hptr;
-   *hsize = hsz;
-}
-EXPORT_SYMBOL_GPL(nf_conntrack_get_ht);
-
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -507,14 +490,11 @@ nf_conntrack_find(struct net *net, const struct 
nf_conntrack_zone *zone,
struct nf_conntrack_tuple_hash *h;
struct hlist_nulls_head *ct_hash;
struct hlist_nulls_node *n;
-   unsigned int bucket, sequence;
+   unsigned int bucket, hsize;
 
 begin:
-   do {
-   sequence = read_seqcount_begin(_conntrack_generation);
-   bucket = scale_hash(hash);
-

1 2 >

1 - 100 of 118 matches

Mail list logo