[PATCH] samples/bpf: Fix cross compiler error with bpf sample

2017-08-03 Thread Joel Fernandes
When cross-compiling the bpf sample map_perf_test for aarch64, I find that
__NR_getpgrp is undefined. This causes build errors. Fix it by allowing the
deprecated syscall in the sample.

Signed-off-by: Joel Fernandes 
---
 samples/bpf/map_perf_test_user.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 1a8894b5ac51..6e6fc7121640 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -8,7 +8,9 @@
 #include 
 #include 
 #include 
+#define __ARCH_WANT_SYSCALL_DEPRECATED
 #include 
+#undef __ARCH_WANT_SYSCALL_DEPRECATED
 #include 
 #include 
 #include 
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



Re: [PATCH net-next] net: dsa: User per-cpu 64-bit statistics

2017-08-03 Thread Eric Dumazet
On Thu, 2017-08-03 at 21:33 -0700, Florian Fainelli wrote:
> During testing with a background iperf pushing 1Gbit/sec worth of
> traffic and having both ifconfig and ethtool collect statistics, we
> could see quite frequent deadlocks. Convert the often accessed DSA slave
> network devices statistics to per-cpu 64-bit statistics to remove these
> deadlocks and provide fast efficient statistics updates.
> 

This seems to be a bug fix, it would be nice to get a proper tag like :

Fixes: f613ed665bb3 ("net: dsa: Add support for 64-bit statistics")

Problem here is that if multiple cpus can call dsa_switch_rcv() at the
same time, then u64_stats_update_begin() contract is not respected.

include/linux/u64_stats_sync.h states :

 * Usage :
 *
 * Stats producer (writer) should use following template granted it already got
 * an exclusive access to counters (a lock is already taken, or per cpu
 * data is used [in a non preemptable context])
 *
 *   spin_lock_bh(...) or other synchronization to get exclusive access
 *   ...
 *   u64_stats_update_begin(&stats->syncp);





Re: [PATCH net-next v2 00/13] Change DSA's FDB API and perform switchdev cleanup

2017-08-03 Thread Jiri Pirko
Fri, Aug 04, 2017 at 12:39:02AM CEST, arka...@mellanox.com wrote:
>
>[...]
>
>>> Now we have the "offload" read only flag, which is good to inform about
>>> a successfully programmed hardware, but adds another level of complexity
>>> to understand the interaction with the hardware.
>>>
>>> I think iproute2 is getting more and more confusing. From what I
>>> understood, respecting the "self" flag as described is not possible
>>> anymore due to some retro-compatibility reasons.
>>>
>>> Also Linux must use the hardware as an accelerator (so "self" or
>>> "offload" must be the default), and always fall back to software
>>> otherwise, hence "master" do not make sense here.
>>>
>>> What do you think about this synopsis for bridge fdb add?
>>>
>>> # bridge fdb add LLADDR dev DEV [ offload { on | off } ]
>>>
>>> Where offload defaults to "on". This option should also be ported to
>>> other offloaded features like MDB and VLAN. Even though this is a bit
>>> out of scope of this patchset, do you think this is feasible?
>>>
>> 
>> I agree completely that currently its confusing. The documentation
>> should be updated for sure. I think that 'self' was primarily introduced
>> (Commit 77162022a) for NIC embedded switches which are used for sriov, in
>> that case the self is related to the internal eswitch, which completely
>> diverge from the software one (clearly not swithcdev).
>> 
>> IMHO For switchdev devices 'self' should not be an option at all, or any
>> other arg regarding hardware. Furthermore, the 'offload' flag should be
>> only relevant during the dump as an indication to the user.
>> 
>> Unfortunately, the  lack of ability of syncing the sw with hw in DSA's
>> case introduces a problem for indicating that the entries are only
>> in hw, I mean marking it only as offloaded is not enough.
>
>Hi,
>
>It seems impossible currently to move the self to be the default, and
>this introduces regression which you don't approve, so it seems few
>options left:
>
>a) Leave two ways to add fdb, through the bridge (by using the master
>   flag) which is introduced in this patchset, and by using the self
>   which is the legacy way. In this way no regression will be introduced,
>   yet, it feels confusing a bit. The benefit is that we (DSA/mlxsw)
>   will be synced.
>b) Leave only the self (which means removing patch no 4,5).

I belive that option a) is the correct way to go. Introduction of self
inclusion was a mistake from the very beginning. I think that we should
just move one and correct this mistake.

Vivien, any arguments against a)?

Thanks!


>
>In both cases the switchdev implementation of .ndo_fdb_add() will be
>moved inside DSA in a similar way to the dump because its only used by
>you.
>
>Option b) actually turns this patchset into cosmetic one which does
>only cleanup.
>
>Thanks,
>Arkadi
>
>
>
>


[net-next PATCH] net: comment fixes against BPF devmap helper calls

2017-08-03 Thread John Fastabend
Update BPF comments to accurately reflect XDP usage.

Fixes: 97f91a7cf04ff ("bpf: add bpf_redirect_map helper routine")
Reported-by: Alexei Starovoitov 
Signed-off-by: John Fastabend 
---
 include/uapi/linux/bpf.h |   16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1106a8c..1ae061e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -345,14 +345,20 @@ enum bpf_attach_type {
  * int bpf_redirect(ifindex, flags)
  * redirect to another netdev
  * @ifindex: ifindex of the net device
- * @flags: bit 0 - if set, redirect to ingress instead of egress
- * other bits - reserved
- * Return: TC_ACT_REDIRECT
- * int bpf_redirect_map(key, map, flags)
+ * @flags:
+ *   cls_bpf:
+ *  bit 0 - if set, redirect to ingress instead of egress
+ *  other bits - reserved
+ *   xdp_bpf:
+ * all bits - reserved
+ * Return: cls_bpf: TC_ACT_REDIRECT
+ *xdp_bfp: XDP_REDIRECT
+ * int bpf_redirect_map(map, key, flags)
  * redirect to endpoint in map
+ * @map: pointer to dev map
  * @key: index in map to lookup
- * @map: fd of map to do lookup in
  * @flags: --
+ * Return: XDP_REDIRECT on success or XDP_ABORT on error
  *
  * u32 bpf_get_route_realm(skb)
  * retrieve a dst's tclassid



Re: [RFC PATCH 4/6] net: sockmap with sk redirect support

2017-08-03 Thread John Fastabend
On 08/03/2017 09:22 PM, Tom Herbert wrote:
> On Thu, Aug 3, 2017 at 4:37 PM, John Fastabend  
> wrote:
>> Recently we added a new map type called dev map used to forward XDP
>> packets between ports (6093ec2dc313). This patches introduces a
>> similar notion for sockets.
>>
>> A sockmap allows users to add participating sockets to a map. When
>> sockets are added to the map enough context is stored with the
>> map entry to use the entry with a new helper
>>
>>   bpf_sk_redirect_map(map, key, flags)
>>
>> This helper (analogous to bpf_redirect_map in XDP) is given the map
>> and an entry in the map. When called from a sockmap program, discussed
>> below, the skb will be sent on the socket using skb_send_sock().
>>
>> With the above we need a bpf program to call the helper from that will
>> then implement the send logic. The initial site implemented in this
>> series is the recv_sock hook. For this to work we implemented a map
>> attach command to add attributes to a map. In sockmap we add two
>> programs a parse program and a verdict program. The parse program
>> uses strparser to build messages and pass them to the verdict program.
>> The parse program usese normal strparser semantics. The verdict
>> program is of type SOCKET_FILTER.
>>
>> The verdict program returns a verdict BPF_OK, BPF_DROP, BPF_REDIRECT.
>> When BPF_REDIRECT is returned, expected when bpf program uses
>> bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables
>> set by the helper routine and pull the sock entry out of the sock map.
>> This pattern follows the existing redirect logic in cls and xdp
>> programs.
>>
> Hi John,
> 
> I'm a bit confused. If the verdict program bpf_mux then? I don't see
> any use of BPF_OK,DROP, or REDIRECT. I assume I'm missing something.
> 
> Tom

Ah so what I coded and what I wrote don't align perfectly here. The
verdict program _is_ bpf_mux as you guessed. I should rename the code
to use bpf_verdict (or come up with a better name). Calling it bpf_mux
was a hold out from a very specific example program I wrote up.

Then BPF_OK_DROP and BPF_REDIRECT still need to be included in below.

[...]

>> +
>> +static struct smap_psock *smap_peers_get(struct smap_psock *psock,
>> +struct sk_buff *skb)
>> +{
>> +   struct sock *sock;
>> +   int rc;
>> +
>> +   rc = smap_mux_func(psock, skb);
>> +   if (unlikely(rc < 0))
>> +   return NULL;
>> +

replacing the above 3 lines with the following should align the commit
message and code,

   rc = smap_mux_func(psock, skb);
   if (rc != BPF_REDIRECT)
return NULL;

>> +   sock = do_sk_redirect_map();
>> +   if (unlikely(!sock))
>> +   return NULL;
>> +
>> +   return smap_psock_sk(sock);
>> +}
>> +

And then in uapi/bpf.h

enum {
BPF_OK_DROP,
BPF_REDIRECT,
}

Thanks,
John



Re: [RFC PATCH 4/6] net: sockmap with sk redirect support

2017-08-03 Thread John Fastabend
On 08/03/2017 09:22 PM, Tom Herbert wrote:
> On Thu, Aug 3, 2017 at 4:37 PM, John Fastabend  
> wrote:
>> Recently we added a new map type called dev map used to forward XDP
>> packets between ports (6093ec2dc313). This patches introduces a
>> similar notion for sockets.
>>
>> A sockmap allows users to add participating sockets to a map. When
>> sockets are added to the map enough context is stored with the
>> map entry to use the entry with a new helper
>>
>>   bpf_sk_redirect_map(map, key, flags)
>>
>> This helper (analogous to bpf_redirect_map in XDP) is given the map
>> and an entry in the map. When called from a sockmap program, discussed
>> below, the skb will be sent on the socket using skb_send_sock().
>>
>> With the above we need a bpf program to call the helper from that will
>> then implement the send logic. The initial site implemented in this
>> series is the recv_sock hook. For this to work we implemented a map
>> attach command to add attributes to a map. In sockmap we add two
>> programs a parse program and a verdict program. The parse program
>> uses strparser to build messages and pass them to the verdict program.
>> The parse program usese normal strparser semantics. The verdict
>> program is of type SOCKET_FILTER.
>>
>> The verdict program returns a verdict BPF_OK, BPF_DROP, BPF_REDIRECT.
>> When BPF_REDIRECT is returned, expected when bpf program uses
>> bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables
>> set by the helper routine and pull the sock entry out of the sock map.
>> This pattern follows the existing redirect logic in cls and xdp
>> programs.
>>
> Hi John,
> 
> I'm a bit confused. If the verdict program bpf_mux then? I don't see
> any use of BPF_OK,DROP, or REDIRECT. I assume I'm missing something.
> 
> Tom

Ah so what I coded and what I wrote don't align perfectly here. The
verdict program _is_ bpf_mux as you guessed. I should rename the code
to use bpf_verdict (or come up with a better name). Calling it bpf_mux
was a hold out from a very specific example program I wrote up.

Then BPF_OK_DROP and BPF_REDIRECT still need to be included in below.

[...]

>> +
>> +static struct smap_psock *smap_peers_get(struct smap_psock *psock,
>> +struct sk_buff *skb)
>> +{
>> +   struct sock *sock;
>> +   int rc;
>> +
>> +   rc = smap_mux_func(psock, skb);
>> +   if (unlikely(rc < 0))
>> +   return NULL;
>> +

replacing the above 3 lines with the following should align the commit
message and code,

   rc = smap_mux_func(psock, skb);
   if (rc != BPF_REDIRECT)
return NULL;

>> +   sock = do_sk_redirect_map();
>> +   if (unlikely(!sock))
>> +   return NULL;
>> +
>> +   return smap_psock_sk(sock);
>> +}
>> +

And then in uapi/linux/bpf.h

enum {
BPF_OK_DROP,
BPF_REDIRECT,
}

Thanks,
John



Re: [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY

2017-08-03 Thread David Miller
From: Willem de Bruijn 
Date: Thu,  3 Aug 2017 16:29:36 -0400

> Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
> shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
> Implement the feature for TCP initially, as large writes benefit
> most.

Looks great, series applied, thanks!


[PATCH net-next] net: dsa: User per-cpu 64-bit statistics

2017-08-03 Thread Florian Fainelli
During testing with a background iperf pushing 1Gbit/sec worth of
traffic and having both ifconfig and ethtool collect statistics, we
could see quite frequent deadlocks. Convert the often accessed DSA slave
network devices statistics to per-cpu 64-bit statistics to remove these
deadlocks and provide fast efficient statistics updates.

Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa.c  | 10 +---
 net/dsa/dsa_priv.h |  2 +-
 net/dsa/slave.c| 72 +++---
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 0ba842c08dd3..a91e520e735f 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -190,6 +190,7 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct 
net_device *dev,
 {
struct dsa_switch_tree *dst = dev->dsa_ptr;
struct sk_buff *nskb = NULL;
+   struct pcpu_sw_netstats *s;
struct dsa_slave_priv *p;
 
if (unlikely(dst == NULL)) {
@@ -213,10 +214,11 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct 
net_device *dev,
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, skb->dev);
 
-   u64_stats_update_begin(&p->stats64.syncp);
-   p->stats64.rx_packets++;
-   p->stats64.rx_bytes += skb->len;
-   u64_stats_update_end(&p->stats64.syncp);
+   s = this_cpu_ptr(p->stats64);
+   u64_stats_update_begin(&s->syncp);
+   s->rx_packets++;
+   s->rx_bytes += skb->len;
+   u64_stats_update_end(&s->syncp);
 
netif_receive_skb(skb);
 
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 7aa0656296c2..306cff229def 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -77,7 +77,7 @@ struct dsa_slave_priv {
struct sk_buff *(*xmit)(struct sk_buff *skb,
struct net_device *dev);
 
-   struct pcpu_sw_netstats stats64;
+   struct pcpu_sw_netstats *stats64;
 
/* DSA port data, such as switch, port index, etc. */
struct dsa_port *dp;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index e196562035b1..605444ced06c 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -352,12 +352,14 @@ static inline netdev_tx_t dsa_netpoll_send_skb(struct 
dsa_slave_priv *p,
 static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
+   struct pcpu_sw_netstats *s;
struct sk_buff *nskb;
 
-   u64_stats_update_begin(&p->stats64.syncp);
-   p->stats64.tx_packets++;
-   p->stats64.tx_bytes += skb->len;
-   u64_stats_update_end(&p->stats64.syncp);
+   s = this_cpu_ptr(p->stats64);
+   u64_stats_update_begin(&s->syncp);
+   s->tx_packets++;
+   s->tx_bytes += skb->len;
+   u64_stats_update_end(&s->syncp);
 
/* Transmit function may have to reallocate the original SKB,
 * in which case it must have freed it. Only free it here on error.
@@ -596,15 +598,26 @@ static void dsa_slave_get_ethtool_stats(struct net_device 
*dev,
 {
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->dp->ds;
+   struct pcpu_sw_netstats *s;
unsigned int start;
-
-   do {
-   start = u64_stats_fetch_begin_irq(&p->stats64.syncp);
-   data[0] = p->stats64.tx_packets;
-   data[1] = p->stats64.tx_bytes;
-   data[2] = p->stats64.rx_packets;
-   data[3] = p->stats64.rx_bytes;
-   } while (u64_stats_fetch_retry_irq(&p->stats64.syncp, start));
+   int i;
+
+   for_each_possible_cpu(i) {
+   u64 tx_packets, tx_bytes, rx_packets, rx_bytes;
+
+   s = per_cpu_ptr(p->stats64, i);
+   do {
+   start = u64_stats_fetch_begin_irq(&s->syncp);
+   tx_packets = s->tx_packets;
+   tx_bytes = s->tx_bytes;
+   rx_packets = s->rx_packets;
+   rx_bytes = s->rx_bytes;
+   } while (u64_stats_fetch_retry_irq(&s->syncp, start));
+   data[0] += tx_packets;
+   data[1] += tx_bytes;
+   data[2] += rx_packets;
+   data[3] += rx_bytes;
+   }
if (ds->ops->get_ethtool_stats)
ds->ops->get_ethtool_stats(ds, p->dp->index, data + 4);
 }
@@ -879,16 +892,28 @@ static void dsa_slave_get_stats64(struct net_device *dev,
  struct rtnl_link_stats64 *stats)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
+   struct pcpu_sw_netstats *s;
unsigned int start;
+   int i;
 
netdev_stats_to_stats64(stats, &dev->stats);
-   do {
-   start = u64_stats_fetch_begin_irq(&p->stats64.syncp);
-   stats->tx_packets = p->stats64.tx_packets;
-   stats->tx_bytes = p->stats64.tx_bytes;
-   stats->rx_packets = p->stats64.rx_packets;
- 

Re: [RFC PATCH 4/6] net: sockmap with sk redirect support

2017-08-03 Thread Tom Herbert
On Thu, Aug 3, 2017 at 4:37 PM, John Fastabend  wrote:
> Recently we added a new map type called dev map used to forward XDP
> packets between ports (6093ec2dc313). This patches introduces a
> similar notion for sockets.
>
> A sockmap allows users to add participating sockets to a map. When
> sockets are added to the map enough context is stored with the
> map entry to use the entry with a new helper
>
>   bpf_sk_redirect_map(map, key, flags)
>
> This helper (analogous to bpf_redirect_map in XDP) is given the map
> and an entry in the map. When called from a sockmap program, discussed
> below, the skb will be sent on the socket using skb_send_sock().
>
> With the above we need a bpf program to call the helper from that will
> then implement the send logic. The initial site implemented in this
> series is the recv_sock hook. For this to work we implemented a map
> attach command to add attributes to a map. In sockmap we add two
> programs a parse program and a verdict program. The parse program
> uses strparser to build messages and pass them to the verdict program.
> The parse program usese normal strparser semantics. The verdict
> program is of type SOCKET_FILTER.
>
> The verdict program returns a verdict BPF_OK, BPF_DROP, BPF_REDIRECT.
> When BPF_REDIRECT is returned, expected when bpf program uses
> bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables
> set by the helper routine and pull the sock entry out of the sock map.
> This pattern follows the existing redirect logic in cls and xdp
> programs.
>
Hi John,

I'm a bit confused. If the verdict program bpf_mux then? I don't see
any use of BPF_OK,DROP, or REDIRECT. I assume I'm missing something.

Tom

> This gives the flow,
>
>  recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
>
> As an example use case a message based load balancer may use specific
> logic in the verdict program to select the sock to send on.
>
> Example and sample programs are provided in future patches that
> hopefully illustrate the user interfaces.
>
> TBD: bpf program refcnt'ing needs to be cleaned up, some additional
> cleanup in a few error paths, publish performance numbers and some
> self tests.
>
> Signed-off-by: John Fastabend 
> ---
>  include/linux/bpf.h   |   11 +
>  include/linux/bpf_types.h |1
>  include/uapi/linux/bpf.h  |   13 +
>  kernel/bpf/Makefile   |2
>  kernel/bpf/helpers.c  |   20 +
>  kernel/bpf/sockmap.c  |  623 
> +
>  kernel/bpf/syscall.c  |   41 +++
>  net/core/filter.c |   51 
>  8 files changed, 759 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/bpf/sockmap.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6353c74..9ce6aa0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -15,6 +15,8 @@
>  #include 
>  #include 
>
> +#include 
> +
>  struct perf_event;
>  struct bpf_map;
>
> @@ -29,6 +31,9 @@ struct bpf_map_ops {
> /* funcs callable from userspace and from eBPF programs */
> void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, 
> u64 flags);
> +   int (*map_ctx_update_elem)(struct bpf_sock_ops_kern *skops,
> +  struct bpf_map *map,
> +  void *key, u64 flags, u64 map_flags);
> int (*map_delete_elem)(struct bpf_map *map, void *key);
>
> /* funcs called by prog_array and perf_event_array map */
> @@ -37,6 +42,7 @@ struct bpf_map_ops {
> void (*map_fd_put_ptr)(void *ptr);
> u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
> u32 (*map_fd_sys_lookup_elem)(void *ptr);
> +   int (*map_attach)(struct bpf_map *map, struct bpf_prog *p1, struct 
> bpf_prog *p2);
>  };
>
>  struct bpf_map {
> @@ -321,6 +327,7 @@ static inline void bpf_long_memcpy(void *dst, const void 
> *src, u32 size)
>
>  /* Map specifics */
>  struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
> +struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __dev_map_flush(struct bpf_map *map);
>
> @@ -378,9 +385,13 @@ static inline void __dev_map_flush(struct bpf_map *map)
>  }
>  #endif /* CONFIG_BPF_SYSCALL */
>
> +inline struct sock *do_sk_redirect_map(void);
> +inline u64 get_sk_redirect_flags(void);
> +
>  /* verifier prototypes for helper functions called from eBPF programs */
>  extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>  extern const struct bpf_func_proto bpf_map_update_elem_proto;
> +extern const struct bpf_func_proto bpf_map_ctx_update_elem_proto;
>  extern const struct bpf_func_proto bpf_map_delete_elem_proto;
>
>  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index b1e1035..930be52 

RE: [PATCH V6 net-next 0/8] Hisilicon Network Subsystem 3 Ethernet Driver

2017-08-03 Thread Salil Mehta
Thanks a ton, Dave!

> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, August 03, 2017 11:10 PM
> To: Salil Mehta
> Cc: Zhuangyuzeng (Yisen); huangdaode; lipeng (Y);
> mehta.salil@gmail.com; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-r...@vger.kernel.org; Linuxarm
> Subject: Re: [PATCH V6 net-next 0/8] Hisilicon Network Subsystem 3
> Ethernet Driver
> 
> From: Salil Mehta 
> Date: Wed, 2 Aug 2017 16:59:44 +0100
> 
> > This patch-set contains the support of the HNS3 (Hisilicon Network
> Subsystem 3)
> > Ethernet driver for hip08 family of SoCs and future upcoming SoCs.
>  ...
> 
> Series applied, thanks.


[PATCH 2/2 v2 net-next] tcp: consolidate congestion control undo functions

2017-08-03 Thread Yuchung Cheng
Most TCP congestion controls are using identical logic to undo
cwnd except BBR. This patch consolidates these similar functions
to the one used currently by Reno and others.

Suggested-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_bic.c   | 14 +-
 net/ipv4/tcp_cdg.c   | 12 +---
 net/ipv4/tcp_cubic.c | 13 +
 net/ipv4/tcp_highspeed.c | 11 +--
 net/ipv4/tcp_illinois.c  | 11 +--
 net/ipv4/tcp_nv.c| 13 +
 net/ipv4/tcp_scalable.c  | 16 +---
 net/ipv4/tcp_veno.c  | 11 +--
 net/ipv4/tcp_yeah.c  | 11 +--
 9 files changed, 9 insertions(+), 103 deletions(-)

diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
index 609965f0e298..fc3614377413 100644
--- a/net/ipv4/tcp_bic.c
+++ b/net/ipv4/tcp_bic.c
@@ -49,7 +49,6 @@ MODULE_PARM_DESC(smooth_part, 
"log(B/(B*Smin))/log(B/(B-1))+B, # of RTT from Wma
 struct bictcp {
u32 cnt;/* increase cwnd by 1 after ACKs */
u32 last_max_cwnd;  /* last maximum snd_cwnd */
-   u32 loss_cwnd;  /* congestion window at last loss */
u32 last_cwnd;  /* the last snd_cwnd */
u32 last_time;  /* time when updated last_cwnd */
u32 epoch_start;/* beginning of an epoch */
@@ -72,7 +71,6 @@ static void bictcp_init(struct sock *sk)
struct bictcp *ca = inet_csk_ca(sk);
 
bictcp_reset(ca);
-   ca->loss_cwnd = 0;
 
if (initial_ssthresh)
tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
@@ -172,22 +170,12 @@ static u32 bictcp_recalc_ssthresh(struct sock *sk)
else
ca->last_max_cwnd = tp->snd_cwnd;
 
-   ca->loss_cwnd = tp->snd_cwnd;
-
if (tp->snd_cwnd <= low_window)
return max(tp->snd_cwnd >> 1U, 2U);
else
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-static u32 bictcp_undo_cwnd(struct sock *sk)
-{
-   const struct tcp_sock *tp = tcp_sk(sk);
-   const struct bictcp *ca = inet_csk_ca(sk);
-
-   return max(tp->snd_cwnd, ca->loss_cwnd);
-}
-
 static void bictcp_state(struct sock *sk, u8 new_state)
 {
if (new_state == TCP_CA_Loss)
@@ -214,7 +202,7 @@ static struct tcp_congestion_ops bictcp __read_mostly = {
.ssthresh   = bictcp_recalc_ssthresh,
.cong_avoid = bictcp_cong_avoid,
.set_state  = bictcp_state,
-   .undo_cwnd  = bictcp_undo_cwnd,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.pkts_acked = bictcp_acked,
.owner  = THIS_MODULE,
.name   = "bic",
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 50a0f3e51d5b..66ac69f7bd19 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -85,7 +85,6 @@ struct cdg {
u8  state;
u8  delack;
u32 rtt_seq;
-   u32 undo_cwnd;
u32 shadow_wnd;
u16 backoff_cnt;
u16 sample_cnt;
@@ -330,8 +329,6 @@ static u32 tcp_cdg_ssthresh(struct sock *sk)
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   ca->undo_cwnd = tp->snd_cwnd;
-
if (ca->state == CDG_BACKOFF)
return max(2U, (tp->snd_cwnd * min(1024U, backoff_beta)) >> 10);
 
@@ -344,13 +341,6 @@ static u32 tcp_cdg_ssthresh(struct sock *sk)
return max(2U, tp->snd_cwnd >> 1);
 }
 
-static u32 tcp_cdg_undo_cwnd(struct sock *sk)
-{
-   struct cdg *ca = inet_csk_ca(sk);
-
-   return max(tcp_sk(sk)->snd_cwnd, ca->undo_cwnd);
-}
-
 static void tcp_cdg_cwnd_event(struct sock *sk, const enum tcp_ca_event ev)
 {
struct cdg *ca = inet_csk_ca(sk);
@@ -403,7 +393,7 @@ struct tcp_congestion_ops tcp_cdg __read_mostly = {
.cong_avoid = tcp_cdg_cong_avoid,
.cwnd_event = tcp_cdg_cwnd_event,
.pkts_acked = tcp_cdg_acked,
-   .undo_cwnd = tcp_cdg_undo_cwnd,
+   .undo_cwnd = tcp_reno_undo_cwnd,
.ssthresh = tcp_cdg_ssthresh,
.release = tcp_cdg_release,
.init = tcp_cdg_init,
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 57ae5b5ae643..78bfadfcf342 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -83,7 +83,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's 
indicating train (mse
 struct bictcp {
u32 cnt;/* increase cwnd by 1 after ACKs */
u32 last_max_cwnd;  /* last maximum snd_cwnd */
-   u32 loss_cwnd;  /* congestion window at last loss */
u32 last_cwnd;  /* the last snd_cwnd */
u32 last_time;  /* time when updated last_cwnd */
u32 bic_origin_point;/* origin point of bic function */
@@ -142,7 +141,6 @@ static void bictcp_init(struct sock *sk)
struct bictcp *ca = inet_csk_ca(sk);
 
bictcp_reset(ca);
-   ca->loss_cwnd = 0;
 
if (hystart)
bictcp_hystart_reset(sk);
@@ -366,18 +36

[PATCH 1/2 v2 net-next] tcp: fix cwnd undo in Reno and HTCP congestion controls

2017-08-03 Thread Yuchung Cheng
Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno.  This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html

Suggested-by: Neal Cardwell 
Reported-by: Wei Sun 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 include/linux/tcp.h  | 2 +-
 net/ipv4/tcp_cong.c  | 2 +-
 net/ipv4/tcp_htcp.c  | 3 +--
 net/ipv4/tcp_input.c | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d7389ea36e10..267164a1d559 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -258,7 +258,7 @@ struct tcp_sock {
u32 snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
u32 snd_cwnd_used;
u32 snd_cwnd_stamp;
-   u32 prior_cwnd; /* Congestion window at start of Recovery. */
+   u32 prior_cwnd; /* cwnd right before starting loss recovery */
u32 prr_delivered;  /* Number of newly delivered packets to
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index fde983f6376b..c2b174469645 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -456,7 +456,7 @@ u32 tcp_reno_undo_cwnd(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
 
-   return max(tp->snd_cwnd, tp->snd_ssthresh << 1);
+   return max(tp->snd_cwnd, tp->prior_cwnd);
 }
 EXPORT_SYMBOL_GPL(tcp_reno_undo_cwnd);
 
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 3eb78cde6ff0..082d479462fa 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -66,7 +66,6 @@ static inline void htcp_reset(struct htcp *ca)
 
 static u32 htcp_cwnd_undo(struct sock *sk)
 {
-   const struct tcp_sock *tp = tcp_sk(sk);
struct htcp *ca = inet_csk_ca(sk);
 
if (ca->undo_last_cong) {
@@ -76,7 +75,7 @@ static u32 htcp_cwnd_undo(struct sock *sk)
ca->undo_last_cong = 0;
}
 
-   return max(tp->snd_cwnd, (tp->snd_ssthresh << 7) / ca->beta);
+   return tcp_reno_undo_cwnd(sk);
 }
 
 static inline void measure_rtt(struct sock *sk, u32 srtt)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99cdf4ccabb8..842ed75ccb25 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1950,6 +1950,7 @@ void tcp_enter_loss(struct sock *sk)
!after(tp->high_seq, tp->snd_una) ||
(icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
+   tp->prior_cwnd = tp->snd_cwnd;
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tcp_ca_event(sk, CA_EVENT_LOSS);
tcp_init_undo(tp);
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



[PATCH 0/2 v2 net-next] tcp cwnd undo refactor

2017-08-03 Thread Yuchung Cheng
This patch series consolidate similar cwnd undo functions
implemented by various congestion control by using existing
tcp socket state variable. The first patch fixes a corner
case in of cwnd undo in Reno and HTCP. Since the bug has
existed for many years and is very minor, we consider this
patch set more suitable for net-next as the major change
is the refactor itself.

- v1->v2
  Fix trivial compile errors

Yuchung Cheng (2):
  tcp: fix cwnd undo in Reno and HTCP congestion controls
  tcp: consolidate congestion control undo functions

 include/linux/tcp.h  |  2 +-
 net/ipv4/tcp_bic.c   | 14 +-
 net/ipv4/tcp_cdg.c   | 12 +---
 net/ipv4/tcp_cong.c  |  2 +-
 net/ipv4/tcp_cubic.c | 13 +
 net/ipv4/tcp_highspeed.c | 11 +--
 net/ipv4/tcp_htcp.c  |  3 +--
 net/ipv4/tcp_illinois.c  | 11 +--
 net/ipv4/tcp_input.c |  1 +
 net/ipv4/tcp_nv.c| 13 +
 net/ipv4/tcp_scalable.c  | 16 +---
 net/ipv4/tcp_veno.c  | 11 +--
 net/ipv4/tcp_yeah.c  | 11 +--
 13 files changed, 13 insertions(+), 107 deletions(-)

-- 
2.14.0.rc1.383.gd1ce394fe2-goog



Re: [PATCH net-next v3 1/2] bpf: add support for sys_enter_* and sys_exit_* tracepoints

2017-08-03 Thread Y Song
On Thu, Aug 3, 2017 at 7:08 PM, Alexei Starovoitov  wrote:
> On 8/3/17 6:29 AM, Yonghong Song wrote:
>>
>> @@ -578,8 +596,9 @@ static void perf_syscall_enter(void *ignore, struct
>> pt_regs *regs, long id)
>> if (!sys_data)
>> return;
>>
>> +   prog = READ_ONCE(sys_data->enter_event->prog);
>> head = this_cpu_ptr(sys_data->enter_event->perf_events);
>> -   if (hlist_empty(head))
>> +   if (!prog && hlist_empty(head))
>> return;
>>
>> /* get the size after alignment with the u32 buffer size field */
>> @@ -594,6 +613,13 @@ static void perf_syscall_enter(void *ignore, struct
>> pt_regs *regs, long id)
>> rec->nr = syscall_nr;
>> syscall_get_arguments(current, regs, 0, sys_data->nb_args,
>>(unsigned long *)&rec->args);
>> +
>> +   if ((prog && !perf_call_bpf_enter(prog, regs, sys_data, rec)) ||
>> +   hlist_empty(head)) {
>> +   perf_swevent_put_recursion_context(rctx);
>> +   return;
>> +   }
>
>
> hmm. if I read the patch correctly that makes it different from
> kprobe/uprobe/tracepoints+bpf behavior. Why make it different and
> force user space to perf_event_open() on every cpu?
> In other cases it's the job of the bpf program to filter by cpu
> if necessary and that is well understood by bcc scripts.

The patch actually does allow the bpf program to track all cpus.
The test:
>> +   if (!prog && hlist_empty(head))
>> return;
ensures that if prog is not empty, it will not return even if the
event in the current cpu is empty. Later on, perf_call_bpf_enter will
be called if prog is not empty. This ensures that
the bpf program will execute regardless of the current cpu.

Maybe I missed anything here?


Re: Gift-

2017-08-03 Thread Mayrhofer Family
Good Day,

My wife and I have awarded you with a donation of $ 1,000,000.00 Dollars from 
part of our Jackpot Lottery of 50 Million Dollars, respond with your details 
for claims.

We await your earliest response and God Bless you.

Friedrich And Annand Mayrhofer.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: [PATCH net-next v3 1/2] bpf: add support for sys_enter_* and sys_exit_* tracepoints

2017-08-03 Thread Alexei Starovoitov

On 8/3/17 6:29 AM, Yonghong Song wrote:

@@ -578,8 +596,9 @@ static void perf_syscall_enter(void *ignore, struct pt_regs 
*regs, long id)
if (!sys_data)
return;

+   prog = READ_ONCE(sys_data->enter_event->prog);
head = this_cpu_ptr(sys_data->enter_event->perf_events);
-   if (hlist_empty(head))
+   if (!prog && hlist_empty(head))
return;

/* get the size after alignment with the u32 buffer size field */
@@ -594,6 +613,13 @@ static void perf_syscall_enter(void *ignore, struct 
pt_regs *regs, long id)
rec->nr = syscall_nr;
syscall_get_arguments(current, regs, 0, sys_data->nb_args,
   (unsigned long *)&rec->args);
+
+   if ((prog && !perf_call_bpf_enter(prog, regs, sys_data, rec)) ||
+   hlist_empty(head)) {
+   perf_swevent_put_recursion_context(rctx);
+   return;
+   }


hmm. if I read the patch correctly that makes it different from
kprobe/uprobe/tracepoints+bpf behavior. Why make it different and
force user space to perf_event_open() on every cpu?
In other cases it's the job of the bpf program to filter by cpu
if necessary and that is well understood by bcc scripts.


Re: [PATCH 2/2 net-next] tcp: consolidate congestion control undo functions

2017-08-03 Thread kbuild test robot
Hi Yuchung,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Yuchung-Cheng/tcp-fix-cwnd-undo-in-Reno-and-HTCP-congestion-controls/20170804-085255
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   net/ipv4/tcp_cdg.c: In function 'tcp_cdg_backoff':
>> net/ipv4/tcp_cdg.c:256:2: error: too many arguments to function 
>> 'tcp_enter_cwr'
 tcp_enter_cwr(sk, true);
 ^
   In file included from net/ipv4/tcp_cdg.c:32:0:
   include/net/tcp.h:1149:6: note: declared here
void tcp_enter_cwr(struct sock *sk);
 ^
--
   net/ipv4/tcp_nv.c: In function 'tcpnv_recalc_ssthresh':
   net/ipv4/tcp_nv.c:178:16: warning: unused variable 'ca' [-Wunused-variable]
 struct tcpnv *ca = inet_csk_ca(sk);
   ^
   net/ipv4/tcp_nv.c: At top level:
>> net/ipv4/tcp_nv.c:439:15: error: 'tcpnv_reno_undo_cwnd' undeclared here (not 
>> in a function)
 .undo_cwnd = tcpnv_reno_undo_cwnd,
  ^

vim +/tcp_enter_cwr +256 net/ipv4/tcp_cdg.c

   239  
   240  static bool tcp_cdg_backoff(struct sock *sk, u32 grad)
   241  {
   242  struct cdg *ca = inet_csk_ca(sk);
   243  struct tcp_sock *tp = tcp_sk(sk);
   244  
   245  if (prandom_u32() <= nexp_u32(grad * backoff_factor))
   246  return false;
   247  
   248  if (use_ineff) {
   249  ca->backoff_cnt++;
   250  if (ca->backoff_cnt > use_ineff)
   251  return false;
   252  }
   253  
   254  ca->shadow_wnd = max(ca->shadow_wnd, tp->snd_cwnd);
   255  ca->state = CDG_BACKOFF;
 > 256  tcp_enter_cwr(sk, true);
   257  return true;
   258  }
   259  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 1/2 net-next] tcp: fix cwnd undo in Reno and HTCP congestion controls

2017-08-03 Thread Yuchung Cheng
Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno.  This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html

Suggested-by: Neal Cardwell 
Reported-by: Wei Sun 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 include/linux/tcp.h  | 2 +-
 net/ipv4/tcp_cong.c  | 2 +-
 net/ipv4/tcp_htcp.c  | 3 +--
 net/ipv4/tcp_input.c | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d7389ea36e10..267164a1d559 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -258,7 +258,7 @@ struct tcp_sock {
u32 snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
u32 snd_cwnd_used;
u32 snd_cwnd_stamp;
-   u32 prior_cwnd; /* Congestion window at start of Recovery. */
+   u32 prior_cwnd; /* cwnd right before starting loss recovery */
u32 prr_delivered;  /* Number of newly delivered packets to
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index fde983f6376b..c2b174469645 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -456,7 +456,7 @@ u32 tcp_reno_undo_cwnd(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
 
-   return max(tp->snd_cwnd, tp->snd_ssthresh << 1);
+   return max(tp->snd_cwnd, tp->prior_cwnd);
 }
 EXPORT_SYMBOL_GPL(tcp_reno_undo_cwnd);
 
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 3eb78cde6ff0..082d479462fa 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -66,7 +66,6 @@ static inline void htcp_reset(struct htcp *ca)
 
 static u32 htcp_cwnd_undo(struct sock *sk)
 {
-   const struct tcp_sock *tp = tcp_sk(sk);
struct htcp *ca = inet_csk_ca(sk);
 
if (ca->undo_last_cong) {
@@ -76,7 +75,7 @@ static u32 htcp_cwnd_undo(struct sock *sk)
ca->undo_last_cong = 0;
}
 
-   return max(tp->snd_cwnd, (tp->snd_ssthresh << 7) / ca->beta);
+   return tcp_reno_undo_cwnd(sk);
 }
 
 static inline void measure_rtt(struct sock *sk, u32 srtt)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99cdf4ccabb8..842ed75ccb25 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1950,6 +1950,7 @@ void tcp_enter_loss(struct sock *sk)
!after(tp->high_seq, tp->snd_una) ||
(icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
+   tp->prior_cwnd = tp->snd_cwnd;
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tcp_ca_event(sk, CA_EVENT_LOSS);
tcp_init_undo(tp);
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



[PATCH 2/2 net-next] tcp: consolidate congestion control undo functions

2017-08-03 Thread Yuchung Cheng
Most TCP congestion controls are using identical logic to undo
cwnd except BBR. This patch consolidates these similar functions
to the one used currently by Reno and others.

Suggested-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_bic.c   | 14 +-
 net/ipv4/tcp_cdg.c   | 14 ++
 net/ipv4/tcp_cubic.c | 13 +
 net/ipv4/tcp_highspeed.c | 11 +--
 net/ipv4/tcp_illinois.c  | 11 +--
 net/ipv4/tcp_nv.c| 12 +---
 net/ipv4/tcp_scalable.c  | 16 +---
 net/ipv4/tcp_veno.c  | 11 +--
 net/ipv4/tcp_yeah.c  | 11 +--
 9 files changed, 10 insertions(+), 103 deletions(-)

diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
index 609965f0e298..fc3614377413 100644
--- a/net/ipv4/tcp_bic.c
+++ b/net/ipv4/tcp_bic.c
@@ -49,7 +49,6 @@ MODULE_PARM_DESC(smooth_part, 
"log(B/(B*Smin))/log(B/(B-1))+B, # of RTT from Wma
 struct bictcp {
u32 cnt;/* increase cwnd by 1 after ACKs */
u32 last_max_cwnd;  /* last maximum snd_cwnd */
-   u32 loss_cwnd;  /* congestion window at last loss */
u32 last_cwnd;  /* the last snd_cwnd */
u32 last_time;  /* time when updated last_cwnd */
u32 epoch_start;/* beginning of an epoch */
@@ -72,7 +71,6 @@ static void bictcp_init(struct sock *sk)
struct bictcp *ca = inet_csk_ca(sk);
 
bictcp_reset(ca);
-   ca->loss_cwnd = 0;
 
if (initial_ssthresh)
tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
@@ -172,22 +170,12 @@ static u32 bictcp_recalc_ssthresh(struct sock *sk)
else
ca->last_max_cwnd = tp->snd_cwnd;
 
-   ca->loss_cwnd = tp->snd_cwnd;
-
if (tp->snd_cwnd <= low_window)
return max(tp->snd_cwnd >> 1U, 2U);
else
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-static u32 bictcp_undo_cwnd(struct sock *sk)
-{
-   const struct tcp_sock *tp = tcp_sk(sk);
-   const struct bictcp *ca = inet_csk_ca(sk);
-
-   return max(tp->snd_cwnd, ca->loss_cwnd);
-}
-
 static void bictcp_state(struct sock *sk, u8 new_state)
 {
if (new_state == TCP_CA_Loss)
@@ -214,7 +202,7 @@ static struct tcp_congestion_ops bictcp __read_mostly = {
.ssthresh   = bictcp_recalc_ssthresh,
.cong_avoid = bictcp_cong_avoid,
.set_state  = bictcp_state,
-   .undo_cwnd  = bictcp_undo_cwnd,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.pkts_acked = bictcp_acked,
.owner  = THIS_MODULE,
.name   = "bic",
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 50a0f3e51d5b..7c2b78b62d54 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -85,7 +85,6 @@ struct cdg {
u8  state;
u8  delack;
u32 rtt_seq;
-   u32 undo_cwnd;
u32 shadow_wnd;
u16 backoff_cnt;
u16 sample_cnt;
@@ -254,7 +253,7 @@ static bool tcp_cdg_backoff(struct sock *sk, u32 grad)
 
ca->shadow_wnd = max(ca->shadow_wnd, tp->snd_cwnd);
ca->state = CDG_BACKOFF;
-   tcp_enter_cwr(sk);
+   tcp_enter_cwr(sk, true);
return true;
 }
 
@@ -330,8 +329,6 @@ static u32 tcp_cdg_ssthresh(struct sock *sk)
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   ca->undo_cwnd = tp->snd_cwnd;
-
if (ca->state == CDG_BACKOFF)
return max(2U, (tp->snd_cwnd * min(1024U, backoff_beta)) >> 10);
 
@@ -344,13 +341,6 @@ static u32 tcp_cdg_ssthresh(struct sock *sk)
return max(2U, tp->snd_cwnd >> 1);
 }
 
-static u32 tcp_cdg_undo_cwnd(struct sock *sk)
-{
-   struct cdg *ca = inet_csk_ca(sk);
-
-   return max(tcp_sk(sk)->snd_cwnd, ca->undo_cwnd);
-}
-
 static void tcp_cdg_cwnd_event(struct sock *sk, const enum tcp_ca_event ev)
 {
struct cdg *ca = inet_csk_ca(sk);
@@ -403,7 +393,7 @@ struct tcp_congestion_ops tcp_cdg __read_mostly = {
.cong_avoid = tcp_cdg_cong_avoid,
.cwnd_event = tcp_cdg_cwnd_event,
.pkts_acked = tcp_cdg_acked,
-   .undo_cwnd = tcp_cdg_undo_cwnd,
+   .undo_cwnd = tcp_reno_undo_cwnd,
.ssthresh = tcp_cdg_ssthresh,
.release = tcp_cdg_release,
.init = tcp_cdg_init,
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 57ae5b5ae643..78bfadfcf342 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -83,7 +83,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's 
indicating train (mse
 struct bictcp {
u32 cnt;/* increase cwnd by 1 after ACKs */
u32 last_max_cwnd;  /* last maximum snd_cwnd */
-   u32 loss_cwnd;  /* congestion window at last loss */
u32 last_cwnd;  /* the last snd_cwnd */
u32 last_time;  /* time when updated last_cwnd */
u32 bic_origin_point;/* origin point o

[PATCH net 0/1] netvsc: race on sub channel open

2017-08-03 Thread Stephen Hemminger
Found this while testing mtu, queue and buffer size changes
in 4.13, but the problem goes back much further. The addition
of NAPI makes the race into a crash. Before that there was just
a risk of sending on an uninitialized channel.

Stephen Hemminger (1):
  netvsc: fix race on sub channel creation

 drivers/net/hyperv/hyperv_net.h   |  3 ++-
 drivers/net/hyperv/netvsc.c   |  1 +
 drivers/net/hyperv/rndis_filter.c | 14 --
 3 files changed, 11 insertions(+), 7 deletions(-)

-- 
2.11.0



[PATCH net 1/1] netvsc: fix race on sub channel creation

2017-08-03 Thread Stephen Hemminger
The existing sub channel code did not wait for all the sub-channels
to completely initialize. This could lead to race causing crash
in napi_netif_del() from bad list. The existing code would send
an init message, then wait only for the initial response that
the init message was received. It thought it was waiting for
sub channels but really the init response did the wakeup.

The new code keeps track of the number of open channels and
waits until that many are open.

Other issues here were:
  * host might return less sub-channels than was requested.
  * the new init status is not valid until after init was completed.

Fixes: b3e6b82a0099 ("hv_netvsc: Wait for sub-channels to be processed during 
probe")
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h   |  3 ++-
 drivers/net/hyperv/netvsc.c   |  1 +
 drivers/net/hyperv/rndis_filter.c | 14 --
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d6c25580f8dd..12cc64bfcff8 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -765,7 +765,8 @@ struct netvsc_device {
u32 max_chn;
u32 num_chn;
 
-   refcount_t sc_offered;
+   atomic_t open_chn;
+   wait_queue_head_t subchan_open;
 
struct rndis_device *extension;
 
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 96f90c75d1b7..d18c3326a1f7 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -78,6 +78,7 @@ static struct netvsc_device *alloc_net_device(void)
net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
init_completion(&net_device->channel_init_wait);
+   init_waitqueue_head(&net_device->subchan_open);
 
return net_device;
 }
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 85c00e1c52b6..d6308ffda53e 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1048,8 +1048,8 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
else
netif_napi_del(&nvchan->napi);
 
-   if (refcount_dec_and_test(&nvscdev->sc_offered))
-   complete(&nvscdev->channel_init_wait);
+   atomic_inc(&nvscdev->open_chn);
+   wake_up(&nvscdev->subchan_open);
 }
 
 int rndis_filter_device_add(struct hv_device *dev,
@@ -1090,8 +1090,6 @@ int rndis_filter_device_add(struct hv_device *dev,
net_device->max_chn = 1;
net_device->num_chn = 1;
 
-   refcount_set(&net_device->sc_offered, 0);
-
net_device->extension = rndis_device;
rndis_device->ndev = net;
 
@@ -1221,11 +1219,11 @@ int rndis_filter_device_add(struct hv_device *dev,
rndis_device->ind_table[i] = ethtool_rxfh_indir_default(i,
net_device->num_chn);
 
+   atomic_set(&net_device->open_chn, 1);
num_rss_qs = net_device->num_chn - 1;
if (num_rss_qs == 0)
return 0;
 
-   refcount_set(&net_device->sc_offered, num_rss_qs);
vmbus_set_sc_create_callback(dev->channel, netvsc_sc_open);
 
init_packet = &net_device->channel_init_pkt;
@@ -1242,15 +1240,19 @@ int rndis_filter_device_add(struct hv_device *dev,
if (ret)
goto out;
 
+   wait_for_completion(&net_device->channel_init_wait);
if (init_packet->msg.v5_msg.subchn_comp.status != NVSP_STAT_SUCCESS) {
ret = -ENODEV;
goto out;
}
-   wait_for_completion(&net_device->channel_init_wait);
 
net_device->num_chn = 1 +
init_packet->msg.v5_msg.subchn_comp.num_subchannels;
 
+   /* wait for all sub channels to open */
+   wait_event(net_device->subchan_open,
+  atomic_read(&net_device->open_chn) == net_device->num_chn);
+
/* ignore failues from setting rss parameters, still have channels */
rndis_filter_set_rss_param(rndis_device, netvsc_hash_key,
   net_device->num_chn);
-- 
2.11.0



[PATCH] MIPS: Add missing file for eBPF JIT.

2017-08-03 Thread David Daney
Inexplicably, commit f381bf6d82f0 ("MIPS: Add support for eBPF JIT.")
lost a file somewhere on its path to Linus' tree.  Add back the
missing ebpf_jit.c so that we can build with CONFIG_BPF_JIT selected.

This version of ebpf_jit.c is identical to the original except for two
minor change need to resolve conflicts with changes merged from the
BPF branch:

A) Set prog->jited_len = image_size;
B) Use BPF_TAIL_CALL instead of BPF_CALL | BPF_X

Fixes: f381bf6d82f0 ("MIPS: Add support for eBPF JIT.")
Signed-off-by: David Daney 
---

It might be best to merge this along the path of BPF fixes rather than
MIPS, as the MIPS maintainer (Ralf) seems to be inactive recently.

 arch/mips/net/ebpf_jit.c | 1950 ++
 1 file changed, 1950 insertions(+)
 create mode 100644 arch/mips/net/ebpf_jit.c

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
new file mode 100644
index 000..3f87b96
--- /dev/null
+++ b/arch/mips/net/ebpf_jit.c
@@ -0,0 +1,1950 @@
+/*
+ * Just-In-Time compiler for eBPF filters on MIPS
+ *
+ * Copyright (c) 2017 Cavium, Inc.
+ *
+ * Based on code from:
+ *
+ * Copyright (c) 2014 Imagination Technologies Ltd.
+ * Author: Markos Chandras 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; version 2 of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Registers used by JIT */
+#define MIPS_R_ZERO0
+#define MIPS_R_AT  1
+#define MIPS_R_V0  2   /* BPF_R0 */
+#define MIPS_R_V1  3
+#define MIPS_R_A0  4   /* BPF_R1 */
+#define MIPS_R_A1  5   /* BPF_R2 */
+#define MIPS_R_A2  6   /* BPF_R3 */
+#define MIPS_R_A3  7   /* BPF_R4 */
+#define MIPS_R_A4  8   /* BPF_R5 */
+#define MIPS_R_T4  12  /* BPF_AX */
+#define MIPS_R_T5  13
+#define MIPS_R_T6  14
+#define MIPS_R_T7  15
+#define MIPS_R_S0  16  /* BPF_R6 */
+#define MIPS_R_S1  17  /* BPF_R7 */
+#define MIPS_R_S2  18  /* BPF_R8 */
+#define MIPS_R_S3  19  /* BPF_R9 */
+#define MIPS_R_S4  20  /* BPF_TCC */
+#define MIPS_R_S5  21
+#define MIPS_R_S6  22
+#define MIPS_R_S7  23
+#define MIPS_R_T8  24
+#define MIPS_R_T9  25
+#define MIPS_R_SP  29
+#define MIPS_R_RA  31
+
+/* eBPF flags */
+#define EBPF_SAVE_S0   BIT(0)
+#define EBPF_SAVE_S1   BIT(1)
+#define EBPF_SAVE_S2   BIT(2)
+#define EBPF_SAVE_S3   BIT(3)
+#define EBPF_SAVE_S4   BIT(4)
+#define EBPF_SAVE_RA   BIT(5)
+#define EBPF_SEEN_FP   BIT(6)
+#define EBPF_SEEN_TC   BIT(7)
+#define EBPF_TCC_IN_V1 BIT(8)
+
+/*
+ * For the mips64 ISA, we need to track the value range or type for
+ * each JIT register.  The BPF machine requires zero extended 32-bit
+ * values, but the mips64 ISA requires sign extended 32-bit values.
+ * At each point in the BPF program we track the state of every
+ * register so that we can zero extend or sign extend as the BPF
+ * semantics require.
+ */
+enum reg_val_type {
+   /* uninitialized */
+   REG_UNKNOWN,
+   /* not known to be 32-bit compatible. */
+   REG_64BIT,
+   /* 32-bit compatible, no truncation needed for 64-bit ops. */
+   REG_64BIT_32BIT,
+   /* 32-bit compatible, need truncation for 64-bit ops. */
+   REG_32BIT,
+   /* 32-bit zero extended. */
+   REG_32BIT_ZERO_EX,
+   /* 32-bit no sign/zero extension needed. */
+   REG_32BIT_POS
+};
+
+/*
+ * high bit of offsets indicates if long branch conversion done at
+ * this insn.
+ */
+#define OFFSETS_B_CONV BIT(31)
+
+/**
+ * struct jit_ctx - JIT context
+ * @skf:   The sk_filter
+ * @stack_size:eBPF stack size
+ * @tmp_offset:eBPF $sp offset to 8-byte temporary memory
+ * @idx:   Instruction index
+ * @flags: JIT flags
+ * @offsets:   Instruction offsets
+ * @target:Memory location for the compiled filter
+ * @reg_val_types  Packed enum reg_val_type for each register.
+ */
+struct jit_ctx {
+   const struct bpf_prog *skf;
+   int stack_size;
+   int tmp_offset;
+   u32 idx;
+   u32 flags;
+   u32 *offsets;
+   u32 *target;
+   u64 *reg_val_types;
+   unsigned int long_b_conversion:1;
+   unsigned int gen_b_offsets:1;
+};
+
+static void set_reg_val_type(u64 *rvt, int reg, enum reg_val_type type)
+{
+   *rvt &= ~(7ull << (reg * 3));
+   *rvt |= ((u64)type << (reg * 3));
+}
+
+static enum reg_val_type get_reg_val_type(const struct jit_ctx *ctx,
+ int index, int reg)
+{
+   return (ctx->reg_val_types[index] >> (reg * 3)) & 7;
+}
+
+/* Simply emit the instruction if the JIT memory space has been allocated */
+#define emit_instr(ctx, func, ...) \
+do {

Re: [PATCH v7 net-next] net: systemport: Support 64bit statistics

2017-08-03 Thread David Miller
From: Florian Fainelli 
Date: Thu, 3 Aug 2017 16:20:04 -0700

> On 08/03/2017 04:16 PM, Stephen Hemminger wrote:
>> On Fri,  4 Aug 2017 00:07:45 +0100
>> "Jianming.qiao"  wrote:
>> 
>>>  static const struct bcm_sysport_stats bcm_sysport_gstrings_stats[] = {
>>> /* general stats */
>>> -   STAT_NETDEV(rx_packets),
>>> -   STAT_NETDEV(tx_packets),
>>> -   STAT_NETDEV(rx_bytes),
>>> -   STAT_NETDEV(tx_bytes),
>>> +   STAT_NETDEV64(rx_packets),
>>> +   STAT_NETDEV64(tx_packets),
>>> +   STAT_NETDEV64(rx_bytes),
>>> +   STAT_NETDEV64(tx_bytes),
>>> STAT_NETDEV(rx_errors),
>> 
>> Please don't duplicate regular statistics (ie netdev) into ethtool.
>> It is a needless duplication.
> 
> Agreed, but these are there already and this driver's ethtool::get_stats
> is an user ABI of some sort, is not it?

Agreed, they have to stay at this point.


Re: [PATCH v3 net-next 4/4] ulp: Documention for ULP infrastructure

2017-08-03 Thread Mat Martineau

On Thu, 3 Aug 2017, Tom Herbert wrote:


Add a doc in Documentation/networking

Signed-off-by: Tom Herbert 
---
Documentation/networking/ulp.txt | 82 
1 file changed, 82 insertions(+)
create mode 100644 Documentation/networking/ulp.txt

diff --git a/Documentation/networking/ulp.txt b/Documentation/networking/ulp.txt
new file mode 100644
index ..4d830314b0ff
--- /dev/null
+++ b/Documentation/networking/ulp.txt
@@ -0,0 +1,82 @@
+Upper Layer Protocol (ULP) Infrastructure
+=
+
+The ULP kernel infrastructure provides a means to hook upper layer
+protocol support on a socket. A module may register a ULP hook
+in the kernel. ULP processing is enabled by a setsockopt on a socket
+that specifies the name of the registered ULP to invoked. An
+initialization function is defined for each ULP that can change the
+function entry points of the socket (sendmsg, rcvmsg, etc.) or change
+the socket in other fundamental ways.
+
+Note, no synchronization is enforced between the setsockopt to enable
+a ULP and ongoing asynchronous operations on the socket (such as a
+blocked read). If synchronization is required this must be handled by
+the ULP and caller.
+
+User interface
+==
+
+The structure for the socket SOL_ULP options is defined in socket.h.
+
+Example to enable "my_ulp" ULP on a socket:
+
+struct ulp_config ulpc = {
+.ulp_name = "my_ulp",
+};
+
+setsockopt(sock, SOL_SOCKET, SO_ULP, &ulpc, sizeof(ulpc))
+
+The ulp_config includes a "__u8 ulp_params[0]" filled that may be used

  ^^
Did you mean "field"? Might also phrase it "The ulp_config structure 
includes..."



+to refer ULP specific parameters being set.



Thanks,

--
Mat Martineau
Intel OTC


Re: Gift-

2017-08-03 Thread Mayrhofer Family
Good Day,

My wife and I have awarded you with a donation of $ 1,000,000.00 Dollars from 
part of our Jackpot Lottery of 50 Million Dollars, respond with your details 
for claims.

We await your earliest response and God Bless you.

Friedrich And Annand Mayrhofer.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



[RFC PATCH 6/6] net: sockmap sample program

2017-08-03 Thread John Fastabend
This program binds a program to a cgroup and then matches hard
coded IP addresses and adds these to a sockmap.

This will receive messages from the backend and send them to
the client.

 client:X <---> frontend:1 client:X <---> backend:80

To keep things simple this is only designed for 1:1 connections
using hard coded values. A more complete example would use allow
many backends and clients.

Signed-off-by: John Fastabend 
---
 samples/sockmap/Makefile  |   78 
 samples/sockmap/sockmap_kern.c|  143 +
 samples/sockmap/sockmap_user.c|   84 +
 tools/include/uapi/linux/bpf.h|1 
 tools/lib/bpf/bpf.c   |   11 ++
 tools/lib/bpf/bpf.h   |4 +
 tools/testing/selftests/bpf/bpf_helpers.h |   12 ++
 7 files changed, 331 insertions(+), 2 deletions(-)
 create mode 100644 samples/sockmap/Makefile
 create mode 100644 samples/sockmap/sockmap_kern.c
 create mode 100644 samples/sockmap/sockmap_user.c

diff --git a/samples/sockmap/Makefile b/samples/sockmap/Makefile
new file mode 100644
index 000..9291ab8
--- /dev/null
+++ b/samples/sockmap/Makefile
@@ -0,0 +1,78 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := sockmap
+
+# Libbpf dependencies
+LIBBPF := ../../tools/lib/bpf/bpf.o
+
+HOSTCFLAGS += -I$(objtree)/usr/include
+HOSTCFLAGS += -I$(srctree)/tools/lib/
+HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
+HOSTCFLAGS += -I$(srctree)/tools/lib/ -I$(srctree)/tools/include
+HOSTCFLAGS += -I$(srctree)/tools/perf
+
+sockmap-objs := ../bpf/bpf_load.o $(LIBBPF) sockmap_user.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+always += sockmap_kern.o
+
+HOSTLOADLIBES_sockmap += -lelf -lpthread
+
+# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
+LLC ?= llc
+CLANG ?= clang
+
+# Trick to allow make to be run from this directory
+all:
+   $(MAKE) -C ../../ $(CURDIR)/
+
+clean:
+   $(MAKE) -C ../../ M=$(CURDIR) clean
+   @rm -f *~
+
+$(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
+   $(call if_changed_dep,cc_s_c)
+
+$(obj)/syscall_nrs.h:  $(obj)/syscall_nrs.s FORCE
+   $(call filechk,offsets,__SYSCALL_NRS_H__)
+
+clean-files += syscall_nrs.h
+
+FORCE:
+
+
+# Verify LLVM compiler tools are available and bpf target is supported by llc
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+   @for TOOL in $^ ; do \
+   if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+   echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+   exit 1; \
+   else true; fi; \
+   done
+
+verify_target_bpf: verify_cmds
+   @if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+   echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+   echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+   exit 2; \
+   else true; fi
+
+$(src)/*.c: verify_target_bpf
+
+# asm/sysreg.h - inline assembly used by it is incompatible with llvm.
+# But, there is no easy way to fix it, so just exclude it since it is
+# useless for BPF samples.
+$(obj)/%.o: $(src)/%.c
+   $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
+   -Wno-compare-distinct-pointer-types \
+   -Wno-gnu-variable-sized-type-not-at-end \
+   -Wno-address-of-packed-member -Wno-tautological-compare \
+   -Wno-unknown-warning-option \
+   -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
diff --git a/samples/sockmap/sockmap_kern.c b/samples/sockmap/sockmap_kern.c
new file mode 100644
index 000..07dea99
--- /dev/null
+++ b/samples/sockmap/sockmap_kern.c
@@ -0,0 +1,143 @@
+#include 
+#include 
+#include 
+#include 
+#include "../../tools/testing/selftests/bpf/bpf_helpers.h"
+#include "../../tools/testing/selftests/bpf/bpf_endian.h"
+
+#define bpf_printk(fmt, ...)   \
+({ \
+  char fmt[] = fmt;\
+  bpf_trace_printk(fmt, sizeof(fmt),   \
+   ##__VA_ARGS__); \
+})
+
+struct bpf_map_def SEC("maps") sock_map = {
+   .type = BPF_MAP_TYPE_SOCKMAP,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 20,
+};
+
+struct bpf_map_def SEC("maps") reply_port = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 1,
+};
+
+SEC("socket1"

[RFC PATCH 3/6] net: fixes for skb_send_sock

2017-08-03 Thread John Fastabend
A couple fixes to new skb_send_sock infrastructure. However, no users
currently exist for this code (adding user in next handful of patches)
so it should not be possible to trigger a panic with in-kernel code.

Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and 
sendpage")
Signed-off-by: John Fastabend 
---
 net/core/skbuff.c |2 +-
 net/socket.c  |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0f0933b..a0504a5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2011,7 +2011,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff 
*skb, int offset,
 
slen = min_t(int, len, skb_headlen(skb) - offset);
kv.iov_base = skb->data + offset;
-   kv.iov_len = len;
+   kv.iov_len = slen;
memset(&msg, 0, sizeof(msg));
 
ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
diff --git a/net/socket.c b/net/socket.c
index b332d1e..c729625 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -658,7 +658,7 @@ int kernel_sendmsg_locked(struct sock *sk, struct msghdr 
*msg,
struct socket *sock = sk->sk_socket;
 
if (!sock->ops->sendmsg_locked)
-   sock_no_sendmsg_locked(sk, msg, size);
+   return sock_no_sendmsg_locked(sk, msg, size);
 
iov_iter_kvec(&msg->msg_iter, WRITE | ITER_KVEC, vec, num, size);
 



[RFC PATCH 5/6] net: bpf, add skb to sk lookup routines

2017-08-03 Thread John Fastabend
Add some useful skb to sk routines to fine ports on a connected
socket.

This is for testing, we may prefer to put sk in bpf sk_buff
representation and access these fields directly. Similar to
sock ops ctx access.

Signed-off-by: John Fastabend 
---
 include/uapi/linux/bpf.h |2 ++
 net/core/filter.c|   36 
 2 files changed, 38 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a89e831..c626c8f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -604,6 +604,8 @@ enum bpf_attach_type {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(map_ctx_update_elem),\
+   FN(skb_get_local_port), \
+   FN(skb_get_remote_port),\
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 2644f2d..3234200 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2993,6 +2993,38 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_remote_port, struct sk_buff *, skb)
+{
+   struct sock *sk = skb->sk;//sk_to_full_sk(skb->sk);
+
+   if (!sk)// || !sk_fullsock(sk))
+   return overflowuid;
+   return sk->sk_dport;
+}
+
+static const struct bpf_func_proto bpf_skb_get_remote_port_proto = {
+   .func   = bpf_get_remote_port,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
+BPF_CALL_1(bpf_get_local_port, struct sk_buff *, skb)
+{
+   struct sock *sk = skb->sk;
+
+   if (!sk)// || !sk_fullsock(sk))
+   return overflowuid;
+   return sk->sk_num;
+}
+
+static const struct bpf_func_proto bpf_skb_get_local_port_proto = {
+   .func   = bpf_get_local_port,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -3135,6 +3167,10 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+   case BPF_FUNC_skb_get_remote_port:
+   return &bpf_skb_get_remote_port_proto;
+   case BPF_FUNC_skb_get_local_port:
+   return &bpf_skb_get_local_port_proto;
case BPF_FUNC_sk_redirect_map:
return &bpf_sk_redirect_map_proto;
case BPF_FUNC_map_ctx_update_elem:



[RFC PATCH 4/6] net: sockmap with sk redirect support

2017-08-03 Thread John Fastabend
Recently we added a new map type called dev map used to forward XDP
packets between ports (6093ec2dc313). This patches introduces a
similar notion for sockets.

A sockmap allows users to add participating sockets to a map. When
sockets are added to the map enough context is stored with the
map entry to use the entry with a new helper

  bpf_sk_redirect_map(map, key, flags)

This helper (analogous to bpf_redirect_map in XDP) is given the map
and an entry in the map. When called from a sockmap program, discussed
below, the skb will be sent on the socket using skb_send_sock().

With the above we need a bpf program to call the helper from that will
then implement the send logic. The initial site implemented in this
series is the recv_sock hook. For this to work we implemented a map
attach command to add attributes to a map. In sockmap we add two
programs a parse program and a verdict program. The parse program
uses strparser to build messages and pass them to the verdict program.
The parse program usese normal strparser semantics. The verdict
program is of type SOCKET_FILTER.

The verdict program returns a verdict BPF_OK, BPF_DROP, BPF_REDIRECT.
When BPF_REDIRECT is returned, expected when bpf program uses
bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables
set by the helper routine and pull the sock entry out of the sock map.
This pattern follows the existing redirect logic in cls and xdp
programs.

This gives the flow,

 recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock

As an example use case a message based load balancer may use specific
logic in the verdict program to select the sock to send on.

Example and sample programs are provided in future patches that
hopefully illustrate the user interfaces.

TBD: bpf program refcnt'ing needs to be cleaned up, some additional
cleanup in a few error paths, publish performance numbers and some
self tests.

Signed-off-by: John Fastabend 
---
 include/linux/bpf.h   |   11 +
 include/linux/bpf_types.h |1 
 include/uapi/linux/bpf.h  |   13 +
 kernel/bpf/Makefile   |2 
 kernel/bpf/helpers.c  |   20 +
 kernel/bpf/sockmap.c  |  623 +
 kernel/bpf/syscall.c  |   41 +++
 net/core/filter.c |   51 
 8 files changed, 759 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/sockmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6353c74..9ce6aa0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -15,6 +15,8 @@
 #include 
 #include 
 
+#include 
+
 struct perf_event;
 struct bpf_map;
 
@@ -29,6 +31,9 @@ struct bpf_map_ops {
/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);
+   int (*map_ctx_update_elem)(struct bpf_sock_ops_kern *skops,
+  struct bpf_map *map,
+  void *key, u64 flags, u64 map_flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
 
/* funcs called by prog_array and perf_event_array map */
@@ -37,6 +42,7 @@ struct bpf_map_ops {
void (*map_fd_put_ptr)(void *ptr);
u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
u32 (*map_fd_sys_lookup_elem)(void *ptr);
+   int (*map_attach)(struct bpf_map *map, struct bpf_prog *p1, struct 
bpf_prog *p2);
 };
 
 struct bpf_map {
@@ -321,6 +327,7 @@ static inline void bpf_long_memcpy(void *dst, const void 
*src, u32 size)
 
 /* Map specifics */
 struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
 
@@ -378,9 +385,13 @@ static inline void __dev_map_flush(struct bpf_map *map)
 }
 #endif /* CONFIG_BPF_SYSCALL */
 
+inline struct sock *do_sk_redirect_map(void);
+inline u64 get_sk_redirect_flags(void);
+
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
+extern const struct bpf_func_proto bpf_map_ctx_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
 
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index b1e1035..930be52 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -37,4 +37,5 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
 #ifdef CONFIG_NET
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1106a8c..a89e831 100644
--- a/include/uapi/linux/bpf.h
+++ b/inc

[RFC PATCH 2/6] net: add sendmsg_locked and sendpage_locked to af_inet6

2017-08-03 Thread John Fastabend
To complete the sendmsg_locked and sendpage_locked implementation add
the hooks for af_inet6 as well.

Signed-off-by: John Fastabend 
---
 net/ipv6/af_inet6.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 0a7c740..3b58ee7 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -554,6 +554,8 @@ int inet6_ioctl(struct socket *sock, unsigned int cmd, 
unsigned long arg)
.recvmsg   = inet_recvmsg,  /* ok   */
.mmap  = sock_no_mmap,
.sendpage  = inet_sendpage,
+   .sendmsg_locked= tcp_sendmsg_locked,
+   .sendpage_locked   = tcp_sendpage_locked,
.splice_read   = tcp_splice_read,
.read_sock = tcp_read_sock,
.peek_len  = tcp_peek_len,



[RFC PATCH 1/6] net: early init support for strparser

2017-08-03 Thread John Fastabend
It is useful to allow strparser to init sockets before the read_sock
callback has been established.

Signed-off-by: John Fastabend 
---
 net/strparser/strparser.c |   10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 0d18fbc..749f1cb 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -373,6 +373,9 @@ static int strp_read_sock(struct strparser *strp)
struct socket *sock = strp->sk->sk_socket;
read_descriptor_t desc;
 
+   if (unlikely(!sock->ops || !sock->ops->read_sock))
+   return -EBUSY;
+
desc.arg.data = strp;
desc.error = 0;
desc.count = 1; /* give more than one skb per call */
@@ -486,12 +489,7 @@ int strp_init(struct strparser *strp, struct sock *sk,
 * The upper layer calls strp_process for each skb to be parsed.
 */
 
-   if (sk) {
-   struct socket *sock = sk->sk_socket;
-
-   if (!sock->ops->read_sock || !sock->ops->peek_len)
-   return -EAFNOSUPPORT;
-   } else {
+   if (!sk) {
if (!cb->lock || !cb->unlock)
return -EINVAL;
}



[RFC PATCH 0/6] BPF socket redirect

2017-08-03 Thread John Fastabend
This series implements socket redirect for BPF using XDP redirect
as a model. The user flow and internals are similar in many ways.
First we add a new map type called, sockmap. A sockmap holds
references to sock structs. Next a bpf helper call is added to
support redirect between sockets,

  bpf_sk_redirect_map(map, key, flags)

This allows BPF programs to redirect packets between sockets.

Finally, we need a call site, as a first call site to implement
we added hooks to recv_sock using the existing strparser blocks.
The call site is added via a new BPF attach map call.

For details see patches. The final patch provides a sample program
that shows a real example that uses cgroups.

I probably need a few more iterations of fixes/cleanup etc. to
get these ready for non-RFC submission, but because its working
with "real" traffic now and is running without issues getting
some feedback would be great. I tried to add comments in the code
with "TBD" around areas I know need some work or where I see a bug
could happen in the error case, etc.

For people who prefer git over pulling patches out of their mail
editor I've posted the code here,

https://github.com/jrfastab/linux-kernel-xdp/tree/kproxy_sockmap7

TBD:
  - bpf program refcnting cleanup
  - publish performance numbers
  - probably a couple more iterations of cleanup
  - build a better cover letter ;)

Thanks to Daniel Borkmann for reviewing and providing feedback even
though some of it just made it into the TBD column so far.

Parts of this code started with initial kproxy RFC patches (Tom
Herbert) here,

 https://patchwork.ozlabs.org/patch/782406/

although its been heavily modified/changed/etc by now.

Some original ideas/dissussions around this started at netconf here
is a link with notes. Search for "In-kernel layer-7 proxying" and
presentation from Thomas Graf,

https://lwn.net/Articles/719985/

Sorry if I forgot citing anyone :) its just an RFC after all.

Thanks,
John

---

John Fastabend (6):
  net: early init support for strparser
  net: add sendmsg_locked and sendpage_locked to af_inet6
  net: fixes for skb_send_sock
  net: sockmap with sk redirect support
  net: bpf, add skb to sk lookup routines
  net: sockmap sample program


 include/linux/bpf.h   |   11 +
 include/linux/bpf_types.h |1 
 include/uapi/linux/bpf.h  |   15 +
 kernel/bpf/Makefile   |2 
 kernel/bpf/helpers.c  |   20 +
 kernel/bpf/sockmap.c  |  623 +
 kernel/bpf/syscall.c  |   41 ++
 net/core/filter.c |   87 
 net/core/skbuff.c |2 
 net/ipv6/af_inet6.c   |2 
 net/socket.c  |2 
 net/strparser/strparser.c |   10 
 samples/sockmap/Makefile  |   78 
 samples/sockmap/sockmap_kern.c|  143 +++
 samples/sockmap/sockmap_user.c|   84 
 tools/include/uapi/linux/bpf.h|1 
 tools/lib/bpf/bpf.c   |   11 -
 tools/lib/bpf/bpf.h   |4 
 tools/testing/selftests/bpf/bpf_helpers.h |   12 +
 19 files changed, 1136 insertions(+), 13 deletions(-)
 create mode 100644 kernel/bpf/sockmap.c
 create mode 100644 samples/sockmap/Makefile
 create mode 100644 samples/sockmap/sockmap_kern.c
 create mode 100644 samples/sockmap/sockmap_user.c

--
Signature


Re: [PATCH 3/5] net: stmmac: Add Adaptrum Anarion GMAC glue layer

2017-08-03 Thread Rob Herring
On Fri, Jul 28, 2017 at 03:07:03PM -0700, Alexandru Gagniuc wrote:
> Before the GMAC on the Anarion chip can be used, the PHY interface
> selection must be configured with the DWMAC block in reset.
> 
> This layer covers a block containing only two registers. Although it
> is possible to model this as a reset controller and use the "resets"
> property of stmmac, it's much more intuitive to include this in the
> glue layer instead.
> 
> At this time only RGMII is supported, because it is the only mode
> which has been validated hardware-wise.
> 
> Signed-off-by: Alexandru Gagniuc 
> ---
>  .../devicetree/bindings/net/anarion-gmac.txt   |  25 

The binding looks fine, but please split to separate patch.

>  drivers/net/ethernet/stmicro/stmmac/Kconfig|   9 ++
>  drivers/net/ethernet/stmicro/stmmac/Makefile   |   1 +
>  .../net/ethernet/stmicro/stmmac/dwmac-anarion.c| 151 
> +
>  4 files changed, 186 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/anarion-gmac.txt
>  create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-anarion.c


Re: [PATCH v3 net-next 2/4] sock: ULP infrastructure

2017-08-03 Thread Mat Martineau


On Thu, 3 Aug 2017, Tom Herbert wrote:


Generalize the TCP ULP infrastructure recently introduced to support
kTLS. This adds a SO_ULP socket option and creates new fields in
sock structure for ULP ops and ULP data. Also, the interface allows
additional per ULP parameters to be set so that a ULP can be pushed
and operations started in one shot.

Signed-off-by: Tom Herbert 
---
arch/alpha/include/uapi/asm/socket.h   |   2 +
arch/frv/include/uapi/asm/socket.h |   2 +
arch/ia64/include/uapi/asm/socket.h|   2 +
arch/m32r/include/uapi/asm/socket.h|   2 +
arch/mips/include/uapi/asm/socket.h|   2 +
arch/mn10300/include/uapi/asm/socket.h |   2 +
arch/parisc/include/uapi/asm/socket.h  |   2 +
arch/s390/include/uapi/asm/socket.h|   2 +
arch/sparc/include/uapi/asm/socket.h   |   2 +
arch/xtensa/include/uapi/asm/socket.h  |   2 +
include/linux/socket.h |   9 ++
include/net/sock.h |   5 +
include/net/ulp_sock.h |  75 +
include/uapi/asm-generic/socket.h  |   2 +
net/Kconfig|   4 +
net/core/Makefile  |   1 +
net/core/sock.c|  14 +++
net/core/sysctl_net_core.c |  25 +
net/core/ulp_sock.c| 194 +
19 files changed, 349 insertions(+)
create mode 100644 include/net/ulp_sock.h
create mode 100644 net/core/ulp_sock.c



...


diff --git a/include/net/ulp_sock.h b/include/net/ulp_sock.h
new file mode 100644
index ..37bf4d2e16b9
--- /dev/null
+++ b/include/net/ulp_sock.h
@@ -0,0 +1,75 @@
+/*
+ * Pluggable upper layer protocol support in sockets.
+ *
+ * Copyright (c) 2016-2017, Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016-2017, Dave Watson . All rights 
reserved.
+ * Copyright (c) 2017, Tom Herbert . All rights reserved.
+ *
+ */
+
+#ifndef __NET_ULP_SOCK_H
+#define __NET_ULP_SOCK_H
+
+#include 
+
+#define ULP_MAX 128
+#define ULP_BUF_MAX (ULP_NAME_MAX * ULP_MAX)
+
+struct ulp_ops {
+   struct list_head list;
+
+   /* initialize ulp */
+   int (*init)(struct sock *sk, char __user *optval, int len);
+
+   /* cleanup ulp */
+   void (*release)(struct sock *sk);
+
+   /* Get ULP specific parameters in getsockopt */
+   int (*get_params)(struct sock *sk, char __user *optval, int *optlen);
+
+   char name[ULP_NAME_MAX];
+   struct module *owner;
+};
+
+#ifdef CONFIG_ULP_SOCK
+
+int ulp_register(struct ulp_ops *type);
+void ulp_unregister(struct ulp_ops *type);
+int ulp_set(struct sock *sk, char __user *optval, int len);
+int ulp_get_config(struct sock *sk, char __user *optval, int *optlen);
+void ulp_get_available(char *buf, size_t len);
+void ulp_cleanup(struct sock *sk);
+
+#else
+
+static inline int ulp_register(struct ulp_ops *type)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline void ulp_unregister(struct ulp_ops *type)
+{
+}
+
+static inline int ulp_set(struct sock *sk, char __user *optval, int len)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline int ulp_get_config(struct sock *sk, char __user *optval,
+int *optlen)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline void ulp_get_available(char *buf, size_t len)
+{


proc_ulp_available() doesn't initialize *buf, so the string needs to be 
NUL-terminated here.



+}
+
+static inline void ulp_cleanup(struct sock *sk)
+{
+}
+
+#endif
+
+#endif /* __NET_ULP_SOCK_H */


...


diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b7cd9aafe99e..9e14f91b57eb 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -21,6 +21,7 @@
#include 
#include 
#include 
+#include 

static int zero = 0;
static int one = 1;
@@ -249,6 +250,24 @@ static int proc_do_rss_key(struct ctl_table *table, int 
write,
return proc_dostring(&fake_table, write, buffer, lenp, ppos);
}

+static int proc_ulp_available(struct ctl_table *ctl,
+ int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+   struct ctl_table tbl = { .maxlen = ULP_BUF_MAX, };
+   int ret;
+
+   tbl.data = kmalloc(tbl.maxlen, GFP_USER);


(Just flagging this to provide context for the uninitialized data comment 
above)



+   if (!tbl.data)
+   return -ENOMEM;
+   ulp_get_available(tbl.data, ULP_BUF_MAX);
+   ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+   kfree(tbl.data);
+
+   return ret;
+}
+
static struct ctl_table net_core_table[] = {
#ifdef CONFIG_NET
{
@@ -460,6 +479,12 @@ static struct ctl_table net_core_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra1 = &zero,
},
+   {
+   .procname   = "ulp_available",
+   .maxlen = ULP_BUF_MAX,
+   .mode   = 0444,
+   .p

Re: [PATCH 3/5] net: stmmac: Add Adaptrum Anarion GMAC glue layer

2017-08-03 Thread Rob Herring
On Mon, Jul 31, 2017 at 08:11:00AM -0700, Alex wrote:
> Hi David,
> 
> On 07/28/2017 07:01 PM, David Miller wrote:
> > From: Alexandru Gagniuc 
> > Date: Fri, 28 Jul 2017 15:07:03 -0700
> > 
> > > Before the GMAC on the Anarion chip can be used, the PHY interface
> > > selection must be configured with the DWMAC block in reset.
> > > 
> > > This layer covers a block containing only two registers. Although it
> > > is possible to model this as a reset controller and use the "resets"
> > > property of stmmac, it's much more intuitive to include this in the
> > > glue layer instead.
> > > 
> > > At this time only RGMII is supported, because it is the only mode
> > > which has been validated hardware-wise.
> > > 
> > > Signed-off-by: Alexandru Gagniuc 
> > 
> > I don't see how this fits into any patch series at all.  If this is
> > part of a series you posted elsewhere, you should keep netdev@ on
> > all such postings so people there can review the change in-context.
> 
> I used the --cc-cmd option to send-email. I'll be sure to CC netdev@ on
> [PATCH v2].

The problem is your series spans several subsystems and it's not 
clear who you intend to apply these. There aren't really any hard 
dependencies between the patches, so they could all go thru different 
trees. But you need to state that at least implicitly by sending the 
patches TO who should apply them and CC the rest (and get_maintainers.pl 
doesn't really help with that aspect). Or just don't send them in a 
series if there's not an inter-dependency of the patches. Normally 
bindings and a driver do go together and I'll ack the binding.

Rob


Re: [PATCH v7 net-next] net: systemport: Support 64bit statistics

2017-08-03 Thread Florian Fainelli
On 08/03/2017 04:16 PM, Stephen Hemminger wrote:
> On Fri,  4 Aug 2017 00:07:45 +0100
> "Jianming.qiao"  wrote:
> 
>>  static const struct bcm_sysport_stats bcm_sysport_gstrings_stats[] = {
>>  /* general stats */
>> -STAT_NETDEV(rx_packets),
>> -STAT_NETDEV(tx_packets),
>> -STAT_NETDEV(rx_bytes),
>> -STAT_NETDEV(tx_bytes),
>> +STAT_NETDEV64(rx_packets),
>> +STAT_NETDEV64(tx_packets),
>> +STAT_NETDEV64(rx_bytes),
>> +STAT_NETDEV64(tx_bytes),
>>  STAT_NETDEV(rx_errors),
> 
> Please don't duplicate regular statistics (ie netdev) into ethtool.
> It is a needless duplication.

Agreed, but these are there already and this driver's ethtool::get_stats
is an user ABI of some sort, is not it?
-- 
Florian


Re: [PATCH v7 net-next] net: systemport: Support 64bit statistics

2017-08-03 Thread Stephen Hemminger
On Fri,  4 Aug 2017 00:07:45 +0100
"Jianming.qiao"  wrote:

>  static const struct bcm_sysport_stats bcm_sysport_gstrings_stats[] = {
>   /* general stats */
> - STAT_NETDEV(rx_packets),
> - STAT_NETDEV(tx_packets),
> - STAT_NETDEV(rx_bytes),
> - STAT_NETDEV(tx_bytes),
> + STAT_NETDEV64(rx_packets),
> + STAT_NETDEV64(tx_packets),
> + STAT_NETDEV64(rx_bytes),
> + STAT_NETDEV64(tx_bytes),
>   STAT_NETDEV(rx_errors),

Please don't duplicate regular statistics (ie netdev) into ethtool.
It is a needless duplication.


[PATCH v7 net-next] net: systemport: Support 64bit statistics

2017-08-03 Thread Jianming.qiao
When using Broadcom Systemport device in 32bit Platform, ifconfig can
only report up to 4G tx,rx status, which will be wrapped to 0 when the
number of incoming or outgoing packets exceeds 4G, only taking
around 2 hours in busy network environment (such as streaming).
Therefore, it makes hard for network diagnostic tool to get reliable
statistical result, so the patch is used to add 64bit support for
Broadcom Systemport device in 32bit Platform.

This patch provides 64bit statistics capability on both ethtool and ifconfig.

Signed-off-by: Jianming.qiao 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 78 +++---
 drivers/net/ethernet/broadcom/bcmsysport.h | 21 
 2 files changed, 82 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 5333601..bf9ca3c 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -201,10 +201,10 @@ static int bcm_sysport_set_features(struct net_device 
*dev,
  */
 static const struct bcm_sysport_stats bcm_sysport_gstrings_stats[] = {
/* general stats */
-   STAT_NETDEV(rx_packets),
-   STAT_NETDEV(tx_packets),
-   STAT_NETDEV(rx_bytes),
-   STAT_NETDEV(tx_bytes),
+   STAT_NETDEV64(rx_packets),
+   STAT_NETDEV64(tx_packets),
+   STAT_NETDEV64(rx_bytes),
+   STAT_NETDEV64(tx_bytes),
STAT_NETDEV(rx_errors),
STAT_NETDEV(tx_errors),
STAT_NETDEV(rx_dropped),
@@ -316,6 +316,7 @@ static inline bool bcm_sysport_lite_stat_valid(enum 
bcm_sysport_stat_type type)
 {
switch (type) {
case BCM_SYSPORT_STAT_NETDEV:
+   case BCM_SYSPORT_STAT_NETDEV64:
case BCM_SYSPORT_STAT_RXCHK:
case BCM_SYSPORT_STAT_RBUF:
case BCM_SYSPORT_STAT_SOFT:
@@ -398,6 +399,7 @@ static void bcm_sysport_update_mib_counters(struct 
bcm_sysport_priv *priv)
s = &bcm_sysport_gstrings_stats[i];
switch (s->type) {
case BCM_SYSPORT_STAT_NETDEV:
+   case BCM_SYSPORT_STAT_NETDEV64:
case BCM_SYSPORT_STAT_SOFT:
continue;
case BCM_SYSPORT_STAT_MIB_RX:
@@ -434,7 +436,10 @@ static void bcm_sysport_get_stats(struct net_device *dev,
  struct ethtool_stats *stats, u64 *data)
 {
struct bcm_sysport_priv *priv = netdev_priv(dev);
+   struct bcm_sysport_stats64 *stats64 = &priv->stats64;
+   struct u64_stats_sync *syncp = &priv->syncp;
struct bcm_sysport_tx_ring *ring;
+   unsigned int start;
int i, j;
 
if (netif_running(dev))
@@ -447,10 +452,20 @@ static void bcm_sysport_get_stats(struct net_device *dev,
s = &bcm_sysport_gstrings_stats[i];
if (s->type == BCM_SYSPORT_STAT_NETDEV)
p = (char *)&dev->stats;
+   else if (s->type == BCM_SYSPORT_STAT_NETDEV64)
+   p = (char *)stats64;
else
p = (char *)priv;
+
p += s->stat_offset;
-   data[j] = *(unsigned long *)p;
+
+   if (s->stat_sizeof == sizeof(u64))
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   data[i] = *(u64 *)p;
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+   else
+   data[i] = *(u32 *)p;
j++;
}
 
@@ -662,6 +677,7 @@ static int bcm_sysport_alloc_rx_bufs(struct 
bcm_sysport_priv *priv)
 static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv,
unsigned int budget)
 {
+   struct bcm_sysport_stats64 *stats64 = &priv->stats64;
struct net_device *ndev = priv->netdev;
unsigned int processed = 0, to_process;
struct bcm_sysport_cb *cb;
@@ -765,6 +781,10 @@ static unsigned int bcm_sysport_desc_rx(struct 
bcm_sysport_priv *priv,
skb->protocol = eth_type_trans(skb, ndev);
ndev->stats.rx_packets++;
ndev->stats.rx_bytes += len;
+   u64_stats_update_begin(&priv->syncp);
+   stats64->rx_packets++;
+   stats64->rx_bytes += len;
+   u64_stats_update_end(&priv->syncp);
 
napi_gro_receive(&priv->napi, skb);
 next:
@@ -787,17 +807,15 @@ static void bcm_sysport_tx_reclaim_one(struct 
bcm_sysport_tx_ring *ring,
struct device *kdev = &priv->pdev->dev;
 
if (cb->skb) {
-   ring->bytes += cb->skb->len;
*bytes_compl += cb->skb->len;
dma_unmap_single(kdev, dma_unmap_addr(cb, dma_addr),
 dma_unmap_len(cb, dma_len),
 DMA_TO_DEVICE);
-   ring->packets++;
(*pkts_compl)++;

Re: [PATCH iproute2] netns: make /var/run/netns bind-mount recursive

2017-08-03 Thread Stephen Hemminger
On Tue,  1 Aug 2017 17:46:09 +0200
Casey Callendrello  wrote:

> When ip netns {add|delete} is first run, it bind-mounts /var/run/netns
> on top of itself, then marks it as shared. However, if there are already
> bind-mounts in the directory from other tools, these would not be
> propagated. Fix this by recursively bind-mounting.
> 
> Signed-off-by: Casey Callendrello 
> ---
>  ip/ipnetns.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Looks good, but I want a review by Eric to make sure this doesn't break other 
things.


Re: [iproute PATCH] tc-simple: Fix documentation

2017-08-03 Thread Stephen Hemminger
On Thu,  3 Aug 2017 17:00:51 +0200
Phil Sutter  wrote:

> - CONTROL has to come last, otherwise 'index' applies to gact and not
>   simple itself.
> - Man page wasn't updated to reflect syntax changes.
> 
> Signed-off-by: Phil Sutter 

Applid, thanks Phil


Re: [iproute PATCH] Really fix get_addr() and get_prefix() error messages

2017-08-03 Thread Stephen Hemminger
On Tue,  1 Aug 2017 18:36:11 +0200
Phil Sutter  wrote:

> Both functions take the desired address family as a parameter. So using
> that to notify the user what address family was expected is correct,
> unlike using dst->family which will tell the user only what address
> family was specified.
> 
> The situation which commit 334af76143368 tried to fix was when 'ip'
> would accept addresses from multiple families. In that case, the family
> parameter is set to AF_UNSPEC so that get_addr_1() may accept any valid
> address.
> 
> This patch introduces a wrapper around family_name() which returns the
> string "any valid" for AF_UNSPEC instead of the three question marks
> unsuitable for use in error messages.
> 
> Tests for AF_UNSPEC:
> 
> | # ip a a 256.10.166.1/24 dev d0
> | Error: any valid prefix is expected rather than "256.10.166.1/24".
> 
> | # ip neighbor add proxy 2001:db8::g dev d0
> | Error: any valid address is expected rather than "2001:db8::g".
> 
> Tests for explicit address family:
> 
> | # ip -6 addrlabel add prefix 1.1.1.1/24 label 123
> | Error: inet6 prefix is expected rather than "1.1.1.1/24".
> 
> | # ip -4 addrlabel add prefix dead:beef::1/24 label 123
> | Error: inet prefix is expected rather than "dead:beef::1/24".
> 
> Reported-by: Jaroslav Aster 
> Fixes: 334af76143368 ("fix get_addr() and get_prefix() error messages")
> Signed-off-by: Phil Sutter 

Moderately more helpful, so sure.


Re: [iproute PATCH] bpf: Make bytecode-file reading a little more robust

2017-08-03 Thread Stephen Hemminger
On Wed,  2 Aug 2017 14:57:56 +0200
Phil Sutter  wrote:

> bpf_parse_string() will now correctly handle:
> 
> - Extraneous whitespace,
> - OPs on multiple lines and
> - overlong file names.
> 
> The added feature of allowing to have OPs on multiple lines (like e.g.
> tcpdump prints them) is rather a side effect of fixing detection of
> malformed bytecode files having random content on a second line, like
> e.g.:
> 
> | 4,40 0 0 12,21 0 1 2048,6 0 0 262144,6 0 0 0
> | foobar
> 
> Cc: Daniel Borkmann 
> Signed-off-by: Phil Sutter 

Looks good applied.


Re: [patch net-next 00/21] mlxsw: Support for IPv6 UC router

2017-08-03 Thread David Ahern
On 8/3/17 4:41 PM, David Miller wrote:
> But unlike the percpu flag, don't we want to somehow propagate offload
> state to the user?

It's a per nexthop flag. For IPv4 it is tracked in fib_nh.nh_flags.
Perhaps it is time for rt6_info to have nh_flags as well.

> 
> I'm sure whatever we decide Jiri will send and appropriate follow-up
> so don't panic :)
> 



Re: [PATCH v4] ss: Enclose IPv6 address in brackets

2017-08-03 Thread Stephen Hemminger
On Tue, 1 Aug 2017 18:54:33 +0200
Florian Lehner  wrote:

> - if (a->family == AF_INET) {
> - if (a->data[0] == 0) {
> + if (a->data[0] == 0) {
>   buf[0] = '*';
>   buf[1] = 0;

This won't work right with IPv6 you need to look at the whole address being 0
not just a->data[0]


Re: [iproute PATCH] iplink: Notify user if EEXIST error might be spurious

2017-08-03 Thread Stephen Hemminger
On Tue,  1 Aug 2017 19:27:47 +0200
Phil Sutter  wrote:

> Back in the days when RTM_NEWLINK wasn't yet implemented, people had to
> rely upon kernel modules to create (virtual) interfaces for them. The
> number of those was usually defined via module parameter, and a sane
> default value was chosen. Now that iproute2 allows users to instantiate
> new interfaces at will, this is no longer required - though for
> backwards compatibility reasons, we're stuck with both methods which
> collide at the point when one tries to create an interface with a
> standard name for a type which exists in a kernel module: The kernel
> will load the module, which instantiates the interface and the following
> RTM_NEWLINK request will fail since the interface exists already. For
> instance:
> 
> | # lsmod | grep dummy
> | # ip link show | grep dummy0
> | # ip link add dummy0 type dummy
> | RTNETLINK answers: File exists
> | # ip link show | grep -c dummy0
> | 1
> 
> There is no race-free solution in userspace for this dilemma as far as I
> can tell, so try to detect whether a user might have run into this and
> notify that the given error message might be irrelevant.
> 
> Signed-off-by: Phil Sutter 

There is already a workable solution. There is already module parameters to 
block autocreation
in bonding and dummy network device. The others should have it as well.

This patch just seems like creating more clutter.


Re: [patch net-next 00/21] mlxsw: Support for IPv6 UC router

2017-08-03 Thread David Miller
From: David Ahern 
Date: Thu, 3 Aug 2017 16:39:54 -0600

> On 8/3/17 4:36 PM, David Miller wrote:
>> From: Jiri Pirko 
>> Date: Thu,  3 Aug 2017 13:28:10 +0200
>> 
>>> This set adds support for IPv6 unicast routes offload.
>> 
>> Series applied, thanks.
>> 
> 
> I take it you disagree with my comment on patch 10 about the RTF_OFFLOAD
> flag? that is a nexthop flag and has no business being part of the UAPI
> for IPv6.

Oh crap, I missed that.

But unlike the percpu flag, don't we want to somehow propagate offload
state to the user?

I'm sure whatever we decide Jiri will send and appropriate follow-up
so don't panic :)


Re: [patch net-next 00/21] mlxsw: Support for IPv6 UC router

2017-08-03 Thread David Ahern
On 8/3/17 4:36 PM, David Miller wrote:
> From: Jiri Pirko 
> Date: Thu,  3 Aug 2017 13:28:10 +0200
> 
>> This set adds support for IPv6 unicast routes offload.
> 
> Series applied, thanks.
> 

I take it you disagree with my comment on patch 10 about the RTF_OFFLOAD
flag? that is a nexthop flag and has no business being part of the UAPI
for IPv6.


Re: [PATCH net-next v2 00/13] Change DSA's FDB API and perform switchdev cleanup

2017-08-03 Thread Arkadi Sharshevsky

[...]

>> Now we have the "offload" read only flag, which is good to inform about
>> a successfully programmed hardware, but adds another level of complexity
>> to understand the interaction with the hardware.
>>
>> I think iproute2 is getting more and more confusing. From what I
>> understood, respecting the "self" flag as described is not possible
>> anymore due to some retro-compatibility reasons.
>>
>> Also Linux must use the hardware as an accelerator (so "self" or
>> "offload" must be the default), and always fall back to software
>> otherwise, hence "master" do not make sense here.
>>
>> What do you think about this synopsis for bridge fdb add?
>>
>> # bridge fdb add LLADDR dev DEV [ offload { on | off } ]
>>
>> Where offload defaults to "on". This option should also be ported to
>> other offloaded features like MDB and VLAN. Even though this is a bit
>> out of scope of this patchset, do you think this is feasible?
>>
> 
> I agree completely that currently its confusing. The documentation
> should be updated for sure. I think that 'self' was primarily introduced
> (Commit 77162022a) for NIC embedded switches which are used for sriov, in
> that case the self is related to the internal eswitch, which completely
> diverge from the software one (clearly not swithcdev).
> 
> IMHO For switchdev devices 'self' should not be an option at all, or any
> other arg regarding hardware. Furthermore, the 'offload' flag should be
> only relevant during the dump as an indication to the user.
> 
> Unfortunately, the  lack of ability of syncing the sw with hw in DSA's
> case introduces a problem for indicating that the entries are only
> in hw, I mean marking it only as offloaded is not enough.

Hi,

It seems impossible currently to move the self to be the default, and
this introduces regression which you don't approve, so it seems few
options left:

a) Leave two ways to add fdb, through the bridge (by using the master
   flag) which is introduced in this patchset, and by using the self
   which is the legacy way. In this way no regression will be introduced,
   yet, it feels confusing a bit. The benefit is that we (DSA/mlxsw)
   will be synced.
b) Leave only the self (which means removing patch no 4,5).

In both cases the switchdev implementation of .ndo_fdb_add() will be
moved inside DSA in a similar way to the dump because its only used by
you.

Option b) actually turns this patchset into cosmetic one which does
only cleanup.

Thanks,
Arkadi






Re: [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls

2017-08-03 Thread David Miller
From: Neal Cardwell 
Date: Thu,  3 Aug 2017 09:19:51 -0400

> This patch series is a bug fix for a TCP loss recovery performance bug
> reported independently in recent netdev threads:
> 
>  (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
>  (ii) July 26, 2017: netdev thread:
>"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
>outstanding TLP retransmission"
> 
> Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
> traces, and packetdrill test cases, which enabled us to root-cause
> this issue and verify the fix.
> 
> - v1 -> v2:
>  - In patch 2/3, changed an unclear comment in the pre-existing code
>in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet
>for suggesting we improve this).

Series applied, thanks Neal.


Re: [patch net-next 00/21] mlxsw: Support for IPv6 UC router

2017-08-03 Thread David Miller
From: Jiri Pirko 
Date: Thu,  3 Aug 2017 13:28:10 +0200

> This set adds support for IPv6 unicast routes offload.

Series applied, thanks.


Re: [PATCH V3 net-next 03/21] net-next/hinic: Initialize api cmd resources

2017-08-03 Thread David Miller
From: Aviad Krawczyk 
Date: Thu, 3 Aug 2017 17:54:09 +0800

> +static int alloc_cmd_buf(struct hinic_api_cmd_chain *chain,
> +  struct hinic_api_cmd_cell *cell, int cell_idx)
> +{
> + struct hinic_hwif *hwif = chain->hwif;
> + struct pci_dev *pdev = hwif->pdev;
> + struct hinic_api_cmd_cell_ctxt *cell_ctxt;
> + dma_addr_t cmd_paddr;
> + u8 *cmd_vaddr;
> + int err = 0;

Order local variables from longest to shortest line.

> +static int api_cmd_create_cell(struct hinic_api_cmd_chain *chain,
> +int cell_idx,
> +struct hinic_api_cmd_cell *pre_node,
> +struct hinic_api_cmd_cell **node_vaddr)
> +{
> + struct hinic_hwif *hwif = chain->hwif;
> + struct pci_dev *pdev = hwif->pdev;
> + struct hinic_api_cmd_cell_ctxt *cell_ctxt;
> + struct hinic_api_cmd_cell *node;
> + dma_addr_t node_paddr;
> + int err;

Likewise.
> +static void api_cmd_destroy_cell(struct hinic_api_cmd_chain *chain,
> +  int cell_idx)
> +{
> + struct hinic_hwif *hwif = chain->hwif;
> + struct pci_dev *pdev = hwif->pdev;
> + struct hinic_api_cmd_cell_ctxt *cell_ctxt;
> + struct hinic_api_cmd_cell *node;
> + dma_addr_t node_paddr;
> + size_t node_size;

Likewise.

etc. etc. etc.

Please audit your entire submission for this problem.

Thanks.


Re: [PATCH V3 net-next 02/21] net-next/hinic: Initialize hw device components

2017-08-03 Thread David Miller
From: Aviad Krawczyk 
Date: Thu, 3 Aug 2017 17:54:08 +0800

> +static int get_capability(struct hinic_hwdev *hwdev,
> +   struct hinic_dev_cap *dev_cap)
> +{
> + struct hinic_hwif *hwif = hwdev->hwif;
> + struct hinic_cap *nic_cap = &hwdev->nic_cap;
> + int num_aeqs, num_ceqs, num_irqs, num_qps;

Please order local variable declarations from longest to shortest
line (aka: reverse christmas tree order).

Move the initialization down into the code if that is necessary
to achiever this.

> +static int get_dev_cap(struct hinic_hwdev *hwdev)
> +{
> + struct hinic_pfhwdev *pfhwdev;
> + struct hinic_hwif *hwif = hwdev->hwif;
> + struct pci_dev *pdev = hwif->pdev;
> + int err;

Likewise.


Re: [PATCH net-next v2 0/7] net: mvpp2: add TX interrupts support

2017-08-03 Thread David Miller
From: Thomas Petazzoni 
Date: Thu,  3 Aug 2017 10:41:54 +0200

> So far, the mvpp2 driver was using an hrtimer to handle TX
> completion. This patch series adds support for using TX interrupts
> (for each CPU) on PPv2.2, the variant of the IP used on Marvell Armada
> 7K/8K.
> 
> Dave: this version can be applied right away, it no longer depends on
> Antoine's patch series. Antoine series had some comments, so he will
> have to respin later on. Therefore, let's merge this smaller patch
> series first.
> 
> Changes since v1:
> 
>  - Rebased on top of net-next, instead of on top of Antoine's series.
> 
>  - Removed the Device Tree patch, as it shouldn't go through the net
>tree.

Series applied.


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-08-03 Thread Alex Williamson
On Thu, 3 Aug 2017 16:49:11 -0500
Bjorn Helgaas  wrote:

> On Thu, Jul 20, 2017 at 02:41:01PM -0700, Roland Dreier wrote:
> > From: Roland Dreier 
> > 
> > Add one more variant of the 82599 plus the device IDs for X540 and X550
> > variants.  Intel has confirmed that none of these devices does peer-to-peer
> > between functions.  The X540 and X550 have added ACS capabilities in their
> > PCI config space, but the ACS control register is hard-wired to 0 for both
> > devices, so we still need the quirk for IOMMU grouping to allow assignment
> > of individual SR-IOV functions.
> > 
> > Signed-off-by: Roland Dreier   
> 
> I haven't seen a real conclusion to the discussion yet, so I'm waiting on
> that and hopefully an ack from Alex.  Can you please repost with that,
> since I'm dropping it from patchwork for now?

I think the conclusion is that a hard-wired ACS capability is a
positive indication of isolation for a multifunction device, the code
is intended to support this and appears to do so, and Roland was going
to investigate the sightings that inspired this patch in more detail.
Dropping for now is appropriate.  Thanks,

Alex

> > ---
> >  drivers/pci/quirks.c | 21 +
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 6967c6b4cf6b..b939db671326 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled {
> > { PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x152A, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x154D, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x154F, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1551, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1558, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x1560, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x1563, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AA, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AC, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AD, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AE, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15B0, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C2, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C3, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C4, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C6, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C7, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15C8, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15CE, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15E4, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15E5, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x15D1, pci_quirk_mf_endpoint_acs },
> > /* 82580 */
> > { PCI_VENDOR_ID_INTEL, 0x1509, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x150E, pci_quirk_mf_endpoint_acs },
> > -- 
> > 2.11.0
> >   



Re: [PATCH] net: arc_emac: Add support for ndo_do_ioctl net_device_ops operation

2017-08-03 Thread David Miller
From: Romain Perier 
Date: Thu,  3 Aug 2017 09:49:03 +0200

> This operation is required for handling ioctl commands like SIOCGMIIREG,
> when debugging MDIO registers from userspace.
> 
> This commit adds support for this operation.
> 
> Signed-off-by: Romain Perier 

Applied, thanks.


Re: [PATCH net] ipv6: set rt6i_protocol properly in the route when it is installed

2017-08-03 Thread David Miller
From: Xin Long 
Date: Thu,  3 Aug 2017 14:13:46 +0800

> After commit c2ed1880fd61 ("net: ipv6: check route protocol when
> deleting routes"), ipv6 route checks rt protocol when trying to
> remove a rt entry.
> 
> It introduced a side effect causing 'ip -6 route flush cache' not
> to work well. When flushing caches with iproute, all route caches
> get dumped from kernel then removed one by one by sending DELROUTE
> requests to kernel for each cache.
> 
> The thing is iproute sends the request with the cache whose proto
> is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps
> it. But in kernel the rt_cache protocol is still 0, which causes
> the cache not to be matched and removed.
> 
> So the real reason is rt6i_protocol in the route is not set when
> it is allocated. As David Ahern's suggestion, this patch is to
> set rt6i_protocol properly in the route when it is installed and
> remove the codes setting rtm_protocol according to rt6i_flags in
> rt6_fill_node.
> 
> This is also an improvement to keep rt6i_protocol consistent with
> rtm_protocol.
> 
> Fixes: c2ed1880fd61 ("net: ipv6: check route protocol when deleting routes")
> Reported-by: Jianlin Shi 
> Suggested-by: David Ahern 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.


Re: [PATCH V6 net-next 0/8] Hisilicon Network Subsystem 3 Ethernet Driver

2017-08-03 Thread David Miller
From: Salil Mehta 
Date: Wed, 2 Aug 2017 16:59:44 +0100

> This patch-set contains the support of the HNS3 (Hisilicon Network Subsystem 
> 3)
> Ethernet driver for hip08 family of SoCs and future upcoming SoCs.
 ...

Series applied, thanks.


[PATCH net-next] liquidio: moved console_bitmask module param to lio_main.c

2017-08-03 Thread Felix Manlunas
From: Intiyaz Basha 

Moving PF module param console_bitmask to lio_main.c for consistency.

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c   | 15 +++
 drivers/net/ethernet/cavium/liquidio/octeon_console.c | 14 --
 drivers/net/ethernet/cavium/liquidio/octeon_device.h  |  2 ++
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 39a8dca..8c2cd80 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -59,6 +59,21 @@ static char fw_type[LIO_MAX_FW_TYPE_LEN];
 module_param_string(fw_type, fw_type, sizeof(fw_type), );
 MODULE_PARM_DESC(fw_type, "Type of firmware to be loaded. Default \"nic\"");
 
+static u32 console_bitmask;
+module_param(console_bitmask, int, 0644);
+MODULE_PARM_DESC(console_bitmask,
+"Bitmask indicating which consoles have debug output 
redirected to syslog.");
+
+/**
+ * \brief determines if a given console has debug enabled.
+ * @param console console to check
+ * @returns  1 = enabled. 0 otherwise
+ */
+int octeon_console_debug_enabled(u32 console)
+{
+   return (console_bitmask >> (console)) & 0x1;
+}
+
 static int ptp_enable = 1;
 
 /* Polling interval for determining when NIC application is alive */
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_console.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_console.c
index 15ad1ab..dd0efc9 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_console.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_console.c
@@ -37,10 +37,6 @@ static u64 cvmx_bootmem_phy_named_block_find(struct 
octeon_device *oct,
 u32 flags);
 static int octeon_console_read(struct octeon_device *oct, u32 console_num,
   char *buffer, u32 buf_size);
-static u32 console_bitmask;
-module_param(console_bitmask, int, 0644);
-MODULE_PARM_DESC(console_bitmask,
-"Bitmask indicating which consoles have debug output 
redirected to syslog.");
 
 #define BOOTLOADER_PCI_READ_BUFFER_DATA_ADDR0x0006c008
 #define BOOTLOADER_PCI_READ_BUFFER_LEN_ADDR 0x0006c004
@@ -136,16 +132,6 @@ struct octeon_pci_console_desc {
 };
 
 /**
- * \brief determines if a given console has debug enabled.
- * @param console console to check
- * @returns  1 = enabled. 0 otherwise
- */
-static int octeon_console_debug_enabled(u32 console)
-{
-   return (console_bitmask >> (console)) & 0x1;
-}
-
-/**
  * This function is the implementation of the get macros defined
  * for individual structure members. The argument are generated
  * by the macros inorder to read only the needed memory.
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
index ad46478..31efdef 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
@@ -739,6 +739,8 @@ int octeon_wait_for_bootloader(struct octeon_device *oct,
  */
 int octeon_init_consoles(struct octeon_device *oct);
 
+int octeon_console_debug_enabled(u32 console);
+
 /**
  * Adds access to a console to the device.
  *


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-08-03 Thread Bjorn Helgaas
On Thu, Jul 20, 2017 at 02:41:01PM -0700, Roland Dreier wrote:
> From: Roland Dreier 
> 
> Add one more variant of the 82599 plus the device IDs for X540 and X550
> variants.  Intel has confirmed that none of these devices does peer-to-peer
> between functions.  The X540 and X550 have added ACS capabilities in their
> PCI config space, but the ACS control register is hard-wired to 0 for both
> devices, so we still need the quirk for IOMMU grouping to allow assignment
> of individual SR-IOV functions.
> 
> Signed-off-by: Roland Dreier 

I haven't seen a real conclusion to the discussion yet, so I'm waiting on
that and hopefully an ack from Alex.  Can you please repost with that,
since I'm dropping it from patchwork for now?

> ---
>  drivers/pci/quirks.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6967c6b4cf6b..b939db671326 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled {
>   { PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x152A, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x154D, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x154F, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1551, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1558, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1560, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1563, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AA, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AC, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AD, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AE, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15B0, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C2, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C3, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C4, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C6, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C7, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C8, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15CE, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15E4, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15E5, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15D1, pci_quirk_mf_endpoint_acs },
>   /* 82580 */
>   { PCI_VENDOR_ID_INTEL, 0x1509, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x150E, pci_quirk_mf_endpoint_acs },
> -- 
> 2.11.0
> 


Re: [PATCH net-next 1/3] proto_ops: Add locked held versions of sendmsg and sendpage

2017-08-03 Thread Tom Herbert
On Thu, Aug 3, 2017 at 1:21 PM, John Fastabend  wrote:
> On 07/28/2017 04:22 PM, Tom Herbert wrote:
>> Add new proto_ops sendmsg_locked and sendpage_locked that can be
>> called when the socket lock is already held. Correspondingly, add
>> kernel_sendmsg_locked and kernel_sendpage_locked as front end
>> functions.
>>
>> These functions will be used in zero proxy so that we can take
>> the socket lock in a ULP sendmsg/sendpage and then directly call the
>> backend transport proto_ops functions.
>>
>
> [...]
>
>>
>> +int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
>> +size_t size, int flags)
>> +{
>> + struct socket *sock = sk->sk_socket;
>> +
>> + if (sock->ops->sendpage_locked)
>> + return sock->ops->sendpage_locked(sk, page, offset, size,
>> +   flags);
>> +
>> + return sock_no_sendpage_locked(sk, page, offset, size, flags);
>> +}
>
> How about just returning EOPNOTSUPP here and force implementations to do both
> sendmsg and sendpage. The only implementation of these callbacks already does
> this. And if its any other socket it will just wind its way through a few
> layers of calls before returning EOPNOTSUPP.
>
Seems reasonable, but we should probably make the same change to
kernel_sendpage to be consistent.

Tom

> .John


[PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets

2017-08-03 Thread Davide Caratti
if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
checksum is successful, the driver subtracts the pseudo-header checksum
from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.

V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set

Reported-by: Shuang Li 
Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM 
COMPLETE")
Signed-off-by: Davide Caratti 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 436f768..bf16380 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -574,16 +574,21 @@ static inline __wsum get_fixed_vlan_csum(__wsum 
hw_checksum,
  * header, the HW adds it. To address that, we are subtracting the pseudo
  * header checksum from the checksum value provided by the HW.
  */
-static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
-   struct iphdr *iph)
+static int get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
+  struct iphdr *iph)
 {
__u16 length_for_csum = 0;
__wsum csum_pseudo_header = 0;
+   __u8 ipproto = iph->protocol;
+
+   if (unlikely(ipproto == IPPROTO_SCTP))
+   return -1;
 
length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
-   length_for_csum, iph->protocol, 
0);
+   length_for_csum, ipproto, 0);
skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
+   return 0;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
@@ -594,17 +599,20 @@ static void get_fixed_ipv4_csum(__wsum hw_checksum, 
struct sk_buff *skb,
 static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
   struct ipv6hdr *ipv6h)
 {
+   __u8 nexthdr = ipv6h->nexthdr;
__wsum csum_pseudo_hdr = 0;
 
-   if (unlikely(ipv6h->nexthdr == IPPROTO_FRAGMENT ||
-ipv6h->nexthdr == IPPROTO_HOPOPTS))
+   if (unlikely(nexthdr == IPPROTO_FRAGMENT ||
+nexthdr == IPPROTO_HOPOPTS ||
+nexthdr == IPPROTO_SCTP))
return -1;
-   hw_checksum = csum_add(hw_checksum, (__force 
__wsum)htons(ipv6h->nexthdr));
+   hw_checksum = csum_add(hw_checksum, (__force __wsum)htons(nexthdr));
 
csum_pseudo_hdr = csum_partial(&ipv6h->saddr,
   sizeof(ipv6h->saddr) + 
sizeof(ipv6h->daddr), 0);
csum_pseudo_hdr = csum_add(csum_pseudo_hdr, (__force 
__wsum)ipv6h->payload_len);
-   csum_pseudo_hdr = csum_add(csum_pseudo_hdr, (__force 
__wsum)ntohs(ipv6h->nexthdr));
+   csum_pseudo_hdr = csum_add(csum_pseudo_hdr,
+  (__force __wsum)htons(nexthdr));
 
skb->csum = csum_sub(hw_checksum, csum_pseudo_hdr);
skb->csum = csum_add(skb->csum, csum_partial(ipv6h, sizeof(struct 
ipv6hdr), 0));
@@ -627,11 +635,10 @@ static int check_csum(struct mlx4_cqe *cqe, struct 
sk_buff *skb, void *va,
}
 
if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4))
-   get_fixed_ipv4_csum(hw_checksum, skb, hdr);
+   return get_fixed_ipv4_csum(hw_checksum, skb, hdr);
 #if IS_ENABLED(CONFIG_IPV6)
-   else if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
-   if (unlikely(get_fixed_ipv6_csum(hw_checksum, skb, hdr)))
-   return -1;
+   if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
+   return get_fixed_ipv6_csum(hw_checksum, skb, hdr);
 #endif
return 0;
 }
-- 
2.9.4



[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
Implement the feature for TCP initially, as large writes benefit
most.

On a send call with MSG_ZEROCOPY, the kernel pins user pages and
links these directly into the skbuff frags[] array.

Each send call with MSG_ZEROCOPY that transmits data will eventually
queue a completion notification on the error queue: a per-socket u32
incremented on each such call. A request may have to revert to copy
to succeed, for instance when a device cannot support scatter-gather
IO. In that case a flag is passed along to notify that the operation
succeeded without zerocopy optimization.

The implementation extends the existing zerocopy infra for tuntap,
vhost and xen with features needed for TCP, notably reference
counting to handle cloning on retransmit and GSO.

For more details, see also the netdev 2.1 paper and presentation at
https://netdevconf.org/2.1/session.html?debruijn

Changelog:

  v3 -> v4:
- dropped UDP, RAW and PF_PACKET for now
Without loopback support, datagrams are usually smaller than
the ~8KB size threshold needed to benefit from zerocopy.
- style: a few reverse chrismas tree
- minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols
- minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch
- minor: rebased on top of net-next with kmap_atomic fix

  v2 -> v3:
- fix rebase conflict: SO_ZEROCOPY 59 -> 60

  v1 -> v2:
- fix (kbuild-bot): do not remove uarg until patch 5
- fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
- fix: remove unused extern in header file

  RFCv2 -> v1:
- patch 2
- review comment: in skb_copy_ubufs, always allocate order-0
page, also when replacing compound source pages.
- patch 3
- fix: always queue completion notification on MSG_ZEROCOPY,
also if revert to copy.
- fix: on syscall abort, correctly revert notification state
- minor: skip queue notification on SOCK_DEAD
- minor: replace BUG_ON with WARN_ON in recoverable error
- patch 4
- new: add socket option SOCK_ZEROCOPY.
only honor MSG_ZEROCOPY if set, ignore for legacy apps.
- patch 5
- fix: clear zerocopy state on skb_linearize
- patch 6
- fix: only coalesce if prev errqueue elem is zerocopy
- minor: try coalescing with list tail instead of head
- minor: merge bytelen limit patch
- patch 7
- new: signal when data had to be copied
- patch 8 (tcp)
- optimize: avoid setting PSH bit when exceeding max frags.
that limits GRO on the client. do not goto new_segment.
- fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
- minor: do not wait for memory: does not work for optmem
- minor: simplify alloc
- patch 9 (udp)
- new: add PF_INET6
- fix: attach zerocopy notification even if revert to copy
- minor: simplify alloc size arithmetic
- patch 10 (raw hdrinc)
- new: add PF_INET6
- patch 11 (pf_packet)
- minor: simplify slightly
- patch 12
- new msg_zerocopy regression test: use veth pair to test
all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
all relevant ethtool settings: rx off, sg off
all relevant packet lengths: 0,  RFCv2:
- review comment: do not loop skb with zerocopy frags onto rx:
  add skb_orphan_frags_rx to orphan even refcounted frags
  call this in __netif_receive_skb_core, deliver_skb and tun:
  same as commit 1080e512d44d ("net: orphan frags on receive")
- fix: hold an explicit sk reference on each notification skb.
  previously relied on the reference (or wmem) held by the
  data skb that would trigger notification, but this breaks
  on skb_orphan.
- fix: when aborting a send, do not inc the zerocopy counter
  this caused gaps in the notification chain
- fix: in packet with SOCK_DGRAM, pull ll headers before calling
  zerocopy_sg_from_iter
- fix: if sock_zerocopy_realloc does not allow coalescing,
  do not fail, just allocate a new ubuf
- fix: in tcp, check return value of second allocation attempt
- chg: allocate notification skbs from optmem
  to avoid affecting tcp write queue accounting (TSQ)
- chg: limit #locked pages (ulimit) per user instead of per process
- chg: grow notification ids from 16 to 32 bit
  - pass range [lo, hi] through 32 bit fields ee_info and ee_data
- chg: rebased to davem-net-next on top of v4.10-rc7
- add: limit notification coalescing
  sharing ubufs limits overhead, but delays notification until
  the last packet is released, possibly unbounded. Add a cap. 
- tests: add snd_zerocopy_lo pf_packet test
- tests: two bugfixes (add do_flush_tcp

[PATCH net-next v4 3/9] sock: add MSG_ZEROCOPY

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h|  60 +++
 include/linux/socket.h|   1 +
 include/net/sock.h|   2 +
 include/uapi/linux/errqueue.h |   3 +
 net/core/datagram.c   |  55 ++---
 net/core/skbuff.c | 133 ++
 net/core/sock.c   |   2 +
 7 files changed, 235 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2f64e2bbb592..59cff7aa494e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -429,6 +429,7 @@ enum {
SKBTX_SCHED_TSTAMP = 1 << 6,
 };
 
+#define SKBTX_ZEROCOPY_FRAG(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP(SKBTX_SW_TSTAMP| \
 SKBTX_SCHED_TSTAMP)
 #define SKBTX_ANY_TSTAMP   (SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP)
@@ -445,8 +446,28 @@ struct ubuf_info {
void (*callback)(struct ubuf_info *, bool zerocopy_success);
void *ctx;
unsigned long desc;
+   u16 zerocopy:1;
+   atomic_t refcnt;
 };
 
+#define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
+
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+
+static inline void sock_zerocopy_get(struct ubuf_info *uarg)
+{
+   atomic_inc(&uarg->refcnt);
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
+
+int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
+struct msghdr *msg, int len,
+struct ubuf_info *uarg);
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -1214,6 +1235,45 @@ static inline struct skb_shared_hwtstamps 
*skb_hwtstamps(struct sk_buff *skb)
return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
+{
+   bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
+
+   return is_zcopy ? skb_uarg(skb) : NULL;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+   if (skb && uarg && !skb_zcopy(skb)) {
+   sock_zerocopy_get(uarg);
+   skb_shinfo(skb)->destructor_arg = uarg;
+   skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+   }
+}
+
+/* Release a reference on a zerocopy structure */
+static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy)
+{
+   struct ubuf_info *uarg = skb_zcopy(skb);
+
+   if (uarg) {
+   uarg->zerocopy = uarg->zerocopy && zerocopy;
+   sock_zerocopy_put(uarg);
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
+   }
+}
+
+/* Abort a zerocopy operation and revert zckey on error in send syscall */
+static inline void skb_zcopy_abort(struct sk_buff *skb)
+{
+   struct ubuf_info *uarg = skb_zcopy(skb);
+
+   if (uarg) {
+   sock_zerocopy_put_abort(uarg);
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
+   }
+}
+
 /**
  * skb_queue_empty - check if a queue is empty
  * @list: queue head
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8b13db5163cc..8ad963cdc88c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
 #define MSG_BATCH  0x4 /* sendmmsg(): more messages coming */
 #define MSG_EOF MSG_FIN
 
+#define MSG_ZEROCOPY   0x400   /* Use user data in kernel path */
 #define MSG_FASTOPEN   0x2000  /* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exec for file
   descriptor received through
diff --git a/include/net/sock.h b/include/net/sock.h
index 0f778d3c4300..fe1a0bc25cd3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -294,6 +294,7 @@ struct sock_common {
   *@sk_stamp: time stamp of last packet received
   *@sk_tsflags: SO_TIMESTAMPING socket options
   *@sk_tskey: counter to disambiguate concurrent tstamp requests
+  *@sk_zckey: counter to order MSG_ZEROCOPY notifications
   *@sk_socket: Identd and reporting IO signals
   *@sk_user_data: RPC layer private data
   *@sk_frag: cached page frag
@@ -46

[PATCH net-next v4 4/9] sock: add SOCK_ZEROCOPY sockopt

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

The send call ignores unknown flags. Legacy applications may already
unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
socket opts in to zerocopy.

Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
Processes can also query this socket option to detect kernel support
for the feature. Older kernels will return ENOPROTOOPT.

Signed-off-by: Willem de Bruijn 
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h |  2 ++
 arch/ia64/include/uapi/asm/socket.h|  2 ++
 arch/m32r/include/uapi/asm/socket.h|  2 ++
 arch/mips/include/uapi/asm/socket.h|  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h|  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/uapi/asm-generic/socket.h  |  2 ++
 net/core/skbuff.c  |  3 +++
 net/core/sock.c| 18 ++
 13 files changed, 43 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 7b285dd4fe05..c6133a045352 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -109,4 +109,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index f1e3b20dce9f..9abf02d6855a 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -102,5 +102,7 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 5dd5c5d0d642..002eb85a6941 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -111,4 +111,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index f8f7b47e247f..e268e51a38d1 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 882823bec153..6c755bc07975 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -120,4 +120,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index c710db354ff2..ac82a3f26dbf 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index a0d4dc9f4eb2..3b2bf7ae703b 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_PEERGROUPS  0x4034
 
+#define SO_ZEROCOPY0x4035
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 52a63f4175cb..a56916c83565 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -108,4 +108,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 186fd8199f54..b2f5c50d0947 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -98,6 +98,8 @@
 
 #define SO_PEERGROUPS  0x003d
 
+#define SO_ZEROCOPY0x003e
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 3eed2761c149..22005e74 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -113,4 +113,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif /* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h 
b/include/uapi/asm-generic/socket.h
index 9861be8da65e..e47c9e436221 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -104,4 +104,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ZEROCOPY60
+
 #endif

[PATCH net-next v4 1/9] sock: allocate skbs from optmem

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn 
---
 include/net/sock.h |  2 ++
 net/core/sock.c| 27 +++
 2 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 393c38e9f6aa..0f778d3c4300 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1531,6 +1531,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned 
long size, int force,
 gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
 void sock_rfree(struct sk_buff *skb);
 void sock_efree(struct sk_buff *skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 742f68c9c84a..1261880bdcc8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1923,6 +1923,33 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned 
long size, int force,
 }
 EXPORT_SYMBOL(sock_wmalloc);
 
+static void sock_ofree(struct sk_buff *skb)
+{
+   struct sock *sk = skb->sk;
+
+   atomic_sub(skb->truesize, &sk->sk_omem_alloc);
+}
+
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+gfp_t priority)
+{
+   struct sk_buff *skb;
+
+   /* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
+   if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) >
+   sysctl_optmem_max)
+   return NULL;
+
+   skb = alloc_skb(size, priority);
+   if (!skb)
+   return NULL;
+
+   atomic_add(skb->truesize, &sk->sk_omem_alloc);
+   skb->sk = sk;
+   skb->destructor = sock_ofree;
+   return skb;
+}
+
 /*
  * Allocate a memory block from the socket's option memory buffer.
  */
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



[PATCH net-next v4 8/9] tcp: enable MSG_ZEROCOPY

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
both supported. Only data sent to remote destinations is sent without
copying. Packets looped onto a local destination have their payload
copied to avoid unbounded latency.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  msg_zerocopy.sh 4 tcp:

  without zerocopy
tx=121792 (7600 MB) txc=0 zc=n
rx=60458 (7600 MB)

  with zerocopy
tx=286257 (17863 MB) txc=286257 zc=y
rx=140022 (17863 MB)

  This test opens a pair of sockets over veth, one one calls send with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9dd6f4dba9b1..71b25567e787 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1165,6 +1165,7 @@ static int tcp_sendmsg_fastopen(struct sock *sk, struct 
msghdr *msg,
 int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct ubuf_info *uarg = NULL;
struct sk_buff *skb;
struct sockcm_cookie sockc;
int flags, err, copied = 0;
@@ -1174,6 +1175,26 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
long timeo;
 
flags = msg->msg_flags;
+
+   if (flags & MSG_ZEROCOPY && size) {
+   if (sk->sk_state != TCP_ESTABLISHED) {
+   err = -EINVAL;
+   goto out_err;
+   }
+
+   skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL;
+   uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+   if (!uarg) {
+   err = -ENOBUFS;
+   goto out_err;
+   }
+
+   /* skb may be freed in main loop, keep extra ref on uarg */
+   sock_zerocopy_get(uarg);
+   if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG))
+   uarg->zerocopy = 0;
+   }
+
if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect)) {
err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size);
if (err == -EINPROGRESS && copied_syn > 0)
@@ -1297,7 +1318,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
err = skb_add_data_nocache(sk, skb, &msg->msg_iter, 
copy);
if (err)
goto do_fault;
-   } else {
+   } else if (!uarg || !uarg->zerocopy) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
@@ -1335,6 +1356,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
page_ref_inc(pfrag->page);
}
pfrag->offset += copy;
+   } else {
+   err = skb_zerocopy_iter_stream(sk, skb, msg, copy, 
uarg);
+   if (err == -EMSGSIZE || err == -EEXIST)
+   goto new_segment;
+   if (err < 0)
+   goto do_error;
+   copy = err;
}
 
if (!copied)
@@ -1381,6 +1409,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
}
 out_nopush:
+   sock_zerocopy_put(uarg);
return copied + copied_syn;
 
 do_fault:
@@ -1397,6 +1426,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
if (copied + copied_syn)
goto out;
 out_err:
+   sock_zerocopy_put_abort(uarg);
err = sk_stream_error(sk, flags, err);
/* make sure we wake any epoll edge trigger waiter */
if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



[PATCH net-next v4 9/9] test: add msg_zerocopy test

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Introduce regression test for msg_zerocopy feature. Send traffic from
one process to another with and without zerocopy.

Evaluate tcp, udp, raw and packet sockets, including variants
- udp: corking and corking with mixed copy/zerocopy calls
- raw: with and without hdrincl
- packet: at both raw and dgram level

Test on both ipv4 and ipv6, optionally with ethtool changes to
disable scatter-gather, tx checksum or tso offload. All of these
can affect zerocopy behavior.

The regression test can be run on a single machine if over a veth
pair. Then skb_orphan_frags_rx must be modified to be identical to
skb_orphan_frags to allow forwarding zerocopy locally.

The msg_zerocopy.sh script will setup the veth pair in network
namespaces and run all tests.

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/.gitignore  |   1 +
 tools/testing/selftests/net/Makefile|   2 +-
 tools/testing/selftests/net/msg_zerocopy.c  | 697 
 tools/testing/selftests/net/msg_zerocopy.sh | 113 +
 4 files changed, 812 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/msg_zerocopy.c
 create mode 100755 tools/testing/selftests/net/msg_zerocopy.sh

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index afe109e5508a..9801253e4802 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -1,3 +1,4 @@
+msg_zerocopy
 socket
 psock_fanout
 psock_tpacket
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index f6c9dbf478f8..6135a8448900 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,7 +7,7 @@ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh 
netdevice.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket
 TEST_GEN_FILES += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
-TEST_GEN_FILES += reuseport_dualstack
+TEST_GEN_FILES += reuseport_dualstack msg_zerocopy
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
new file mode 100644
index ..448c69a8af74
--- /dev/null
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -0,0 +1,697 @@
+/* Evaluate MSG_ZEROCOPY
+ *
+ * Send traffic between two processes over one of the supported
+ * protocols and modes:
+ *
+ * PF_INET/PF_INET6
+ * - SOCK_STREAM
+ * - SOCK_DGRAM
+ * - SOCK_DGRAM with UDP_CORK
+ * - SOCK_RAW
+ * - SOCK_RAW with IP_HDRINCL
+ *
+ * PF_PACKET
+ * - SOCK_DGRAM
+ * - SOCK_RAW
+ *
+ * Start this program on two connected hosts, one in send mode and
+ * the other with option '-r' to put it in receiver mode.
+ *
+ * If zerocopy mode ('-z') is enabled, the sender will verify that
+ * the kernel queues completions on the error queue for all zerocopy
+ * transfers.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef SO_EE_ORIGIN_ZEROCOPY
+#define SO_EE_ORIGIN_ZEROCOPY  SO_EE_ORIGIN_UPAGE
+#endif
+
+#ifndef SO_ZEROCOPY
+#define SO_ZEROCOPY59
+#endif
+
+#ifndef SO_EE_CODE_ZEROCOPY_COPIED
+#define SO_EE_CODE_ZEROCOPY_COPIED 1
+#endif
+
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY   0x400
+#endif
+
+static int  cfg_cork;
+static bool cfg_cork_mixed;
+static int  cfg_cpu= -1;   /* default: pin to last cpu */
+static int  cfg_family = PF_UNSPEC;
+static int  cfg_ifindex= 1;
+static int  cfg_payload_len;
+static int  cfg_port   = 8000;
+static bool cfg_rx;
+static int  cfg_runtime_ms = 4200;
+static int  cfg_verbose;
+static int  cfg_waittime_ms= 500;
+static bool cfg_zerocopy;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_dst_addr;
+static struct sockaddr_storage cfg_src_addr;
+
+static char payload[IP_MAXPACKET];
+static long packets, bytes, completions, expected_completions;
+static int  zerocopied = -1;
+static uint32_t next_completion;
+
+static unsigned long gettimeofday_ms(void)
+{
+   struct timeval tv;
+
+   gettimeofday(&tv, NULL);
+   return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static uint16_t get_ip_csum(const uint16_t *start, int num_words)
+{
+   unsigned long sum = 0;
+   int i;
+
+   for (i = 0; i < num_words; i++)
+   sum += start[i];
+
+   while (sum >> 16)
+   sum = (sum & 0x) + (sum >> 16);
+
+   return ~sum;
+}
+
+static int do_setcpu(int cpu)
+{
+   cpu_set_t mask;
+
+   CPU_ZERO(&mask);
+   CPU_SET(cpu, &mask);
+   if (sched_setaffinity(0, sizeof(mask), &mask))
+   error(1, 0, "setaffini

[PATCH net-next v4 2/9] sock: skb_copy_ubufs support for compound pages

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Refine skb_copy_ubufs to support compound pages. With upcoming TCP
zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages:
1. compute bytelength and minimum number of pages to store this.
2. allocate
3. copy, filling each page except the last to PAGE_SIZE bytes
4. update skb frag array

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  9 +++--
 net/core/skbuff.c  | 53 --
 2 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index be76082f48aa..2f64e2bbb592 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1796,13 +1796,18 @@ static inline unsigned int skb_headlen(const struct 
sk_buff *skb)
return skb->len - skb->data_len;
 }
 
-static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+static inline unsigned int __skb_pagelen(const struct sk_buff *skb)
 {
unsigned int i, len = 0;
 
for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
len += skb_frag_size(&skb_shinfo(skb)->frags[i]);
-   return len + skb_headlen(skb);
+   return len;
+}
+
+static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+{
+   return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0f0933b338d7..a95877a8ac8b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -932,17 +932,20 @@ EXPORT_SYMBOL_GPL(skb_morph);
  */
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 {
-   int i;
+   struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
int num_frags = skb_shinfo(skb)->nr_frags;
struct page *page, *head = NULL;
-   struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+   int i, new_frags;
+   u32 d_off;
 
-   for (i = 0; i < num_frags; i++) {
-   skb_frag_t *f = &skb_shinfo(skb)->frags[i];
-   u32 p_off, p_len, copied;
-   struct page *p;
-   u8 *vaddr;
+   if (!num_frags)
+   return 0;
+
+   if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
+   return -EINVAL;
 
+   new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   for (i = 0; i < new_frags; i++) {
page = alloc_page(gfp_mask);
if (!page) {
while (head) {
@@ -952,17 +955,36 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
}
return -ENOMEM;
}
+   set_page_private(page, (unsigned long)head);
+   head = page;
+   }
+
+   page = head;
+   d_off = 0;
+   for (i = 0; i < num_frags; i++) {
+   skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+   u32 p_off, p_len, copied;
+   struct page *p;
+   u8 *vaddr;
 
skb_frag_foreach_page(f, f->page_offset, skb_frag_size(f),
  p, p_off, p_len, copied) {
+   u32 copy, done = 0;
vaddr = kmap_atomic(p);
-   memcpy(page_address(page) + copied, vaddr + p_off,
-  p_len);
+
+   while (done < p_len) {
+   if (d_off == PAGE_SIZE) {
+   d_off = 0;
+   page = (struct page 
*)page_private(page);
+   }
+   copy = min_t(u32, PAGE_SIZE - d_off, p_len - 
done);
+   memcpy(page_address(page) + d_off,
+  vaddr + p_off + done, copy);
+   done += copy;
+   d_off += copy;
+   }
kunmap_atomic(vaddr);
}
-
-   set_page_private(page, (unsigned long)head);
-   head = page;
}
 
/* skb frags release userspace buffers */
@@ -972,11 +994,12 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
uarg->callback(uarg, false);
 
/* skb frags point to kernel buffers */
-   for (i = num_frags - 1; i >= 0; i--) {
-   __skb_fill_page_desc(skb, i, head, 0,
-skb_shinfo(skb)->frags[i].size);
+   for (i = 0; i < new_frags - 1; i++) {
+   __skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
head = (struct page *)page_private(head);
}
+   __skb_fill_page_desc(skb, new_fr

[PATCH net-next v4 7/9] sock: ulimit on MSG_ZEROCOPY pages

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn 
---
 include/linux/sched/user.h |  3 ++-
 include/linux/skbuff.h |  5 +
 net/core/skbuff.c  | 48 ++
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 5d5415e129d4..3c07e4135127 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -36,7 +36,8 @@ struct user_struct {
struct hlist_node uidhash_node;
kuid_t uid;
 
-#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
+defined(CONFIG_NET)
atomic_long_t locked_vm;
 #endif
 };
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f5bdd93a87da..8c0708d2e5e6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -457,6 +457,11 @@ struct ubuf_info {
};
};
atomic_t refcnt;
+
+   struct mmpin {
+   struct user_struct *user;
+   unsigned int num_pg;
+   } mmp;
 };
 
 #define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dcee0f64f1fa..42b62c716a33 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -897,6 +897,44 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct 
sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+{
+   unsigned long max_pg, num_pg, new_pg, old_pg;
+   struct user_struct *user;
+
+   if (capable(CAP_IPC_LOCK) || !size)
+   return 0;
+
+   num_pg = (size >> PAGE_SHIFT) + 2;  /* worst case */
+   max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   user = mmp->user ? : current_user();
+
+   do {
+   old_pg = atomic_long_read(&user->locked_vm);
+   new_pg = old_pg + num_pg;
+   if (new_pg > max_pg)
+   return -ENOBUFS;
+   } while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
+old_pg);
+
+   if (!mmp->user) {
+   mmp->user = get_uid(user);
+   mmp->num_pg = num_pg;
+   } else {
+   mmp->num_pg += num_pg;
+   }
+
+   return 0;
+}
+
+static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+{
+   if (mmp->user) {
+   atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+   free_uid(mmp->user);
+   }
+}
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
struct ubuf_info *uarg;
@@ -913,6 +951,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
 
BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
uarg = (void *)skb->cb;
+   uarg->mmp.user = NULL;
+
+   if (mm_account_pinned_pages(&uarg->mmp, size)) {
+   kfree_skb(skb);
+   return NULL;
+   }
 
uarg->callback = sock_zerocopy_callback;
uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
@@ -956,6 +1000,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, 
size_t size,
 
next = (u32)atomic_read(&sk->sk_zckey);
if ((u32)(uarg->id + uarg->len) == next) {
+   if (mm_account_pinned_pages(&uarg->mmp, size))
+   return NULL;
uarg->len++;
uarg->bytelen = bytelen;
atomic_set(&sk->sk_zckey, ++next);
@@ -1038,6 +1084,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+   mm_unaccount_pinned_pages(&uarg->mmp);
+
if (uarg->callback)
uarg->callback(uarg, uarg->zerocopy);
else
-- 
2.14.0.rc1.383.gd1ce394fe2-goog



[PATCH net-next v4 5/9] sock: enable MSG_ZEROCOPY

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
skb_split and skb_try_coalesce can be refined to avoid copying.
These are not in the hot path and this patch is hairy enough as
is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/tun.c  |  2 +-
 drivers/vhost/net.c|  1 +
 include/linux/skbuff.h | 14 +-
 net/core/dev.c |  4 ++--
 net/core/skbuff.c  | 48 +++-
 5 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 68add55f8460..d21510d47aa2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -892,7 +892,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
sk_filter(tfile->socket.sk, skb))
goto drop;
 
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
goto drop;
 
skb_tx_timestamp(skb);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 06d044862e58..ba08b78ed630 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -533,6 +533,7 @@ static void handle_tx(struct vhost_net *net)
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
+   atomic_set(&ubuf->refcnt, 1);
msg.msg_control = ubuf;
msg.msg_controllen = sizeof(ubuf);
ubufs = nvq->ubufs;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 59cff7aa494e..e5387932c266 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2512,7 +2512,17 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+   if (likely(!skb_zcopy(skb)))
+   return 0;
+   if (skb_uarg(skb)->callback == sock_zerocopy_callback)
+   return 0;
+   return skb_copy_ubufs(skb, gfp_mask);
+}
+
+/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
+static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
+{
+   if (likely(!skb_zcopy(skb)))
return 0;
return skb_copy_ubufs(skb, gfp_mask);
 }
@@ -2944,6 +2954,8 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
const struct page *page, int off)
 {
+   if (skb_zcopy(skb))
+   return false;
if (i) {
const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i 
- 1];
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 8ea6b4b42611..1d75499add72 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1853,7 +1853,7 @@ static inline int deliver_skb(struct sk_buff *skb,
  struct packet_type *pt_prev,
  struct net_device *orig_dev)
 {
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
return -ENOMEM;
refcount_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
@@ -4412,7 +4412,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, 
bool pfmemalloc)
}
 
if (pt_prev) {
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
goto drop;
else
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 29e34bc6a17c..74d3c36f8419 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -567,21 +567,10 @@ static void skb_release_data(struct sk_buff *skb)
fo

[PATCH net-next v4 6/9] sock: MSG_ZEROCOPY notification coalescing

2017-08-03 Thread Willem de Bruijn
From: Willem de Bruijn 

In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.

Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].

Coalescing is limited to a few TSO frames worth of data to bound
notification latency.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 17 +++--
 net/core/skbuff.c  | 99 ++
 2 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e5387932c266..f5bdd93a87da 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -444,15 +444,26 @@ enum {
  */
 struct ubuf_info {
void (*callback)(struct ubuf_info *, bool zerocopy_success);
-   void *ctx;
-   unsigned long desc;
-   u16 zerocopy:1;
+   union {
+   struct {
+   unsigned long desc;
+   void *ctx;
+   };
+   struct {
+   u32 id;
+   u16 len;
+   u16 zerocopy:1;
+   u32 bytelen;
+   };
+   };
atomic_t refcnt;
 };
 
 #define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+   struct ubuf_info *uarg);
 
 static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 74d3c36f8419..dcee0f64f1fa 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -915,7 +915,9 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
uarg = (void *)skb->cb;
 
uarg->callback = sock_zerocopy_callback;
-   uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+   uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
+   uarg->len = 1;
+   uarg->bytelen = size;
uarg->zerocopy = 1;
atomic_set(&uarg->refcnt, 0);
sock_hold(sk);
@@ -929,26 +931,101 @@ static inline struct sk_buff *skb_from_uarg(struct 
ubuf_info *uarg)
return container_of((void *)uarg, struct sk_buff, cb);
 }
 
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+   struct ubuf_info *uarg)
+{
+   if (uarg) {
+   const u32 byte_limit = 1 << 19; /* limit to a few TSO */
+   u32 bytelen, next;
+
+   /* realloc only when socket is locked (TCP, UDP cork),
+* so uarg->len and sk_zckey access is serialized
+*/
+   if (!sock_owned_by_user(sk)) {
+   WARN_ON_ONCE(1);
+   return NULL;
+   }
+
+   bytelen = uarg->bytelen + size;
+   if (uarg->len == USHRT_MAX - 1 || bytelen > byte_limit) {
+   /* TCP can create new skb to attach new uarg */
+   if (sk->sk_type == SOCK_STREAM)
+   goto new_alloc;
+   return NULL;
+   }
+
+   next = (u32)atomic_read(&sk->sk_zckey);
+   if ((u32)(uarg->id + uarg->len) == next) {
+   uarg->len++;
+   uarg->bytelen = bytelen;
+   atomic_set(&sk->sk_zckey, ++next);
+   return uarg;
+   }
+   }
+
+new_alloc:
+   return sock_zerocopy_alloc(sk, size);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
+
+static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
+{
+   struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+   u32 old_lo, old_hi;
+   u64 sum_len;
+
+   old_lo = serr->ee.ee_info;
+   old_hi = serr->ee.ee_data;
+   sum_len = old_hi - old_lo + 1ULL + len;
+
+   if (sum_len >= (1ULL << 32))
+   return false;
+
+   if (lo != old_hi + 1)
+   return false;
+
+   serr->ee.ee_data += len;
+   return true;
+}
+
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 {
-   struct sk_buff *skb = skb_from_uarg(uarg);
+   struct sk_buff *tail, *skb = skb_from_uarg(uarg);
struct sock_exterr_skb *serr;
struct sock *sk = skb->sk;
-   u16 id = uarg->desc;
+   struct sk_buff_head *q;
+   unsigned 

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-03 Thread David Ahern
On 5/18/17 10:24 PM, David Ahern wrote:
> On 5/18/17 3:02 AM, Daniel Borkmann wrote:
>> So effectively this means libmnl has to be used for new stuff, noone
>> has time to do the work to convert the existing tooling over (which
>> by itself might be a challenge in testing everything to make sure
>> there are no regressions) given there's not much activity around
>> lib/libnetlink.c anyway, and existing users not using libmnl today
>> won't see/notice new improvements on netlink side when they do an
>> upgrade. So we'll be stuck with that dual library mess pretty much
>> for a very long time. :(
> 
> lib/libnetlink.c with all of its duplicate functions weighs in at just
> 947 LOC -- a mere 12% of the code in lib/. From a total SLOC of iproute2
> it is a negligible part of the code base.
> 
> Given that, there is very little gain -- but a lot of risk in
> regressions -- in converting such a small, low level code base to libmnl
> just for the sake of using a library - something Phil noted in his
> cursory attempt at converting ip to libmnl. ie., The level effort
> required vs the benefit is just not worth it.
> 
> There are so many other parts of the ip code base that need work with a
> much higher return on the time investment.
> 

Stephen: It has been 3 months since the first extack patches were posted
and still nothing in iproute2, all of it hung up on your decision to
require libmnl. Do you plan to finish the libmnl support any time soon
and send out patches?


Re: [PATCH v3 net-next 5/5] net: dsa: lan9303: refactor lan9303_get_ethtool_stats

2017-08-03 Thread Egil Hjelmeland

Den 03. aug. 2017 20:04, skrev Florian Fainelli:

On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:

In lan9303_get_ethtool_stats: Get rid of 0x400 constant magic
by using new lan9303_read_switch_reg() inside loop.
Reduced scope of two variables.

Signed-off-by: Egil Hjelmeland 
---
  drivers/net/dsa/lan9303-core.c | 26 --
  1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index 6f409755ba1a..5aaa46146c27 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -435,6 +435,13 @@ static int lan9303_write_switch_port(
chip, LAN9303_SWITCH_PORT_REG(port, regnum), val);
  }
  
+static int lan9303_read_switch_port(

+   struct lan9303 *chip, int port, u16 regnum, u32 *val)
+{


This indentation is really funny, why not just break it up that way:

static int lan9303_read_switch_port(struct lan9303 *chip, int port
u16 regnum, u32 *val)
{
}

This applies to patch 5 as well, other than that:

Reviewed-by: Florian Fainelli 



Because it is the form that passes scripts/checkpatch.pl which I find
easiest to type and maintain.

- No need to fine tune spaces.
- No need to change indentation if later renaming function to name of
  different length.

Do you have any references backing up your claim this is wrong?

Cheers
Egil


Re: [PATCH net-next 1/3] proto_ops: Add locked held versions of sendmsg and sendpage

2017-08-03 Thread John Fastabend
On 07/28/2017 04:22 PM, Tom Herbert wrote:
> Add new proto_ops sendmsg_locked and sendpage_locked that can be
> called when the socket lock is already held. Correspondingly, add
> kernel_sendmsg_locked and kernel_sendpage_locked as front end
> functions.
> 
> These functions will be used in zero proxy so that we can take
> the socket lock in a ULP sendmsg/sendpage and then directly call the
> backend transport proto_ops functions.
> 

[...]

>  
> +int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
> +size_t size, int flags)
> +{
> + struct socket *sock = sk->sk_socket;
> +
> + if (sock->ops->sendpage_locked)
> + return sock->ops->sendpage_locked(sk, page, offset, size,
> +   flags);
> +
> + return sock_no_sendpage_locked(sk, page, offset, size, flags);
> +}

How about just returning EOPNOTSUPP here and force implementations to do both
sendmsg and sendpage. The only implementation of these callbacks already does
this. And if its any other socket it will just wind its way through a few
layers of calls before returning EOPNOTSUPP.

.John


Re: [PATCH v3 net-next 3/5] net: dsa: lan9303: Simplify lan9303_xxx_packet_processing() usage

2017-08-03 Thread Egil Hjelmeland

Den 03. aug. 2017 20:06, skrev Florian Fainelli:

On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:

Simplify usage of lan9303_enable_packet_processing,
lan9303_disable_packet_processing()

Signed-off-by: Egil Hjelmeland 


Reviewed-by: Florian Fainelli 

took a little while to figure out that we are utilizing fall through of
the switch/case statement and that's why it's okay.



  static int lan9303_check_device(struct lan9303 *chip)
@@ -765,7 +766,6 @@ static int lan9303_port_enable(struct dsa_switch *ds, int 
port,
/* enable internal packet processing */
switch (port) {
case 1:
-   return lan9303_enable_packet_processing(chip, port);
case 2:
return lan9303_enable_packet_processing(chip, port);
default:


I suppose if we later change to dsa_switch_alloc(...,3), then it could
be further simplified to

if (port != 0)
return lan9303_enable_packet_processing(chip, port);

Or perhaps no test is needed at all. The driver assumes port 0 is cpu
port, which is the sensible way to use the chip. (Because port 0 has
no phy, the others have phy). Declaring a different port as cpu port in
DTS will not work, but it will not crash the kernel.

Egil


[PATCH net-next] liquidio: add missing strings in oct_dev_state_str array

2017-08-03 Thread Felix Manlunas
From: Intiyaz Basha 

There's supposed to be a one-to-one correspondence between the 18 macros
that #define the OCT_DEV states (in octeon_device.h) and the strings in the
oct_dev_state_str array, but there are only 14 strings in the array.

Add the missing strings (so they become 18 in total), and also revise some
incorrect/outdated text of existing strings.

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/octeon_device.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
index f10014f7..495cc88 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
@@ -528,9 +528,10 @@ static struct octeon_config_ptr {
 };
 
 static char oct_dev_state_str[OCT_DEV_STATES + 1][32] = {
-   "BEGIN", "PCI-MAP-DONE", "DISPATCH-INIT-DONE",
+   "BEGIN", "PCI-ENABLE-DONE", "PCI-MAP-DONE", "DISPATCH-INIT-DONE",
"IQ-INIT-DONE", "SCBUFF-POOL-INIT-DONE", "RESPLIST-INIT-DONE",
-   "DROQ-INIT-DONE", "IO-QUEUES-INIT-DONE", "CONSOLE-INIT-DONE",
+   "DROQ-INIT-DONE", "MBOX-SETUP-DONE", "MSIX-ALLOC-VECTOR-DONE",
+   "INTR-SET-DONE", "IO-QUEUES-INIT-DONE", "CONSOLE-INIT-DONE",
"HOST-READY", "CORE-READY", "RUNNING", "IN-RESET",
"INVALID"
 };


Re: [PATCH net-next 2/3] skbuff: Function to send an skbuf on a socket

2017-08-03 Thread John Fastabend
On 07/28/2017 04:22 PM, Tom Herbert wrote:
> Add skb_send_sock to send an skbuff on a socket within the kernel.
> Arguments include an offset so that an skbuf might be sent in mulitple
> calls (e.g. send buffer limit is hit).
> 
> Signed-off-by: Tom Herbert 
> ---

[...]

> +/* Send skb data on a socket. Socket must be locked. */
> +int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
> +  int len)
> +{
> + unsigned int orig_len = len;
> + struct sk_buff *head = skb;
> + unsigned short fragidx;
> + int slen, ret;
> +
> +do_frag_list:
> +
> + /* Deal with head data */
> + while (offset < skb_headlen(skb) && len) {
> + struct kvec kv;
> + struct msghdr msg;
> +
> + slen = min_t(int, len, skb_headlen(skb) - offset);
> + kv.iov_base = skb->data + offset;
> + kv.iov_len = len;
^^

This should be slen right?

> + memset(&msg, 0, sizeof(msg));
> +
> + ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
> + if (ret <= 0)
> + goto error;
> +
> + offset += ret;
> + len -= ret;
> + }
> +

Thanks,
John



Re: [PATCH net-next 1/3] proto_ops: Add locked held versions of sendmsg and sendpage

2017-08-03 Thread John Fastabend
On 07/28/2017 04:22 PM, Tom Herbert wrote:
> Add new proto_ops sendmsg_locked and sendpage_locked that can be
> called when the socket lock is already held. Correspondingly, add
> kernel_sendmsg_locked and kernel_sendpage_locked as front end
> functions.
> 
> These functions will be used in zero proxy so that we can take
> the socket lock in a ULP sendmsg/sendpage and then directly call the
> backend transport proto_ops functions.
> 
> Signed-off-by: Tom Herbert 
> ---

[...]

> diff --git a/net/socket.c b/net/socket.c
> index 79d9bb964cd8..c0a12ad39610 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -652,6 +652,20 @@ int kernel_sendmsg(struct socket *sock, struct msghdr 
> *msg,
>  }
>  EXPORT_SYMBOL(kernel_sendmsg);
>  
> +int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
> +   struct kvec *vec, size_t num, size_t size)
> +{
> + struct socket *sock = sk->sk_socket;
> +
> + if (!sock->ops->sendmsg_locked)
> + sock_no_sendmsg_locked(sk, msg, size);
> +

Should be

return sock_no_sendmsg_locked(sk, msg, size);


> + iov_iter_kvec(&msg->msg_iter, WRITE | ITER_KVEC, vec, num, size);
> +
> + return sock->ops->sendmsg_locked(sk, msg, msg_data_left(msg));

Otherwise this is a null ptr deref.

> +}
> +EXPORT_SYMBOL(kernel_sendmsg_locked);
> +


Thanks,
John



Re: [PATCH net-next] net: dsa: Add support for 64-bit statistics

2017-08-03 Thread Florian Fainelli
On 08/03/2017 11:11 AM, Andrew Lunn wrote:
> On Thu, Aug 03, 2017 at 10:30:56AM -0700, Florian Fainelli wrote:
>> On 08/02/2017 04:49 PM, David Miller wrote:
>>> From: Florian Fainelli 
>>> Date: Tue,  1 Aug 2017 15:00:36 -0700
>>>
 DSA slave network devices maintain a pair of bytes and packets counters
 for each directions, but these are not 64-bit capable. Re-use
 pcpu_sw_netstats which contains exactly what we need for that purpose
 and update the code path to report 64-bit capable statistics.

 Signed-off-by: Florian Fainelli 
>>>
>>> Applied, thanks.
>>>
>>> I would run ethtool -S and ifconfig under perf to see where it is
>>> spending so much time.
>>>
>>
>> This appears to be way worse than I thought, will keep digging, but for
>> now, I may have to send a revert. Andrew, Vivien can you see if you have
>> the same problems on your boards? Thanks!
>>
>> # killall iperf
>> # [ ID] Interval   Transfer Bandwidth
>> [  3]  0.0-19.1 sec   500 MBytes   220 Mbits/sec
>> # while true; do ethtool -S gphy; ifconfig gphy; done
>> ^C^C
>>
>>
>> [   64.566226] INFO: rcu_sched self-detected stall on CPU
>> [   64.571487]  0-...: (25999 ticks this GP) idle=006/141/0
> 
> Hi Florian

Hi Andrew,

> 
> I don't get anything so bad, but i think that is because of hardware
> restrictions. I see the ethtool; ifconfig loop goes a lot slower when
> there is iperf traffic, but i don't get an RCU stall. However, the
> board i tested on only has a 100Mbps CPU interface, and it can handle
> all that traffic without pushing the CPU to 100%. What is the CPU load
> when you run your test? Even if you are going to 100% CPU load, we
> still don't want RCU stalls.

This is a quad core 1.5 Ghz board pushing 1Gbit/sec worth of traffic,
this is about 25% loaded. What is needed to reproduce this is basically:

iperf -c 192.168.1.1 -t 30 &
while true; do ifconfig gphy; ethtool -S gphy; done

when iperf terminates, the lock-up reliably occurs. I just converted
net/dsa/ to use per-cpu statistics and of course, I can no longer
reproduce this problem now...
-- 
Florian


[PATCH v3 net-next 2/4] sock: ULP infrastructure

2017-08-03 Thread Tom Herbert
Generalize the TCP ULP infrastructure recently introduced to support
kTLS. This adds a SO_ULP socket option and creates new fields in
sock structure for ULP ops and ULP data. Also, the interface allows
additional per ULP parameters to be set so that a ULP can be pushed
and operations started in one shot.

Signed-off-by: Tom Herbert 
---
 arch/alpha/include/uapi/asm/socket.h   |   2 +
 arch/frv/include/uapi/asm/socket.h |   2 +
 arch/ia64/include/uapi/asm/socket.h|   2 +
 arch/m32r/include/uapi/asm/socket.h|   2 +
 arch/mips/include/uapi/asm/socket.h|   2 +
 arch/mn10300/include/uapi/asm/socket.h |   2 +
 arch/parisc/include/uapi/asm/socket.h  |   2 +
 arch/s390/include/uapi/asm/socket.h|   2 +
 arch/sparc/include/uapi/asm/socket.h   |   2 +
 arch/xtensa/include/uapi/asm/socket.h  |   2 +
 include/linux/socket.h |   9 ++
 include/net/sock.h |   5 +
 include/net/ulp_sock.h |  75 +
 include/uapi/asm-generic/socket.h  |   2 +
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/sock.c|  14 +++
 net/core/sysctl_net_core.c |  25 +
 net/core/ulp_sock.c| 194 +
 19 files changed, 349 insertions(+)
 create mode 100644 include/net/ulp_sock.h
 create mode 100644 net/core/ulp_sock.c

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 7b285dd4fe05..885e8fca79b0 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -109,4 +109,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index f1e3b20dce9f..8ba71f2a3bf3 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -102,5 +102,7 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 5dd5c5d0d642..2de1c53f88b5 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -111,4 +111,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index f8f7b47e247f..b2d394381787 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 882823bec153..d0bdf8c78220 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -120,4 +120,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index c710db354ff2..686fbf497a13 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index a0d4dc9f4eb2..d6e99deca976 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_PEERGROUPS  0x4034
 
+#define SO_ULP 0x4035
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 52a63f4175cb..6b52f162369a 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -108,4 +108,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 186fd8199f54..e765bf781107 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -98,6 +98,8 @@
 
 #define SO_PEERGROUPS  0x003d
 
+#define SO_ULP 0x003e
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 3eed2761c149..8eaa2e9e27b6 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -113,4 +113,6 @@
 
 #define SO_PEERGROUPS  59
 
+#define SO_ULP 60
+
 #endif /* _XTENSA_

[PATCH v3 net-next 4/4] ulp: Documention for ULP infrastructure

2017-08-03 Thread Tom Herbert
Add a doc in Documentation/networking

Signed-off-by: Tom Herbert 
---
 Documentation/networking/ulp.txt | 82 
 1 file changed, 82 insertions(+)
 create mode 100644 Documentation/networking/ulp.txt

diff --git a/Documentation/networking/ulp.txt b/Documentation/networking/ulp.txt
new file mode 100644
index ..4d830314b0ff
--- /dev/null
+++ b/Documentation/networking/ulp.txt
@@ -0,0 +1,82 @@
+Upper Layer Protocol (ULP) Infrastructure
+=
+
+The ULP kernel infrastructure provides a means to hook upper layer
+protocol support on a socket. A module may register a ULP hook
+in the kernel. ULP processing is enabled by a setsockopt on a socket
+that specifies the name of the registered ULP to invoked. An
+initialization function is defined for each ULP that can change the
+function entry points of the socket (sendmsg, rcvmsg, etc.) or change
+the socket in other fundamental ways.
+
+Note, no synchronization is enforced between the setsockopt to enable
+a ULP and ongoing asynchronous operations on the socket (such as a
+blocked read). If synchronization is required this must be handled by
+the ULP and caller.
+
+User interface
+==
+
+The structure for the socket SOL_ULP options is defined in socket.h.
+
+Example to enable "my_ulp" ULP on a socket:
+
+struct ulp_config ulpc = {
+.ulp_name = "my_ulp",
+};
+
+setsockopt(sock, SOL_SOCKET, SO_ULP, &ulpc, sizeof(ulpc))
+
+The ulp_config includes a "__u8 ulp_params[0]" filled that may be used
+to refer ULP specific parameters being set.
+
+Kernel interface
+
+
+The interface for ULP infrastructure is defined in net/ulp_sock.h.
+
+ULP registration functions
+--
+
+int ulp_register(struct ulp_ops *type)
+
+ Called to register a ULP. The ulp_ops structure is described below.
+
+void ulp_unregister(struct ulp_ops *type);
+
+ Called to unregister a ULP.
+
+ulp_ops structure
+-
+
+int (*init)(struct sock *sk, char __user *optval, int len)
+
+ Initialization function for the ULP. This is called from setsockopt
+ when the ULP name in the ulp_config argument matches the registered
+ ULP. optval is a userspace pointer to the ULP specific parameters.
+ len is the length of the ULP specific parameters.
+
+void (*release)(struct sock *sk)
+
+ Called when socket is being destroyed. The ULP implementation
+ should cancel any asynchronous operations (such as timers) and
+ release any acquired resources.
+
+int (*get_params)(struct sock *sk, char __user *optval, int *optlen)
+
+ Get the ULP specific parameters previous set in the init function
+ for the ULP. Note that optlen is a pointer to kernel memory.
+
+char name[ULP_NAME_MAX]
+
+ Name of the ULP. Must be NULL terminated.
+
+struct module *owner
+
+ Corresponding owner for ref count.
+
+Author
+==
+
+Tom Herbert (t...@quantonium.net)
+
-- 
2.11.0



[PATCH v3 net-next 1/4] inet: include net/sock.h in inet_common.h

2017-08-03 Thread Tom Herbert
inet_common.h has a dependency on sock.h so it should include that.

Signed-off-by: Tom Herbert 
---
 include/net/inet_common.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index f39ae697347f..df0119a317aa 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -1,6 +1,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
-- 
2.11.0



[PATCH v3 net-next 3/4] tcp: Adjust TCP ULP to defer to sockets ULP

2017-08-03 Thread Tom Herbert
Fix TCP and TLS to use the newer ULP infrastructure in sockets.

Tested-by: Dave Watson 
Signed-off-by: Tom Herbert 
---
 Documentation/networking/tls.txt   |   6 +-
 include/net/inet_connection_sock.h |   4 --
 include/net/tcp.h  |  25 ---
 include/net/tls.h  |   4 +-
 net/ipv4/Makefile  |   2 +-
 net/ipv4/sysctl_net_ipv4.c |   9 ++-
 net/ipv4/tcp.c |  40 ++-
 net/ipv4/tcp_ipv4.c|   2 -
 net/ipv4/tcp_ulp.c | 135 -
 net/tls/Kconfig|   1 +
 net/tls/tls_main.c |  24 ---
 11 files changed, 51 insertions(+), 201 deletions(-)
 delete mode 100644 net/ipv4/tcp_ulp.c

diff --git a/Documentation/networking/tls.txt b/Documentation/networking/tls.txt
index 77ed00631c12..b70309df4709 100644
--- a/Documentation/networking/tls.txt
+++ b/Documentation/networking/tls.txt
@@ -12,8 +12,12 @@ Creating a TLS connection
 
 First create a new TCP socket and set the TLS ULP.
 
+struct ulp_config ulpc = {
+   .ulp_name = "tls",
+};
+
   sock = socket(AF_INET, SOCK_STREAM, 0);
-  setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
+  setsockopt(sock, SOL_SOCKET, SO_ULP, &ulpc, sizeof(ulpc))
 
 Setting the TLS ULP allows us to set/get TLS socket options. Currently
 only the symmetric encryption is handled in the kernel.  After the TLS
diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 13e4c89a8231..c7a577976bec 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -75,8 +75,6 @@ struct inet_connection_sock_af_ops {
  * @icsk_pmtu_cookie  Last pmtu seen by socket
  * @icsk_ca_ops   Pluggable congestion control hook
  * @icsk_af_ops   Operations which are AF_INET{4,6} specific
- * @icsk_ulp_ops  Pluggable ULP control hook
- * @icsk_ulp_data ULP private data
  * @icsk_ca_state:Congestion control state
  * @icsk_retransmits: Number of unrecovered [RTO] timeouts
  * @icsk_pending: Scheduled timer event
@@ -99,8 +97,6 @@ struct inet_connection_sock {
__u32 icsk_pmtu_cookie;
const struct tcp_congestion_ops *icsk_ca_ops;
const struct inet_connection_sock_af_ops *icsk_af_ops;
-   const struct tcp_ulp_ops  *icsk_ulp_ops;
-   void  *icsk_ulp_data;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8  icsk_ca_state:6,
  icsk_ca_setsockopt:1,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index bb1881b4ce48..65c462da3740 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1968,31 +1968,6 @@ static inline void tcp_listendrop(const struct sock *sk)
 
 enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
 
-/*
- * Interface for adding Upper Level Protocols over TCP
- */
-
-#define TCP_ULP_NAME_MAX   16
-#define TCP_ULP_MAX128
-#define TCP_ULP_BUF_MAX(TCP_ULP_NAME_MAX*TCP_ULP_MAX)
-
-struct tcp_ulp_ops {
-   struct list_headlist;
-
-   /* initialize ulp */
-   int (*init)(struct sock *sk);
-   /* cleanup ulp */
-   void (*release)(struct sock *sk);
-
-   charname[TCP_ULP_NAME_MAX];
-   struct module   *owner;
-};
-int tcp_register_ulp(struct tcp_ulp_ops *type);
-void tcp_unregister_ulp(struct tcp_ulp_ops *type);
-int tcp_set_ulp(struct sock *sk, const char *name);
-void tcp_get_available_ulp(char *buf, size_t len);
-void tcp_cleanup_ulp(struct sock *sk);
-
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
diff --git a/include/net/tls.h b/include/net/tls.h
index b89d397dd62f..7d88a6e2f5a7 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -214,9 +214,7 @@ static inline void tls_fill_prepend(struct tls_context *ctx,
 
 static inline struct tls_context *tls_get_ctx(const struct sock *sk)
 {
-   struct inet_connection_sock *icsk = inet_csk(sk);
-
-   return icsk->icsk_ulp_data;
+   return sk->sk_ulp_data;
 }
 
 static inline struct tls_sw_context *tls_sw_ctx(
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index afcb435adfbe..f83de23a30e7 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -8,7 +8,7 @@ obj-y := route.o inetpeer.o protocol.o \
 inet_timewait_sock.o inet_connection_sock.o \
 tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
 tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
-tcp_rate.o tcp_recovery.o tcp_ulp.o \
+tcp_rate.o tcp_recovery.o \
 tcp_offload.o datagram.o raw.o udp.o udplite.o \
 udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o 

[PATCH v3 net-next 0/4] ulp: Generalize ULP infrastructure

2017-08-03 Thread Tom Herbert
Generalize the ULP infrastructure that was recently introduced to
support kTLS. This adds a SO_ULP socket option and creates new fields in
sock structure for ULP ops and ULP data. Also, the interface allows
additional per ULP parameters to be set so that a ULP can be pushed
and operations started in one shot.

This patch sets:
  - Minor dependency fix in inet_common.h
  - Implement ULP infrastructure as a socket mechanism
  - Fixes TCP and TLS to use the new method (maintaining backwards
API compatibility)
  - Adds a ulp.txt document

Tested: Ran simple ULP. Dave Watson verified kTLS works.

-v2: Fix compliation errors when CONFIG_ULP_SOCK not set.
-v3: FIx one more build issue, check that sk_protocol is IPPROTO_TCP
 in tsl_init

Tom Herbert (4):
  inet: include net/sock.h in inet_common.h
  sock: ULP infrastructure
  tcp: Adjust TCP ULP to defer to sockets ULP
  ulp: Documention for ULP infrastructure

 Documentation/networking/tls.txt   |   6 +-
 Documentation/networking/ulp.txt   |  82 ++
 arch/alpha/include/uapi/asm/socket.h   |   2 +
 arch/frv/include/uapi/asm/socket.h |   2 +
 arch/ia64/include/uapi/asm/socket.h|   2 +
 arch/m32r/include/uapi/asm/socket.h|   2 +
 arch/mips/include/uapi/asm/socket.h|   2 +
 arch/mn10300/include/uapi/asm/socket.h |   2 +
 arch/parisc/include/uapi/asm/socket.h  |   2 +
 arch/s390/include/uapi/asm/socket.h|   2 +
 arch/sparc/include/uapi/asm/socket.h   |   2 +
 arch/xtensa/include/uapi/asm/socket.h  |   2 +
 include/linux/socket.h |   9 ++
 include/net/inet_common.h  |   2 +
 include/net/inet_connection_sock.h |   4 -
 include/net/sock.h |   5 +
 include/net/tcp.h  |  25 -
 include/net/tls.h  |   4 +-
 include/net/ulp_sock.h |  75 +
 include/uapi/asm-generic/socket.h  |   2 +
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/sock.c|  14 +++
 net/core/sysctl_net_core.c |  25 +
 net/core/ulp_sock.c| 194 +
 net/ipv4/Makefile  |   2 +-
 net/ipv4/sysctl_net_ipv4.c |   9 +-
 net/ipv4/tcp.c |  40 ---
 net/ipv4/tcp_ipv4.c|   2 -
 net/ipv4/tcp_ulp.c | 135 ---
 net/tls/Kconfig|   1 +
 net/tls/tls_main.c |  24 ++--
 32 files changed, 484 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/networking/ulp.txt
 create mode 100644 include/net/ulp_sock.h
 create mode 100644 net/core/ulp_sock.c
 delete mode 100644 net/ipv4/tcp_ulp.c

-- 
2.11.0



Re: [PATCH 00/27] ip: add -json support to 'ip link show'

2017-08-03 Thread Julien Fortin
On Thu, Aug 3, 2017 at 7:08 PM, Oliver Hartkopp  wrote:
> Hi Julien,
>
> On 08/03/2017 05:54 PM, Julien Fortin wrote:
>>
>> From: Julien Fortin 
>
>
> what about
>
> link_veth.c
> iplink_vcan.c
> iplink_vxcan.c
>
> ??

Hello Oliver,

None of these files print any link info data, see struct link_util (.print_opt)

Regards,
Julien.

>
> Regards,
> Oliver
>
>
>>
>> This patch series adds json support to 'ip [-details] link show [dev DEV]'
>> Each patch describes the json schema it adds and provides some examples.
>>
>> Julien Fortin (27):
>>color: add new COLOR_NONE and disable_color function
>>ip: add new command line argument -json (mutually exclusive with
>>  -color)
>>json_writer: add new json handlers (null, float with format, lluint,
>>  hu)
>>ip: ip_print: add new API to print JSON or regular format output
>>ip: ipaddress.c: add support for json output
>>ip: iplink.c: open/close json object for ip -brief -json link show dev
>>  DEV
>>ip: iplink_bond.c: add json output support
>>ip: iplink_bond_slave.c: add json output support (info_slave_data)
>>ip: iplink_hsr.c: add json output support
>>ip: iplink_bridge.c: add json output support
>>ip: iplink_bridge_slave.c: add json output support
>>ip: iplink_can.c: add json output support
>>ip: iplink_geneve.c: add json output support
>>ip: iplink_ipoib.c: add json output support
>>ip: iplink_ipvlan.c: add json output support
>>ip: iplink_vrf.c: add json output support
>>ip: iplink_vxlan.c: add json output support
>>ip: iplink_xdp.c: add json output support
>>ip: ipmacsec.c: add json output support
>>ip: link_gre.c: add json output support
>>ip: link_gre6.c: add json output support
>>ip: link_ip6tnl.c: add json output support
>>ip: link_iptnl.c: add json output support
>>ip: link_vti.c: add json output support
>>ip: link_vti6.c: add json output support
>>ip: link_macvlan.c: add json output support
>>ip: iplink_vlan.c: add json output support
>>
>>   include/color.h  |2 +
>>   include/json_writer.h|9 +
>>   include/utils.h  |1 +
>>   ip/Makefile  |2 +-
>>   ip/ip.c  |6 +
>>   ip/ip_common.h   |   56 +++
>>   ip/ip_print.c|  233 ++
>>   ip/ipaddress.c   | 1089
>> --
>>   ip/iplink.c  |2 +
>>   ip/iplink_bond.c |  231 +++---
>>   ip/iplink_bond_slave.c   |   57 ++-
>>   ip/iplink_bridge.c   |  291 -
>>   ip/iplink_bridge_slave.c |  185 +---
>>   ip/iplink_can.c  |  276 +---
>>   ip/iplink_geneve.c   |   86 +++-
>>   ip/iplink_hsr.c  |   36 +-
>>   ip/iplink_ipoib.c|   30 +-
>>   ip/iplink_ipvlan.c   |8 +-
>>   ip/iplink_macvlan.c  |   37 +-
>>   ip/iplink_vlan.c |   62 ++-
>>   ip/iplink_vrf.c  |   13 +-
>>   ip/iplink_vxlan.c|  161 ---
>>   ip/iplink_xdp.c  |   31 +-
>>   ip/ipmacsec.c|   84 +++-
>>   ip/link_gre.c|  147 ---
>>   ip/link_gre6.c   |  142 --
>>   ip/link_ip6tnl.c |  172 +---
>>   ip/link_iptnl.c  |  155 ---
>>   ip/link_vti.c|   23 +-
>>   ip/link_vti6.c   |   22 +-
>>   lib/color.c  |9 +-
>>   lib/json_writer.c|   44 +-
>>   32 files changed, 2668 insertions(+), 1034 deletions(-)
>>   create mode 100644 ip/ip_print.c
>>
>


Re: [PATCH net-next] net: dsa: Add support for 64-bit statistics

2017-08-03 Thread Andrew Lunn
On Thu, Aug 03, 2017 at 10:30:56AM -0700, Florian Fainelli wrote:
> On 08/02/2017 04:49 PM, David Miller wrote:
> > From: Florian Fainelli 
> > Date: Tue,  1 Aug 2017 15:00:36 -0700
> > 
> >> DSA slave network devices maintain a pair of bytes and packets counters
> >> for each directions, but these are not 64-bit capable. Re-use
> >> pcpu_sw_netstats which contains exactly what we need for that purpose
> >> and update the code path to report 64-bit capable statistics.
> >>
> >> Signed-off-by: Florian Fainelli 
> > 
> > Applied, thanks.
> > 
> > I would run ethtool -S and ifconfig under perf to see where it is
> > spending so much time.
> > 
> 
> This appears to be way worse than I thought, will keep digging, but for
> now, I may have to send a revert. Andrew, Vivien can you see if you have
> the same problems on your boards? Thanks!
> 
> # killall iperf
> # [ ID] Interval   Transfer Bandwidth
> [  3]  0.0-19.1 sec   500 MBytes   220 Mbits/sec
> # while true; do ethtool -S gphy; ifconfig gphy; done
> ^C^C
> 
> 
> [   64.566226] INFO: rcu_sched self-detected stall on CPU
> [   64.571487]  0-...: (25999 ticks this GP) idle=006/141/0

Hi Florian

I don't get anything so bad, but i think that is because of hardware
restrictions. I see the ethtool; ifconfig loop goes a lot slower when
there is iperf traffic, but i don't get an RCU stall. However, the
board i tested on only has a 100Mbps CPU interface, and it can handle
all that traffic without pushing the CPU to 100%. What is the CPU load
when you run your test? Even if you are going to 100% CPU load, we
still don't want RCU stalls.

  Andrew


Re: [PATCH RFC 00/13] phylink and sfp support

2017-08-03 Thread Florian Fainelli
On 08/01/2017 07:39 AM, Andrew Lunn wrote:
> On Tue, Jul 25, 2017 at 03:01:39PM +0100, Russell King - ARM Linux wrote:
>> Hi,
>>
>> This patch series introduces generic support for SFP sockets found on
>> various Marvell based platforms.  The idea here is to provide common
>> SFP socket support which can be re-used by network drivers as
>> appropriate, rather than each network driver having to re-implement
>> SFP socket support.
> 
> There is a lot of code here, and i'm not really going to understand it
> until i use it. I have a couple of boards with SFFs connected to
> switches, so i will spend some time over the next month or so to make
> DSA use phylink. As is usual, we can sort out any issues as we go
> along.
> 
> David, if it still applies cleanly, can you add it to net-next?

Agreed, this is in a good shape and the more people we can have testing
this, the better. It may be nice to include a consumer of the PHYLINK
API for other people to start copying from.
-- 
Florian


Re: [PATCH v3 net-next 1/5] net: dsa: lan9303: Change lan9303_xxx_packet_processing() port param.

2017-08-03 Thread Florian Fainelli
On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:
> lan9303_enable_packet_processing, lan9303_disable_packet_processing()
> Pass port number (0,1,2) as parameter instead of port offset.
> Because other functions in the module pass port numbers.
> And to enable simplifications in following patch.
> 
> Introduce lan9303_write_switch_port().
> 
> Signed-off-by: Egil Hjelmeland 
> ---
>  drivers/net/dsa/lan9303-core.c | 60 
> ++
>  1 file changed, 32 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
> index 8e430d1ee297..fa19e320c5a8 100644
> --- a/drivers/net/dsa/lan9303-core.c
> +++ b/drivers/net/dsa/lan9303-core.c
> @@ -159,9 +159,7 @@
>  # define LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT1 (BIT(9) | BIT(8))
>  # define LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0 (BIT(1) | BIT(0))
>  
> -#define LAN9303_PORT_0_OFFSET 0x400
> -#define LAN9303_PORT_1_OFFSET 0x800
> -#define LAN9303_PORT_2_OFFSET 0xc00
> +#define LAN9303_SWITCH_PORT_REG(port, reg0) (0x400 * (port) + (reg0))
>  
>  /* the built-in PHYs are of type LAN911X */
>  #define MII_LAN911X_SPECIAL_MODES 0x12
> @@ -428,6 +426,13 @@ static int lan9303_read_switch_reg(struct lan9303 *chip, 
> u16 regnum, u32 *val)
>   return ret;
>  }
>  
> +static int lan9303_write_switch_port(
> + struct lan9303 *chip, int port, u16 regnum, u32 val)
> +{
> + return lan9303_write_switch_reg(
> + chip, LAN9303_SWITCH_PORT_REG(port, regnum), val);
> +}

This argument alignment is not looking too good, can you do this instead:

static int lan9303_write_switch_port(struct lan9303 *chip, int port
 u16 regnum, u32 *val)
{
}

This applied to patch 5 as well (which should have included it applies
to patch 1 as well).

With that:

Reviewed-by: Florian Fainelli 

> +
>  static int lan9303_detect_phy_setup(struct lan9303 *chip)
>  {
>   int reg;
> @@ -458,24 +463,23 @@ static int lan9303_detect_phy_setup(struct lan9303 
> *chip)
>   return 0;
>  }
>  
> -#define LAN9303_MAC_RX_CFG_OFFS (LAN9303_MAC_RX_CFG_0 - 
> LAN9303_PORT_0_OFFSET)
> -#define LAN9303_MAC_TX_CFG_OFFS (LAN9303_MAC_TX_CFG_0 - 
> LAN9303_PORT_0_OFFSET)
> -
>  static int lan9303_disable_packet_processing(struct lan9303 *chip,
>unsigned int port)
>  {
>   int ret;
>  
>   /* disable RX, but keep register reset default values else */
> - ret = lan9303_write_switch_reg(chip, LAN9303_MAC_RX_CFG_OFFS + port,
> -LAN9303_MAC_RX_CFG_X_REJECT_MAC_TYPES);
> + ret = lan9303_write_switch_port(
> + chip, port, LAN9303_MAC_RX_CFG_0,
> + LAN9303_MAC_RX_CFG_X_REJECT_MAC_TYPES);
>   if (ret)
>   return ret;
>  
>   /* disable TX, but keep register reset default values else */
> - return lan9303_write_switch_reg(chip, LAN9303_MAC_TX_CFG_OFFS + port,
> - LAN9303_MAC_TX_CFG_X_TX_IFG_CONFIG_DEFAULT |
> - LAN9303_MAC_TX_CFG_X_TX_PAD_ENABLE);
> + return lan9303_write_switch_port(
> + chip, port, LAN9303_MAC_TX_CFG_0,
> + LAN9303_MAC_TX_CFG_X_TX_IFG_CONFIG_DEFAULT |
> + LAN9303_MAC_TX_CFG_X_TX_PAD_ENABLE);

Same here, please don't re-align the arguments, they were fine already.

>  }
>  
>  static int lan9303_enable_packet_processing(struct lan9303 *chip,
> @@ -484,17 +488,19 @@ static int lan9303_enable_packet_processing(struct 
> lan9303 *chip,
>   int ret;
>  
>   /* enable RX and keep register reset default values else */
> - ret = lan9303_write_switch_reg(chip, LAN9303_MAC_RX_CFG_OFFS + port,
> -LAN9303_MAC_RX_CFG_X_REJECT_MAC_TYPES |
> -LAN9303_MAC_RX_CFG_X_RX_ENABLE);
> + ret = lan9303_write_switch_port(
> + chip, port, LAN9303_MAC_RX_CFG_0,
> + LAN9303_MAC_RX_CFG_X_REJECT_MAC_TYPES |
> + LAN9303_MAC_RX_CFG_X_RX_ENABLE);
>   if (ret)
>   return ret;
>  
>   /* enable TX and keep register reset default values else */
> - return lan9303_write_switch_reg(chip, LAN9303_MAC_TX_CFG_OFFS + port,
> - LAN9303_MAC_TX_CFG_X_TX_IFG_CONFIG_DEFAULT |
> - LAN9303_MAC_TX_CFG_X_TX_PAD_ENABLE |
> - LAN9303_MAC_TX_CFG_X_TX_ENABLE);
> + return lan9303_write_switch_port(
> + chip, port, LAN9303_MAC_TX_CFG_0,
> + LAN9303_MAC_TX_CFG_X_TX_IFG_CONFIG_DEFAULT |
> + LAN9303_MAC_TX_CFG_X_TX_PAD_ENABLE |
> + LAN9303_MAC_TX_CFG_X_TX_ENABLE);
>  }
>  
>  /* We want a special working switch:
> @@ -558,13 +564,13 @@ static int lan9303_disable_processing(struct lan9303 
> *chip)
>  {
>   i

Re: [PATCH v3 net-next 4/5] net: dsa: lan9303: Rename lan9303_xxx_packet_processing()

2017-08-03 Thread Florian Fainelli
On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:
> The lan9303_enable_packet_processing, lan9303_disable_packet_processing
> functions operate on port, so the names should reflect that.
> And to align with lan9303_disable_processing(), rename:
> 
> lan9303_enable_packet_processing -> lan9303_enable_processing_port
> lan9303_disable_packet_processing -> lan9303_disable_processing_port
> 
> Signed-off-by: Egil Hjelmeland 

Reviewed-by: Florian Fainelli 
-- 
Florian


STABLE: net: cdc_mbim: apply "NDP to end" quirk to HP lt4132

2017-08-03 Thread Aurimas Fišeras

Please backport the upstream patch to the stable trees 4.4.x and 4.9.y:

a68491f895a937778bb25b0795830797239de31f net: cdc_mbim: apply "NDP to 
end" quirk to HP lt4132


Without this patch HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) is useless.
Ubuntu bug report: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1707643


Patch was tested with 4.4.x kernel.

Sincerely
Aurimas Fišeras


Re: [PATCH v3 net-next 3/5] net: dsa: lan9303: Simplify lan9303_xxx_packet_processing() usage

2017-08-03 Thread Florian Fainelli
On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:
> Simplify usage of lan9303_enable_packet_processing,
> lan9303_disable_packet_processing()
> 
> Signed-off-by: Egil Hjelmeland 

Reviewed-by: Florian Fainelli 

took a little while to figure out that we are utilizing fall through of
the switch/case statement and that's why it's okay.

> ---
>  drivers/net/dsa/lan9303-core.c | 24 ++--
>  1 file changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
> index 126e8b84bdf0..31fe66fbe39a 100644
> --- a/drivers/net/dsa/lan9303-core.c
> +++ b/drivers/net/dsa/lan9303-core.c
> @@ -564,15 +564,16 @@ static int lan9303_handle_reset(struct lan9303 *chip)
>  /* stop processing packets for all ports */
>  static int lan9303_disable_processing(struct lan9303 *chip)
>  {
> - int ret;
> + int p;
>  
> - ret = lan9303_disable_packet_processing(chip, 0);
> - if (ret)
> - return ret;
> - ret = lan9303_disable_packet_processing(chip, 1);
> - if (ret)
> - return ret;
> - return lan9303_disable_packet_processing(chip, 2);
> + for (p = 0; p < LAN9303_NUM_PORTS; p++) {
> + int ret = lan9303_disable_packet_processing(chip, p);
> +
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
>  }
>  
>  static int lan9303_check_device(struct lan9303 *chip)
> @@ -765,7 +766,6 @@ static int lan9303_port_enable(struct dsa_switch *ds, int 
> port,
>   /* enable internal packet processing */
>   switch (port) {
>   case 1:
> - return lan9303_enable_packet_processing(chip, port);
>   case 2:
>   return lan9303_enable_packet_processing(chip, port);
>   default:
> @@ -784,13 +784,9 @@ static void lan9303_port_disable(struct dsa_switch *ds, 
> int port,
>   /* disable internal packet processing */
>   switch (port) {
>   case 1:
> - lan9303_disable_packet_processing(chip, port);
> - lan9303_phy_write(ds, chip->phy_addr_sel_strap + 1,
> -   MII_BMCR, BMCR_PDOWN);
> - break;
>   case 2:
>   lan9303_disable_packet_processing(chip, port);
> - lan9303_phy_write(ds, chip->phy_addr_sel_strap + 2,
> + lan9303_phy_write(ds, chip->phy_addr_sel_strap + port,
> MII_BMCR, BMCR_PDOWN);
>   break;
>   default:
> 


-- 
Florian


Re: [PATCH v3 net-next 5/5] net: dsa: lan9303: refactor lan9303_get_ethtool_stats

2017-08-03 Thread Florian Fainelli
On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:
> In lan9303_get_ethtool_stats: Get rid of 0x400 constant magic
> by using new lan9303_read_switch_reg() inside loop.
> Reduced scope of two variables.
> 
> Signed-off-by: Egil Hjelmeland 
> ---
>  drivers/net/dsa/lan9303-core.c | 26 --
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
> index 6f409755ba1a..5aaa46146c27 100644
> --- a/drivers/net/dsa/lan9303-core.c
> +++ b/drivers/net/dsa/lan9303-core.c
> @@ -435,6 +435,13 @@ static int lan9303_write_switch_port(
>   chip, LAN9303_SWITCH_PORT_REG(port, regnum), val);
>  }
>  
> +static int lan9303_read_switch_port(
> + struct lan9303 *chip, int port, u16 regnum, u32 *val)
> +{

This indentation is really funny, why not just break it up that way:

static int lan9303_read_switch_port(struct lan9303 *chip, int port
u16 regnum, u32 *val)
{
}

This applies to patch 5 as well, other than that:

Reviewed-by: Florian Fainelli 

> + return lan9303_read_switch_reg(
> + chip, LAN9303_SWITCH_PORT_REG(port, regnum), val);
> +}
> +
>  static int lan9303_detect_phy_setup(struct lan9303 *chip)
>  {
>   int reg;
> @@ -709,19 +716,18 @@ static void lan9303_get_ethtool_stats(struct dsa_switch 
> *ds, int port,
> uint64_t *data)
>  {
>   struct lan9303 *chip = ds->priv;
> - u32 reg;
> - unsigned int u, poff;
> - int ret;
> -
> - poff = port * 0x400;
> + unsigned int u;
>  
>   for (u = 0; u < ARRAY_SIZE(lan9303_mib); u++) {
> - ret = lan9303_read_switch_reg(chip,
> -   lan9303_mib[u].offset + poff,
> -   ®);
> + u32 reg;
> + int ret;
> +
> + ret = lan9303_read_switch_port(
> + chip, port, lan9303_mib[u].offset, ®);
> +
>   if (ret)
> - dev_warn(chip->dev, "Reading status reg %u failed\n",
> -  lan9303_mib[u].offset + poff);
> + dev_warn(chip->dev, "Reading status port %d reg %u 
> failed\n",
> +  port, lan9303_mib[u].offset);
>   data[u] = reg;
>   }
>  }
> 


-- 
Florian


Re: [PATCH v3 net-next 2/5] net: dsa: lan9303: define LAN9303_NUM_PORTS 3

2017-08-03 Thread Florian Fainelli
On 08/03/2017 02:45 AM, Egil Hjelmeland wrote:
> Will be used instead of '3' in upcomming patches.
> 
> Signed-off-by: Egil Hjelmeland 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next] net: dsa: Add support for 64-bit statistics

2017-08-03 Thread Florian Fainelli
On 08/02/2017 04:49 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue,  1 Aug 2017 15:00:36 -0700
> 
>> DSA slave network devices maintain a pair of bytes and packets counters
>> for each directions, but these are not 64-bit capable. Re-use
>> pcpu_sw_netstats which contains exactly what we need for that purpose
>> and update the code path to report 64-bit capable statistics.
>>
>> Signed-off-by: Florian Fainelli 
> 
> Applied, thanks.
> 
> I would run ethtool -S and ifconfig under perf to see where it is
> spending so much time.
> 

This appears to be way worse than I thought, will keep digging, but for
now, I may have to send a revert. Andrew, Vivien can you see if you have
the same problems on your boards? Thanks!

# killall iperf
# [ ID] Interval   Transfer Bandwidth
[  3]  0.0-19.1 sec   500 MBytes   220 Mbits/sec
# while true; do ethtool -S gphy; ifconfig gphy; done
^C^C


[   64.566226] INFO: rcu_sched self-detected stall on CPU
[   64.571487]  0-...: (25999 ticks this GP) idle=006/141/0
softirq=965/965 fqs=6495
[   64.580214]   (t=26000 jiffies g=205 c=204 q=51)
[   64.584958] NMI backtrace for cpu 0
[   64.588571] CPU: 0 PID: 1515 Comm: ethtool Not tainted
4.13.0-rc3-00534-g5a4d148f0d78 #328
[   64.596951] Hardware name: Broadcom STB (Flattened Device Tree)
[   64.602973] [] (unwind_backtrace) from []
(show_stack+0x10/0x14)
[   64.610836] [] (show_stack) from []
(dump_stack+0xb0/0xdc)
[   64.618172] [] (dump_stack) from []
(nmi_cpu_backtrace+0x11c/0x120)
[   64.626295] [] (nmi_cpu_backtrace) from []
(nmi_trigger_cpumask_backtrace+0x118/0x158)
[   64.636087] [] (nmi_trigger_cpumask_backtrace) from
[] (rcu_dump_cpu_stacks+0xa0/0xd4)
[   64.645875] [] (rcu_dump_cpu_stacks) from []
(rcu_check_callbacks+0xb8c/0xbfc)
[   64.654965] [] (rcu_check_callbacks) from []
(update_process_times+0x30/0x5c)
[   64.663966] [] (update_process_times) from []
(tick_sched_timer+0x40/0x90)
[   64.672702] [] (tick_sched_timer) from []
(__hrtimer_run_queues+0x198/0x6f0)
[   64.681615] [] (__hrtimer_run_queues) from []
(hrtimer_interrupt+0x98/0x1f4)
[   64.690530] [] (hrtimer_interrupt) from []
(arch_timer_handler_virt+0x28/0x30)
[   64.699628] [] (arch_timer_handler_virt) from []
(handle_percpu_devid_irq+0xc8/0x484)
[   64.709329] [] (handle_percpu_devid_irq) from []
(generic_handle_irq+0x24/0x34)
[   64.718502] [] (generic_handle_irq) from []
(__handle_domain_irq+0x5c/0xb0)
[   64.727325] [] (__handle_domain_irq) from []
(gic_handle_irq+0x48/0x8c)
[   64.735794] [] (gic_handle_irq) from []
(__irq_svc+0x5c/0x7c)
[   64.743384] Exception stack(0xc283fd40 to 0xc283fd88)
[   64.748510] fd40: 0001 0011a9ad  edf37000 ecddb800
f0d95000 ecddbe6c c08e87f4
[   64.756801] fd60: ed6b8010 0660 0658 600f0013 0001
c283fd90 c027a39c c09a8c24
[   64.765091] fd80: 200f0013 
[   64.768647] [] (__irq_svc) from []
(dsa_slave_get_ethtool_stats+0x100/0x104)
[   64.777562] [] (dsa_slave_get_ethtool_stats) from
[] (dev_ethtool+0x768/0x2840)
[   64.786742] [] (dev_ethtool) from []
(dev_ioctl+0x5f8/0xa50)
[   64.794251] [] (dev_ioctl) from []
(do_vfs_ioctl+0xac/0x8d0)
[   64.801755] [] (do_vfs_ioctl) from []
(SyS_ioctl+0x34/0x5c)
[   64.809175] [] (SyS_ioctl) from []
(ret_fast_syscall+0x0/0x1c)
[   64.816901] INFO: rcu_sched detected stalls on CPUs/tasks:
[   64.822480]  0-...: (26006 ticks this GP) idle=006/140/0
softirq=965/965 fqs=6495
[   64.831206]  (detected by 2, t=26264 jiffies, g=205, c=204, q=51)
[   64.837390] Sending NMI from CPU 2 to CPUs 0:
[   64.841811] NMI backtrace for cpu 0
[   64.841818] CPU: 0 PID: 1515 Comm: ethtool Not tainted
4.13.0-rc3-00534-g5a4d148f0d78 #328
[   64.841821] Hardware name: Broadcom STB (Flattened Device Tree)
[   64.841824] task: edf37000 task.stack: c283e000
[   64.841832] PC is at dsa_slave_get_ethtool_stats+0x100/0x104
[   64.841842] LR is at mark_held_locks+0x68/0x90
[   64.841846] pc : []lr : []psr: 200f0013
[   64.841850] sp : c283fd90  ip : 0001  fp : 600f0013
[   64.841853] r10: 0658  r9 : 0660  r8 : ed6b8010
[   64.841856] r7 : c08e87f4  r6 : ecddbe6c  r5 : f0d95000  r4 : ecddb800
[   64.841860] r3 : edf37000  r2 :   r1 : 0011a9ad  r0 : 0001
[   64.841865] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
Segment user
[   64.841869] Control: 30c5387d  Table: 2dddf500  DAC: fffd
[   64.841874] CPU: 0 PID: 1515 Comm: ethtool Not tainted
4.13.0-rc3-00534-g5a4d148f0d78 #328
[   64.841876] Hardware name: Broadcom STB (Flattened Device Tree)
[   64.841885] [] (unwind_backtrace) from []
(show_stack+0x10/0x14)
[   64.841892] [] (show_stack) from []
(dump_stack+0xb0/0xdc)
[   64.841900] [] (dump_stack) from []
(nmi_cpu_backtrace+0xc0/0x120)
[   64.841907] [] (nmi_cpu_backtrace) from []
(handle_IPI+0x108/0x424)
[   64.841914] [] (handle_IPI) from []
(gic_handle_irq+0x88/0x8c)
[   64.841919] [] (gic_handle_irq) from []
(__irq_svc+0x5c/0x7c)
[   64.841922] Exception stack(0xc283fd40 to 0xc283fd88)

Re: [PATCH net-next v3 2/2] bpf: add a test case for syscalls/sys_{enter|exit}_* tracepoints

2017-08-03 Thread Daniel Borkmann

On 08/03/2017 06:29 PM, Yonghong Song wrote:

Signed-off-by: Yonghong Song 


Acked-by: Daniel Borkmann 


Re: [PATCH 00/27] ip: add -json support to 'ip link show'

2017-08-03 Thread Oliver Hartkopp

Hi Julien,

On 08/03/2017 05:54 PM, Julien Fortin wrote:

From: Julien Fortin 


what about

link_veth.c
iplink_vcan.c
iplink_vxcan.c

??

Regards,
Oliver



This patch series adds json support to 'ip [-details] link show [dev DEV]'
Each patch describes the json schema it adds and provides some examples.

Julien Fortin (27):
   color: add new COLOR_NONE and disable_color function
   ip: add new command line argument -json (mutually exclusive with
 -color)
   json_writer: add new json handlers (null, float with format, lluint,
 hu)
   ip: ip_print: add new API to print JSON or regular format output
   ip: ipaddress.c: add support for json output
   ip: iplink.c: open/close json object for ip -brief -json link show dev
 DEV
   ip: iplink_bond.c: add json output support
   ip: iplink_bond_slave.c: add json output support (info_slave_data)
   ip: iplink_hsr.c: add json output support
   ip: iplink_bridge.c: add json output support
   ip: iplink_bridge_slave.c: add json output support
   ip: iplink_can.c: add json output support
   ip: iplink_geneve.c: add json output support
   ip: iplink_ipoib.c: add json output support
   ip: iplink_ipvlan.c: add json output support
   ip: iplink_vrf.c: add json output support
   ip: iplink_vxlan.c: add json output support
   ip: iplink_xdp.c: add json output support
   ip: ipmacsec.c: add json output support
   ip: link_gre.c: add json output support
   ip: link_gre6.c: add json output support
   ip: link_ip6tnl.c: add json output support
   ip: link_iptnl.c: add json output support
   ip: link_vti.c: add json output support
   ip: link_vti6.c: add json output support
   ip: link_macvlan.c: add json output support
   ip: iplink_vlan.c: add json output support

  include/color.h  |2 +
  include/json_writer.h|9 +
  include/utils.h  |1 +
  ip/Makefile  |2 +-
  ip/ip.c  |6 +
  ip/ip_common.h   |   56 +++
  ip/ip_print.c|  233 ++
  ip/ipaddress.c   | 1089 --
  ip/iplink.c  |2 +
  ip/iplink_bond.c |  231 +++---
  ip/iplink_bond_slave.c   |   57 ++-
  ip/iplink_bridge.c   |  291 -
  ip/iplink_bridge_slave.c |  185 +---
  ip/iplink_can.c  |  276 +---
  ip/iplink_geneve.c   |   86 +++-
  ip/iplink_hsr.c  |   36 +-
  ip/iplink_ipoib.c|   30 +-
  ip/iplink_ipvlan.c   |8 +-
  ip/iplink_macvlan.c  |   37 +-
  ip/iplink_vlan.c |   62 ++-
  ip/iplink_vrf.c  |   13 +-
  ip/iplink_vxlan.c|  161 ---
  ip/iplink_xdp.c  |   31 +-
  ip/ipmacsec.c|   84 +++-
  ip/link_gre.c|  147 ---
  ip/link_gre6.c   |  142 --
  ip/link_ip6tnl.c |  172 +---
  ip/link_iptnl.c  |  155 ---
  ip/link_vti.c|   23 +-
  ip/link_vti6.c   |   22 +-
  lib/color.c  |9 +-
  lib/json_writer.c|   44 +-
  32 files changed, 2668 insertions(+), 1034 deletions(-)
  create mode 100644 ip/ip_print.c



Re: [PATCH v2 3/4] can: m_can: Update documentation to mention new fixed transceiver binding

2017-08-03 Thread Rob Herring
On Mon, Jul 24, 2017 at 06:05:20PM -0500, Franklin S Cooper Jr wrote:
> Add information regarding fixed transceiver binding. This is especially
> important for MCAN since the IP allows CAN FD mode to run significantly
> faster than what most transceivers are capable of.
> 
> Signed-off-by: Franklin S Cooper Jr 
> ---
> Version 2 changes:
> Drop unit address
> 
>  Documentation/devicetree/bindings/net/can/m_can.txt | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/can/m_can.txt 
> b/Documentation/devicetree/bindings/net/can/m_can.txt
> index 9e33177..e4abd2c 100644
> --- a/Documentation/devicetree/bindings/net/can/m_can.txt
> +++ b/Documentation/devicetree/bindings/net/can/m_can.txt
> @@ -43,6 +43,11 @@ Required properties:
> Please refer to 2.4.1 Message RAM Configuration in
> Bosch M_CAN user manual for details.
>  
> +Optional properties:
> +- fixed-transceiver  : Fixed-transceiver subnode describing maximum speed

This is a node, not a property. Sub nodes should have their own section.

> +   that can be used for CAN and/or CAN-FD modes.  See
> +   
> Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
> +   for details.
>  Example:
>  SoC dtsi:
>  m_can1: can@020e8000 {
> @@ -64,4 +69,9 @@ Board dts:
>   pinctrl-names = "default";
>   pinctrl-0 = <&pinctrl_m_can1>;
>   status = "enabled";
> +
> + fixed-transceiver {
> + max-arbitration-speed = <100>;
> + max-data-speed = <500>;
> + };
>  };
> -- 
> 2.10.0
> 


Re: [PATCH net-next 00/14] sctp: remove typedefs from structures part 4

2017-08-03 Thread David Miller
From: Xin Long 
Date: Thu,  3 Aug 2017 15:42:08 +0800

> As we know, typedef is suggested not to use in kernel, even checkpatch.pl
> also gives warnings about it. Now sctp is using it for many structures.
> 
> All this kind of typedef's using should be removed. This patchset is the
> part 4 to remove it for another 14 basic structures from linux/sctp.h.
> After this patchset, all typedefs are cleaned in linux/sctp.h.
> 
> Just as the part 1-3, No any code's logic would be changed in these patches,
> only cleaning up.

Series applied, thanks.


Re: [patch net-next 10/21] ipv6: fib: Add offload indication to routes

2017-08-03 Thread David Ahern
On 8/3/17 5:28 AM, Jiri Pirko wrote:
> diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
> index d496c02..33e2a57 100644
> --- a/include/uapi/linux/ipv6_route.h
> +++ b/include/uapi/linux/ipv6_route.h
> @@ -35,6 +35,7 @@
>  #define RTF_PREF(pref)   ((pref) << 27)
>  #define RTF_PREF_MASK0x1800
>  
> +#define RTF_OFFLOAD  0x2000  /* offloaded route  */
>  #define RTF_PCPU 0x4000  /* read-only: can not be set by user */
>  #define RTF_LOCAL0x8000

PCPU as a UAPI flag was a mistake; it is a flag internal to IPv6 stack
and really makes no sense to the user. The OFFLOAD should not follow
suit especially given the limited uapi bits left.


Re: [PATCH v3 2/4] dt-bindings: can: fixed-transceiver: Add new CAN fixed transceiver bindings

2017-08-03 Thread Franklin S Cooper Jr


On 08/03/2017 07:22 AM, Sergei Shtylyov wrote:
> On 08/03/2017 12:48 PM, Franklin S Cooper Jr wrote:
> 
 Add documentation to describe usage of the new fixed transceiver
 binding.
 This new binding is applicable for any CAN device therefore it
 exists as
 its own document.

 Signed-off-by: Franklin S Cooper Jr 
 ---
.../bindings/net/can/fixed-transceiver.txt | 24
 ++
1 file changed, 24 insertions(+)
create mode 100644
 Documentation/devicetree/bindings/net/can/fixed-transceiver.txt

 diff --git
 a/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
 b/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
 new file mode 100644
 index 000..2f58838b
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
 @@ -0,0 +1,24 @@
 +Fixed transceiver Device Tree binding
 +--
 +
 +CAN transceiver typically limits the max speed in standard CAN and
 CAN FD
 +modes. Typically these limitations are static and the transceivers
 themselves
 +provide no way to detect this limitation at runtime. For this
 situation,
 +the "fixed-transceiver" node can be used.
 +
 +Required Properties:
 + max-bitrate:a positive non 0 value that determines the max
 +speed that CAN/CAN-FD can run. Any other value
 +will be ignored.
 +
 +Examples:
 +
 +Based on Texas Instrument's TCAN1042HGV CAN Transceiver
 +
 +m_can0 {
 +
 +fixed-transceiver@0 {
>>>
>>> The  (after @) must only be specified if there's "reg"
>>
>> Sorry. Fixed this in my v2 and some how it came back. Will fix.
>>
>>> prop in the device node. Also, please name the node "can-transceiver@"
>>> to be more in line with the DT spec. which requires generic node names.
>>
>> Its possible for future can transceivers drivers to be created. So I
> 
>So what? Ah, you are using the node name to match in the CAN drivers...
> 
>> thought including fixed was important to indicate that this is a "dumb"
>> transceiver similar to "fixed-link".
> 
>I'm not sure the "fixed-link" MAC subnode assumed any transceiver at
> all...

Your right. I wasn't trying to imply that it does. What I meant was that
having a node named "can-transceiver" may be a bit confusing in the
future if can transceiver drivers are created. Prefix of "fixed" atleast
to me makes it clear that this is something unique or a generic
transceiver with limitations. Similar to "fixed-link" which is for MACs
not connected to MDIO managed phy. Calling this subnode
"can-transceiver" to me would be like renaming "fixed-link" to "phy".

> 
>> So would "fixed-can-transceiver" be
>> ok or do you want to go with can-transceiver?
> 
>I'm somewhat perplexed at this point...

If my reasoning still didn't change your views then I'll make the switch.
> 
> MBR, Sergei


  1   2   3   4   >