date:20170912

Dear Talented

2017-09-12 Thread Kim Sharma

Dear Talented,

I am Talent Scout For BLUE SKY FILM STUDIO, Present Blue sky Studio a
Film Corporation Located in the United State, is Soliciting for the
Right to use Your Photo/Face and Personality as One of the Semi -Major
Role/ Character in our Upcoming ANIMATED Stereoscope 3D Movie-The Story
of Anubis (Anubis 2018) The Movie is Currently Filming (In
Production) Please Note That There Will Be No Auditions, Traveling or
Any Special / Professional Acting Skills, Since the Production of This
Movie Will Be Done with our State of Art Computer -Generating Imagery
Equipment. We Are Prepared to Pay the Total Sum of $620,000.00 USD. For
More Information/Understanding, Please Write us on the E-Mail Below.
CONTACT EMAIL: blueskyanimatedstu...@usa.com
All Reply to: blueskyanimatedstu...@usa.com
Note: Only the Response send to this mail will be Given a Prior
Consideration.


Talent Scout
Kim Sharma

Re: [PATCH] w90p910_ether: include linux/interrupt.h

2017-09-12 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 12 Sep 2017 14:31:48 +0200

> A randconfig build caused a compile failure:
> 
> drivers/net/ethernet/nuvoton/w90p910_ether.c: In function 
> 'w90p910_ether_close':
> drivers/net/ethernet/nuvoton/w90p910_ether.c:580:2: error: implicit 
> declaration of function 'free_irq'; did you mean 'free_uid'? 
> [-Werror=implicit-function-declaration]
> 
> Adding the correct include fixes the problem.
> 
> Signed-off-by: Arnd Bergmann 

Applied.

Re: [PATCH net] net: bonding: fix tlb_dynamic_lb default value

2017-09-12 Thread David Miller

From: Nikolay Aleksandrov 
Date: Tue, 12 Sep 2017 15:10:05 +0300

> Commit 8b426dc54cf4 ("bonding: remove hardcoded value") changed the
> default value for tlb_dynamic_lb which lead to either broken ALB mode
> (since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
> with tlb_dynamic_lb equal to 0.
> The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
> when switching to ALB mode, but the default value is still wrong and
> we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
> changed via netlink or sysfs. In order to restore the previous behaviour
> and default value simply remove the mode check around the default param
> initialization for tlb_dynamic_lb which will always set it to 1 as
> before.
> 
> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
> Signed-off-by: Nikolay Aleksandrov 

Applied and queued up for -stable, thanks.

Re: [PATCH] ipv4: Namespaceify tcp_fastopen knob

2017-09-12 Thread David Miller

From: Haishuang Yan 
Date: Tue, 12 Sep 2017 18:30:57 +0800

> Different namespace application might require enable TCP Fast Open
> feature independently of the host.
> 
> Reported-by: Luca BRUNO 
> Signed-off-by: Haishuang Yan 
 ...
> diff --git a/samples/bpf/test_ipip.sh b/samples/bpf/test_ipip.sh
> index 1969254..7bbc521 100755
> --- a/samples/bpf/test_ipip.sh
> +++ b/samples/bpf/test_ipip.sh
> @@ -173,6 +173,8 @@ function cleanup {
>  cleanup
>  echo "Testing IP tunnels..."
>  test_ipip
> +sleep 1
>  test_ipip6
> +sleep 1
>  test_ip6ip6
>  echo "*** PASS ***"

This seems like a completely unrelated change.

RE: [PATCH v3] iproute2: add support for GRE ignore-df knob

2017-09-12 Thread Michele Lucini

Guys, thanks heaps for this, much appreciated!

Cheers.

Mike
-Original Message-
From: Philip Prindeville [mailto:phil...@redfish-solutions.com] 
Sent: Friday, 21 July 2017 10:35 AM
To: Stephen Hemminger 
Cc: netdev@vger.kernel.org; Michele Lucini 
Subject: Re: [PATCH v3] iproute2: add support for GRE ignore-df knob


> On Jul 20, 2017, at 6:26 PM, Stephen Hemminger  
> wrote:
> 
> On Thu, 20 Jul 2017 13:06:10 -0600
> "Philip Prindeville"  wrote:
> 
>> From: Philip Prindeville 
>> 
>> In the presence of firewalls which improperly block ICMP Unreachable 
>> (including Fragmentation Required) messages, Path MTU Discovery is 
>> prevented from working.
>> 
>> The workaround is to handle IPv4 payloads opaquely, ignoring the DF 
>> bit.
>> 
>> Kernel commit 22a59be8b7693eb2d0897a9638f5991f2f8e4ddd ("net: ipv4:
>> Add ability to have GRE ignore DF bit in IPv4 payloads") is 
>> complemented by this user-space changeset which exposes control of 
>> this setting.
>> 
>> Reviewed-by: Stephen Hemminger 
>> Signed-off-by: Philip Prindeville 
> 
> Applied, thanks Philip


Thanks!  Sorry I didn’t realize that the first submission a year ago hadn’t 
been applied and it took me this long to redux and resubmit it.

Michele: hopefully this comes out in your distro-of-choice fairly soon.  Like I 
said, I thought this had already been rolled in.

-Philip

Re: [PATCH v4 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode

2017-09-12 Thread David Miller

From: Haishuang Yan 
Date: Tue, 12 Sep 2017 17:47:57 +0800

> In collect_md mode, if the tun dev is down, it still can call
> __ip6_tnl_rcv to receive on packets, and the rx statistics increase
> improperly.
> 
> When the md tunnel is down, it's not neccessary to increase RX drops
> for the tunnel device, packets would be recieved on fallback tunnel,
> and the RX drops on fallback device will be increased as expected.
> 
> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> Cc: Alexei Starovoitov 
> Signed-off-by: Haishuang Yan 

Applied.

Re: [PATCH v4 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-09-12 Thread David Miller

From: Haishuang Yan 
Date: Tue, 12 Sep 2017 17:47:56 +0800

> In collect_md mode, if the tun dev is down, it still can call
> ip_tunnel_rcv to receive on packets, and the rx statistics increase
> improperly.
> 
> When the md tunnel is down, it's not neccessary to increase RX drops
> for the tunnel device, packets would be recieved on fallback tunnel,
> and the RX drops on fallback device will be increased as expected.
> 
> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
> Cc: Pravin B Shelar 
> Signed-off-by: Haishuang Yan 

Applied.

Re: [patch net] mlxsw: spectrum: Prevent mirred-related crash on removal

2017-09-12 Thread David Miller

From: Jiri Pirko 
Date: Tue, 12 Sep 2017 08:50:53 +0200

> From: Yuval Mintz 
> 
> When removing the offloading of mirred actions under
> matchall classifiers, mlxsw would find the destination port
> associated with the offloaded action and utilize it for undoing
> the configuration.
> 
> Depending on the order by which ports are removed, it's possible that
> the destination port would get removed before the source port.
> In such a scenario, when actions would be flushed for the source port
> mlxsw would perform an illegal dereference as the destination port is
> no longer listed.
> 
> Since the only item necessary for undoing the configuration on the
> destination side is the port-id and that in turn is already maintained
> by mlxsw on the source-port, simply stop trying to access the
> destination port and use the port-id directly instead.
> 
> Fixes: 763b4b70af ("mlxsw: spectrum: Add support in matchall mirror TC 
> offloading")
> Signed-off-by: Yuval Mintz 
> Signed-off-by: Jiri Pirko 

Applied and queued up for -stable, thanks.

Re: [Patch net v3 0/3] net_sched: fix filter chain reference counting

2017-09-12 Thread David Miller

From: Cong Wang 
Date: Mon, 11 Sep 2017 16:33:29 -0700

> This patchset fixes tc filter chain reference counting and nasty race
> conditions with RCU callbacks. Please see each patch for details.

Series applied, thanks Cong.

Re: [PATCH v2] openvswitch: Fix an error handling path in 'ovs_nla_init_match_and_action()'

2017-09-12 Thread David Miller

From: Christophe JAILLET 
Date: Mon, 11 Sep 2017 21:56:20 +0200

> All other error handling paths in this function go through the 'error'
> label. This one should do the same.
> 
> Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
> Signed-off-by: Christophe JAILLET 

Applied.

Re: [PATCH net] tcp/dccp: remove reqsk_put() from inet_child_forget()

2017-09-12 Thread David Miller

From: Eric Dumazet 
Date: Mon, 11 Sep 2017 15:58:38 -0700

> From: Eric Dumazet 
> 
> Back in linux-4.4, I inadvertently put a call to reqsk_put() in
> inet_child_forget(), forgetting it could be called from two different
> points.
> 
> In the case it is called from inet_csk_reqsk_queue_add(), we want to
> keep the reference on the request socket, since it is released later by
> the caller (tcp_v{4|6}_rcv())
> 
> This bug never showed up because atomic_dec_and_test() was not signaling
> the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
> prevented the request to be put in quarantine.
> 
> Recent conversion of socket refcount from atomic_t to refcount_t finally
> exposed the bug.
> 
> So move the reqsk_put() to inet_csk_listen_stop() to fix this.
> 
> Thanks to Shankara Pailoor for using syzkaller and providing
> a nice set of .config and C repro.
 ...
> Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
> Signed-off-by: Eric Dumazet 
> Reported-by: Shankara Pailoor 
> Tested-by: Shankara Pailoor 

Applied and queued up for -stable.

Thanks.

Re: [PATCH v2 net] smsc95xx: Configure pause time to 0xffff when tx flow control enabled

2017-09-12 Thread David Miller

From: 
Date: Mon, 11 Sep 2017 17:43:11 +

> From: Nisar Sayed 
> 
> Configure pause time to 0x when tx flow control enabled
> 
> Set pause time to 0x in the pause frame to indicate the
> partner to stop sending the packets. When RX buffer frees up,
> the device sends pause frame with pause time zero for partner to
> resume transmission.
> 
> Fixes: 2f7ca802bdae ("Add SMSC LAN9500 USB2.0 10/100 ethernet adapter driver")
> Signed-off-by: Nisar Sayed 

Applied.

Re: Regression in throughput between kvm guests over virtual bridge

2017-09-12 Thread Jason Wang




On 2017年09月13日 01:56, Matthew Rosato wrote:

We are seeing a regression for a subset of workloads across KVM guests
over a virtual bridge between host kernel 4.12 and 4.13.  Bisecting
points to c67df11f "vhost_net: try batch dequing from skb array"

In the regressed environment, we are running 4 kvm guests, 2 running as
uperf servers and 2 running as uperf clients, all on a single host.
They are connected via a virtual bridge.  The uperf client profile looks
like:



   
 
   
 
 
   
 
 
   
 
   


So, 1 tcp streaming instance per client.  When upgrading the host kernel
from 4.12->4.13, we see about a 30% drop in throughput for this
scenario.  After the bisect, I further verified that reverting c67df11f
on 4.13 "fixes" the throughput for this scenario.

On the other hand, if we increase the load by upping the number of
streaming instances to 50 (nprocs="50") or even 10, we see instead a
~10% increase in throughput when upgrading host from 4.12->4.13.

So it may be the issue is specific to "light load" scenarios.  I would
expect some overhead for the batching, but 30% seems significant...  Any
thoughts on what might be happening here?



Hi, thanks for the bisecting. Will try to see if I can reproduce. 
Various factors could have impact on stream performance. If possible, 
could you collect the #pkts and average packet size during the test? And 
if you guest version is above 4.12, could you please retry with 
napi_tx=true?


Thanks

Re: [PATCH] datapath: Fix an error handling path in 'ovs_nla_init_match_and_action()'

2017-09-12 Thread Tonghao Zhang

On Tue, Sep 12, 2017 at 3:20 AM, Christophe JAILLET
 wrote:
> All other error handling paths in this function go through the 'error'
> label. This one should do the same.
>
> Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
> Signed-off-by: Christophe JAILLET 
> ---
> I think that the comment above the function could be improved. It looks
> like the commit log which has introduced this function.
>
> I'm also not sure that commit 9cc9a5cb176c is of any help. It is
> supposed to remove a warning, and I guess it does. But 
> 'ovs_nla_init_match_and_action()'
> is called unconditionnaly from 'ovs_flow_cmd_set()'. So even if the stack
> used by each function is reduced, the overall stack should be the same, if
> not larger.
>
> So this commit sounds like adding a bug where the code was fine and states
> to fix an issue but, at the best, only hides it.
>
> Instead of fixing the code with the proposed patch, reverting the initial
> commit could also be considered.
> ---
>  net/openvswitch/datapath.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 76cf273a56c7..c3aec6227c91 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -1112,7 +1112,8 @@ static int ovs_nla_init_match_and_action(struct net 
> *net,
> if (!a[OVS_FLOW_ATTR_KEY]) {
> OVS_NLERR(log,
>   "Flow key attribute not present in set 
> flow.");
> -   return -EINVAL;
> +   error = -EINVAL;
> +   goto error;

Thank for your report. But I really don't understand.
In the 'ovs_nla_init_match_and_action', we only init 'match' when the
OVS_FLOW_ATTR_KEY is set.
If the 'OVS_FLOW_ATTR_ACTIONS' is set, but not 'OVS_FLOW_ATTR_KEY', we
can return directly because the match is not inited yet, and it is
unnecessary to set it's mask NULL. Then ovs_flow_cmd_set can run via
value returned.


> }
>
> *acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], key,
> --
> 2.11.0
>

Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-12 Thread Pravin Shelar

On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan
 wrote:
> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
> mode, tos should also fallback to ip{4,6}_dst_hoplimit.
>
> Signed-off-by: Haishuang Yan 
>
> ---
> Changes since v2:
>   * Make the commit message more clearer.
> ---
>  drivers/net/geneve.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index f640407..d52a65f 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
> net_device *dev,
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->collect_md) {
> tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
> -   ttl = key->ttl;
> } else {
> tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
> -   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
> }
> +   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
> df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>
This changes user API of Geneve collect-metadata mode. I do not see
good reason for this. Why user can not set right TTL for the flow?

Re: [PATCH v4 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-09-12 Thread Pravin Shelar

On Tue, Sep 12, 2017 at 2:47 AM, Haishuang Yan
 wrote:
> In collect_md mode, if the tun dev is down, it still can call
> ip_tunnel_rcv to receive on packets, and the rx statistics increase
> improperly.
>
> When the md tunnel is down, it's not neccessary to increase RX drops
> for the tunnel device, packets would be recieved on fallback tunnel,
> and the RX drops on fallback device will be increased as expected.
>
> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
> Cc: Pravin B Shelar 
> Signed-off-by: Haishuang Yan 
Acked-by: Pravin B Shelar

Memory leaks in conntrack

2017-09-12 Thread Cong Wang

Hello,

While testing my TC filter patches (so not related to conntrack), the
following memory leaks are shown up:


unreferenced object 0x9b19ba551228 (size 128):
  comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s)
  hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00  ...0
  backtrace:
[] create_object+0x169/0x2aa
[] kmemleak_alloc+0x25/0x41
[] slab_post_alloc_hook+0x44/0x65
[] __kmalloc_track_caller+0x113/0x146
[] __krealloc+0x4a/0x69
[] nf_ct_ext_add+0xe1/0x145
[] init_conntrack+0x1f7/0x36e
[] nf_conntrack_in+0x1d3/0x326
[] ipv4_conntrack_local+0x4d/0x50
[] nf_hook_slow+0x3c/0x9b
[] nf_hook.constprop.40+0xbe/0xd8
[] __ip_local_out+0xb3/0xbf
[] ip_local_out+0x1c/0x36
[] ip_send_skb+0x19/0x3d
[] udp_send_skb+0x17e/0x1df
[] udp_sendmsg+0x5a2/0x77c
unreferenced object 0x9b19a69b3340 (size 336):
  comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s)
  hex dump (first 32 bytes):
01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de  .N..
ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff  
  backtrace:
[] create_object+0x169/0x2aa
[] kmemleak_alloc+0x25/0x41
[] slab_post_alloc_hook+0x44/0x65
[] kmem_cache_alloc+0xd7/0x1f1
[] __nf_conntrack_alloc+0xa2/0x146
[] init_conntrack+0xb2/0x36e
[] nf_conntrack_in+0x1d3/0x326
[] ipv4_conntrack_local+0x4d/0x50
[] nf_hook_slow+0x3c/0x9b
[] nf_hook.constprop.40+0xbe/0xd8
[] __ip_local_out+0xb3/0xbf
[] ip_local_out+0x1c/0x36
[] ip_send_skb+0x19/0x3d
[] udp_send_skb+0x17e/0x1df
[] udp_sendmsg+0x5a2/0x77c
[] inet_sendmsg+0x37/0x5e

This seems new because I never see this before.

I don't touch chronyd in my VM, so I have no idea why it sends out UDP
packets, my guess is it is some periodical packet.

I don't think I use conntrack either, since /proc/net/ip_conntrack
does not exist.

Here are some related config of my kernel:

$ grep CONNTRACK .config
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CONNTRACK_AMANDA=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_H323=y
CONFIG_NF_CONNTRACK_IRC=y
CONFIG_NF_CONNTRACK_BROADCAST=y
CONFIG_NF_CONNTRACK_NETBIOS_NS=y
CONFIG_NF_CONNTRACK_SNMP=y
CONFIG_NF_CONNTRACK_PPTP=y
CONFIG_NF_CONNTRACK_SANE=y
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CONNTRACK_TFTP=y
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NF_CONNTRACK_IPV6=y

Please let me know if you need any other information.

Thanks.

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-12 Thread Josef Bacik

First I’m super sorry for the top post, I’m at plumbers and I forgot to upload 
my muttrc to my new cloud instance, so I’m screwed using outlook.

I have a completely untested, uncompiled patch that I think will fix the 
problem, would you mind giving it a go?  Thanks,

Josef

On 9/12/17, 3:36 PM, "Laura Abbott"  wrote:

Hi,

Fedora got a bug report 
https://bugzilla.redhat.com/show_bug.cgi?id=1432684 of a regression with 
automatic spice port
assignment. The libvirt team reduced this to the attached test
case run as follows:

In a separate terminal, qemu-kvm -vnc 127.0.0.1:0 to grab port 5900. 
Then do this:

$ gcc bind-collision.c && ./a.out
bind: Address already in use
AF_INET check failed.
$ gcc -D CHECK_IPV6 bind-collision.c && ./a.out
AF_INET6 success
AF_INET success
$ gcc bind-collision.c && ./a.out
AF_INET success

Bisection showed this behavior to be caused by

commit 319554f284dda9f2737d09df82ba3610bd8ddea3
Author: Josef Bacik 
Date:   Thu Jan 19 17:47:46 2017 -0500

 inet: don't use sk_v6_rcv_saddr directly

 When comparing two sockets we need to use inet6_rcv_saddr so we get 
a NULL
 sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our 
comparison function
 can be wrong.

 Fixes: 637bc8b ("inet: reset tb->fastreuseport when adding a 
reuseport sk")
 Signed-off-by: Josef Bacik 
 Signed-off-by: David S. Miller 


And reverting fixed both the standalone test case and the spice issue.

Any ideas?

Thanks,
Laura




0001-net-set-tb-fast_sk_family.patch
Description: 0001-net-set-tb-fast_sk_family.patch

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Tom Herbert

On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:
>
>
> On 9/12/2017 8:47 AM, Eric Dumazet wrote:
>>
>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>>>
>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:

 On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:

> Two ints in sock_common for this purpose is quite expensive and the
> use case for this is limited-- even if a RX->TX queue mapping were
> introduced to eliminate the queue pair assumption this still won't
> help if the receive and transmit interfaces are different for the
> connection. I think we really need to see some very compelling results
> to be able to justify this.
>>>
>>> Will try to collect and post some perf data with symmetric queue
>>> configuration.
>>>
 Yes, this is unreasonable cost.

 XPS should really cover the case already.

>>>
>>> Eric,
>>>
>>> Can you clarify how XPS covers the RX-> TX queue mapping case?
>>> Is it possible to configure XPS to select TX queue based on the RX queue
>>> of a flow?
>>> IIUC, it is based on the CPU of the thread doing the transmit OR based
>>> on skb->priority to TC mapping?
>>> It may be possible to get this effect if the the threads are pinned to a
>>> core, but if the app threads are
>>> freely moving, i am not sure how XPS can be configured to select the TX
>>> queue based on the RX queue of a flow.
>>
>> If application is freely moving, how NIC can properly select the RX
>> queue so that packets are coming to the appropriate queue ?
>
> The RX queue is selected via RSS and we don't want to move the flow based on
> where the thread is running.

Unless flow director is enabled on the Intel device... This was, I
believe, one of the first attempts to introduce a queue pair notion to
general purpose NICs. The idea was that the device records the TX
queue for a flow and then uses that to determine receive queue in a
symmetric fashion. aRFS is similar, but was under SW control how the
mapping is done. As Eric mentioned there are scalability issues with
these mechanisms, but we also found that flow director can easily
reorder packets whenever the thread moves.

>>
>>
>> This is called aRFS, and it does not scale to millions of flows.
>> We tried in the past, and this went nowhere really, since the setup cost
>> is prohibitive and DDOS vulnerable.
>>
>> XPS will follow the thread, since selection is done on current cpu.
>>
>> The problem is RX side. If application is free to migrate, then special
>> support (aRFS) is needed from the hardware.
>
> This may be true if most of the rx processing is happening in the interrupt
> context.
> But with busy polling,  i think we don't need aRFS as a thread should be
> able to poll
> any queue irrespective of where it is running.

It's not just a problem with interrupt processing, in general we like
to have all receive processing an subsequent transmit of a reply to be
done on one CPU. Silo'ing is good for performance and parallelism.
This can sometimes be relaxed in situations where CPUs share a cache
so crossing CPUs is not not costly.

>>
>>
>> At least for passive connections, we already have all the support in the
>> kernel so that you can have one thread per NIC queue, dealing with
>> sockets that have incoming packets all received on one NIC RX queue.
>> (And of course all TX packets will use the symmetric TX queue)
>>
>> SO_REUSEPORT plus appropriate BPF filter can achieve that.
>>
>> Say you have 32 queues, 32 cpus.
>>
>> Simply use 32 listeners, 32 threads (or 32 pools of threads)
>
> Yes. This will work if each thread is pinned to a core associated with the
> RX interrupt.
> It may not be possible to pin the threads to a core.
> Instead we want to associate a thread to a queue and do all the RX and TX
> completion
> of a queue in the same thread context via busy polling.
>
When that happens it's possible for RX to be done on the completely
wrong CPU which we know is suboptimal. However, this shouldn't
negatively affect TX side since XPS will just use the queue
appropriate for running CPU. Like Eric said, this is really a receive
problem more than a transmit problem. Keeping them as independent
paths seems to be a good approach.

Tom

Re: [PATCH] datapath: Fix an error handling path in 'ovs_nla_init_match_and_action()'

2017-09-12 Thread Greg Rose


On 09/11/2017 12:20 PM, Christophe JAILLET wrote:

All other error handling paths in this function go through the 'error'
label. This one should do the same.

Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
Signed-off-by: Christophe JAILLET 
---
I think that the comment above the function could be improved. It looks
like the commit log which has introduced this function.

I'm also not sure that commit 9cc9a5cb176c is of any help. It is
supposed to remove a warning, and I guess it does. But 
'ovs_nla_init_match_and_action()'
is called unconditionnaly from 'ovs_flow_cmd_set()'. So even if the stack
used by each function is reduced, the overall stack should be the same, if
not larger.

So this commit sounds like adding a bug where the code was fine and states
to fix an issue but, at the best, only hides it.


Having a large stack frame isn't really a bug per se.  But the Linux kernel
warns about stack frames that are too large so reordering the code to
get the warning to go away seems fine to me.



Instead of fixing the code with the proposed patch, reverting the initial
commit could also be considered.


Then the warning will come back.

- Greg


---
  net/openvswitch/datapath.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 76cf273a56c7..c3aec6227c91 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1112,7 +1112,8 @@ static int ovs_nla_init_match_and_action(struct net *net,
if (!a[OVS_FLOW_ATTR_KEY]) {
OVS_NLERR(log,
  "Flow key attribute not present in set 
flow.");
-   return -EINVAL;
+   error = -EINVAL;
+   goto error;
}
  
  		*acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], key,

319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-12 Thread Laura Abbott


Hi,

Fedora got a bug report 
https://bugzilla.redhat.com/show_bug.cgi?id=1432684 of a regression with 
automatic spice port

assignment. The libvirt team reduced this to the attached test
case run as follows:

In a separate terminal, qemu-kvm -vnc 127.0.0.1:0 to grab port 5900. 
Then do this:


$ gcc bind-collision.c && ./a.out
bind: Address already in use
AF_INET check failed.
$ gcc -D CHECK_IPV6 bind-collision.c && ./a.out
AF_INET6 success
AF_INET success
$ gcc bind-collision.c && ./a.out
AF_INET success

Bisection showed this behavior to be caused by

commit 319554f284dda9f2737d09df82ba3610bd8ddea3
Author: Josef Bacik 
Date:   Thu Jan 19 17:47:46 2017 -0500

inet: don't use sk_v6_rcv_saddr directly

When comparing two sockets we need to use inet6_rcv_saddr so we get 
a NULL
sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our 
comparison function

can be wrong.

Fixes: 637bc8b ("inet: reset tb->fastreuseport when adding a 
reuseport sk")

Signed-off-by: Josef Bacik 
Signed-off-by: David S. Miller 


And reverting fixed both the standalone test case and the spice issue.

Any ideas?

Thanks,
Laura
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* Reproducer for https://bugzilla.redhat.com/show_bug.cgi?id=1432684
   Simply do something like: qemu-kvm -vnc 127.0.0.1:0
 */

#define PORT 5900

int check_port(int family) {
int fd = -1;
int reuseaddr = 1;
int v6only = 1;
int addrlen;
int ret = -1;
bool ipv6 = false;
struct sockaddr *addr;

struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
.sin6_port = htons(PORT),
.sin6_addr = in6addr_any
};
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
.sin_port = htons(PORT),
.sin_addr.s_addr = htonl(INADDR_ANY)
};


if (family == AF_INET6) {
addr = (struct sockaddr*)
addrlen = sizeof(addr6);
ipv6 = true;
} else if (family == AF_INET) {
addr = (struct sockaddr*)
addrlen = sizeof(addr4);
} else {
printf("Unknown family\n");
goto out;
}

if ((fd = socket(family, SOCK_STREAM, 0)) < 0) {
perror("socket");
goto out;
}

if (ipv6 && setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, (void*),
   sizeof(v6only)) < 0) {
perror("setsockopt IPV6_V6ONLY");
goto out;
}

if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
   , sizeof(reuseaddr)) < 0) {
perror("setsockopt SO_REUSEADDR");
goto out;
}

if (bind(fd, addr, addrlen) < 0) {
perror("bind");
goto out;
}

ret = 0;
out:
close(fd);
return ret;
}

int main(void) {
#ifdef CHECK_IPV6
if (check_port(AF_INET6) < 0) {
printf("AF_INET6 check failed.\n");
return -1;
}
printf("AF_INET6 success\n");
#endif

if (check_port(AF_INET) < 0) {
printf("AF_INET check failed.\n");
return -1;
}
printf("AF_INET success\n");

return 0;
}

Re: [PATCH] ieee802154: fix gcc-4.9 warnings

2017-09-12 Thread Joe Perches

On Tue, 2017-09-12 at 12:16 +0200, Arnd Bergmann wrote:
> All older compiler versions up to gcc-4.9 produce these
> harmless warnings:
> 
> drivers/net/ieee802154/ca8210.c: In function 'ca8210_skb_tx':
> drivers/net/ieee802154/ca8210.c:1947:9: warning: missing braces around 
> initializer [-Wmissing-braces]
> 
> This changes the syntax to something that works on all versions
> without warnings.
> 
> Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
[]
> diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
[]
> @@ -1944,7 +1944,7 @@ static int ca8210_skb_tx(
>  )
>  {
>   int status;
> - struct ieee802154_hdr header = { 0 };
> + struct ieee802154_hdr header = { };
>   struct secspec secspec;
>   unsigned int mac_len;

Presumably gcc does this because the first member
of struct ieee802154_hdr is another struct.

I wonder if "struct foo bar = { 0 };" should be
discouraged by checkpatch.

Right now it's about 4:3 in favor of
struct foo bar = {};
over
struct foo bar = { 0 };

$ git grep -E "struct\s+\w+\s+\w+\s*=\s*\{\s*0\s*\}\s*[,;]" | wc -l
826
$ git grep -E "struct\s+\w+\s+\w+\s*=\s*\{\s*\}\s*[,;]" | wc -l
990

There are many instances on multiple lines too.
The git grep above doesn't span multiple lines.

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Samudrala, Sridhar




On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.


Yes, this is unreasonable cost.

XPS should really cover the case already.
   

Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX
queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

The RX queue is selected via RSS and we don't want to move the flow based on
where the thread is running.


This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.
This may be true if most of the rx processing is happening in the 
interrupt context.
But with busy polling,  i think we don't need aRFS as a thread should be 
able to poll

any queue irrespective of where it is running.


At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)
Yes. This will work if each thread is pinned to a core associated with 
the RX interrupt.

It may not be possible to pin the threads to a core.
Instead we want to associate a thread to a queue and do all the RX and 
TX completion

of a queue in the same thread context via busy polling.

Thanks
Sridhar

[no subject]

2017-09-12 Thread marketing

<>

Re: [PATCH 2/2] net: qcom/emac: add software control for pause frame mode

2017-09-12 Thread Timur Tabi


On 08/01/2017 04:37 PM, Timur Tabi wrote:

The EMAC has the option of sending only a single pause frame when
flow control is enabled and the RX queue is full.  Although sending
only one pause frame has little value, this would allow admins to
enable automatic flow control without having to worry about the EMAC
flooding nearby switches with pause frames if the kernel hangs.

The option is enabled by using the single-pause-mode private flag.

Signed-off-by: Timur Tabi


Dave,

I don't see this patch in net-next.  Can you pick it up for 4.14?

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

Re: [Patch net v3 1/3] net_sched: get rid of tcfa_rcu

2017-09-12 Thread Cong Wang

On Tue, Sep 12, 2017 at 2:36 PM, Jiri Pirko  wrote:
> Tue, Sep 12, 2017 at 11:10:22PM CEST, xiyou.wangc...@gmail.com wrote:
>>On Tue, Sep 12, 2017 at 3:40 AM, Jiri Pirko  wrote:
>>> This patch helps:
>>
>>Looks good to me. Please feel free to submit a formal patch.
>
> Okay, I will send the patch to you formally so you can add it as a first
> patch of your patchset.

I can carry it by myself if it fits to this patchset. However, I believe it
should be independent since it has to be backported much further
than this patchset. I don't know why no one triggered the crash
before call_rcu() was introduced there.

Anyway, I believe you should submit your patch alone, either before
or after this patchset, there should be no conflict.

Re: [PATCH net] net: systemport: Fix 64-bit stats deadlock

2017-09-12 Thread Florian Fainelli

On 09/12/2017 02:38 PM, Eric Dumazet wrote:
> On Tue, 2017-09-12 at 13:14 -0700, Florian Fainelli wrote:
>> We can enter a deadlock situation because there is no sufficient protection
>> when ndo_get_stats64() runs in process context to guard against RX or TX NAPI
>> contexts running in softirq, this can lead to the following lockdep splat and
>> actual deadlock was experienced as well with an iperf session in the 
>> background
>> and a while loop doing ifconfig + ethtool.
> 
>> So just remove the u64_stats_update_begin()/end() pair in ndo_get_stats64()
>> since it does not appear to be useful for anything. No inconsistency was
>> observed with either ifconfig or ethtool, global TX counts equal the sum of
>> per-queue TX counts on a 32-bit architecture.
>>
>> Fixes: 10377ba7673d ("net: systemport: Support 64bit statistics")
>> Signed-off-by: Florian Fainelli 
>> ---
>>  drivers/net/ethernet/broadcom/bcmsysport.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
>> b/drivers/net/ethernet/broadcom/bcmsysport.c
>> index a6572b51435a..c3c53f6cd9e6 100644
>> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
>> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
>> @@ -1735,11 +1735,8 @@ static void bcm_sysport_get_stats64(struct net_device 
>> *dev,
>>  stats->tx_packets += tx_packets;
>>  }
>>  
>> -/* lockless update tx_bytes and tx_packets */
>> -u64_stats_update_begin(>syncp);
> 
> Yes, this u64_stats_update_begin()/u64_stats_update_end() is bogus
> 
> But why do we even write on tx_bytes/tx_packets here ??? 

That's for the ethtool -S netdev stats copy (that's on me, I added that
in the driver initial version), so yes, not very robust...

> 
> Seems very wrong anyway.
> 
> (ethtool -S does not call bcm_sysport_get_stats64() to refresh them )

Yes that might actually be the simplest way to get this fixed.

> 
>>  stats64->tx_bytes = stats->tx_bytes;
>>  stats64->tx_packets = stats->tx_packets;
>> -u64_stats_update_end(>syncp);
>>  
>>  do {
>>  start = u64_stats_fetch_begin_irq(>syncp);
> 
> 


-- 
Florian

Re: [PATCH net] net: systemport: Fix 64-bit stats deadlock

2017-09-12 Thread Eric Dumazet

On Tue, 2017-09-12 at 13:14 -0700, Florian Fainelli wrote:
> We can enter a deadlock situation because there is no sufficient protection
> when ndo_get_stats64() runs in process context to guard against RX or TX NAPI
> contexts running in softirq, this can lead to the following lockdep splat and
> actual deadlock was experienced as well with an iperf session in the 
> background
> and a while loop doing ifconfig + ethtool.

> So just remove the u64_stats_update_begin()/end() pair in ndo_get_stats64()
> since it does not appear to be useful for anything. No inconsistency was
> observed with either ifconfig or ethtool, global TX counts equal the sum of
> per-queue TX counts on a 32-bit architecture.
> 
> Fixes: 10377ba7673d ("net: systemport: Support 64bit statistics")
> Signed-off-by: Florian Fainelli 
> ---
>  drivers/net/ethernet/broadcom/bcmsysport.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
> b/drivers/net/ethernet/broadcom/bcmsysport.c
> index a6572b51435a..c3c53f6cd9e6 100644
> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
> @@ -1735,11 +1735,8 @@ static void bcm_sysport_get_stats64(struct net_device 
> *dev,
>   stats->tx_packets += tx_packets;
>   }
>  
> - /* lockless update tx_bytes and tx_packets */
> - u64_stats_update_begin(>syncp);

Yes, this u64_stats_update_begin()/u64_stats_update_end() is bogus

But why do we even write on tx_bytes/tx_packets here ??? 

Seems very wrong anyway.

(ethtool -S does not call bcm_sysport_get_stats64() to refresh them )

>   stats64->tx_bytes = stats->tx_bytes;
>   stats64->tx_packets = stats->tx_packets;
> - u64_stats_update_end(>syncp);
>  
>   do {
>   start = u64_stats_fetch_begin_irq(>syncp);

Re: [Patch net v3 1/3] net_sched: get rid of tcfa_rcu

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 11:10:22PM CEST, xiyou.wangc...@gmail.com wrote:
>On Tue, Sep 12, 2017 at 3:40 AM, Jiri Pirko  wrote:
>> Tue, Sep 12, 2017 at 11:42:15AM CEST, j...@resnulli.us wrote:
>>>Tue, Sep 12, 2017 at 01:33:30AM CEST, xiyou.wangc...@gmail.com wrote:
gen estimator has been rewritten in commit 1c0d32fde5bd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Please see also the comments in code.
>>>
>>>Looks like this is causing a null pointer dereference bug for me, 100%
>>>of the time. Just add and remove any rule with action and you get:
>>>
>>
>> [...]
>>
>>>
>>>Looks like you need to save owner of the module before you call
>>>__tcf_idr_release so you can later on use it for module_put
>
>Why do you believe it is this patch introduces the bug?
>
>That code has been there since the beginning of git history:
>
>+   for (a = act; a; a = act) {
>+   if (a->ops && a->ops->cleanup) {
>+   DPRINTK("tcf_action_destroy destroying %p next %p\n",
>+   a, a->next);
>+   if (a->ops->cleanup(a, bind) == ACT_P_DELETED)
>+   module_put(a->ops->owner);
>+   act = act->next;
>
>Seems to be a very old one. The reason why it exposes, I guess,
>is call_rcu() somehow delays the free after module_put().

Yeah, looks like the race was just hard to hit. However with your patch,
it is very easy to hit.


>
>
>>
>> This patch helps:
>
>Looks good to me. Please feel free to submit a formal patch.

Okay, I will send the patch to you formally so you can add it as a first
patch of your patchset.

Re: [Patch net v3 1/3] net_sched: get rid of tcfa_rcu

2017-09-12 Thread Cong Wang

On Tue, Sep 12, 2017 at 3:40 AM, Jiri Pirko  wrote:
> Tue, Sep 12, 2017 at 11:42:15AM CEST, j...@resnulli.us wrote:
>>Tue, Sep 12, 2017 at 01:33:30AM CEST, xiyou.wangc...@gmail.com wrote:
>>>gen estimator has been rewritten in commit 1c0d32fde5bd
>>>("net_sched: gen_estimator: complete rewrite of rate estimators"),
>>>the caller is no longer needed to wait for a grace period.
>>>So this patch gets rid of it.
>>>
>>>This also completely closes a race condition between action free
>>>path and filter chain add/remove path for the following patch.
>>>Because otherwise the nested RCU callback can't be caught by
>>>rcu_barrier().
>>>
>>>Please see also the comments in code.
>>
>>Looks like this is causing a null pointer dereference bug for me, 100%
>>of the time. Just add and remove any rule with action and you get:
>>
>
> [...]
>
>>
>>Looks like you need to save owner of the module before you call
>>__tcf_idr_release so you can later on use it for module_put

Why do you believe it is this patch introduces the bug?

That code has been there since the beginning of git history:

+   for (a = act; a; a = act) {
+   if (a->ops && a->ops->cleanup) {
+   DPRINTK("tcf_action_destroy destroying %p next %p\n",
+   a, a->next);
+   if (a->ops->cleanup(a, bind) == ACT_P_DELETED)
+   module_put(a->ops->owner);
+   act = act->next;

Seems to be a very old one. The reason why it exposes, I guess,
is call_rcu() somehow delays the free after module_put().


>
> This patch helps:

Looks good to me. Please feel free to submit a formal patch.

Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Michal Kubecek

On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote:
> It does not appear to work on Fedora-26, and I'm curious if someone
> knows what needs doing to get this support working?

It's rather complicated. The "vlan" and "vlan " filters didn't
handle the case when vlan information is passed in metadata until commit
04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap
1.7.0. Unfortunately that commit made libpcap always check only metadata
for the first outermost vlan tag so that it broke the case when vlan
information is passed in packet itself (which is less frequent today).

To handle both cases correctly, you would need libpcap with commits
d739b068ac29 ("Make VLAN filter handle both metadata and inline tags")
and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the
optimizer fix from

  https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a

(without it the filters generate incorrect BPF in some cases unless the
optimizer is disabled). As far as I can see, these commits are not in
any release yet.

   Michal Kubecek

Re: [PATCH/RFC net-next 2/2] net/sched: allow flower to match tunnel options

2017-09-12 Thread Or Gerlitz

On Tue, Sep 12, 2017 at 5:20 PM, Simon Horman
 wrote:
> Allow matching on options in tunnel headers.
> This makes use of existing tunnel metadata support.

Simon,

This patch is about matching on tunnel options, right? but

> Options are a bytestring of up to 256 bytes.
> Tunnel implementations may support less or more options,
> or no options at all.
>
>  # ip link add name geneve0 type geneve dstport 0 external
>  # tc qdisc add dev eth0 ingress
>  # tc qdisc del dev eth0 ingress; tc qdisc add dev eth0 ingress
>  # tc filter add dev eth0 protocol ip parent : \
>  flower indev eth0 \
> ip_proto udp \
> action tunnel_key \
> set src_ip 10.0.99.192 \
> dst_ip 10.0.99.193 \
> dst_port 4789 \
> id 11 \
> opts 0102800100800022 \
> action mirred egress redirect dev geneve0

the example here is on how to use tunnel options in the tunnel set key actions..

And the other way around in the other patch... the patch is about the
tunnel key set action and the example shows how to match that in
flower... I guess you want to swap the relevant of the change log.

Anyway, is there any human readable/understandable representation of
these options? e.g what does 0102800100800022 means for geneve?

[PATCH net] net: systemport: Fix 64-bit stats deadlock

2017-09-12 Thread Florian Fainelli

We can enter a deadlock situation because there is no sufficient protection
when ndo_get_stats64() runs in process context to guard against RX or TX NAPI
contexts running in softirq, this can lead to the following lockdep splat and
actual deadlock was experienced as well with an iperf session in the background
and a while loop doing ifconfig + ethtool.

[5.780350] 
[5.784679] WARNING: inconsistent lock state
[5.789011] 4.13.0-rc7-02179-g32fae27c725d #70 Not tainted
[5.794561] 
[5.798890] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[5.804971] swapper/0/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
[5.810175]  (>seq#2){+.?...}, at: [] 
bcm_sysport_tx_reclaim+0x30/0x54
[5.818327] {SOFTIRQ-ON-W} state was registered at:
[5.823278]   bcm_sysport_get_stats64+0x17c/0x258
[5.828053]   dev_get_stats+0x38/0xac
[5.831776]   rtnl_fill_stats+0x30/0x118
[5.835761]   rtnl_fill_ifinfo+0x538/0xe24
[5.839921]   rtmsg_ifinfo_build_skb+0x6c/0xd8
[5.844430]   rtmsg_ifinfo_event.part.5+0x14/0x44
[5.849201]   rtmsg_ifinfo+0x20/0x28
[5.852837]   register_netdevice+0x628/0x6b8
[5.857171]   register_netdev+0x14/0x24
[5.861051]   bcm_sysport_probe+0x30c/0x438
[5.865280]   platform_drv_probe+0x50/0xb0
[5.869418]   driver_probe_device+0x2e8/0x450
[5.873817]   __driver_attach+0x104/0x120
[5.877871]   bus_for_each_dev+0x7c/0xc0
[5.881834]   bus_add_driver+0x1b0/0x270
[5.885797]   driver_register+0x78/0xf4
[5.889675]   do_one_initcall+0x54/0x190
[5.893646]   kernel_init_freeable+0x144/0x1d0
[5.898135]   kernel_init+0x8/0x110
[5.901665]   ret_from_fork+0x14/0x2c
[5.905363] irq event stamp: 24263
[5.908804] hardirqs last  enabled at (24262): [] 
net_rx_action+0xc4/0x4e4
[5.916624] hardirqs last disabled at (24263): [] 
_raw_spin_lock_irqsave+0x1c/0x98
[5.925143] softirqs last  enabled at (24258): [] 
irq_enter+0x84/0x98
[5.932524] softirqs last disabled at (24259): [] 
irq_exit+0x108/0x16c
[5.939985]
[5.939985] other info that might help us debug this:
[5.946576]  Possible unsafe locking scenario:
[5.946576]
[5.952556]CPU0
[5.955031]
[5.957506]   lock(>seq#2);
[5.960955]   
[5.963604] lock(>seq#2);
[5.967227]
[5.967227]  *** DEADLOCK ***
[5.967227]
[5.973222] 1 lock held by swapper/0/0:
[5.977092]  #0:  (&(>lock)->rlock){..-...}, at: [] 
bcm_sysport_tx_reclaim+0x20/0x54

So just remove the u64_stats_update_begin()/end() pair in ndo_get_stats64()
since it does not appear to be useful for anything. No inconsistency was
observed with either ifconfig or ethtool, global TX counts equal the sum of
per-queue TX counts on a 32-bit architecture.

Fixes: 10377ba7673d ("net: systemport: Support 64bit statistics")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index a6572b51435a..c3c53f6cd9e6 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1735,11 +1735,8 @@ static void bcm_sysport_get_stats64(struct net_device 
*dev,
stats->tx_packets += tx_packets;
}
 
-   /* lockless update tx_bytes and tx_packets */
-   u64_stats_update_begin(>syncp);
stats64->tx_bytes = stats->tx_bytes;
stats64->tx_packets = stats->tx_packets;
-   u64_stats_update_end(>syncp);
 
do {
start = u64_stats_fetch_begin_irq(>syncp);
-- 
1.9.1

[PATCH] net: vrf: avoid gcc-4.6 warning

2017-09-12 Thread Arnd Bergmann

When building an allmodconfig kernel with gcc-4.6, we get a rather
odd warning:

drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’:
drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten 
[-Werror]
drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror]

I have no idea what this warning is even trying to say, but it does
seem like a false positive. Reordering the initialization in to match
the structure definition gets rid of the warning, and might also avoid
whatever gcc thinks is wrong here.

Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/vrf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7e19051f3230..9b243e6f3008 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -957,12 +957,12 @@ static void vrf_ip6_input_dst(struct sk_buff *skb, struct 
net_device *vrf_dev,
 {
const struct ipv6hdr *iph = ipv6_hdr(skb);
struct flowi6 fl6 = {
+   .flowi6_iif = ifindex,
+   .flowi6_mark= skb->mark,
+   .flowi6_proto   = iph->nexthdr,
.daddr  = iph->daddr,
.saddr  = iph->saddr,
.flowlabel  = ip6_flowinfo(iph),
-   .flowi6_mark= skb->mark,
-   .flowi6_proto   = iph->nexthdr,
-   .flowi6_iif = ifindex,
};
struct net *net = dev_net(vrf_dev);
struct rt6_info *rt6;
-- 
2.9.0

[PATCH] ravb: document R8A77970 bindings

2017-09-12 Thread Sergei Shtylyov

R-Car V3M (R8A77970) SoC also has the R-Car gen3 compatible EtherAVB
device, so document  the SoC specific bindings.

Signed-off-by: Sergei Shtylyov 

---
The patch is against DaveM's 'net-next.git' repo but I wouldn't mind if it's
applied to 'net.git' instead. :-)

 Documentation/devicetree/bindings/net/renesas,ravb.txt |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: net-next/Documentation/devicetree/bindings/net/renesas,ravb.txt
===
--- net-next.orig/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ net-next/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -17,6 +17,7 @@ Required properties:
 
   - "renesas,etheravb-r8a7795" for the R8A7795 SoC.
   - "renesas,etheravb-r8a7796" for the R8A7796 SoC.
+  - "renesas,etheravb-r8a77970" for the R8A77970 SoC.
   - "renesas,etheravb-rcar-gen3" as a fallback for the above
R-Car Gen3 devices.
 
@@ -40,7 +41,7 @@ Optional properties:
 - interrupt-parent: the phandle for the interrupt controller that services
interrupts for this device.
 - interrupt-names: A list of interrupt names.
-  For the R8A779[56] SoCs this property is mandatory;
+  For the R-Car Gen 3 SoCs this property is mandatory;
   it should include one entry per channel, named "ch%u",
   where %u is the channel number ranging from 0 to 24.
   For other SoCs this property is optional; if present

Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear


On 09/12/2017 11:54 AM, Ben Greear wrote:

It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben




Gah, I spoke too soon.  system-test guy says it works on cmd-line, but
not when we try to make it work in another way...could be local bug,
I'll poke at this more.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Ben Greear


It does not appear to work on Fedora-26, and I'm curious if someone knows what 
needs
doing to get this support working?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

[no subject]

2017-09-12 Thread pooks005

<>

Regression in throughput between kvm guests over virtual bridge

2017-09-12 Thread Matthew Rosato

We are seeing a regression for a subset of workloads across KVM guests
over a virtual bridge between host kernel 4.12 and 4.13.  Bisecting
points to c67df11f "vhost_net: try batch dequing from skb array"

In the regressed environment, we are running 4 kvm guests, 2 running as
uperf servers and 2 running as uperf clients, all on a single host.
They are connected via a virtual bridge.  The uperf client profile looks
like:



  

  


  


  

  


So, 1 tcp streaming instance per client.  When upgrading the host kernel
from 4.12->4.13, we see about a 30% drop in throughput for this
scenario.  After the bisect, I further verified that reverting c67df11f
on 4.13 "fixes" the throughput for this scenario.

On the other hand, if we increase the load by upping the number of
streaming instances to 50 (nprocs="50") or even 10, we see instead a
~10% increase in throughput when upgrading host from 4.12->4.13.

So it may be the issue is specific to "light load" scenarios.  I would
expect some overhead for the batching, but 30% seems significant...  Any
thoughts on what might be happening here?

Re: [PATCH net] net: bonding: fix tlb_dynamic_lb default value

2017-09-12 Thread महेश बंडेवार

On Tue, Sep 12, 2017 at 5:10 AM, Nikolay Aleksandrov
 wrote:
> Commit 8b426dc54cf4 ("bonding: remove hardcoded value") changed the
> default value for tlb_dynamic_lb which lead to either broken ALB mode
> (since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
> with tlb_dynamic_lb equal to 0.
> The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
> when switching to ALB mode, but the default value is still wrong and
> we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
> changed via netlink or sysfs. In order to restore the previous behaviour
> and default value simply remove the mode check around the default param
> initialization for tlb_dynamic_lb which will always set it to 1 as
> before.
>
> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
> Signed-off-by: Nikolay Aleksandrov 
Acked-by: Mahesh Bandewar 
> ---
>  drivers/net/bonding/bond_main.c | 17 +++--
>  1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index fc63992ab0e0..c99dc59d729b 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4289,7 +4289,7 @@ static int bond_check_params(struct bond_params *params)
> int bond_mode   = BOND_MODE_ROUNDROBIN;
> int xmit_hashtype = BOND_XMIT_POLICY_LAYER2;
> int lacp_fast = 0;
> -   int tlb_dynamic_lb = 0;
> +   int tlb_dynamic_lb;
>
> /* Convert string parameters. */
> if (mode) {
> @@ -4601,16 +4601,13 @@ static int bond_check_params(struct bond_params 
> *params)
> }
> ad_user_port_key = valptr->value;
>
> -   if ((bond_mode == BOND_MODE_TLB) || (bond_mode == BOND_MODE_ALB)) {
> -   bond_opt_initstr(, "default");
> -   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB),
> -   );
> -   if (!valptr) {
> -   pr_err("Error: No tlb_dynamic_lb default value");
> -   return -EINVAL;
> -   }
> -   tlb_dynamic_lb = valptr->value;
> +   bond_opt_initstr(, "default");
> +   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB), 
> );
> +   if (!valptr) {
> +   pr_err("Error: No tlb_dynamic_lb default value");
> +   return -EINVAL;
> }
> +   tlb_dynamic_lb = valptr->value;
>
> if (lp_interval == 0) {
> pr_warn("Warning: ip_interval must be between 1 and %d, so it 
> was reset to %d\n",
> --
> 2.1.4
>

[PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-12 Thread Stefan Hajnoczi

This patch fixes the following compiler errors when userspace
applications use the vm_sockets.h header:

  include/uapi/linux/vm_sockets.h:148:32: error: invalid application of 
‘sizeof’ to incomplete type ‘struct sockaddr’
unsigned char svm_zero[sizeof(struct sockaddr) -
  ^~
  include/uapi/linux/vm_sockets.h:149:18: error: ‘sa_family_t’ undeclared here 
(not in a function)
 sizeof(sa_family_t) -
^~~

Two issues:
1. In the kernel struct sockaddr comes in via  but in
   userspace  is required.
2. struct sockaddr_vm has a __kernel_sa_family_t field so let's be
   consistent and use the same type for the sizeof(sa_family_t)
   calculation.

Currently userspace applications work around this broken header by first
including .  In the kernel there is no compiler error
because  provides everything.  It's worth fixing the
header file though.

Cc: Jorgen Hansen 
Signed-off-by: Stefan Hajnoczi 
---
 include/uapi/linux/vm_sockets.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index b4ed5d895699..4ae5c625ac56 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -18,6 +18,10 @@
 
 #include 
 
+#ifndef __KERNEL__
+#include  /* struct sockaddr */
+#endif
+
 /* Option name for STREAM socket buffer size.  Use as the option name in
  * setsockopt(3) or getsockopt(3) to set or get an unsigned long long that
  * specifies the size of the buffer underlying a vSockets STREAM socket.
@@ -146,7 +150,7 @@ struct sockaddr_vm {
unsigned int svm_port;
unsigned int svm_cid;
unsigned char svm_zero[sizeof(struct sockaddr) -
-  sizeof(sa_family_t) -
+  sizeof(__kernel_sa_family_t) -
   sizeof(unsigned short) -
   sizeof(unsigned int) - sizeof(unsigned int)];
 };
-- 
2.13.5

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Eric Dumazet

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
> 
> On 9/11/2017 8:53 PM, Eric Dumazet wrote:
> > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
> >
> >> Two ints in sock_common for this purpose is quite expensive and the
> >> use case for this is limited-- even if a RX->TX queue mapping were
> >> introduced to eliminate the queue pair assumption this still won't
> >> help if the receive and transmit interfaces are different for the
> >> connection. I think we really need to see some very compelling results
> >> to be able to justify this.
> Will try to collect and post some perf data with symmetric queue 
> configuration.
> 
> > Yes, this is unreasonable cost.
> >
> > XPS should really cover the case already.
> >   
> Eric,
> 
> Can you clarify how XPS covers the RX-> TX queue mapping case?
> Is it possible to configure XPS to select TX queue based on the RX queue 
> of a flow?
> IIUC, it is based on the CPU of the thread doing the transmit OR based 
> on skb->priority to TC mapping?
> It may be possible to get this effect if the the threads are pinned to a 
> core, but if the app threads are
> freely moving, i am not sure how XPS can be configured to select the TX 
> queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.

At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)

[PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers

2017-09-12 Thread Matteo Croce

Currently, writing into
net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
Fix handling of these flags by:

- using the maximum of global and per-interface values for the
  accept_dad flag. That is, if at least one of the two values is
  non-zero, enable DAD on the interface. If at least one value is
  set to 2, enable DAD and disable IPv6 operation on the interface if
  MAC-based link-local address was found

- using the logical OR of global and per-interface values for the
  optimistic_dad flag. If at least one of them is set to one, optimistic
  duplicate address detection (RFC 4429) is enabled on the interface

- using the logical OR of global and per-interface values for the
  use_optimistic flag. If at least one of them is set to one,
  optimistic addresses won't be marked as deprecated during source address
  selection on the interface.

While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
drop inline, and let the compiler decide.

Fixes: 7fd2561e4ebd ("net: ipv6: Add a sysctl to make optimistic addresses 
useful candidates")
Signed-off-by: Matteo Croce 
---
 Documentation/networking/ip-sysctl.txt | 18 ++
 net/ipv6/addrconf.c| 27 ---
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index b3345d0fe0a6..77f4de59dc9c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1680,6 +1680,9 @@ accept_dad - INTEGER
2: Enable DAD, and disable IPv6 operation if MAC-based duplicate
   link-local address has been found.
 
+   DAD operation and mode on a given interface will be selected according
+   to the maximum value of conf/{all,interface}/accept_dad.
+
 force_tllao - BOOLEAN
Enable sending the target link-layer address option even when
responding to a unicast neighbor solicitation.
@@ -1727,16 +1730,23 @@ suppress_frag_ndisc - INTEGER
 
 optimistic_dad - BOOLEAN
Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
-   0: disabled (default)
-   1: enabled
+   0: disabled (default)
+   1: enabled
+
+   Optimistic Duplicate Address Detection for the interface will be enabled
+   if at least one of conf/{all,interface}/optimistic_dad is set to 1,
+   it will be disabled otherwise.
 
 use_optimistic - BOOLEAN
If enabled, do not classify optimistic addresses as deprecated during
source address selection.  Preferred addresses will still be chosen
before optimistic addresses, subject to other ranking in the source
address selection algorithm.
-   0: disabled (default)
-   1: enabled
+   0: disabled (default)
+   1: enabled
+
+   This will be enabled if at least one of
+   conf/{all,interface}/use_optimistic is set to 1, disabled otherwise.
 
 stable_secret - IPv6 address
This IPv6 address will be used as a secret to generate IPv6
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78787ec..774d8794248a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1399,10 +1399,18 @@ static inline int ipv6_saddr_preferred(int type)
return 0;
 }
 
-static inline bool ipv6_use_optimistic_addr(struct inet6_dev *idev)
+static bool ipv6_use_optimistic_addr(struct net *net,
+struct inet6_dev *idev)
 {
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
-   return idev && idev->cnf.optimistic_dad && idev->cnf.use_optimistic;
+   if (!idev)
+   return false;
+   if (!net->ipv6.devconf_all->optimistic_dad && !idev->cnf.optimistic_dad)
+   return false;
+   if (!net->ipv6.devconf_all->use_optimistic && !idev->cnf.use_optimistic)
+   return false;
+
+   return true;
 #else
return false;
 #endif
@@ -1472,7 +1480,7 @@ static int ipv6_get_saddr_eval(struct net *net,
/* Rule 3: Avoid deprecated and optimistic addresses */
u8 avoid = IFA_F_DEPRECATED;
 
-   if (!ipv6_use_optimistic_addr(score->ifa->idev))
+   if (!ipv6_use_optimistic_addr(net, score->ifa->idev))
avoid |= IFA_F_OPTIMISTIC;
ret = ipv6_saddr_preferred(score->addr_type) ||
  !(score->ifa->flags & avoid);
@@ -2460,7 +2468,8 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct 
net_device *dev,
int max_addresses = in6_dev->cnf.max_addresses;
 
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
-   if (in6_dev->cnf.optimistic_dad &&
+   if ((net->ipv6.devconf_all->optimistic_dad ||
+in6_dev->cnf.optimistic_dad) &&
!net->ipv6.devconf_all->forwarding && sllao)
addr_flags |= IFA_F_OPTIMISTIC;
 #endif
@@

Re: [PATCH] tcp: TCP_USER_TIMEOUT can not work in tcp_probe_timer()

2017-09-12 Thread Eric Dumazet

On Tue, 2017-09-12 at 08:05 -0700, Eric Dumazet wrote:
> On Tue, 2017-09-12 at 14:08 +0800, liujian wrote:
> > Hi,
> > 
> > In the scenario, tcp server side IP changed, and at that memont,
> > userspace application still send data continuously;
> > tcp_send_head(sk)'s timestamp always be refreshed.
> > 
> > Here is the packetdrill script:
> > 
> >0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >+0 bind(3, ..., ...) = 0
> >+0 listen(3, 1) = 0
> > 
> >+0 < S 0:0(0) win 0 
> >+0 > S. 0:0(0) ack 1 
> > 
> >   +.1 < . 1:1(0) ack 1 win 65530
> >+0 accept(3, ..., ...) = 4
> > 
> >+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
> >+0 write(4, ..., 24) = 24
> >+0 > P. 1:25(24) ack 1 win 229
> >+.1 < . 1:1(0) ack 25 win 65530
> > 
> > //change the ipaddress
> >+1 `ifconfig tun0 192.168.0.10/16`
> > 
> >+1 write(4, ..., 24) = 24
> >+1 write(4, ..., 24) = 24
> >+1 write(4, ..., 24) = 24
> >+1 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> >+3 write(4, ..., 24) = 24
> > 
> >+0 `ifconfig tun0 192.168.0.1/16`
> >+0 < . 1:1(0) ack 1 win 1000
> >+0 write(4, ..., 24) = -1
> > 
> > 
> 
> This has nothing to do with the code patch you have changed.
> 
> How have you tested your patch exactly ?
> 


lpaa23:~# ss -toenmi src :8080
State  Recv-Q Send-Q Local Address:Port   Peer
Address:Port  
ESTAB  0  144192.168.134.161:8080
192.0.2.1:51165   timer:(persist,8.262ms,5) ino:1
82083 sk:3 <->
 skmem:(r0,rb359040,t0,tb46080,f1792,w2304,o0,bl0,d0) sack cubic
wscale:7,8 rto:301 backoff:5 rtt:100.127/37.576 
mss:1460 rcvmss:536 advmss:1460 cwnd:10 bytes_acked:24 segs_out:12
segs_in:3 data_segs_out:12 send 1.2Mbps lastsnd:1370 l
astrcv:13348 lastack:13248 pacing_rate 2.3Mbps delivery_rate 116.7Kbps
app_limited busy:11346ms rcv_space:29200 notsent:1
44 minrtt:100.043

This is the typical RTO timer, not zero window probe.

Re: [PATCH] tcp: TCP_USER_TIMEOUT can not work in tcp_probe_timer()

2017-09-12 Thread Eric Dumazet

On Tue, 2017-09-12 at 14:08 +0800, liujian wrote:
> Hi,
> 
> In the scenario, tcp server side IP changed, and at that memont,
> userspace application still send data continuously;
> tcp_send_head(sk)'s timestamp always be refreshed.
> 
> Here is the packetdrill script:
> 
>0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
>+0 < S 0:0(0) win 0 
>+0 > S. 0:0(0) ack 1 
> 
>   +.1 < . 1:1(0) ack 1 win 65530
>+0 accept(3, ..., ...) = 4
> 
>+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
>+0 write(4, ..., 24) = 24
>+0 > P. 1:25(24) ack 1 win 229
>+.1 < . 1:1(0) ack 25 win 65530
> 
> //change the ipaddress
>+1 `ifconfig tun0 192.168.0.10/16`
> 
>+1 write(4, ..., 24) = 24
>+1 write(4, ..., 24) = 24
>+1 write(4, ..., 24) = 24
>+1 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
>+3 write(4, ..., 24) = 24
> 
>+0 `ifconfig tun0 192.168.0.1/16`
>+0 < . 1:1(0) ack 1 win 1000
>+0 write(4, ..., 24) = -1
> 
> 

This has nothing to do with the code patch you have changed.

How have you tested your patch exactly ?


> [root@localhost ~]# time ./gtests/net/packetdrill/packetdrill test.pkt
> test.pkt:50: runtime error in write call: Expected result -1 but got 24 with 
> errno 2 (No such file or directory)
> 
> real  1m11.364s
> user  0m0.028s
> sys   0m0.106s
> 
> [root@localhost ~]# netstat -toen
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address State 
>   User   Inode  Timer
> tcp0504 192.168.0.1:8080192.0.2.1:33993 
> ESTABLISHED 0  45453  probe (22.38/0/7)
> 
> since the script didn't wait for enough time, here only got 7 probes.
> 
> 在 2017/9/11 23:22, Eric Dumazet 写道:
> > On Mon, 2017-09-11 at 08:13 -0700, Eric Dumazet wrote:
> > 
> >> You can see we got only 3 probes, not 4.
> > 
> > Here is complete packetdrill test showing that code behaves as expected.
> > 
> > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >+0 bind(3, ..., ...) = 0
> >+0 listen(3, 1) = 0
> > 
> >+0 < S 0:0(0) win 0 
> >+0 > S. 0:0(0) ack 1 
> > 
> > // Client advertises a zero receive window, so we can't send.
> >   +.1 < . 1:1(0) ack 1 win 0
> >+0 accept(3, ..., ...) = 4
> > 
> >+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
> >+0 write(4, ..., 2920) = 2920
> > 
> > // Window probes are scheduled just like RTOs.
> >   +.3~+.31 > . 0:0(0) ack 1
> >   +.6~+.62 > . 0:0(0) ack 1
> >  +1.2~+1.24 > . 0:0(0) ack 1
> > 
> > // Peer opens its window too late !
> >+3 < . 1:1(0) ack 1 win 1000
> >+0 > R 1:1(0)
> > 
> > 
> > 
> > .
> > 
>

[iproute PATCH] ipaddress: Fix segfault in 'addr showdump'

2017-09-12 Thread Phil Sutter

Obviously, 'addr showdump' feature wasn't adjusted to json output
support. As a consequence, calls to print_string() in print_addrinfo()
tried to dereference a NULL FILE pointer.

Fixes: d0e720111aad2 ("ip: ipaddress.c: add support for json output")
Signed-off-by: Phil Sutter 
---
 ip/ipaddress.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 9797145023966..ee6c9f588e7ba 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1801,17 +1801,31 @@ static int show_handler(const struct sockaddr_nl *nl,
 {
struct ifaddrmsg *ifa = NLMSG_DATA(n);
 
-   printf("if%d:\n", ifa->ifa_index);
+   open_json_object(NULL);
+   print_int(PRINT_ANY, "index", "if%d:\n", ifa->ifa_index);
print_addrinfo(NULL, n, stdout);
+   close_json_object();
return 0;
 }
 
 static int ipaddr_showdump(void)
 {
+   int err;
+
if (ipadd_dump_check_magic())
exit(-1);
 
-   exit(rtnl_from_file(stdin, _handler, NULL));
+   new_json_obj(json, stdout);
+   open_json_object(NULL);
+   open_json_array(PRINT_JSON, "addr_info");
+
+   err = rtnl_from_file(stdin, _handler, NULL);
+
+   close_json_array(PRINT_JSON, NULL);
+   close_json_object();
+   delete_json_obj();
+
+   exit(err);
 }
 
 static int restore_handler(const struct sockaddr_nl *nl,
-- 
2.13.1

Re: [PATCH] ieee802154: fix gcc-4.9 warnings

2017-09-12 Thread Marcel Holtmann

Hi Arnd,

> All older compiler versions up to gcc-4.9 produce these
> harmless warnings:
> 
> drivers/net/ieee802154/ca8210.c: In function 'ca8210_skb_tx':
> drivers/net/ieee802154/ca8210.c:1947:9: warning: missing braces around 
> initializer [-Wmissing-braces]
> 
> This changes the syntax to something that works on all versions
> without warnings.
> 
> Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
> Signed-off-by: Arnd Bergmann 
> ---
> drivers/net/ieee802154/ca8210.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

[PATCH/RFC net-next 2/2] net/sched: allow flower to match tunnel options

2017-09-12 Thread Simon Horman

Allow matching on options in tunnel headers.
This makes use of existing tunnel metadata support.

Options are a bytestring of up to 256 bytes.
Tunnel implementations may support less or more options,
or no options at all.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc qdisc del dev eth0 ingress; tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent : \
 flower indev eth0 \
ip_proto udp \
action tunnel_key \
set src_ip 10.0.99.192 \
dst_ip 10.0.99.193 \
dst_port 4789 \
id 11 \
opts 0102800100800022 \
action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 include/net/flow_dissector.h | 13 +
 include/uapi/linux/pkt_cls.h |  3 +++
 net/sched/cls_flower.c   | 35 ++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index fc3dce730a6b..43f98bf0b349 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -183,6 +183,18 @@ struct flow_dissector_key_ip {
__u8ttl;
 };
 
+/**
+ * struct flow_dissector_key_enc_opts:
+ * @data: data
+ * @len: len
+ */
+struct flow_dissector_key_enc_opts {
+   u8 data[256];   /* Using IP_TUNNEL_OPTS_MAX is desired here
+* but seems difficult to #include
+*/
+   u8 len;
+};
+
 enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
@@ -205,6 +217,7 @@ enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
+   FLOW_DISSECTOR_KEY_ENC_OPTS, /* struct flow_dissector_key_enc_opts */
 
FLOW_DISSECTOR_KEY_MAX,
 };
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index d5e2bf68d0d4..7a09a28f21e0 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -467,6 +467,9 @@ enum {
TCA_FLOWER_KEY_IP_TTL,  /* u8 */
TCA_FLOWER_KEY_IP_TTL_MASK, /* u8 */
 
+   TCA_FLOWER_KEY_ENC_OPTS,
+   TCA_FLOWER_KEY_ENC_OPTS_MASK,
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1a267e77c6de..2a8364ef4fd5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -51,6 +51,7 @@ struct fl_flow_key {
struct flow_dissector_key_mpls mpls;
struct flow_dissector_key_tcp tcp;
struct flow_dissector_key_ip ip;
+   struct flow_dissector_key_enc_opts enc_opts;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
 
 struct fl_flow_mask_range {
@@ -181,6 +182,11 @@ static int fl_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
skb_key.enc_key_id.keyid = tunnel_id_to_key32(key->tun_id);
skb_key.enc_tp.src = key->tp_src;
skb_key.enc_tp.dst = key->tp_dst;
+
+   if (info->options_len) {
+   skb_key.enc_opts.len = info->options_len;
+   ip_tunnel_info_opts_get(skb_key.enc_opts.data, info);
+   }
}
 
skb_key.indev_ifindex = skb->skb_iif;
@@ -421,6 +427,8 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_IP_TOS_MASK]= { .type = NLA_U8 },
[TCA_FLOWER_KEY_IP_TTL] = { .type = NLA_U8 },
[TCA_FLOWER_KEY_IP_TTL_MASK]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ENC_OPTS]   = { .type = NLA_BINARY },
+   [TCA_FLOWER_KEY_ENC_OPTS_MASK]  = { .type = NLA_BINARY },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -712,6 +720,26 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
   >enc_tp.dst, TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
   sizeof(key->enc_tp.dst));
 
+   if (tb[TCA_FLOWER_KEY_ENC_OPTS]) {
+   key->enc_opts.len = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS]);
+
+   if (key->enc_opts.len > sizeof(key->enc_opts.data))
+   return -EINVAL;
+
+   /* enc_opts is variable length.
+* If present ensure the value and mask are the same length.
+*/
+   if (tb[TCA_FLOWER_KEY_ENC_OPTS_MASK] &&
+   nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]) != 
key->enc_opts.len)
+   return -EINVAL;
+
+   mask->enc_opts.len = key->enc_opts.len;
+   fl_set_key_val(tb, key->enc_opts.data, TCA_FLOWER_KEY_ENC_OPTS,
+  mask->enc_opts.data,
+

[PATCH/RFC net-next 1/2] net/sched: add tunnel option support to act_tunnel_key

2017-09-12 Thread Simon Horman

Allow setting tunnel options using the act_tunnel_key action.

Options are a bitwise maskable bytestring of up to 256 bytes.
Tunnel implementations may support less or more options,
or no options at all.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc del dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent : \
 flower \
   enc_src_ip 10.0.99.192 \
   enc_dst_ip 10.0.99.193 \
   enc_key_id 11 \
   enc_opts 0102800100800020/fff0 \
   ip_proto udp \
   action mirred egress redirect dev eth1

Signed-off-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |  1 +
 net/sched/act_tunnel_key.c| 26 +-
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h 
b/include/uapi/linux/tc_act/tc_tunnel_key.h
index afcd4be953e2..e0cb1121d132 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -35,6 +35,7 @@ enum {
TCA_TUNNEL_KEY_PAD,
TCA_TUNNEL_KEY_ENC_DST_PORT,/* be16 */
TCA_TUNNEL_KEY_NO_CSUM, /* u8 */
+   TCA_TUNNEL_KEY_ENC_OPTS,
__TCA_TUNNEL_KEY_MAX,
 };
 
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 30c96274c638..77b5890a48b9 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -66,6 +66,7 @@ static const struct nla_policy 
tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
[TCA_TUNNEL_KEY_ENC_KEY_ID]   = { .type = NLA_U32 },
[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
[TCA_TUNNEL_KEY_NO_CSUM]  = { .type = NLA_U8 },
+   [TCA_TUNNEL_KEY_ENC_OPTS] = { .type = NLA_BINARY },
 };
 
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,9 +82,11 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
struct tcf_tunnel_key *t;
bool exists = false;
__be16 dst_port = 0;
+   int opts_len = 0;
__be64 key_id;
__be16 flags;
int ret = 0;
+   u8 *opts;
int err;
 
if (!nla)
@@ -121,6 +124,11 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
dst_port = 
nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
 
+   if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+   opts = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPTS]);
+   opts_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPTS]);
+   }
+
if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
__be32 saddr;
@@ -131,7 +139,7 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
 
metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
dst_port, flags,
-   key_id, 0);
+   key_id, opts_len);
} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
   tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
struct in6_addr saddr;
@@ -142,9 +150,13 @@ static int tunnel_key_init(struct net *net, struct nlattr 
*nla,
 
metadata = __ipv6_tun_set_dst(, , 0, 0, 
dst_port,
  0, flags,
- key_id, 0);
+ key_id, opts_len);
}
 
+   if (opts_len)
+   ip_tunnel_info_opts_set(>u.tun_info,
+   opts, opts_len);
+
if (!metadata) {
ret = -EINVAL;
goto err_out;
@@ -264,8 +276,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
goto nla_put_failure;
 
if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
-   struct ip_tunnel_key *key =
-   >tcft_enc_metadata->u.tun_info.key;
+   struct ip_tunnel_info *info =
+   >tcft_enc_metadata->u.tun_info;
+   struct ip_tunnel_key *key = >key;
__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -273,7 +286,10 @@ static int tunnel_key_dump(struct sk_buff *skb, struct 
tc_action *a,
  
>tcft_enc_metadata->u.tun_info) ||
nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) 
||
nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
-  !(key->tun_flags & TUNNEL_CSUM)))
+

[PATCH/RFC net-next 0/2] net/sched: support tunnel options in cls_flower and act_tunnel_key

2017-09-12 Thread Simon Horman

Allow the flower classifier to match on tunnel options and the
tunnel key action to set them.

Tunnel options are a bytestring of up to 256 bytes.
The flower classifier matching with an optional bitwise mask.
Tunnel implementations may support more or less options,
or none at all.

Simon Horman (2):
  net/sched: add tunnel option support to act_tunnel_key
  net/sched: allow flower to match tunnel options

 include/net/flow_dissector.h  | 13 
 include/uapi/linux/pkt_cls.h  |  3 +++
 include/uapi/linux/tc_act/tc_tunnel_key.h |  1 +
 net/sched/act_tunnel_key.c| 26 ++-
 net/sched/cls_flower.c| 35 ++-
 5 files changed, 72 insertions(+), 6 deletions(-)

-- 
2.1.4

[PATCH iproute2/net-next] tc: flower: support for matching MPLS labels

2017-09-12 Thread Simon Horman

From: Benjamin LaHaise 

This patch adds support to the iproute2 tc filter command for matching MPLS
labels in the flower classifier.  The ability to match the Time To Live,
Bottom Of Stack, Traffic Control and Label fields are added as options to
the flower filter.

e.g.:
  tc filter add dev eth0 protocol 0x8847 parent : \
flower mpls_label 1 mpls_tc 2 mpls_ttl 3 mpls_bos 0 \
action drop

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
v1 [Simon Horman]
- added flower_print_opt portion to code
- added example to changelog
- revised manpage changes

v0 [Benjamin LaHaise]
---
 man/man8/tc-flower.8 | 37 +++--
 tc/f_flower.c| 92 
 2 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index be46f0278b4f..88a23f544133 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -29,6 +29,14 @@ flower \- flow based traffic control filter
 .IR PRIORITY " | "
 .BR vlan_ethtype " { " ipv4 " | " ipv6 " | "
 .IR ETH_TYPE " } | "
+.B mpls_label
+.IR LABEL " | "
+.B mpls_tc
+.IR TC " | "
+.B mpls_bos
+.IR BOS " | "
+.B mpls_ttl
+.IR TTL " | "
 .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
 .IR IP_PROTO " } | "
 .B ip_tos
@@ -119,6 +127,29 @@ may be either
 .BR ipv4 ", " ipv6
 or an unsigned 16bit value in hexadecimal format.
 .TP
+.BI mpls_label " LABEL"
+Match the label id in the outermost MPLS label stack entry.
+.I LABEL
+is an unsigned 20 bit value in decimal format.
+.TP
+.BI mpls_tc " TC"
+Match on the MPLS TC field, which is typically used for packet priority,
+in the outermost MPLS label stack entry.
+.I TC
+is an unsigned 3 bit value in decimal format.
+.TP
+.BI mpls_bos " BOS"
+Match on the MPLS Bottom Of Stack field in the outermost MPLS label stack
+entry.
+.I BOS
+is a 1 bit value in decimal format.
+.TP
+.BI mpls_ttl " TTL"
+Match on the MPLS Time To Live field in the outermost MPLS label stack
+entry.
+.I TTL
+is an unsigned 8 bit value in decimal format.
+.TP
 .BI ip_proto " IP_PROTO"
 Match on layer four protocol.
 .I IP_PROTO
@@ -226,8 +257,10 @@ to match on fragmented packets or not respectively.
 As stated above where applicable, matches of a certain layer implicitly depend
 on the matches of the next lower layer. Precisely, layer one and two matches
 (\fBindev\fR,  \fBdst_mac\fR and \fBsrc_mac\fR)
-have no dependency, layer three matches
-(\fBip_proto\fR, \fBdst_ip\fR, \fBsrc_ip\fR, \fBarp_tip\fR, \fBarp_sip\fR,
+have no dependency,
+MPLS and layer three matches
+(\fBmpls_label\fR, \fBmpls_tc\fR, \fBmpls_bos\fR, \fBmpls_ttl\fR,
+\fBip_proto\fR, \fBdst_ip\fR, \fBsrc_ip\fR, \fBarp_tip\fR, \fBarp_sip\fR,
 \fBarp_op\fR, \fBarp_tha\fR, \fBarp_sha\fR and \fBip_flags\fR)
 depend on the
 .B protocol
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 934832e2bbe9..8c4bfb0d339e 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "utils.h"
 #include "tc_util.h"
@@ -55,6 +56,10 @@ static void explain(void)
"   ip_proto [tcp | udp | sctp | icmp | 
icmpv6 | IP-PROTO ] |\n"
"   ip_tos MASKED-IP_TOS |\n"
"   ip_ttl MASKED-IP_TTL |\n"
+   "   mpls_label LABEL |\n"
+   "   mpls_tc TC |\n"
+   "   mpls_bos BOS |\n"
+   "   mpls_ttl TTL |\n"
"   dst_ip PREFIX |\n"
"   src_ip PREFIX |\n"
"   dst_port PORT-NUMBER |\n"
@@ -672,6 +677,70 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
 _ethtype, n);
if (ret < 0)
return -1;
+   } else if (matches(*argv, "mpls_label") == 0) {
+   __u32 label;
+
+   NEXT_ARG();
+   if (eth_type != htons(ETH_P_MPLS_UC) &&
+   eth_type != htons(ETH_P_MPLS_MC)) {
+   fprintf(stderr,
+   "Can't set \"mpls_label\" if ethertype 
isn't MPLS\n");
+   return -1;
+   }
+   ret = get_u32(, *argv, 10);
+   if (ret < 0 || label & ~(MPLS_LS_LABEL_MASK >> 
MPLS_LS_LABEL_SHIFT)) {
+   fprintf(stderr, "Illegal \"mpls_label\"\n");
+   return -1;
+   }
+   addattr32(n, MAX_MSG, TCA_FLOWER_KEY_MPLS_LABEL,

Re: [patch net] mlxsw: spectrum: Prevent mirred-related crash on removal

2017-09-12 Thread Andrew Lunn

On Tue, Sep 12, 2017 at 03:15:50PM +0200, Jiri Pirko wrote:
> Tue, Sep 12, 2017 at 03:05:06PM CEST, and...@lunn.ch wrote:
> >On Tue, Sep 12, 2017 at 08:50:53AM +0200, Jiri Pirko wrote:
> >> From: Yuval Mintz 
> >
> >Hi Jiri, Yuval
> >
> >s/mirred/mirrored/g
> 
> Actually, the name of the tc action is indeed "mirred".

:-(

Andrew

Re: [patch net] mlxsw: spectrum: Prevent mirred-related crash on removal

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 03:05:06PM CEST, and...@lunn.ch wrote:
>On Tue, Sep 12, 2017 at 08:50:53AM +0200, Jiri Pirko wrote:
>> From: Yuval Mintz 
>
>Hi Jiri, Yuval
>
>s/mirred/mirrored/g

Actually, the name of the tc action is indeed "mirred".
See net/sched/act_mirred.c

[PATCH/RFC net-next] ravb: RX checksum offload

2017-09-12 Thread Simon Horman

Add support for RX checksum offload. This is enabled by default and
may be disabled and re-enabled using ethtool:

 # ethtool -K eth0 rx off
 # ethtool -K eth0 rx on

The RAVB provides a simple checksumming scheme which appears to be
completely compatible with CHECKSUM_COMPLETE: a 1's complement sum of
all packet data after the L2 header is appended to packet data; this may
be trivially read by the driver and used to update the skb accordingly.

In terms of performance throughput is close to gigabit line-rate both with
and without RX checksum offload enabled. Perf output, however, appears to
indicate that significantly less time is spent in do_csum(). This is as
expected.

Test results with RX checksum offload enabled:
 # /usr/bin/perf_3.16 record -o /run/perf.data -a netperf -t TCP_MAERTS -H 
10.4.3.162
 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.4.3.162 
() port 0 AF_INET : demo
 enable_enobufs failed: getprotobyname
 Recv   SendSend
 Socket Socket  Message  Elapsed
 Size   SizeSize Time Throughput
 bytes  bytes   bytessecs.10^6bits/sec

  87380  16384  1638410.00 938.78
 [ perf record: Woken up 14 times to write data ]
 [ perf record: Captured and wrote 3.524 MB /run/perf.data (~153957 samples) ]

 Summary of output of perf report:
19.49%  ksoftirqd/0  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 9.88%  ksoftirqd/0  [kernel.kallsyms]  [k] __pi_memcpy
 7.33%  ksoftirqd/0  [kernel.kallsyms]  [k] skb_put
 7.00%  ksoftirqd/0  [kernel.kallsyms]  [k] ravb_poll
 3.89%  ksoftirqd/0  [kernel.kallsyms]  [k] dev_gro_receive
 3.65%  netperf  [kernel.kallsyms]  [k] __arch_copy_to_user
 3.43%  swapper  [kernel.kallsyms]  [k] arch_cpu_idle
 2.77%  swapper  [kernel.kallsyms]  [k] tick_nohz_idle_enter
 1.85%  ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_skb
 1.80%  swapper  [kernel.kallsyms]  [k] _raw_spin_unlock_irq
 1.64%  ksoftirqd/0  [kernel.kallsyms]  [k] __slab_alloc.isra.79
 1.62%  ksoftirqd/0  [kernel.kallsyms]  [k] __pi___inval_cache_range

Test results without RX checksum offload enabled:
 # /usr/bin/perf_3.16 record -o /run/perf.data -a netperf -t TCP_MAERTS -H 
10.4.3.162
 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.4.3.162 
() port 0 AF_INET : demo
 enable_enobufs failed: getprotobyname
 Recv   SendSend
 Socket Socket  Message  Elapsed
 Size   SizeSize Time Throughput
 bytes  bytes   bytessecs.10^6bits/sec

  87380  16384  1638410.00 941.09
 [ perf record: Woken up 14 times to write data ]
 [ perf record: Captured and wrote 3.411 MB /run/perf.data (~149040 samples) ]

 Summary of output of perf report:
   17.50%ksoftirqd/0  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
10.60%ksoftirqd/0  [kernel.kallsyms]  [k] __pi_memcpy
 7.91%ksoftirqd/0  [kernel.kallsyms]  [k] skb_put
 6.95%ksoftirqd/0  [kernel.kallsyms]  [k] do_csum
 6.22%ksoftirqd/0  [kernel.kallsyms]  [k] ravb_poll
 3.84%ksoftirqd/0  [kernel.kallsyms]  [k] dev_gro_receive
 2.53%netperf  [kernel.kallsyms]  [k] __arch_copy_to_user
 2.53%swapper  [kernel.kallsyms]  [k] arch_cpu_idle
 2.27%swapper  [kernel.kallsyms]  [k] tick_nohz_idle_enter
 1.90%ksoftirqd/0  [kernel.kallsyms]  [k] __pi___inval_cache_range
 1.90%ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_skb
 1.52%ksoftirqd/0  [kernel.kallsyms]  [k] __slab_alloc.isra.79

Above results collected on an R-Car Gen 3 Salvator-X/r8a7796 ES1.0.
Also tested on a R-Car Gen 3 Salvator-X/r8a7795 ES1.0.

By inspection this also appears to be compatible with the ravb found
on R-Car Gen 2 SoCs, however, this patch is currently untested on such
hardware.

Signed-off-by: Simon Horman 
---
 drivers/net/ethernet/renesas/ravb_main.c | 58 +++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index fdf30bfa403b..7c6438cd7de7 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -403,8 +403,9 @@ static void ravb_emac_init(struct net_device *ndev)
/* Receive frame limit set register */
ravb_write(ndev, ndev->mtu + ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN, RFLR);
 
-   /* PAUSE prohibition */
+   /* EMAC Mode: PAUSE prohibition; Duplex; RX Checksum; TX; RX */
ravb_write(ndev, ECMR_ZPF | (priv->duplex ? ECMR_DM : 0) |
+  (ndev->features & NETIF_F_RXCSUM ? ECMR_RCSC : 0) |
   ECMR_TE | ECMR_RE, ECMR);
 
ravb_set_rate(ndev);
@@ -520,6 +521,19 @@ static void ravb_get_tx_tstamp(struct net_device *ndev)
}
 }
 
+static void ravb_rx_csum(struct sk_buff *skb)
+{
+   u8 *hw_csum;
+
+   /* The hardware

Re: [patch net] mlxsw: spectrum: Prevent mirred-related crash on removal

2017-09-12 Thread Andrew Lunn

On Tue, Sep 12, 2017 at 08:50:53AM +0200, Jiri Pirko wrote:
> From: Yuval Mintz 

Hi Jiri, Yuval

s/mirred/mirrored/g

Andrew

RE: [PATCH v2 net 1/3] lan78xx: Fix for eeprom read/write when device auto suspend

2017-09-12 Thread Nisar.Sayed

> > From: Nisar Sayed 
> >
> > Fix for eeprom read/write when device auto suspend
> >
> > Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to
> > 10/100/1000 Ethernet device driver")
> > Signed-off-by: Nisar Sayed 
> > ---
> >  drivers/net/usb/lan78xx.c | 22 ++
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> > index b99a7fb..baf91c7 100644
> > --- a/drivers/net/usb/lan78xx.c
> > +++ b/drivers/net/usb/lan78xx.c
> > @@ -1265,30 +1265,44 @@ static int lan78xx_ethtool_get_eeprom(struct
> net_device *netdev,
> >   struct ethtool_eeprom *ee, u8 *data)  {
> > struct lan78xx_net *dev = netdev_priv(netdev);
> > +   int ret = -EINVAL;
> > +
> > +   if (usb_autopm_get_interface(dev->intf) < 0)
> > +   return ret;
> 
> Hi Nisar
> 
> It is better to do
> 
>ret = usb_autopm_get_interface(dev->intf;
>if (ret)
> return ret;
> 
> i.e. use the error code usb_autopm_get_interface() gives you.
> 
> > ee->magic = LAN78XX_EEPROM_MAGIC;
> >
> > -   return lan78xx_read_raw_eeprom(dev, ee->offset, ee->len, data);
> > +   ret = lan78xx_read_raw_eeprom(dev, ee->offset, ee->len, data);
> > +
> > +   usb_autopm_put_interface(dev->intf);
> > +
> > +   return ret;
> >  }
> >
> >  static int lan78xx_ethtool_set_eeprom(struct net_device *netdev,
> >   struct ethtool_eeprom *ee, u8 *data)  {
> > struct lan78xx_net *dev = netdev_priv(netdev);
> > +   int ret = -EINVAL;
> > +
> > +   if (usb_autopm_get_interface(dev->intf) < 0)
> > +   return ret;
> 
> Same here.
> 
>  Andrew

Thanks Andrew, will update it.

- Nisar

Re: [PATCH net-next v2 1/2] net: phy: realtek: rename RTL8211F_PAGE_SELECT to RTL821x_PAGE_SELECT

2017-09-12 Thread Andrew Lunn

On Tue, Sep 12, 2017 at 06:54:35PM +0900, Kunihiko Hayashi wrote:
> This renames the definition of page select register from
> RTL8211F_PAGE_SELECT to RTL821x_PAGE_SELECT to use it across models.
> 
> Signed-off-by: Kunihiko Hayashi 

Reviewed-by: Andrew Lunn 

Andrew

[PATCH] w90p910_ether: include linux/interrupt.h

2017-09-12 Thread Arnd Bergmann

A randconfig build caused a compile failure:

drivers/net/ethernet/nuvoton/w90p910_ether.c: In function 'w90p910_ether_close':
drivers/net/ethernet/nuvoton/w90p910_ether.c:580:2: error: implicit declaration 
of function 'free_irq'; did you mean 'free_uid'? 
[-Werror=implicit-function-declaration]

Adding the correct include fixes the problem.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/nuvoton/w90p910_ether.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/nuvoton/w90p910_ether.c 
b/drivers/net/ethernet/nuvoton/w90p910_ether.c
index 89ab786da25f..4a67c55aa9f1 100644
--- a/drivers/net/ethernet/nuvoton/w90p910_ether.c
+++ b/drivers/net/ethernet/nuvoton/w90p910_ether.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.9.0

[PATCH net] net: bonding: fix tlb_dynamic_lb default value

2017-09-12 Thread Nikolay Aleksandrov

Commit 8b426dc54cf4 ("bonding: remove hardcoded value") changed the
default value for tlb_dynamic_lb which lead to either broken ALB mode
(since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
with tlb_dynamic_lb equal to 0.
The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
when switching to ALB mode, but the default value is still wrong and
we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
changed via netlink or sysfs. In order to restore the previous behaviour
and default value simply remove the mode check around the default param
initialization for tlb_dynamic_lb which will always set it to 1 as
before.

Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
Signed-off-by: Nikolay Aleksandrov 
---
 drivers/net/bonding/bond_main.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index fc63992ab0e0..c99dc59d729b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4289,7 +4289,7 @@ static int bond_check_params(struct bond_params *params)
int bond_mode   = BOND_MODE_ROUNDROBIN;
int xmit_hashtype = BOND_XMIT_POLICY_LAYER2;
int lacp_fast = 0;
-   int tlb_dynamic_lb = 0;
+   int tlb_dynamic_lb;
 
/* Convert string parameters. */
if (mode) {
@@ -4601,16 +4601,13 @@ static int bond_check_params(struct bond_params *params)
}
ad_user_port_key = valptr->value;
 
-   if ((bond_mode == BOND_MODE_TLB) || (bond_mode == BOND_MODE_ALB)) {
-   bond_opt_initstr(, "default");
-   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB),
-   );
-   if (!valptr) {
-   pr_err("Error: No tlb_dynamic_lb default value");
-   return -EINVAL;
-   }
-   tlb_dynamic_lb = valptr->value;
+   bond_opt_initstr(, "default");
+   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB), );
+   if (!valptr) {
+   pr_err("Error: No tlb_dynamic_lb default value");
+   return -EINVAL;
}
+   tlb_dynamic_lb = valptr->value;
 
if (lp_interval == 0) {
pr_warn("Warning: ip_interval must be between 1 and %d, so it 
was reset to %d\n",
-- 
2.1.4

Re: ipset losing entries on its own

2017-09-12 Thread Akshat Kakkar

can somebody throw more light on this? How it is possible (without a
bug) that for exactly same set of IPs, at time IPSET HASHSIZE remains
at 1024 and at times it increases to 2048?

As a workaround I am running the show setting HASHSIZE as 16384 at
times of IPSET creation itself, and till now (its more than 4 days)
the issue has not repeated.

But this need to be addressed, right?

Re: Subject: [PATCH] vxlan: only reduce known arp boardcast request to support, virtual IP

2017-09-12 Thread Jiri Benc

On Tue, 12 Sep 2017 11:26:49 +0800, oc wrote:
> The purpose of vxlan arp reduce feature is to reply the boardcast
> arp request in vtep instead of sending it out to save traffic.
> The current implemention drops arp packet, if the ip cannot be
> found in neigh table. In the case of virtual IP address, user
> defines IP address without management from SDN controller. The IP
> address does not exist in neigh table, so the arp boardcast request
> from a client can not be sent to the server who owns the virtual IP
> address.
> 
> This patch allow the arp request to be sent out if:
> 1. not arp boardcast request
> 2. cannot be found in neigh table
> 3. arp record status is not NUD_CONNECTED
> 
> The user defined of virtual IP address works while arp reduce still
> suppress the arp boardcast for IP address managed by SDN controller
> with this patch.

Your patch is whitespace damaged, does not conform to the kernel coding
style and the email does not have your full name in the From header.

As for the patch itself, you're changing existing functionality that
people may depend on and thus a new config option is needed to enable
the behavior.

 Jiri

[PATCH] vti: fix NULL dereference in xfrm_input()

2017-09-12 Thread Alexey Kodanev

Can be reproduced with LTP tests:
  # icmp-uni-vti.sh -p ah -a sha256 -m tunnel -S fffe -k 1 -s 10

IPv4:
  RIP: 0010:xfrm_input+0x7f9/0x870
  ...
  Call Trace:
  
  vti_input+0xaa/0x110 [ip_vti]
  ? skb_free_head+0x21/0x40
  vti_rcv+0x33/0x40 [ip_vti]
  xfrm4_ah_rcv+0x33/0x60
  ip_local_deliver_finish+0x94/0x1e0
  ip_local_deliver+0x6f/0xe0
  ? ip_route_input_noref+0x28/0x50
  ...

  # icmp-uni-vti.sh -6 -p ah -a sha256 -m tunnel -S fffe -k 1 -s 10
IPv6:
  RIP: 0010:xfrm_input+0x7f9/0x870
  ...
  Call Trace:
  
  xfrm6_rcv_tnl+0x3c/0x40
  vti6_rcv+0xd5/0xe0 [ip6_vti]
  xfrm6_ah_rcv+0x33/0x60
  ip6_input_finish+0xee/0x460
  ip6_input+0x3f/0xb0
  ip6_rcv_finish+0x45/0xa0
  ipv6_rcv+0x34b/0x540

xfrm_input() invokes xfrm_rcv_cb() -> vti_rcv_cb(), the last callback
might call skb_scrub_packet(), which in turn can reset secpath.

Fix it by adding a check that skb->sp is not NULL.

Fixes: 7e9e9202bccc ("xfrm: Clear RX SKB secpath xfrm_offload")
Signed-off-by: Alexey Kodanev 
---
 net/xfrm/xfrm_input.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 2515cd2..8ac9d32 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -429,7 +429,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
nf_reset(skb);
 
if (decaps) {
-   skb->sp->olen = 0;
+   if (skb->sp)
+   skb->sp->olen = 0;
skb_dst_drop(skb);
gro_cells_receive(_cells, skb);
return 0;
@@ -440,7 +441,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
 
err = x->inner_mode->afinfo->transport_finish(skb, xfrm_gro || 
async);
if (xfrm_gro) {
-   skb->sp->olen = 0;
+   if (skb->sp)
+   skb->sp->olen = 0;
skb_dst_drop(skb);
gro_cells_receive(_cells, skb);
return err;
-- 
1.7.1

[PATCH] qed: remove unnecessary call to memset

2017-09-12 Thread Himanshu Jha

call to memset to assign 0 value immediately after allocating
memory with kzalloc is unnecesaary as kzalloc allocates the memory
filled with 0 value.

Semantic patch used to resolve this issue:

@@
expression e,e2; constant c;
statement S;
@@

  e = kzalloc(e2, c);
  if(e == NULL) S
- memset(e, 0, e2);

Signed-off-by: Himanshu Jha 
---
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index eaca457..8f6ccc0 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -1244,7 +1244,6 @@ int qed_dcbx_get_config_params(struct qed_hwfn *p_hwfn,
if (!dcbx_info)
return -ENOMEM;
 
-   memset(dcbx_info, 0, sizeof(*dcbx_info));
rc = qed_dcbx_query_params(p_hwfn, dcbx_info, QED_DCBX_OPERATIONAL_MIB);
if (rc) {
kfree(dcbx_info);
-- 
2.7.4

Re: broken vlan support on Realtek RTL8111/8168/8411 rev 9

2017-09-12 Thread Benoit Panizzon

Hi Francois

> ethtool -K eth0 rxvlan off

Thank you, that did the trick, vlan tags are not correctly passed on
and not set to vlan 0 with rxvlan turned off.
 
> For my reward, please send a complete dmesg where the messages from
> the vanilla r8169 module appear for the rev 09 card (r81..f ?). I
> won't dig it right now.

Yes, the 'f' variant:

[1.035203] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[1.035262] r8169 :03:00.0: setting latency timer to 64
[1.035348] r8169 :03:00.0: irq 41 for MSI/MSI-X
[1.035634] r8169 :03:00.0: eth0: RTL8168f/8111f at 0xc9c28000, 
c8:60:00:dd:f8:6c, XID 08000800 IRQ 41
[1.035637] r8169 :03:00.0: eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[   10.403921] r8169 :03:00.0: firmware: agent loaded rtl_nic/rtl8168f-1.fw 
into memory

Linux pulsar 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 
GNU/Linux

624831688e25aa47fa84c30c045fcae3  /lib/firmware/rtl_nic/rtl8168f-1.fw

Firmware Bug or Hardware Problem?

-Benoît Panizzon-
-- 
I m p r o W a r e   A G-Leiter Commerce Kunden
__

Zurlindenstrasse 29 Tel  +41 61 826 93 00
CH-4133 PrattelnFax  +41 61 826 93 01
Schweiz Web  http://www.imp.ch
__

Re: [Patch net v3 3/3] net_sched: carefully handle tcf_block_put()

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 01:33:32AM CEST, xiyou.wangc...@gmail.com wrote:
>As pointed out by Jiri, there is still a race condition between
>tcf_block_put() and tcf_chain_destroy() in a RCU callback. There
>is no way to make it correct without proper locking or synchronization,
>because both operate on a shared list.
>
>Locking is hard, because the only lock we can pick here is a spinlock,
>however, in tc_dump_tfilter() we iterate this list with a sleeping
>function called (tcf_chain_dump()), which makes using a lock to protect
>chain_list almost impossible.
>
>Jiri suggested the idea of holding a refcnt before flushing, this works
>because it guarantees us there would be no parallel tcf_chain_destroy()
>during the loop, therefore the race condition is gone. But we have to
>be very careful with proper synchronization with RCU callbacks.
>
>Suggested-by: Jiri Pirko 
>Cc: Jamal Hadi Salim 
>Signed-off-by: Cong Wang 

Acked-by: Jiri Pirko 

Thanks!

Re: [Patch net v3 2/3] net_sched: fix reference counting of tc filter chain

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 01:33:31AM CEST, xiyou.wangc...@gmail.com wrote:
>This patch fixes the following ugliness of tc filter chain refcnt:
>
>a) tp proto should hold a refcnt to the chain too. This significantly
>   simplifies the logic.
>
>b) Chain 0 is no longer special, it is created with refcnt=1 like any
>   other chains. All the ugliness in tcf_chain_put() can be gone!
>
>c) No need to handle the flushing oddly, because block still holds
>   chain 0, it can not be released, this guarantees block is the last
>   user.
>
>d) The race condition with RCU callbacks is easier to handle with just
>   a rcu_barrier(). Much easier to understand, nothing to hide. Thanks
>   to the previous patch. Please see also the comments in code.
>
>e) Make the code understandable by humans, much less error-prone.
>
>Fixes: 744a4cf63e52 ("net: sched: fix use after free when tcf_chain_destroy is 
>called multiple times")
>Fixes: 5bc1701881e3 ("net: sched: introduce multichain support for filters")
>Cc: Jiri Pirko 
>Cc: Jamal Hadi Salim 
>Signed-off-by: Cong Wang 

Looking good to me. Thanks!

Acked-by: Jiri Pirko

Re: [Patch net v3 1/3] net_sched: get rid of tcfa_rcu

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 11:42:15AM CEST, j...@resnulli.us wrote:
>Tue, Sep 12, 2017 at 01:33:30AM CEST, xiyou.wangc...@gmail.com wrote:
>>gen estimator has been rewritten in commit 1c0d32fde5bd
>>("net_sched: gen_estimator: complete rewrite of rate estimators"),
>>the caller is no longer needed to wait for a grace period.
>>So this patch gets rid of it.
>>
>>This also completely closes a race condition between action free
>>path and filter chain add/remove path for the following patch.
>>Because otherwise the nested RCU callback can't be caught by
>>rcu_barrier().
>>
>>Please see also the comments in code.
>
>Looks like this is causing a null pointer dereference bug for me, 100%
>of the time. Just add and remove any rule with action and you get:
>

[...]

>
>Looks like you need to save owner of the module before you call
>__tcf_idr_release so you can later on use it for module_put

This patch helps:

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index fcd7dc7..de73e71 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -514,13 +514,15 @@ EXPORT_SYMBOL(tcf_action_exec);
 
 int tcf_action_destroy(struct list_head *actions, int bind)
 {
+   const struct tc_action_ops *ops;
struct tc_action *a, *tmp;
int ret = 0;
 
list_for_each_entry_safe(a, tmp, actions, list) {
+   ops = a->ops;
ret = __tcf_idr_release(a, bind, true);
if (ret == ACT_P_DELETED)
-   module_put(a->ops->owner);
+   module_put(ops->owner);
else if (ret < 0)
return ret;
}

[PATCH] ipv4: Namespaceify tcp_fastopen knob

2017-09-12 Thread Haishuang Yan

Different namespace application might require enable TCP Fast Open
feature independently of the host.

Reported-by: Luca BRUNO 
Signed-off-by: Haishuang Yan 
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/tcp.h  |  1 -
 net/ipv4/af_inet.c |  7 ---
 net/ipv4/sysctl_net_ipv4.c | 42 +-
 net/ipv4/tcp.c |  4 ++--
 net/ipv4/tcp_fastopen.c| 13 ++---
 net/ipv4/tcp_ipv4.c|  2 ++
 samples/bpf/test_ipip.sh   |  2 ++
 8 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 305e031..ea0953b 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -128,6 +128,8 @@ struct netns_ipv4 {
struct inet_timewait_death_row tcp_death_row;
int sysctl_max_syn_backlog;
int sysctl_tcp_max_orphans;
+   int sysctl_tcp_fastopen;
+   unsigned int sysctl_tcp_fastopen_blackhole_timeout;
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_udp_l3mdev_accept;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ac2d998..e4cc0dd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@
 
 
 /* sysctl variables for tcp */
-extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
 extern int sysctl_tcp_rfc1337;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e..309b849 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -195,7 +195,7 @@ int inet_listen(struct socket *sock, int backlog)
 {
struct sock *sk = sock->sk;
unsigned char old_state;
-   int err;
+   int err, tcp_fastopen;
 
lock_sock(sk);
 
@@ -217,8 +217,9 @@ int inet_listen(struct socket *sock, int backlog)
 * because the socket was in TCP_LISTEN state previously but
 * was shutdown() rather than close().
 */
-   if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
-   (sysctl_tcp_fastopen & TFO_SERVER_ENABLE) &&
+   tcp_fastopen =  sock_net(sk)->ipv4.sysctl_tcp_fastopen;
+   if ((tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
+   (tcp_fastopen & TFO_SERVER_ENABLE) &&
!inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) {
fastopen_queue_tune(sk, backlog);
tcp_fastopen_init_key_once(true);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4f26c8d3..30ebeb9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -394,27 +394,6 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_fastopen",
-   .data   = _tcp_fastopen,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
-   {
-   .procname   = "tcp_fastopen_key",
-   .mode   = 0600,
-   .maxlen = ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10),
-   .proc_handler   = proc_tcp_fastopen_key,
-   },
-   {
-   .procname   = "tcp_fastopen_blackhole_timeout_sec",
-   .data   = _tcp_fastopen_blackhole_timeout,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_tfo_blackhole_detect_timeout,
-   .extra1 = ,
-   },
-   {
.procname   = "tcp_abort_on_overflow",
.data   = _tcp_abort_on_overflow,
.maxlen = sizeof(int),
@@ -1085,6 +1064,27 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_fastopen",
+   .data   = _net.ipv4.sysctl_tcp_fastopen,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "tcp_fastopen_key",
+   .mode   = 0600,
+   .maxlen = ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10),
+   .proc_handler   = proc_tcp_fastopen_key,
+   },
+   {
+   .procname   = "tcp_fastopen_blackhole_timeout_sec",
+   .data   = 
_net.ipv4.sysctl_tcp_fastopen_blackhole_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_tfo_blackhole_detect_timeout,
+   .extra1 = ,
+   },
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
{
.procname   =

Re: [PATCH] ieee802154: fix gcc-4.9 warnings

2017-09-12 Thread Stefan Schmidt


Hello.

On 09/12/2017 12:16 PM, Arnd Bergmann wrote:

All older compiler versions up to gcc-4.9 produce these
harmless warnings:

drivers/net/ieee802154/ca8210.c: In function 'ca8210_skb_tx':
drivers/net/ieee802154/ca8210.c:1947:9: warning: missing braces around 
initializer [-Wmissing-braces]

This changes the syntax to something that works on all versions
without warnings.

Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
Signed-off-by: Arnd Bergmann 
---
  drivers/net/ieee802154/ca8210.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
index 24a1eabbbc9d..e6b8ce81a6c3 100644
--- a/drivers/net/ieee802154/ca8210.c
+++ b/drivers/net/ieee802154/ca8210.c
@@ -1944,7 +1944,7 @@ static int ca8210_skb_tx(
  )
  {
int status;
-   struct ieee802154_hdr header = { 0 };
+   struct ieee802154_hdr header = { };
struct secspec secspec;
unsigned int mac_len;
  



Acked-by: Stefan Schmidt 

regards
Stefan Schmidt

[PATCH] ieee802154: fix gcc-4.9 warnings

2017-09-12 Thread Arnd Bergmann

All older compiler versions up to gcc-4.9 produce these
harmless warnings:

drivers/net/ieee802154/ca8210.c: In function 'ca8210_skb_tx':
drivers/net/ieee802154/ca8210.c:1947:9: warning: missing braces around 
initializer [-Wmissing-braces]

This changes the syntax to something that works on all versions
without warnings.

Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ieee802154/ca8210.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
index 24a1eabbbc9d..e6b8ce81a6c3 100644
--- a/drivers/net/ieee802154/ca8210.c
+++ b/drivers/net/ieee802154/ca8210.c
@@ -1944,7 +1944,7 @@ static int ca8210_skb_tx(
 )
 {
int status;
-   struct ieee802154_hdr header = { 0 };
+   struct ieee802154_hdr header = { };
struct secspec secspec;
unsigned int mac_len;
 
-- 
2.9.0

[PATCH net-next v2 1/2] net: phy: realtek: rename RTL8211F_PAGE_SELECT to RTL821x_PAGE_SELECT

2017-09-12 Thread Kunihiko Hayashi

This renames the definition of page select register from
RTL8211F_PAGE_SELECT to RTL821x_PAGE_SELECT to use it across models.

Signed-off-by: Kunihiko Hayashi 
---
Changes since v1:
 - new patch in this series
---
 drivers/net/phy/realtek.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 9cbe645..99c3297 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -22,11 +22,11 @@
 #define RTL821x_INER   0x12
 #define RTL821x_INER_INIT  0x6400
 #define RTL821x_INSR   0x13
+#define RTL821x_PAGE_SELECT0x1f
 #define RTL8211E_INER_LINK_STATUS 0x400
 
 #define RTL8211F_INER_LINK_STATUS 0x0010
 #define RTL8211F_INSR  0x1d
-#define RTL8211F_PAGE_SELECT   0x1f
 #define RTL8211F_TX_DELAY  0x100
 
 MODULE_DESCRIPTION("Realtek PHY driver");
@@ -46,10 +46,10 @@ static int rtl8211f_ack_interrupt(struct phy_device *phydev)
 {
int err;
 
-   phy_write(phydev, RTL8211F_PAGE_SELECT, 0xa43);
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0xa43);
err = phy_read(phydev, RTL8211F_INSR);
/* restore to default page 0 */
-   phy_write(phydev, RTL8211F_PAGE_SELECT, 0x0);
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
 
return (err < 0) ? err : 0;
 }
@@ -102,7 +102,7 @@ static int rtl8211f_config_init(struct phy_device *phydev)
if (ret < 0)
return ret;
 
-   phy_write(phydev, RTL8211F_PAGE_SELECT, 0xd08);
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0xd08);
reg = phy_read(phydev, 0x11);
 
/* enable TX-delay for rgmii-id and rgmii-txid, otherwise disable it */
@@ -114,7 +114,7 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 
phy_write(phydev, 0x11, reg);
/* restore to default page 0 */
-   phy_write(phydev, RTL8211F_PAGE_SELECT, 0x0);
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
 
return 0;
 }
-- 
2.7.4

[PATCH net-next v2 2/2] net: phy: realtek: add RTL8201F phy-id and functions

2017-09-12 Thread Kunihiko Hayashi

From: Jassi Brar 

Add RTL8201F phy-id and the related functions to the driver.

The original patch is as follows:
https://patchwork.kernel.org/patch/2538341/

Signed-off-by: Jongsung Kim 
Signed-off-by: Jassi Brar 
Signed-off-by: Kunihiko Hayashi 
Reviewed-by: Andrew Lunn 
Reviewed-by: Florian Fainelli 
---
Changes since v1:
 - use RTL821x_PAGE_SELECT instead of defining RTL8201F_PAGE_SELECT
---
 drivers/net/phy/realtek.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 99c3297..d4670ec 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -29,10 +29,22 @@
 #define RTL8211F_INSR  0x1d
 #define RTL8211F_TX_DELAY  0x100
 
+#define RTL8201F_ISR   0x1e
+#define RTL8201F_IER   0x13
+
 MODULE_DESCRIPTION("Realtek PHY driver");
 MODULE_AUTHOR("Johnson Leung");
 MODULE_LICENSE("GPL");
 
+static int rtl8201_ack_interrupt(struct phy_device *phydev)
+{
+   int err;
+
+   err = phy_read(phydev, RTL8201F_ISR);
+
+   return (err < 0) ? err : 0;
+}
+
 static int rtl821x_ack_interrupt(struct phy_device *phydev)
 {
int err;
@@ -54,6 +66,25 @@ static int rtl8211f_ack_interrupt(struct phy_device *phydev)
return (err < 0) ? err : 0;
 }
 
+static int rtl8201_config_intr(struct phy_device *phydev)
+{
+   int err;
+
+   /* switch to page 7 */
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0x7);
+
+   if (phydev->interrupts == PHY_INTERRUPT_ENABLED)
+   err = phy_write(phydev, RTL8201F_IER,
+   BIT(13) | BIT(12) | BIT(11));
+   else
+   err = phy_write(phydev, RTL8201F_IER, 0);
+
+   /* restore to default page 0 */
+   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
+
+   return err;
+}
+
 static int rtl8211b_config_intr(struct phy_device *phydev)
 {
int err;
@@ -129,6 +160,18 @@ static struct phy_driver realtek_drvs[] = {
.config_aneg= _config_aneg,
.read_status= _read_status,
}, {
+   .phy_id = 0x001cc816,
+   .name   = "RTL8201F 10/100Mbps Ethernet",
+   .phy_id_mask= 0x001f,
+   .features   = PHY_BASIC_FEATURES,
+   .flags  = PHY_HAS_INTERRUPT,
+   .config_aneg= _config_aneg,
+   .read_status= _read_status,
+   .ack_interrupt  = _ack_interrupt,
+   .config_intr= _config_intr,
+   .suspend= genphy_suspend,
+   .resume = genphy_resume,
+   }, {
.phy_id = 0x001cc912,
.name   = "RTL8211B Gigabit Ethernet",
.phy_id_mask= 0x001f,
@@ -181,6 +224,7 @@ static struct phy_driver realtek_drvs[] = {
 module_phy_driver(realtek_drvs);
 
 static struct mdio_device_id __maybe_unused realtek_tbl[] = {
+   { 0x001cc816, 0x001f },
{ 0x001cc912, 0x001f },
{ 0x001cc914, 0x001f },
{ 0x001cc915, 0x001f },
-- 
2.7.4

[PATCH v4 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode

2017-09-12 Thread Haishuang Yan

In collect_md mode, if the tun dev is down, it still can call
__ip6_tnl_rcv to receive on packets, and the rx statistics increase
improperly.

When the md tunnel is down, it's not neccessary to increase RX drops
for the tunnel device, packets would be recieved on fallback tunnel,
and the RX drops on fallback device will be increased as expected.

Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Cc: Alexei Starovoitov 
Signed-off-by: Haishuang Yan 

---
Change since v4:
  * Make the commit message more clearer
  * Fix wrong recipient address
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 10a693a..ae73164 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -171,7 +171,7 @@ static struct net_device_stats *ip6_get_stats(struct 
net_device *dev)
}
 
t = rcu_dereference(ip6n->collect_md_tun);
-   if (t)
+   if (t && t->dev->flags & IFF_UP)
return t;
 
t = rcu_dereference(ip6n->tnls_wc[0]);
-- 
1.8.3.1

[PATCH v4 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-09-12 Thread Haishuang Yan

In collect_md mode, if the tun dev is down, it still can call
ip_tunnel_rcv to receive on packets, and the rx statistics increase
improperly.

When the md tunnel is down, it's not neccessary to increase RX drops
for the tunnel device, packets would be recieved on fallback tunnel,
and the RX drops on fallback device will be increased as expected.

Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
Cc: Pravin B Shelar 
Signed-off-by: Haishuang Yan 

---
Change since v4:
  * Make the commit message more clearer.
  * Fix wrong recipient addresss
---
 net/ipv4/ip_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index e1856bf..e9805ad 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -176,7 +176,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net 
*itn,
return cand;
 
t = rcu_dereference(itn->collect_md_tun);
-   if (t)
+   if (t && t->dev->flags & IFF_UP)
return t;
 
if (itn->fb_tunnel_dev && itn->fb_tunnel_dev->flags & IFF_UP)
-- 
1.8.3.1

Re: [Patch net v3 1/3] net_sched: get rid of tcfa_rcu

2017-09-12 Thread Jiri Pirko

Tue, Sep 12, 2017 at 01:33:30AM CEST, xiyou.wangc...@gmail.com wrote:
>gen estimator has been rewritten in commit 1c0d32fde5bd
>("net_sched: gen_estimator: complete rewrite of rate estimators"),
>the caller is no longer needed to wait for a grace period.
>So this patch gets rid of it.
>
>This also completely closes a race condition between action free
>path and filter chain add/remove path for the following patch.
>Because otherwise the nested RCU callback can't be caught by
>rcu_barrier().
>
>Please see also the comments in code.

Looks like this is causing a null pointer dereference bug for me, 100%
of the time. Just add and remove any rule with action and you get:


[  598.599825] BUG: unable to handle kernel NULL pointer dereference at 
0030
[  598.607782] IP: tcf_action_destroy+0xc0/0x140
[  598.612231] PGD 0 P4D 0 
[  598.614797] Oops:  [#1] SMP KASAN
[  598.618525] Modules linked in: act_gact cls_flower sch_ingress 
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache 
intel_rapl x86_pkg_temp_thermal coretemp mlxsw_spectrum kvm_intel mlxfw kvm 
parman bridge sunrpc irqbypass iTCO_wdt iTCO_vendor_support stp 
crct10dif_pclmul llc crc32_pclmul crc32c_intel mlxsw_pci ghash_clmulni_intel 
mlxsw_core i2c_i801 e1000e pcspkr ptp tpm_tis mei_me pps_core mei tpm_tis_core 
lpc_ich tpm shpchp video
[  598.659010] CPU: 1 PID: 758 Comm: bash Tainted: GB   4.13.0jiri+ 
#70
[  598.666509] Hardware name: Mellanox Technologies Ltd. Mellanox 
switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[  598.677630] task: 880371624bc0 task.stack: 880387808000
[  598.683648] RIP: 0010:tcf_action_destroy+0xc0/0x140
[  598.688617] RSP: 0018:88038d107cb8 EFLAGS: 00010282
[  598.693922] RAX:  RBX: 88038d107d28 RCX: 820b80e0
[  598.701184] RDX:  RSI: 0008 RDI: 0030
[  598.708405] RBP: 88038d107ce8 R08: 0001 R09: 0001
[  598.715607] R10: 88038d107b27 R11: fbfff0bcf36c R12: 
[  598.722816] R13: 88038d107d38 R14: 88036bf75650 R15: 0001
[  598.730047] FS:  7f398050b700() GS:88038d10() 
knlGS:
[  598.738253] CS:  0010 DS:  ES:  CR0: 80050033
[  598.744086] CR2: 0030 CR3: 000371ac4001 CR4: 001606e0
[  598.751328] Call Trace:
[  598.753809]  
[  598.755871]  tcf_exts_destroy+0x17f/0x260
[  598.775969]  fl_destroy_filter+0x1d/0x30 [cls_flower]
[  598.781069]  rcu_process_callbacks+0x6b2/0xe00


Kasan says:

[  597.503005] BUG: KASAN: use-after-free in tcf_action_destroy+0xad/0x140
[  597.509751] Read of size 8 at addr 88036bf75640 by task bash/758
[  597.516222] 
[  597.517761] CPU: 1 PID: 758 Comm: bash Not tainted 4.13.0jiri+ #70
[  597.524075] Hardware name: Mellanox Technologies Ltd. Mellanox 
switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[  597.535132] Call Trace:
[  597.537630]  
[  597.539718]  dump_stack+0xd5/0x150
[  597.554853]  print_address_description+0x86/0x410
[  597.559667]  kasan_report+0x181/0x4c0
[  597.583360]  tcf_action_destroy+0xad/0x140
[  597.587551]  tcf_exts_destroy+0x17f/0x260


Ubsan says:

[  598.184033] UBSAN: Undefined behaviour in net/sched/act_api.c:523:4
[  598.190409] member access within null pointer of type 'const struct 
tc_action_ops'
[  598.198076] CPU: 1 PID: 758 Comm: bash Tainted: GB   4.13.0jiri+ 
#70
[  598.205570] Hardware name: Mellanox Technologies Ltd. Mellanox 
switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[  598.216669] Call Trace:
[  598.219157]  
[  598.221245]  dump_stack+0xd5/0x150
[  598.228703]  ubsan_epilogue+0xd/0x4e
[  598.232333]  __ubsan_handle_type_mismatch+0xf2/0x293
[  598.252880]  tcf_action_destroy+0x115/0x140
[  598.257151]  tcf_exts_destroy+0x17f/0x260
[  598.277336]  fl_destroy_filter+0x1d/0x30 [cls_flower]
[  598.282472]  rcu_process_callbacks+0x6b2/0xe00

Looks like you need to save owner of the module before you call
__tcf_idr_release so you can later on use it for module_put

RE: [PATCH] tipc: Use bsearch library function

2017-09-12 Thread David Laight

From: David Miller
> Sent: 11 September 2017 22:30
> From: Thomas Meyer 
> Date: Sat,  9 Sep 2017 05:18:19 +0200
> 
> > @@ -168,6 +169,18 @@ static struct name_seq *tipc_nameseq_create(u32 type, 
> > struct hlist_head
> *seq_hea
> > return nseq;
> >  }
> >
> > +static int nameseq_find_subseq_cmp(const void *key, const void *elt)
> > +{
> > +   u32 instance = *(u32 *)key;
> > +   struct sub_seq *sseq = (struct sub_seq *)elt;
> 
> Please order local variables from longest to shortest (ie. reverse
> christmas tree).

You probably just need to remove the unnecessary cast of 'void *'.
Although adding the 'const' qualifier will make it wrong again.

You probably ought to make the 'key' a structure - even if it only
contains a single u32.
Casting pointers to numeric types is often wrong.

David

Re: [PATCH net-next 2/3] net: ethernet: socionext: add AVE ethernet driver

2017-09-12 Thread Kunihiko Hayashi

Hi Andrew,

On Mon, 11 Sep 2017 14:00:09 +0200 Andrew Lunn  wrote:

> > > > +static irqreturn_t ave_interrupt(int irq, void *netdev)
> > > > +{
> > > > +   struct net_device *ndev = (struct net_device *)netdev;
> > > > +   struct ave_private *priv = netdev_priv(ndev);
> > > > +   u32 gimr_val, gisr_val;
> > > > +
> > > > +   gimr_val = ave_irq_disable_all(ndev);
> > > > +
> > > > +   /* get interrupt status */
> > > > +   gisr_val = ave_r32(ndev, AVE_GISR);
> > > > +
> > > > +   /* PHY */
> > > > +   if (gisr_val & AVE_GI_PHY) {
> > > > +   ave_w32(ndev, AVE_GISR, AVE_GI_PHY);
> > > > +   if (priv->internal_phy_interrupt)
> > > > +   phy_mac_interrupt(ndev->phydev, 
> > > > ndev->phydev->link);
> > > 
> > > Humm. I don't think this is correct. You are supposed to give it the
> > > new link state, not the old.
> > > 
> > > What does a PHY interrupt mean here? 
> > 
> > In the general case, I think PHY events like changing link state are 
> > transmitted
> > to CPU as interrupt via interrupt controller, then PHY driver itself can 
> > handle
> > the interrupt.
> > 
> > And in this case, PHY events are transmitted to MAC as one of its interrupt 
> > factor,
> > then I thought that MAC driver had to tell the events to PHY.
> 
> Could this be in-band SGMI signalling from the PHY to the MAC? Does
> the documentation give examples of when this interrupt will happen?
> 
> Andrew

Unfortunately this MAC doesn't support SGMII.
And there aren't any examples of when this interrupt will happen.
This interrupt happens after ave_open() is called and link is established.

However, I found that auto negotiation didn't start when this interrupt wasn't 
handled.

Although ave_init() calls phy_start_aneg(), it doesn't make sense
because phy driver doesn't start yet.

And according to Florian's comment in ave_init(),

> + phy_start_interrupts(phydev);
> +
> + phy_start_aneg(phydev);
>
> No, no, phy_start() would take care of all of that correctly for you,
> you don't have have to do it just there because ave_open() eventually
> calls phy_start() for you, so just drop these two calls.

When moving phy_start_aneg() to ave_open(), the handler doesn't need to call
phy_mac_interrupt() and I can remove it from the handler.

---
Best Regards,
Kunihiko Hayashi

scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-12 Thread Michal Hocko

Hi,
we are seeing the following splat with Debian 3.16 stable kernel

BUG: scheduling while atomic: MATLAB/26771/0x0100
Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver rpcsec_gss_krb5 
nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc vmw_vso$
CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 Debian 
3.16.7-ckt20-1+deb8u3
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 09/21/2015
 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
 815103a2 00012f00 8819423dbfd8 00012f00
 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
Call Trace:
   [] ? dump_stack+0x41/0x51
 [] ? __schedule_bug+0x48/0x55
 [] ? __schedule+0x5d2/0x700
 [] ? schedule_timeout+0x229/0x2a0
 [] ? select_task_rq_fair+0x390/0x700
 [] ? check_preempt_wakeup+0x120/0x1d0
 [] ? wait_for_completion+0xa8/0x120
 [] ? wake_up_state+0x10/0x10
 [] ? call_rcu_bh+0x20/0x20
 [] ? wait_rcu_gp+0x4b/0x60
 [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
 [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
 [] ? vmci_transport_destruct+0x1d/0xe0 
[vmw_vsock_vmci_transport]
 [] ? vsock_sk_destruct+0x13/0x60 [vsock]
 [] ? __sk_free+0x1a/0x130
 [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
[vmw_vsock_vmci_transport]
 [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 [vmw_vmci]
 [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
 [] ? tasklet_action+0xf4/0x100
 [] ? __do_softirq+0xf1/0x290
 [] ? irq_exit+0x95/0xa0
 [] ? do_IRQ+0x52/0xe0
 [] ? common_interrupt+0x6d/0x6d

AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
to call in interrupt context") but this patch hasn't been backported to
stable trees. It applies cleanly on top of 3.16 stable tree but I am not
familiar with the code to send the backport to the stable maintainer
directly.

Could you double check that the patch below (just a blind cherry-pick)
is correct and it doesn't need additional patches on top?

Ben could you take this to your stable 3.16 branch if the patch is correct?

I am CCing Sasha for 4.1 stable tree as well. I haven't checked whether
pre 3.16 kernels are affected as well.
---
commit fac774c40b5c512113b6373cad498f35bee7a409
Author: Jorgen Hansen 
Date:   Wed Oct 21 04:53:56 2015 -0700

VSOCK: sock_put wasn't safe to call in interrupt context

commit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream.

In the vsock vmci_transport driver, sock_put wasn't safe to call
in interrupt context, since that may call the vsock destructor
which in turn calls several functions that should only be called
from process context. This change defers the callling of these
functions  to a worker thread. All these functions were
deallocation of resources related to the transport itself.

Furthermore, an unused callback was removed to simplify the
cleanup.

Multiple customers have been hitting this issue when using
VMware tools on vSphere 2015.

Also added a version to the vmci transport module (starting from
1.0.2.0-k since up until now it appears that this module was
sharing version with vsock that is currently at 1.0.1.0-k).

Reviewed-by: Aditya Asarwade 
Reviewed-by: Thomas Hellstrom 
Signed-off-by: Jorgen Hansen 
Signed-off-by: David S. Miller 

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 9bb63ffec4f2..aed136d27b01 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -40,13 +40,11 @@
 
 static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg);
 static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg);
-static void vmci_transport_peer_attach_cb(u32 sub_id,
- const struct vmci_event_data *ed,
- void *client_data);
 static void vmci_transport_peer_detach_cb(u32 sub_id,
  const struct vmci_event_data *ed,
  void *client_data);
 static void vmci_transport_recv_pkt_work(struct work_struct *work);
+static void vmci_transport_cleanup(struct work_struct *work);
 static int vmci_transport_recv_listen(struct sock *sk,
  struct vmci_transport_packet *pkt);
 static int vmci_transport_recv_connecting_server(
@@ -75,6 +73,10 @@ struct vmci_transport_recv_pkt_info {
struct vmci_transport_packet pkt;
 };
 
+static LIST_HEAD(vmci_transport_cleanup_list);
+static DEFINE_SPINLOCK(vmci_transport_cleanup_lock);
+static DECLARE_WORK(vmci_transport_cleanup_work, vmci_transport_cleanup);
+
 static struct vmci_handle vmci_transport_stream_handle = { VMCI_INVALID_ID,

Re: [PATCH iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-12 Thread Michal Kubecek

On Tue, Sep 12, 2017 at 10:47:48AM +0200, Michal Kubecek wrote:
> On Mon, Sep 11, 2017 at 03:19:55PM +0800, Hangbin Liu wrote:
> > On Fri, Sep 08, 2017 at 04:51:13PM +0200, Phil Sutter wrote:
> > > Regarding Michal's concern about reentrancy, maybe we should go into a
> > > different direction and make rtnl_recvmsg() return a newly allocated
> > > buffer which the caller has to free.
> > 
> > Hmm... But we could not free the buf in __rtnl_talk(). Because in
> > __rtnl_talk() we assign the answer with the buf address and return to 
> > caller.
> > 
> > for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
> > [...]
> > if (answer) {
> > *answer= h;
> > return 0;
> > }
> > }
> > 
> > And the caller will keep use it in later code. Since there are plenty of
> > functions called rtnl_talk. I think it would be much more complex to free
> > the buffer every time.
> > 
> > 
> > Hi Michal,
> > 
> > Would you like to tell me more about your concern with reentrancy? It's 
> > looks
> > arpd doesn't call rtnl_talk() or rtnl_dump_filter_l().
> 
> I checked again and arpd indeed isn't a problem. It doesn't seem to call
> any of the two functions (directly or indirectly) and while it's linked
> with "-lpthread", it's not really multithreaded.
> 
> But my concern was rather about other potential users of libnetlink
> (i.e. those which are not part of iproute2). I must admit, though, that
> I'm not sure if libnetlink code is reentrant as of now. (And people are
> discouraged from using it in its own manual page.)
> 
> That being said, I still like Phil's idea for a different reason. While
> investigating the issue with "ip link show dev eth ..." which led me to
> commit 6599162b958e ("iplink: check for message truncation in
> iplink_get()"), I quickly peeked at some other callers of rtnl_talk()
> and I'm afraid there may be others which wouldn't handle truncated
> message correctly. I assume the maxlen argument was always chosen to be
> sufficient for any expected messages but as the example of iplink_get()
> shows, messages returned by kernel my grow over time.
> 
> That's why I like the idea of __rtnl_talk() returning a pointer to newly
> allocated buffer (of sufficient size) rather than copying the response
> into a buffer provided by caller and potentially truncating it.

I'm sorry, I managed to forget that your patch 2 does already address
this problem. But the fact that any caller must keep in mind that he
must not call the same function again until the previous response is no
longer needed still feels like a trap. It's something you need to keep
in mind (where "you" in fact means any future contributor) and it's
easy to forget. That's why I prefer the reentrant functions like
strerror_r() or localtime_r() even in code which is not intended to be
multithreaded. Getting an object which is "mine" to do with whatever
I want until I no longer need it feels like a cleaner interface to me.

Michal Kubecek

Re: [PATCH iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-12 Thread Michal Kubecek

On Mon, Sep 11, 2017 at 03:19:55PM +0800, Hangbin Liu wrote:
> On Fri, Sep 08, 2017 at 04:51:13PM +0200, Phil Sutter wrote:
> > Regarding Michal's concern about reentrancy, maybe we should go into a
> > different direction and make rtnl_recvmsg() return a newly allocated
> > buffer which the caller has to free.
> 
> Hmm... But we could not free the buf in __rtnl_talk(). Because in
> __rtnl_talk() we assign the answer with the buf address and return to caller.
> 
>   for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
>   [...]
>   if (answer) {
>   *answer= h;
>   return 0;
>   }
>   }
> 
> And the caller will keep use it in later code. Since there are plenty of
> functions called rtnl_talk. I think it would be much more complex to free
> the buffer every time.
> 
> 
> Hi Michal,
> 
> Would you like to tell me more about your concern with reentrancy? It's looks
> arpd doesn't call rtnl_talk() or rtnl_dump_filter_l().

I checked again and arpd indeed isn't a problem. It doesn't seem to call
any of the two functions (directly or indirectly) and while it's linked
with "-lpthread", it's not really multithreaded.

But my concern was rather about other potential users of libnetlink
(i.e. those which are not part of iproute2). I must admit, though, that
I'm not sure if libnetlink code is reentrant as of now. (And people are
discouraged from using it in its own manual page.)

That being said, I still like Phil's idea for a different reason. While
investigating the issue with "ip link show dev eth ..." which led me to
commit 6599162b958e ("iplink: check for message truncation in
iplink_get()"), I quickly peeked at some other callers of rtnl_talk()
and I'm afraid there may be others which wouldn't handle truncated
message correctly. I assume the maxlen argument was always chosen to be
sufficient for any expected messages but as the example of iplink_get()
shows, messages returned by kernel my grow over time.

That's why I like the idea of __rtnl_talk() returning a pointer to newly
allocated buffer (of sufficient size) rather than copying the response
into a buffer provided by caller and potentially truncating it.

Michal Kubecek

Re: [PATCH v5 10/10] net: stmmac: dwmac-sun8i: Handle integrated/external MDIOs

2017-09-12 Thread Corentin Labbe

On Mon, Sep 11, 2017 at 10:19:20PM +0200, Andrew Lunn wrote:
> > Even with CLK_BUS_EPHY/RST_BUS_EPHY enabled, the MAC reset timeout.
> > So no the CLK/RST are really for the PHY.
> 
> Thanks for trying that.
> 
> You said it was probably during scanning of the bus it times out. What
> address is causing the timeout? 0 or 1? If the internal bus can only
> have one PHY on it, maybe we need to set bus->phy_mask to 0x1?
> 

I have added a trace in begin and end of stmmac_mdio_read()

[   18.145451] libphy: stmmac: probed
[   18.148398] libphy: mdio_mux: probed
[   18.148650] dwmac-sun8i 1c3.ethernet: Switch mux to internal PHY
[   18.248751] dwmac-sun8i 1c3.ethernet: EMAC reset timeout
[   18.249297] libphy: mdio_mux: probed
[   18.249362] dwmac-sun8i 1c3.ethernet: Switch mux to external PHY
[   18.249391] stmmac_mdio_read 0 2
[   18.249598] stmmac_mdio_read 0 2 1c
[   18.249623] stmmac_mdio_read 0 3
[   18.249811] stmmac_mdio_read 0 3 c915
[   20.737271] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
[   31.294868] stmmac_mdio_read 0 0
[   31.295311] stmmac_mdio_read 0 0 1140

It seems that the timeout is unrelated to MDIO bus.

Regards

[PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-12 Thread Haishuang Yan

Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
mode, tos should also fallback to ip{4,6}_dst_hoplimit.

Signed-off-by: Haishuang Yan 

---
Changes since v2:
  * Make the commit message more clearer.
---
 drivers/net/geneve.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index f640407..d52a65f 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
-   ttl = key->ttl;
} else {
tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
-   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
}
+   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 
err = geneve_build_skb(>dst, skb, info, xnet, sizeof(struct iphdr));
@@ -873,12 +872,11 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
-   ttl = key->ttl;
} else {
prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
   ip_hdr(skb), skb);
-   ttl = key->ttl ? : ip6_dst_hoplimit(dst);
}
+   ttl = key->ttl ? : ip6_dst_hoplimit(dst);
err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct ipv6hdr));
if (unlikely(err))
return err;
-- 
1.8.3.1

[patch net] mlxsw: spectrum: Prevent mirred-related crash on removal

2017-09-12 Thread Jiri Pirko

From: Yuval Mintz 

When removing the offloading of mirred actions under
matchall classifiers, mlxsw would find the destination port
associated with the offloaded action and utilize it for undoing
the configuration.

Depending on the order by which ports are removed, it's possible that
the destination port would get removed before the source port.
In such a scenario, when actions would be flushed for the source port
mlxsw would perform an illegal dereference as the destination port is
no longer listed.

Since the only item necessary for undoing the configuration on the
destination side is the port-id and that in turn is already maintained
by mlxsw on the source-port, simply stop trying to access the
destination port and use the port-id directly instead.

Fixes: 763b4b70af ("mlxsw: spectrum: Add support in matchall mirror TC 
offloading")
Signed-off-by: Yuval Mintz 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index e080459..696b99e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -575,15 +575,14 @@ static void mlxsw_sp_span_entry_destroy(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static struct mlxsw_sp_span_entry *
-mlxsw_sp_span_entry_find(struct mlxsw_sp_port *port)
+mlxsw_sp_span_entry_find(struct mlxsw_sp *mlxsw_sp, u8 local_port)
 {
-   struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
int i;
 
for (i = 0; i < mlxsw_sp->span.entries_count; i++) {
struct mlxsw_sp_span_entry *curr = _sp->span.entries[i];
 
-   if (curr->used && curr->local_port == port->local_port)
+   if (curr->used && curr->local_port == local_port)
return curr;
}
return NULL;
@@ -594,7 +593,8 @@ static struct mlxsw_sp_span_entry
 {
struct mlxsw_sp_span_entry *span_entry;
 
-   span_entry = mlxsw_sp_span_entry_find(port);
+   span_entry = mlxsw_sp_span_entry_find(port->mlxsw_sp,
+ port->local_port);
if (span_entry) {
/* Already exists, just take a reference */
span_entry->ref_count++;
@@ -783,12 +783,13 @@ static int mlxsw_sp_span_mirror_add(struct mlxsw_sp_port 
*from,
 }
 
 static void mlxsw_sp_span_mirror_remove(struct mlxsw_sp_port *from,
-   struct mlxsw_sp_port *to,
+   u8 destination_port,
enum mlxsw_sp_span_type type)
 {
struct mlxsw_sp_span_entry *span_entry;
 
-   span_entry = mlxsw_sp_span_entry_find(to);
+   span_entry = mlxsw_sp_span_entry_find(from->mlxsw_sp,
+ destination_port);
if (!span_entry) {
netdev_err(from->dev, "no span entry found\n");
return;
@@ -1563,14 +1564,12 @@ static void
 mlxsw_sp_port_del_cls_matchall_mirror(struct mlxsw_sp_port *mlxsw_sp_port,
  struct mlxsw_sp_port_mall_mirror_tc_entry 
*mirror)
 {
-   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
enum mlxsw_sp_span_type span_type;
-   struct mlxsw_sp_port *to_port;
 
-   to_port = mlxsw_sp->ports[mirror->to_local_port];
span_type = mirror->ingress ?
MLXSW_SP_SPAN_INGRESS : MLXSW_SP_SPAN_EGRESS;
-   mlxsw_sp_span_mirror_remove(mlxsw_sp_port, to_port, span_type);
+   mlxsw_sp_span_mirror_remove(mlxsw_sp_port, mirror->to_local_port,
+   span_type);
 }
 
 static int
-- 
2.9.3

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Samudrala, Sridhar




On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.
Will try to collect and post some perf data with symmetric queue 
configuration.



Yes, this is unreasonable cost.

XPS should really cover the case already.
  

Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue 
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based 
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a 
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX 
queue based on the RX queue of a flow.


Thanks
Sridhar

Re: [PATCH] tcp: TCP_USER_TIMEOUT can not work in tcp_probe_timer()

2017-09-12 Thread liujian

Hi,

In the scenario, tcp server side IP changed, and at that memont,
userspace application still send data continuously;
tcp_send_head(sk)'s timestamp always be refreshed.

Here is the packetdrill script:

   0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 0 
   +0 > S. 0:0(0) ack 1 

  +.1 < . 1:1(0) ack 1 win 65530
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
   +0 write(4, ..., 24) = 24
   +0 > P. 1:25(24) ack 1 win 229
   +.1 < . 1:1(0) ack 25 win 65530

//change the ipaddress
   +1 `ifconfig tun0 192.168.0.10/16`

   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24
   +3 write(4, ..., 24) = 24

   +0 `ifconfig tun0 192.168.0.1/16`
   +0 < . 1:1(0) ack 1 win 1000
   +0 write(4, ..., 24) = -1


[root@localhost ~]# time ./gtests/net/packetdrill/packetdrill test.pkt
test.pkt:50: runtime error in write call: Expected result -1 but got 24 with 
errno 2 (No such file or directory)

real1m11.364s
user0m0.028s
sys 0m0.106s

[root@localhost ~]# netstat -toen
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State   
User   Inode  Timer
tcp0504 192.168.0.1:8080192.0.2.1:33993 ESTABLISHED 
0  45453  probe (22.38/0/7)

since the script didn't wait for enough time, here only got 7 probes.

在 2017/9/11 23:22, Eric Dumazet 写道:
> On Mon, 2017-09-11 at 08:13 -0700, Eric Dumazet wrote:
> 
>> You can see we got only 3 probes, not 4.
> 
> Here is complete packetdrill test showing that code behaves as expected.
> 
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
>+0 < S 0:0(0) win 0 
>+0 > S. 0:0(0) ack 1 
> 
> // Client advertises a zero receive window, so we can't send.
>   +.1 < . 1:1(0) ack 1 win 0
>+0 accept(3, ..., ...) = 4
> 
>+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
>+0 write(4, ..., 2920) = 2920
> 
> // Window probes are scheduled just like RTOs.
>   +.3~+.31 > . 0:0(0) ack 1
>   +.6~+.62 > . 0:0(0) ack 1
>  +1.2~+1.24 > . 0:0(0) ack 1
> 
> // Peer opens its window too late !
>+3 < . 1:1(0) ack 1 win 1000
>+0 > R 1:1(0)
> 
> 
> 
> .
>

85 matches

Mail list logo