RE: [next-queue PATCH] igb: Fix the transmission mode of queue 0 for Qav mode

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Friday, March 30, 2018 5:07 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus ; Guedes, Andre
> 
> Subject: [next-queue PATCH] igb: Fix the transmission mode of queue 0 for
> Qav mode
> 
> When Qav mode is enabled, queue 0 should be kept on Stream Reservation
> mode. From the i210 datasheet, section 8.12.19:
> 
> "Note: Queue0 QueueMode must be set to 1b when TransmitMode is set to
> Qav." ("QueueMode 1b" represents the Stream Reservation mode)
> 
> The solution is to give queue 0 the all the credits it might need, so
> it has priority over queue 1.
> 
> A situation where this can happen is when cbs is "installed" only on
> queue 1, leaving queue 0 alone. For example:
> 
> $ tc qdisc replace dev enp2s0 handle 100: parent root mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc replace dev enp2s0 parent 100:2 cbs locredit -1470 \
>  hicredit 30 sendslope -98 idleslope 2 offload 1
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 

Tested by: Aaron Brown 


RE: [next-queue PATCH v7 10/10] igb: Add support for adding offloaded clsflower filters

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus 
> Subject: [next-queue PATCH v7 10/10] igb: Add support for adding offloaded
> clsflower filters
> 
> This allows filters added by tc-flower and specifying MAC addresses,
> Ethernet types, and the VLAN priority field, to be offloaded to the
> controller.
> 
> This reuses most of the infrastructure used by ethtool, but clsflower
> filters are kept in a separated list, so they are invisible to
> ethtool.
> 
> To setup clsflower offloading:
> 
> $ tc qdisc replace dev eth0 handle 100: parent root mqprio \
>  num_tc 3 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>  queues 1@0 1@1 2@2 hw 0
> (clsflower offloading depends on the netword driver to be configured
> with multiple traffic classes, we use mqprio's 'num_tc' parameter to
> set it to 3)
> 
> $ tc qdisc add dev eth0 ingress
> 
> Examples of filters:
> 
> $ tc filter add dev eth0 parent : flower \
>   dst_mac aa:aa:aa:aa:aa:aa \
>   hw_tc 2 skip_sw
> (just a simple filter filtering for the destination MAC address and
> steering that traffic to queue 2)
> 
> $ tc filter add dev enp2s0 parent : proto 0x22f0 flower \
>   src_mac cc:cc:cc:cc:cc:cc \
>   hw_tc 1 skip_sw
> (as the i210 doesn't support steering traffic based on the source
> address alone, we need to use another steering traffic, in this case
> we are using the ethernet type (0x22f0) to steer traffic to queue 1)
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb.h  |   2 +
>  drivers/net/ethernet/intel/igb/igb_main.c | 188
> +-
>  2 files changed, 188 insertions(+), 2 deletions(-)

Tested by: Aaron Brown 


RE: [Intel-wired-lan] [next-queue PATCH v7 09/10] igb: Add the skeletons for tc-flower offloading

2018-04-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v7 09/10] igb: Add the
> skeletons for tc-flower offloading
> 
> This adds basic functions needed to implement offloading for filters
> created by tc-flower.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 66
> +++
>  1 file changed, 66 insertions(+)
> 

Tested by: Aaron Brown 


RE: [next-queue PATCH v7 08/10] igb: Add MAC address support for ethtool nftuple filters

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus 
> Subject: [next-queue PATCH v7 08/10] igb: Add MAC address support for
> ethtool nftuple filters
> 
> This adds the capability of configuring the queue steering of arriving
> packets based on their source and destination MAC addresses.
> 
> Source address steering (i.e. driving traffic to a specific queue),
> for the i210, does not work, but filtering does (i.e. accepting
> traffic based on the source address). So, trying to add a filter
> specifying only a source address will be an error.
> 
> In practical terms this adds support for the following use cases,
> characterized by these examples:
> 
> $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
> (this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
> to the RX queue 0)
> 
> $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 \
> proto 0x22f0 action 3
> (this will direct packets with source address "44:44:44:44:44:44" and
> ethertype 0x22f0 to the RX queue 3)
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 43 ++--
>  1 file changed, 39 insertions(+), 4 deletions(-)
> 

Tested by: Aaron Brown 


RE: [next-queue PATCH v7 07/10] igb: Enable nfc filters to specify MAC addresses

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus 
> Subject: [next-queue PATCH v7 07/10] igb: Enable nfc filters to specify MAC
> addresses
> 
> This allows igb_add_filter()/igb_erase_filter() to work on filters
> that include MAC addresses (both source and destination).
> 
> For now, this only exposes the functionality, the next commit glues
> ethtool into this. Later in this series, these APIs are used to allow
> offloading of cls_flower filters.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb.h |  4 +++
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 28 
>  2 files changed, 32 insertions(+)

Tested by: Aaron Brown 


RE: [Intel-wired-lan] [next-queue PATCH v7 06/10] igb: Allow filters to be added for the local MAC address

2018-04-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v7 06/10] igb: Allow filters to
> be added for the local MAC address
> 
> Users expect that when adding a steering filter for the local MAC
> address, that all the traffic directed to that address will go to some
> queue.
> 
> Currently, it's not possible to configure entries in the "in use"
> state, which is the normal state of the local MAC address entry (it is
> the default), this patch allows to override the steering configuration
> of "in use" entries, if the filter to be added match the address and
> address type (source or destination) of an existing entry.
> 
> There is a bit of a special handling for entries referring to the
> local MAC address, when they are removed, only the steering
> configuration is reset.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 40 ---
>  1 file changed, 36 insertions(+), 4 deletions(-)

Tested by: Aaron Brown 


RE: [Intel-wired-lan] [next-queue PATCH v7 05/10] igb: Add support for enabling queue steering in filters

2018-04-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v7 05/10] igb: Add support for
> enabling queue steering in filters
> 
> On some igb models (82575 and i210) the MAC address filters can
> control to which queue the packet will be assigned.
> 
> This extends the 'state' with one more state to signify that queue
> selection should be enabled for that filter.
> 
> As 82575 parts are no longer easily obtained (and this was developed
> against i210), only support for the i210 model is enabled.
> 
> These functions are exported and will be used in the next patch.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  .../net/ethernet/intel/igb/e1000_defines.h|  1 +
>  drivers/net/ethernet/intel/igb/igb.h  |  6 +
>  drivers/net/ethernet/intel/igb/igb_main.c | 26 +++
>  3 files changed, 33 insertions(+)

Tested by: Aaron Brown 


RE: [next-queue PATCH v7 04/10] igb: Add support for MAC address filters specifying source addresses

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus 
> Subject: [next-queue PATCH v7 04/10] igb: Add support for MAC address
> filters specifying source addresses
> 
> Makes it possible to direct packets to queues based on their source
> address. Documents the expected usage of the 'flags' parameter.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  .../net/ethernet/intel/igb/e1000_defines.h|  1 +
>  drivers/net/ethernet/intel/igb/igb.h  |  1 +
>  drivers/net/ethernet/intel/igb/igb_main.c | 40 ---
>  3 files changed, 37 insertions(+), 5 deletions(-)

Tested by: Aaron Brown 


RE: [Intel-wired-lan] [next-queue PATCH v7 03/10] igb: Enable the hardware traffic class feature bit for igb models

2018-04-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v7 03/10] igb: Enable the
> hardware traffic class feature bit for igb models
> 
> This will allow functionality depending on the hardware being traffic
> class aware to work. In particular the tc-flower offloading checks
> verifies that this bit is set.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 3 +++
>  1 file changed, 3 insertions(+)

Tested by: Aaron Brown 


RE: [Intel-wired-lan] [next-queue PATCH v7 02/10] igb: Fix queue selection on MAC filters on i210

2018-04-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v7 02/10] igb: Fix queue
> selection on MAC filters on i210
> 
> On the RAH registers there are semantic differences on the meaning of
> the "queue" parameter for traffic steering depending on the controller
> model: there is the 82575 meaning, which "queue" means a RX Hardware
> Queue, and the i350 meaning, where it is a reception pool.
> 
> The previous behaviour was having no effect for i210 based controllers
> because the QSEL bit of the RAH register wasn't being set.
> 
> This patch separates the condition in discrete cases, so the different
> handling is clearer.
> 
> Fixes: 83c21335c876 ("igb: improve MAC filter handling")
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 

Tested by: Aaron Brown 


RE: [next-queue PATCH v7 01/10] igb: Fix not adding filter elements to the list

2018-04-13 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Tuesday, April 10, 2018 10:50 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: Gomes, Vinicius ; Kirsher, Jeffrey T
> ; netdev@vger.kernel.org; Sanchez-Palencia,
> Jesus 
> Subject: [next-queue PATCH v7 01/10] igb: Fix not adding filter elements to
> the list
> 
> Because the order of the parameters passes to 'hlist_add_behind()' was
> inverted, the 'parent' node was added "behind" the 'input', as input
> is not in the list, this causes the 'input' node to be lost.
> 
> Fixes: 0e71def25281 ("igb: add support of RX network flow classification")
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Tested by: Aaron Brown 


Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
Eric Dumazet wrote on Fri, Apr 13, 2018:
> Ah sorry, your trace was truncated, we need more packets _before_ the excerpt.

Ah, sorry as well. I tried to go far back enough to include the replayed
packets, but I see it didn't include the original ack for that packet.
I'm resending both full traces from connect to fin this time.

> That might be caused by some TS val/ecr breakage :
> 
> Many acks were received by the server tcpdump,
> but none of them was accepted by TCP stack, for some reason.
> 
> Try to disable TCP timestamps, it will give some hint if bug does not 
> reproduce.

I unfortunately cannot reproduce on a local network, but will give this
a try on Monday and report. (sysctl net.ipv4.tcp_timestamps=0, correct?)


Here's the traces again, server-capture first:
16:49:22.374452 IP .13317 > .31872: Flags 
[S], seq 2026966826, win 29200, options [mss 1460,sackOK,TS val 1313933281 ecr 
0,nop,wscale 7], length 0
16:49:22.700553 IP .31872 > .13317: Flags 
[S], seq 882075258, win 29200, options [mss 1386,sackOK,TS val 1617125446 ecr 
0,nop,wscale 7], length 0
16:49:22.700577 IP .13317 > .31872: Flags 
[S.], seq 2026966826, ack 882075259, win 29200, options [mss 1460,sackOK,TS val 
1313933607 ecr 1617125446,nop,wscale 7], length 0
16:49:22.727131 IP .31872 > .13317: Flags 
[.], ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 0
16:49:22.729193 IP .31872 > .13317: Flags 
[P.], seq 1:22, ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 
1313933607], length 21
16:49:22.729230 IP .13317 > .31872: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1313933635 ecr 1617125472], 
length 0
16:49:22.732726 IP .13317 > .31872: Flags 
[P.], seq 1:22, ack 22, win 229, options [nop,nop,TS val 1313933639 ecr 
1617125472], length 21
16:49:22.759809 IP .31872 > .13317: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1617125503 ecr 1313933639], 
length 0
16:49:22.759835 IP .13317 > .31872: Flags 
[P.], seq 22:734, ack 22, win 229, options [nop,nop,TS val 1313933666 ecr 
1617125503], length 712
16:49:22.761643 IP .31872 > .13317: Flags 
[P.], seq 22:1270, ack 22, win 229, options [nop,nop,TS val 1617125504 ecr 
1313933639], length 1248
16:49:22.801860 IP .13317 > .31872: Flags 
[.], ack 1270, win 248, options [nop,nop,TS val 1313933708 ecr 1617125504], 
length 0
16:49:22.826902 IP .31872 > .13317: Flags 
[.], ack 734, win 240, options [nop,nop,TS val 1617125572 ecr 1313933666], 
length 0
16:49:22.827576 IP .31872 > .13317: Flags 
[P.], seq 1270:1318, ack 734, win 240, options [nop,nop,TS val 1617125573 ecr 
1313933708], length 48
16:49:22.827600 IP .13317 > .31872: Flags 
[.], ack 1318, win 248, options [nop,nop,TS val 1313933734 ecr 1617125573], 
length 0
16:49:22.833028 IP .13317 > .31872: Flags 
[P.], seq 734:1122, ack 1318, win 248, options [nop,nop,TS val 1313933739 ecr 
1617125573], length 388
16:49:22.858455 IP .31872 > .13317: Flags 
[.], ack 1122, win 251, options [nop,nop,TS val 1617125604 ecr 1313933739], 
length 0
16:49:22.865866 IP .31872 > .13317: Flags 
[P.], seq 1318:1334, ack 1122, win 251, options [nop,nop,TS val 1617125612 ecr 
1313933739], length 16
16:49:22.906865 IP .13317 > .31872: Flags 
[.], ack 1334, win 248, options [nop,nop,TS val 1313933813 ecr 1617125612], 
length 0
16:49:22.944474 IP .31872 > .13317: Flags 
[P.], seq 1334:1386, ack 1122, win 251, options [nop,nop,TS val 1617125678 ecr 
1313933813], length 52
16:49:22.944497 IP .13317 > .31872: Flags 
[.], ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], 
length 0
16:49:22.944747 IP .13317 > .31872: Flags 
[P.], seq 1122:1174, ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 
1617125678], length 52
16:49:22.971083 IP .31872 > .13317: Flags 
[P.], seq 1386:1454, ack 1174, win 251, options [nop,nop,TS val 1617125716 ecr 
1313933851], length 68
16:49:22.971607 IP .13317 > .31872: Flags 
[P.], seq 1174:1258, ack 1454, win 248, options [nop,nop,TS val 1313933878 ecr 
1617125716], length 84
16:49:22.998201 IP .31872 > .13317: Flags 
[P.], seq 1454:2082, ack 1258, win 251, options [nop,nop,TS val 1617125742 ecr 
1313933878], length 628
16:49:22.998987 IP .13317 > .31872: Flags 
[P.], seq 1258:1838, ack 2082, win 268, options [nop,nop,TS val 1313933905 ecr 
1617125742], length 580
16:49:23.065102 IP .31872 > .13317: Flags 
[.], ack 1838, win 262, options [nop,nop,TS val 1617125810 ecr 1313933905], 
length 0
16:49:24.811297 IP .31872 > .13317: Flags 
[P.], seq 2082:3254, ack 1838, win 262, options [nop,nop,TS val 1617127527 ecr 
1313933905], length 1172
16:49:24.812562 IP .13317 > .31872: Flags 
[P.], seq 1838:1874, ack 3254, win 287, options [nop,nop,TS val 1313935719 ecr 
1617127527], length 36
16:49:24.838210 IP .31872 > .13317: Flags 
[.], ack 1874, win 262, options [nop,nop,TS val 1617127583 ecr 1313935719], 
length 0
16:49:24.838236 IP .13317 > .31872: Flags 
[P.], seq 1874:2534, ack 3254, win 287, options [nop,nop,TS val 1313935744 ecr 
1617127583], length 660
16:49:24.839175 

Re: tcp hang when socket fills up ?

2018-04-13 Thread Eric Dumazet


On 04/13/2018 06:09 PM, Dominique Martinet wrote:
> Thank you for the replies,
> 
> Eric Dumazet wrote on Fri, Apr 13, 2018:
>> There is no way a regular TCP stack (including linux) could send the 
>> following frame.
>>
>>> 16:49:27.048760 IP .13317 > .31872: 
>>> Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 
>>> 1313937955 ecr 1617129473], length 1374
>>
>> So something is mangling the packet, maybe NAT or something.
> 
> 
> The pcap was produced on the server which emitted the frame, so it
> should be exactly as the server intended it to be without any mangling
> involved?
> If you could point at what strikes you as odd I can check if the same
> happened on other hang traces I might still have. Is it just that it
> replays a very old seq?
> (if it's odd wrt client packets, the same trace as captured on client is
> at the end of this mail)
> 

Ah sorry, your trace was truncated, we need more packets _before_ the excerpt.

That might be caused by some TS val/ecr breakage :

Many acks were received by the server tcpdump,
but none of them was accepted by TCP stack, for some reason.

Try to disable TCP timestamps, it will give some hint if bug does not reproduce.







Re: Crash due to NULL dereference in tcp_rearm_rto

2018-04-13 Thread Eric Dumazet


On 04/13/2018 05:39 PM, Subash Abhinov Kasiviswanathan wrote:
> We are seeing a warning followed by a crash on an ARM64 device with
> Android 4.14 based kernel.
> 
> It looks like both sk->sk_write_queue and sk->sk_send_head are NULL.
> Since the sk->sk_write_queue is NULL and is dereferenced in tcp_rto_delta_us()
> to get the skb->skb_mstamp, there is crash observed.
> 
> Since this is 4.14.32, it already has commit ("tcp: reset sk_send_head in 
> tcp_write_queue_purge")
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.14.34=dbbf2d1e4077bab0c65ece2765d3fc69cf7d610f
> 
> 12876.013077:   <6> WARNING: CPU: 5 PID: 14828 at net/ipv4/tcp_output.c:2469 
> tcp_send_loss_probe+0x198/0x1b8
> 12876.038939:   <6> task: ffe73f7e5a80 task.stack: ff801b068000
> 12876.038941:   <2> PC is at tcp_send_loss_probe+0x198/0x1b8
> 12876.038942:   <2> LR is at tcp_send_loss_probe+0x28/0x1b8
> 12876.038944:   <2> pc : [] lr : [] 
> pstate: 60400145
> 12876.038944:   <2> sp : ff800802bd30
> 12876.038945:   <2> x29: ff800802bd50 x28: ff8dc9d83eb8
> 12876.038948:   <2> x27: ff800802be08 x26: ff8dc9737000
> 12876.038950:   <2> x25: 0001 x24: ffe744ea1728
> 12876.038952:   <2> x23: ffe73f7e5a80 x22: 0558
> 12876.038954:   <2> x21:  x20: 01080020
> 12876.038956:   <2> x19: ffe73d06e440 x18: 0020
> 12876.038958:   <2> x17: 0014 x16: 0030
> 12876.038960:   <2> x15:  x14: 
> 12876.038962:   <2> x13: 13af314c x12: 002773f8a550
> 12876.038965:   <2> x11: 0538 x10: 
> 12876.038967:   <2> x9 : 0020 x8 : ffe73d06e5f0
> 12876.038969:   <2> x7 : 00924278 x6 : ffe76fe9ed80
> 12876.038971:   <2> x5 : ffe76fe9ed80 x4 : 0f500458
> 12876.038973:   <2> x3 : ff800802bce0 x2 : ff800802bce8
> 12876.038975:   <2> x1 :  x0 : 0558
> 12876.039082:   <2> [] tcp_send_loss_probe+0x198/0x1b8
> 12876.039084:   <2> [] tcp_write_timer_handler+0xf8/0x1c4
> 12876.039086:   <2> [] tcp_write_timer+0x5c/0x98
> 12876.039089:   <2> [] call_timer_fn+0xc0/0x1b4
> 12876.039091:   <2> [] run_timer_softirq+0x230/0x850
> 12876.039094:   <2> [] __do_softirq+0x1dc/0x3a4
> 12876.039096:   <2> [] irq_exit+0xc8/0xd4
> 12876.039098:   <2> [] __handle_domain_irq+0x8c/0xc4
> 12876.039099:   <2> [] gic_handle_irq+0x164/0x1bc
> 
> [net/ipv4/tcp_output.c]
> void tcp_send_loss_probe(struct sock *sk)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> struct sk_buff *skb;
> int pcount;
> int mss = tcp_current_mss(sk);
> 
> ...
> 
> /* Retransmit last segment. */
> if (WARN_ON(!skb))
>     goto rearm_timer;
> 
> 12876.043967:   <6> Unable to handle kernel NULL pointer dereference at 
> virtual address 0010
> 12876.091600:   <6> Internal error: Oops: 9605 [#1] PREEMPT SMP
> 12876.152597:   <2> PC is at tcp_rearm_rto+0x48/0x90
> 12876.156979:   <2> LR is at tcp_send_loss_probe+0x178/0x1b8
> 12876.162077:   <2> pc : [] lr : [] 
> pstate: 60400145
> 12876.169657:   <2> sp : ff800802bd10
> 12876.173056:   <2> x29: ff800802bd20 x28: ff8dc9d83eb8
> 12876.178511:   <2> x27: ff800802be08 x26: ff8dc9737000
> 12876.183967:   <2> x25: 0001 x24: ffe744ea1728
> 12876.189418:   <2> x23: ffe73f7e5a80 x22: 0558
> 12876.194863:   <2> x21:  x20: 01080020
> 12876.200312:   <2> x19: ffe73d06e440 x18: 0020
> 12876.205758:   <2> x17: 0014 x16: 0030
> 12876.211212:   <2> x15:  x14: 
> 12876.216660:   <2> x13: 13af314c x12: 002773f8a550
> 12876.222108:   <2> x11: 0538 x10: 
> 12876.227561:   <2> x9 :  x8 : ffe73d06e5f0
> 12876.233008:   <2> x7 : 00924278 x6 : ffe76fe9ed80
> 12876.238455:   <2> x5 : ffe76fe9ed80 x4 : 0f500458
> 12876.243907:   <2> x3 : ff800802bce0 x2 : ff800802bce8
> 12876.249360:   <2> x1 :  x0 : 0867
> 12876.473522:   <2> [] tcp_rearm_rto+0x48/0x90
> 12876.478971:   <2> [] tcp_send_loss_probe+0x178/0x1b8
> 12876.485131:   <2> [] tcp_write_timer_handler+0xf8/0x1c4
> 12876.491557:   <2> [] tcp_write_timer+0x5c/0x98
> 12876.497189:   <2> [] call_timer_fn+0xc0/0x1b4
> 12876.502731:   <2> [] run_timer_softirq+0x230/0x850
> 12876.508716:   <2> [] __do_softirq+0x1dc/0x3a4
> 12876.514260:   <2> [] irq_exit+0xc8/0xd4
> 12876.519261:   <2> [] __handle_domain_irq+0x8c/0xc4
> 12876.525245:   <2> [] gic_handle_irq+0x164/0x1bc
> 
> [net/ipv4/tcp_input.c]
> void tcp_rearm_rto(struct sock *sk)
> {
> ...
>     inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
> } else {
>     u32 rto = inet_csk(sk)->icsk_rto;
>     /* Offset the time elapsed after installing regular RTO */
>     if 

Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
Thank you for the replies,

Eric Dumazet wrote on Fri, Apr 13, 2018:
> There is no way a regular TCP stack (including linux) could send the 
> following frame.
> 
> > 16:49:27.048760 IP .13317 > .31872: 
> > Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 
> > 1313937955 ecr 1617129473], length 1374
> 
> So something is mangling the packet, maybe NAT or something.


The pcap was produced on the server which emitted the frame, so it
should be exactly as the server intended it to be without any mangling
involved?
If you could point at what strikes you as odd I can check if the same
happened on other hang traces I might still have. Is it just that it
replays a very old seq?
(if it's odd wrt client packets, the same trace as captured on client is
at the end of this mail)



Michal Kubecek wrote on Fri, Apr 13, 2018:
> The way I read this, server doesn't see anything sent by client since
> some point shortly before the dump shown here starts (about 5ms). It
> keeps sending data until 16:49:26.807754 (seq 77346) and then keeps
> resending first (from its point of view) unacknowledged segment
> (32004:33378) in exponentially growing intervals and ignores replies
> from the client. Client apparently receives these retransmits and
> replies with dupack (with D-SACK for 32004:33378) and retransmits of its
> own first unacknowledged segment (4190:4226).

I have the same understanding.

> As we can see the client packets in the dump (which was taken on
> server), it would mean they are dropped after the point where packet
> socket would pass them to libpcap. That might be e.g. netfilter
> (conntrack?) or the IP/TCP code detecting them to be invalid for some
> reason (which is not obvious to me from the dump above).

In my second mail, I got as far as `tcp_ack`, where the replays didn't
get considered for ack because it was seen as having the same ack number
as the lastest seen. I'm not sure why it didn't get considered the first
time, I need to add more instrumentation.

> There are two strange points:
> 
> 1. While client apparently responds to all server retransmits, it does
> so with TSecr=1313937714 (matching server packet from 16:49:26.807754)
> rather than TSval of the packets it dupacks (1313937955 through
> 1313953050). This doesn't seem to follow the rules of RFC 7323
> Section 4.3.

I have a pcap of the same sequence on the client as well, and it does
get the packets with higher val= unchanged, so that might be something
to look at.
I'll post a similarily formated tcpdump output at the end of this mail,
the timestamps and verious numbers in the traces match with my original
mail.

> 2. Window size values in acks from client grow with each acked packet by
> 22-23 (which might be ~1400 with scaling factor of 64). I would rather
> expect advertised receive window to go down by 1374 with each received
> segment and to grow by bigger steps with each read()/recv() call from
> application.

I have a very poor understanding of this part of the protocol/code, but
I think the window grows when a socket starts receiving a bulk of data.
Given this hangs very fast, there were only individual small packets up
till this trace where data comes in quickly, and I think this might just
be window scaling?
I might very well be wrong though..


> We might get more insight if we saw the same connection on both sides.
> From what was presented here, my guess is that
> 
>   (1) received packets are dropped somewhere on server side (after they
>   are cloned for the packet socket)

I'd agree for received replays, but I don't get why the first time that
ack was received didn't work.

>   (2) there is something wrong either on client side or between the two
>   hosts (there is at least a NAT, IIUC)

My minimal reproducer involves the internet, so there is no telling what
is in the middle, but there is at least two NAT, and the connection is
established through two connect() calls with carefully crafted
source/destination ports if that makes a difference.
I've done a bit more fiddling with netem/tbf and a single nat but I
still can't reproduce it on a local network; I need to spend more time
trying as this is very frustrating because I don't control any of the
two NATs in the real case.

I don't think the NATs are culprits, though - looking at the traces I
don't see any reordering within a direction, missing packets or fiddling
with values so this seems fairly straightforward.

Here's the client-side traces. pcap files still available on request, if
you prefer to look at it in wireshark or another tool.

16:49:26.726793 IP .13317 > .31872: Flags 
[.], seq 40248:41622, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 
1617129440], length 1374
16:49:26.726798 IP .31872 > .13317: Flags 
[.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
length 0
16:49:26.726802 IP .13317 > .31872: Flags 
[.], seq 41622:42996, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 

Re: Creating FOU tunnels to the same destination IP but different port

2018-04-13 Thread Tom Herbert
On Fri, Apr 13, 2018 at 9:57 AM, Kostas Peletidis  wrote:
> Hello,
>
> I am having trouble with a particular case of setting up a fou tunnel
> and I would really appreciate your help.
>
> I have a remote multihomed host behind a NAT box and I want to create
> a fou tunnel for each of its IP addresses, from my machine.
>
> A typical case would be something like that (output from the local machine):
>
> # ip tun
> ipudp09602: ip/ip remote 135.196.22.100 local 172.31.0.140 ttl 225
> ipudp00101: ip/ip remote 148.252.129.30 local 172.31.0.140 ttl 225
> ipudp09604: ip/ip remote 77.247.11.249 local 172.31.0.140 ttl 225
> tunl0: any/ip remote any local any ttl inherit nopmtudisc
> ipudp00102: ip/ip remote 213.205.194.18 local 172.31.0.140 ttl 225
>
> However, if the remote end has the same IP address with the remote end
> of an existing tunnel (but a different remote port)
> tunnel creation fails. In this example there is already a tunnel to
> 135.196.22.100:32270 and I wanted to create a new tunnel
> to 135.196.22.100:24822 as below:
>
> # ip link add name ipudp09603 mtu 1356 type ipip \
>   remote 135.196.22.100 \
>   local 172.31.0.140 \
>   ttl 225 \
>   encap fou \
>  encap-sport 4500 \
>  encap-dport 24822
>
> RTNETLINK answers: File exists
>
> The remote IP addresses in this case are identical because there is a
> NAT box in the way, but the port numbers are different. The source
> address and port are the same in all cases.
>
> I noticed that ip_tunnel_find() does not check port numbers - being IP
> and all - so I am thinking that a not-so-elegant way to do it is to
> get the port numbers from the netlink request and have
> ip_tunnel_find() compare them against encap.{sport, dport} of existing
> tunnels.
>
> Is there a better way to create a second fou tunnel to the same IP
> address but a different port? Use of keys as unique tunnel IDs maybe?
> Any feedback is appreciated. Thank you.
>
Hi Kostas,

This is an interesting problem, thanks for reporting it! FOU in this
case is being used as modified ipip tunnel so the check of uniqueness
is only based on local and remote addresses for an IP tunnel. As you
point out, the port information does provide more specific information
that could be be used to distinguish between the tunnels (especially
on receive). Using the information is tricky since the FOU and ipip
layers are pretty much independent. The keys approach might be
possible. I'll try to take a closer look.

Tom

>
> Regards,
> Kostas


Crash due to NULL dereference in tcp_rearm_rto

2018-04-13 Thread Subash Abhinov Kasiviswanathan

We are seeing a warning followed by a crash on an ARM64 device with
Android 4.14 based kernel.

It looks like both sk->sk_write_queue and sk->sk_send_head are NULL.
Since the sk->sk_write_queue is NULL and is dereferenced in 
tcp_rto_delta_us()

to get the skb->skb_mstamp, there is crash observed.

Since this is 4.14.32, it already has commit ("tcp: reset sk_send_head 
in tcp_write_queue_purge")

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.14.34=dbbf2d1e4077bab0c65ece2765d3fc69cf7d610f

12876.013077:   <6> WARNING: CPU: 5 PID: 14828 at 
net/ipv4/tcp_output.c:2469 tcp_send_loss_probe+0x198/0x1b8

12876.038939:   <6> task: ffe73f7e5a80 task.stack: ff801b068000
12876.038941:   <2> PC is at tcp_send_loss_probe+0x198/0x1b8
12876.038942:   <2> LR is at tcp_send_loss_probe+0x28/0x1b8
12876.038944:   <2> pc : [] lr : [] 
pstate: 60400145

12876.038944:   <2> sp : ff800802bd30
12876.038945:   <2> x29: ff800802bd50 x28: ff8dc9d83eb8
12876.038948:   <2> x27: ff800802be08 x26: ff8dc9737000
12876.038950:   <2> x25: 0001 x24: ffe744ea1728
12876.038952:   <2> x23: ffe73f7e5a80 x22: 0558
12876.038954:   <2> x21:  x20: 01080020
12876.038956:   <2> x19: ffe73d06e440 x18: 0020
12876.038958:   <2> x17: 0014 x16: 0030
12876.038960:   <2> x15:  x14: 
12876.038962:   <2> x13: 13af314c x12: 002773f8a550
12876.038965:   <2> x11: 0538 x10: 
12876.038967:   <2> x9 : 0020 x8 : ffe73d06e5f0
12876.038969:   <2> x7 : 00924278 x6 : ffe76fe9ed80
12876.038971:   <2> x5 : ffe76fe9ed80 x4 : 0f500458
12876.038973:   <2> x3 : ff800802bce0 x2 : ff800802bce8
12876.038975:   <2> x1 :  x0 : 0558
12876.039082:   <2> [] tcp_send_loss_probe+0x198/0x1b8
12876.039084:   <2> [] 
tcp_write_timer_handler+0xf8/0x1c4

12876.039086:   <2> [] tcp_write_timer+0x5c/0x98
12876.039089:   <2> [] call_timer_fn+0xc0/0x1b4
12876.039091:   <2> [] run_timer_softirq+0x230/0x850
12876.039094:   <2> [] __do_softirq+0x1dc/0x3a4
12876.039096:   <2> [] irq_exit+0xc8/0xd4
12876.039098:   <2> [] __handle_domain_irq+0x8c/0xc4
12876.039099:   <2> [] gic_handle_irq+0x164/0x1bc

[net/ipv4/tcp_output.c]
void tcp_send_loss_probe(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int pcount;
int mss = tcp_current_mss(sk);

...

/* Retransmit last segment. */
if (WARN_ON(!skb))
goto rearm_timer;

12876.043967:   <6> Unable to handle kernel NULL pointer dereference at 
virtual address 0010

12876.091600:   <6> Internal error: Oops: 9605 [#1] PREEMPT SMP
12876.152597:   <2> PC is at tcp_rearm_rto+0x48/0x90
12876.156979:   <2> LR is at tcp_send_loss_probe+0x178/0x1b8
12876.162077:   <2> pc : [] lr : [] 
pstate: 60400145

12876.169657:   <2> sp : ff800802bd10
12876.173056:   <2> x29: ff800802bd20 x28: ff8dc9d83eb8
12876.178511:   <2> x27: ff800802be08 x26: ff8dc9737000
12876.183967:   <2> x25: 0001 x24: ffe744ea1728
12876.189418:   <2> x23: ffe73f7e5a80 x22: 0558
12876.194863:   <2> x21:  x20: 01080020
12876.200312:   <2> x19: ffe73d06e440 x18: 0020
12876.205758:   <2> x17: 0014 x16: 0030
12876.211212:   <2> x15:  x14: 
12876.216660:   <2> x13: 13af314c x12: 002773f8a550
12876.222108:   <2> x11: 0538 x10: 
12876.227561:   <2> x9 :  x8 : ffe73d06e5f0
12876.233008:   <2> x7 : 00924278 x6 : ffe76fe9ed80
12876.238455:   <2> x5 : ffe76fe9ed80 x4 : 0f500458
12876.243907:   <2> x3 : ff800802bce0 x2 : ff800802bce8
12876.249360:   <2> x1 :  x0 : 0867
12876.473522:   <2> [] tcp_rearm_rto+0x48/0x90
12876.478971:   <2> [] tcp_send_loss_probe+0x178/0x1b8
12876.485131:   <2> [] 
tcp_write_timer_handler+0xf8/0x1c4

12876.491557:   <2> [] tcp_write_timer+0x5c/0x98
12876.497189:   <2> [] call_timer_fn+0xc0/0x1b4
12876.502731:   <2> [] run_timer_softirq+0x230/0x850
12876.508716:   <2> [] __do_softirq+0x1dc/0x3a4
12876.514260:   <2> [] irq_exit+0xc8/0xd4
12876.519261:   <2> [] __handle_domain_irq+0x8c/0xc4
12876.525245:   <2> [] gic_handle_irq+0x164/0x1bc

[net/ipv4/tcp_input.c]
void tcp_rearm_rto(struct sock *sk)
{
...
inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
} else {
u32 rto = inet_csk(sk)->icsk_rto;
/* Offset the time elapsed after installing regular RTO */
if (icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
s64 delta_us = tcp_rto_delta_us(sk);
/* 

RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Long Li
> Subject: RE: [Resend Patch 3/3] Storvsc: Select channel based on available
> percentage of ring buffer to write
> 
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org
> >  On Behalf Of Long Li
> > Sent: Tuesday, March 27, 2018 5:49 PM
> > To: KY Srinivasan ; Haiyang Zhang
> > ; Stephen Hemminger
> ;
> > James E . J . Bottomley ; Martin K . Petersen
> > ; de...@linuxdriverproject.org; linux-
> > s...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > netdev@vger.kernel.org
> > Cc: Long Li 
> > Subject: [Resend Patch 3/3] Storvsc: Select channel based on available
> > percentage of ring buffer to write
> >
> > From: Long Li 
> >
> > This is a best effort for estimating on how busy the ring buffer is
> > for that channel, based on available buffer to write in percentage. It
> > is still possible that at the time of actual ring buffer write, the
> > space may not be available due to other processes may be writing at the
> time.
> >
> > Selecting a channel based on how full it is can reduce the possibility
> > that a ring buffer write will fail, and avoid the situation a channel
> > is over busy.
> >
> > Now it's possible that storvsc can use a smaller ring buffer size
> > (e.g. 40k bytes) to take advantage of cache locality.
> >
> > Signed-off-by: Long Li 
> > ---
> >  drivers/scsi/storvsc_drv.c | 62
> > +-
> >  1 file changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index a2ec0bc9e9fa..b1a87072b3ab 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size,
> "Ring
> > buffer size (bytes)");
> >
> >  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
> > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs
> to
> > subchannels");
> > +
> > +static int ring_avail_percent_lowater = 10;
> 
> Reserving 10% of each ring buffer by default seems like more than is needed
> in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
> you suggest, and even more for a ring buffer of 128K.  Each outgoing record is
> only about 344 bytes (I'd have to check exactly).  With the new channel
> selection algorithm below, the only time we use a channel that is already
> below the low water mark is when no channel could be found that is above
> the low water mark.   There could be a case of two or more threads deciding
> that a channel is above the low water mark at the same time and both
> choosing it, but that's likely to be rare.  So it seems like we could set the

It's not rare for two processes checking on the same channel at the same time, 
when running multiple processes I/O workload. The CPU to channel is not 1:1 
mapping.

> default low water mark to 5 percent or even 3 percent, which will let more of
> the ring buffer be used, and let a channel be assigned according to the
> algorithm, rather than falling through to the default because all channels
> appear to be "full".

It seems it's not about how big ring buffer is, e.g. even you have a ring 
buffer of infinite size, it won't help with performance if it's getting queued 
all the time, while other ring buffers are near empty. It's more about how 
multiple ring buffers are getting utilized in a reasonable and balanced way. 
Testing shows 10 is a good choice, while 3 is prone to return BUSY and trigger 
block layer retry.

> 
> > +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> > +MODULE_PARM_DESC(ring_avail_percent_lowater,
> > +   "Select a channel if available ring size > this in percent");
> > +
> >  /*
> >   * Timeout in seconds for all devices managed by this driver.
> >   */
> > @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device
> > *device,  {
> > struct storvsc_device *stor_device;
> > struct vstor_packet *vstor_packet;
> > -   struct vmbus_channel *outgoing_channel;
> > +   struct vmbus_channel *outgoing_channel, *channel;
> > int ret = 0;
> > -   struct cpumask alloced_mask;
> > +   struct cpumask alloced_mask, other_numa_mask;
> > int tgt_cpu;
> >
> > vstor_packet = >vstor_packet; @@ -1301,22 +1307,53 @@
> > static int storvsc_do_io(struct hv_device *device,
> > /*
> >  * Select an an appropriate channel to send the request out.
> >  */
> > -
> > if (stor_device->stor_chns[q_num] != NULL) {
> > outgoing_channel = stor_device->stor_chns[q_num];
> > -   if (outgoing_channel->target_cpu == smp_processor_id()) {
> > +   if (outgoing_channel->target_cpu == q_num) {
> > /*
> >  * Ideally, we want to pick a different channel if
> >  

RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org; netdev@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring 
> buffer
> percentage
> 
> From: Long Li 
> 
> In Vmbus, we have defined a function to calculate available ring buffer
> percentage to write.
> 
> Use that function and remove netvsc's private version.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley 

> ---
>  drivers/net/hyperv/hyperv_net.h |  1 -
>  drivers/net/hyperv/netvsc.c | 17 +++--
>  drivers/net/hyperv/netvsc_drv.c |  3 ---
>  3 files changed, 3 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index cd538d5a7986..a0199ab13d67 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -189,7 +189,6 @@ struct netvsc_device;
>  struct net_device_context;
> 
>  extern u32 netvsc_ring_bytes;
> -extern struct reciprocal_value netvsc_ring_reciprocal;
> 
>  struct netvsc_device *netvsc_device_add(struct hv_device *device,
>   const struct netvsc_device_info *info);
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 0265d703eb03..8af0069e4d8c 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -31,7 +31,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
> 
> @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device)
>  #define RING_AVAIL_PERCENT_HIWATER 20
>  #define RING_AVAIL_PERCENT_LOWATER 10
> 
> -/*
> - * Get the percentage of available bytes to write in the ring.
> - * The return value is in range from 0 to 100.
> - */
> -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info 
> *ring_info)
> -{
> - u32 avail_write = hv_get_bytes_to_write(ring_info);
> -
> - return reciprocal_divide(avail_write  * 100, netvsc_ring_reciprocal);
> -}
> -
>  static inline void netvsc_free_send_slot(struct netvsc_device *net_device,
>u32 index)
>  {
> @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device
> *net_device,
>   wake_up(_device->wait_drain);
> 
>   if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) &&
> - (hv_ringbuf_avail_percent(>outbound) >
> RING_AVAIL_PERCENT_HIWATER ||
> + (hv_get_avail_to_write_percent(>outbound) >
> +  RING_AVAIL_PERCENT_HIWATER ||
>queue_sends < 1)) {
>   netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx));
>   ndev_ctx->eth_stats.wake_queue++;
> @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt(
>   struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
>   u64 req_id;
>   int ret;
> - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
> + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
> 
>   nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
>   if (skb)
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index faea0be18924..b0b1c2fd2b7b 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -35,7 +35,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
>  #include 
> @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128;
>  module_param(ring_size, uint, S_IRUGO);
>  MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
>  unsigned int netvsc_ring_bytes __ro_after_init;
> -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init;
> 
>  static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE |
>   NETIF_MSG_LINK | NETIF_MSG_IFUP |
> @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void)
>   ring_size);
>   }
>   netvsc_ring_bytes = ring_size * PAGE_SIZE;
> - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
> 
>   ret = vmbus_driver_register(_drv);
>   if (ret)
> --
> 2.14.1



RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org; netdev@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 3/3] Storvsc: Select channel based on available 
> percentage of ring
> buffer to write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 62 
> +-
>  1 file changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..b1a87072b3ab 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;

Reserving 10% of each ring buffer by default seems like more than is needed
in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
you suggest, and even more for a ring buffer of 128K.  Each outgoing record
is only about 344 bytes (I'd have to check exactly).  With the new channel
selection algorithm below, the only time we use a channel that is already
below the low water mark is when no channel could be found that is above
the low water mark.   There could be a case of two or more threads deciding
that a channel is above the low water mark at the same time and both
choosing it, but that's likely to be rare.  So it seems like we could set the
default low water mark to 5 percent or even 3 percent, which will let more
of the ring buffer be used, and let a channel be assigned according to the
algorithm, rather than falling through to the default because all channels
appear to be "full".

> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device,
>  {
>   struct storvsc_device *stor_device;
>   struct vstor_packet *vstor_packet;
> - struct vmbus_channel *outgoing_channel;
> + struct vmbus_channel *outgoing_channel, *channel;
>   int ret = 0;
> - struct cpumask alloced_mask;
> + struct cpumask alloced_mask, other_numa_mask;
>   int tgt_cpu;
> 
>   vstor_packet = >vstor_packet;
> @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device,
>   /*
>* Select an an appropriate channel to send the request out.
>*/
> -
>   if (stor_device->stor_chns[q_num] != NULL) {
>   outgoing_channel = stor_device->stor_chns[q_num];
> - if (outgoing_channel->target_cpu == smp_processor_id()) {
> + if (outgoing_channel->target_cpu == q_num) {
>   /*
>* Ideally, we want to pick a different channel if
>* available on the same NUMA node.
>*/
>   cpumask_and(_mask, _device->alloced_cpus,
>   cpumask_of_node(cpu_to_node(q_num)));
> - for_each_cpu_wrap(tgt_cpu, _mask,
> - outgoing_channel->target_cpu + 1) {
> - if (tgt_cpu != outgoing_channel->target_cpu) {
> - outgoing_channel =
> - stor_device->stor_chns[tgt_cpu];
> - break;
> +
> + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) {
> + if (tgt_cpu == q_num)
> + continue;
> + channel = 

Re: SRIOV switchdev mode BoF minutes

2018-04-13 Thread Samudrala, Sridhar

On 4/13/2018 1:16 PM, Or Gerlitz wrote:

On Fri, Apr 13, 2018 at 7:49 PM, Samudrala, Sridhar
 wrote:

On 4/13/2018 1:57 AM, Or Gerlitz wrote:



in  overlay networks scheme, the uplink rep has the VTEP ip and is not connected
to the bridge, e.g you use ovs you have vf reps and vxlan ports connected
to ovs and the ip stack routes through the uplink rep

This changes the legacy mode behavior of configuring  vtep ip on the pf
netdev. How does host to host traffic expected to work when vtep ip is moved to 
uplink rep?

What do you mean host to host traffic, is that two VFs on the same host?
control plane SWs (such as OVS) don't apply encapsulation within the same host


I meant between PFs on 2 compute nodes.





What about pf-rep?

Are you planning to create a pf-rep too? Is pf also treated similar to vf in
switchdev mode?
All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default
without any rules programmed?

@ the sriov switchdev ARCH level, pf/pf-rep would work indeed as you described.

We will have pf rep for smartnic schemes where the the pf on the host
is not the manager of the eswitch but rather the smartnic driver instance.

on non smart env, there are some challenges to address for the pf
nic to be fully functional for the slow path (what you described), we
will get there down the road if there is a real need.


So on non-smart env, are you planning to only expose uplink rep and vf reps as 
netdevs.
By smartnic env, i guess you are referring to OVS control plane also running on 
the NIC.

I will look forward to your patches.

Thanks
Sridhar




Re: [PATCH iproute2-next 3/3] treewide: Use addattr_nest()/addattr_nest_end() to handle nested attributes

2018-04-13 Thread Vinicius Costa Gomes
Hi,

Serhey Popovych  writes:

[...]

> diff --git a/tc/q_mqprio.c b/tc/q_mqprio.c
> index 89b4600..207d644 100644
> --- a/tc/q_mqprio.c
> +++ b/tc/q_mqprio.c
> @@ -173,8 +173,7 @@ static int mqprio_parse_opt(struct qdisc_util *qu, int 
> argc,
>   argc--; argv++;
>   }
>  
> - tail = NLMSG_TAIL(n);
> - addattr_l(n, 1024, TCA_OPTIONS, , sizeof(opt));
> + tail = addattr_nest_compat(n, 1024, TCA_OPTIONS, , sizeof(opt));
>  
>   if (flags & TC_MQPRIO_F_MODE)
>   addattr_l(n, 1024, TCA_MQPRIO_MODE,
> @@ -209,7 +208,7 @@ static int mqprio_parse_opt(struct qdisc_util *qu, int 
> argc,
>   addattr_nest_end(n, start);
>   }
>  
> - tail->rta_len = (void *)NLMSG_TAIL(n) - (void *)tail;
> + addattr_nest_compat_end(n, tail);
>  
>   return 0;
>  }

Sorry if I am too late, but this breaks mqprio, i.e. something like
this:

$ tc qdisc replace dev enp2s0 handle 100: parent root mqprio \
   num_tc 3 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
   queues 1@0 1@1 2@2 hw 0

that used to work, now doesn't.

This patch looks right, so I thought that it could be possible that mqprio
(in the kernel side) was making some wrong assumptions about the format
of the messages.

And after some investigation, what seems to be happening is something
like this (not too familiar with netlink protocol internals, I may be
missing something).

In the "wire", after this patch, the mqprio part of message may be
represented as:

/* The message format is [ len | type | payload ] */

[ S | 2 |  ]
[ 0 | 2 | ]

Some notes:
 - S is the aligned value of sizeof(opt);
 - The value of TCA_OPTIONS is 2;

Before this patch, I think it was something like:

[ S | 2 |  ]

The problem is that mqprio defines an internal type with the same value
as TCA_OPTIONS (2), and that finalizing (empty) is interpreted as the
"internal" field instead of indicating the end of TCA_OPTIONS, which
causes a size mismatch with 'mqprio_policy', causing the command to
create a mqprio qdisc to fail.

In short, I think that replacing the "open coded" version with
addattr_nest_compat() is not a functionally equivalent change.


Cheers,
--
Vinicius


Cavium Octeon III network driver.

2018-04-13 Thread Steven J. Hill
Patches for Cavium's Octeon III network driver were submitted by
David Daney back on 20180222. David has since left the company and
I am now responsible for the upstreaming effort. When looking at
 they are marked as "Not Applicable". What
steps do I take next? Thanks.

Steve


[PATCH iproute2-next 1/1] tc: jsonify ife action

2018-04-13 Thread Roman Mashak
Signed-off-by: Roman Mashak 
---
 tc/m_ife.c | 54 --
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/tc/m_ife.c b/tc/m_ife.c
index d7e61703f666..15d09a167450 100644
--- a/tc/m_ife.c
+++ b/tc/m_ife.c
@@ -240,22 +240,24 @@ static int print_ife(struct action_util *au, FILE *f, 
struct rtattr *arg)
parse_rtattr_nested(tb, TCA_IFE_MAX, arg);
 
if (tb[TCA_IFE_PARMS] == NULL) {
-   fprintf(f, "[NULL ife parameters]");
+   print_string(PRINT_FP, NULL, "%s", "[NULL ife parameters]");
return -1;
}
p = RTA_DATA(tb[TCA_IFE_PARMS]);
 
-   fprintf(f, "ife %s ", p->flags & IFE_ENCODE ? "encode" : "decode");
+   print_string(PRINT_ANY, "kind", "%s ", "ife");
+   print_string(PRINT_ANY, "mode", "%s",
+p->flags & IFE_ENCODE ? "encode" : "decode");
print_action_control(f, "action ", p->action, " ");
 
if (tb[TCA_IFE_TYPE]) {
ife_type = rta_getattr_u16(tb[TCA_IFE_TYPE]);
has_optional = 1;
-   fprintf(f, "type 0x%X ", ife_type);
+   print_0xhex(PRINT_ANY, "type", "type 0x%X ", ife_type);
}
 
if (has_optional)
-   fprintf(f, "\n\t ");
+   print_string(PRINT_FP, NULL, "%s\t", _SL_);
 
if (tb[TCA_IFE_METALST]) {
struct rtattr *metalist[IFE_META_MAX + 1];
@@ -268,9 +270,11 @@ static int print_ife(struct action_util *au, FILE *f, 
struct rtattr *arg)
len = RTA_PAYLOAD(metalist[IFE_META_SKBMARK]);
if (len) {
mmark = 
rta_getattr_u32(metalist[IFE_META_SKBMARK]);
-   fprintf(f, "use mark %u ", mmark);
+   print_uint(PRINT_ANY, "mark", "use mark %u ",
+  mmark);
} else
-   fprintf(f, "allow mark ");
+   print_string(PRINT_ANY, "mark", "%s mark ",
+"allow");
}
 
if (metalist[IFE_META_TCINDEX]) {
@@ -278,41 +282,47 @@ static int print_ife(struct action_util *au, FILE *f, 
struct rtattr *arg)
if (len) {
mtcindex =

rta_getattr_u16(metalist[IFE_META_TCINDEX]);
-   fprintf(f, "use tcindex %d ", mtcindex);
+   print_uint(PRINT_ANY, "tcindex",
+  "use tcindex %u ", mtcindex);
} else
-   fprintf(f, "allow tcindex ");
+   print_string(PRINT_ANY, "tcindex",
+"%s tcindex ", "allow");
}
 
if (metalist[IFE_META_PRIO]) {
len = RTA_PAYLOAD(metalist[IFE_META_PRIO]);
if (len) {
mprio = 
rta_getattr_u32(metalist[IFE_META_PRIO]);
-   fprintf(f, "use prio %u ", mprio);
+   print_uint(PRINT_ANY, "prio", "use prio %u ",
+  mprio);
} else
-   fprintf(f, "allow prio ");
+   print_string(PRINT_ANY, "prio", "%s prio ",
+"allow");
}
 
}
 
if (tb[TCA_IFE_DMAC]) {
has_optional = 1;
-   fprintf(f, "dst %s ",
-   ll_addr_n2a(RTA_DATA(tb[TCA_IFE_DMAC]),
-   RTA_PAYLOAD(tb[TCA_IFE_DMAC]), 0, b2,
-   sizeof(b2)));
-
+   print_string(PRINT_ANY, "dst", "dst %s ",
+ll_addr_n2a(RTA_DATA(tb[TCA_IFE_DMAC]),
+RTA_PAYLOAD(tb[TCA_IFE_DMAC]), 0, b2,
+sizeof(b2)));
}
 
if (tb[TCA_IFE_SMAC]) {
has_optional = 1;
-   fprintf(f, "src %s ",
-   ll_addr_n2a(RTA_DATA(tb[TCA_IFE_SMAC]),
-   RTA_PAYLOAD(tb[TCA_IFE_SMAC]), 0, b2,
-   sizeof(b2)));
+   print_string(PRINT_ANY, "src", "src %s ",
+ll_addr_n2a(RTA_DATA(tb[TCA_IFE_SMAC]),
+RTA_PAYLOAD(tb[TCA_IFE_SMAC]), 0, b2,
+sizeof(b2)));
}
 
-   fprintf(f, "\n\t index %u ref %d bind %d", p->index, p->refcnt,
-   p->bindcnt);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   print_uint(PRINT_ANY, "index", "\t index %u", 

Panic since 4.15.15 in tcp_retransmit_timer when doing ss -K

2018-04-13 Thread Sami Farin
I started getting this since 4.15.15.  It's easy to trigger,
for example I get new IP address via dhcp (NetworkManager),
then ss -K the_old_ip_address .

Happens on Ryzen and SandyBridge systems.
My guess of the cause: commit 960058fe196397aecb16bb14e64980e265d2bc5e
(didn't try reverting)

BUG: unable to handle kernel NULL pointer dereference at 30
IP: tcp_retransmit_skb+0x57/0xc0
PGD 0 P4D 0
Oops:  [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW   4.15.17+ #42
Hardware name: To Be Filled By O.E.M./To Be Filled By O.E.M./X370 Taichi, BIOS 
P4.60 03/02/2018
RIP: 0010:tcp_retransmit_skb+0x57/0xc0
RSP: 0018:95b1dea03e00 EFLAGS: 00010206
RAX: fff5 RBX: 95b15876d000 RCX: 5
RDX: 5 RSI: 0 RDI: 95b115876d000

... 
Call Trace:

tcp_retransmit_timer
tcp_write_timer_handler
tcp_write_timer
? tcp_write_timer_handler
expire_timers
run_timer_softirq
sched_clock
sched_clock
sched_clock_cpu
irqtime_account_irq
__do_softirq
sched_clock
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt


-- 
Do what you love because life is too short for anything else.



Re: [PATCH] x86/cpufeature: guard asm_volatile_goto usage with CC_HAVE_ASM_GOTO

2018-04-13 Thread Alexei Starovoitov

On 4/13/18 11:19 AM, Peter Zijlstra wrote:

On Tue, Apr 10, 2018 at 02:28:04PM -0700, Alexei Starovoitov wrote:

Instead of
#ifdef CC_HAVE_ASM_GOTO
we can replace it with
#ifndef __BPF__
or some other name,


I would prefer the BPF specific hack; otherwise we might be encouraging
people to build the kernel proper without asm-goto.



I don't understand this concern.

1.
arch/x86/Makefile does
ifndef CC_HAVE_ASM_GOTO
  $(error Compiler lacks asm-goto support.)
endif

which is pretty strong statement of the kernel direction.

2.
Even with this patch that adds #ifdef CC_HAVE_ASM_GOTO back
the x86 arch still needs asm-goto in the compiler to be built.
As far as I can see there are other places where asm-goto
is open coded.
So there is no 'encouraging'.

Whereas if we do bpf specific marco we'd need to explain that
to all bpf users and they would need to fix their user space scripts.
Amount of user space breakage is unknown at this point.



Re: v6/sit tunnels and VRFs

2018-04-13 Thread David Ahern
On 4/13/18 2:23 PM, Jeff Barnhill wrote:
> It seems that the ENETUNREACH response is still desirable in the VRF
> case since the only difference (when using VRF vs. not) is that the
> lookup should be restrained to a specific VRF.

VRF is just policy routing to a table. If the table wants the lookup to
stop, then it needs a default route. What you are referring to is the
lookup goes through all tables and does not find an answer so it fails
with -ENETUNREACH. I do not know of any way to make that happen with the
existing default route options and in the past 2+ years we have not hit
any s/w that discriminates -ENETUNREACH from -EHOSTUNREACH.

I take it this is code from your internal code base. Why does it care
between those two failures?


Re: v6/sit tunnels and VRFs

2018-04-13 Thread Jeff Barnhill
Thanks for the response, David. I'm not questioning the need to stop
the fib lookup once the end of the VRF table is reached - I agree that
is needed.  I'm concerned with the difference in the response/error
returned from the failed lookup.

For instance, with vrf "unreachable default" route, I get this:
# ip route get 1.1.1.1 vrf vrf_258
RTNETLINK answers: No route to host

Without it (and assuming no match for 1.1.1.1 in local/main/default
tables), I get this:
# ip route get 1.1.1.1 vrf vrf_258
RTNETLINK answers: Network is unreachable

Which is also what happens when not using VRFs at all.

It seems that the ENETUNREACH response is still desirable in the VRF
case since the only difference (when using VRF vs. not) is that the
lookup should be restrained to a specific VRF.

Jeff



On Thu, Apr 12, 2018 at 10:25 PM, David Ahern  wrote:
> On 4/12/18 10:54 AM, Jeff Barnhill wrote:
>> Hi David,
>>
>> In the slides referenced, you recommend adding an "unreachable
>> default" route to the end of each VRF route table.  In my testing (for
>> v4) this results in a change to fib lookup failures such that instead
>> of ENETUNREACH being returned, EHOSTUNREACH is returned since the fib
>> finds the unreachable route, versus failing to find a route
>> altogether.
>>
>> Have the implications of this been considered?  I don't see a
>> clean/easy way to achieve the old behavior without affecting non-VRF
>> routing (eg. remove the unreachable route and delete the non-VRF
>> rules).  I'm guessing that programmatically, it may not make much
>> difference, ie. lookup fails, but for debugging or to a user looking
>> at it, the difference matters.  Do you (or anyone else) have any
>> thoughts on this?
>
> We have recommended moving the local table down in the FIB rules:
>
> # ip ru ls
> 1000:   from all lookup [l3mdev-table]
> 32765:  from all lookup local
> 32766:  from all lookup main
> 32767:  from all lookup default
>
> and adding a default route to VRF tables:
>
> # ip ro ls vrf red
> unreachable default  metric 4278198272
> 172.16.2.0/24  proto bgp  metric 20
> nexthop via 169.254.0.1  dev swp3 weight 1 onlink
> nexthop via 169.254.0.1  dev swp4 weight 1 onlink
>
> # ip -6 ro ls vrf red
> 2001:db8:2::/64  proto bgp  metric 20
> nexthop via fe80::202:ff:fe00:e  dev swp3 weight 1
> nexthop via fe80::202:ff:fe00:f  dev swp4 weight 1
> anycast fe80:: dev lo  proto kernel  metric 0  pref medium
> anycast fe80:: dev lo  proto kernel  metric 0  pref medium
> fe80::/64 dev swp3  proto kernel  metric 256  pref medium
> fe80::/64 dev swp4  proto kernel  metric 256  pref medium
> ff00::/8 dev swp3  metric 256  pref medium
> ff00::/8 dev swp4  metric 256  pref medium
> unreachable default dev lo  metric 4278198272  error -101 pref medium
>
> Over the last 2 years we have not seen any negative side effects to this
> and is what you want to have proper VRF separation.
>
> Without a default route lookups will proceed to the next fib rule which
> means a lookup in the next table and barring other PBR rules will be the
> main table. It will lead to wrong lookups.
>
> Here is an example:
>   ip netns add foo
>   ip netns exec foo bash
>   ip li set lo up
>   ip li add red type vrf table 123
>   ip li set red up
>   ip li add dummy1 type dummy
>   ip addr add 10.100.1.1/24 dev dummy1
>   ip li set dummy1 master red
>   ip li set dummy1 up
>   ip li add dummy2 type dummy
>   ip addr add 10.100.1.1/24 dev dummy2
>   ip li set dummy2 up
>   ip ro get 10.100.2.2
>   ip ro get 10.100.2.2 vrf red
>
> # ip ru ls
> 0:  from all lookup local
> 1000:   from all lookup [l3mdev-table]
> 32766:  from all lookup main
> 32767:  from all lookup default
>
> # ip ro ls
> 10.100.1.0/24 dev dummy2 proto kernel scope link src 10.100.1.1
> 10.100.2.0/24 via 10.100.1.2 dev dummy2
>
> # ip ro ls vrf red
> 10.100.1.0/24 dev dummy1 proto kernel scope link src 10.100.1.1
>
> That's the setup. What happens on route lookups?
> # ip ro get vrf red 10.100.2.1
> 10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
> cache
>
> which is clearly wrong. Let's look at the lookup sequence
>
> # perf record -e fib:* ip ro get vrf red 10.100.2.1
> 10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
> cache
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.003 MB perf.data (4 samples) ]
>
> #  perf script --fields trace:trace
> table 255 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> table 123 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> table 254 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> nexthop dev dummy2 oif 4 src 10.100.1.1
>
> The first one is because I did not move the local table down.
> The second one is the correct vrf lookup
> The third one is the continuation to the next table - the main table.
>
> Adding a default route:
> # ip ro add vrf red unreachable default
>
> And the lookup is proper:
> # ip ro 

Re: SRIOV switchdev mode BoF minutes

2018-04-13 Thread Or Gerlitz
On Fri, Apr 13, 2018 at 7:49 PM, Samudrala, Sridhar
 wrote:
> On 4/13/2018 1:57 AM, Or Gerlitz wrote:


>>> in  overlay networks scheme, the uplink rep has the VTEP ip and is not 
>>> connected
>>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected
>>> to ovs and the ip stack routes through the uplink rep

> This changes the legacy mode behavior of configuring  vtep ip on the pf
> netdev. How does host to host traffic expected to work when vtep ip is moved 
> to uplink rep?

What do you mean host to host traffic, is that two VFs on the same host?
control plane SWs (such as OVS) don't apply encapsulation within the same host

 What about pf-rep?

> Are you planning to create a pf-rep too? Is pf also treated similar to vf in
> switchdev mode?
> All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default
> without any rules programmed?

@ the sriov switchdev ARCH level, pf/pf-rep would work indeed as you described.

We will have pf rep for smartnic schemes where the the pf on the host
is not the manager of the eswitch but rather the smartnic driver instance.

on non smart env, there are some challenges to address for the pf
nic to be fully functional for the slow path (what you described), we
will get there down the road if there is a real need.


[next-queue PATCH 1/1] ixgbe: cleanup sparse warnings

2018-04-13 Thread cathy . zhou
From: Cathy Zhou 

Sparse complains valid conversions between restricted types, force
attribute is used to avoid those warnings.

Signed-off-by: Cathy Zhou 
Reviewed-by: Shannon Nelson 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c  | 13 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c   |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c  | 25 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 29 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h  | 16 +++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c   |  9 
 7 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
index 66a74f4651e8..898b47b1a854 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
@@ -1462,7 +1462,8 @@ void ixgbe_atr_compute_perfect_hash_82599(union 
ixgbe_atr_input *input,
 {
 
u32 hi_hash_dword, lo_hash_dword, flow_vm_vlan;
-   u32 bucket_hash = 0, hi_dword = 0;
+   u32 bucket_hash = 0;
+   __be32 hi_dword = 0;
int i;
 
/* Apply masks to input data */
@@ -1501,7 +1502,7 @@ void ixgbe_atr_compute_perfect_hash_82599(union 
ixgbe_atr_input *input,
 * Limit hash to 13 bits since max bucket count is 8K.
 * Store result at the end of the input stream.
 */
-   input->formatted.bkt_hash = bucket_hash & 0x1FFF;
+   input->formatted.bkt_hash = (__force __be16)(bucket_hash & 0x1FFF);
 }
 
 /**
@@ -1610,7 +1611,7 @@ s32 ixgbe_fdir_set_input_mask_82599(struct ixgbe_hw *hw,
return IXGBE_ERR_CONFIG;
}
 
-   switch (input_mask->formatted.flex_bytes & 0x) {
+   switch ((__force u16)input_mask->formatted.flex_bytes & 0x) {
case 0x:
/* Mask Flex Bytes */
fdirm |= IXGBE_FDIRM_FLEX;
@@ -1680,13 +1681,13 @@ s32 ixgbe_fdir_write_perfect_filter_82599(struct 
ixgbe_hw *hw,
IXGBE_WRITE_REG(hw, IXGBE_FDIRPORT, fdirport);
 
/* record vlan (little-endian) and flex_bytes(big-endian) */
-   fdirvlan = IXGBE_STORE_AS_BE16(input->formatted.flex_bytes);
+   fdirvlan = IXGBE_STORE_AS_BE16((__force 
u16)input->formatted.flex_bytes);
fdirvlan <<= IXGBE_FDIRVLAN_FLEX_SHIFT;
fdirvlan |= ntohs(input->formatted.vlan_id);
IXGBE_WRITE_REG(hw, IXGBE_FDIRVLAN, fdirvlan);
 
/* configure FDIRHASH register */
-   fdirhash = input->formatted.bkt_hash;
+   fdirhash = (__force u32)input->formatted.bkt_hash;
fdirhash |= soft_id << IXGBE_FDIRHASH_SIG_SW_INDEX_SHIFT;
IXGBE_WRITE_REG(hw, IXGBE_FDIRHASH, fdirhash);
 
@@ -1724,7 +1725,7 @@ s32 ixgbe_fdir_erase_perfect_filter_82599(struct ixgbe_hw 
*hw,
s32 err;
 
/* configure FDIRHASH register */
-   fdirhash = input->formatted.bkt_hash;
+   fdirhash = (__force u32)input->formatted.bkt_hash;
fdirhash |= soft_id << IXGBE_FDIRHASH_SIG_SW_INDEX_SHIFT;
IXGBE_WRITE_REG(hw, IXGBE_FDIRHASH, fdirhash);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index 633be93f3dbb..7db2722366c2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -3652,7 +3652,7 @@ s32 ixgbe_hic_unlocked(struct ixgbe_hw *hw, u32 *buffer, 
u32 length,
 */
for (i = 0; i < dword_len; i++)
IXGBE_WRITE_REG_ARRAY(hw, IXGBE_FLEX_MNG,
- i, cpu_to_le32(buffer[i]));
+ i, (__force u32)cpu_to_le32(buffer[i]));
 
/* Setting this bit tells the ARC that a new command is pending. */
IXGBE_WRITE_REG(hw, IXGBE_HICR, hicr | IXGBE_HICR_C);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 7a09a40e4472..4e4c5eeda50d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -465,7 +465,7 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
case cpu_to_le32(IXGBE_RXDADV_STAT_FCSTAT_FCPRSP):
dma_unmap_sg(>pdev->dev, ddp->sgl,
 ddp->sgc, DMA_FROM_DEVICE);
-   ddp->err = ddp_err;
+   ddp->err = (__force u32)ddp_err;
ddp->sgl = NULL;
ddp->sgc = 0;
/* fall through */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 68af127987bc..33e8c588ff51 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -43,8 +43,9 @@ static void ixgbe_ipsec_set_tx_sa(struct 

Re: [RFC bpf-next v2 4/8] bpf: add documentation for eBPF helpers (23-32)

2018-04-13 Thread Quentin Monnet
2018-04-12 17:28 UTC-0700 ~ Alexei Starovoitov

> On Tue, Apr 10, 2018 at 03:41:53PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Daniel:
>>
>> - bpf_get_prandom_u32()
>> - bpf_get_smp_processor_id()
>> - bpf_get_cgroup_classid()
>> - bpf_get_route_realm()
>> - bpf_skb_load_bytes()
>> - bpf_csum_diff()
>> - bpf_skb_get_tunnel_opt()
>> - bpf_skb_set_tunnel_opt()
>> - bpf_skb_change_proto()
>> - bpf_skb_change_type()
>>
>> Cc: Daniel Borkmann 
>> Signed-off-by: Quentin Monnet 
>> ---
>>  include/uapi/linux/bpf.h | 125 
>> +++
>>  1 file changed, 125 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index f3ea8824efbc..d147d9dd6a83 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -604,6 +612,13 @@ union bpf_attr {
>>   *  Return
>>   *  0 on success, or a negative error in case of failure.
>>   *
>> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
>> + *  Description
>> + *  Retrieve the classid for the current task, i.e. for the
>> + *  net_cls (network classifier) cgroup to which *skb* belongs.
> 
> please add that kernel should be configured with CONFIG_NET_CLS_CGROUP=y|m
> and mention Documentation/cgroup-v1/net_cls.txt

Ok.

> Otherwise 'network classifier' is way too generic.

I am not so familiar with cgroups. What would you suggest instead?

> I'd also mention that placing a task into net_cls controller
> disables all of cgroup-bpf.

Could you please explain a bit more? Placing a task into the controller
is using:

echo   >  /sys/fs/cgroup//tasks

correct? Then if I do this, it disables all of cgroup-bpf. Does this
mean that I loose the possibility to use or add BPF programs to all
cgroup-related attach points for this cgroup? I think I missed something
here.

>> + *  Return
>> + *  The classid, or 0 for the default unconfigured classid.
>> + *
>>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 
>> vlan_tci)
>>   *  Description
>>   *  Push a *vlan_tci* (VLAN tag control information) of protocol

I have no particular comments on the other items you reported on this
patch, I will fix them. Thanks!

Quentin


[PATCH v2 net 3/3] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread Edward Cree
A misconfigured system (e.g. with all interrupts affinitised to all CPUs)
 may produce a storm of ARFS steering events.  With the existing sfc ARFS
 implementation, that could create a backlog of workitems that grinds the
 system to a halt.  To prevent this, limit the number of workitems that
 may be in flight for a given SFC device to 8 (EFX_RPS_MAX_IN_FLIGHT), and
 return EBUSY from our ndo_rx_flow_steer method if the limit is reached.
Given this limit, also store the workitems in an array of slots within the
 struct efx_nic, rather than dynamically allocating for each request.
The limit should not negatively impact performance, because it is only
 likely to be hit in cases where ARFS will be ineffective anyway.

Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/net_driver.h | 25 +++
 drivers/net/ethernet/sfc/rx.c | 58 ++-
 2 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index 5e379a83c729..eea3808b3f25 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -733,6 +733,27 @@ struct efx_rss_context {
u32 rx_indir_table[128];
 };
 
+#ifdef CONFIG_RFS_ACCEL
+/**
+ * struct efx_async_filter_insertion - Request to asynchronously insert a 
filter
+ * @net_dev: Reference to the netdevice
+ * @spec: The filter to insert
+ * @work: Workitem for this request
+ * @rxq_index: Identifies the channel for which this request was made
+ * @flow_id: Identifies the kernel-side flow for which this request was made
+ */
+struct efx_async_filter_insertion {
+   struct net_device *net_dev;
+   struct efx_filter_spec spec;
+   struct work_struct work;
+   u16 rxq_index;
+   u32 flow_id;
+};
+
+/* Maximum number of ARFS workitems that may be in flight on an efx_nic */
+#define EFX_RPS_MAX_IN_FLIGHT  8
+#endif /* CONFIG_RFS_ACCEL */
+
 /**
  * struct efx_nic - an Efx NIC
  * @name: Device name (net device name or bus id before net device registered)
@@ -850,6 +871,8 @@ struct efx_rss_context {
  * @rps_expire_channel: Next channel to check for expiry
  * @rps_expire_index: Next index to check for expiry in
  * @rps_expire_channel's @rps_flow_id
+ * @rps_slot_map: bitmap of in-flight entries in @rps_slot
+ * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
  * @active_queues: Count of RX and TX queues that haven't been flushed and 
drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be 
flushed.
  * Decremented when the efx_flush_rx_queue() is called.
@@ -1004,6 +1027,8 @@ struct efx_nic {
struct mutex rps_mutex;
unsigned int rps_expire_channel;
unsigned int rps_expire_index;
+   unsigned long rps_slot_map;
+   struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
 #endif
 
atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 13b0eb71dbf3..9c593c661cbf 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -827,28 +827,13 @@ MODULE_PARM_DESC(rx_refill_threshold,
 
 #ifdef CONFIG_RFS_ACCEL
 
-/**
- * struct efx_async_filter_insertion - Request to asynchronously insert a 
filter
- * @net_dev: Reference to the netdevice
- * @spec: The filter to insert
- * @work: Workitem for this request
- * @rxq_index: Identifies the channel for which this request was made
- * @flow_id: Identifies the kernel-side flow for which this request was made
- */
-struct efx_async_filter_insertion {
-   struct net_device *net_dev;
-   struct efx_filter_spec spec;
-   struct work_struct work;
-   u16 rxq_index;
-   u32 flow_id;
-};
-
 static void efx_filter_rfs_work(struct work_struct *data)
 {
struct efx_async_filter_insertion *req = container_of(data, struct 
efx_async_filter_insertion,
  work);
struct efx_nic *efx = netdev_priv(req->net_dev);
struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
+   int slot_idx = req - efx->rps_slot;
int rc;
 
rc = efx->type->filter_insert(efx, >spec, true);
@@ -878,8 +863,8 @@ static void efx_filter_rfs_work(struct work_struct *data)
}
 
/* Release references */
+   clear_bit(slot_idx, >rps_slot_map);
dev_put(req->net_dev);
-   kfree(req);
 }
 
 int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
@@ -888,22 +873,36 @@ int efx_filter_rfs(struct net_device *net_dev, const 
struct sk_buff *skb,
struct efx_nic *efx = netdev_priv(net_dev);
struct efx_async_filter_insertion *req;
struct flow_keys fk;
+   int slot_idx;
+   int rc;
 
-   if (flow_id == RPS_FLOW_ID_INVALID)
-   return -EINVAL;
+   /* find a free slot */
+   for (slot_idx = 0; slot_idx < 

[PATCH v2 net 2/3] sfc: pass the correctly bogus filter_id to rps_may_expire_flow()

2018-04-13 Thread Edward Cree
When we inserted an ARFS filter for ndo_rx_flow_steer(), we didn't know
 what the filter ID would be, so we just returned 0.  Thus, we must also
 pass 0 as the filter ID when calling rps_may_expire_flow() for it, and
 rely on the flow_id to identify what we're talking about.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/ef10.c  | 3 +--
 drivers/net/ethernet/sfc/farch.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 50daad0a1482..36f24c7e553a 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -4776,8 +4776,7 @@ static bool efx_ef10_filter_rfs_expire_one(struct efx_nic 
*efx, u32 flow_id,
goto out_unlock;
}
 
-   if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id,
-flow_id, filter_idx)) {
+   if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
ret = false;
goto out_unlock;
}
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 4a19c7efdf8d..7174ef5e5c5e 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2912,7 +2912,7 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, 
u32 flow_id,
if (test_bit(index, table->used_bitmap) &&
table->spec[index].priority == EFX_FILTER_PRI_HINT &&
rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-   flow_id, index)) {
+   flow_id, 0)) {
efx_farch_filter_table_clear_entry(efx, table, index);
ret = true;
}



[PATCH v2 net 1/3] sfc: insert ARFS filters with replace_equal=true

2018-04-13 Thread Edward Cree
Necessary to allow redirecting a flow when the application moves.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 95682831484e..13b0eb71dbf3 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -851,7 +851,7 @@ static void efx_filter_rfs_work(struct work_struct *data)
struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
int rc;
 
-   rc = efx->type->filter_insert(efx, >spec, false);
+   rc = efx->type->filter_insert(efx, >spec, true);
if (rc >= 0) {
/* Remember this so we can check whether to expire the filter
 * later.



[PATCH] PCI: Add PCIe to pcie_print_link_status() messages

2018-04-13 Thread Jakub Kicinski
Currently the pcie_print_link_status() will print PCIe bandwidth
and link width information but does not mention it is pertaining
to the PCIe.  Since this and related functions are used exclusively
by networking drivers today users may get confused into thinking
that it's the NIC bandwidth that is being talked about.  Insert a
"PCIe" into the messages.

Signed-off-by: Jakub Kicinski 
---
 drivers/pci/pci.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index aa86e904f93c..73a0a4993f6a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5273,11 +5273,11 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_avail = pcie_bandwidth_available(dev, _dev, , );
 
if (bw_avail >= bw_cap)
-   pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d 
link)\n",
+   pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d 
link)\n",
 bw_cap / 1000, bw_cap % 1000,
 PCIE_SPEED2STR(speed_cap), width_cap);
else
-   pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s 
x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
+   pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited 
by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
 bw_avail / 1000, bw_avail % 1000,
 PCIE_SPEED2STR(speed), width,
 limiting_dev ? pci_name(limiting_dev) : "",
-- 
2.16.2



[PATCH v2 net 0/3] sfc: ARFS fixes

2018-04-13 Thread Edward Cree
Three issues introduced by my recent asynchronous filter handling changes:
1. The old filter_rfs_insert would replace a matching filter of equal
   priority; we need to pass the appropriate argument to filter_insert to
   make it do the same.
2. We're lying to the kernel with our return value from ndo_rx_flow_steer,
   so we need to lie consistently when calling rps_may_expire_flow.  This
   is only a partial fix, as the lie still prevents us from steering
   multiple flows with the same ID to different queues; a proper fix that
   stops us lying at all will hopefully follow later.
3. It's possible to cause the kernel to hammer ndo_rx_flow_steer very
   hard, so make sure we don't build up too huge a backlog of workitems.

Possibly it would be better to fix #3 on the kernel side; I have a patch
 which I think does that but it's not a regression in 4.17 so isn't 'net'
 material.
There's also the issue that we come up in the bad configuration that
 triggers #3 by default, but that too is a problem for another time.

Edward Cree (3):
  sfc: insert ARFS filters with replace_equal=true
  sfc: pass the correctly bogus filter_id to rps_may_expire_flow()
  sfc: limit ARFS workitems in flight per channel

 drivers/net/ethernet/sfc/ef10.c   |  3 +-
 drivers/net/ethernet/sfc/farch.c  |  2 +-
 drivers/net/ethernet/sfc/net_driver.h | 25 +++
 drivers/net/ethernet/sfc/rx.c | 60 ++-
 4 files changed, 58 insertions(+), 32 deletions(-)



ethtool 4.16 released

2018-04-13 Thread John W. Linville
ethtool version 4.16 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.16.tar.xz

Release notes:

* Feature: add support for extra RSS contexts and RSS steering filters
* Feature: Document RSS context control and RSS filters
* Fix: don't fall back to grxfhindir when context was specified
* Fix: correct display of VF when showing vf/queue filters
* Fix: show VF and queue in the help for -N
* Fix: correct VF index values for the ring_cookie parameter
* Feature: Add SFF 8636 date code parsing support

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.





Re: [PATCH net] team: avoid adding twice the same option to the event list

2018-04-13 Thread David Miller
From: Paolo Abeni 
Date: Fri, 13 Apr 2018 13:59:25 +0200

> When parsing the options provided by the user space,
> team_nl_cmd_options_set() insert them in a temporary list to send
> multiple events with a single message.
> While each option's attribute is correctly validated, the code does
> not check for duplicate entries before inserting into the event
> list.
> 
> Exploiting the above, the syzbot was able to trigger the following
> splat:
 ...
> This changeset addresses the avoiding list_add() if the current
> option is already present in the event list.
> 
> Reported-and-tested-by: syzbot+4d4af685432dc0e56...@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni 
> Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")

Looks good to me.

It's too bad that the tmp list entries don't get marked as they are
added, or get unlinked by the list processor.  Either scheme would
make the "already added" test a lot simpler.

Jiri, please review before I apply this.

Thanks.


Re: [PATCH net-next 2/3] dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting

2018-04-13 Thread Rob Herring
On Mon, Apr 09, 2018 at 03:38:44PM +0900, Kunihiko Hayashi wrote:
> Add "socionext,syscon-phy-mode" property to specify system controller that
> configures the settings about phy-mode.
> 
> Signed-off-by: Kunihiko Hayashi 
> ---
>  Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Reviewed-by: Rob Herring 



Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:
> I guess there is nothing we need to do!
>
> On x86, in case of no intel iommu or iommu is disabled, you end up in
> swiotlb for DMA API calls when system has 4G memory.
> However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> use bounce buffer until and unless you have swiotlb=force specified in
> kernel commandline.

Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
involves an indirect call to do exactly nothing, which in the workload
Jesper is looking at is causing a huge performance degradation due to
retpolines.


Re: [PATCH net-next 1/3] net: ethernet: ave: add multiple clocks and resets support as required property

2018-04-13 Thread Rob Herring
On Mon, Apr 09, 2018 at 03:38:43PM +0900, Kunihiko Hayashi wrote:
> When the link is becoming up for Pro4 SoC, the kernel is stalled
> due to some missing clocks and resets.
> 
> The AVE block for Pro4 is connected to the GIO bus in the SoC.
> Without its clock/reset, the access to the AVE register makes the
> system stall.
> 
> In the same way, another MAC clock for Giga-bit Connection and
> the PHY clock are also required for Pro4 to activate the Giga-bit feature
> and to recognize the PHY.
> 
> To satisfy these requirements, this patch adds support for multiple clocks
> and resets, and adds the clock-names and reset-names to the binding because
> we need to distinguish clock/reset for the AVE main block and the others.
> 
> Also, make the resets a required property. Currently, "reset is
> optional" relies on that the bootloader or firmware has deasserted
> the reset before booting the kernel.  Drivers should work without
> such expectation.
> 
> Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
> Suggested-by: Masahiro Yamada 
> Signed-off-by: Kunihiko Hayashi 
> ---
>  .../bindings/net/socionext,uniphier-ave4.txt   |  13 ++-

Reviewed-by: Rob Herring 

>  drivers/net/ethernet/socionext/sni_ave.c   | 108 
> -
>  2 files changed, 96 insertions(+), 25 deletions(-)


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Tushar Dave



On 04/12/2018 07:56 AM, Christoph Hellwig wrote:

On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:

On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:

---
Implement support for keeping the DMA mapping through the XDP return
call, to remove RX map/unmap calls.  Implement bulking for XDP
ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
bulking via scatter-gatter DMA calls, XDP TX need it for DMA
map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
to mitigate (via bulk technique). Ask DMA maintainer for a common
case direct call for swiotlb DMA sync call ;-)


Why do you even end up in swiotlb code?  Once you bounce buffer your
performance is toast anyway..


I guess that is because x86 selects it as the default as soon as
we have more than 4G memory. That should be solveable fairly easily
with the per-device dma ops, though.\


I guess there is nothing we need to do!

On x86, in case of no intel iommu or iommu is disabled, you end up in
swiotlb for DMA API calls when system has 4G memory.
However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
use bounce buffer until and unless you have swiotlb=force specified in
kernel commandline.

e.g. here is the snip:
dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
unsigned long attrs)
{
phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = phys_to_dma(dev, phys);

BUG_ON(dir == DMA_NONE);
/*
 * If the address happens to be in the device's DMA window,
 * we can safely return the device addr and not worry about bounce
 * buffering it.
 */
if (dma_capable(dev, dev_addr, size) && swiotlb_force != 
SWIOTLB_FORCE)

return dev_addr;


-Tushar




Creating FOU tunnels to the same destination IP but different port

2018-04-13 Thread Kostas Peletidis
Hello,

I am having trouble with a particular case of setting up a fou tunnel
and I would really appreciate your help.

I have a remote multihomed host behind a NAT box and I want to create
a fou tunnel for each of its IP addresses, from my machine.

A typical case would be something like that (output from the local machine):

# ip tun
ipudp09602: ip/ip remote 135.196.22.100 local 172.31.0.140 ttl 225
ipudp00101: ip/ip remote 148.252.129.30 local 172.31.0.140 ttl 225
ipudp09604: ip/ip remote 77.247.11.249 local 172.31.0.140 ttl 225
tunl0: any/ip remote any local any ttl inherit nopmtudisc
ipudp00102: ip/ip remote 213.205.194.18 local 172.31.0.140 ttl 225

However, if the remote end has the same IP address with the remote end
of an existing tunnel (but a different remote port)
tunnel creation fails. In this example there is already a tunnel to
135.196.22.100:32270 and I wanted to create a new tunnel
to 135.196.22.100:24822 as below:

# ip link add name ipudp09603 mtu 1356 type ipip \
  remote 135.196.22.100 \
  local 172.31.0.140 \
  ttl 225 \
  encap fou \
 encap-sport 4500 \
 encap-dport 24822

RTNETLINK answers: File exists

The remote IP addresses in this case are identical because there is a
NAT box in the way, but the port numbers are different. The source
address and port are the same in all cases.

I noticed that ip_tunnel_find() does not check port numbers - being IP
and all - so I am thinking that a not-so-elegant way to do it is to
get the port numbers from the netlink request and have
ip_tunnel_find() compare them against encap.{sport, dport} of existing
tunnels.

Is there a better way to create a second fou tunnel to the same IP
address but a different port? Use of keys as unique tunnel IDs maybe?
Any feedback is appreciated. Thank you.


Regards,
Kostas


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote:
> > I guess that is because x86 selects it as the default as soon as
> > we have more than 4G memory. 
> 
> I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
> that might explain it. And I'm not hitting the bounce-buffer case.
> 
> How do I control which DMA engine I use? (So, I can play a little)

At the lowest level you control it by:

 (1) setting the dma_ops pointer in struct device
 (2) if that is NULL by choosing what is returned from
 get_arch_dma_ops()

> 
> 
> > That should be solveable fairly easily with the per-device dma ops,
> > though.
> 
> I didn't understand this part.

What I mean with that is that we can start out setting dma_ops
to dma_direct_ops for everyone on x86 when we start out (that is assuming
we don't have an iommu), and only switching to swiotlb_dma_ops when
actually required by either a dma_mask that can't address all memory,
or some other special cases like SEV or broken bridges.

> I wanted to ask your opinion, on a hackish idea I have...
> Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
> operation on another device (still/only calling sync_single_for_device).
> 
> With XDP_REDIRECT we are redirecting between net_device's. Usually
> we keep the RX-DMA mapping as we recycle the page. On the redirect to
> TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
> question is how to avoid this mapping(?).  In some cases, with some DMA
> engines (or lack of) I guess the DMA address is actually the same as
> the RX-DMA mapping dma_addr_t already known, right?  For those cases,
> would it be possible to just (re)use that address for TX?

You can't in any sensible way without breaking a lot of abstractions.
For dma direct ops that mapping will be the same unless the devices
have different dma_offsets in their struct device, or the architecture
overrides phys_to_dma entirely, in which case all bets are off.
If you have an iommu it depends on which devices are behind the same
iommu.


Re: SRIOV switchdev mode BoF minutes

2018-04-13 Thread Samudrala, Sridhar

On 4/13/2018 1:57 AM, Or Gerlitz wrote:

On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz  wrote:

On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
 wrote:

On 4/12/2018 1:20 PM, Or Gerlitz wrote:

On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
 wrote:

On 11/12/2017 11:49 AM, Or Gerlitz wrote:

Hi Dave and all,

During and after the BoF on SRIOV switchdev mode, we came into a
consensus among the developers from four different HW vendors (CC
audience) that a correct thing to do would be to disallow any new
extensions to the legacy mode.

The idea is to put focus on the new mode and not add new UAPIs and
kernel code which was turned to be a wrong design which does not allow
for properly offloading a kernel switching SW model to e-switch HW.

We also had a good session the day after regarding alignment for the
representation model of the uplink (physical port) and PF/s.

The VF representor netdevs  exist for all drivers that support the new
mode but the representation for the uplink and PF wasn't the same for
all. The decision was to represent the uplink and PFs vports in the
same manner done for VFs, using rep netdevs. This alignment would
provide a more strict and clear view of the kernel model for e-switch
to users and upper layer control plane SW.


I don't see any changes in the Mellanox/other drivers to move to this new
model to enable the uplink and PF port representors, any updates?

Yeah, I am worked on that but didn't get to finalize the upstreaming
so far.  I have resumed
the work and plan uplink rep in mlx5 to replace the PF being uplink rep
for 4.18


It would be really nice to highlight the pros and cons of the old versus
the
new model.

We are looking into adding switchdev support for our new 100Gb ice driver
and could use some feedback on the direction we should be taking.

good news.

The uplink rep is clear cut that needs to be a rep device representing
the uplink just like vf
rep represents the vport toward the vf - please just do it correct
from the begining


Having an uplink rep will definitely help implement the slow path with
flat/vlan network
scenarios by not having to add PF to the bridge.

But how do they help with a vxlan overlay scenario? In case of overlays, the
slow path has to go via vxlan -> ip stack -> pf?

in  overlay networks scheme, the uplink has the VTEP ip and is not connected

the uplink rep has the vtep ip


to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
and the ip stack routes through the uplink rep


This changes the legacy mode behavior of configuring  vtep ip on the pf netdev.
How does host to host traffic expected to work when vtep ip is moved to uplink 
rep?





What about pf-rep?


Are you planning to create a pf-rep too? Is pf also treated similar to vf in 
switchdev mode?
All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default without 
any rules
programmed?



[PATCH iproute2] utils: Do not reset family for default, any, all addresses

2018-04-13 Thread David Ahern
Thomas reported a change in behavior with respect to autodectecting
address families. Specifically, 'ip ro add default via fe80::1'
syntax was failing to treat fe80::1 as an IPv6 address as it did in
prior releases. The root causes appears to be a change in family when
the default keyword is parsed.

'default', 'any' and 'all' are relevant outside of AF_INET. Leave the
family arg as is for these when setting addr.

Fixes: 93fa12418dc6 ("utils: Always specify family and ->bytelen in 
get_prefix_1()")
Reported-by: Thomas Deutschmann 
Signed-off-by: David Ahern 
Cc: Serhey Popovych 
---
 lib/utils.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/utils.c b/lib/utils.c
index 60d7eb14b438..8a0bff0babeb 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -568,7 +568,7 @@ static int __get_addr_1(inet_prefix *addr, const char 
*name, int family)
if (strcmp(name, "default") == 0) {
if ((family == AF_DECnet) || (family == AF_MPLS))
return -1;
-   addr->family = (family != AF_UNSPEC) ? family : AF_INET;
+   addr->family = family;
addr->bytelen = af_byte_len(addr->family);
addr->bitlen = -2;
addr->flags |= PREFIXLEN_SPECIFIED;
@@ -579,7 +579,7 @@ static int __get_addr_1(inet_prefix *addr, const char 
*name, int family)
strcmp(name, "any") == 0) {
if ((family == AF_DECnet) || (family == AF_MPLS))
return -1;
-   addr->family = AF_UNSPEC;
+   addr->family = family;
addr->bytelen = 0;
addr->bitlen = -2;
return 0;
-- 
2.11.0



Re: tcp hang when socket fills up ?

2018-04-13 Thread Michal Kubecek
On Fri, Apr 06, 2018 at 11:07:20AM +0200, Dominique Martinet wrote:
> 16:49:26.735042 IP .13317 > .31872: Flags 
> [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735046 IP .13317 > .31872: Flags 
> [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735334 IP .31872 > .13317: Flags 
> [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736005 IP .31872 > .13317: Flags 
> [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736402 IP .13317 > .31872: Flags 
> [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.736408 IP .13317 > .31872: Flags 
> [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.738561 IP .31872 > .13317: Flags 
> [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739539 IP .31872 > .13317: Flags 
> [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739882 IP .31872 > .13317: Flags 
> [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.740255 IP .31872 > .13317: Flags 
> [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.746756 IP .31872 > .13317: Flags 
> [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
> length 0
> 16:49:26.747923 IP .31872 > .13317: Flags 
> [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
> length 0
> 16:49:26.749083 IP .31872 > .13317: Flags 
> [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
> length 0
> 16:49:26.750171 IP .31872 > .13317: Flags 
> [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
> length 0
> 16:49:26.750808 IP .31872 > .13317: Flags 
> [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], 
> length 0
> 16:49:26.754648 IP .31872 > .13317: Flags 
> [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], 
> length 0
> 16:49:26.755985 IP .31872 > .13317: Flags 
> [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], 
> length 0
> 16:49:26.758513 IP .31872 > .13317: Flags 
> [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], 
> length 0
> 16:49:26.759096 IP .31872 > .13317: Flags 
> [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759421 IP .31872 > .13317: Flags 
> [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759755 IP .31872 > .13317: Flags 
> [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], 
> length 0
> 16:49:26.760653 IP .31872 > .13317: Flags 
> [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], 
> length 0
> 16:49:26.761453 IP .31872 > .13317: Flags 
> [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], 
> length 0
> 16:49:26.762199 IP .31872 > .13317: Flags 
> [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], 
> length 0
> 16:49:26.763547 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 
> ecr 1313937638], length 36
> 16:49:26.763553 IP .31872 > .13317: Flags 
> [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], 
> length 0
> 16:49:26.764298 IP .31872 > .13317: Flags 
> [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], 
> length 0
> 16:49:26.764676 IP .31872 > .13317: Flags 
> [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], 
> length 0
> 16:49:26.807754 IP .13317 > .31872: Flags 
> [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 
> ecr 1617129473], length 1374
> 16:49:26.876467 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], 
> length 0
> 16:49:27.048760 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 
> ecr 1617129473], length 1374
> 16:49:27.051791 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 
> ecr 1313937714], length 36
> 16:49:27.076444 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 
> ecr 1313937714], length 36
> 16:49:27.519862 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, 

Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread Edward Cree
On 13/04/18 17:14, David Miller wrote:
> Is the issue that you learn about the hardware reset asynchronously,
> and therefore cannot determine if filter insertion programming
> happened afterwards and thus is still in the chip?
Yes, pretty much.

> You must have a table of all the entries, so that you can reprogram
> the hardware should it reset.
Yes, we do have such a table; 'reprogram the hardware' happens in
 efx_ef10_filter_table_restore().

> Understood, thanks for explaining.
>
> Please respin your series with the updates you talked about and I'll
> apply it.
Will do, thanks.

-Ed


Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps

2018-04-13 Thread David Miller
From: Guillaume Nault 
Date: Fri, 13 Apr 2018 18:09:12 +0200

> On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
>> From: Guillaume Nault 
>> Date: Thu, 12 Apr 2018 20:50:33 +0200
>> 
>> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
>> > tunnel, therefore it can be freed whenever the caller uses it.
>> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
>> > also takes a reference on the returned tunnel. The caller then has to
>> > drop it after it stops using the tunnel.
>> > 
>> > Convert netlink dumps to make them safe against concurrent tunnel
>> > deletion.
>> > 
>> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
>> > Signed-off-by: Guillaume Nault 
>> 
>> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
>> mutex is held.
>> 
>> Therefore no tunnel configuration changes may occur and the tunnel
>> object will persist and is safe to access.
>> 
> Yes, but only for updates done with the genl API. For L2TPv2, the
> tunnel can be created by connecting a PPPOL2TP and a UDP socket.
> Closing these sockets destroys the tunnel without any RTNL
> synchronisation.

Right, that's the part I missed.  Thanks for explaining.



Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread David Miller
From: Edward Cree 
Date: Fri, 13 Apr 2018 16:59:07 +0100

> On 13/04/18 16:03, David Miller wrote:
>> Whilst you may not be able to program the filter into the hardware
>> synchronously, you should be able to allocate the ID and get all of
>> the software state setup.
> That's what we were doing before commit 3af0f34290f6 ("sfc: replace
>  asynchronous filter operations"), and (as mentioned in that commit)
>  that leads to (or at least the scheme we used had) race conditions
>  which I could not see a way to fix.  If the hardware resets (and
>  thus forgets all its filters) after we've done the software state
>  setup but before we reach the point of finalising the software state
>  after the hardware operation, we don't know what operations we need
>  to do to re-apply the software state to the hardware, because we
>  don't know whether the reset happened before or after the hardware
>  operation.

When an entry is successfully programmed into the chip, you update
the software state.

When the chip resets, you clear all of those state booleans to false.

Indeed, you would have to synchronize these things somehow.

Is the issue that you learn about the hardware reset asynchronously,
and therefore cannot determine if filter insertion programming
happened afterwards and thus is still in the chip?

You must have a table of all the entries, so that you can reprogram
the hardware should it reset.  Or do you not handle things that way
and it's a lossy system?

> Well, the alternative, even if the software state setup part _could_
>  be made synchronous, is to allow a potentially unbounded queue for
>  the hardware update part (I think there are even still cases in
>  which the exponential growth pathology is possible), causing the
>  filter insertions to be delayed an arbitrarily long time.  Either
>  the flow is still going by that time (in which case the backlog
>  limit approach will get a new ndo_rx_flow_steer request and insert
>  the filter too) or it isn't, in which case getting round to it
>  eventually is no better than dropping it immediately.  In fact it's
>  worse because now you waste time inserting a useless filter which
>  delays new requests even more.
> Besides, I'm fairly confident that the only cases in which you'll
>  even come close to hitting the limit are ones where ARFS wouldn't
>  do you much good anyway, such as:
> * Misconfigured interrupt affinities where ARFS is entirely pointless
> * Many short-lived flows (which greatly diminish the utility of ARFS)
> 
> So for multiple reasons, hitting the limit won't actually make
>  performance worse, although it will often be a sign that performance
>  will be bad for other reasons.

Understood, thanks for explaining.

Please respin your series with the updates you talked about and I'll
apply it.

But generally we do have this issue with various kinds of
configuration programming and async vs. sync.


Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps

2018-04-13 Thread Guillaume Nault
On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
> From: Guillaume Nault 
> Date: Thu, 12 Apr 2018 20:50:33 +0200
> 
> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> > tunnel, therefore it can be freed whenever the caller uses it.
> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
> > also takes a reference on the returned tunnel. The caller then has to
> > drop it after it stops using the tunnel.
> > 
> > Convert netlink dumps to make them safe against concurrent tunnel
> > deletion.
> > 
> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> > Signed-off-by: Guillaume Nault 
> 
> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
> mutex is held.
> 
> Therefore no tunnel configuration changes may occur and the tunnel
> object will persist and is safe to access.
> 
Yes, but only for updates done with the genl API. For L2TPv2, the
tunnel can be created by connecting a PPPOL2TP and a UDP socket.
Closing these sockets destroys the tunnel without any RTNL
synchronisation.

> The netlink dump should be safe as-is.
> 
> Were you actually able to trigger a crash or KASAN warning or is
> this purely from code inspection?
> 
Yes, I have a KASAN use-after-free for this case. I remember I saw a
few complains about stack traces in commit messages, so I've stopped
putting them there. I can paste (stripped) traces again. Just let me
know if you have any preference.

Guillaume



Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread Edward Cree
On 13/04/18 16:03, David Miller wrote:
> Whilst you may not be able to program the filter into the hardware
> synchronously, you should be able to allocate the ID and get all of
> the software state setup.
That's what we were doing before commit 3af0f34290f6 ("sfc: replace
 asynchronous filter operations"), and (as mentioned in that commit)
 that leads to (or at least the scheme we used had) race conditions
 which I could not see a way to fix.  If the hardware resets (and
 thus forgets all its filters) after we've done the software state
 setup but before we reach the point of finalising the software state
 after the hardware operation, we don't know what operations we need
 to do to re-apply the software state to the hardware, because we
 don't know whether the reset happened before or after the hardware
 operation.
Actually, it's not the insertion itself that this can happen to - if
 the reset happens first then the insert will fail because other
 hardware state we reference isn't set up yet.  However, inserting
 one filter can necessitate removing some others (lower-priority
 multicast filters for the same flow); the code before 3af0f34290f6
 was actually capable of getting so confused that it could double-
 free a pointer.
This all gets sufficiently complicated that even if I can find a
 locking scheme that in theory gets it right, there's pretty much a
 100% chance that the actual implementation will contain bugs,
 probably very subtle ones that can't reliably be reproduced etc.
 All my instincts say to get away from that if at all possible.

> People really aren't going to be happy if their performance goes into
> the tank because their filter update rate, for whatever reason, hits
> this magic backlog limit.
Well, the alternative, even if the software state setup part _could_
 be made synchronous, is to allow a potentially unbounded queue for
 the hardware update part (I think there are even still cases in
 which the exponential growth pathology is possible), causing the
 filter insertions to be delayed an arbitrarily long time.  Either
 the flow is still going by that time (in which case the backlog
 limit approach will get a new ndo_rx_flow_steer request and insert
 the filter too) or it isn't, in which case getting round to it
 eventually is no better than dropping it immediately.  In fact it's
 worse because now you waste time inserting a useless filter which
 delays new requests even more.
Besides, I'm fairly confident that the only cases in which you'll
 even come close to hitting the limit are ones where ARFS wouldn't
 do you much good anyway, such as:
* Misconfigured interrupt affinities where ARFS is entirely pointless
* Many short-lived flows (which greatly diminish the utility of ARFS)

So for multiple reasons, hitting the limit won't actually make
 performance worse, although it will often be a sign that performance
 will be bad for other reasons.

-Ed


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>  wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:

 On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
>>
>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>
>>> Is this meant to be fixed already? I am still seeing this on the
>>> latest upstream tree.
>>>
>>
>> These two commits are in v4.16-rc1:
>>
>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>> Author: Tommi Rantala 
>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>
>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>> ...
>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>> secondary
>> addresses")
>>
>>
>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>> Author: Alexey Kodanev 
>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>
>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>> ...
>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>> secondary
>> addresses for ipv6")
>>
>>
>> I guess we missed something if it's still reproducible.
>>
>> I can check it later this week, unless someone else beat me to it.
>
>
> Hi Tommi,
>
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
>
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
>
> on upstream tree pulled ~12 hours ago.
>
 Can you write a systemtap script to probe dev_hold, and dev_put, printing
 out a
 backtrace if the device name matches "lo".  That should tell us
 definitively if
 the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to work around the main side-effect of this,
> which is hanging while holding the global net mutex.  Hangs will still
> happen (e.g. if a dst leaks) but should not affect anything else,
> other than a leak of the dst and its net namespace.
>
> Fixing the dst leaks is important too, of course, but a dst leak (or
> other cause) shouldn't break the entire system.


RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-13 Thread Keller, Jacob E


> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, April 13, 2018 7:07 AM
> To: Jakub Kicinski 
> Cc: Tal Gilboa ; Tariq Toukan ;
> Keller, Jacob E ; Ariel Elior 
> ;
> Ganesh Goudar ; Kirsher, Jeffrey T
> ; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> speed
> and whether it's limited
> 
> On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > > + if (bw_avail >= bw_cap)
> > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > + else
> > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +  bw_avail, PCIE_SPEED2STR(speed), width,
> > > +  limiting_dev ? pci_name(limiting_dev) : "",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> >
> > I was just looking at using this new function to print PCIe BW for a
> > NIC, but I'm slightly worried that there is nothing in the message that
> > says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> > bandwidth:
> >
> > [   39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000
> PCIe Card Probe
> > [   39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 
> > link)
> > [   39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 
> > 0.1:
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> >
> > It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> > didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> > message like bnx2x used to do?  Like:
> >
> > nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> 
> I agree, that does look potentially confusing.  How about this:
> 
>   nfp :04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
> 
> I did have to look twice at this before I remembered that we're
> printing Gb/s (not GB/s).  Most of the references I found on the web
> use GB/s when talking about total PCIe bandwidth.
> 
> But either way I think it's definitely worth mentioning PCIe
> explicitly.

I also agree printing PCIe explicitly is good.

Thanks,
Jake


Re: [PATCH net] virtio-net: add missing virtqueue kick when flushing packets

2018-04-13 Thread David Miller
From: Jason Wang 
Date: Fri, 13 Apr 2018 14:58:25 +0800

> We tends to batch submitting packets during XDP_TX. This requires to
> kick virtqueue after a batch, we tried to do it through
> xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
> explicitly kick the virtqueue in this case.
> 
> Reported-by: Kimitoshi Takahashi 
> Tested-by: Kimitoshi Takahashi 
> Cc: Daniel Borkmann 
> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Signed-off-by: Jason Wang 

Applied and queued up for -stable, thanks Jason.


Re: [RFC v2] virtio: support packed ring

2018-04-13 Thread Michael S. Tsirkin
On Sun, Apr 01, 2018 at 10:12:16PM +0800, Tiwei Bie wrote:
> +static inline bool more_used(const struct vring_virtqueue *vq)
> +{
> + return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> +}
> +
> +void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
> +   void **ctx)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + void *ret;
> + unsigned int i;
> + u16 last_used;
> +
> + START_USE(vq);
> +
> + if (unlikely(vq->broken)) {
> + END_USE(vq);
> + return NULL;
> + }
> +
> + if (!more_used(vq)) {
> + pr_debug("No more buffers in queue\n");
> + END_USE(vq);
> + return NULL;
> + }

So virtqueue_get_buf_ctx_split should only call more_used_split.

to avoid such issues I think we should lay out the code like this:

XXX_split

XXX_packed

XXX wrappers

> +/* The standard layout

I'd drop standard here.

> for the packed ring is a continuous chunk of memory
> + * which looks like this.
> + *
> + * struct vring_packed
> + * {

Can the opening bracket go on the prev line pls?

> + *   // The actual descriptors (16 bytes each)
> + *   struct vring_packed_desc desc[num];
> + *
> + *   // Padding to the next align boundary.
> + *   char pad[];
> + *
> + *   // Driver Event Suppression
> + *   struct vring_packed_desc_event driver;
> + *
> + *   // Device Event Suppression
> + *   struct vring_packed_desc_event device;

Maybe that's how our driver does it but it's not based on spec
so I don't think this belongs in the header.

> + * };
> + */
> +
> +static inline unsigned vring_packed_size(unsigned int num, unsigned long 
> align)
> +{
> + return ((sizeof(struct vring_packed_desc) * num + align - 1)
> + & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> +}
> +

Cant say this API makes sense for me.


>  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> -- 
> 2.11.0


Re: Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")

2018-04-13 Thread Alexei Starovoitov
On Fri, Apr 13, 2018 at 03:22:37PM +0200, Jesper Dangaard Brouer wrote:
> Hi Peter,
> 
> Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
> for several samples/bpf programs. I'm unsure what the best way forward
> is to unbreak these...
> 
> The issue is that these samples are build with LLVM/clang (which
> doesn't like 'asm goto' constructs).  And they end up including
> arch/x86/include/asm/cpufeature.h via a long include path, see build
> examples below (through different path to include/linux/thread_info.h).
> 
> Maybe Alexei or Daniel have an idea how to work around this?
> As tools/testing/selftests/bpf/ does not seem to fail!?

Right. All of bpf tracing and samples/bpf/ broke.
Here is the proposed fix that we're asking Peter to apply and send to Linus 
asap.
https://lkml.org/lkml/2018/4/10/825

> Build error#1:
> --
> clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include 
> -I./arch/x86/include -I./arch/x86/include/generated  -I./include 
> -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi 
> -I./include/uapi -I./include/generated/uapi -include 
> ./include/linux/kconfig.h  -Isamples/bpf \
> -I./tools/testing/selftests/bpf/ \
> -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
> -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
> -Wno-gnu-variable-sized-type-not-at-end \
> -Wno-address-of-packed-member -Wno-tautological-compare \
> -Wno-unknown-warning-option  \
> -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf 
> -filetype=obj -o samples/bpf/sockex2_kern.o
> In file included from samples/bpf/sockex2_kern.c:3:
> In file included from ./include/uapi/linux/in.h:24:
> In file included from ./include/linux/socket.h:8:
> In file included from ./include/linux/uio.h:13:
> In file included from ./include/linux/thread_info.h:38:
> In file included from ./arch/x86/include/asm/thread_info.h:53:
> ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are 
> not supported yet
> asm_volatile_goto("1: jmp 6f\n"
> ^
> ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 
> 'asm_volatile_goto'
> #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread David Miller
From: Edward Cree 
Date: Fri, 13 Apr 2018 15:52:25 +0100

> On 13/04/18 15:45, David Miller wrote:
>> I understand the constraints you are working under, but do realize
>> that the real root of the problems is that you are implementing what
>> is defined clearly as a synchronous operation as asynchronous.
> Yes, it is unfortunate that we are unable to perform synchronous filter
>  insertions, but you go to war with the hardware you have :(

Whilst you may not be able to program the filter into the hardware
synchronously, you should be able to allocate the ID and get all of
the software state setup.

At least then you can return the proper return value from the netdev
op.

People really aren't going to be happy if their performance goes into
the tank because their filter update rate, for whatever reason, hits
this magic backlog limit.


Re: tcp hang when socket fills up ?

2018-04-13 Thread Eric Dumazet


On 04/06/2018 02:07 AM, Dominique Martinet wrote:
> (current kernel: vanilla 4.14.29)
> 
> I've been running into troubles recently where a sockets "fills up" (as
> in, select() will no longer return it in its outfd / attempting to write
> to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it
> doesn't seem to ever "unfill" until the tcp connexion timeout.
> 
> The previous time I pushed it down to the application for not trying to
> read the socket either as I assume the buffers could be shared and
> just waiting won't take data out, but this time I'm a bit more
> skeptical as socat waits for the fd in both read and write...
> 
> Let me take a minute to describe my setup, I don't think that how the
> socket was created matters but it might be interesting:
>  - I have two computers behind NATs, no port forwarding on either side
>  - One (call it C for client) runs ssh with a proxycommand ncat/socat to
> control the source port, e.g.
> $ ssh -o ProxyCommand="socat stdio tcp: ip>:,sourceport=" server
>  - The server runs another socat to connect to that and forwards to ssh
> locally, e.g.
> $ socat tcp::,sourceport= tcp:127.0.0.1:22
> 
> (yes, both are connect() calls, and that just works™ thanks to initial
> syn replay and conntrack on routers)
> 
> When things stall, the first socat is in a select with both fd in
> reading, so it's waiting data.
> The second socat has data coming from ssh and is in a select with the
> client-facing socket in both read and write - that select never returns
> so the socket is not available for reading or writing and does not free
> up until the connection eventually times out a few minutes later.
> 
> At this point, I only see tcp replays in tcpdump/wireshark. I've
> compared dumps from both sides and there are no lost packets, only
> reordering - there always is a batch of acks that were sent regularily
> coming in shortly before the hang. Here's the trace on the server:
> 
> 16:49:26.735042 IP .13317 > .31872: Flags 
> [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735046 IP .13317 > .31872: Flags 
> [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735334 IP .31872 > .13317: Flags 
> [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736005 IP .31872 > .13317: Flags 
> [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736402 IP .13317 > .31872: Flags 
> [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.736408 IP .13317 > .31872: Flags 
> [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.738561 IP .31872 > .13317: Flags 
> [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739539 IP .31872 > .13317: Flags 
> [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739882 IP .31872 > .13317: Flags 
> [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.740255 IP .31872 > .13317: Flags 
> [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.746756 IP .31872 > .13317: Flags 
> [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
> length 0
> 16:49:26.747923 IP .31872 > .13317: Flags 
> [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
> length 0
> 16:49:26.749083 IP .31872 > .13317: Flags 
> [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
> length 0
> 16:49:26.750171 IP .31872 > .13317: Flags 
> [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
> length 0
> 16:49:26.750808 IP .31872 > .13317: Flags 
> [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], 
> length 0
> 16:49:26.754648 IP .31872 > .13317: Flags 
> [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], 
> length 0
> 16:49:26.755985 IP .31872 > .13317: Flags 
> [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], 
> length 0
> 16:49:26.758513 IP .31872 > .13317: Flags 
> [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], 
> length 0
> 16:49:26.759096 IP .31872 > .13317: Flags 
> [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759421 IP .31872 > .13317: Flags 
> [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759755 IP .31872 > .13317: Flags 
> [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], 
> length 0
> 16:49:26.760653 IP .31872 > .13317: Flags 
> [.], ack 64980, win 1302, options [nop,nop,TS val 

Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps

2018-04-13 Thread David Miller
From: Guillaume Nault 
Date: Thu, 12 Apr 2018 20:50:33 +0200

> l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> tunnel, therefore it can be freed whenever the caller uses it.
> This patch defines l2tp_tunnel_get_nth() which works similarly, but
> also takes a reference on the returned tunnel. The caller then has to
> drop it after it stops using the tunnel.
> 
> Convert netlink dumps to make them safe against concurrent tunnel
> deletion.
> 
> Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> Signed-off-by: Guillaume Nault 

During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
mutex is held.

Therefore no tunnel configuration changes may occur and the tunnel
object will persist and is safe to access.

The netlink dump should be safe as-is.

Were you actually able to trigger a crash or KASAN warning or is
this purely from code inspection?


Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

2018-04-13 Thread Arnd Bergmann
On Fri, Apr 13, 2018 at 3:15 PM, Pablo Neira Ayuso  wrote:
> On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
>> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso  
>> wrote:
>> > Hi Arnd,
>> >
>> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
>> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and 
>> >> CONFIG_NF_REJECT_IPV6=m
>> >
>> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
>> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
>> > and CONFIG_NF_REJECT_IPV6=m.
>> >
>> > I mean, just like we do with NFT_FIB_INET.
>>
>> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
>> again, so that code gets built as a loadable module if
>> CONFIG_NF_REJECT_IPV6=m.
>>
>> > BTW, I think this problem has been is not related to the recent patch,
>> > but something older that kbuild robot has triggered more easily for
>> > some reason?
>>
>> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
>> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
>> restricted to a loadable module with IPV6=m, but can now be
>> built-in, which causes that link error.
>
> Still one more spin on this, I would like to see if we have a way to
> fix this by simplifing things a bit.
>
> Would this one I'm attaching would work?

One disadvantage is that it makes the vmlinux bigger since
NF_REJECT_IPV{4,6} can no longer be a module at all now.

I suspect you also stil get a link error with IPV6=m, this time because
the nf_reject_ipv6.o file fails to link against the ipv6 code, e.g.
ipv6_skip_exthdr() and icmpv6_send() appear to be unreachable here.
I haven't tried that though, so I might be missing something.

Arnd


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread Edward Cree
On 13/04/18 13:36, Edward Cree wrote:
> It turns out this may all be moot anyway: I figured out why I was seeing
>  ARFS storms and it wasn't the configuration issue I originally blamed.
Hmm, correction, while the fix I mentioned in my previous email is needed,
 it doesn't prevent the ARFS storms (although seems to lessen their
 severity, given that the machine didn't actually fall over this time),
 so we do also need some kind of limiting.

On 12/04/18 16:33, David Miller wrote:
> Then simply make the work process a queue, and add entries to the queue
> here if the work is already scheduled.
>
> Is there a reason why that wouldn't work?
That has the same problem as the existing code, that the length of the queue
 can grow without bound, potentially causing a very long lag between the
 request and its execution.  This then can quickly become exponential as,
 while waiting for the filter to be inserted, further packets from the same
 flow are received (still unsteered) and trigger duplicate ARFS requests
 (though I suppose it would be possible to scan the queue for matching flow
 IDs; but the concurrency / locking problems with that are 'interesting').

I'm not sure why you object to the dropping of requests - it seems reasonable
 to me to treat ARFS as a 'best-effort' thing.  The packets will still be
 received (just not necessarily on the core nearest the application), and if
 the flow continues it will generate more steering requests after the ones
 currently in flight have been processed.
And in practice we only get into this situation in the first place when we
 have interrupt affinities configured in such a way as to make ARFS
 practically useless anyway, so our failure to insert the filters is not of
 great significance.

On 13/04/18 15:45, David Miller wrote:
> I understand the constraints you are working under, but do realize
> that the real root of the problems is that you are implementing what
> is defined clearly as a synchronous operation as asynchronous.
Yes, it is unfortunate that we are unable to perform synchronous filter
 insertions, but you go to war with the hardware you have :(

-Ed


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread David Miller
From: Edward Cree 
Date: Fri, 13 Apr 2018 13:36:28 +0100

> It turns out this may all be moot anyway: I figured out why I was seeing
>  ARFS storms and it wasn't the configuration issue I originally blamed.
> My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
>  0 for success, but the caller expects a filter ID to be returned (which
>  we can't give it because we don't know what the filter ID will be until
>  we start mucking around in the software state that's now protected by a
>  sleepable lock).
> As a result, when we call rps_may_expire_flow(), and pass it the _actual_
>  filter ID, this doesn't match the one set_rps_cpu() recorded, so the
>  function returns true and we immediately expire the filter.  Then the
>  next packet to come along isn't steered, so ARFS asks us to insert a
>  steering filter again.
> As a quick fix I've simply tried making the rps_may_expire_flow() calls
>  also pass a filter ID of 0, which prevents the ARFS storms.  This is
>  safe; it may cause us to delay expiring a filter when flow_ids collide,
>  but that can happen anyway with other drivers' implementations (e.g.
>  mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
> I'll post a v2 with that fix in place of this Patch #2 shortly, then try
>  to follow up with a counter-generated ID (similar to what mlx have).

I understand the constraints you are working under, but do realize
that the real root of the problems is that you are implementing what
is defined clearly as a synchronous operation as asynchronous.


Re: Atlantic driver 4.16 stable request

2018-04-13 Thread David Miller
From: Igor Russkikh 
Date: Fri, 13 Apr 2018 15:03:29 +0300

> Could you please consider queuing to v4.16:
> 
> 9a11aff net: aquantia: oops when shutdown on already stopped device
> cce96d1 net: aquantia: Regression on reset with 1.x firmware
> 
> These are both critical and well tested by our team.

Queued up, thank you.


Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-13 Thread Bjorn Helgaas
On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > +   if (bw_avail >= bw_cap)
> > +   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +   else
> > +   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d 
> > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > +bw_avail, PCIE_SPEED2STR(speed), width,
> > +limiting_dev ? pci_name(limiting_dev) : "",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> 
> I was just looking at using this new function to print PCIe BW for a
> NIC, but I'm slightly worried that there is nothing in the message that
> says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> bandwidth:
> 
> [   39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000 
> PCIe Card Probe
> [   39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 
> link)
> [   39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: 
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> 
> It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> message like bnx2x used to do?  Like:
> 
> nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)

I agree, that does look potentially confusing.  How about this:

  nfp :04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)

I did have to look twice at this before I remembered that we're
printing Gb/s (not GB/s).  Most of the references I found on the web
use GB/s when talking about total PCIe bandwidth.

But either way I think it's definitely worth mentioning PCIe
explicitly.


Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")

2018-04-13 Thread Jesper Dangaard Brouer
Hi Peter,

Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
for several samples/bpf programs. I'm unsure what the best way forward
is to unbreak these...

The issue is that these samples are build with LLVM/clang (which
doesn't like 'asm goto' constructs).  And they end up including
arch/x86/include/asm/cpufeature.h via a long include path, see build
examples below (through different path to include/linux/thread_info.h).

Maybe Alexei or Daniel have an idea how to work around this?
As tools/testing/selftests/bpf/ does not seem to fail!?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Build error#1:
--
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include 
-I./arch/x86/include -I./arch/x86/include/generated  -I./include 
-I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi 
-I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option  \
-O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf 
-filetype=obj -o samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:3:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not 
supported yet
asm_volatile_goto("1: jmp 6f\n"
^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 
'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
 ^


Build error#2:
--
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include 
-I./arch/x86/include -I./arch/x86/include/generated  -I./include 
-I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi 
-I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option  \
-O2 -emit-llvm -c samples/bpf/tracex1_kern.c -o -| llc -march=bpf 
-filetype=obj -o samples/bpf/tracex1_kern.o
In file included from samples/bpf/tracex1_kern.c:7:
In file included from ./include/linux/skbuff.h:19:
In file included from ./include/linux/time.h:6:
In file included from ./include/linux/seqlock.h:36:
In file included from ./include/linux/spinlock.h:51:
In file included from ./include/linux/preempt.h:81:
In file included from ./arch/x86/include/asm/preempt.h:7:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not 
supported yet
asm_volatile_goto("1: jmp 6f\n"
^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 
'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
 ^


Build error#3:
--
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include 
-I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86
/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi 
-I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
-I./tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option  \
-O2 -emit-llvm -c samples/bpf/xdp1_kern.c -o -| llc -march=bpf 
-filetype=obj -o samples/bpf/xdp1_kern.o
In file included from samples/bpf/xdp1_kern.c:9:
In file included from ./include/linux/in.h:23:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not 
supported yet
asm_volatile_goto("1: jmp 

Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

2018-04-13 Thread Pablo Neira Ayuso
On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso  wrote:
> > Hi Arnd,
> >
> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and 
> >> CONFIG_NF_REJECT_IPV6=m
> >
> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
> > and CONFIG_NF_REJECT_IPV6=m.
> >
> > I mean, just like we do with NFT_FIB_INET.
> 
> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
> again, so that code gets built as a loadable module if
> CONFIG_NF_REJECT_IPV6=m.
> 
> > BTW, I think this problem has been is not related to the recent patch,
> > but something older that kbuild robot has triggered more easily for
> > some reason?
> 
> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
> restricted to a loadable module with IPV6=m, but can now be
> built-in, which causes that link error.

Still one more spin on this, I would like to see if we have a way to
fix this by simplifing things a bit.

Would this one I'm attaching would work?

Thanks for you patience.
>From af07bc7ff5d34ce54e7913233912c058e6699e3c Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso 
Date: Fri, 13 Apr 2018 10:48:40 +0200
Subject: [PATCH] netfilter: CONFIG_NF_REJECT_IPV{4,6} becomes bool toggle

Arnd reports that we get a new link error with CONFIG_NFT_REJECT_INET=y
and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules
are linked together:

net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval':
nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6'
nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6'

The problem is that with NF_TABLES_INET set, we implicitly try to use
the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to
a loadable module, it's impossible to reach that.

This patch fixes this problem by building-in nf_reject_ipv{4,6}.c, IPv6
symbol dependencies for the IPv6 reject infrastructure are located in
exthdrs_core.c, ip6_checksum.c and ip6_icmp.c which are also built-in,
so let's do the same to simplify this.

Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")
Reported-by: Arnd Bergmann 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/Kconfig | 3 +--
 net/ipv6/netfilter/Kconfig | 3 +--
 net/netfilter/Kconfig  | 2 ++
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index 280048e1e395..3e4e0ae2a9a1 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -104,8 +104,7 @@ config NF_LOG_IPV4
 	select NF_LOG_COMMON
 
 config NF_REJECT_IPV4
-	tristate "IPv4 packet rejection"
-	default m if NETFILTER_ADVANCED=n
+	bool "IPv4 packet rejection"
 
 config NF_NAT_IPV4
 	tristate "IPv4 NAT"
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index ccbfa83e4bb0..1e5d040a60b8 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -87,8 +87,7 @@ config NF_DUP_IPV6
 	  packet to be rerouted to another destination.
 
 config NF_REJECT_IPV6
-	tristate "IPv6 packet rejection"
-	default m if NETFILTER_ADVANCED=n
+	bool "IPv6 packet rejection"
 
 config NF_LOG_IPV6
 	tristate "IPv6 packet logging"
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 4189f574f5ec..d7b3272fe821 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -609,6 +609,8 @@ config NFT_REJECT
 
 config NFT_REJECT_INET
 	depends on NF_TABLES_INET
+	select NF_REJECT_IPV4
+	select NF_REJECT_IPV6
 	default NFT_REJECT
 	tristate
 
-- 
2.11.0



Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dan Streetman
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

 On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
  wrote:
>
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
>  sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
>  Fixes: 410f03831 ("sctp: add routing output fallback")
>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
> secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.


 Hi Tommi,

 Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
 another one then. But I am still seeing these:

 [   58.799130] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   60.847138] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   62.895093] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   64.943103] unregister_netdevice: waiting for lo to become free.
 Usage count = 4

 on upstream tree pulled ~12 hours ago.

>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

I'm working on this currently:
https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

I added a summary of what I've found to be the cause (or at least, one
possible cause) of this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

I'm working on a patch to work around the main side-effect of this,
which is hanging while holding the global net mutex.  Hangs will still
happen (e.g. if a dst leaks) but should not affect anything else,
other than a leak of the dst and its net namespace.

Fixing the dst leaks is important too, of course, but a dst leak (or
other cause) shouldn't break the entire system.


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-13 Thread Edward Cree
It turns out this may all be moot anyway: I figured out why I was seeing
 ARFS storms and it wasn't the configuration issue I originally blamed.
My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
 0 for success, but the caller expects a filter ID to be returned (which
 we can't give it because we don't know what the filter ID will be until
 we start mucking around in the software state that's now protected by a
 sleepable lock).
As a result, when we call rps_may_expire_flow(), and pass it the _actual_
 filter ID, this doesn't match the one set_rps_cpu() recorded, so the
 function returns true and we immediately expire the filter.  Then the
 next packet to come along isn't steered, so ARFS asks us to insert a
 steering filter again.
As a quick fix I've simply tried making the rps_may_expire_flow() calls
 also pass a filter ID of 0, which prevents the ARFS storms.  This is
 safe; it may cause us to delay expiring a filter when flow_ids collide,
 but that can happen anyway with other drivers' implementations (e.g.
 mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
I'll post a v2 with that fix in place of this Patch #2 shortly, then try
 to follow up with a counter-generated ID (similar to what mlx have).

-Ed


Atlantic driver 4.16 stable request

2018-04-13 Thread Igor Russkikh
Hi David,

Could you please consider queuing to v4.16:

9a11aff net: aquantia: oops when shutdown on already stopped device
cce96d1 net: aquantia: Regression on reset with 1.x firmware

These are both critical and well tested by our team.

Thanks in advance!


[PATCH net] team: avoid adding twice the same option to the event list

2018-04-13 Thread Paolo Abeni
When parsing the options provided by the user space,
team_nl_cmd_options_set() insert them in a temporary list to send
multiple events with a single message.
While each option's attribute is correctly validated, the code does
not check for duplicate entries before inserting into the event
list.

Exploiting the above, the syzbot was able to trigger the following
splat:

kernel BUG at lib/list_debug.c:31!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:8801b04bf248 EFLAGS: 00010286
RAX: 0058 RBX: 8801c8fc7a90 RCX: 
RDX: 0058 RSI: 815fbf41 RDI: ed0036097e3f
RBP: 8801b04bf260 R08: 8801b0b2a700 R09: ed003b604f90
R10: ed003b604f90 R11: 8801db027c87 R12: 8801c8fc7a90
R13: 8801c8fc7a90 R14: dc00 R15: 
FS:  00b98880() GS:8801db00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0043fc30 CR3: 0001afe8e000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  __list_add include/linux/list.h:60 [inline]
  list_add include/linux/list.h:79 [inline]
  team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
  __sys_sendmsg+0x115/0x270 net/socket.c:2155
  SYSC_sendmsg net/socket.c:2164 [inline]
  SyS_sendmsg+0x29/0x30 net/socket.c:2162
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:7ffd1d4a7278 EFLAGS: 0213 ORIG_RAX: 002e
RAX: ffda RBX: 001b RCX: 004458b9
RDX: 0010 RSI: 2d00 RDI: 0004
RBP: 004a74ed R08:  R09: 
R10:  R11: 0213 R12: 7ffd1d4a7348
R13: 00402a60 R14:  R15: 
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41
RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: 8801b04bf248

This changeset addresses the avoiding list_add() if the current
option is already present in the event list.

Reported-and-tested-by: syzbot+4d4af685432dc0e56...@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni 
Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")
---
 drivers/net/team/team.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a6c6ce19..acbe84967834 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -261,6 +261,17 @@ static void __team_option_inst_mark_removed_port(struct 
team *team,
}
 }
 
+static bool __team_option_inst_tmp_find(const struct list_head *opts,
+   const struct team_option_inst *needle)
+{
+   struct team_option_inst *opt_inst;
+
+   list_for_each_entry(opt_inst, opts, tmp_list)
+   if (opt_inst == needle)
+   return true;
+   return false;
+}
+
 static int __team_options_register(struct team *team,
   const struct team_option *option,
   size_t option_count)
@@ -2568,6 +2579,14 @@ static int team_nl_cmd_options_set(struct sk_buff *skb, 
struct genl_info *info)
if (err)
goto team_put;
opt_inst->changed = true;
+
+   /* dumb/evil user-space can send us duplicate opt,
+* keep only the last one
+*/
+   if (__team_option_inst_tmp_find(_inst_list,
+   opt_inst))
+   continue;
+
list_add(_inst->tmp_list, _inst_list);
}
if (!opt_found) {
-- 
2.14.3



Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Neil Horman
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
> > On 20.02.2018 18:26, Neil Horman wrote:
> >>
> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> >>>
> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
> >>>  wrote:
> 
>  On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >
> > Is this meant to be fixed already? I am still seeing this on the
> > latest upstream tree.
> >
> 
>  These two commits are in v4.16-rc1:
> 
>  commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>  Author: Tommi Rantala 
>  Date:   Mon Feb 5 21:48:14 2018 +0200
> 
>   sctp: fix dst refcnt leak in sctp_v4_get_dst
>  ...
>   Fixes: 410f03831 ("sctp: add routing output fallback")
>   Fixes: 0ca50d12f ("sctp: fix src address selection if using
>  secondary
>  addresses")
> 
> 
>  commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>  Author: Alexey Kodanev 
>  Date:   Mon Feb 5 15:10:35 2018 +0300
> 
>   sctp: fix dst refcnt leak in sctp_v6_get_dst()
>  ...
>   Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>  secondary
>  addresses for ipv6")
> 
> 
>  I guess we missed something if it's still reproducible.
> 
>  I can check it later this week, unless someone else beat me to it.
> >>>
> >>>
> >>> Hi Tommi,
> >>>
> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> >>> another one then. But I am still seeing these:
> >>>
> >>> [   58.799130] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   60.847138] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   62.895093] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   64.943103] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>>
> >>> on upstream tree pulled ~12 hours ago.
> >>>
> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> >> out a
> >> backtrace if the device name matches "lo".  That should tell us
> >> definitively if
> >> the problem is in the same location or not
> >
> >
> > Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> > sent in the first email in this thread:
> >
> > With 4.16-rc2 unable to reproduce.
> >
> > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> > lo to become free. Usage count = 3"
> >
> > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> > cherry-picked on top, unable to reproduce.
> >
> >
> > Is syzkaller doing something else now to trigger the bug...?
> > Can you still trigger the bug with the same reproducer?
> 
> Hi Neil, Tommi,
> 
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
> 
> One idea that could make it all simpler:
> 
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
As the name implies, its supposed to wait for the reference count to be zero
indefinately, but yes, under normal operation, its intended to not have to wait
very long at all.  The issuance of the NETDEV_UNREGISTER_FINAL notification is
meant to be a subscribable signal to any code path holding a reference that it
needs to be dropped so that the progress can be made.

Note that the "waiting for %s to become free" message is triggered after 10
seconds of waiting, and is likely the trigger you want, Its just an emergency
level log message rather a WARN.  I don't think we want to change that
permanently, but you could certainly alter it in the code to cause syzbot to
catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) )


> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
> 
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
Well, it drops everything and sleeps 

Re: [net] xfrm: cover crypto status in xfrm_input

2018-04-13 Thread Steffen Klassert
On Thu, Apr 12, 2018 at 12:03:15PM -0700, Jacek Kalwas wrote:
> Status checking in xfrm_input doesn't cover CRYPTO_GENERIC_ERROR and
> CRYPTO_INVALID_PACKET_SYNTAX.
> 
> Given patch adds additional check for CRYPTO_INVALID_PACKET_SYNTAX and
> treats CRYPTO_GENERIC_ERROR as status matching LINUX_MIB_XFRMINERROR.
> 
> Signed-off-by: Jacek Kalwas 
> ---
>  net/xfrm/xfrm_input.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
> index 352abca2605f..08d70ea774f9 100644
> --- a/net/xfrm/xfrm_input.c
> +++ b/net/xfrm/xfrm_input.c
> @@ -285,7 +285,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
> spi, int encap_type)
>   goto drop;
>   }
>  
> - XFRM_INC_STATS(net, 
> LINUX_MIB_XFRMINBUFFERERROR);
> + if (xo->status & CRYPTO_INVALID_PACKET_SYNTAX) {
> + XFRM_INC_STATS(net, 
> LINUX_MIB_XFRMINBUFFERERROR);
> + goto drop;
> + }

Please consider adding separate statistic counters for offloading.
Reusing some other counter does not make it more usfull as it is now.
Some time ago, each statistic counter was bumped at a unique place,
so it was easy to identify where the packet was dropped. Unfortunately 
this changed over the years. This was one of the concerns the userspace
IPsec developers had during the IPsec workshop we held recently. So I
think it is better to add new counters insted of reusing old ones here.



Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet

Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{   
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_sacktag_state sack_state;
struct rate_sample rs = { .prior_delivered = 0 };
u32 prior_snd_una = tp->snd_una;
bool is_sack_reneg = tp->is_sack_reneg;
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
bool is_dupack = false;
int prior_packets = tp->packets_out;
u32 delivered = tp->delivered;
u32 lost = tp->lost;
int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
losses */
u32 prior_fack;

sack_state.first_sackt = 0;
sack_state.rate = 

/* We very likely will need to access rtx queue. */
prefetch(sk->tcp_rtx_queue.rb_node);

/* If the ack is older than previous acks
 * then we can probably ignore it.
 */
if (before(ack, prior_snd_una)) {
}

/* If the ack includes data we haven't sent yet, discard
 * this segment (RFC793 Section 3.9).
 */
if (after(ack, tp->snd_nxt))

if (after(ack, prior_snd_una)) {
}

prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
tp->snd_una;
rs.prior_in_flight = tcp_packets_in_flight(tp);

/* ts_recent update must be made after we are sure that the
packet   
 * is in window.
 */
if (flag & FLAG_UPDATE_TS_RECENT)

if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
} else {
u32 ack_ev_flags = CA_ACK_SLOWPATH;

if (ack_seq != TCP_SKB_CB(skb)->end_seq)
else
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPPUREACKS);

flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

if (TCP_SKB_CB(skb)->sacked)

if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
}

if (flag & FLAG_WIN_UPDATE)

tcp_in_ack_event(sk, ack_ev_flags);
}

/* We passed data and got it acked, remove any soft error
 * log. Something worked...
 */
sk->sk_err_soft = 0;
icsk->icsk_probes_out = 0;
tp->rcv_tstamp = tcp_jiffies32;
if (!prior_packets)
goto no_queue;

no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK)
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
,
  );
/* If this ack opens up a zero window, clear backoff.  It was
 * being used to time the probes, and is probably far higher
 than
 * it needs to be for normal retransmission.
 */
tcp_ack_probe(sk);

if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
return 1;


And here is 'info local' towards the end of the function:
icsk = 0x88001740b800
tp = 0x88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 
0x88003fc83ae0, flag = 0, mss_now 

[PATCH v2] selftests: add headers_install to lib.mk

2018-04-13 Thread Anders Roxell
If the kernel headers aren't installed we can't build all the tests.
Add a new make target rule 'khdr' in the file lib.mk to generate the
kernel headers and that gets include for every test-dir Makefile that
includes lib.mk If the testdir in turn have its own sub-dirs the
top_srcdir needs to be set to the linux-rootdir to be able to generate
the kernel headers.

Signed-off-by: Anders Roxell 
Reviewed-by: Fathi Boudra 
---
 Makefile  | 14 +-
 scripts/subarch.include   | 13 +
 tools/testing/selftests/android/Makefile  |  2 +-
 tools/testing/selftests/android/ion/Makefile  |  1 +
 tools/testing/selftests/bpf/Makefile  |  5 ++---
 tools/testing/selftests/futex/functional/Makefile |  1 +
 tools/testing/selftests/gpio/Makefile |  4 
 tools/testing/selftests/kvm/Makefile  |  6 +-
 tools/testing/selftests/lib.mk| 10 ++
 tools/testing/selftests/vm/Makefile   |  3 ---
 10 files changed, 30 insertions(+), 29 deletions(-)
 create mode 100644 scripts/subarch.include

diff --git a/Makefile b/Makefile
index 74567b0ec2f0..0c47935d48a2 100644
--- a/Makefile
+++ b/Makefile
@@ -286,19 +286,7 @@ KERNELRELEASE = $(shell cat include/config/kernel.release 
2> /dev/null)
 KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if 
$(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
 export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION
 
-# SUBARCH tells the usermode build what the underlying arch is.  That is set
-# first, and if a usermode build is happening, the "ARCH=um" on the command
-# line overrides the setting of ARCH below.  If a native build is happening,
-# then ARCH is assigned, getting whatever value it gets normally, and
-# SUBARCH is subsequently ignored.
-
-SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
- -e s/sun4u/sparc64/ \
- -e s/arm.*/arm/ -e s/sa110/arm/ \
- -e s/s390x/s390/ -e s/parisc64/parisc/ \
- -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
- -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
- -e s/riscv.*/riscv/)
+include scripts/subarch.include
 
 # Cross compiling and selecting different set of gcc/bin-utils
 # ---
diff --git a/scripts/subarch.include b/scripts/subarch.include
new file mode 100644
index ..650682821126
--- /dev/null
+++ b/scripts/subarch.include
@@ -0,0 +1,13 @@
+# SUBARCH tells the usermode build what the underlying arch is.  That is set
+# first, and if a usermode build is happening, the "ARCH=um" on the command
+# line overrides the setting of ARCH below.  If a native build is happening,
+# then ARCH is assigned, getting whatever value it gets normally, and
+# SUBARCH is subsequently ignored.
+
+SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
+ -e s/sun4u/sparc64/ \
+ -e s/arm.*/arm/ -e s/sa110/arm/ \
+ -e s/s390x/s390/ -e s/parisc64/parisc/ \
+ -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
+ -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
+ -e s/riscv.*/riscv/)
diff --git a/tools/testing/selftests/android/Makefile 
b/tools/testing/selftests/android/Makefile
index f6304d2be90c..087390bbad68 100644
--- a/tools/testing/selftests/android/Makefile
+++ b/tools/testing/selftests/android/Makefile
@@ -6,7 +6,7 @@ TEST_PROGS := run.sh
 
 include ../lib.mk
 
-all:
+all: khdr
@for DIR in $(SUBDIRS); do  \
BUILD_TARGET=$(OUTPUT)/$$DIR;   \
mkdir $$BUILD_TARGET  -p;   \
diff --git a/tools/testing/selftests/android/ion/Makefile 
b/tools/testing/selftests/android/ion/Makefile
index e03695287f76..14ecd9805748 100644
--- a/tools/testing/selftests/android/ion/Makefile
+++ b/tools/testing/selftests/android/ion/Makefile
@@ -11,6 +11,7 @@ $(TEST_GEN_FILES): ipcsocket.c ionutils.c
 TEST_PROGS := ion_test.sh
 
 include ../../lib.mk
+top_srcdir = ../../../../../
 
 $(OUTPUT)/ionapp_export: ionapp_export.c ipcsocket.c ionutils.c
 $(OUTPUT)/ionapp_import: ionapp_import.c ipcsocket.c ionutils.c
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..cc611a284087 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -16,9 +16,8 @@ LDLIBS += -lcap -lelf -lrt -lpthread
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
 all: $(TEST_CUSTOM_PROGS)
 
-$(TEST_CUSTOM_PROGS): urandom_read
-
-urandom_read: urandom_read.c
+$(TEST_CUSTOM_PROGS):| khdr

Re: SRIOV switchdev mode BoF minutes

2018-04-13 Thread Or Gerlitz
On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz  wrote:
> On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
>  wrote:
>> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>>
>>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>>>  wrote:

 On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>
> Hi Dave and all,
>
> During and after the BoF on SRIOV switchdev mode, we came into a
> consensus among the developers from four different HW vendors (CC
> audience) that a correct thing to do would be to disallow any new
> extensions to the legacy mode.
>
> The idea is to put focus on the new mode and not add new UAPIs and
> kernel code which was turned to be a wrong design which does not allow
> for properly offloading a kernel switching SW model to e-switch HW.
>
> We also had a good session the day after regarding alignment for the
> representation model of the uplink (physical port) and PF/s.
>
> The VF representor netdevs  exist for all drivers that support the new
> mode but the representation for the uplink and PF wasn't the same for
> all. The decision was to represent the uplink and PFs vports in the
> same manner done for VFs, using rep netdevs. This alignment would
> provide a more strict and clear view of the kernel model for e-switch
> to users and upper layer control plane SW.
>
 I don't see any changes in the Mellanox/other drivers to move to this new
 model to enable the uplink and PF port representors, any updates?
>>>
>>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>>> so far.  I have resumed
>>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>>> for 4.18
>>>
 It would be really nice to highlight the pros and cons of the old versus
 the
 new model.

 We are looking into adding switchdev support for our new 100Gb ice driver
 and could use some feedback on the direction we should be taking.
>>>
>>> good news.
>>>
>>> The uplink rep is clear cut that needs to be a rep device representing
>>> the uplink just like vf
>>> rep represents the vport toward the vf - please just do it correct
>>> from the begining
>>>
>> Having an uplink rep will definitely help implement the slow path with
>> flat/vlan network
>> scenarios by not having to add PF to the bridge.
>>
>> But how do they help with a vxlan overlay scenario? In case of overlays, the
>> slow path has to go via vxlan -> ip stack -> pf?
>
> in  overlay networks scheme, the uplink has the VTEP ip and is not connected

the uplink rep has the vtep ip

> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to 
> ovs
> and the ip stack routes through the uplink rep
>
>>
>> What about pf-rep?
>>


Re: SRIOV switchdev mode BoF minutes

2018-04-13 Thread Or Gerlitz
On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
 wrote:
> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>
>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>>  wrote:
>>>
>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote:

 Hi Dave and all,

 During and after the BoF on SRIOV switchdev mode, we came into a
 consensus among the developers from four different HW vendors (CC
 audience) that a correct thing to do would be to disallow any new
 extensions to the legacy mode.

 The idea is to put focus on the new mode and not add new UAPIs and
 kernel code which was turned to be a wrong design which does not allow
 for properly offloading a kernel switching SW model to e-switch HW.

 We also had a good session the day after regarding alignment for the
 representation model of the uplink (physical port) and PF/s.

 The VF representor netdevs  exist for all drivers that support the new
 mode but the representation for the uplink and PF wasn't the same for
 all. The decision was to represent the uplink and PFs vports in the
 same manner done for VFs, using rep netdevs. This alignment would
 provide a more strict and clear view of the kernel model for e-switch
 to users and upper layer control plane SW.

>>> I don't see any changes in the Mellanox/other drivers to move to this new
>>> model to enable the uplink and PF port representors, any updates?
>>
>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>> so far.  I have resumed
>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>> for 4.18
>>
>>> It would be really nice to highlight the pros and cons of the old versus
>>> the
>>> new model.
>>>
>>> We are looking into adding switchdev support for our new 100Gb ice driver
>>> and could use some feedback on the direction we should be taking.
>>
>> good news.
>>
>> The uplink rep is clear cut that needs to be a rep device representing
>> the uplink just like vf
>> rep represents the vport toward the vf - please just do it correct
>> from the begining
>>
> Having an uplink rep will definitely help implement the slow path with
> flat/vlan network
> scenarios by not having to add PF to the bridge.
>
> But how do they help with a vxlan overlay scenario? In case of overlays, the
> slow path has to go via vxlan -> ip stack -> pf?

in  overlay networks scheme, the uplink has the VTEP ip and is not connected
to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
and the ip stack routes through the uplink rep

>
> What about pf-rep?
>


Re: KMSAN: uninit-value in __netif_receive_skb_core

2018-04-13 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 10:20 AM, Toshiaki Makita
 wrote:
> On 2018/04/12 17:03, Dmitry Vyukov wrote:
>> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
>>  wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>>> commit
>>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +)
>>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>>> syzbot dashboard link:
>>> https://syzkaller.appspot.com/bug?extid=b202b7208664142954fa
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>> Raw console output:
>>> https://syzkaller.appspot.com/x/log.txt?id=535651643762
>>> Kernel config:
>>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>>> compiler: clang version 7.0.0 (trunk 329391)
>>
>> +Toshiaki as this seems to be related to the recent vlan tagging changes.
>
> seems not...
> "Uninit was stored to memory at:" shows uninitialized memory was stored
> before where I modified the code (skb_reorder_vlan_header).
>
> I'm not sure what this uninit memory means.
> To me it looks like the memory is initialized by user provided data.
>
> (iov in packet sock -> skb->data -> skb->protocol)
>
> The reproducer provides 4 bytes after ethernet header, so it should be
> sufficient for a vlan tag. This will set skb->len to 4 and fill the
> 4-byte contents in packet_snd(). skb_vlan_untag() is reading the user
> provided 4-byte skb->data. It is ensured that skb_vlan_untag() does not
> read beyond skb->len since it calls pskb_may_pull(). At this point I am
> failing to find what I am missing.

Eric,

You mentioned something about assumption that the
__vlan_insert_inner_tag() helper would only be called from
__netif_receive_skb_core(). Can you elaborate?



>> This also seems to be related to
>> https://groups.google.com/d/msg/syzkaller-bugs/VRH9NnUi2k0/90GYsAeRBgAJ
>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+b202b720866414295...@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>> ==
>>> BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:197
>>> [inline]
>>> BUG: KMSAN: uninit-value in deliver_ptype_list_skb net/core/dev.c:1908
>>> [inline]
>>> BUG: KMSAN: uninit-value in __netif_receive_skb_core+0x4630/0x4a80
>>> net/core/dev.c:4545
>>> CPU: 0 PID: 5999 Comm: syz-executor3 Not tainted 4.16.0+ #82
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> Call Trace:
>>>  
>>>  __dump_stack lib/dump_stack.c:17 [inline]
>>>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>>>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>>>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>>>  __read_once_size include/linux/compiler.h:197 [inline]
>>>  deliver_ptype_list_skb net/core/dev.c:1908 [inline]
>>>  __netif_receive_skb_core+0x4630/0x4a80 net/core/dev.c:4545
>>>  __netif_receive_skb net/core/dev.c:4627 [inline]
>>>  process_backlog+0x62d/0xe20 net/core/dev.c:5307
>>>  napi_poll net/core/dev.c:5705 [inline]
>>>  net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>>>  __do_softirq+0x56d/0x93d kernel/softirq.c:285
>>>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
>>>  
>>>  do_softirq kernel/softirq.c:329 [inline]
>>>  __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
>>>  local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>>>  rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>>>  __dev_queue_xmit+0x2a31/0x2b60 net/core/dev.c:3584
>>>  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>>>  packet_snd net/packet/af_packet.c:2944 [inline]
>>>  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>>  sock_sendmsg net/socket.c:640 [inline]
>>>  sock_write_iter+0x3b9/0x470 net/socket.c:909
>>>  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>>>  do_iter_write+0x30d/0xd40 fs/read_write.c:932
>>>  vfs_writev fs/read_write.c:977 [inline]
>>>  do_writev+0x3c9/0x830 fs/read_write.c:1012
>>>  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>>>  SyS_writev+0x56/0x80 fs/read_write.c:1082
>>>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>>>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> RIP: 0033:0x455259
>>> RSP: 002b:7fb53ede8c68 EFLAGS: 0246 ORIG_RAX: 0014
>>> RAX: ffda RBX: 7fb53ede96d4 RCX: 00455259
>>> RDX: 0001 RSI: 200010c0 RDI: 0013
>>> RBP: 0072bea0 R08:  R09: 
>>> R10:  R11: 0246 R12: 
>>> R13: 06cd R14: 006fd3d8 R15: 
>>>
>>> Uninit was stored 

Re: KMSAN: uninit-value in netif_skb_features

2018-04-13 Thread Toshiaki Makita
On 2018/04/12 17:03, Dmitry Vyukov wrote:
> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
>  wrote:
>> Hello,
>>
>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>> commit
>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +)
>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=0bbe42c764feafa82c5a
>>
>> So far this crash happened 30 times on
>> https://github.com/google/kmsan.git/master.
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4850744041668608
>> syzkaller reproducer:
>> https://syzkaller.appspot.com/x/repro.syz?id=6289386287136768
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=4577411249209344
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>> compiler: clang version 7.0.0 (trunk 329391)
> 
> +Toshiaki as this seems to be related to the recent vlan tagging changes.

Seems not.
Probably skb_vlan_tagged_multi() needs to call pskb_may_pull() before
accessing skb->data? I'll confirm it later.

> This also seems to be related to
> https://groups.google.com/d/msg/syzkaller-bugs/FNEavkB4QaM/efXl2AeRBgAJ
> 
> 
> 
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+0bbe42c764feafa82...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> ==
>> BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283
>> [inline]
>> BUG: KMSAN: uninit-value in skb_vlan_tagged_multi
>> include/linux/if_vlan.h:656 [inline]
>> BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672
>> [inline]
>> BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
>> BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0
>> net/core/dev.c:3009
>> CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>>  eth_type_vlan include/linux/if_vlan.h:283 [inline]
>>  skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
>>  vlan_features_check include/linux/if_vlan.h:672 [inline]
>>  dflt_features_check net/core/dev.c:2949 [inline]
>>  netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
>>  validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
>>  __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
>>  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>>  packet_snd net/packet/af_packet.c:2944 [inline]
>>  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>  sock_sendmsg net/socket.c:640 [inline]
>>  sock_write_iter+0x3b9/0x470 net/socket.c:909
>>  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>>  do_iter_write+0x30d/0xd40 fs/read_write.c:932
>>  vfs_writev fs/read_write.c:977 [inline]
>>  do_writev+0x3c9/0x830 fs/read_write.c:1012
>>  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>>  SyS_writev+0x56/0x80 fs/read_write.c:1082
>>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x43ffa9
>> RSP: 002b:7fff2cff3948 EFLAGS: 0217 ORIG_RAX: 0014
>> RAX: ffda RBX: 004002c8 RCX: 0043ffa9
>> RDX: 0001 RSI: 2080 RDI: 0003
>> RBP: 006cb018 R08:  R09: 
>> R10:  R11: 0217 R12: 004018d0
>> R13: 00401960 R14:  R15: 
>>
>> Uninit was created at:
>>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>>  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>>  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>>  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>>  slab_post_alloc_hook mm/slab.h:445 [inline]
>>  slab_alloc_node mm/slub.c:2737 [inline]
>>  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>>  __kmalloc_reserve net/core/skbuff.c:138 [inline]
>>  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>>  alloc_skb include/linux/skbuff.h:984 [inline]
>>  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
>>  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
>>  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
>>  packet_snd net/packet/af_packet.c:2894 [inline]
>>  packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>  sock_sendmsg net/socket.c:640 [inline]
>>  sock_write_iter+0x3b9/0x470 

Re: KMSAN: uninit-value in __netif_receive_skb_core

2018-04-13 Thread Toshiaki Makita
On 2018/04/12 17:03, Dmitry Vyukov wrote:
> On Thu, Apr 12, 2018 at 10:01 AM, syzbot
>  wrote:
>> Hello,
>>
>> syzbot hit the following crash on https://github.com/google/kmsan.git/master
>> commit
>> e2ab7e8abba47a2f2698216258e5d8727ae58717 (Fri Apr 6 16:24:31 2018 +)
>> kmsan: temporarily disable visitAsmInstruction() to help syzbot
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=b202b7208664142954fa
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=535651643762
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
>> compiler: clang version 7.0.0 (trunk 329391)
> 
> +Toshiaki as this seems to be related to the recent vlan tagging changes.

seems not...
"Uninit was stored to memory at:" shows uninitialized memory was stored
before where I modified the code (skb_reorder_vlan_header).

I'm not sure what this uninit memory means.
To me it looks like the memory is initialized by user provided data.

(iov in packet sock -> skb->data -> skb->protocol)

The reproducer provides 4 bytes after ethernet header, so it should be
sufficient for a vlan tag. This will set skb->len to 4 and fill the
4-byte contents in packet_snd(). skb_vlan_untag() is reading the user
provided 4-byte skb->data. It is ensured that skb_vlan_untag() does not
read beyond skb->len since it calls pskb_may_pull(). At this point I am
failing to find what I am missing.

> This also seems to be related to
> https://groups.google.com/d/msg/syzkaller-bugs/VRH9NnUi2k0/90GYsAeRBgAJ
> 
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+b202b720866414295...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> ==
>> BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:197
>> [inline]
>> BUG: KMSAN: uninit-value in deliver_ptype_list_skb net/core/dev.c:1908
>> [inline]
>> BUG: KMSAN: uninit-value in __netif_receive_skb_core+0x4630/0x4a80
>> net/core/dev.c:4545
>> CPU: 0 PID: 5999 Comm: syz-executor3 Not tainted 4.16.0+ #82
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>>  __read_once_size include/linux/compiler.h:197 [inline]
>>  deliver_ptype_list_skb net/core/dev.c:1908 [inline]
>>  __netif_receive_skb_core+0x4630/0x4a80 net/core/dev.c:4545
>>  __netif_receive_skb net/core/dev.c:4627 [inline]
>>  process_backlog+0x62d/0xe20 net/core/dev.c:5307
>>  napi_poll net/core/dev.c:5705 [inline]
>>  net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>>  __do_softirq+0x56d/0x93d kernel/softirq.c:285
>>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
>>  
>>  do_softirq kernel/softirq.c:329 [inline]
>>  __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
>>  local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>>  rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>>  __dev_queue_xmit+0x2a31/0x2b60 net/core/dev.c:3584
>>  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
>>  packet_snd net/packet/af_packet.c:2944 [inline]
>>  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
>>  sock_sendmsg_nosec net/socket.c:630 [inline]
>>  sock_sendmsg net/socket.c:640 [inline]
>>  sock_write_iter+0x3b9/0x470 net/socket.c:909
>>  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
>>  do_iter_write+0x30d/0xd40 fs/read_write.c:932
>>  vfs_writev fs/read_write.c:977 [inline]
>>  do_writev+0x3c9/0x830 fs/read_write.c:1012
>>  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
>>  SyS_writev+0x56/0x80 fs/read_write.c:1082
>>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x455259
>> RSP: 002b:7fb53ede8c68 EFLAGS: 0246 ORIG_RAX: 0014
>> RAX: ffda RBX: 7fb53ede96d4 RCX: 00455259
>> RDX: 0001 RSI: 200010c0 RDI: 0013
>> RBP: 0072bea0 R08:  R09: 
>> R10:  R11: 0246 R12: 
>> R13: 06cd R14: 006fd3d8 R15: 
>>
>> Uninit was stored to memory at:
>>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>>  kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
>>  kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
>>  __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
>>  skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597
>>  __netif_receive_skb_core+0x70a/0x4a80 

Re: [RFC v2] virtio: support packed ring

2018-04-13 Thread Tiwei Bie
On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/virtio/virtio_ring.c   | 1094 
> > +---
> >   include/linux/virtio_ring.h|8 +-
> >   include/uapi/linux/virtio_config.h |   12 +-
> >   include/uapi/linux/virtio_ring.h   |   61 ++
> >   4 files changed, 980 insertions(+), 195 deletions(-)
[...]
> > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue 
> > *_vq,
> > +  unsigned int total_sg,
> > +  gfp_t gfp)
> > +{
> > +   struct vring_packed_desc *desc;
> > +
> > +   /*
> > +* We require lowmem mappings for the descriptors because
> > +* otherwise virt_to_phys will give us bogus addresses in the
> > +* virtqueue.
> > +*/
> > +   gfp &= ~__GFP_HIGHMEM;
> > +
> > +   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
> 
> Can we simply check vq->packed here to avoid duplicating helpers?

Then it would be something like this:

static void *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
void *data;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

if (vq->packed) {
data = kmalloc(total_sg * sizeof(struct vring_packed_desc),
gfp);
if (!data)
return NULL;
} else {
struct vring_desc *desc;
unsigned int i;

desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;

for (i = 0; i < total_sg; i++)
desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);

data = desc;
}

return data;
}

I would prefer to have two simpler helpers (and to the callers,
it's already very clear about which one they should call), i.e.
the current implementation:

static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
   unsigned int total_sg,
   gfp_t gfp)
{
struct vring_packed_desc *desc;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);

return desc;
}

static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
   unsigned int total_sg,
   gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;

for (i = 0; i < total_sg; i++)
desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
return desc;
}

> 
> > +
> > +   return desc;
> > +}
[...]
> > +static inline int virtqueue_add_packed(struct virtqueue *_vq,
> > +  struct scatterlist *sgs[],
> > +  unsigned int total_sg,
> > +  unsigned int out_sgs,
> > 

[PATCH] net: stmmac: fix missing support for 802.1AD tag on reception

2018-04-13 Thread Elad Nachman
stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before
calling napi_gro_receive().

The function assumes VLAN tagged frames are always tagged with 802.1Q
protocol,
and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call
to __vlan_hwaccel_put_tag() .

This causes packets not to be passed to the VLAN slave if it was created
with 802.1AD protocol
(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into
__vlan_hwaccel_put_tag()
instead of using the hard-coded value of ETH_P_8021Q.

Signed-off-by: Elad Nachman 

---

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 2018-04-11
17:04:00.586057300 +0300
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 2018-04-11
17:05:33.601992400 +0300
@@ -3293,17 +3293,19 @@ dma_map_err:

 static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
 {
-   struct ethhdr *ehdr;
+   struct vlan_ethhdr *veth;
u16 vlanid;
+   __be16 vlan_proto;

if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
NETIF_F_HW_VLAN_CTAG_RX &&
!__vlan_get_tag(skb, )) {
/* pop the vlan tag */
-   ehdr = (struct ethhdr *)skb->data;
-   memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
+   veth = (struct vlan_ethhdr *)skb->data;
+   vlan_proto = veth->h_vlan_proto;
+   memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
skb_pull(skb, VLAN_HLEN);
-   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
+   __vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
}
 }


[PATCH net] virtio-net: add missing virtqueue kick when flushing packets

2018-04-13 Thread Jason Wang
We tends to batch submitting packets during XDP_TX. This requires to
kick virtqueue after a batch, we tried to do it through
xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
explicitly kick the virtqueue in this case.

Reported-by: Kimitoshi Takahashi 
Tested-by: Kimitoshi Takahashi 
Cc: Daniel Borkmann 
Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2337460..d8e1aea 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1269,7 +1269,9 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
-   unsigned int received;
+   struct virtnet_info *vi = rq->vq->vdev->priv;
+   struct send_queue *sq;
+   unsigned int received, qp;
bool xdp_xmit = false;
 
virtnet_poll_cleantx(rq);
@@ -1280,8 +1282,13 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
if (received < budget)
virtqueue_napi_complete(napi, rq->vq, received);
 
-   if (xdp_xmit)
+   if (xdp_xmit) {
+   qp = vi->curr_queue_pairs - vi->xdp_queue_pairs +
+smp_processor_id();
+   sq = >sq[qp];
+   virtqueue_kick(sq->vq);
xdp_do_flush_map();
+   }
 
return received;
 }
-- 
2.7.4



[PATCH] net: stmmac: fix missing support for 802.1AD tag on reception

2018-04-13 Thread Elad Nachman
stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before 
calling napi_gro_receive().


The function assumes VLAN tagged frames are always tagged with 802.1Q 
protocol,
and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call 
to __vlan_hwaccel_put_tag() .


This causes packets not to be passed to the VLAN slave if it was created 
with 802.1AD protocol

(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into 
__vlan_hwaccel_put_tag()

instead of using the hard-coded value of ETH_P_8021Q.

Signed-off-by: Elad Nachman 

---

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c	2018-04-11 
17:04:00.586057300 +0300


+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c	2018-04-11 
17:05:33.601992400 +0300


@@ -3293,17 +3293,19 @@ dma_map_err:



 static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)

 {

-   struct ethhdr *ehdr;

+   struct vlan_ethhdr *veth;

u16 vlanid;

+   __be16 vlan_proto;



if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==

NETIF_F_HW_VLAN_CTAG_RX &&

!__vlan_get_tag(skb, )) {

/* pop the vlan tag */

-   ehdr = (struct ethhdr *)skb->data;

-   memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);

+   veth = (struct vlan_ethhdr *)skb->data;

+   vlan_proto = veth->h_vlan_proto;

+   memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);

skb_pull(skb, VLAN_HLEN);

-   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);

+   __vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);

}

 }


Re: INFO: task hung in do_ip_vs_set_ctl (2)

2018-04-13 Thread syzbot

syzbot has found reproducer for the following crash on net-next commit
17dec0a949153d9ac00760ba2f5b78cb583e995f (Wed Apr 4 02:15:32 2018 +)
Merge branch 'userns-linus' of  
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=7810ed2e0cb359580c17


So far this crash happened 2 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5922062967242752
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5359824032235520
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6352399027404800
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-2735707888269579554

compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+7810ed2e0cb359580...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

INFO: task syzkaller402106:4498 blocked for more than 120 seconds.
  Not tainted 4.16.0+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syzkaller402106 D22184  4498   4494 0x
Call Trace:
 context_switch kernel/sched/core.c:2848 [inline]
 __schedule+0x807/0x1e40 kernel/sched/core.c:3490
 schedule+0xef/0x430 kernel/sched/core.c:3549
 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607
 __mutex_lock_common kernel/locking/mutex.c:833 [inline]
 __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
 do_ip_vs_set_ctl+0x339/0x1d30 net/netfilter/ipvs/ip_vs_ctl.c:2393
 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
 nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
 ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
 tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:2888
 sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3039
 smc_setsockopt+0xc7/0x120 net/smc/af_smc.c:1289
 __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
 SYSC_setsockopt net/socket.c:1914 [inline]
 SyS_setsockopt+0x34/0x50 net/socket.c:1911
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x445959
RSP: 002b:7f2770618db8 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 006dac24 RCX: 00445959
RDX: 048c RSI:  RDI: 0003
RBP: 006dac20 R08: 0018 R09: 
R10: 2140 R11: 0246 R12: 
R13: 7ffd81ae8f6f R14: 7f27706199c0 R15: 0001

Showing all locks held in the system:
3 locks held by kworker/0:0/4:
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
 #0: 7346131c ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
 #1: 894403a3 ((addr_chk_work).work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
 #2: ddc85278 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74

2 locks held by khungtaskd/877:
 #0: 706bfe1c (rcu_read_lock){}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
 #0: 706bfe1c (rcu_read_lock){}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
 #1: 761e40d2 (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470

2 locks held by getty/4464:
 #0: f90a9320 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: 5dd151b8 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4465:
 #0: 737b5b26 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: 17bb1ae5 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4466:
 #0: badd071e (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: a46de9fa (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4467:
 #0: 112731eb (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 

Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-13 Thread kbuild test robot
Hi Chen-Yu,

I love your patch! Perhaps something to improve:

[auto build test WARNING on robh/for-next]
[also build test WARNING on v4.16 next-20180412]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Icenowy-Zheng/Add-support-in-dwmac-sun8i-for-accessing-EMAC-clock/20180413-004816
base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c:82:24: sparse: symbol 
'old_syscon_reg_field' was not declared. Should it be static?
>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c:89:24: sparse: symbol 
>> 'single_reg_field' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[RFC PATCH] net: stmmac: dwmac-sun8i: single_reg_field can be static

2018-04-13 Thread kbuild test robot

Fixes: 529123418105 ("net: stmmac: dwmac-sun8i: Allow getting syscon regmap 
from device")
Signed-off-by: Fengguang Wu 
---
 dwmac-sun8i.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index b61210c..842315a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -86,7 +86,7 @@ const struct reg_field old_syscon_reg_field = {
 };
 
 /* Specially exported regmap which contains only EMAC register */
-const struct reg_field single_reg_field = {
+static const struct reg_field single_reg_field = {
.reg = 0,
.lsb = 0,
.msb = 31,