date:20180719

Re: [PATCH net-next 0/2] qed*: Add support for phy module query.

2018-07-19 Thread David Miller

From: Sudarsana Reddy Kalluru 
Date: Wed, 18 Jul 2018 06:27:21 -0700

> The patch series adds driver support for querying the PHY module's
> eeprom data.
> 
> Please consider applying it to 'net-next'.

Series applied.

Re: [PATCH net-next 0/3] set/match the tos/ttl fields of TC based IP tunnels

2018-07-19 Thread David Miller

From: Or Gerlitz 
Date: Tue, 17 Jul 2018 19:27:15 +0300

> This series comes to address the case to set (encap) and match (decap)
> also the tos and ttl fields of TC based IP tunnels.
> 
> Example encap (1st one) and decap (2nd) that use the new fields 
> 
> tc filter add dev eth0_0 protocol ip parent : prio 10 flower \
>   src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70  \
>   action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 
> dst_port 4789 tos 0x30 \
>   action mirred egress redirect dev vxlan_sys_4789
> 
> tc filter add dev vxlan_sys_4789 protocol ip parent : prio 10 flower \
>   enc_src_ip 192.168.10.2 enc_dst_ip 192.168.10.1 enc_key_id 100 
> enc_dst_port 4789 enc_tos 0x30 \
>   src_mac e4:11:22:33:44:70 dst_mac e4:11:22:33:44:50 \
>   action tunnel_key unset \
>   action mirred egress redirect dev eth0_0

Series applied, thanks Or.

Re: [PATCH net] net/page_pool: Fix inconsistent lock state warning

2018-07-19 Thread David Miller

From: Tariq Toukan 
Date: Tue, 17 Jul 2018 18:10:37 +0300

> Fix the warning below by calling the ptr_ring_consume_bh,
> which uses spin_[un]lock_bh.
 ...
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Signed-off-by: Tariq Toukan 
> Cc: Jesper Dangaard Brouer 

Applied, thanks.

[PATCH net] multicast: remove useless parameter for group add

2018-07-19 Thread Hangbin Liu

Remove the mode parameter for igmp/igmp6_group_added as we can get it
from first parameter.

Fixes: 6e2059b53f988 (ipv4/igmp: init group mode as INCLUDE when join source 
group)
Fixes: c7ea20c9da5b9 (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
Signed-off-by: Hangbin Liu 
---
 net/ipv4/igmp.c  | 10 +-
 net/ipv6/mcast.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 28fef7d..bae9096 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1288,7 +1288,7 @@ static void igmp_group_dropped(struct ip_mc_list *im)
 #endif
 }
 
-static void igmp_group_added(struct ip_mc_list *im, unsigned int mode)
+static void igmp_group_added(struct ip_mc_list *im)
 {
struct in_device *in_dev = im->interface;
 #ifdef CONFIG_IP_MULTICAST
@@ -1320,7 +1320,7 @@ static void igmp_group_added(struct ip_mc_list *im, 
unsigned int mode)
 * not send filter-mode change record as the mode should be from
 * IN() to IN(A).
 */
-   if (mode == MCAST_EXCLUDE)
+   if (im->sfmode == MCAST_EXCLUDE)
im->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
 
igmp_ifc_event(in_dev);
@@ -1431,7 +1431,7 @@ void __ip_mc_inc_group(struct in_device *in_dev, __be32 
addr, unsigned int mode)
 #ifdef CONFIG_IP_MULTICAST
igmpv3_del_delrec(in_dev, im);
 #endif
-   igmp_group_added(im, mode);
+   igmp_group_added(im);
if (!in_dev->dead)
ip_rt_multicast_event(in_dev);
 out:
@@ -1698,7 +1698,7 @@ void ip_mc_remap(struct in_device *in_dev)
 #ifdef CONFIG_IP_MULTICAST
igmpv3_del_delrec(in_dev, pmc);
 #endif
-   igmp_group_added(pmc, pmc->sfmode);
+   igmp_group_added(pmc);
}
 }
 
@@ -1761,7 +1761,7 @@ void ip_mc_up(struct in_device *in_dev)
 #ifdef CONFIG_IP_MULTICAST
igmpv3_del_delrec(in_dev, pmc);
 #endif
-   igmp_group_added(pmc, pmc->sfmode);
+   igmp_group_added(pmc);
}
 }
 
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index f60f310..4ae54aa 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -660,7 +660,7 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr 
*mc_addr,
return rv;
 }
 
-static void igmp6_group_added(struct ifmcaddr6 *mc, unsigned int mode)
+static void igmp6_group_added(struct ifmcaddr6 *mc)
 {
struct net_device *dev = mc->idev->dev;
char buf[MAX_ADDR_LEN];
@@ -690,7 +690,7 @@ static void igmp6_group_added(struct ifmcaddr6 *mc, 
unsigned int mode)
 * should not send filter-mode change record as the mode
 * should be from IN() to IN(A).
 */
-   if (mode == MCAST_EXCLUDE)
+   if (mc->mca_sfmode == MCAST_EXCLUDE)
mc->mca_crcount = mc->idev->mc_qrv;
 
mld_ifc_event(mc->idev);
@@ -931,7 +931,7 @@ static int __ipv6_dev_mc_inc(struct net_device *dev,
write_unlock_bh(&idev->lock);
 
mld_del_delrec(idev, mc);
-   igmp6_group_added(mc, mode);
+   igmp6_group_added(mc);
ma_put(mc);
return 0;
 }
@@ -2571,7 +2571,7 @@ void ipv6_mc_up(struct inet6_dev *idev)
ipv6_mc_reset(idev);
for (i = idev->mc_list; i; i = i->next) {
mld_del_delrec(idev, i);
-   igmp6_group_added(i, i->mca_sfmode);
+   igmp6_group_added(i);
}
read_unlock_bh(&idev->lock);
 }
-- 
2.5.5

[PATCH net] multicast: do not restore deleted record source filter mode to new one

2018-07-19 Thread Hangbin Liu

There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.

The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.

old_socketnew_socket   before_fix   after_fix
  IN(A) IN(A)   ALLOW(A) ALLOW(A)
  IN(A) EX( )   TO_IN( ) TO_EX( )
  EX( ) IN(A)   TO_EX( ) ALLOW(A)
  EX( ) EX( )   TO_EX( ) TO_EX( )

Fixes: 24803f38a5c0b (igmp: do not remove igmp souce list info when set link 
down)
Fixes: 1666d49e1d416 (mld: do not remove mld souce list info when set link down)
Signed-off-by: Hangbin Liu 
---
 net/ipv4/igmp.c  | 3 +--
 net/ipv6/mcast.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index b3c899a..28fef7d 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1200,8 +1200,7 @@ static void igmpv3_del_delrec(struct in_device *in_dev, 
struct ip_mc_list *im)
spin_lock_bh(&im->lock);
if (pmc) {
im->interface = pmc->interface;
-   im->sfmode = pmc->sfmode;
-   if (pmc->sfmode == MCAST_INCLUDE) {
+   if (im->sfmode == MCAST_INCLUDE) {
im->tomb = pmc->tomb;
im->sources = pmc->sources;
for (psf = im->sources; psf; psf = psf->sf_next)
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 2699be7..f60f310 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -790,8 +790,7 @@ static void mld_del_delrec(struct inet6_dev *idev, struct 
ifmcaddr6 *im)
spin_lock_bh(&im->mca_lock);
if (pmc) {
im->idev = pmc->idev;
-   im->mca_sfmode = pmc->mca_sfmode;
-   if (pmc->mca_sfmode == MCAST_INCLUDE) {
+   if (im->mca_sfmode == MCAST_INCLUDE) {
im->mca_tomb = pmc->mca_tomb;
im->mca_sources = pmc->mca_sources;
for (psf = im->mca_sources; psf; psf = psf->sf_next)
-- 
2.5.5

[PATCH bpf] bpf: Use option "help" in the llvm-objcopy test

2018-07-19 Thread Martin KaFai Lau

I noticed the "--version" option of the llvm-objcopy command has recently
disappeared from the master llvm branch.  It is currently used as a BTF
support test in tools/testing/selftests/bpf/Makefile.

This patch replaces it with "--help" which should be
less error prone in the future.

Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a6214e9ae58..a362e3d7abc6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -105,7 +105,7 @@ $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
 BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
 BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
-BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version 2>&1 | grep LLVM)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
 
 ifneq ($(BTF_LLC_PROBE),)
 ifneq ($(BTF_PAHOLE_PROBE),)
-- 
2.17.1

[PATCH bpf] bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h

2018-07-19 Thread Martin KaFai Lau

This patch shrinks the BTF_INT_BITS() mask.  The current
btf_int_check_meta() ensures the nr_bits of an integer
cannot exceed 64.  Hence, it is mostly an uapi cleanup.

The actual btf usage (i.e. seq_show()) is also modified
to use u8 instead of u16.  The verification (e.g. btf_int_check_meta())
path stays as is to deal with invalid BTF situation.

Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau 
---
 include/uapi/linux/btf.h |  2 +-
 kernel/bpf/btf.c | 16 ++--
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h
index 0b5ddbe135a4..972265f32871 100644
--- a/include/uapi/linux/btf.h
+++ b/include/uapi/linux/btf.h
@@ -76,7 +76,7 @@ struct btf_type {
  */
 #define BTF_INT_ENCODING(VAL)  (((VAL) & 0x0f00) >> 24)
 #define BTF_INT_OFFSET(VAL)(((VAL  & 0x00ff)) >> 16)
-#define BTF_INT_BITS(VAL)  ((VAL)  & 0x)
+#define BTF_INT_BITS(VAL)  ((VAL)  & 0x00ff)
 
 /* Attributes stored in the BTF_INT_ENCODING */
 #define BTF_INT_SIGNED (1 << 0)
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index e016ac3afa24..9704934252b3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -450,7 +450,7 @@ static const struct btf_type *btf_type_by_id(const struct 
btf *btf, u32 type_id)
  */
 static bool btf_type_int_is_regular(const struct btf_type *t)
 {
-   u16 nr_bits, nr_bytes;
+   u8 nr_bits, nr_bytes;
u32 int_data;
 
int_data = btf_type_int(t);
@@ -993,12 +993,16 @@ static void btf_int_bits_seq_show(const struct btf *btf,
 {
u16 left_shift_bits, right_shift_bits;
u32 int_data = btf_type_int(t);
-   u16 nr_bits = BTF_INT_BITS(int_data);
-   u16 total_bits_offset;
-   u16 nr_copy_bytes;
-   u16 nr_copy_bits;
+   u8 nr_bits = BTF_INT_BITS(int_data);
+   u8 total_bits_offset;
+   u8 nr_copy_bytes;
+   u8 nr_copy_bits;
u64 print_num;
 
+   /*
+* bits_offset is at most 7.
+* BTF_INT_OFFSET() cannot exceed 64 bits.
+*/
total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data);
data += BITS_ROUNDDOWN_BYTES(total_bits_offset);
bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset);
@@ -1028,7 +1032,7 @@ static void btf_int_seq_show(const struct btf *btf, const 
struct btf_type *t,
u32 int_data = btf_type_int(t);
u8 encoding = BTF_INT_ENCODING(int_data);
bool sign = encoding & BTF_INT_SIGNED;
-   u32 nr_bits = BTF_INT_BITS(int_data);
+   u8 nr_bits = BTF_INT_BITS(int_data);
 
if (bits_offset || BTF_INT_OFFSET(int_data) ||
BITS_PER_BYTE_MASKED(nr_bits)) {
-- 
2.17.1

Re: DNAT with VRF support in Linux Kernel

2018-07-19 Thread David Ahern

On 7/19/18 7:52 PM, D'Souza, Nelson wrote:
> Hi,
> 
>  
> 
> I'm seeing a VRF/Netfilter related issue on a system running a 4.14.52
> Linux kernel.
> 
>  
> 
> I have an eth interface enslaved to l3mdev mgmtvrf device.
> 
>  
> 
> After reviewing
> https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf, I was
> expecting that the Netfilter NF_INET_PRE_ROUTING rules would be applied
> to packets at the ingress eth interface and VRF device level. I
> confirmed that this works for pre-routing rules added to the raw and
> mangle tables at the ingress interface and VRF device level. I'm having
> issues though with pre-routing rules that are applied to the NAT table.
> NAT pre-routing rules only match on the ingress eth interface, not on
> the mgmtVRF device. As a result, I'm not able to apply DNAT at the
> mgmtvrf device level for IPv4 packets sourced from an external host and
> destined to the eth interface ip address.
> 
>  
> 
> Also observed that a tcpdump on the mgmtvrf device captures packets
> ingressing on the mgmtvrf.
> 
>  
> 
> Please let me know if my understanding is correct, and if so, if this is
> a resolved/outstanding issue.
> 

I am puzzled by this one. My main dev server uses mgmt vrf with DNAT
rules to access VMs running on it, so I know it works to some degree. e.g.,

$ sudo iptables -nvL -t nat
Chain PREROUTING (policy ACCEPT 409 packets, 68587 bytes)
 pkts bytes target prot opt in out source
destination
 8761  583K ACCEPT all  --  br0*   0.0.0.0/0
0.0.0.0/0
5   320 DNAT   tcp  --  *  *   0.0.0.0/0
0.0.0.0/0tcp dpt:2201 to:10.1.1.1:22
...

But, adding LOG rule does not show a hit with dev == mgmt.

Re: VRF with enslaved L3 enabled bridge

2018-07-19 Thread David Ahern

On 7/19/18 8:19 PM, D'Souza, Nelson wrote:
> Hi,
> 
>  
> 
> I'm seeing the following issue on a system running a 4.14.52 Linux kernel.
> 
>  
> 
> With an eth interface enslaved to a VRF device, pings sent out on the
> VRF to an neighboring host are successful. But, with an eth interface
> enslaved to a L3 enabled bridge (mgmtbr0), and the bridge enslaved to a
> l3mdev VRF (mgmtvrf), the pings sent out on the VRF are not received
> back at the application level.

you mean this setup:
eth1 (ingress port) -> br0 (bridge) -> red (vrf)

IP address on br0:

9: br0:  mtu 1500 qdisc noqueue master
red state UP group default qlen 1000
link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.4/24 scope global br0
   valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe1c:37/64 scope link
   valid_lft forever preferred_lft forever

And then ping a neighbor:

# ping -I red -c1 -w1 10.100.1.254
ping: Warning: source address might be selected on device other than red.
PING 10.100.1.254 (10.100.1.254) from 10.100.1.4 red: 56(84) bytes of data.
64 bytes from 10.100.1.254: icmp_seq=1 ttl=64 time=0.810 ms

--- 10.100.1.254 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.810/0.810/0.810/0.000 ms

> 
>  
> 
> ICMP Echo requests are successfully sent out on the mgmtvrf device to a
> neighboring host. However, ICMP echo replies that are received back from
> the neighboring host via the eth and mgmtbr0 interfaces are not seen at
> the vrf device level and therefore fail to be delivered locally to the
> ping application.

Does tcpdump on each level show the response? tcpdump on eth, tcpdump on
bridge and tcpdump on the vrf device?

> 
>  
> 
> The following LOG rules were added to the raw table, prerouting chain
> and the filter table, OUTPUT chains:
> 
>  
> 
> root@x10sdv-4c-tln4f:~# iptables -t raw -S PREROUTING
> 
> -P PREROUTING ACCEPT
> 
> -A PREROUTING -s 10.32.8.135/32 -i mgmtbr0 -j LOG
> 
> -A PREROUTING -s 10.32.8.135/32 -i mgmtvrf -j LOG
> 
>  
> 
> root@x10sdv-4c-tln4f:~# iptables -S OUTPUT
> 
> -P OUTPUT ACCEPT
> 
> -A OUTPUT -o mgmtvrf -j LOG
> 
> -A OUTPUT -o mgmtbr0 -j LOG
> 
>  
> 
> Pings are sent on the management VRF to a neighboring host (10.32.8.135)
> and the netfilter logs included below:
> 
> Note, that in the logs, ICMP echo requests are sent out on the mgmtvrf
> and match the output rules for mgmvrf and mgmtbr0, but the ICMP echo
> replies are only seen on mgmtbr0, not on mgmtvrf
> 
>  
> 
> root@x10sdv-4c-tln4f:~# ping 10.32.8.135 -I mgmtvrf -c 1
> 
> PING 10.32.8.135  (10.32.8.135):
> 
> 56 data bytes
> 
> [ 2679.683027] IN= OUT=mgmtvrf SRC=10.33.96.131 DST=10.32.8.135 LEN=84
> TOS=0x00 PREC=0x00 TTL=64 ID=23921 DF PROTO=ICMP TYPE=8 CODE=0 ID=32610
> SEQ=0   <<< ICMP echo sent on mgmtvrf
> 
> [ 2679.697560] IN= OUT=mgmtbr0 SRC=10.33.96.131 DST=10.32.8.135 LEN=84
> TOS=0x00 PREC=0x00 TTL=64 ID=23921 DF PROTO=ICMP TYPE=8 CODE=0 ID=32610
> SEQ=0   <<< ICMP echo sent on mgmtbr0
> 
> [ 2679.713312] IN=mgmtbr0 OUT= PHYSIN=ethUSB
> MAC=c0:56:27:90:4f:75:c4:7d:4f:bb:02:e7:08:00 SRC=10.32.8.135
> DST=10.33.96.131 LEN=84 TOS=0x00 PREC=0x00 TTL=62 ID=64949 PROTO=ICMP
> TYPE=0 CODE=0 ID=32610 SEQ=0 <<< ICMP echo reply rcvd on mgmtbr0,
> but not on mgmtvrf
> 
>  
> 
> --- 10.32.8.135 ping statistics ---
> 
> 1 packets transmitted, 0 packets received, 100% packet loss 
> ping failed
> 
>  
> 
> I’d like to know if this is an outstanding/resolved issue.
> 

This one works (see above), so I suspect it is something with your setup.

[PATCH ipsec-next] xfrm: Allow xfrmi if_id to be updated by UPDSA

2018-07-19 Thread Nathan Harold

Allow attaching an SA to an xfrm interface id after
the creation of the SA, so that tasks such as keying
which must be done as the SA is created, can remain
separate from the decision on how to route traffic
from an SA. This permits SA creation to be decomposed
in to three separate steps:
1) allocation of a SPI
2) algorithm and key negotiation
3) insertion into the data path

Signed-off-by: Nathan Harold 
---
 net/xfrm/xfrm_state.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 27c84e63c7ff..c4c563d9be47 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1562,10 +1562,14 @@ int xfrm_state_update(struct xfrm_state *x)
if (x1->curlft.use_time)
xfrm_state_check_expire(x1);
 
-   if (x->props.smark.m || x->props.smark.v) {
+   if (x->props.smark.m || x->props.smark.v || x->if_id) {
spin_lock_bh(&net->xfrm.xfrm_state_lock);
 
-   x1->props.smark = x->props.smark;
+   if (x->props.smark.m || x->props.smark.v)
+   x1->props.smark = x->props.smark;
+
+   if (x->if_id)
+   x1->if_id = x->if_id;
 
__xfrm_state_bump_genids(x1);
spin_unlock_bh(&net->xfrm.xfrm_state_lock);
-- 
2.18.0.233.g985f88cf7e-goog

Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-19 Thread Max Gurtovoy




[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18


queue 9 is not mapped (overlap).
please try the bellow:



This seems to work.  Here are three mapping cases:  each vector on its
own cpu, each vector on 1 cpu within the local numa node, and each
vector having all cpus in its numa node.  The 2nd mapping looks kinda
funny, but I think it achieved what you wanted?  And all the cases
resulted in successful connections.



Thanks for testing this.
I slightly improved the setting of the left CPUs and actually used 
Sagi's initial proposal.


Sagi,
please review the attached patch and let me know if I should add your 
signature on it.
I'll run some perf test early next week on it (meanwhile I run 
login/logout with different num_queues successfully and irq settings).


Steve,
It will be great if you can apply the attached in your system and send 
your findings.


Regards,
Max,
From 6f7b98f1c43252f459772390c178fc3ad043fc82 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy 
Date: Thu, 19 Jul 2018 12:42:00 +
Subject: [PATCH 1/1] blk-mq: fix RDMA queue/cpu mappings assignments for mq

In order to fulfil the block layer cpu <-> queue mapping, all the
allocated queues and all the possible CPUs should be mapped. First,
try to map the queues according to the affinity hint from the underlying
RDMA device. Second, map all the unmapped queues in a naive way to unmapped
CPU. In case we still have unmapped CPUs, use the default blk-mq mappings
to map the rest. This way we guarantee that no matter what is the underlying
affinity, all the possible CPUs and all the allocated block queues will be
mapped.

Signed-off-by: Max Gurtovoy 
---
 block/blk-mq-cpumap.c  | 41 -
 block/blk-mq-rdma.c| 44 ++--
 include/linux/blk-mq.h |  1 +
 3 files changed, 67 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f..02b888f 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   unsigned int cpu, first_sibling;
+   unsigned int first_sibling;
 
-   for_each_possible_cpu(cpu) {
-   /*
-* First do sequential mapping between CPUs and queues.
-* In case we still have CPUs to map, and we have some number of
-* threads per cores then map sibling threads to the same queue 
for
-* performace optimizations.
-*/
-   if (cpu < nr_queues) {
+   /*
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue for
+* performace optimizations.
+*/
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   } else {
-   first_sibling = get_first_sibling(cpu);
-   if (first_sibling == cpu)
-   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   else
-   map[cpu] = map[first_sibling];
-   }
+   else
+   map[cpu] = map[first_sibling];
}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   blk_mq_map_queue_to_cpu(set, cpu);
 
return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f..10e4f8a 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -34,14 +34,54 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 {
const struct cpumask *mask;
unsigned int queue, cpu;
+   bool mapped;
 
+   /* reset all CPUs mapping */
+   for_each_possible_cpu(cpu)
+   set->mq_map[cpu] = UINT_MAX;
+
+   /* Try to map the queues according to affinity */
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
goto fallback;
 
-   for_each_cpu(cpu, mask)
-   set->mq_map[cpu] = queue;
+   for_each_cpu(cpu, mask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+

Re: [PATCH net] net: skb_segment() should not return NULL

2018-07-19 Thread Alexander Duyck

On Thu, Jul 19, 2018 at 4:04 PM, Eric Dumazet  wrote:
> syzbot caught a NULL deref [1], caused by skb_segment()
>
> skb_segment() has many "goto err;" that assume the @err variable
> contains -ENOMEM.
>
> A successful call to __skb_linearize() should not clear @err,
> otherwise a subsequent memory allocation error could return NULL.
>
> While we are at it, we might use -EINVAL instead of -ENOMEM when
> MAX_SKB_FRAGS limit is reached.
>
> [1]
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
> Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 
> 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 
> 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
> RSP: 0018:88019b7fd060 EFLAGS: 00010206
> RAX: 0012 RBX: 0020 RCX: dc00
> RDX: 0004 RSI:  RDI: 0090
> RBP: 88019b7fd0f0 R08: 88019510e0c0 R09: ed003b5c46d6
> R10: ed003b5c46d6 R11: 8801dae236b3 R12: 0001
> R13: 8801d6c581f4 R14:  R15: 8801d6c58128
> FS:  7fcae64d6700() GS:8801dae0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004e8664 CR3: 0001b669b000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
>  inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
>  inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
>  skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
>  __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
>  skb_gso_segment include/linux/netdevice.h:4099 [inline]
>  validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
>  __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
>  neigh_hh_output include/net/neighbour.h:473 [inline]
>  neigh_output include/net/neighbour.h:481 [inline]
>  ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
>  ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
>  NF_HOOK_COND include/linux/netfilter.h:276 [inline]
>  ip_output+0x223/0x880 net/ipv4/ip_output.c:405
>  dst_output include/net/dst.h:444 [inline]
>  ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
>  iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
>  ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
>  ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
>  __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
>  netdev_start_xmit include/linux/netdevice.h:4157 [inline]
>  xmit_one net/core/dev.c:3034 [inline]
>  dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
>  __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
>  neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
>  neigh_output include/net/neighbour.h:483 [inline]
>  ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
>  ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
>  NF_HOOK_COND include/linux/netfilter.h:276 [inline]
>  ip_output+0x223/0x880 net/ipv4/ip_output.c:405
>  dst_output include/net/dst.h:444 [inline]
>  ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
>  ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
>  tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
>  tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
>  __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
>  tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
>  tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
>  tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
>  inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
>  sock_sendmsg_nosec net/socket.c:641 [inline]
>  sock_sendmsg+0xd5/0x120 net/socket.c:651
>  __sys_sendto+0x3d7/0x670 net/socket.c:1797
>  __do_sys_sendto net/socket.c:1809 [inline]
>  __se_sys_sendto net/socket.c:1805 [inline]
>  __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
>  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x455ab9
> Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7fcae64d5c68 EFLAGS: 0246 ORIG_RAX: 002c
> RAX: ffda RBX: 7fcae64d66d4 RCX: 00455ab9
> RDX: 0001 RSI: 2200 RDI: 0013
> RBP: 0072bea0 R08:  R09: 
> R10:  R11: 0246 R12: 00

Re: [PATCH bpf 0/2] BPF fix and test case

2018-07-19 Thread Alexei Starovoitov

On Thu, Jul 19, 2018 at 06:18:34PM +0200, Daniel Borkmann wrote:
> This set adds a ppc64 JIT fix for xadd as well as a missing test
> case for verifying whether xadd messes with src/dst reg. Thanks!

Applied, Thanks

[PATCH net] net: skb_segment() should not return NULL

2018-07-19 Thread Eric Dumazet

syzbot caught a NULL deref [1], caused by skb_segment()

skb_segment() has many "goto err;" that assume the @err variable
contains -ENOMEM.

A successful call to __skb_linearize() should not clear @err,
otherwise a subsequent memory allocation error could return NULL.

While we are at it, we might use -EINVAL instead of -ENOMEM when
MAX_SKB_FRAGS limit is reached.

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 
00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 94 
00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
RSP: 0018:88019b7fd060 EFLAGS: 00010206
RAX: 0012 RBX: 0020 RCX: dc00
RDX: 0004 RSI:  RDI: 0090
RBP: 88019b7fd0f0 R08: 88019510e0c0 R09: ed003b5c46d6
R10: ed003b5c46d6 R11: 8801dae236b3 R12: 0001
R13: 8801d6c581f4 R14:  R15: 8801d6c58128
FS:  7fcae64d6700() GS:8801dae0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004e8664 CR3: 0001b669b000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
 inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
 inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
 skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
 __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
 skb_gso_segment include/linux/netdevice.h:4099 [inline]
 validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
 __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
 neigh_hh_output include/net/neighbour.h:473 [inline]
 neigh_output include/net/neighbour.h:481 [inline]
 ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
 ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
 NF_HOOK_COND include/linux/netfilter.h:276 [inline]
 ip_output+0x223/0x880 net/ipv4/ip_output.c:405
 dst_output include/net/dst.h:444 [inline]
 ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
 iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
 ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
 ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
 __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
 netdev_start_xmit include/linux/netdevice.h:4157 [inline]
 xmit_one net/core/dev.c:3034 [inline]
 dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
 __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
 neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
 neigh_output include/net/neighbour.h:483 [inline]
 ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
 ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
 NF_HOOK_COND include/linux/netfilter.h:276 [inline]
 ip_output+0x223/0x880 net/ipv4/ip_output.c:405
 dst_output include/net/dst.h:444 [inline]
 ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
 ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
 tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
 tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
 __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
 tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
 tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:641 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:651
 __sys_sendto+0x3d7/0x670 net/socket.c:1797
 __do_sys_sendto net/socket.c:1809 [inline]
 __se_sys_sendto net/socket.c:1805 [inline]
 __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455ab9
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fcae64d5c68 EFLAGS: 0246 ORIG_RAX: 002c
RAX: ffda RBX: 7fcae64d66d4 RCX: 00455ab9
RDX: 0001 RSI: 2200 RDI: 0013
RBP: 0072bea0 R08:  R09: 
R10:  R11: 0246 R12: 0014
R13: 004c1145 R14: 004d1818 R15: 0006
Modules linked in:
Dumping ftrace buffer:
   (ftrace buffer empty)

Fixes: ddff00d42043 ("net: Move skb_has_shared_frag check out of GRE code and 
into segmentatio

Re: [PATCH net] net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv

2018-07-19 Thread Andrew Lunn

On Thu, Jul 19, 2018 at 08:15:16AM +0200, Heiner Kallweit wrote:
> The situation described in the comment can occur also with
> PHY_IGNORE_INTERRUPT, therefore change the condition to include it.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net-next 6/7] net: systemport: Add support for WAKE_FILTER

2018-07-19 Thread Andrew Lunn

> In both of your examples, only one bit is set, what will change is the
> value being programmed to RXHCK_BRCM_TAG(i), which will be either 0, or
> 1, but the value programmed to RXCHK_CONTROL as far as which filter is
> enabled will be the same because we can use filter position 0.
> 
> What the code basically does is look at how many bits are set in the
> filters bitmap, and then it starts populating the filters from filter 0
> up to filter 7 with the value of the bit.

O.K. Now it get it. Sorry for being so slow.

 Andrew

Re: [PATCH net-next] net: phy: add GBit master / slave error detection

2018-07-19 Thread Andrew Lunn

On Thu, Jul 19, 2018 at 11:11:53PM +0200, Heiner Kallweit wrote:
> On 19.07.2018 16:46, Andrew Lunn wrote:
> >>> AFAIR there was a patch a while ago from Mellanox guys that was possibly
> >>> extending the link notification with an error cause, this sounds like
> >>> something that could be useful to report to user space somehow to help
> >>> troubleshoot link down events.
> >>>
> >> Do you by chance have a reference to this patch? There's heavy development
> >> on the Mellanox drivers with a lot of patches.
> > 
> > Hi Heiner, Florian
> > 
> > A general mechanism has been added to allow error messages to be
> > reported via netlink sockets. I think wifi was the first to actually
> > make use of it, since i think Johannes Berg did the core work, but
> > other parts of the stack have also started using it.
> > 
> I think you mean the devlink interface.

devlink uses netlink, so it can use these extended error message. But
all forms of netlink sockets can use this.

What Florian might be referring to, is that when netif_carrier_off()
or netif_carrier_on() is called, a netlink message is sent to
userspace. Tools link ip monitor can be used to display these events.
I think Florian is suggesting a more detailed message could be display
about master/slave issues. However, how you get that information to
rtnl_fill_ifinfo() i don't know.

   Andrew

Re: [PATCH net] net/xdp: Fix suspicious RCU usage warning

2018-07-19 Thread Alexei Starovoitov

On Wed, Jul 18, 2018 at 05:13:54PM +0300, Tariq Toukan wrote:
> 
> 
> On 17/07/2018 10:27 PM, Daniel Borkmann wrote:
> > On 07/17/2018 06:47 PM, Alexei Starovoitov wrote:
> > > On Tue, Jul 17, 2018 at 06:10:38PM +0300, Tariq Toukan wrote:
> > > > Fix the warning below by calling rhashtable_lookup under
> > > > RCU read lock.
> > > > 
> 
> ...
> 
> > > > mutex_lock(&mem_id_lock);
> > > > +   rcu_read_lock();
> > > > xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
> > > > +   rcu_read_unlock();
> > > > if (!xa) {
> > > 
> > > if it's an actual bug rcu_read_unlock seems to be misplaced.
> > > It silences the warn, but rcu section looks wrong.
> > 
> > I think that whole piece in __xdp_rxq_info_unreg_mem_model() should be:
> > 
> >mutex_lock(&mem_id_lock);
> >xa = rhashtable_lookup_fast(mem_id_ht, &id, mem_id_rht_params);
> >if (xa && rhashtable_remove_fast(mem_id_ht, &xa->node, 
> > mem_id_rht_params) == 0)
> >call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
> >mutex_unlock(&mem_id_lock);
> > 
> > Technically the RCU read side plus rhashtable_lookup() is the same, but lets
> > use proper api. From the doc (https://lwn.net/Articles/751374/) object 
> > removal
> > is wrapped around the RCU read side additionally, but in our case we're 
> > behind
> > mem_id_lock for insertion/removal serialization.
> > 
> > Cheers,
> > Daniel
> > 
> 
> Just as Daniel stated, I think there's no actual bug here, but we still want
> to silence the RCU warning.
> 
> Alexei, did you mean getting the if statement into the RCU lock critical
> section?

If what Daniel proposes silences the warn, I'd rather do that.
Pattern like:
  rcu_lock;
  val = lookup();
  rcu_unlock;
  if (val)
will cause people to question the quality of the code and whether
authors of the code understand rcu.
There should be a way to silence the warn without adding
"wrong on the first glance" code.

Re: [PATCH net-next] net: phy: add GBit master / slave error detection

2018-07-19 Thread Heiner Kallweit

On 19.07.2018 16:46, Andrew Lunn wrote:
>>> AFAIR there was a patch a while ago from Mellanox guys that was possibly
>>> extending the link notification with an error cause, this sounds like
>>> something that could be useful to report to user space somehow to help
>>> troubleshoot link down events.
>>>
>> Do you by chance have a reference to this patch? There's heavy development
>> on the Mellanox drivers with a lot of patches.
> 
> Hi Heiner, Florian
> 
> A general mechanism has been added to allow error messages to be
> reported via netlink sockets. I think wifi was the first to actually
> make use of it, since i think Johannes Berg did the core work, but
> other parts of the stack have also started using it.
> 
I think you mean the devlink interface. With regard to nl80211 I'm aware
that it's used for communication between Wifi core and userspace tools /
daemons like hostapd and wpa_supplicant.
The devlink use case reminds me of swconfig for switch port configuration
under OpenWRT.

For our use case I think this is overkill, and I wonder what the
userspace tool would be that consumes our link down error info.

Heiner

> Just picking a commit at random, maybe not the best of examples:
> 
> Fixes: 768075ebc238 ("nl80211: add a few extended error strings to key 
> parsing")
> 
>Andrew
>

Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Alexander Duyck

On Thu, Jul 19, 2018 at 1:52 PM, Pablo Neira Ayuso  wrote:
> On Thu, Jul 19, 2018 at 08:18:20AM -0700, Alexander Duyck wrote:
>> On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  
>> wrote:
>> > One of the recurring complaints is that we do not have, as a driver
>> > writer, a central location from which we would be fed offloading rules
>> > into a NIC. This was brought up again during Netconf'18 in Boston.
>> >
>> > This patch just renames ndo_setup_tc to ndo_setup_offload as a very
>> > early initial work to prepare for follow up patch that discuss unified
>> > flow representation for the existing offload programming APIs.
>> >
>> > Signed-off-by: Pablo Neira Ayuso 
>> > Acked-by: Jiri Pirko 
>> > Acked-by: Jakub Kicinski 
>>
>> One request I would have here is to not bother updating the individual
>> driver function names. For now I would say we could leave the
>> "_setup_tc" in the naming of the driver functions itself and just
>> update the name of the net device operation. Renaming the driver
>> functions just adds unnecessary overhead and complexity to the patch
>> and will make it more difficult to maintain. When we get around to
>> adding additional functionality that relates to the rename we could
>> address renaming the function on a per driver basis in the future.
>
> Plan was to follow up patch will rename enum tc_setup_type too:
>
> https://marc.info/?l=linux-netdev&m=153193158512556&w=2
>
> that will result in more renames in the driver side.
>
> I would expect this will happen sooner or later, and out of tree
> patches will end up needing a rebase sooner or later, if that is the
> concern.

I was just thinking that renaming the functions themselves adds noise
and makes it harder to debug functions later when they get renamed. As
far as the out-of-tree driver I agree we will still have to deal with
it due to the enum and NDO function rename. I just figured that using
things like LXR is a bit easier when the function name stays the same
and you have to move between versions.

- Alex

Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Pablo Neira Ayuso

On Thu, Jul 19, 2018 at 08:18:20AM -0700, Alexander Duyck wrote:
> On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  
> wrote:
> > One of the recurring complaints is that we do not have, as a driver
> > writer, a central location from which we would be fed offloading rules
> > into a NIC. This was brought up again during Netconf'18 in Boston.
> >
> > This patch just renames ndo_setup_tc to ndo_setup_offload as a very
> > early initial work to prepare for follow up patch that discuss unified
> > flow representation for the existing offload programming APIs.
> >
> > Signed-off-by: Pablo Neira Ayuso 
> > Acked-by: Jiri Pirko 
> > Acked-by: Jakub Kicinski 
> 
> One request I would have here is to not bother updating the individual
> driver function names. For now I would say we could leave the
> "_setup_tc" in the naming of the driver functions itself and just
> update the name of the net device operation. Renaming the driver
> functions just adds unnecessary overhead and complexity to the patch
> and will make it more difficult to maintain. When we get around to
> adding additional functionality that relates to the rename we could
> address renaming the function on a per driver basis in the future.

Plan was to follow up patch will rename enum tc_setup_type too:

https://marc.info/?l=linux-netdev&m=153193158512556&w=2

that will result in more renames in the driver side.

I would expect this will happen sooner or later, and out of tree
patches will end up needing a rebase sooner or later, if that is the
concern.

[PATCH net] net/ipv6: Fix linklocal to global address with VRF

2018-07-19 Thread dsahern

From: David Ahern 

Example setup:
host: ip -6 addr add dev eth1 2001:db8:104::4
   where eth1 is enslaved to a VRF

switch: ip -6 ro add 2001:db8:104::4/128 dev br1
where br1 only has an LLA

   ping6 2001:db8:104::4
   ssh   2001:db8:104::4

(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).

For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.

For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.

Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern 
---
Dave: I can look at the backports to stable if needed.

 include/net/tcp.h   | 5 +
 net/ipv6/icmp.c | 5 +++--
 net/ipv6/tcp_ipv6.c | 6 --
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3482d13d655b..b073dca8f3ac 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -839,6 +839,11 @@ static inline void bpf_compute_data_end_sk_skb(struct 
sk_buff *skb)
  */
 static inline int tcp_v6_iif(const struct sk_buff *skb)
 {
+   return TCP_SKB_CB(skb)->header.h6.iif;
+}
+
+static inline int tcp_v6_iif_l3_slave(const struct sk_buff *skb)
+{
bool l3_slave = ipv6_l3mdev_skb(TCP_SKB_CB(skb)->header.h6.flags);
 
return l3_slave ? skb->skb_iif : TCP_SKB_CB(skb)->header.h6.iif;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index be491bf6ab6e..ef2505aefc15 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -402,9 +402,10 @@ static int icmp6_iif(const struct sk_buff *skb)
 
/* for local traffic to local address, skb dev is the loopback
 * device. Check if there is a dst attached to the skb and if so
-* get the real device index.
+* get the real device index. Same is needed for replies to a link
+* local address on a device enslaved to an L3 master device
 */
-   if (unlikely(iif == LOOPBACK_IFINDEX)) {
+   if (unlikely(iif == LOOPBACK_IFINDEX || netif_is_l3_master(skb->dev))) {
const struct rt6_info *rt6 = skb_rt6_info(skb);
 
if (rt6)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7efa9fd7e109..03e6b7a2bc53 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -938,7 +938,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
   &tcp_hashinfo, NULL, 0,
   &ipv6h->saddr,
   th->source, &ipv6h->daddr,
-  ntohs(th->source), tcp_v6_iif(skb),
+  ntohs(th->source),
+  tcp_v6_iif_l3_slave(skb),
   tcp_v6_sdif(skb));
if (!sk1)
goto out;
@@ -1609,7 +1610,8 @@ static int tcp_v6_rcv(struct sk_buff *skb)
skb, __tcp_hdrlen(th),
&ipv6_hdr(skb)->saddr, th->source,
&ipv6_hdr(skb)->daddr,
-   ntohs(th->dest), tcp_v6_iif(skb),
+   ntohs(th->dest),
+   tcp_v6_iif_l3_slave(skb),
sdif);
if (sk2) {
struct inet_timewait_sock *tw = inet_twsk(sk);
-- 
2.11.0

Re: [PATCH net-next 3/4] net/tc: introduce TC_ACT_MIRRED.

2018-07-19 Thread Jiri Pirko

Thu, Jul 19, 2018 at 03:02:28PM CEST, pab...@redhat.com wrote:
>This is similar TC_ACT_REDIRECT, but with a slightly different
>semantic:
>- on ingress the mirred skbs are passed to the target device
>network stack without any additional check not scrubbing.
>- the rcu-protected stats provided via the tcf_result struct
>  are updated on error conditions.
>
>v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
> a new action type instead
>
>Signed-off-by: Paolo Abeni 
>---
> include/net/sch_generic.h| 19 +++
> include/uapi/linux/pkt_cls.h |  3 ++-
> net/core/dev.c   |  4 
> net/sched/act_api.c  |  6 --
> 4 files changed, 29 insertions(+), 3 deletions(-)
>
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 056dc1083aa3..667d7b66fee2 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -235,6 +235,12 @@ struct tcf_result {
>   u32 classid;
>   };
>   const struct tcf_proto *goto_tp;
>+
>+  /* used by the TC_ACT_MIRRED action */
>+  struct {
>+  boolingress;
>+  struct gnet_stats_queue *qstats;
>+  };
>   };
> };
> 
>@@ -1091,4 +1097,17 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair 
>*miniqp,
> void mini_qdisc_pair_init(struct mini_Qdisc_pair *miniqp, struct Qdisc *qdisc,
> struct mini_Qdisc __rcu **p_miniq);
> 
>+static inline void skb_tc_redirect(struct sk_buff *skb, struct tcf_result 
>*res)
>+{
>+  struct gnet_stats_queue *stats = res->qstats;
>+  int ret;
>+
>+  if (res->ingress)
>+  ret = netif_receive_skb(skb);
>+  else
>+  ret = dev_queue_xmit(skb);
>+  if (ret && stats)
>+  qstats_overlimit_inc(res->qstats);
>+}
>+
> #endif
>diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>index 7cdd62b51106..1a5e8a3217f3 100644
>--- a/include/uapi/linux/pkt_cls.h
>+++ b/include/uapi/linux/pkt_cls.h
>@@ -45,7 +45,8 @@ enum {
>  * the skb and act like everything
>  * is alright.
>  */
>-#define TC_ACT_LAST   TC_ACT_TRAP
>+#define TC_ACT_MIRRED 9
>+#define TC_ACT_LAST   TC_ACT_MIRRED
> 
> /* There is a special kind of actions called "extended actions",
>  * which need a value parameter. These have a local opcode located in
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 14a748ee8cc9..3822f29d730f 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4602,6 +4602,10 @@ sch_handle_ingress(struct sk_buff *skb, struct 
>packet_type **pt_prev, int *ret,
>   __skb_push(skb, skb->mac_len);
>   skb_do_redirect(skb);
>   return NULL;
>+  case TC_ACT_MIRRED:
>+  /* this does not scrub the packet, and updates stats on error */
>+  skb_tc_redirect(skb, &cl_res);

REDIRECT does skb_do_redirect and MIRRED does skb_tc_redirect.
Confusing. I agree with Cong that the name is not correct.


>+  return NULL;
>   default:
>   break;
>   }
>diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>index f6438f246dab..029302e2813e 100644
>--- a/net/sched/act_api.c
>+++ b/net/sched/act_api.c
>@@ -895,8 +895,10 @@ struct tc_action *tcf_action_init_1(struct net *net, 
>struct tcf_proto *tp,
>   }
>   }
> 
>-  if (a->tcfa_action == TC_ACT_REDIRECT) {
>-  net_warn_ratelimited("TC_ACT_REDIRECT can't be used directly");
>+  if (a->tcfa_action == TC_ACT_REDIRECT ||
>+  a->tcfa_action == TC_ACT_MIRRED) {
>+  net_warn_ratelimited("action %d can't be used directly",
>+   a->tcfa_action);
>   a->tcfa_action = TC_ACT_LAST + 1;
>   }
> 
>-- 
>2.17.1
>

Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-19 Thread Steve Wise




On 7/19/2018 9:50 AM, Max Gurtovoy wrote:
>
>
> On 7/18/2018 10:29 PM, Steve Wise wrote:
>>
>>>
>>> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:

>> IMO we must fulfil the user wish to connect to N queues and not
>> reduce
>> it because of affinity overlaps. So in order to push Leon's patch we
>> must also fix the blk_mq_rdma_map_queues to do a best effort
>>> mapping
>> according the affinity and map the rest in naive way (in that way we
>> will *always* map all the queues).
>
> That is what I would expect also.   For example, in my node, where
> there are
> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
>>> performance by
> setting up my 16 driver completion event queues such that each is
> bound to a
> node-local cpu.  So I end up with each nodel-local cpu having 2
> queues
> bound
> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
> this
> works fine.   I assumed adding ib_get_vector_affinity() would allow
> this to
> all "just work" by default, but I'm running into this connection
> failure
> issue.
>
> I don't understand exactly what the blk_mq layer is trying to do,
> but I
> assume it has ingress event queues and processing that it trying
> to align
> with the drivers ingress cq event handling, so everybody stays on the
> same
> cpu (or at least node).   But something else is going on.  Is there
> documentation on how this works somewhere?

 Does this (untested) patch help?
>>>
>>> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
>>> queues and not the cpus.
>>> for example, if the affinity of q=6 and q=12 returned the same cpumask
>>> than q=6 will not be mapped and will fail to connect.
>>>
>>
>> Attached is a patch that applies cleanly for me.  It has problems if
>> vectors have affinity to more than 1 cpu:
>>
>> [ 2031.91] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
>> [ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
>> [ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
>> [ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
>> [ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
>> [ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
>> [ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
>> [ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
>> [ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
>> [ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
>> [ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
>> [ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
>> [ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
>> [ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
>> [ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
>> [ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
>> [ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
>> [ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
>> [ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
>> [ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
>> [ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
>> [ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
>> [ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
>> [ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
>> [ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
>> [ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
>> [ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15
>> vector 15
>> [ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15
>> vector 15
>> [ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15
>> vector 15
>> [ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15
>> vector 15
>> [ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15
>> vector 15
>> [ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15
>> vector 15
>> [ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit:
>> -16402
>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>
> queue 9 is not mapped (overlap).
> please try the bellow:
>

This seems to work.  Here are three mapping cases:  each vector on its
own cpu, each vector on 1 cpu within the local numa node, and each
vector having all cpus in its numa node.  The 2nd mapping looks kinda
funny, but I think it achieved what you wanted?  And all the cases
resulted in successful connections.

 each vector on its own cpu:

[ 3844.756229] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 3844.762104] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 3844.767896] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 3844.773663] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 3844.779405] iw_cxgb4:

Re: [PATCH v4 net-next] net/sched: add skbprio scheduler

2018-07-19 Thread Cong Wang

On Thu, Jul 19, 2018 at 11:23 AM Nishanth Devarajan  wrote:
> +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
> +   struct netlink_ext_ack *extack)
> +{
> +   struct skbprio_sched_data *q = qdisc_priv(sch);
> +   struct tc_skbprio_qopt *ctl = nla_data(opt);
> +   const unsigned int min_limit = 1;
> +
> +   if (ctl->limit == (typeof(ctl->limit))-1)
> +   sch->limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);

I am sorry, I still don't like this use of tx_queue_len, it either
should be removed or just align to other existing use cases.

Apologize for responding to your previous email so late.

Re: [PATCH v3 net-next] net/sched: add skbprio scheduler

2018-07-19 Thread Cong Wang

(Sorry for missing this email, it is lost in other discussions.)

On Wed, Jul 11, 2018 at 8:25 AM Michel Machado  wrote:
>
> On 07/10/2018 10:57 PM, Cong Wang wrote:
> > The dev->tx_queue_len is fundamentally non-sense since now
> > almost every real NIC is multi-queue and qdisc has a completely
> > different sch->limit. This is why I suggested you to simply
> > avoid it in your code.
>
> Would you be okay with a constant there? If so, we could just put 64
> there. The optimal number is hardware dependent, but we don't know how
> to calculate it.

Yes, sure, fq_codel uses 10240 already. :)


>
> > There is no standard way to use dev->tx_queue_len in kernel,
> > so I can't claim your use is correct or not, but it still looks odd,
> > other qdisc seems just uses as a default, rather than picking
> > the smaller or bigger value as a cap.
>
> The reason for the `max(qdisc_dev(sch)->tx_queue_len, min_limit)` is
> to make sure that sch->limit is at least 1. We couldn't come up with a
> meaningful behavior for sch->limit being zero, so we defined the basis
> case of skbprio_enqueue() as sch->limit one. If there's a guarantee that
> qdisc_dev(sch)->tx_queue_len is always greater than zero, we don't need
> the max().

I think tx_queue_len could be 0. But again, why do you need to care
about tx_queue_len being 0 or not here?

sch->limit could be 0 too, it means this qdisc should not queue any
packets.

Thanks

[PATCH v4 net-next] net/sched: add skbprio scheduler

2018-07-19 Thread Nishanth Devarajan

net/sched: add skbprio scheduer

Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
according to their skb->priority field. Under congestion, already-enqueued lower
priority packets will be dropped to make space available for higher priority
packets. Skbprio was conceived as a solution for denial-of-service defenses that
need to route packets with different priorities as a means to overcome DoS
attacks.

v4
*Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
page for Skbprio, in iproute2.

v3
*Drop max_limit parameter in struct skbprio_sched_data and instead use
sch->limit.

*Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
qdisc (previously being referenced every time qdisc changes).

*Move qdisc's detailed description from in-code to Documentation/networking.

*When qdisc is saturated, enqueue incoming packet first before dequeueing
lowest priority packet in queue - improves usage of call stack registers.

*Introduce and use overlimit stat to keep track of number of dropped packets.

v2
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).

*Queueing discipline is made classful to expose Skbprio's internal priority
queues.

Signed-off-by: Nishanth Devarajan 
Reviewed-by: Sachin Paryani 
Reviewed-by: Cody Doucette 
Reviewed-by: Michel Machado 
---
 include/uapi/linux/pkt_sched.h |  15 ++
 net/sched/Kconfig  |  13 ++
 net/sched/Makefile |   1 +
 net/sched/sch_skbprio.c| 330 +
 4 files changed, 359 insertions(+)
 create mode 100644 net/sched/sch_skbprio.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d9cc9dc..8975fd1 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -124,6 +124,21 @@ struct tc_fifo_qopt {
__u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* SKBPRIO section */
+
+/*
+ * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
+ * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
+ * to map one to one the DS field of IPV4 and IPV6 headers.
+ * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
+ */
+
+#define SKBPRIO_MAX_PRIORITY 64
+
+struct tc_skbprio_qopt {
+   __u32   limit;  /* Queue length in packets. */
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS 16
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 7af2467..7699344 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -251,6 +251,19 @@ config NET_SCH_MQPRIO
 
  If unsure, say N.
 
+config NET_SCH_SKBPRIO
+   tristate "SKB priority queue scheduler (SKBPRIO)"
+   help
+ Say Y here if you want to use the SKB priority queue
+ scheduler. This schedules packets according to skb->priority,
+ which is useful for request packets in DoS mitigation systems such
+ as Gatekeeper.
+
+ To compile this driver as a module, choose M here: the module will
+ be called sch_skbprio.
+
+ If unsure, say N.
+
 config NET_SCH_CHOKE
tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 673ee7d..112ef70 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)   += sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)  += sch_drr.o
 obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o
 obj-$(CONFIG_NET_SCH_MQPRIO)   += sch_mqprio.o
+obj-$(CONFIG_NET_SCH_SKBPRIO)  += sch_skbprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)  += sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)+= sch_codel.o
diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
new file mode 100644
index 000..6b94f54
--- /dev/null
+++ b/net/sched/sch_skbprio.c
@@ -0,0 +1,330 @@
+/*
+ * net/sched/sch_skbprio.c  SKB Priority Queue.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Nishanth Devarajan, 
+ * Cody Doucette, 
+ * original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* SKB Priority Queue
+ * =
+ *
+ * Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes
+ * packets according to their skb->priority field. Under congestion,
+ * Skbprio drops already-enqueued lower priority packets to make space
+ * available for higher priority packets; it was conceived as a solution
+ * for denial-of-

Re: [PATCH net-next 3/4] net/tc: introduce TC_ACT_MIRRED.

2018-07-19 Thread Cong Wang

On Thu, Jul 19, 2018 at 6:03 AM Paolo Abeni  wrote:
>
> This is similar TC_ACT_REDIRECT, but with a slightly different
> semantic:
> - on ingress the mirred skbs are passed to the target device
> network stack without any additional check not scrubbing.
> - the rcu-protected stats provided via the tcf_result struct
>   are updated on error conditions.

At least its name sucks, it means to skip the skb_clone(),
that is avoid a copy, but you still call it MIRRED...

MIRRED means MIRror and REDirect.

Also, I don't understand why this new TC_ACT code needs
to be visible to user-space, whether to clone or not is purely
internal.

Re: [PATCH net-next 4/4] act_mirred: use ACT_REDIRECT when possible

2018-07-19 Thread Cong Wang

On Wed, Jul 18, 2018 at 3:05 AM Paolo Abeni  wrote:
>
> Hi,
>
> On Tue, 2018-07-17 at 10:24 -0700, Cong Wang wrote:
> > If you goal is to get rid of skb_clone(), why not just do the following?
> >
> > if (tcf_mirred_is_act_redirect(m_eaction)) {
> > skb2 = skb;
> > } else {
> > skb2 = skb_clone(skb, GFP_ATOMIC);
> > if (!skb2)
> > goto out;
> > }
> >
> > For redirect, we return TC_ACT_SHOT, so upper layer should not
> > touch the skb after that.
> >
> > What am I missing here?
>
> With ACT_SHOT caller/upper layer will free the skb, too. We will have
> an use after free (from either the upper layer and the xmit device).
> Similar issues with STOLEN, TRAP, etc.
>
> In the past, Changli Gao attempted to avoid the clone incrementing the
> skb usage count:
>
> commit 210d6de78c5d7c785fc532556cea340e517955e1
> Author: Changli Gao 
> Date:   Thu Jun 24 16:25:12 2010 +
>
> act_mirred: don't clone skb when skb isn't shared
>
> but some/many device drivers expect an skb usage count of 1, and that
> caused ooops and was revered.

Interesting, I wasn't aware of the above commit and its revert.

First, I didn't use skb_get() above.

Second, I think the caller of dev_queue_xmit() should not
touch the skb after it, the skb is either freed by dev_queue_xmit()
or successfully transmitted, in either case, the ownership belongs
to dev_queue_xmit(). So, I think we should skip the qdisc_drop()
for this case.

Not sure about netif_receive_skb() case, given veth calls in its
xmit too, I speculate the rule is probably same.

Not sure about other ACT_SHOT case than act_mirred...

>
> I think the only other option (beyond re-using ACT_MIRROR) is adding
> another action value, and let the upper layer re-inject the packet
> while handling such action (similar to what ACT_MIRROR currently does,
> but preserving the current mirred semantic).

Maybe if you mean to avoid breaking ACT_SHOT.

Thanks.

[PATCH ipsec-next] xfrm: Remove xfrmi interface ID from flowi

2018-07-19 Thread Benedict Wong

In order to remove performance impact of having the extra u32 in every
single flowi, this change removes the flowi_xfrm struct, prefering to
take the if_id as a method parameter where needed.

In the inbound direction, if_id is only needed during the
__xfrm_check_policy() function, and the if_id can be determined at that
point based on the skb. As such, xfrmi_decode_session() is only called
with the skb in __xfrm_check_policy().

In the outbound direction, the only place where if_id is needed is the
xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
directly passed into the xfrm_lookup_with_ifid() call. All existing
callers can still call xfrm_lookup(), which uses a default if_id of 0.

This change does not change any behavior of XFRMIs except for improving
overall system performance via flowi size reduction.

This change has been tested against the Android Kernel Networking Tests:

https://android.googlesource.com/kernel/tests/+/master/net/test

Signed-off-by: Benedict Wong 
---
 include/net/dst.h | 14 ++
 include/net/flow.h|  9 
 include/net/xfrm.h|  2 +-
 net/xfrm/xfrm_interface.c |  4 +-
 net/xfrm/xfrm_policy.c| 98 ++-
 net/xfrm/xfrm_state.c |  3 +-
 6 files changed, 83 insertions(+), 47 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index b3219cd8a5a1..7f735e76ca73 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -475,6 +475,14 @@ static inline struct dst_entry *xfrm_lookup(struct net 
*net,
return dst_orig;
 }
 
+static inline struct dst_entry *
+xfrm_lookup_with_ifid(struct net *net, struct dst_entry *dst_orig,
+ const struct flowi *fl, const struct sock *sk,
+ int flags, u32 if_id)
+{
+   return dst_orig;
+}
+
 static inline struct dst_entry *xfrm_lookup_route(struct net *net,
  struct dst_entry *dst_orig,
  const struct flowi *fl,
@@ -494,6 +502,12 @@ struct dst_entry *xfrm_lookup(struct net *net, struct 
dst_entry *dst_orig,
  const struct flowi *fl, const struct sock *sk,
  int flags);
 
+struct dst_entry *xfrm_lookup_with_ifid(struct net *net,
+   struct dst_entry *dst_orig,
+   const struct flowi *fl,
+   const struct sock *sk, int flags,
+   u32 if_id);
+
 struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry 
*dst_orig,
const struct flowi *fl, const struct sock 
*sk,
int flags);
diff --git a/include/net/flow.h b/include/net/flow.h
index 187c9bef672f..8ce21793094e 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -26,10 +26,6 @@ struct flowi_tunnel {
__be64  tun_id;
 };
 
-struct flowi_xfrm {
-   __u32   if_id;
-};
-
 struct flowi_common {
int flowic_oif;
int flowic_iif;
@@ -43,7 +39,6 @@ struct flowi_common {
 #define FLOWI_FLAG_SKIP_NH_OIF 0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
-   struct flowi_xfrm xfrm;
kuid_t  flowic_uid;
 };
 
@@ -83,7 +78,6 @@ struct flowi4 {
 #define flowi4_secid   __fl_common.flowic_secid
 #define flowi4_tun_key __fl_common.flowic_tun_key
 #define flowi4_uid __fl_common.flowic_uid
-#define flowi4_xfrm__fl_common.xfrm
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -115,7 +109,6 @@ static inline void flowi4_init_output(struct flowi4 *fl4, 
int oif,
fl4->flowi4_flags = flags;
fl4->flowi4_secid = 0;
fl4->flowi4_tun_key.tun_id = 0;
-   fl4->flowi4_xfrm.if_id = 0;
fl4->flowi4_uid = uid;
fl4->daddr = daddr;
fl4->saddr = saddr;
@@ -145,7 +138,6 @@ struct flowi6 {
 #define flowi6_secid   __fl_common.flowic_secid
 #define flowi6_tun_key __fl_common.flowic_tun_key
 #define flowi6_uid __fl_common.flowic_uid
-#define flowi6_xfrm__fl_common.xfrm
struct in6_addr daddr;
struct in6_addr saddr;
/* Note: flowi6_tos is encoded in flowlabel, too. */
@@ -193,7 +185,6 @@ struct flowi {
 #define flowi_secidu.__fl_common.flowic_secid
 #define flowi_tun_key  u.__fl_common.flowic_tun_key
 #define flowi_uid  u.__fl_common.flowic_uid
-#define flowi_xfrm u.__fl_common.xfrm
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 1350e2cf0749..ca820945f30c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1557,7 +1557,7 @@ struct xfrm_state

Re: [PATCH net] net: rollback orig value on failure of dev_qdisc_change_tx_queue_len

2018-07-19 Thread Cong Wang

On Thu, Jul 19, 2018 at 7:34 AM Tariq Toukan  wrote:
>
> Fix dev_change_tx_queue_len so it rolls back original value
> upon a failure in dev_qdisc_change_tx_queue_len.
> This is already done for notifirers' failures, share the code.
>
> The revert of changes in dev_qdisc_change_tx_queue_len
> in such case is needed but missing (marked as TODO), yet it is
> still better to not apply the new requested value.

You misunderstand the TODO, that is for reverting tx queue len
change for previous queues in the loop. I still don't have any
nice solution for this yet.

Yeah, your change itself looks good.

Please update the changelog.

Thanks.

Re: [PATCH net-next] net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc

2018-07-19 Thread Cong Wang

On Thu, Jul 19, 2018 at 7:50 AM Tariq Toukan  wrote:
> --- a/net/core/dev_ioctl.c
> +++ b/net/core/dev_ioctl.c
> @@ -282,14 +282,7 @@ static int dev_ifsioc(struct net *net, struct ifreq 
> *ifr, unsigned int cmd)
> return dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data);
>
> case SIOCSIFTXQLEN:
> -   if (ifr->ifr_qlen < 0)
> -   return -EINVAL;

Are you sure we can remove this if check too?

The other one is safe to remove.

Re: [PATCH net-next 4/4] act_mirred: use ACT_REDIRECT when possible

2018-07-19 Thread kbuild test robot

Hi Paolo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Paolo-Abeni/TC-refactor-TC_ACT_REDIRECT-action/20180716-011055
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/sched/act_mirred.c:195:17: sparse: dubious: x | !y
   net/sched/act_mirred.c:260:23: sparse: expression using sizeof(void)
   net/sched/act_mirred.c:260:23: sparse: expression using sizeof(void)

vim +195 net/sched/act_mirred.c

   169  
   170  static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
   171struct tcf_result *res)
   172  {
   173  struct tcf_mirred *m = to_mirred(a);
   174  bool m_mac_header_xmit;
   175  struct net_device *dev;
   176  struct sk_buff *skb2;
   177  int retval, err = 0;
   178  bool want_ingress;
   179  int m_eaction;
   180  int mac_len;
   181  
   182  tcf_lastuse_update(&m->tcf_tm);
   183  bstats_cpu_update(this_cpu_ptr(m->common.cpu_bstats), skb);
   184  
   185  m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
   186  m_eaction = READ_ONCE(m->tcfm_eaction);
   187  retval = READ_ONCE(m->tcf_action);
   188  dev = rcu_dereference_bh(m->tcfm_dev);
   189  want_ingress = tcf_mirred_act_wants_ingress(m_eaction);
   190  if (skb_at_tc_ingress(skb) && 
tcf_mirred_is_act_redirect(m_eaction)) {
   191  skb->tc_redirected = 1;
   192  skb->tc_from_ingress = 1;
   193  
   194  /* the core redirect code will check dev and its status 
*/
 > 195  TCF_RESULT_SET_REDIRECT(res, dev, want_ingress);
   196  res->qstats = this_cpu_ptr(m->common.cpu_qstats);
   197  return TC_ACT_REDIRECT;
   198  }
   199  
   200  if (unlikely(!dev)) {
   201  pr_notice_once("tc mirred: target device is gone\n");
   202  goto out;
   203  }
   204  
   205  if (unlikely(!(dev->flags & IFF_UP))) {
   206  net_notice_ratelimited("tc mirred to Houston: device %s 
is down\n",
   207 dev->name);
   208  goto out;
   209  }
   210  
   211  skb2 = skb_clone(skb, GFP_ATOMIC);
   212  if (!skb2)
   213  goto out;
   214  
   215  /* If action's target direction differs than filter's direction,
   216   * and devices expect a mac header on xmit, then mac push/pull 
is
   217   * needed.
   218   */
   219  if (skb_at_tc_ingress(skb) != want_ingress && 
m_mac_header_xmit) {
   220  if (!skb_at_tc_ingress(skb)) {
   221  /* caught at egress, act ingress: pull mac */
   222  mac_len = skb_network_header(skb) - 
skb_mac_header(skb);
   223  skb_pull_rcsum(skb2, mac_len);
   224  } else {
   225  /* caught at ingress, act egress: push mac */
   226  skb_push_rcsum(skb2, skb->mac_len);
   227  }
   228  }
   229  
   230  /* mirror is always swallowed */
   231  if (tcf_mirred_is_act_redirect(m_eaction)) {
   232  skb2->tc_redirected = 1;
   233  skb2->tc_from_ingress = skb2->tc_at_ingress;
   234  }
   235  
   236  skb2->skb_iif = skb->dev->ifindex;
   237  skb2->dev = dev;
   238  if (!tcf_mirred_act_wants_ingress(m_eaction))
   239  err = dev_queue_xmit(skb2);
   240  else
   241  err = netif_receive_skb(skb2);
   242  
   243  if (err) {
   244  out:
   245  
qstats_overlimit_inc(this_cpu_ptr(m->common.cpu_qstats));
   246  if (tcf_mirred_is_act_redirect(m_eaction))
   247  retval = TC_ACT_SHOT;
   248  }
   249  
   250  return retval;
   251  }
   252  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH iproute2/next 1/2] tc/act_tunnel_key: Enable setup of tos and ttl

2018-07-19 Thread Or Gerlitz

On Thu, Jul 19, 2018 at 2:48 PM, Roman Mashak  wrote:
> Or Gerlitz  writes:
>
>> Allow to set tos and ttl for the tunnel.
>>
>> For example, here's encap rule that sets tos to the tunnel:
>>
>> tc filter add dev eth0_0 protocol ip parent : prio 10 flower \
>>src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70 \
>>action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 
>> dst_port 4789 tos 0x30 \
>>action mirred egress redirect dev vxlan_sys_4789
>>
>> Signed-off-by: Or Gerlitz 
>> Reviewed-by: Roi Dayan 
>> Acked-by: Jiri Pirko 
>
> [...]
>
> Or, could you also update tunnel_key actions for the new options in
> $(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> once the patches are accepted ?

yes, I will do that

Re: [PATCH RFC bpf-next] bpf: per-register parent pointers

2018-07-19 Thread Edward Cree

On 18/07/18 04:54, Alexei Starovoitov wrote:
> I'd like to apply it, but I see the difference in insn_processed.
> Several cilium tests show favorable difference towards new liveness approach.
> selftests/bpf/test_xdp_noinline.o also shows the difference.
> I'm struggling to see why this patch would make such difference.
> Could you please help me analyze why such difference exists?
I'm also confused by it at present, but I am looking into it.
-Ed

Re: [PATCH iproute2 5/5] bpf: implement btf handling and map annotation

2018-07-19 Thread Martin KaFai Lau

On Thu, Jul 19, 2018 at 05:43:11PM +0200, Daniel Borkmann wrote:
> On 07/19/2018 02:11 AM, Martin KaFai Lau wrote:
> > On Wed, Jul 18, 2018 at 11:13:37AM -0700, Jakub Kicinski wrote:
> >> On Wed, 18 Jul 2018 11:33:22 +0200, Daniel Borkmann wrote:
> >>> On 07/18/2018 10:42 AM, Daniel Borkmann wrote:
>  On 07/18/2018 02:27 AM, Jakub Kicinski wrote:  
> > On Wed, 18 Jul 2018 01:31:22 +0200, Daniel Borkmann wrote:  
> >>   # bpftool map dump id 386
> >>[{
> >> "key": 0,
> >> "value": {
> >> "": {
> >> "value": 0,
> >> "ifindex": 0,
> >> "mac": []
> >> }
> >> }
> >> },{
> >> "key": 1,
> >> "value": {
> >> "": {
> >> "value": 0,
> >> "ifindex": 0,
> >> "mac": []
> >> }
> >> }
> >> },{
> >>   [...]  
> >
> > Ugh, the empty keys ("") look worrying, we should probably improve
> > handling of anonymous structs in bpftool :S  
> 
>  Yeah agree, I think it would be nice to see a more pahole style dump
>  where we have types and member names along with the value as otherwise
>  it might be a bit confusing.  
> >>>
> >>> Another feature that would be super useful imho would be in the /single/
> >>> map view e.g. 'bpftool map show id 123' to have a detailed BTF key+value
> >>> type dump, so in addition to the basic map info we show pahole like info
> >>> of the structs with length/offsets.
> >>
> >> That sounds good!  We could also consider adding a btf object and
> >> commands to interrogate BTF types in the kernel in general..  Perhaps
> >> then we could add something like bpftool btf describe map id 123.
> > +1 on the btf subcommand.
> 
> That would also work, I think both might be useful to have. Former would
> all sit under a single command to show map details.
Agree that both would be useful.  btf command could address the whole BTF
object which could include many maps/types while the map command is
focusing on its own map info.

> With 'bpftool btf' you
> would also allow for a full BTF dump when a specific BTF obj id is provided?
Right, I think the BTF obj id (or file) is needed for the btf command.  and then
it should allow to do full dump or only show a particular map/type id.

A little forward thinking, map here is a C type.  Hence, I think using the
name "type" like "btftool btf id 1 show _type_ id 123" may be better when
we later expand BTF usage beyond BPF program.

> 
> >> Having the single map view show more information seems interesting, but
> >> I wonder if it could be surprising.  Is there precedent for such
> >> behaviour?
> > Having everything in one page (map show id 123) could be interesting.
> > One thing is the pahole-like output may be quite long?
> > e.g. the member of a struct could itself be another struct.
> 
> Right, though probably fine when you want to see all information specific
> to one map. Of course the 'bpftool map' list view would need to hide this
> information.
> 
> > Not sure how the pahole-like output may look like in json though.
> 
> Would the existing map k/v dump have more or less the same 'issue'?
True, the existing map k/v dump of the map data is reusing the
json {}/[]/""/number convention.  I think that is ok and actually a
pretty condensed way since people are used to this convention when
reading "data" output.

For printing out C type, I think it is more natural to have it as
close to C syntax as possible in order to have it parsable by human
eyes.  However, yes, we could reuse a similar fashion to print type
in json as we do in printing data.  Just curious, the json type output
is more for script or mostly for people that can read everything from one
json output.

For plaintext, we can just print like pahole.

Re: [PATCH iproute2-next v4] net:sched: add action inheritdsfield to skbedit

2018-07-19 Thread David Ahern

On 7/19/18 10:07 AM, Qiaobin Fu wrote:
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> v4:
> * Make tc use netlink helper functions
> 
> v3:
> * Make flag represented in JSON output as a null value
> 
> v2:
> * Align the output syntax with the input syntax
> 
> * Fix the style issues
> 
> Original idea by Jamal Hadi Salim 
> 
> Signed-off-by: Qiaobin Fu 
> Reviewed-by: Michel Machado 
> Reviewed-by: Cong Wang 
> Reviewed-by: Marcelo Ricardo Leitner 
> Reviewed-by: Stephen Hemminger 
> Reviewed-by: David Ahern 
> ---
> 
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
>  tc/m_skbedit.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 

applied to iproute2-next. Thanks

Re: [PATCH iproute2 net-next] devlink: Add support for devlink-region access

2018-07-19 Thread David Ahern

On 7/17/18 2:34 AM, Alex Vesker wrote:
> Devlink region allows access to driver defined address regions.
> Each device can create its supported address regions and register
> them. A device which exposes a region will allow access to it
> using devlink.
> 
> This support allows reading and dumping regions snapshots as well
> as presenting information such as region size and current available
> snapshots.
> 
> A snapshot represents a memory image of a region taken by the driver.
> If a device collects a snapshot of an address region it can be later
> exposed using devlink region read or dump commands.
> This functionality allows for future analyses on the snapshots.
> 
> The dump command is designed to read the full address space of a
> region or of a snapshot unlike the read command which allows
> reading only a specific section in a region/snapshot indicated by
> an address and a length, current support is for reading and dumping
> for a previously taken snapshot ID.
> 
> New commands added:
>  devlink region show [ DEV/REGION ]
>  devlink region delete DEV/REGION snapshot SNAPSHOT_ID
>  devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
>  devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
> address ADDRESS length length
> 
> Signed-off-by: Alex Vesker 
> Signed-off-by: Jiri Pirko 
> ---
>  devlink/devlink.c | 485 
> +-
>  man/man8/devlink-region.8 | 131 +
>  man/man8/devlink.8|   1 +
>  3 files changed, 616 insertions(+), 1 deletion(-)
>  create mode 100644 man/man8/devlink-region.8
> 

applied to iproute2-next. Thanks

Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-07-19 Thread Michal Soltys


On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:

From: Mahesh Bandewar 

Commit b89f04c61efe ("bonding: deliver link-local packets with
skb->dev set to link that packets arrived on") changed the behavior
of how link-local-multicast packets are processed. The change in
the behavior broke some legacy use cases where these packets are
expected to arrive on bonding master device also.

This patch passes the packet to the stack with the link it arrived
on as well as passes to the bonding-master device to preserve the
legacy use case.

Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link 
that packets arrived on")
Reported-by: Michal Soltys 
Signed-off-by: Mahesh Bandewar 
---
v2: Added Fixes tag.
v1: Initial patch.
  drivers/net/bonding/bond_main.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9a2ea3c1f949..1d3b7d8448f2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct 
sk_buff **pskb)
}
}
  
-	/* don't change skb->dev for link-local packets */

-   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
+   /* Link-local multicast packets should be passed to the
+* stack on the link they arrive as well as pass them to the
+* bond-master device. These packets are mostly usable when
+* stack receives it with the link on which they arrive
+* (e.g. LLDP) but there may be some legacy behavior that
+* expects these packets to appear on bonding master too.


I'd really change the comment from:

"These packets are mostly usable when stack receives it with the link on which 
they arrive (e.g. LLDP) but there may be some legacy behavior that expects 
these packets to appear on bonding master too."


to something like:

"These packets are mostly usable when stack receives it with the link on which 
they arrive, but they also must be available on aggregations. Some of the use 
cases include (but are not limited to): LLDP agents that must be able to 
operate both on enslaved interfaces as well as on bonds themselves; linux 
bridges that must be able to process/pass BPDUs from attached bonds when any 
kind of stp version is enabled on the network."


It's a bit longer, but clarifies the reasons more precisely (without going too 
deep into features like group_fwd_mask).

[PATCH bpf 0/2] BPF fix and test case

2018-07-19 Thread Daniel Borkmann

This set adds a ppc64 JIT fix for xadd as well as a missing test
case for verifying whether xadd messes with src/dst reg. Thanks!

Daniel Borkmann (2):
  bpf, ppc64: fix unexpected r0=0 exit path inside bpf_xadd
  bpf: test case to check whether src/dst regs got mangled by xadd

 arch/powerpc/net/bpf_jit_comp64.c   | 29 -
 tools/testing/selftests/bpf/test_verifier.c | 40 +
 2 files changed, 45 insertions(+), 24 deletions(-)

-- 
2.9.5

[PATCH bpf 1/2] bpf, ppc64: fix unexpected r0=0 exit path inside bpf_xadd

2018-07-19 Thread Daniel Borkmann

None of the JITs is allowed to implement exit paths from the BPF
insn mappings other than BPF_JMP | BPF_EXIT. In the BPF core code
we have a couple of rewrites in eBPF (e.g. LD_ABS / LD_IND) and
in eBPF to cBPF translation to retain old existing behavior where
exceptions may occur; they are also tightly controlled by the
verifier where it disallows some of the features such as BPF to
BPF calls when legacy LD_ABS / LD_IND ops are present in the BPF
program. During recent review of all BPF_XADD JIT implementations
I noticed that the ppc64 one is buggy in that it contains two
jumps to exit paths. This is problematic as this can bypass verifier
expectations e.g. pointed out in commit f6b1b3bf0d5f ("bpf: fix
subprog verifier bypass by div/mod by 0 exception"). The first
exit path is obsoleted by the fix in ca36960211eb ("bpf: allow xadd
only on aligned memory") anyway, and for the second one we need to
do a fetch, add and store loop if the reservation from lwarx/ldarx
was lost in the meantime.

Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended 
BPF")
Reviewed-by: Naveen N. Rao 
Reviewed-by: Sandipan Das 
Tested-by: Sandipan Das 
Signed-off-by: Daniel Borkmann 
---
 arch/powerpc/net/bpf_jit_comp64.c | 29 +
 1 file changed, 5 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 380cbf9..c0a9bcd 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -286,6 +286,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
u64 imm64;
u8 *func;
u32 true_cond;
+   u32 tmp_idx;
 
/*
 * addrs[] maps a BPF bytecode address into a real offset from
@@ -637,11 +638,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
case BPF_STX | BPF_XADD | BPF_W:
/* Get EA into TMP_REG_1 */
PPC_ADDI(b2p[TMP_REG_1], dst_reg, off);
-   /* error if EA is not word-aligned */
-   PPC_ANDI(b2p[TMP_REG_2], b2p[TMP_REG_1], 0x03);
-   PPC_BCC_SHORT(COND_EQ, (ctx->idx * 4) + 12);
-   PPC_LI(b2p[BPF_REG_0], 0);
-   PPC_JMP(exit_addr);
+   tmp_idx = ctx->idx * 4;
/* load value from memory into TMP_REG_2 */
PPC_BPF_LWARX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1], 0);
/* add value from src_reg into this */
@@ -649,32 +646,16 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
/* store result back */
PPC_BPF_STWCX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1]);
/* we're done if this succeeded */
-   PPC_BCC_SHORT(COND_EQ, (ctx->idx * 4) + (7*4));
-   /* otherwise, let's try once more */
-   PPC_BPF_LWARX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1], 0);
-   PPC_ADD(b2p[TMP_REG_2], b2p[TMP_REG_2], src_reg);
-   PPC_BPF_STWCX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1]);
-   /* exit if the store was not successful */
-   PPC_LI(b2p[BPF_REG_0], 0);
-   PPC_BCC(COND_NE, exit_addr);
+   PPC_BCC_SHORT(COND_NE, tmp_idx);
break;
/* *(u64 *)(dst + off) += src */
case BPF_STX | BPF_XADD | BPF_DW:
PPC_ADDI(b2p[TMP_REG_1], dst_reg, off);
-   /* error if EA is not doubleword-aligned */
-   PPC_ANDI(b2p[TMP_REG_2], b2p[TMP_REG_1], 0x07);
-   PPC_BCC_SHORT(COND_EQ, (ctx->idx * 4) + (3*4));
-   PPC_LI(b2p[BPF_REG_0], 0);
-   PPC_JMP(exit_addr);
-   PPC_BPF_LDARX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1], 0);
-   PPC_ADD(b2p[TMP_REG_2], b2p[TMP_REG_2], src_reg);
-   PPC_BPF_STDCX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1]);
-   PPC_BCC_SHORT(COND_EQ, (ctx->idx * 4) + (7*4));
+   tmp_idx = ctx->idx * 4;
PPC_BPF_LDARX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1], 0);
PPC_ADD(b2p[TMP_REG_2], b2p[TMP_REG_2], src_reg);
PPC_BPF_STDCX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1]);
-   PPC_LI(b2p[BPF_REG_0], 0);
-   PPC_BCC(COND_NE, exit_addr);
+   PPC_BCC_SHORT(COND_NE, tmp_idx);
break;
 
/*
-- 
2.9.5

[PATCH bpf 2/2] bpf: test case to check whether src/dst regs got mangled by xadd

2018-07-19 Thread Daniel Borkmann

We currently do not have such a test case in test_verifier selftests
but it's important to test under bpf_jit_enable=1 to make sure JIT
implementations do not mistakenly mess with src/dst reg for xadd/{w,dw}.

Signed-off-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_verifier.c | 40 +
 1 file changed, 40 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index f5f7bcc..41106d9 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -12005,6 +12005,46 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_XDP,
},
{
+   "xadd/w check whether src/dst got mangled, 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_JMP_REG(BPF_JNE, BPF_REG_6, BPF_REG_0, 3),
+   BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_10, 2),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 42),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 3,
+   },
+   {
+   "xadd/w check whether src/dst got mangled, 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -8),
+   BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -8),
+   BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -8),
+   BPF_JMP_REG(BPF_JNE, BPF_REG_6, BPF_REG_0, 3),
+   BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_10, 2),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 42),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 3,
+   },
+   {
"bpf_get_stack return R0 within range",
.insns = {
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
-- 
2.9.5

[PATCH iproute2-next v4] net:sched: add action inheritdsfield to skbedit

2018-07-19 Thread Qiaobin Fu

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v4:
* Make tc use netlink helper functions

v3:
* Make flag represented in JSON output as a null value

v2:
* Align the output syntax with the input syntax

* Fix the style issues

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
Reviewed-by: Cong Wang 
Reviewed-by: Marcelo Ricardo Leitner 
Reviewed-by: Stephen Hemminger 
Reviewed-by: David Ahern 
---

Note that the motivation for this patch is found in the following discussion:
https://www.spinics.net/lists/netdev/msg501061.html
---
 tc/m_skbedit.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/tc/m_skbedit.c b/tc/m_skbedit.c
index 7391fc7f..b6b839f8 100644
--- a/tc/m_skbedit.c
+++ b/tc/m_skbedit.c
@@ -30,16 +30,18 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT]>\n"
+   fprintf(stderr, "Usage: ... skbedit <[QM] [PM] [MM] [PT] [IF]>\n"
"QM = queue_mapping QUEUE_MAPPING\n"
"PM = priority PRIORITY\n"
"MM = mark MARK\n"
"PT = ptype PACKETYPE\n"
+   "IF = inheritdsfield\n"
"PACKETYPE = is one of:\n"
"  host, otherhost, broadcast, multicast\n"
"QUEUE_MAPPING = device transmit queue to use\n"
"PRIORITY = classID to assign to priority field\n"
-   "MARK = firewall mark to set\n");
+   "MARK = firewall mark to set\n"
+   "note: inheritdsfield maps DS field to skb->priority\n");
 }
 
 static void
@@ -60,6 +62,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
unsigned int tmp;
__u16 queue_mapping, ptype;
__u32 flags = 0, priority, mark;
+   __u64 pure_flags = 0;
struct tc_skbedit sel = { 0 };
 
if (matches(*argv, "skbedit") != 0)
@@ -111,6 +114,9 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
}
flags |= SKBEDIT_F_PTYPE;
ok++;
+   } else if (matches(*argv, "inheritdsfield") == 0) {
+   pure_flags |= SKBEDIT_F_INHERITDSFIELD;
+   ok++;
} else if (matches(*argv, "help") == 0) {
usage();
} else {
@@ -156,6 +162,8 @@ parse_skbedit(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
if (flags & SKBEDIT_F_PTYPE)
addattr_l(n, MAX_MSG, TCA_SKBEDIT_PTYPE,
  &ptype, sizeof(ptype));
+   if (pure_flags != 0)
+   addattr64(n, MAX_MSG, TCA_SKBEDIT_FLAGS, pure_flags);
addattr_nest_end(n, tail);
 
*argc_p = argc;
@@ -214,6 +222,13 @@ static int print_skbedit(struct action_util *au, FILE *f, 
struct rtattr *arg)
else
print_uint(PRINT_ANY, "ptype", " ptype %u", ptype);
}
+   if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
+   __u64 flags = rta_getattr_u64(tb[TCA_SKBEDIT_FLAGS]);
+
+   if (flags & SKBEDIT_F_INHERITDSFIELD)
+   print_null(PRINT_ANY, "inheritdsfield", " %s",
+"inheritdsfield");
+   }
 
print_action_control(f, " ", p->action, "");
 
-- 
2.17.1

Re: [PATCH bpf-next] bpf: show in bpftool map overview whether btf is available

2018-07-19 Thread Daniel Borkmann

On 07/18/2018 08:08 PM, Jakub Kicinski wrote:
> On Wed, 18 Jul 2018 11:19:42 +0200, Daniel Borkmann wrote:
>> For a quick overview in 'bpftool map' display 'btf' if it's
>> available for the dump for a specific map:
>>
>>   # bpftool map list
>>   11: array  flags 0x0  btf
>>   key 4B  value 20B  max_entries 40  memlock 4096B
>>
>>   # bpftool --json --pretty map list
>>   [{
>>   "id": 11,
>>   "type": "array",
>>   "flags": 0,
>>   "btf_available": true,
>>   "bytes_key": 4,
>>   "bytes_value": 20,
>>   "max_entries": 40,
>>   "bytes_memlock": 4096
>>   }
>>   ]
>>
>> Signed-off-by: Daniel Borkmann 
> 
> Hmm.. would it make sense to provide the actual BTF IDs instead of just
> yes/no?  At least in JSON?

Yeah that sounds reasonable to me. I think in that case it needs to be
BTF id as well as key and value id inside that BTF object to make some
sense out of it (BTF id alone would not be enough).

Probably makes sense to show the same information in plain text output
then.

Re: [PATCH iproute2 5/5] bpf: implement btf handling and map annotation

2018-07-19 Thread Daniel Borkmann

On 07/19/2018 02:11 AM, Martin KaFai Lau wrote:
> On Wed, Jul 18, 2018 at 11:13:37AM -0700, Jakub Kicinski wrote:
>> On Wed, 18 Jul 2018 11:33:22 +0200, Daniel Borkmann wrote:
>>> On 07/18/2018 10:42 AM, Daniel Borkmann wrote:
 On 07/18/2018 02:27 AM, Jakub Kicinski wrote:  
> On Wed, 18 Jul 2018 01:31:22 +0200, Daniel Borkmann wrote:  
>>   # bpftool map dump id 386
>>[{
>> "key": 0,
>> "value": {
>> "": {
>> "value": 0,
>> "ifindex": 0,
>> "mac": []
>> }
>> }
>> },{
>> "key": 1,
>> "value": {
>> "": {
>> "value": 0,
>> "ifindex": 0,
>> "mac": []
>> }
>> }
>> },{
>>   [...]  
>
> Ugh, the empty keys ("") look worrying, we should probably improve
> handling of anonymous structs in bpftool :S  

 Yeah agree, I think it would be nice to see a more pahole style dump
 where we have types and member names along with the value as otherwise
 it might be a bit confusing.  
>>>
>>> Another feature that would be super useful imho would be in the /single/
>>> map view e.g. 'bpftool map show id 123' to have a detailed BTF key+value
>>> type dump, so in addition to the basic map info we show pahole like info
>>> of the structs with length/offsets.
>>
>> That sounds good!  We could also consider adding a btf object and
>> commands to interrogate BTF types in the kernel in general..  Perhaps
>> then we could add something like bpftool btf describe map id 123.
> +1 on the btf subcommand.

That would also work, I think both might be useful to have. Former would
all sit under a single command to show map details. With 'bpftool btf' you
would also allow for a full BTF dump when a specific BTF obj id is provided?

>> Having the single map view show more information seems interesting, but
>> I wonder if it could be surprising.  Is there precedent for such
>> behaviour?
> Having everything in one page (map show id 123) could be interesting.
> One thing is the pahole-like output may be quite long?
> e.g. the member of a struct could itself be another struct.

Right, though probably fine when you want to see all information specific
to one map. Of course the 'bpftool map' list view would need to hide this
information.

> Not sure how the pahole-like output may look like in json though.

Would the existing map k/v dump have more or less the same 'issue'?

Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Alexander Duyck

On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  wrote:
> One of the recurring complaints is that we do not have, as a driver
> writer, a central location from which we would be fed offloading rules
> into a NIC. This was brought up again during Netconf'18 in Boston.
>
> This patch just renames ndo_setup_tc to ndo_setup_offload as a very
> early initial work to prepare for follow up patch that discuss unified
> flow representation for the existing offload programming APIs.
>
> Signed-off-by: Pablo Neira Ayuso 
> Acked-by: Jiri Pirko 
> Acked-by: Jakub Kicinski 

One request I would have here is to not bother updating the individual
driver function names. For now I would say we could leave the
"_setup_tc" in the naming of the driver functions itself and just
update the name of the net device operation. Renaming the driver
functions just adds unnecessary overhead and complexity to the patch
and will make it more difficult to maintain. When we get around to
adding additional functionality that relates to the rename we could
address renaming the function on a per driver basis in the future.

Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-19 Thread Max Gurtovoy





On 7/18/2018 10:29 PM, Steve Wise wrote:




On 7/18/2018 2:38 PM, Sagi Grimberg wrote:



IMO we must fulfil the user wish to connect to N queues and not reduce
it because of affinity overlaps. So in order to push Leon's patch we
must also fix the blk_mq_rdma_map_queues to do a best effort

mapping

according the affinity and map the rest in naive way (in that way we
will *always* map all the queues).


That is what I would expect also.   For example, in my node, where
there are
16 cpus, and 2 numa nodes, I observe much better nvmf IOPS

performance by

setting up my 16 driver completion event queues such that each is
bound to a
node-local cpu.  So I end up with each nodel-local cpu having 2 queues
bound
to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
this
works fine.   I assumed adding ib_get_vector_affinity() would allow
this to
all "just work" by default, but I'm running into this connection failure
issue.

I don't understand exactly what the blk_mq layer is trying to do, but I
assume it has ingress event queues and processing that it trying to align
with the drivers ingress cq event handling, so everybody stays on the
same
cpu (or at least node).   But something else is going on.  Is there
documentation on how this works somewhere?


Does this (untested) patch help?


I'm not sure (I'll test it tomorrow) because the issue is the unmapped
queues and not the cpus.
for example, if the affinity of q=6 and q=12 returned the same cpumask
than q=6 will not be mapped and will fail to connect.



Attached is a patch that applies cleanly for me.  It has problems if vectors 
have affinity to more than 1 cpu:

[ 2031.91] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
[ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
[ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15 vector 15
[ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15 vector 15
[ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15 vector 15
[ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15 vector 15
[ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15 vector 15
[ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18


queue 9 is not mapped (overlap).
please try the bellow:

diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f..a91d611 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -34,14 +34,55 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 {
const struct cpumask *mask;
unsigned int queue, cpu;
+   bool mapped;

+   /* reset all CPUs mapping */
+   for_each_possible_cpu(cpu)
+   set->mq_map[cpu] = UINT_MAX;
+
+   /* Try to map the queues according to affinity */
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
goto fallback;

-   for_each_cpu(cpu, mask)
-   set->mq_map[cpu] = queue;
+   for_each_cpu(cpu, mask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   /* Initialy each queue mapped to 1 cpu */
+

[PATCH net-next] net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc

2018-07-19 Thread Tariq Toukan

The cited patch added a call to dev_change_tx_queue_len in
SIOCSIFTXQLEN case.
This obsoletes the checks done before the function call.
Remove them here.

Fixes: 3f76df198288 ("net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN")
Signed-off-by: Tariq Toukan 
Reviewed-by: Eran Ben Elisha 
Cc: Cong Wang 
---
 net/core/dev_ioctl.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 50537ff961a7..7c1ad40322a3 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -282,14 +282,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, 
unsigned int cmd)
return dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data);
 
case SIOCSIFTXQLEN:
-   if (ifr->ifr_qlen < 0)
-   return -EINVAL;
-   if (dev->tx_queue_len ^ ifr->ifr_qlen) {
-   err = dev_change_tx_queue_len(dev, ifr->ifr_qlen);
-   if (err)
-   return err;
-   }
-   return 0;
+   return dev_change_tx_queue_len(dev, ifr->ifr_qlen);
 
case SIOCSIFNAME:
ifr->ifr_newname[IFNAMSIZ-1] = '\0';
-- 
1.8.3.1

Re: [PATCH net-next] net: phy: add GBit master / slave error detection

2018-07-19 Thread Andrew Lunn

> > AFAIR there was a patch a while ago from Mellanox guys that was possibly
> > extending the link notification with an error cause, this sounds like
> > something that could be useful to report to user space somehow to help
> > troubleshoot link down events.
> > 
> Do you by chance have a reference to this patch? There's heavy development
> on the Mellanox drivers with a lot of patches.

Hi Heiner, Florian

A general mechanism has been added to allow error messages to be
reported via netlink sockets. I think wifi was the first to actually
make use of it, since i think Johannes Berg did the core work, but
other parts of the stack have also started using it.

Just picking a commit at random, maybe not the best of examples:

Fixes: 768075ebc238 ("nl80211: add a few extended error strings to key parsing")

   Andrew

[PATCH net] net: rollback orig value on failure of dev_qdisc_change_tx_queue_len

2018-07-19 Thread Tariq Toukan

Fix dev_change_tx_queue_len so it rolls back original value
upon a failure in dev_qdisc_change_tx_queue_len.
This is already done for notifirers' failures, share the code.

The revert of changes in dev_qdisc_change_tx_queue_len
in such case is needed but missing (marked as TODO), yet it is
still better to not apply the new requested value.

Fixes: 48bfd55e7e41 ("net_sched: plug in qdisc ops change_tx_queue_len")
Signed-off-by: Tariq Toukan 
Reviewed-by: Eran Ben Elisha 
Reported-by: Ran Rozenstein 
Cc: Cong Wang 
---
 net/core/dev.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7444e6..559a91271f82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7149,16 +7149,19 @@ int dev_change_tx_queue_len(struct net_device *dev, 
unsigned long new_len)
dev->tx_queue_len = new_len;
res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
res = notifier_to_errno(res);
-   if (res) {
-   netdev_err(dev,
-  "refused to change device tx_queue_len\n");
-   dev->tx_queue_len = orig_len;
-   return res;
-   }
-   return dev_qdisc_change_tx_queue_len(dev);
+   if (res)
+   goto err_rollback;
+   res = dev_qdisc_change_tx_queue_len(dev);
+   if (res)
+   goto err_rollback;
}
 
return 0;
+
+err_rollback:
+   netdev_err(dev, "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return res;
 }
 
 /**
-- 
1.8.3.1

[PATCH net-next 3/4] net/tc: introduce TC_ACT_MIRRED.

2018-07-19 Thread Paolo Abeni

This is similar TC_ACT_REDIRECT, but with a slightly different
semantic:
- on ingress the mirred skbs are passed to the target device
network stack without any additional check not scrubbing.
- the rcu-protected stats provided via the tcf_result struct
  are updated on error conditions.

v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
 a new action type instead

Signed-off-by: Paolo Abeni 
---
 include/net/sch_generic.h| 19 +++
 include/uapi/linux/pkt_cls.h |  3 ++-
 net/core/dev.c   |  4 
 net/sched/act_api.c  |  6 --
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 056dc1083aa3..667d7b66fee2 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -235,6 +235,12 @@ struct tcf_result {
u32 classid;
};
const struct tcf_proto *goto_tp;
+
+   /* used by the TC_ACT_MIRRED action */
+   struct {
+   boolingress;
+   struct gnet_stats_queue *qstats;
+   };
};
 };
 
@@ -1091,4 +1097,17 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
 void mini_qdisc_pair_init(struct mini_Qdisc_pair *miniqp, struct Qdisc *qdisc,
  struct mini_Qdisc __rcu **p_miniq);
 
+static inline void skb_tc_redirect(struct sk_buff *skb, struct tcf_result *res)
+{
+   struct gnet_stats_queue *stats = res->qstats;
+   int ret;
+
+   if (res->ingress)
+   ret = netif_receive_skb(skb);
+   else
+   ret = dev_queue_xmit(skb);
+   if (ret && stats)
+   qstats_overlimit_inc(res->qstats);
+}
+
 #endif
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 7cdd62b51106..1a5e8a3217f3 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -45,7 +45,8 @@ enum {
   * the skb and act like everything
   * is alright.
   */
-#define TC_ACT_LASTTC_ACT_TRAP
+#define TC_ACT_MIRRED  9
+#define TC_ACT_LASTTC_ACT_MIRRED
 
 /* There is a special kind of actions called "extended actions",
  * which need a value parameter. These have a local opcode located in
diff --git a/net/core/dev.c b/net/core/dev.c
index 14a748ee8cc9..3822f29d730f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4602,6 +4602,10 @@ sch_handle_ingress(struct sk_buff *skb, struct 
packet_type **pt_prev, int *ret,
__skb_push(skb, skb->mac_len);
skb_do_redirect(skb);
return NULL;
+   case TC_ACT_MIRRED:
+   /* this does not scrub the packet, and updates stats on error */
+   skb_tc_redirect(skb, &cl_res);
+   return NULL;
default:
break;
}
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f6438f246dab..029302e2813e 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -895,8 +895,10 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
}
}
 
-   if (a->tcfa_action == TC_ACT_REDIRECT) {
-   net_warn_ratelimited("TC_ACT_REDIRECT can't be used directly");
+   if (a->tcfa_action == TC_ACT_REDIRECT ||
+   a->tcfa_action == TC_ACT_MIRRED) {
+   net_warn_ratelimited("action %d can't be used directly",
+a->tcfa_action);
a->tcfa_action = TC_ACT_LAST + 1;
}
 
-- 
2.17.1

[PATCH net-next 4/4] act_mirred: use ACT_MIRRED when possible

2018-07-19 Thread Paolo Abeni

When mirred is invoked from the ingress path, and it wants to redirect
the processed packet, it can now use the ACT_MIRRED action,
filling the tcf_result accordingly, and avoiding a per packet
skb_clone().

Overall this gives a ~10% improvement in forwarding performance for the
TC S/W data path and TC S/W performances are now comparable to the
kernel openswitch datapath.

v1 -> v2: use ACT_MIRRED instead of ACT_REDIRECT

Signed-off-by: Paolo Abeni 
---
 net/sched/act_mirred.c | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index eeb335f03102..fa06c161f139 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -171,10 +171,12 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
  struct tcf_result *res)
 {
struct tcf_mirred *m = to_mirred(a);
+   struct sk_buff *skb2 = skb;
bool m_mac_header_xmit;
struct net_device *dev;
-   struct sk_buff *skb2;
int retval, err = 0;
+   bool want_ingress;
+   bool is_redirect;
int m_eaction;
int mac_len;
 
@@ -196,16 +198,19 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
goto out;
}
 
-   skb2 = skb_clone(skb, GFP_ATOMIC);
-   if (!skb2)
-   goto out;
+   is_redirect = tcf_mirred_is_act_redirect(m_eaction);
+   if (!skb_at_tc_ingress(skb) || !is_redirect) {
+   skb2 = skb_clone(skb, GFP_ATOMIC);
+   if (!skb2)
+   goto out;
+   }
 
/* If action's target direction differs than filter's direction,
 * and devices expect a mac header on xmit, then mac push/pull is
 * needed.
 */
-   if (skb_at_tc_ingress(skb) != tcf_mirred_act_wants_ingress(m_eaction) &&
-   m_mac_header_xmit) {
+   want_ingress = tcf_mirred_act_wants_ingress(m_eaction);
+   if (skb_at_tc_ingress(skb) != want_ingress && m_mac_header_xmit) {
if (!skb_at_tc_ingress(skb)) {
/* caught at egress, act ingress: pull mac */
mac_len = skb_network_header(skb) - skb_mac_header(skb);
@@ -216,14 +221,22 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
}
}
 
+   skb2->skb_iif = skb->dev->ifindex;
+   skb2->dev = dev;
+
/* mirror is always swallowed */
-   if (tcf_mirred_is_act_redirect(m_eaction)) {
+   if (is_redirect) {
skb2->tc_redirected = 1;
skb2->tc_from_ingress = skb2->tc_at_ingress;
+
+   /* let's the caller reinject the packet, if possible */
+   if (skb_at_tc_ingress(skb)) {
+   res->ingress = want_ingress;
+   res->qstats = this_cpu_ptr(m->common.cpu_qstats);
+   return TC_ACT_MIRRED;
+   }
}
 
-   skb2->skb_iif = skb->dev->ifindex;
-   skb2->dev = dev;
if (!tcf_mirred_act_wants_ingress(m_eaction))
err = dev_queue_xmit(skb2);
else
-- 
2.17.1

[PATCH net-next 1/4] tc/act: user space can't use TC_ACT_REDIRECT directly

2018-07-19 Thread Paolo Abeni

Only cls_bpf and act_bpf can safely use such value. If a generic
action is configured by user space to return TC_ACT_REDIRECT,
the usually visible behavior is passing the skb up the stack - as
for unknown action, but, with complex configuration, more random
results can be obtained.

This patch forcefully converts TC_ACT_REDIRECT to TC_ACT_LAST + 1
at action init time, making the kernel behavior more consistent.

Signed-off-by: Paolo Abeni 
---
 include/uapi/linux/pkt_cls.h | 1 +
 net/sched/act_api.c  | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index c4262d911596..7cdd62b51106 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -45,6 +45,7 @@ enum {
   * the skb and act like everything
   * is alright.
   */
+#define TC_ACT_LASTTC_ACT_TRAP
 
 /* There is a special kind of actions called "extended actions",
  * which need a value parameter. These have a local opcode located in
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 148a89ab789b..f6438f246dab 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -895,6 +895,11 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
}
}
 
+   if (a->tcfa_action == TC_ACT_REDIRECT) {
+   net_warn_ratelimited("TC_ACT_REDIRECT can't be used directly");
+   a->tcfa_action = TC_ACT_LAST + 1;
+   }
+
return a;
 
 err_mod:
-- 
2.17.1

[PATCH net-next 2/4] tc/act: remove unneeded RCU lock in action callback

2018-07-19 Thread Paolo Abeni

Each lockless action currently does its own RCU locking in ->act().
This is allows using plain RCU accessor, even if the context
is really RCU BH.

This change drops the per action RCU lock, replace the accessors
with _bh variant, cleans up a bit the surronding code and documents
the RCU status in the relevant header.
No functional nor performance change is intended.

The goal of this patch is clarifying that the RCU critical section
used by the tc actions extends up to the classifier's caller.

v1 -> v2:
 - preserve rcu lock in act_bpf: it's needed by eBPF helpers,
   as pointed out by Daniel

Signed-off-by: Paolo Abeni 
---
 include/net/act_api.h  |  2 +-
 include/net/sch_generic.h  |  2 ++
 net/sched/act_csum.c   | 12 +++-
 net/sched/act_ife.c|  5 +
 net/sched/act_mirred.c |  4 +---
 net/sched/act_sample.c |  4 +---
 net/sched/act_skbedit.c| 10 +++---
 net/sched/act_skbmod.c | 21 +
 net/sched/act_tunnel_key.c |  6 +-
 net/sched/act_vlan.c   | 19 +++
 10 files changed, 29 insertions(+), 56 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 683ce41053d9..8c9bc02d05e1 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -85,7 +85,7 @@ struct tc_action_ops {
size_t  size;
struct module   *owner;
int (*act)(struct sk_buff *, const struct tc_action *,
-  struct tcf_result *);
+  struct tcf_result *); /* called under RCU BH lock*/
int (*dump)(struct sk_buff *, struct tc_action *, int, int);
void(*cleanup)(struct tc_action *);
int (*lookup)(struct net *net, struct tc_action **a, u32 index,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7432100027b7..056dc1083aa3 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -275,6 +275,8 @@ struct tcf_proto {
/* Fast access part */
struct tcf_proto __rcu  *next;
void __rcu  *root;
+
+   /* called under RCU BH lock*/
int (*classify)(struct sk_buff *,
const struct tcf_proto *,
struct tcf_result *);
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index bd232d3bd022..4f092fbd1e20 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -561,15 +561,14 @@ static int tcf_csum(struct sk_buff *skb, const struct 
tc_action *a,
u32 update_flags;
int action;
 
-   rcu_read_lock();
-   params = rcu_dereference(p->params);
+   params = rcu_dereference_bh(p->params);
 
tcf_lastuse_update(&p->tcf_tm);
bstats_cpu_update(this_cpu_ptr(p->common.cpu_bstats), skb);
 
action = params->action;
if (unlikely(action == TC_ACT_SHOT))
-   goto drop_stats;
+   goto drop;
 
update_flags = params->update_flags;
switch (tc_skb_protocol(skb)) {
@@ -583,16 +582,11 @@ static int tcf_csum(struct sk_buff *skb, const struct 
tc_action *a,
break;
}
 
-unlock:
-   rcu_read_unlock();
return action;
 
 drop:
-   action = TC_ACT_SHOT;
-
-drop_stats:
qstats_drop_inc(this_cpu_ptr(p->common.cpu_qstats));
-   goto unlock;
+   return TC_ACT_SHOT;
 }
 
 static int tcf_csum_dump(struct sk_buff *skb, struct tc_action *a, int bind,
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 3d6e265758c0..df4060e32d43 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -820,14 +820,11 @@ static int tcf_ife_act(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_ife_params *p;
int ret;
 
-   rcu_read_lock();
-   p = rcu_dereference(ife->params);
+   p = rcu_dereference_bh(ife->params);
if (p->flags & IFE_ENCODE) {
ret = tcf_ife_encode(skb, a, res, p);
-   rcu_read_unlock();
return ret;
}
-   rcu_read_unlock();
 
return tcf_ife_decode(skb, a, res);
 }
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 6afd89a36c69..eeb335f03102 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -181,11 +181,10 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
tcf_lastuse_update(&m->tcf_tm);
bstats_cpu_update(this_cpu_ptr(m->common.cpu_bstats), skb);
 
-   rcu_read_lock();
m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
m_eaction = READ_ONCE(m->tcfm_eaction);
retval = READ_ONCE(m->tcf_action);
-   dev = rcu_dereference(m->tcfm_dev);
+   dev = rcu_dereference_bh(m->tcfm_dev);
if (unlikely(!dev)) {
pr_notice_once("tc mirred: target device is gone\n");
goto out;
@@ -236,7 +235,6 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,

[PATCH net-next 0/4] TC: refactor act_mirred packets re-injection

2018-07-19 Thread Paolo Abeni

This series is aimed at improving the act_mirred redirect performances.
Such action is used by OVS to represent TC S/W flows, and it's current largest
bottle-neck is the need for a skb_clone() for each packet.

The first 2 patches introduce some cleanup and safeguards to allow extending 
tca_result: we will use it to store RCU protected redirect information.
Then a new tca_action value is introduced: TC_ACT_MIRRED, similar to
TC_ACT_REDIRECT, but preserving the mirred semantic. The last patch exploits
the introduced infrastructure in the act_mirred action, to avoid a skb_clone,
when possible.

Overall this the above gives a ~10% performance improvement in forwarding tput,
when using the TC S/W datapath.

v1 -> v2:
 - preserve the rcu lock in act_bpf
 - add and use a new action value to reinject the packets, preserving the mirred
   semantic

Paolo Abeni (4):
  tc/act: user space can't use TC_ACT_REDIRECT directly
  tc/act: remove unneeded RCU lock in action callback
  net/tc: introduce TC_ACT_MIRRED.
  act_mirred: use ACT_MIRRED when possible

 include/net/act_api.h|  2 +-
 include/net/sch_generic.h| 21 +
 include/uapi/linux/pkt_cls.h |  2 ++
 net/core/dev.c   |  4 
 net/sched/act_api.c  |  7 +++
 net/sched/act_csum.c | 12 +++-
 net/sched/act_ife.c  |  5 +
 net/sched/act_mirred.c   | 35 +++
 net/sched/act_sample.c   |  4 +---
 net/sched/act_skbedit.c  | 10 +++---
 net/sched/act_skbmod.c   | 21 +
 net/sched/act_tunnel_key.c   |  6 +-
 net/sched/act_vlan.c | 19 +++
 13 files changed, 83 insertions(+), 65 deletions(-)

-- 
2.17.1

Re: [PATCH iproute2/next 1/2] tc/act_tunnel_key: Enable setup of tos and ttl

2018-07-19 Thread Roman Mashak

Or Gerlitz  writes:

> Allow to set tos and ttl for the tunnel.
>
> For example, here's encap rule that sets tos to the tunnel:
>
> tc filter add dev eth0_0 protocol ip parent : prio 10 flower \
>src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70 \
>action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 
> dst_port 4789 tos 0x30 \
>action mirred egress redirect dev vxlan_sys_4789
>
> Signed-off-by: Or Gerlitz 
> Reviewed-by: Roi Dayan 
> Acked-by: Jiri Pirko 

[...]

Or, could you also update tunnel_key actions for the new options in
$(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
once the patches are accepted ?

[net-next v5 3/3] net/tls: Remove redundant array allocation.

2018-07-19 Thread Vakul Garg

In function decrypt_skb(), array allocation in case when sgout is NULL
is unnecessary. Instead, local variable sgin_arr[] can be used.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e15ace0ebd79..1aa2d46713d7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -704,7 +704,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
nsg = skb_cow_data(skb, 0, &unused) + 1;
-   sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
sgout = sgin;
}
 
@@ -725,9 +724,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
rxm->full_len - tls_ctx->rx.overhead_size,
skb, sk->sk_allocation);
 
-   if (sgin != &sgin_arr[0])
-   kfree(sgin);
-
return ret;
 }
 
-- 
2.13.6

[net-next v5 0/3] net/tls: Minor code cleanup patches

2018-07-19 Thread Vakul Garg

This patch series improves tls_sw.c code by:

1) Using correct socket callback for flagging data availability.
2) Removing redundant variable assignments and wakeup callbacks.
3) Removing redundant dynamic array allocation.

The patches do not fix any functional bug. Hence "Fixes:" tag has not
been used. From patch series v3, this series v4 contains two patches
less. They will be submitted separately.

Vakul Garg (3):
  net/tls: Use socket data_ready callback on record availability
  net/tls: Remove redundant variable assignments and wakeup
  net/tls: Remove redundant array allocation.

 net/tls/tls_sw.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

-- 
2.13.6

[net-next v5 1/3] net/tls: Use socket data_ready callback on record availability

2018-07-19 Thread Vakul Garg

On receipt of a complete tls record, use socket's saved data_ready
callback instead of state_change callback.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 7d194c0cd6cf..a58661c624ec 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1023,7 +1023,7 @@ static void tls_queue(struct strparser *strp, struct 
sk_buff *skb)
ctx->recv_pkt = skb;
strp_pause(strp);
 
-   strp->sk->sk_state_change(strp->sk);
+   ctx->saved_data_ready(strp->sk);
 }
 
 static void tls_data_ready(struct sock *sk)
-- 
2.13.6

[net-next v5 2/3] net/tls: Remove redundant variable assignments and wakeup

2018-07-19 Thread Vakul Garg

In function decrypt_skb_update(), the assignment to tls receive context
variable 'decrypted' is redundant as the same is being done in function
tls_sw_recvmsg() after calling decrypt_skb_update(). Also calling callback
function to wakeup processes sleeping on socket data availability is
useless as decrypt_skb_update() is invoked from user processes only. This
patch cleans these up.

Signed-off-by: Vakul Garg 
---

Changes from v4->v5: Fixed compilation issue.

 net/tls/tls_sw.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a58661c624ec..e15ace0ebd79 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -679,8 +679,6 @@ static int decrypt_skb_update(struct sock *sk, struct 
sk_buff *skb,
rxm->offset += tls_ctx->rx.prepend_size;
rxm->full_len -= tls_ctx->rx.overhead_size;
tls_advance_record_sn(sk, &tls_ctx->rx);
-   ctx->decrypted = true;
-   ctx->saved_data_ready(sk);
 
return err;
 }
-- 
2.13.6

[net-next v5 2/3] net/tls: Remove redundant variable assignments and wakeup

2018-07-19 Thread Vakul Garg

In function decrypt_skb_update(), the assignment to tls receive context
variable 'decrypted' is redundant as the same is being done in function
tls_sw_recvmsg() after calling decrypt_skb_update(). Also calling callback
function to wakeup processes sleeping on socket data availability is
useless as decrypt_skb_update() is invoked from user processes only. This
patch cleans these up.

Signed-off-by: Vakul Garg 
---

Changes from v4->v5: Fixed compilation issue.

 net/tls/tls_sw.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a58661c624ec..e15ace0ebd79 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -679,8 +679,6 @@ static int decrypt_skb_update(struct sock *sk, struct 
sk_buff *skb,
rxm->offset += tls_ctx->rx.prepend_size;
rxm->full_len -= tls_ctx->rx.overhead_size;
tls_advance_record_sn(sk, &tls_ctx->rx);
-   ctx->decrypted = true;
-   ctx->saved_data_ready(sk);
 
return err;
 }
-- 
2.13.6

[net-next v4 0/3] net/tls: Minor code cleanup patches

2018-07-19 Thread Vakul Garg

This patch series improves tls_sw.c code by:

1) Using correct socket callback for flagging data availability.
2) Removing redundant variable assignments and wakeup callbacks.
3) Removing redundant dynamic array allocation.

The patches do not fix any functional bug. Hence "Fixes:" tag has not
been used. From patch series v3, this series v4 contains two patches
less. They will be submitted separately.

Vakul Garg (3):
  net/tls: Use socket data_ready callback on record availability
  net/tls: Remove redundant variable assignments and wakeup
  net/tls: Remove redundant array allocation.

 net/tls/tls_sw.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

-- 
2.13.6

[net-next v5 3/3] net/tls: Remove redundant array allocation.

2018-07-19 Thread Vakul Garg

In function decrypt_skb(), array allocation in case when sgout is NULL
is unnecessary. Instead, local variable sgin_arr[] can be used.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e15ace0ebd79..1aa2d46713d7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -704,7 +704,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
nsg = skb_cow_data(skb, 0, &unused) + 1;
-   sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
sgout = sgin;
}
 
@@ -725,9 +724,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
rxm->full_len - tls_ctx->rx.overhead_size,
skb, sk->sk_allocation);
 
-   if (sgin != &sgin_arr[0])
-   kfree(sgin);
-
return ret;
 }
 
-- 
2.13.6

[net-next v5 1/3] net/tls: Use socket data_ready callback on record availability

2018-07-19 Thread Vakul Garg

On receipt of a complete tls record, use socket's saved data_ready
callback instead of state_change callback.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 7d194c0cd6cf..a58661c624ec 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1023,7 +1023,7 @@ static void tls_queue(struct strparser *strp, struct 
sk_buff *skb)
ctx->recv_pkt = skb;
strp_pause(strp);
 
-   strp->sk->sk_state_change(strp->sk);
+   ctx->saved_data_ready(strp->sk);
 }
 
 static void tls_data_ready(struct sock *sk)
-- 
2.13.6

[PATCH iproute2/next 1/2] tc/act_tunnel_key: Enable setup of tos and ttl

2018-07-19 Thread Or Gerlitz

Allow to set tos and ttl for the tunnel.

For example, here's encap rule that sets tos to the tunnel:

tc filter add dev eth0_0 protocol ip parent : prio 10 flower \
   src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70 \
   action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 
dst_port 4789 tos 0x30 \
   action mirred egress redirect dev vxlan_sys_4789

Signed-off-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Acked-by: Jiri Pirko 
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |2 +
 man/man8/tc-tunnel_key.8  |8 
 tc/m_tunnel_key.c |   53 +
 3 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h 
b/include/uapi/linux/tc_act/tc_tunnel_key.h
index e284fec..be384d6 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -39,6 +39,8 @@ enum {
TCA_TUNNEL_KEY_ENC_OPTS,/* Nested TCA_TUNNEL_KEY_ENC_OPTS_
 * attributes
 */
+   TCA_TUNNEL_KEY_ENC_TOS, /* u8 */
+   TCA_TUNNEL_KEY_ENC_TTL, /* u8 */
__TCA_TUNNEL_KEY_MAX,
 };
 
diff --git a/man/man8/tc-tunnel_key.8 b/man/man8/tc-tunnel_key.8
index 7d4b30e..1e09362 100644
--- a/man/man8/tc-tunnel_key.8
+++ b/man/man8/tc-tunnel_key.8
@@ -16,6 +16,8 @@ tunnel_key - Tunnel metadata manipulation
 .IR ADDRESS
 .BI id " KEY_ID"
 .BI dst_port " UDP_PORT"
+.BI tos " TOS"
+.BI ttl " TTL"
 .RB "[ " csum " | " nocsum " ]"
 
 .SH DESCRIPTION
@@ -89,6 +91,12 @@ is specified in the form CLASS:TYPE:DATA, where CLASS is 
represented as a
 variable length hexadecimal value. Additionally multiple options may be
 listed using a comma delimiter.
 .TP
+.B tos
+Outer header TOS
+.TP
+.B ttl
+Outer header TTL
+.TP
 .RB [ no ] csum
 Controlls outer UDP checksum. When set to
 .B csum
diff --git a/tc/m_tunnel_key.c b/tc/m_tunnel_key.c
index 5a0e3fc..e9e71e4 100644
--- a/tc/m_tunnel_key.c
+++ b/tc/m_tunnel_key.c
@@ -190,6 +190,22 @@ static int tunnel_key_parse_geneve_opts(char *str, struct 
nlmsghdr *n)
return 0;
 }
 
+static int tunnel_key_parse_tos_ttl(char *str, int type, struct nlmsghdr *n)
+{
+   int ret;
+   __u8 val;
+
+   ret = get_u8(&val, str, 10);
+   if (ret)
+   ret = get_u8(&val, str, 16);
+   if (ret)
+   return -1;
+
+   addattr8(n, MAX_MSG, type, val);
+
+   return 0;
+}
+
 static int parse_tunnel_key(struct action_util *a, int *argc_p, char ***argv_p,
int tca_id, struct nlmsghdr *n)
 {
@@ -273,6 +289,22 @@ static int parse_tunnel_key(struct action_util *a, int 
*argc_p, char ***argv_p,
fprintf(stderr, "Illegal \"geneve_opts\"\n");
return -1;
}
+   } else if (matches(*argv, "tos") == 0) {
+   NEXT_ARG();
+   ret = tunnel_key_parse_tos_ttl(*argv,
+   TCA_TUNNEL_KEY_ENC_TOS, 
n);
+   if (ret < 0) {
+   fprintf(stderr, "Illegal \"tos\"\n");
+   return -1;
+   }
+   } else if (matches(*argv, "ttl") == 0) {
+   NEXT_ARG();
+   ret = tunnel_key_parse_tos_ttl(*argv,
+   TCA_TUNNEL_KEY_ENC_TTL, 
n);
+   if (ret < 0) {
+   fprintf(stderr, "Illegal \"ttl\"\n");
+   return -1;
+   }
} else if (matches(*argv, "csum") == 0) {
csum = 1;
} else if (matches(*argv, "nocsum") == 0) {
@@ -435,6 +467,23 @@ static void tunnel_key_print_key_opt(const char *name, 
struct rtattr *attr)
tb[TCA_TUNNEL_KEY_ENC_OPTS_GENEVE]);
 }
 
+static void tunnel_key_print_tos_ttl(FILE *f, char *name,
+struct rtattr *attr)
+{
+   if (!attr)
+   return;
+
+   if (matches(name, "tos") == 0 && rta_getattr_u8(attr) != 0) {
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   print_uint(PRINT_ANY, "tos", "\ttos 0x%x",
+  rta_getattr_u8(attr));
+   } else if (matches(name, "ttl") == 0 && rta_getattr_u8(attr) != 0) {
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   print_uint(PRINT_ANY, "ttl", "\tttl %u",
+  rta_getattr_u8(attr));
+   }
+}
+
 static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr 
*arg)
 {
struct rtattr *tb[TCA_TUNNEL_KEY_MAX + 1];
@@ -476,6 +525,10 @@ static int print_tunnel_key(struct action_util *au, FILE 
*f, struct rtattr *arg)

[PATCH iproute2/next 0/2] set/match the tos/ttl fields of TC based IP tunnels

2018-07-19 Thread Or Gerlitz

Hi Dave,

This series comes to address the case to set (encap) and match (decap)
also the tos and ttl fields of TC based IP tunnels.

Example command lines in the change log of each patch.

The kernel bits are under review [1], sending this out in parallel.

Or.

[1] https://patchwork.ozlabs.org/cover/945216/

Or Gerlitz (2):
  tc/act_tunnel_key: Enable setup of tos and ttl
  tc/flower: Add match on encapsulating tos/ttl

 include/uapi/linux/pkt_cls.h  |5 +++
 include/uapi/linux/tc_act/tc_tunnel_key.h |2 +
 man/man8/tc-flower.8  |   14 +++-
 man/man8/tc-tunnel_key.8  |8 
 tc/f_flower.c |   27 +++
 tc/m_tunnel_key.c |   53 +
 6 files changed, 108 insertions(+), 1 deletions(-)

[PATCH iproute2/next2/next 2/2] tc/flower: Add match on encapsulating tos/ttl

2018-07-19 Thread Or Gerlitz

Add matching on tos/ttl of the IP tunnel headers.

For example, here's decap rule that matches on the tunnel tos:

tc filter add dev vxlan_sys_4789 protocol ip parent : prio 10 flower \
   enc_src_ip 192.168.10.2 enc_dst_ip 192.168.10.1 enc_key_id 100 enc_dst_port 
4789 enc_tos 0x30 \
   src_mac e4:11:22:33:44:70 dst_mac e4:11:22:33:44:50  \
   action tunnel_key unset \
   action mirred egress redirect dev eth0_0

Signed-off-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Acked-by: Jiri Pirko 
---
 include/uapi/linux/pkt_cls.h |5 +
 man/man8/tc-flower.8 |   14 +-
 tc/f_flower.c|   27 +++
 3 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index c4262d9..b451225 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -473,6 +473,11 @@ enum {
TCA_FLOWER_KEY_CVLAN_PRIO,  /* u8   */
TCA_FLOWER_KEY_CVLAN_ETH_TYPE,  /* be16 */
 
+   TCA_FLOWER_KEY_ENC_IP_TOS,  /* u8 */
+   TCA_FLOWER_KEY_ENC_IP_TOS_MASK, /* u8 */
+   TCA_FLOWER_KEY_ENC_IP_TTL,  /* u8 */
+   TCA_FLOWER_KEY_ENC_IP_TTL_MASK, /* u8 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index bfa66d8..305d7ef 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -76,6 +76,10 @@ flower \- flow based traffic control filter
 .IR ipv4_address " | " ipv6_address " } | "
 .B enc_dst_port
 .IR port_number " | "
+.B enc_tos
+.IR TOS " | "
+.B enc_ttl
+.IR TTL " | "
 .BR ip_flags
 .IR IP_FLAGS
 .SH DESCRIPTION
@@ -275,6 +279,10 @@ bits is assumed.
 .BI enc_src_ip " PREFIX"
 .TQ
 .BI enc_dst_port " NUMBER"
+.TQ
+.BI enc_tos " NUMBER"
+.TQ
+.BI enc_ttl " NUMBER"
 Match on IP tunnel metadata. Key id
 .I NUMBER
 is a 32 bit tunnel key id (e.g. VNI for VXLAN tunnel).
@@ -283,7 +291,11 @@ must be a valid IPv4 or IPv6 address optionally followed 
by a slash and the
 prefix length. If the prefix is missing, \fBtc\fR assumes a full-length
 host match.  Dst port
 .I NUMBER
-is a 16 bit UDP dst port.
+is a 16 bit UDP dst port. Tos
+.I NUMBER
+is an 8 bit tos (dscp+ecn) value, ttl
+.I NUMBER
+is an 8 bit time-to-live value.
 .TP
 .BI ip_flags " IP_FLAGS"
 .I IP_FLAGS
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 40b4026..a4cf06a 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -77,6 +77,8 @@ static void explain(void)
"   enc_dst_ip [ IPV4-ADDR | IPV6-ADDR ] 
|\n"
"   enc_src_ip [ IPV4-ADDR | IPV6-ADDR ] 
|\n"
"   enc_key_id [ KEY-ID ] |\n"
+   "   enc_tos MASKED-IP_TOS |\n"
+   "   enc_ttl MASKED-IP_TTL |\n"
"   ip_flags IP-FLAGS | \n"
"   enc_dst_port [ port_number ] }\n"
"   FILTERID := X:Y:Z\n"
@@ -1019,6 +1021,26 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
fprintf(stderr, "Illegal \"enc_dst_port\"\n");
return -1;
}
+   } else if (matches(*argv, "enc_tos") == 0) {
+   NEXT_ARG();
+   ret = flower_parse_ip_tos_ttl(*argv,
+ TCA_FLOWER_KEY_ENC_IP_TOS,
+ 
TCA_FLOWER_KEY_ENC_IP_TOS_MASK,
+ n);
+   if (ret < 0) {
+   fprintf(stderr, "Illegal \"enc_tos\"\n");
+   return -1;
+   }
+   } else if (matches(*argv, "enc_ttl") == 0) {
+   NEXT_ARG();
+   ret = flower_parse_ip_tos_ttl(*argv,
+ TCA_FLOWER_KEY_ENC_IP_TTL,
+ 
TCA_FLOWER_KEY_ENC_IP_TTL_MASK,
+ n);
+   if (ret < 0) {
+   fprintf(stderr, "Illegal \"enc_ttl\"\n");
+   return -1;
+   }
} else if (matches(*argv, "action") == 0) {
NEXT_ARG();
ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
@@ -1542,6 +1564,11 @@ static int flower_print_opt(struct filter_util *qu, FILE 
*f,
 
flower_print_port("enc_dst_port", tb[TCA_FLOWER_KEY_ENC_UDP_DST_PORT]);
 
+   flower_print_ip_attr("enc_tos", tb[TCA_FLOWER_KEY_ENC_IP_TOS],
+   tb[TCA_FLOWER_KEY_ENC_IP_TOS_MASK]);
+   flower_print_ip_attr("enc_ttl", tb[TCA_FLOWER_KEY_ENC_IP_TTL],
+   tb[TCA_FLOWER_KEY_ENC_IP_TTL_MASK]);
+
flower

[PATCH net-next 03/11] s390/qeth: remove redundant netif_carrier_ok() checks

2018-07-19 Thread Julian Wiedmann

netif_carrier_off() does its own checking.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c | 2 +-
 drivers/s390/net/qeth_l2_main.c   | 2 +-
 drivers/s390/net/qeth_l3_main.c   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index f7ddd6455638..7c3b643550f6 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -653,7 +653,7 @@ static struct qeth_ipa_cmd *qeth_check_ipa_data(struct 
qeth_card *card,
cmd->hdr.return_code, card);
}
card->lan_online = 0;
-   if (card->dev && netif_carrier_ok(card->dev))
+   if (card->dev)
netif_carrier_off(card->dev);
return NULL;
case IPA_CMD_STARTLAN:
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 8ac243de7a9e..089bde458fd5 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -1163,7 +1163,7 @@ static int __qeth_l2_set_offline(struct ccwgroup_device 
*cgdev,
QETH_DBF_TEXT(SETUP, 3, "setoffl");
QETH_DBF_HEX(SETUP, 3, &card, sizeof(void *));
 
-   if (card->dev && netif_carrier_ok(card->dev))
+   if (card->dev)
netif_carrier_off(card->dev);
recover_flag = card->state;
if ((!recovery_mode && card->info.hwtrap) || card->info.hwtrap == 2) {
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index 062f62b49294..ee99af08b2c4 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2773,7 +2773,7 @@ static int __qeth_l3_set_offline(struct ccwgroup_device 
*cgdev,
QETH_DBF_TEXT(SETUP, 3, "setoffl");
QETH_DBF_HEX(SETUP, 3, &card, sizeof(void *));
 
-   if (card->dev && netif_carrier_ok(card->dev))
+   if (card->dev)
netif_carrier_off(card->dev);
recover_flag = card->state;
if ((!recovery_mode && card->info.hwtrap) || card->info.hwtrap == 2) {
-- 
2.16.4

[PATCH net-next 01/11] s390/qeth: fix race in used-buffer accounting

2018-07-19 Thread Julian Wiedmann

By updating q->used_buffers only _after_ do_QDIO() has completed, there
is a potential race against the buffer's TX completion. In the unlikely
case that the TX completion path wins, qeth_qdio_output_handler() would
decrement the counter before qeth_flush_buffers() even incremented it.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index d80972b9bfc7..f7ddd6455638 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -3506,13 +3506,14 @@ static void qeth_flush_buffers(struct qeth_qdio_out_q 
*queue, int index,
qdio_flags = QDIO_FLAG_SYNC_OUTPUT;
if (atomic_read(&queue->set_pci_flags_count))
qdio_flags |= QDIO_FLAG_PCI_OUT;
+   atomic_add(count, &queue->used_buffers);
+
rc = do_QDIO(CARD_DDEV(queue->card), qdio_flags,
 queue->queue_no, index, count);
if (queue->card->options.performance_stats)
queue->card->perf_stats.outbound_do_qdio_time +=
qeth_get_micros() -
queue->card->perf_stats.outbound_do_qdio_start_time;
-   atomic_add(count, &queue->used_buffers);
if (rc) {
queue->card->stats.tx_errors += count;
/* ignore temporary SIGA errors without busy condition */
-- 
2.16.4

[PATCH net-next 04/11] s390/qeth: allocate netdevice early

2018-07-19 Thread Julian Wiedmann

Allocation of the netdevice is currently delayed until a qeth card first
goes online. This complicates matters in several places, where we need
to cache values instead of applying them straight to the netdevice.

Improve on this by moving the allocation up to where the qeth card
itself is created. This is also one step in direction of eventually
placing the qeth card into netdev_priv().

In all subsequent code, remove the now redundant checks whether
card->dev is valid.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  1 +
 drivers/s390/net/qeth_core_main.c | 61 ---
 drivers/s390/net/qeth_core_sys.c  | 14 +++--
 drivers/s390/net/qeth_l2_main.c   | 52 +
 drivers/s390/net/qeth_l3_main.c   | 48 ++
 drivers/s390/net/qeth_l3_sys.c|  6 ++--
 6 files changed, 94 insertions(+), 88 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index a932aac62d0e..4d6827c8aba4 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -966,6 +966,7 @@ extern struct qeth_card_list_struct qeth_core_card_list;
 extern struct kmem_cache *qeth_core_header_cache;
 extern struct qeth_dbf_info qeth_dbf[QETH_DBF_INFOS];
 
+struct net_device *qeth_clone_netdev(struct net_device *orig);
 void qeth_set_recovery_task(struct qeth_card *);
 void qeth_clear_recovery_task(struct qeth_card *);
 void qeth_set_allowed_threads(struct qeth_card *, unsigned long , int);
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 7c3b643550f6..6df2226417f1 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -653,8 +653,7 @@ static struct qeth_ipa_cmd *qeth_check_ipa_data(struct 
qeth_card *card,
cmd->hdr.return_code, card);
}
card->lan_online = 0;
-   if (card->dev)
-   netif_carrier_off(card->dev);
+   netif_carrier_off(card->dev);
return NULL;
case IPA_CMD_STARTLAN:
dev_info(&card->gdev->dev,
@@ -2350,9 +2349,8 @@ static int qeth_ulp_enable_cb(struct qeth_card *card, 
struct qeth_reply *reply,
}
if (card->info.initial_mtu && (card->info.initial_mtu != mtu)) {
/* frame size has changed */
-   if (card->dev &&
-   ((card->dev->mtu == card->info.initial_mtu) ||
-(card->dev->mtu > mtu)))
+   if ((card->dev->mtu == card->info.initial_mtu) ||
+   (card->dev->mtu > mtu))
card->dev->mtu = mtu;
qeth_free_qdio_buffers(card);
}
@@ -3578,7 +3576,7 @@ static void qeth_qdio_start_poll(struct ccw_device 
*ccwdev, int queue,
 {
struct qeth_card *card = (struct qeth_card *)card_ptr;
 
-   if (card->dev && (card->dev->flags & IFF_UP))
+   if (card->dev->flags & IFF_UP)
napi_schedule(&card->napi);
 }
 
@@ -4794,9 +4792,6 @@ int qeth_vm_request_mac(struct qeth_card *card)
 
QETH_DBF_TEXT(SETUP, 2, "vmreqmac");
 
-   if (!card->dev)
-   return -ENODEV;
-
request = kzalloc(sizeof(*request), GFP_KERNEL | GFP_DMA);
response = kzalloc(sizeof(*response), GFP_KERNEL | GFP_DMA);
if (!request || !response) {
@@ -5676,6 +5671,44 @@ static void qeth_clear_dbf_list(void)
mutex_unlock(&qeth_dbf_list_mutex);
 }
 
+static struct net_device *qeth_alloc_netdev(struct qeth_card *card)
+{
+   struct net_device *dev;
+
+   switch (card->info.type) {
+   case QETH_CARD_TYPE_IQD:
+   dev = alloc_netdev(0, "hsi%d", NET_NAME_UNKNOWN, ether_setup);
+   break;
+   case QETH_CARD_TYPE_OSN:
+   dev = alloc_netdev(0, "osn%d", NET_NAME_UNKNOWN, ether_setup);
+   break;
+   default:
+   dev = alloc_etherdev(0);
+   }
+
+   if (!dev)
+   return NULL;
+
+   dev->ml_priv = card;
+   dev->watchdog_timeo = QETH_TX_TIMEOUT;
+   dev->min_mtu = 64;
+   dev->max_mtu = ETH_MAX_MTU;
+   SET_NETDEV_DEV(dev, &card->gdev->dev);
+   netif_carrier_off(dev);
+   return dev;
+}
+
+struct net_device *qeth_clone_netdev(struct net_device *orig)
+{
+   struct net_device *clone = qeth_alloc_netdev(orig->ml_priv);
+
+   if (!clone)
+   return NULL;
+
+   clone->dev_port = orig->dev_port;
+   return clone;
+}
+
 static int qeth_core_probe_device(struct ccwgroup_device *gdev)
 {
struct qeth_card *card;
@@ -5725,6 +5758,10 @@ static int qeth_core_probe_device(struct ccwgroup_device 
*gdev)

[PATCH net-next 10/11] s390/qeth: add support for constrained HW headers

2018-07-19 Thread Julian Wiedmann

Some transmit modes require that the HW header is located in the same
page as the initial protocol headers in skb->data. Let callers specify
the size of this contiguous header range, and enforce it when building
the HW header.

While at it, apply some gentle renaming to the relevant L2 code so that
it matches the L3 code.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  4 ++--
 drivers/s390/net/qeth_core_main.c | 33 +++--
 drivers/s390/net/qeth_l2_main.c   | 12 +++-
 drivers/s390/net/qeth_l3_main.c   |  3 ++-
 4 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 2a5ec99643df..605ec4706773 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1048,8 +1048,8 @@ netdev_features_t qeth_features_check(struct sk_buff *skb,
  netdev_features_t features);
 int qeth_vm_request_mac(struct qeth_card *card);
 int qeth_add_hw_header(struct qeth_card *card, struct sk_buff *skb,
-  struct qeth_hdr **hdr, unsigned int len,
-  unsigned int *elements);
+  struct qeth_hdr **hdr, unsigned int hdr_len,
+  unsigned int proto_len, unsigned int *elements);
 
 /* exports for OSN */
 int qeth_osn_assist(struct net_device *, void *, int);
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index e7b34624df1e..732b517369c7 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -3895,7 +3895,9 @@ EXPORT_SYMBOL_GPL(qeth_hdr_chk_and_bounce);
  * @skb: skb that the HW header should be added to.
  * @hdr: double pointer to a qeth_hdr. When returning with >= 0,
  *  it contains a valid pointer to a qeth_hdr.
- * @len: length of the HW header.
+ * @hdr_len: length of the HW header.
+ * @proto_len: length of protocol headers that need to be in same page as the
+ *HW header.
  *
  * Returns the pushed length. If the header can't be pushed on
  * (eg. because it would cross a page boundary), it is allocated from
@@ -3904,31 +3906,32 @@ EXPORT_SYMBOL_GPL(qeth_hdr_chk_and_bounce);
  * Error to create the hdr is indicated by returning with < 0.
  */
 int qeth_add_hw_header(struct qeth_card *card, struct sk_buff *skb,
-  struct qeth_hdr **hdr, unsigned int len,
-  unsigned int *elements)
+  struct qeth_hdr **hdr, unsigned int hdr_len,
+  unsigned int proto_len, unsigned int *elements)
 {
const unsigned int max_elements = QETH_MAX_BUFFER_ELEMENTS(card);
+   const unsigned int contiguous = proto_len ? proto_len : 1;
unsigned int __elements;
addr_t start, end;
bool push_ok;
int rc;
 
 check_layout:
-   start = (addr_t)skb->data - len;
+   start = (addr_t)skb->data - hdr_len;
end = (addr_t)skb->data;
 
-   if (qeth_get_elements_for_range(start, end + 1) == 1) {
+   if (qeth_get_elements_for_range(start, end + contiguous) == 1) {
/* Push HW header into same page as first protocol header. */
push_ok = true;
__elements = qeth_count_elements(skb, 0);
-   } else {
+   } else if (!proto_len && qeth_get_elements_for_range(start, end) == 1) {
+   /* Push HW header into a new page. */
+   push_ok = true;
__elements = 1 + qeth_count_elements(skb, 0);
-   if (qeth_get_elements_for_range(start, end) == 1)
-   /* Push HW header into a new page. */
-   push_ok = true;
-   else
-   /* Use header cache. */
-   push_ok = false;
+   } else {
+   /* Use header cache, copy protocol headers up. */
+   push_ok = false;
+   __elements = 1 + qeth_count_elements(skb, proto_len);
}
 
/* Compress skb to fit into one IO buffer: */
@@ -3957,13 +3960,15 @@ int qeth_add_hw_header(struct qeth_card *card, struct 
sk_buff *skb,
*elements = __elements;
/* Add the header: */
if (push_ok) {
-   *hdr = skb_push(skb, len);
-   return len;
+   *hdr = skb_push(skb, hdr_len);
+   return hdr_len;
}
/* fall back */
*hdr = kmem_cache_alloc(qeth_core_header_cache, GFP_ATOMIC);
if (!*hdr)
return -ENOMEM;
+   /* Copy protocol headers behind HW header: */
+   skb_copy_from_linear_data(skb, ((char *)*hdr) + hdr_len, proto_len);
return 0;
 }
 EXPORT_SYMBOL_GPL(qeth_add_hw_header);
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 905f3bb3a87c..c302002e422f 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -671,17 +671,19 @@ static int qeth_l2_

[PATCH net-next 00/11] s390/qeth: updates 2018-07-19

2018-07-19 Thread Julian Wiedmann

Hi Dave,

please apply one more round of qeth patches to net-next.
This brings additional performance improvements for the transmit code,
and some refactoring to pave the way for using netdev_priv.
Also, two minor fixes for rare corner cases.

Thanks,
Julian


Julian Wiedmann (11):
  s390/qeth: fix race in used-buffer accounting
  s390/qeth: reset layer2 attribute on layer switch
  s390/qeth: remove redundant netif_carrier_ok() checks
  s390/qeth: allocate netdevice early
  s390/qeth: don't cache HW port number
  s390/qeth: simplify max MTU handling
  s390/qeth: use core MTU range checking
  s390/qeth: add statistics for consumed buffer elements
  s390/qeth: merge linearize-check into HW header construction
  s390/qeth: add support for constrained HW headers
  s390/qeth: speed up L2 IQD xmit

 drivers/s390/net/qeth_core.h  |  11 +-
 drivers/s390/net/qeth_core_main.c | 283 --
 drivers/s390/net/qeth_core_mpc.h  |   1 +
 drivers/s390/net/qeth_core_sys.c  |  18 ++-
 drivers/s390/net/qeth_l2_main.c   | 176 +++-
 drivers/s390/net/qeth_l3_main.c   | 112 +--
 drivers/s390/net/qeth_l3_sys.c|   6 +-
 7 files changed, 287 insertions(+), 320 deletions(-)

-- 
2.16.4

[PATCH net-next 11/11] s390/qeth: speed up L2 IQD xmit

2018-07-19 Thread Julian Wiedmann

Modify the L2 OSA xmit path so that it also supports L2 IQD devices
(in particular, their HW header requirements). This allows IQD devices
to advertise NETIF_F_SG support, and eliminates the allocation overhead
for the HW header.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c |  7 
 drivers/s390/net/qeth_l2_main.c   | 76 ---
 drivers/s390/net/qeth_l3_main.c   |  3 --
 3 files changed, 30 insertions(+), 56 deletions(-)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 732b517369c7..d09a7110b381 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5731,6 +5731,13 @@ static struct net_device *qeth_alloc_netdev(struct 
qeth_card *card)
dev->mtu = 0;
SET_NETDEV_DEV(dev, &card->gdev->dev);
netif_carrier_off(dev);
+
+   if (!IS_OSN(card)) {
+   dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+   dev->hw_features |= NETIF_F_SG;
+   dev->vlan_features |= NETIF_F_SG;
+   }
+
return dev;
 }
 
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index c302002e422f..c1829a4b955d 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -641,37 +641,13 @@ static void qeth_l2_set_rx_mode(struct net_device *dev)
qeth_promisc_to_bridge(card);
 }
 
-static int qeth_l2_xmit_iqd(struct qeth_card *card, struct sk_buff *skb,
-   struct qeth_qdio_out_q *queue, int cast_type)
-{
-   unsigned int data_offset = ETH_HLEN;
-   struct qeth_hdr *hdr;
-   int rc;
-
-   hdr = kmem_cache_alloc(qeth_core_header_cache, GFP_ATOMIC);
-   if (!hdr)
-   return -ENOMEM;
-   qeth_l2_fill_header(hdr, skb, cast_type, skb->len);
-   skb_copy_from_linear_data(skb, ((char *)hdr) + sizeof(*hdr),
- data_offset);
-
-   if (!qeth_get_elements_no(card, skb, 1, data_offset)) {
-   rc = -E2BIG;
-   goto out;
-   }
-   rc = qeth_do_send_packet_fast(queue, skb, hdr, data_offset,
- sizeof(*hdr) + data_offset);
-out:
-   if (rc)
-   kmem_cache_free(qeth_core_header_cache, hdr);
-   return rc;
-}
-
-static int qeth_l2_xmit_osa(struct qeth_card *card, struct sk_buff *skb,
-   struct qeth_qdio_out_q *queue, int cast_type,
-   int ipv)
+static int qeth_l2_xmit(struct qeth_card *card, struct sk_buff *skb,
+   struct qeth_qdio_out_q *queue, int cast_type, int ipv)
 {
+   const unsigned int proto_len = IS_IQD(card) ? ETH_HLEN : 0;
const unsigned int hw_hdr_len = sizeof(struct qeth_hdr);
+   unsigned int frame_len = skb->len;
+   unsigned int data_offset = 0;
struct qeth_hdr *hdr = NULL;
unsigned int hd_len = 0;
unsigned int elements;
@@ -682,15 +658,16 @@ static int qeth_l2_xmit_osa(struct qeth_card *card, 
struct sk_buff *skb,
if (rc)
return rc;
 
-   push_len = qeth_add_hw_header(card, skb, &hdr, hw_hdr_len, 0,
+   push_len = qeth_add_hw_header(card, skb, &hdr, hw_hdr_len, proto_len,
  &elements);
if (push_len < 0)
return push_len;
if (!push_len) {
-   /* hdr was allocated from cache */
-   hd_len = sizeof(*hdr);
+   /* HW header needs its own buffer element. */
+   hd_len = hw_hdr_len + proto_len;
+   data_offset = proto_len;
}
-   qeth_l2_fill_header(hdr, skb, cast_type, skb->len - push_len);
+   qeth_l2_fill_header(hdr, skb, cast_type, frame_len);
if (skb->ip_summed == CHECKSUM_PARTIAL) {
qeth_tx_csum(skb, &hdr->hdr.l2.flags[1], ipv);
if (card->options.performance_stats)
@@ -698,9 +675,15 @@ static int qeth_l2_xmit_osa(struct qeth_card *card, struct 
sk_buff *skb,
}
 
is_sg = skb_is_nonlinear(skb);
-   /* TODO: remove the skb_orphan() once TX completion is fast enough */
-   skb_orphan(skb);
-   rc = qeth_do_send_packet(card, queue, skb, hdr, 0, hd_len, elements);
+   if (IS_IQD(card)) {
+   rc = qeth_do_send_packet_fast(queue, skb, hdr, data_offset,
+ hd_len);
+   } else {
+   /* TODO: drop skb_orphan() once TX completion is fast enough */
+   skb_orphan(skb);
+   rc = qeth_do_send_packet(card, queue, skb, hdr, data_offset,
+hd_len, elements);
+   }
 
if (!rc) {
if (card->options.performance_stats) {
@@ -759,16 +742,10 @@ static netdev_tx_t qeth_l2_hard_start_xmit(struct sk_buff 
*skb,
}
netif_stop_queue(dev);
 
-   switch (card->info.type) {
-   case QETH_CARD_TYP

[PATCH net-next 07/11] s390/qeth: use core MTU range checking

2018-07-19 Thread Julian Wiedmann

qeth's ndo_change_mtu() only applies some trivial bounds checking. Set
up dev->min_mtu properly, so that dev_set_mtu() can do this for us.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  1 -
 drivers/s390/net/qeth_core_main.c | 34 +-
 drivers/s390/net/qeth_core_mpc.h  |  1 +
 drivers/s390/net/qeth_l2_main.c   |  1 -
 drivers/s390/net/qeth_l3_main.c   |  2 --
 5 files changed, 2 insertions(+), 37 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 6f02a6cbe59e..994ac7f434d5 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -993,7 +993,6 @@ void qeth_clear_cmd_buffers(struct qeth_channel *);
 void qeth_clear_qdio_buffers(struct qeth_card *);
 void qeth_setadp_promisc_mode(struct qeth_card *);
 struct net_device_stats *qeth_get_stats(struct net_device *);
-int qeth_change_mtu(struct net_device *, int);
 int qeth_setadpparms_change_macaddr(struct qeth_card *);
 void qeth_tx_timeout(struct net_device *);
 void qeth_prepare_control_data(struct qeth_card *, int,
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index ab3d63f98779..717511c167e7 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -2332,20 +2332,6 @@ static int qeth_get_mtu_outof_framesize(int framesize)
}
 }
 
-static int qeth_mtu_is_valid(struct qeth_card *card, int mtu)
-{
-   switch (card->info.type) {
-   case QETH_CARD_TYPE_OSD:
-   case QETH_CARD_TYPE_OSM:
-   case QETH_CARD_TYPE_OSX:
-   case QETH_CARD_TYPE_IQD:
-   return ((mtu >= 576) && (mtu <= card->dev->max_mtu));
-   case QETH_CARD_TYPE_OSN:
-   default:
-   return 1;
-   }
-}
-
 static int qeth_ulp_enable_cb(struct qeth_card *card, struct qeth_reply *reply,
unsigned long data)
 {
@@ -4204,24 +4190,6 @@ void qeth_setadp_promisc_mode(struct qeth_card *card)
 }
 EXPORT_SYMBOL_GPL(qeth_setadp_promisc_mode);
 
-int qeth_change_mtu(struct net_device *dev, int new_mtu)
-{
-   struct qeth_card *card;
-   char dbf_text[15];
-
-   card = dev->ml_priv;
-
-   QETH_CARD_TEXT(card, 4, "chgmtu");
-   sprintf(dbf_text, "%8x", new_mtu);
-   QETH_CARD_TEXT(card, 4, dbf_text);
-
-   if (!qeth_mtu_is_valid(card, new_mtu))
-   return -EINVAL;
-   dev->mtu = new_mtu;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(qeth_change_mtu);
-
 struct net_device_stats *qeth_get_stats(struct net_device *dev)
 {
struct qeth_card *card;
@@ -5696,7 +5664,7 @@ static struct net_device *qeth_alloc_netdev(struct 
qeth_card *card)
 
dev->ml_priv = card;
dev->watchdog_timeo = QETH_TX_TIMEOUT;
-   dev->min_mtu = 64;
+   dev->min_mtu = IS_OSN(card) ? 64 : 576;
 /* initialized when device first goes online: */
dev->max_mtu = 0;
dev->mtu = 0;
diff --git a/drivers/s390/net/qeth_core_mpc.h b/drivers/s390/net/qeth_core_mpc.h
index 54c35224262a..cf5ad94e960a 100644
--- a/drivers/s390/net/qeth_core_mpc.h
+++ b/drivers/s390/net/qeth_core_mpc.h
@@ -65,6 +65,7 @@ enum qeth_card_types {
 };
 
 #define IS_IQD(card)   ((card)->info.type == QETH_CARD_TYPE_IQD)
+#define IS_OSN(card)   ((card)->info.type == QETH_CARD_TYPE_OSN)
 
 #define QETH_MPC_DIFINFO_LEN_INDICATES_LINK_TYPE 0x18
 /* only the first two bytes are looked at in qeth_get_cardname_short */
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 00d8bb1d2a41..668f80680575 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -928,7 +928,6 @@ static const struct net_device_ops qeth_l2_netdev_ops = {
.ndo_set_rx_mode= qeth_l2_set_rx_mode,
.ndo_do_ioctl   = qeth_do_ioctl,
.ndo_set_mac_address= qeth_l2_set_mac_address,
-   .ndo_change_mtu = qeth_change_mtu,
.ndo_vlan_rx_add_vid= qeth_l2_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid   = qeth_l2_vlan_rx_kill_vid,
.ndo_tx_timeout = qeth_tx_timeout,
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index e5fa8e5ac1b3..078b891a6d24 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2506,7 +2506,6 @@ static const struct net_device_ops qeth_l3_netdev_ops = {
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_rx_mode= qeth_l3_set_rx_mode,
.ndo_do_ioctl   = qeth_do_ioctl,
-   .ndo_change_mtu = qeth_change_mtu,
.ndo_fix_features   = qeth_fix_features,
.ndo_set_features   = qeth_set_features,
.ndo_vlan_rx_add_vid= qeth_l3_vlan_rx_add_vid,
@@ -2523,7 +2522,6 @@ static const struct net_device_ops qeth_l3_osa_netdev_ops 
= {
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_rx_mode= qeth_l3_set_rx_mode,
.ndo_do_ioctl   = qeth_do_ioctl,
-

[PATCH net-next 06/11] s390/qeth: simplify max MTU handling

2018-07-19 Thread Julian Wiedmann

When the MPC initialization code discovers the HW-specific max MTU,
apply the resulting changes straight to the netdevice.

If this is the device's first initialization, also set its MTU
(HiperSockets: the max MTU; else: a layer-specific default value).
Then cap the current MTU by the new max MTU.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  2 -
 drivers/s390/net/qeth_core_main.c | 82 +--
 drivers/s390/net/qeth_l2_main.c   |  1 -
 drivers/s390/net/qeth_l3_main.c   |  1 -
 4 files changed, 45 insertions(+), 41 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 04b900c7060d..6f02a6cbe59e 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -660,8 +660,6 @@ struct qeth_card_info {
int mac_bits;
enum qeth_card_types type;
enum qeth_link_types link_type;
-   int initial_mtu;
-   int max_mtu;
int broadcast_capable;
int unique_id;
bool layer_enforced;
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 39c429787c4d..ab3d63f98779 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -2278,19 +2278,42 @@ static int qeth_cm_setup(struct qeth_card *card)
 
 }
 
-static int qeth_get_initial_mtu_for_card(struct qeth_card *card)
+static int qeth_update_max_mtu(struct qeth_card *card, unsigned int max_mtu)
 {
-   switch (card->info.type) {
-   case QETH_CARD_TYPE_IQD:
-   return card->info.max_mtu;
-   case QETH_CARD_TYPE_OSD:
-   case QETH_CARD_TYPE_OSX:
-   if (!card->options.layer2)
-   return ETH_DATA_LEN - 8; /* L3: allow for LLC + SNAP */
-   /* fall through */
-   default:
-   return ETH_DATA_LEN;
+   struct net_device *dev = card->dev;
+   unsigned int new_mtu;
+
+   if (!max_mtu) {
+   /* IQD needs accurate max MTU to set up its RX buffers: */
+   if (IS_IQD(card))
+   return -EINVAL;
+   /* tolerate quirky HW: */
+   max_mtu = ETH_MAX_MTU;
+   }
+
+   rtnl_lock();
+   if (IS_IQD(card)) {
+   /* move any device with default MTU to new max MTU: */
+   new_mtu = (dev->mtu == dev->max_mtu) ? max_mtu : dev->mtu;
+
+   /* adjust RX buffer size to new max MTU: */
+   card->qdio.in_buf_size = max_mtu + 2 * PAGE_SIZE;
+   if (dev->max_mtu && dev->max_mtu != max_mtu)
+   qeth_free_qdio_buffers(card);
+   } else {
+   if (dev->mtu)
+   new_mtu = dev->mtu;
+   /* default MTUs for first setup: */
+   else if (card->options.layer2)
+   new_mtu = ETH_DATA_LEN;
+   else
+   new_mtu = ETH_DATA_LEN - 8; /* allow for LLC + SNAP */
}
+
+   dev->max_mtu = max_mtu;
+   dev->mtu = min(new_mtu, max_mtu);
+   rtnl_unlock();
+   return 0;
 }
 
 static int qeth_get_mtu_outof_framesize(int framesize)
@@ -2316,8 +2339,7 @@ static int qeth_mtu_is_valid(struct qeth_card *card, int 
mtu)
case QETH_CARD_TYPE_OSM:
case QETH_CARD_TYPE_OSX:
case QETH_CARD_TYPE_IQD:
-   return ((mtu >= 576) &&
-   (mtu <= card->info.max_mtu));
+   return ((mtu >= 576) && (mtu <= card->dev->max_mtu));
case QETH_CARD_TYPE_OSN:
default:
return 1;
@@ -2342,28 +2364,10 @@ static int qeth_ulp_enable_cb(struct qeth_card *card, 
struct qeth_reply *reply,
if (card->info.type == QETH_CARD_TYPE_IQD) {
memcpy(&framesize, QETH_ULP_ENABLE_RESP_MAX_MTU(iob->data), 2);
mtu = qeth_get_mtu_outof_framesize(framesize);
-   if (!mtu) {
-   iob->rc = -EINVAL;
-   QETH_DBF_TEXT_(SETUP, 2, "  rc%d", iob->rc);
-   return 0;
-   }
-   if (card->info.initial_mtu && (card->info.initial_mtu != mtu)) {
-   /* frame size has changed */
-   if ((card->dev->mtu == card->info.initial_mtu) ||
-   (card->dev->mtu > mtu))
-   card->dev->mtu = mtu;
-   qeth_free_qdio_buffers(card);
-   }
-   card->info.initial_mtu = mtu;
-   card->info.max_mtu = mtu;
-   card->qdio.in_buf_size = mtu + 2 * PAGE_SIZE;
} else {
-   card->info.max_mtu = *(__u16 *)QETH_ULP_ENABLE_RESP_MAX_MTU(
-   iob->data);
-   card->info.initial_mtu = min(card->info.max_mtu,
-   qeth_get_initial_mtu_for_card(card));
-   card->qdio.in_buf_size = QETH_IN_BUF_SIZE_DEFAULT;
+   mtu = *(__u16 *)

[PATCH net-next 05/11] s390/qeth: don't cache HW port number

2018-07-19 Thread Julian Wiedmann

The netdevice is always available now, so get the portno from there.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  | 1 -
 drivers/s390/net/qeth_core_main.c | 7 +++
 drivers/s390/net/qeth_core_sys.c  | 3 +--
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 4d6827c8aba4..04b900c7060d 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -658,7 +658,6 @@ struct qeth_card_info {
char mcl_level[QETH_MCL_LENGTH + 1];
int guestlan;
int mac_bits;
-   int portno;
enum qeth_card_types type;
enum qeth_link_types link_type;
int initial_mtu;
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 6df2226417f1..39c429787c4d 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -1920,7 +1920,7 @@ static int qeth_idx_activate_channel(struct qeth_channel 
*channel,
memcpy(QETH_TRANSPORT_HEADER_SEQ_NO(iob->data),
   &card->seqno.trans_hdr, QETH_SEQ_NO_LENGTH);
}
-   tmp = ((__u8)card->info.portno) | 0x80;
+   tmp = ((u8)card->dev->dev_port) | 0x80;
memcpy(QETH_IDX_ACT_PNO(iob->data), &tmp, 1);
memcpy(QETH_IDX_ACT_ISSUER_RM_TOKEN(iob->data),
   &card->token.issuer_rm_w, QETH_MPC_TOKEN_LENGTH);
@@ -2389,8 +2389,7 @@ static int qeth_ulp_enable(struct qeth_card *card)
iob = qeth_wait_for_buffer(&card->write);
memcpy(iob->data, ULP_ENABLE, ULP_ENABLE_SIZE);
 
-   *(QETH_ULP_ENABLE_LINKNUM(iob->data)) =
-   (__u8) card->info.portno;
+   *(QETH_ULP_ENABLE_LINKNUM(iob->data)) = (u8) card->dev->dev_port;
if (card->options.layer2)
if (card->info.type == QETH_CARD_TYPE_OSN)
prot_type = QETH_PROT_OSN2;
@@ -2918,7 +2917,7 @@ static void qeth_fill_ipacmd_header(struct qeth_card 
*card,
cmd->hdr.initiator = IPA_CMD_INITIATOR_HOST;
/* cmd->hdr.seqno is set by qeth_send_control_data() */
cmd->hdr.adapter_type = qeth_get_ipa_adp_type(card->info.link_type);
-   cmd->hdr.rel_adapter_no = (__u8) card->info.portno;
+   cmd->hdr.rel_adapter_no = (u8) card->dev->dev_port;
if (card->options.layer2)
cmd->hdr.prim_version_no = 2;
else
diff --git a/drivers/s390/net/qeth_core_sys.c b/drivers/s390/net/qeth_core_sys.c
index 9bef19ed7e04..25d0be25bcb3 100644
--- a/drivers/s390/net/qeth_core_sys.c
+++ b/drivers/s390/net/qeth_core_sys.c
@@ -112,7 +112,7 @@ static ssize_t qeth_dev_portno_show(struct device *dev,
if (!card)
return -EINVAL;
 
-   return sprintf(buf, "%i\n", card->info.portno);
+   return sprintf(buf, "%i\n", card->dev->dev_port);
 }
 
 static ssize_t qeth_dev_portno_store(struct device *dev,
@@ -143,7 +143,6 @@ static ssize_t qeth_dev_portno_store(struct device *dev,
rc = -EINVAL;
goto out;
}
-   card->info.portno = portno;
card->dev->dev_port = portno;
 out:
mutex_unlock(&card->conf_mutex);
-- 
2.16.4

[PATCH net-next 02/11] s390/qeth: reset layer2 attribute on layer switch

2018-07-19 Thread Julian Wiedmann

After the subdriver's remove() routine has completed, the card's layer
mode is undetermined again. Reflect this in the layer2 field.

If qeth_dev_layer2_store() hits an error after remove() was called, the
card _always_ requires a setup(), even if the previous layer mode is
requested again.
But qeth_dev_layer2_store() bails out early if the requested layer mode
still matches the current one. So unless we reset the layer2 field,
re-probing the card back to its previous mode is currently not possible.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_sys.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/s390/net/qeth_core_sys.c b/drivers/s390/net/qeth_core_sys.c
index c3f18afb368b..cfb659747693 100644
--- a/drivers/s390/net/qeth_core_sys.c
+++ b/drivers/s390/net/qeth_core_sys.c
@@ -426,6 +426,7 @@ static ssize_t qeth_dev_layer2_store(struct device *dev,
if (card->discipline) {
card->discipline->remove(card->gdev);
qeth_core_free_discipline(card);
+   card->options.layer2 = -1;
}
 
rc = qeth_core_load_discipline(card, newdis);
-- 
2.16.4

[PATCH net-next 08/11] s390/qeth: add statistics for consumed buffer elements

2018-07-19 Thread Julian Wiedmann

Nowadays an skb fragment typically spans over multiple pages. So replace
the obsolete, SG-only 'fragments' counter with one that tracks the
consumed buffer elements. This is what actually matters for performance.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  2 +-
 drivers/s390/net/qeth_core_main.c |  4 ++--
 drivers/s390/net/qeth_l2_main.c   | 13 +++--
 drivers/s390/net/qeth_l3_main.c   | 28 ++--
 4 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 994ac7f434d5..6d8005af67f5 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -104,6 +104,7 @@ struct qeth_dbf_info {
 struct qeth_perf_stats {
unsigned int bufs_rec;
unsigned int bufs_sent;
+   unsigned int buf_elements_sent;
 
unsigned int skbs_sent_pack;
unsigned int bufs_sent_pack;
@@ -137,7 +138,6 @@ struct qeth_perf_stats {
unsigned int large_send_bytes;
unsigned int large_send_cnt;
unsigned int sg_skbs_sent;
-   unsigned int sg_frags_sent;
/* initial values when measuring starts */
unsigned long initial_rx_packets;
unsigned long initial_tx_packets;
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 717511c167e7..84f1e1e33f3f 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5970,7 +5970,7 @@ static struct {
{"tx skbs packing"},
{"tx buffers packing"},
{"tx sg skbs"},
-   {"tx sg frags"},
+   {"tx buffer elements"},
 /* 10 */{"rx sg skbs"},
{"rx sg frags"},
{"rx sg page allocs"},
@@ -6029,7 +6029,7 @@ void qeth_core_get_ethtool_stats(struct net_device *dev,
data[6] = card->perf_stats.skbs_sent_pack;
data[7] = card->perf_stats.bufs_sent_pack;
data[8] = card->perf_stats.sg_skbs_sent;
-   data[9] = card->perf_stats.sg_frags_sent;
+   data[9] = card->perf_stats.buf_elements_sent;
data[10] = card->perf_stats.sg_skbs_rx;
data[11] = card->perf_stats.sg_frags_rx;
data[12] = card->perf_stats.sg_alloc_page_rx;
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 668f80680575..a785c5ff73cd 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -672,10 +672,11 @@ static int qeth_l2_xmit_osa(struct qeth_card *card, 
struct sk_buff *skb,
int ipv)
 {
int push_len = sizeof(struct qeth_hdr);
-   unsigned int elements, nr_frags;
unsigned int hdr_elements = 0;
struct qeth_hdr *hdr = NULL;
unsigned int hd_len = 0;
+   unsigned int elements;
+   bool is_sg;
int rc;
 
/* fix hardware limitation: as long as we do not have sbal
@@ -693,7 +694,6 @@ static int qeth_l2_xmit_osa(struct qeth_card *card, struct 
sk_buff *skb,
if (rc)
return rc;
}
-   nr_frags = skb_shinfo(skb)->nr_frags;
 
rc = skb_cow_head(skb, push_len);
if (rc)
@@ -720,15 +720,16 @@ static int qeth_l2_xmit_osa(struct qeth_card *card, 
struct sk_buff *skb,
}
elements += hdr_elements;
 
+   is_sg = skb_is_nonlinear(skb);
/* TODO: remove the skb_orphan() once TX completion is fast enough */
skb_orphan(skb);
rc = qeth_do_send_packet(card, queue, skb, hdr, 0, hd_len, elements);
 out:
if (!rc) {
-   if (card->options.performance_stats && nr_frags) {
-   card->perf_stats.sg_skbs_sent++;
-   /* nr_frags + skb->data */
-   card->perf_stats.sg_frags_sent += nr_frags + 1;
+   if (card->options.performance_stats) {
+   card->perf_stats.buf_elements_sent += elements;
+   if (is_sg)
+   card->perf_stats.sg_skbs_sent++;
}
} else {
if (hd_len)
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index 078b891a6d24..c12aeb7d8f26 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2166,12 +2166,13 @@ static int qeth_l3_xmit_offload(struct qeth_card *card, 
struct sk_buff *skb,
int cast_type)
 {
const unsigned int hw_hdr_len = sizeof(struct qeth_hdr);
-   unsigned int frame_len, nr_frags;
unsigned char eth_hdr[ETH_HLEN];
unsigned int hdr_elements = 0;
struct qeth_hdr *hdr = NULL;
int elements, push_len, rc;
unsigned int hd_len = 0;
+   unsigned int frame_len;
+   bool is_sg;
 
/* compress skb to fit into one IO buffer: */
if (!qeth_get_elements_no(card, skb, 0, 0)) {
@@ -2194,7 +2195,6 @@ static int qeth_l3_xmit_offload(struct qeth_card *card, 
struct sk_buff *skb,
skb_copy_

[PATCH net-next 09/11] s390/qeth: merge linearize-check into HW header construction

2018-07-19 Thread Julian Wiedmann

When checking whether an skb needs to be linearized to fit into an IO
buffer, it's desirable to consider the skb's final size and layout
(ie. after the HW header was added). But a subsequent linearization can
then cause the re-positioned HW header to violate its alignment
restrictions.

Dealing with this situation in two different code paths is quite tricky.
This patch integrates a) linearize-check and b) HW header construction
into one 3 step-sequence:
1. evaluate how the HW header needs to be added (to identify if it takes
   up an additional buffer element), then
2. check if the required buffer elements exceed the device's limit.
   Linearize when necessary and re-evaluate the HW header placement.
3. Add the HW header in the best-possible way:
   a) push, without taking up an additional buffer element
   b) push, but consume another buffer element
   c) allocate a header object from the cache.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  4 +-
 drivers/s390/net/qeth_core_main.c | 86 ---
 drivers/s390/net/qeth_l2_main.c   | 29 +
 drivers/s390/net/qeth_l3_main.c   | 31 ++
 4 files changed, 80 insertions(+), 70 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 6d8005af67f5..2a5ec99643df 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1047,7 +1047,9 @@ netdev_features_t qeth_features_check(struct sk_buff *skb,
  struct net_device *dev,
  netdev_features_t features);
 int qeth_vm_request_mac(struct qeth_card *card);
-int qeth_push_hdr(struct sk_buff *skb, struct qeth_hdr **hdr, unsigned int 
len);
+int qeth_add_hw_header(struct qeth_card *card, struct sk_buff *skb,
+  struct qeth_hdr **hdr, unsigned int len,
+  unsigned int *elements);
 
 /* exports for OSN */
 int qeth_osn_assist(struct net_device *, void *, int);
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 84f1e1e33f3f..e7b34624df1e 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -3831,6 +3831,17 @@ int qeth_get_elements_for_frags(struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(qeth_get_elements_for_frags);
 
+static unsigned int qeth_count_elements(struct sk_buff *skb, int data_offset)
+{
+   unsigned int elements = qeth_get_elements_for_frags(skb);
+   addr_t end = (addr_t)skb->data + skb_headlen(skb);
+   addr_t start = (addr_t)skb->data + data_offset;
+
+   if (start != end)
+   elements += qeth_get_elements_for_range(start, end);
+   return elements;
+}
+
 /**
  * qeth_get_elements_no() -find number of SBALEs for skb data, inc. frags.
  * @card:  qeth card structure, to check max. elems.
@@ -3846,12 +3857,7 @@ EXPORT_SYMBOL_GPL(qeth_get_elements_for_frags);
 int qeth_get_elements_no(struct qeth_card *card,
 struct sk_buff *skb, int extra_elems, int data_offset)
 {
-   addr_t end = (addr_t)skb->data + skb_headlen(skb);
-   int elements = qeth_get_elements_for_frags(skb);
-   addr_t start = (addr_t)skb->data + data_offset;
-
-   if (start != end)
-   elements += qeth_get_elements_for_range(start, end);
+   int elements = qeth_count_elements(skb, data_offset);
 
if ((elements + extra_elems) > QETH_MAX_BUFFER_ELEMENTS(card)) {
QETH_DBF_MESSAGE(2, "Invalid size of IP packet "
@@ -3885,22 +3891,72 @@ int qeth_hdr_chk_and_bounce(struct sk_buff *skb, struct 
qeth_hdr **hdr, int len)
 EXPORT_SYMBOL_GPL(qeth_hdr_chk_and_bounce);
 
 /**
- * qeth_push_hdr() - push a qeth_hdr onto an skb.
- * @skb: skb that the qeth_hdr should be pushed onto.
+ * qeth_add_hw_header() - add a HW header to an skb.
+ * @skb: skb that the HW header should be added to.
  * @hdr: double pointer to a qeth_hdr. When returning with >= 0,
  *  it contains a valid pointer to a qeth_hdr.
- * @len: length of the hdr that needs to be pushed on.
+ * @len: length of the HW header.
  *
  * Returns the pushed length. If the header can't be pushed on
  * (eg. because it would cross a page boundary), it is allocated from
  * the cache instead and 0 is returned.
+ * The number of needed buffer elements is returned in @elements.
  * Error to create the hdr is indicated by returning with < 0.
  */
-int qeth_push_hdr(struct sk_buff *skb, struct qeth_hdr **hdr, unsigned int len)
-{
-   if (skb_headroom(skb) >= len &&
-   qeth_get_elements_for_range((addr_t)skb->data - len,
-   (addr_t)skb->data) == 1) {
+int qeth_add_hw_header(struct qeth_card *card, struct sk_buff *skb,
+  struct qeth_hdr **hdr, unsigned int len,
+  unsigned int *elements)
+{
+   const unsigned int max_elements = QETH_MAX_BUFFER_ELEMENTS(card);
+   unsigned

[net-next v4 2/3] net/tls: Remove redundant variable assignments and wakeup

2018-07-19 Thread Vakul Garg

In function decrypt_skb_update(), the assignment to tls receive context
variable 'decrypted' is redundant as the same is being done in function
tls_sw_recvmsg() after calling decrypt_skb_update(). Also calling callback
function to wakeup processes sleeping on socket data availability is
useless as decrypt_skb_update() is invoked from user processes only. This
patch cleans these up.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a58661c624ec..e62f288fda31 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -659,7 +659,6 @@ static int decrypt_skb_update(struct sock *sk, struct 
sk_buff *skb,
  struct scatterlist *sgout, bool *zc)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
-   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
struct strp_msg *rxm = strp_msg(skb);
int err = 0;
 
@@ -679,8 +678,6 @@ static int decrypt_skb_update(struct sock *sk, struct 
sk_buff *skb,
rxm->offset += tls_ctx->rx.prepend_size;
rxm->full_len -= tls_ctx->rx.overhead_size;
tls_advance_record_sn(sk, &tls_ctx->rx);
-   ctx->decrypted = true;
-   ctx->saved_data_ready(sk);
 
return err;
 }
-- 
2.13.6

[net-next v4 3/3] net/tls: Remove redundant array allocation.

2018-07-19 Thread Vakul Garg

In function decrypt_skb(), array allocation in case when sgout is NULL
is unnecessary. Instead, local variable sgin_arr[] can be used.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e62f288fda31..c33ba3f1c408 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -703,7 +703,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
nsg = skb_cow_data(skb, 0, &unused) + 1;
-   sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
sgout = sgin;
}
 
@@ -724,9 +723,6 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
rxm->full_len - tls_ctx->rx.overhead_size,
skb, sk->sk_allocation);
 
-   if (sgin != &sgin_arr[0])
-   kfree(sgin);
-
return ret;
 }
 
-- 
2.13.6

[net-next v4 1/3] net/tls: Use socket data_ready callback on record availability

2018-07-19 Thread Vakul Garg

On receipt of a complete tls record, use socket's saved data_ready
callback instead of state_change callback.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 7d194c0cd6cf..a58661c624ec 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1023,7 +1023,7 @@ static void tls_queue(struct strparser *strp, struct 
sk_buff *skb)
ctx->recv_pkt = skb;
strp_pause(strp);
 
-   strp->sk->sk_state_change(strp->sk);
+   ctx->saved_data_ready(strp->sk);
 }
 
 static void tls_data_ready(struct sock *sk)
-- 
2.13.6

[net-next v4 0/3] net/tls: Minor code cleanup patches

2018-07-19 Thread Vakul Garg

This patch series improves tls_sw.c code by:

1) Using correct socket callback for flagging data availability.
2) Removing redundant variable assignments and wakeup callbacks.
3) Removing redundant dynamic array allocation.

The patches do not fix any functional bug. Hence "Fixes:" tag has not
been used. From patch series v3, this series v4 contains two patches
less. They will be submitted separately.

Vakul Garg (3):
  net/tls: Use socket data_ready callback on record availability
  net/tls: Remove redundant variable assignments and wakeup
  net/tls: Remove redundant array allocation.

 net/tls/tls_sw.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

-- 
2.13.6

RE: [net-next v3 1/5] net/tls: Do not enable zero-copy prematurely

2018-07-19 Thread Vakul Garg

Thanks for the comment.
I will take this patch out of the series.

> -Original Message-
> From: Boris Pismenny [mailto:bor...@mellanox.com]
> Sent: Thursday, July 19, 2018 3:58 PM
> To: Vakul Garg ; netdev@vger.kernel.org
> Cc: avia...@mellanox.com; davejwat...@fb.com; da...@davemloft.net
> Subject: Re: [net-next v3 1/5] net/tls: Do not enable zero-copy prematurely
> 
> Hi Vakul,
> 
> On 7/19/2018 7:16 AM, Vakul Garg wrote:
> > Zero-copy mode was left enabled even when zerocopy_from_iter() failed.
> > Set the zero-copy mode only when zerocopy_from_iter() succeeds. This
> > leads to removal of argument 'zc' of function decrypt_skb_update().
> > Function decrypt_skb_update() does not need to check whether
> > ctx->decrypted is set since it is never called if ctx->decrypted is
> > true.
> >
> 
> This patch breaks our tls_device code for the following 2 reasons:
> 1. We need to disable zerocopy if the device decrypted the record, because
> decrypted data has to be copied to user buffers.
> 2. ctx->decrypted must be checked in decrypt_skb_update, because it might
> change after calling tls_device_decrypted.

Re: [net 4/8] net/mlx5e: Don't allow aRFS for encapsulated packets

2018-07-19 Thread Or Gerlitz

On Thu, Jul 19, 2018 at 12:02 PM, Eran Ben Elisha
 wrote:
> On Thu, Jul 19, 2018 at 10:50 AM, Or Gerlitz  wrote:
>> On Thu, Jul 19, 2018 at 9:55 AM, Eran Ben Elisha
>>  wrote:
>>> On Thu, Jul 19, 2018 at 9:23 AM, Or Gerlitz  wrote:
 On Thu, Jul 19, 2018 at 4:26 AM, Saeed Mahameed  
 wrote:
> From: Eran Ben Elisha 
>
> Driver is yet to support aRFS for encapsulated packets, return early
> error in such case.


 Eran,

 Isn't that something which is done wrong by the arfs stack code?

 If the kernel has an SKB which has encap set and an arfs steering
 rule is programed into the driver, the API should include a driver neutral
 description for the encap header for the HW to match, so maybe we can just 
 do

>>>
>>> Hi Or,
>>> This could break existing drivers support for tunneled aRFS, and hurts
>>> their RX performance dramatically..
>>
>>> IMHO, it is expected from the driver to figure out that the skb holds
>>> encap packet and act accordingly.
>>
>> I don't think this one bit indication on the skb is enough for
>> any HW driver (e.g mlx4, mlx5 and others) to properly set
>> the steering rules.
>
> why do you think it is not enough?
> mlx5e currently cannot offload tunneled packets, so this info is
> perfectly fit in order to reject.

Do we know that all the flows in stack that deals with reception of packets do
this marking for encapsulated packets? is this documented some/where? if
this is what we think, then the patch is ok.

Re: [net-next v3 1/5] net/tls: Do not enable zero-copy prematurely

2018-07-19 Thread Boris Pismenny


Hi Vakul,

On 7/19/2018 7:16 AM, Vakul Garg wrote:

Zero-copy mode was left enabled even when zerocopy_from_iter() failed.
Set the zero-copy mode only when zerocopy_from_iter() succeeds. This
leads to removal of argument 'zc' of function decrypt_skb_update().
Function decrypt_skb_update() does not need to check whether
ctx->decrypted is set since it is never called if ctx->decrypted is
true.



This patch breaks our tls_device code for the following 2 reasons:
1. We need to disable zerocopy if the device decrypted the record, 
because decrypted data has to be copied to user buffers.
2. ctx->decrypted must be checked in decrypt_skb_update, because it 
might change after calling tls_device_decrypted.

Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Martin Habets

On 19/07/18 01:11, Pablo Neira Ayuso wrote:
> One of the recurring complaints is that we do not have, as a driver
> writer, a central location from which we would be fed offloading rules
> into a NIC. This was brought up again during Netconf'18 in Boston.
> 
> This patch just renames ndo_setup_tc to ndo_setup_offload as a very
> early initial work to prepare for follow up patch that discuss unified
> flow representation for the existing offload programming APIs.
> 
> Signed-off-by: Pablo Neira Ayuso 
> Acked-by: Jiri Pirko 
> Acked-by: Jakub Kicinski 

Acked-by: Martin Habets 

> ---
> v2: Missing function definition update in drivers/net/ethernet/sfc/falcon/tx.c
> apparently I forgot to turn on that driver when doing compile-testing,
> problem spotted by Martin Habets. Keeping Jakub and Jiri Acked-by tags,
> as this is the only change in the v1 patch.
>

Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-07-19 Thread Peter Zijlstra

On Wed, Jul 18, 2018 at 11:22:36AM -0700, Andrei Vagin wrote:
> > > [1.085679]   lock(cpu_hotplug_lock.rw_sem);
> > > [1.085753]   lock(cpu_hotplug_lock.rw_sem);
> > > [1.085828] 
> > > [1.085828]  *** DEADLOCK ***

> Peter and Ingo, maybe you could explain why it isn't safe to take one
> reader lock twice?

Very simple, because rwsems are fair and !recursive.

So if another CPU were to issue a write-lock in between these, then the
second would block because there is a writer pending. But because we
then already have a reader we're deadlocked.

Re: [net 4/8] net/mlx5e: Don't allow aRFS for encapsulated packets

2018-07-19 Thread Eran Ben Elisha

On Thu, Jul 19, 2018 at 10:50 AM, Or Gerlitz  wrote:
> On Thu, Jul 19, 2018 at 9:55 AM, Eran Ben Elisha
>  wrote:
>> On Thu, Jul 19, 2018 at 9:23 AM, Or Gerlitz  wrote:
>>> On Thu, Jul 19, 2018 at 4:26 AM, Saeed Mahameed  wrote:
 From: Eran Ben Elisha 

 Driver is yet to support aRFS for encapsulated packets, return early
 error in such case.
>>>
>>>
>>> Eran,
>>>
>>> Isn't that something which is done wrong by the arfs stack code?
>>>
>>> If the kernel has an SKB which has encap set and an arfs steering
>>> rule is programed into the driver, the API should include a driver neutral
>>> description for the encap header for the HW to match, so maybe we can just 
>>> do
>>>
>>
>> Hi Or,
>> This could break existing drivers support for tunneled aRFS, and hurts
>> their RX performance dramatically..
>
>> IMHO, it is expected from the driver to figure out that the skb holds
>> encap packet and act accordingly.
>
> I don't think this one bit indication on the skb is enough for
> any HW driver (e.g mlx4, mlx5 and others) to properly set
> the steering rules.

why do you think it is not enough?
mlx5e currently cannot offload tunneled packets, so this info is
perfectly fit in order to reject.

>
> The problem you indicate typically doesn't come into play in the presence
> of VMs, since the host TCP stack isn't active on such traffic.
>
> This is maybe why it wasn't pointed earlier.
>
> I believe that more drivers are broken (mlx4?)

We can do a fix for that latter.

>
> Looking now on bnxt, I see they dissect the skb and then check the
> FLOW_DIS_ENCAPSULATION flag but not always err.
>
> If we want to make sure we don't break anyone else, we can indeed have
> the check done in our drivers.
>
> It seems that the check done by bnxt is more general, thoughts?

I see bnxt driver checking protocol using this mechanism as well,
in mlx5e it was selected via skb metadata, let's keep going with this approach.

>
> Or.

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-19 Thread Eran Ben Elisha

>
> This should not be num. It should be a string. Same for "mode".

will fix for v2, thanks.
>
>

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-19 Thread Jiri Pirko

Thu, Jul 19, 2018 at 03:01:01AM CEST, sae...@mellanox.com wrote:
>From: Eran Ben Elisha 
>
>Add support for two driver parameters via devlink params interface:
>- Congestion action
>   HW mechanism in the PCIe buffer which monitors the amount of
>   consumed PCIe buffer per host.  This mechanism supports the
>   following actions in case of threshold overflow:
>   - Disabled - NOP (Default)
>   - Drop
>   - Mark - Mark CE bit in the CQE of received packet
>- Congestion mode
>   - Aggressive - Aggressive static trigger threshold (Default)
>   - Dynamic - Dynamically change the trigger threshold
>
>Signed-off-by: Eran Ben Elisha 
>Reviewed-by: Moshe Shemesh 
>Signed-off-by: Saeed Mahameed 
>---
> .../net/ethernet/mellanox/mlx5/core/devlink.c | 105 +-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c 
>b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>index 9800c98b01d3..1f04decef043 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>@@ -153,12 +153,115 @@ static int mlx5_devlink_query_tx_overflow_sense(struct 
>mlx5_core_dev *mdev,
>   return 0;
> }
> 
>+static int mlx5_devlink_set_congestion_action(struct devlink *devlink, u32 id,
>+struct devlink_param_gset_ctx 
>*ctx)
>+{
>+  struct mlx5_core_dev *dev = devlink_priv(devlink);
>+  u8 max = MLX5_DEVLINK_CONGESTION_ACTION_MAX;
>+  u8 sense;
>+  int err;
>+
>+  if (!MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cqe) &&
>+  !MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cnp))
>+  max = MLX5_DEVLINK_CONGESTION_ACTION_MARK - 1;
>+
>+  if (ctx->val.vu8 > max)

This should not be num. It should be a string. Same for "mode".


>+  return -ERANGE;
>+
>+  err = mlx5_devlink_query_tx_overflow_sense(dev, &sense);
>+  if (err)
>+  return err;
>+
>+  if (ctx->val.vu8 == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED &&
>+  sense != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE)
>+  return -EINVAL;
>+
>+  return mlx5_devlink_set_tx_lossy_overflow(dev, ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_get_congestion_action(struct devlink *devlink, u32 id,
>+struct devlink_param_gset_ctx 
>*ctx)
>+{
>+  struct mlx5_core_dev *dev = devlink_priv(devlink);
>+
>+  return mlx5_devlink_query_tx_lossy_overflow(dev, &ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_set_congestion_mode(struct devlink *devlink, u32 id,
>+  struct devlink_param_gset_ctx *ctx)
>+{
>+  struct mlx5_core_dev *dev = devlink_priv(devlink);
>+  u8 tx_lossy_overflow;
>+  int err;
>+
>+  if (ctx->val.vu8 > MLX5_DEVLINK_CONGESTION_MODE_MAX)
>+  return -ERANGE;
>+
>+  err = mlx5_devlink_query_tx_lossy_overflow(dev, &tx_lossy_overflow);
>+  if (err)
>+  return err;
>+
>+  if (ctx->val.vu8 != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE &&
>+  tx_lossy_overflow == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED)
>+  return -EINVAL;
>+
>+  return mlx5_devlink_set_tx_overflow_sense(dev, ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_get_congestion_mode(struct devlink *devlink, u32 id,
>+  struct devlink_param_gset_ctx *ctx)
>+{
>+  struct mlx5_core_dev *dev = devlink_priv(devlink);
>+
>+  return mlx5_devlink_query_tx_overflow_sense(dev, &ctx->val.vu8);
>+}
>+
>+enum mlx5_devlink_param_id {
>+  MLX5_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
>+  MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
>+  MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
>+};
>+
>+static const struct devlink_param mlx5_devlink_params[] = {
>+  DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
>+   "congestion_action",
>+   DEVLINK_PARAM_TYPE_U8,
>+   BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+   mlx5_devlink_get_congestion_action,
>+   mlx5_devlink_set_congestion_action, NULL),
>+  DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
>+   "congestion_mode",
>+   DEVLINK_PARAM_TYPE_U8,
>+   BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+   mlx5_devlink_get_congestion_mode,
>+   mlx5_devlink_set_congestion_mode, NULL),
>+};
>+
> int mlx5_devlink_register(struct devlink *devlink, struct device *dev)
> {
>-  return devlink_register(devlink, dev);
>+  int err;
>+
>+  err = devlink_register(devlink, dev);
>+  if (err)
>+  return err;
>+
>+  err = devlink_params_register(devlink, mlx5_devlink_params,
>+ARRAY_SIZE(mlx5_devl

Re: [net 4/8] net/mlx5e: Don't allow aRFS for encapsulated packets

2018-07-19 Thread Or Gerlitz

On Thu, Jul 19, 2018 at 9:55 AM, Eran Ben Elisha
 wrote:
> On Thu, Jul 19, 2018 at 9:23 AM, Or Gerlitz  wrote:
>> On Thu, Jul 19, 2018 at 4:26 AM, Saeed Mahameed  wrote:
>>> From: Eran Ben Elisha 
>>>
>>> Driver is yet to support aRFS for encapsulated packets, return early
>>> error in such case.
>>
>>
>> Eran,
>>
>> Isn't that something which is done wrong by the arfs stack code?
>>
>> If the kernel has an SKB which has encap set and an arfs steering
>> rule is programed into the driver, the API should include a driver neutral
>> description for the encap header for the HW to match, so maybe we can just do
>>
>
> Hi Or,
> This could break existing drivers support for tunneled aRFS, and hurts
> their RX performance dramatically..

> IMHO, it is expected from the driver to figure out that the skb holds
> encap packet and act accordingly.

I don't think this one bit indication on the skb is enough for
any HW driver (e.g mlx4, mlx5 and others) to properly set
the steering rules.

The problem you indicate typically doesn't come into play in the presence
of VMs, since the host TCP stack isn't active on such traffic.

This is maybe why it wasn't pointed earlier.

I believe that more drivers are broken (mlx4?)

Looking now on bnxt, I see they dissect the skb and then check the
FLOW_DIS_ENCAPSULATION flag but not always err.

If we want to make sure we don't break anyone else, we can indeed have
the check done in our drivers.

It seems that the check done by bnxt is more general, thoughts?

Or.

Re: [RFC ipsec-next] xfrm: Remove xfrmi interface ID from flowi

2018-07-19 Thread Steffen Klassert

On Tue, Jul 17, 2018 at 02:40:04PM -0700, Benedict Wong wrote:

> @@ -2301,6 +2322,13 @@ int __xfrm_policy_check(struct sock *sk, int dir, 
> struct sk_buff *skb,
>   int reverse;
>   struct flowi fl;
>   int xerr_idx = -1;
> + const struct xfrm_if_cb *ifcb;
> + struct xfrm_if *xi;
> + u32 if_id = 0;
> +
> + rcu_read_lock();
> + ifcb = xfrm_if_get_cb();
> + rcu_read_unlock();
>  
>   reverse = dir & ~XFRM_POLICY_MASK;
>   dir &= XFRM_POLICY_MASK;
> @@ -2325,10 +2353,16 @@ int __xfrm_policy_check(struct sock *sk, int dir, 
> struct sk_buff *skb,
>   }
>   }
>  
> + if (ifcb) {
> + xi = ifcb->decode_session(skb);
> + if (xi)
> + if_id = xi->p.if_id;
> + }

The usage of the ifcb pointer should go into the
rcu_read_lock section above.

Looks good otherwise, nice improvement.

Please respin and do an official submission of this
patch, I'd like to merge it before I send the pull
request for the ipsec-next tree.

Re: [PATCH net 2/2] openvswitch: check for null return for nla_nest_start in datapath

2018-07-19 Thread Pravin Shelar

On Wed, Jul 18, 2018 at 9:12 AM, Stephen Hemminger
 wrote:
> The call to nla_nest_start when forming packet messages can lead to a NULL
> return so it's possible for attr to become NULL and we can potentially
> get a NULL pointer dereference on attr.  Fix this by checking for
> a NULL return.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200537
> Fixes: 8f0aad6f35f7 ("openvswitch: Extend packet attribute for egress tunnel 
> info")
> Signed-off-by: Stephen Hemminger 
> ---
>  net/openvswitch/datapath.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 0f5ce77460d4..93c3eb635827 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -460,6 +460,10 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->egress_tun_info) {
> nla = nla_nest_start(user_skb, 
> OVS_PACKET_ATTR_EGRESS_TUN_KEY);
> +   if (!nla) {
> +   err = -EMSGSIZE;
> +   goto out;
> +   }
> err = ovs_nla_put_tunnel_info(user_skb,
>   upcall_info->egress_tun_info);
> BUG_ON(err);
> @@ -468,6 +472,10 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->actions_len) {
> nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_ACTIONS);
> +   if (!nla) {
> +   err = -EMSGSIZE;
> +   goto out;
> +   }
> err = ovs_nla_put_actions(upcall_info->actions,
>   upcall_info->actions_len,
>   user_skb);

Acked-by: Pravin B Shelar 

Thanks.

Re: [PATCH net 1/2] openvswitch: check for null return for nla_nest_start

2018-07-19 Thread Pravin Shelar

On Wed, Jul 18, 2018 at 9:12 AM, Stephen Hemminger
 wrote:
> The call to nla_nest_start in conntrack can lead to a NULL
> return so it's possible for attr to become NULL and we can potentially
> get a NULL pointer dereference on attr.  Fix this by checking for
> a NULL return.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200533
> Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
> Signed-off-by: Stephen Hemminger 
> ---
>  net/openvswitch/conntrack.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 284aca2a252d..2e316f641df8 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -2132,6 +2132,8 @@ static int ovs_ct_limit_cmd_get(struct sk_buff *skb, 
> struct genl_info *info)
> return PTR_ERR(reply);
>
> nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
> +   if (!nla_reply)
> +   return PRT_ERR(-EMSGSIZE);
>
> if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
> err = ovs_ct_limit_get_zone_limit(
> --
Acked-by: Pravin B Shelar 

Thanks.

Re: [RFC PATCH 3/3] net: macb: add support for padding and fcs computation

2018-07-19 Thread Claudiu Beznea




On 18.07.2018 20:54, David Miller wrote:
> From: Claudiu Beznea 
> Date: Wed, 18 Jul 2018 15:58:09 +0300
> 
>>  
>> +static int macb_pad_and_fcs(struct sk_buff **skb, struct net_device *ndev)
>> +{
>> +struct sk_buff *nskb;
>> +int padlen = ETH_ZLEN - (*skb)->len;
>> +int headroom = skb_headroom(*skb);
>> +int tailroom = skb_tailroom(*skb);
>> +bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb);
>> +u32 fcs;
> 
> Please keep local variable ordered from longest to shortest line
> (ie. reverse christmas tree format).

OK! Thank you!

> 
> Thank you.
>

96 matches

Mail list logo