[PATCH net] sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead

2018-10-16 Thread Xin Long
According to rfc7496 section 4.3 or 4.4:

   sprstat_policy:  This parameter indicates for which PR-SCTP policy
  the user wants the information.  It is an error to use
  SCTP_PR_SCTP_NONE in sprstat_policy.  If SCTP_PR_SCTP_ALL is used,
  the counters provided are aggregated over all supported policies.

We change to dump pr_assoc and pr_stream all status by SCTP_PR_SCTP_ALL
instead, and return error for SCTP_PR_SCTP_NONE, as it also said "It is
an error to use SCTP_PR_SCTP_NONE in sprstat_policy. "

Fixes: 826d253d57b1 ("sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt")
Fixes: d229d48d183f ("sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp")
Reported-by: Ying Xu 
Signed-off-by: Xin Long 
---
 include/uapi/linux/sctp.h | 1 +
 net/sctp/socket.c | 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index b479db5..34dd3d4 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -301,6 +301,7 @@ enum sctp_sinfo_flags {
SCTP_SACK_IMMEDIATELY   = (1 << 3), /* SACK should be sent without 
delay. */
/* 2 bits here have been used by SCTP_PR_SCTP_MASK */
SCTP_SENDALL= (1 << 6),
+   SCTP_PR_SCTP_ALL= (1 << 7),
SCTP_NOTIFICATION   = MSG_NOTIFICATION, /* Next message is not user 
msg but notification. */
SCTP_EOF= MSG_FIN,  /* Initiate graceful shutdown 
process. */
 };
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index f73e9d3..e25a20f 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -7100,14 +7100,14 @@ static int sctp_getsockopt_pr_assocstatus(struct sock 
*sk, int len,
}
 
policy = params.sprstat_policy;
-   if (policy & ~SCTP_PR_SCTP_MASK)
+   if (!policy || (policy & ~(SCTP_PR_SCTP_MASK | SCTP_PR_SCTP_ALL)))
goto out;
 
asoc = sctp_id2assoc(sk, params.sprstat_assoc_id);
if (!asoc)
goto out;
 
-   if (policy == SCTP_PR_SCTP_NONE) {
+   if (policy & SCTP_PR_SCTP_ALL) {
params.sprstat_abandoned_unsent = 0;
params.sprstat_abandoned_sent = 0;
for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
@@ -7159,7 +7159,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock 
*sk, int len,
}
 
policy = params.sprstat_policy;
-   if (policy & ~SCTP_PR_SCTP_MASK)
+   if (!policy || (policy & ~(SCTP_PR_SCTP_MASK | SCTP_PR_SCTP_ALL)))
goto out;
 
asoc = sctp_id2assoc(sk, params.sprstat_assoc_id);
@@ -7175,7 +7175,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock 
*sk, int len,
goto out;
}
 
-   if (policy == SCTP_PR_SCTP_NONE) {
+   if (policy == SCTP_PR_SCTP_ALL) {
params.sprstat_abandoned_unsent = 0;
params.sprstat_abandoned_sent = 0;
for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
-- 
2.1.0



[PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation

2018-10-16 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

When using the PF_KEY interface, SHA-256 hashes are hardcoded to
use 96-bit truncation.  This is a violation of RFC4868, which
specifies 128-bit truncation.

We cannot fix this without introducing backwards compatibility
concerns unless we make it an optional build time setting
(defaulting to no).  Android will default to yes instead
of carrying an Android specific one line patch.

While the PF_KEY interface is deprecated in favour of netlink XFRM
(which allows the app to specify an arbitrary truncation length),
changing the PF_KEY truncation length from 96 to 128 allows PF_KEY
apps such as racoon to work with standards-compliant VPN servers.

Cc: Lorenzo Colitti 
Signed-off-by: Maciej Żenczykowski 
---
 net/xfrm/Kconfig | 10 ++
 net/xfrm/xfrm_algo.c |  4 
 2 files changed, 14 insertions(+)

diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index 4a9ee2d83158..0ede7e81a5d3 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -15,6 +15,16 @@ config XFRM_ALGO
select XFRM
select CRYPTO
 
+config XFRM_HMAC_SHA256_RFC4868
+   bool "Strict RFC4868 hmac(sha256) 128-bit truncation"
+   depends on XFRM_ALGO
+   default n
+   ---help---
+ Support strict RFC4868 hmac(sha256) 128-bit truncation
+ (default on Android) instead of the default 96-bit Linux truncation.
+
+ If unsure, say N.
+
 config XFRM_USER
tristate "Transformation user configuration interface"
depends on INET
diff --git a/net/xfrm/xfrm_algo.c b/net/xfrm/xfrm_algo.c
index 44ac85fe2bc9..a70391fb2c1e 100644
--- a/net/xfrm/xfrm_algo.c
+++ b/net/xfrm/xfrm_algo.c
@@ -241,7 +241,11 @@ static struct xfrm_algo_desc aalg_list[] = {
 
.uinfo = {
.auth = {
+#if IS_ENABLED(CONFIG_XFRM_HMAC_SHA256_RFC4868)
+   .icv_truncbits = 128,
+#else
.icv_truncbits = 96,
+#endif
.icv_fullbits = 256,
}
},
-- 
2.19.1.331.ge82ca0e54c-goog



Re: [PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation

2018-10-16 Thread Maciej Żenczykowski
Yes, I realize there's been similar submits in the past,
but we're trying to get rid of or upstream android kernel networking
divergences...
maybe this approach will be more palatable?

Thanks,
Maciej


Re: crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-16 Thread Florian Westphal
Maciej Żenczykowski  wrote:

I am currently travelling and not able to investigate
until next week.

> commit ad8b1ffc3efae2f65080bdb11145c87d299b8f9a
> Author: Florian Westphal 
> netfilter: ipv6: nf_defrag: drop skb dst before queueing
> 
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -618,6 +618,8 @@ int nf_ct_frag6_gather(struct net *net, struct
> sk_buff *skb, u32 user)
> fq->q.meat == fq->q.len &&
> nf_ct_frag6_reasm(fq, skb, dev))
> ret = 0;
> +   else
> +   skb_dst_drop(skb);

This is only supposed to drop dst of skbs that are enqueued,
i.e. frag6_gather returns NF_STOLEN.

In case skb completes the queue, then that skbs dst_entry
is supposed to be kept, so skb_dst() does NOT return NULL.

Its not supposed to be any different than ipv4 defrag.

> const struct dst_entry *dst = skb_dst(skb); // returns NULL

That is not supposed to happen.



Re: [PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation

2018-10-16 Thread Lorenzo Colitti
On Tue, Oct 16, 2018 at 5:06 PM Maciej Żenczykowski
 wrote:
> +config XFRM_HMAC_SHA256_RFC4868
> +   bool "Strict RFC4868 hmac(sha256) 128-bit truncation"
> +   depends on XFRM_ALGO
> +   default n
> +   ---help---
> + Support strict RFC4868 hmac(sha256) 128-bit truncation
> + (default on Android) instead of the default 96-bit Linux truncation.

Not sure it's worth mentioning Android here, given that other
contributors from other organizations have attempted to change this as
well.

> .uinfo = {
> .auth = {
> +#if IS_ENABLED(CONFIG_XFRM_HMAC_SHA256_RFC4868)
> +   .icv_truncbits = 128,
> +#else
> .icv_truncbits = 96,
> +#endif

Also, consider adding a Tested: line saying that this allows
pf_key_test.py to pass on upstream kernels.

Other than that,

Acked-By: Lorenzo Colitti 


Re: crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-16 Thread Maciej Żenczykowski
> That is not supposed to happen.

# uname -a
Linux (none) 4.9.119 #3 Tue Oct 16 02:34:36 PDT 2018 x86_64 GNU/Linux
root@(none)# ip6tables -A OUTPUT -m policy --dir out --pol ipsec
root@(none)# python -c "import os, socket;
ip='0001';
x='6001234504d82c40'+ip+ip+'3a01a1224d20' + 'ff'*(1280-40-8);
y='6001234500092c40'+ip+ip+'3a0004d0a1224d20' + 'ff';
s=socket.socket(socket.AF_INET6,socket.SOCK_RAW,socket.IPPROTO_RAW);
s.sendto(x.decode('hex'),('::1',0,0,1));
s.sendto(y.decode('hex'),('::1',0,0,1));"

Modules linked in:
Pid: 297, comm: python Not tainted 4.9.119
RIP: 0033:[<60272eca>]
RSP: 802afa10  EFLAGS: 00010246
RAX: 60492fa8 RBX: 60272c6f RCX: 803a12a8
RDX: 803a1288 RSI: 802afa98 RDI: 80314d00
RBP: 802afa40 R08: 0001 R09: 0100
R10:  R11: 803a12a8 R12: 00010002
R13: 000a R14:  R15: 
Kernel panic - not syncing: Kernel mode fault at addr 0x48, ip 0x60272eca
CPU: 0 PID: 297 Comm: python Not tainted 4.9.119 #3
Stack:
 800d5000 803a11e0 80314d00 803a1000
   802afb00 6031afe1
  803a1288 803a100c 10003
Call Trace:
 [<6031afe1>] ip6t_do_table+0x2a3/0x3d4
 [<6026d440>] ? netfilter_net_init+0xbe/0x14f
 [<6026d4d1>] ? nf_iterate+0x0/0x5c
 [<6031cca5>] ip6table_filter_hook+0x21/0x23
 [<6026d509>] nf_iterate+0x38/0x5c
 [<6026d561>] nf_hook_slow+0x34/0xa2
 [<6003166c>] ? set_signals+0x0/0x3f
 [<6003165d>] ? get_signals+0x0/0xf
 [<603048b0>] rawv6_sendmsg+0x842/0xc4b
 [<60033d15>] ? wait_stub_done+0x40/0x10a
 [<60021176>] ? copy_chunk_from_user+0x23/0x2e
 [<60021153>] ? copy_chunk_from_user+0x0/0x2e
 [<6030307f>] ? dst_output+0x0/0x11
 [<602b0926>] inet_sendmsg+0x1e/0x5c
 [<600fe15f>] ? __fdget+0x15/0x17
 [<602264b9>] sock_sendmsg+0xf/0x62
 [<602279aa>] SyS_sendto+0x108/0x140
 [<600389c2>] ? arch_switch_to+0x2b/0x2e
 [<60367ff4>] ? __schedule+0x428/0x44f
 [<60367bcc>] ? __schedule+0x0/0x44f
 [<60021125>] handle_syscall+0x79/0xa7
 [<6003445c>] userspace+0x3bb/0x453
 [<6001dd92>] ? interrupt_end+0x0/0x94
 [<6001dc42>] fork_handler+0x85/0x87


Re: crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-16 Thread Maciej Żenczykowski
(and v4.9.133 latest 4.9 LTS fails the same way, but curiously 4.19-rc8 doesn't)


Re: crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-16 Thread Maciej Żenczykowski
4.19-rc8 - pass
4.14.76 - pass
4.9.133 - fail
4.9.133 + revert of ad8b1ffc3efae2f65080bdb11145c87d299b8f9a - pass

On Tue, Oct 16, 2018 at 2:41 AM Maciej Żenczykowski
 wrote:
>
> (and v4.9.133 latest 4.9 LTS fails the same way, but curiously 4.19-rc8 
> doesn't)


[PATCH 2/2] arm64: dts: clearfog-gt-8k: 1G eth PHY reset signal

2018-10-16 Thread Baruch Siach
This reset signal controls the Marvell 1512 1G PHY.

Note that current implementation queries the PHY over the MDIO bus
(get_phy_device() call from of_mdiobus_register_phy()) before reset
signal deassert. If the PHY reset signal is asserted at boot time, PHY
registration fails. So current code relies on the bootloader to deassert
the reset signal.

Signed-off-by: Baruch Siach 
---
 arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts 
b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
index af1310c53bc8..73df0ef5e0c4 100644
--- a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
+++ b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
@@ -337,6 +337,10 @@
 */
marvell,reg-init = <3 16 0 0x1017>;
reg = <0>;
+   pinctrl-names = "default";
+   pinctrl-0 = <&cp0_copper_eth_phy_reset>;
+   reset-gpios = <&cp1_gpio1 11 GPIO_ACTIVE_LOW>;
+   reset-assert-us = <1>;
};
 
switch0: switch0@4 {
-- 
2.19.1



[PATCH 1/2] arm64: dts: clearfog-gt-8k: fix USB regulator gpio polarity

2018-10-16 Thread Baruch Siach
The fixed regulator driver ignores the gpio flags, so this change has
no practical effect in the current implementation. Fix it anyway to
correct the hardware description.

Signed-off-by: Baruch Siach 
---
 arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts 
b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
index aea9c220ae6a..af1310c53bc8 100644
--- a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
+++ b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
@@ -42,7 +42,7 @@
 
v_5v0_usb3_hst_vbus: regulator-usb3-vbus0 {
compatible = "regulator-fixed";
-   gpio = <&cp0_gpio2 15 GPIO_ACTIVE_HIGH>;
+   gpio = <&cp0_gpio2 15 GPIO_ACTIVE_LOW>;
pinctrl-names = "default";
pinctrl-0 = <&cp0_xhci_vbus_pins>;
regulator-name = "v_5v0_usb3_hst_vbus";
-- 
2.19.1



[PATCH net-next 2/5] qed: Added supported transceiver modes, speed capability and board config to HSI.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

Added transceiver modes with different speed and media type,
speed capability and supported board types in HSI, which
will be utilizing to display correct specification of link
modes and speed type.

Signed-off-by: Rahul Verma 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h | 54 ++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index f2dfc7a..5c221eb 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -12207,11 +12207,56 @@ struct public_port {
u32 transceiver_data;
 #define ETH_TRANSCEIVER_STATE_MASK 0x00FF
 #define ETH_TRANSCEIVER_STATE_SHIFT0x
+#define ETH_TRANSCEIVER_STATE_OFFSET   0x
 #define ETH_TRANSCEIVER_STATE_UNPLUGGED0x
 #define ETH_TRANSCEIVER_STATE_PRESENT  0x0001
 #define ETH_TRANSCEIVER_STATE_VALID0x0003
 #define ETH_TRANSCEIVER_STATE_UPDATING 0x0008
-
+#define ETH_TRANSCEIVER_TYPE_MASK   0xFF00
+#define ETH_TRANSCEIVER_TYPE_OFFSET 0x8
+#define ETH_TRANSCEIVER_TYPE_NONE   0x00
+#define ETH_TRANSCEIVER_TYPE_UNKNOWN0xFF
+#define ETH_TRANSCEIVER_TYPE_1G_PCC 0x01
+#define ETH_TRANSCEIVER_TYPE_1G_ACC 0x02
+#define ETH_TRANSCEIVER_TYPE_1G_LX  0x03
+#define ETH_TRANSCEIVER_TYPE_1G_SX  0x04
+#define ETH_TRANSCEIVER_TYPE_10G_SR 0x05
+#define ETH_TRANSCEIVER_TYPE_10G_LR 0x06
+#define ETH_TRANSCEIVER_TYPE_10G_LRM0x07
+#define ETH_TRANSCEIVER_TYPE_10G_ER 0x08
+#define ETH_TRANSCEIVER_TYPE_10G_PCC0x09
+#define ETH_TRANSCEIVER_TYPE_10G_ACC0x0a
+#define ETH_TRANSCEIVER_TYPE_XLPPI  0x0b
+#define ETH_TRANSCEIVER_TYPE_40G_LR40x0c
+#define ETH_TRANSCEIVER_TYPE_40G_SR40x0d
+#define ETH_TRANSCEIVER_TYPE_40G_CR40x0e
+#define ETH_TRANSCEIVER_TYPE_100G_AOC   0x0f
+#define ETH_TRANSCEIVER_TYPE_100G_SR4   0x10
+#define ETH_TRANSCEIVER_TYPE_100G_LR4   0x11
+#define ETH_TRANSCEIVER_TYPE_100G_ER4   0x12
+#define ETH_TRANSCEIVER_TYPE_100G_ACC   0x13
+#define ETH_TRANSCEIVER_TYPE_100G_CR4   0x14
+#define ETH_TRANSCEIVER_TYPE_4x10G_SR   0x15
+#define ETH_TRANSCEIVER_TYPE_25G_CA_N   0x16
+#define ETH_TRANSCEIVER_TYPE_25G_ACC_S  0x17
+#define ETH_TRANSCEIVER_TYPE_25G_CA_S   0x18
+#define ETH_TRANSCEIVER_TYPE_25G_ACC_M  0x19
+#define ETH_TRANSCEIVER_TYPE_25G_CA_L   0x1a
+#define ETH_TRANSCEIVER_TYPE_25G_ACC_L  0x1b
+#define ETH_TRANSCEIVER_TYPE_25G_SR 0x1c
+#define ETH_TRANSCEIVER_TYPE_25G_LR 0x1d
+#define ETH_TRANSCEIVER_TYPE_25G_AOC0x1e
+#define ETH_TRANSCEIVER_TYPE_4x10G  0x1f
+#define ETH_TRANSCEIVER_TYPE_4x25G_CR   0x20
+#define ETH_TRANSCEIVER_TYPE_1000BASET  0x21
+#define ETH_TRANSCEIVER_TYPE_10G_BASET  0x22
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_10G_40G_SR  0x30
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_10G_40G_CR  0x31
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_10G_40G_LR  0x32
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_40G_100G_SR 0x33
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_40G_100G_CR 0x34
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_40G_100G_LR 0x35
+#define ETH_TRANSCEIVER_TYPE_MULTI_RATE_40G_100G_AOC0x36
u32 wol_info;
u32 wol_pkt_len;
u32 wol_pkt_details;
@@ -13199,6 +13244,13 @@ struct nvm_cfg1_port {
u32 transceiver_00;
u32 device_ids;
u32 board_cfg;
+#define NVM_CFG1_PORT_PORT_TYPE_MASK0x00FF
+#define NVM_CFG1_PORT_PORT_TYPE_OFFSET  0
+#define NVM_CFG1_PORT_PORT_TYPE_UNDEFINED   0x0
+#define NVM_CFG1_PORT_PORT_TYPE_MODULE  0x1
+#define NVM_CFG1_PORT_PORT_TYPE_BACKPLANE   0x2
+#define NVM_CFG1_PORT_PORT_TYPE_EXT_PHY 0x3
+#define NVM_CFG1_PORT_PORT_TYPE_MODULE_SLAVE0x4
u32 mnm_10g_cap;
u32 mnm_10g_ctrl;
u32 mnm_10g_misc;
-- 
1.8.3.1



[PATCH net-next 0/5] Align PTT and add various link modes.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

This series aligns the ptt propagation as local ptt or global ptt.
Adds new transceiver modes, speed capabilities and board config,
which is utilized to display the enhanced link modes, media types
and speed. Enhances the link with detailed information.

Rahul Verma (5):
  qed: Align local and global PTT to propagate through the APIs.
  qed: Added supported transceiver modes, speed capability and board
config to HSI.
  qed: Add supported link and advertise link to display in ethtool.
  qede: Check available link modes before link set from ethtool.
  qed: Prevent link getting down in case of autoneg-off.

 drivers/net/ethernet/qlogic/qed/qed.h   |   2 +-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h   |  54 -
 drivers/net/ethernet/qlogic/qed/qed_main.c  | 259 ++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   | 201 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  51 -
 drivers/net/ethernet/qlogic/qed/qed_vf.c|   2 +-
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  95 ++---
 include/linux/qed/qed_if.h  |  26 ++-
 8 files changed, 587 insertions(+), 103 deletions(-)

-- 
1.8.3.1



[PATCH net-next 3/5] qed: Add supported link and advertise link to display in ethtool.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

Added transceiver type, speed capability and board types
in HSI, are utilizing to display the accurate link
information in ethtool.

Signed-off-by: Rahul Verma 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c  | 199 ++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   | 182 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  46 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  31 +++-
 include/linux/qed/qed_if.h  |  26 +++-
 5 files changed, 426 insertions(+), 58 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 8c7cbbd..e762881 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -58,6 +58,7 @@
 #include "qed_iscsi.h"
 
 #include "qed_mcp.h"
+#include "qed_reg_addr.h"
 #include "qed_hw.h"
 #include "qed_selftest.h"
 #include "qed_debug.h"
@@ -1330,8 +1331,7 @@ static int qed_set_link(struct qed_dev *cdev, struct 
qed_link_params *params)
link_params->speed.autoneg = params->autoneg;
if (params->override_flags & QED_LINK_OVERRIDE_SPEED_ADV_SPEEDS) {
link_params->speed.advertised_speeds = 0;
-   if ((params->adv_speeds & QED_LM_1000baseT_Half_BIT) ||
-   (params->adv_speeds & QED_LM_1000baseT_Full_BIT))
+   if (params->adv_speeds & QED_LM_1000baseT_Full_BIT)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_1G;
if (params->adv_speeds & QED_LM_1baseKR_Full_BIT)
@@ -1462,13 +1462,149 @@ static int qed_get_link_data(struct qed_hwfn *hwfn,
return 0;
 }
 
+static void qed_fill_link_capability(struct qed_hwfn *hwfn,
+struct qed_ptt *ptt, u32 capability,
+u32 *if_capability)
+{
+   u32 media_type, tcvr_state, tcvr_type;
+   u32 speed_mask, board_cfg;
+
+   if (qed_mcp_get_media_type(hwfn, ptt, &media_type))
+   media_type = MEDIA_UNSPECIFIED;
+
+   if (qed_mcp_get_transceiver_data(hwfn, ptt, &tcvr_state, &tcvr_type))
+   tcvr_type = ETH_TRANSCEIVER_STATE_UNPLUGGED;
+
+   if (qed_mcp_trans_speed_mask(hwfn, ptt, &speed_mask))
+   speed_mask = 0x;
+
+   if (qed_mcp_get_board_config(hwfn, ptt, &board_cfg))
+   board_cfg = NVM_CFG1_PORT_PORT_TYPE_UNDEFINED;
+
+   DP_VERBOSE(hwfn->cdev, NETIF_MSG_DRV,
+  "Media_type = 0x%x tcvr_state = 0x%x tcvr_type = 0x%x 
speed_mask = 0x%x board_cfg = 0x%x\n",
+  media_type, tcvr_state, tcvr_type, speed_mask, board_cfg);
+
+   switch (media_type) {
+   case MEDIA_DA_TWINAX:
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_20G)
+   *if_capability |= QED_LM_2baseKR2_Full_BIT;
+   /* For DAC media multiple speed capabilities are supported*/
+   capability = capability & speed_mask;
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_1G)
+   *if_capability |= QED_LM_1000baseKX_Full_BIT;
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_10G)
+   *if_capability |= QED_LM_1baseCR_Full_BIT;
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_40G)
+   *if_capability |= QED_LM_4baseCR4_Full_BIT;
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_25G)
+   *if_capability |= QED_LM_25000baseCR_Full_BIT;
+   if (capability & NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_50G)
+   *if_capability |= QED_LM_5baseCR2_Full_BIT;
+   if (capability &
+   NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_BB_100G)
+   *if_capability |= QED_LM_10baseCR4_Full_BIT;
+   break;
+   case MEDIA_BASE_T:
+   if (board_cfg & NVM_CFG1_PORT_PORT_TYPE_EXT_PHY) {
+   if (capability &
+   NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_1G) {
+   *if_capability |= QED_LM_1000baseT_Full_BIT;
+   }
+   if (capability &
+   NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_10G) {
+   *if_capability |= QED_LM_1baseT_Full_BIT;
+   }
+   }
+   if (board_cfg & NVM_CFG1_PORT_PORT_TYPE_MODULE) {
+   if (tcvr_type == ETH_TRANSCEIVER_TYPE_1000BASET)
+   *if_capability |= QED_LM_1000baseT_Full_BIT;
+   if (tcvr_type == ETH_TRANSCEIVER_TYPE_10G_BASET)
+   *i

[PATCH net-next 1/5] qed: Align local and global PTT to propagate through the APIs.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

Align the use of local PTT to propagate through the qed_mcp* API's.
Global ptt should not be used.

Register access should be done through layers. Register address is
mapped into a PTT, PF translation table. Several interface functions
require a PTT to direct read/write into register. There is a pool of
PTT maintained, and several PTT are used simultaneously to access
device registers in different flows. Same PTT should not be used in
flows that can run concurrently.
To avoid running out of PTT resources, too many PTT should not be
acquired without releasing them. Every PF has a global PTT, which is
used throughout the life of PF, in most important flows for register
access. Generic functions acquire the PTT locally and release after
the use. This patch aligns the use of Global PTT and Local PTT
accordingly.

Signed-off-by: Rahul Verma 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h  |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_main.c | 22 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 27 ---
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  |  5 +++--
 drivers/net/ethernet/qlogic/qed/qed_vf.c   |  2 +-
 5 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 5f0962d..d9a03ab 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -915,7 +915,7 @@ void qed_set_fw_mac_addr(__le16 *fw_msb,
 /* Prototypes */
 int qed_fill_dev_info(struct qed_dev *cdev,
  struct qed_dev_info *dev_info);
-void qed_link_update(struct qed_hwfn *hwfn);
+void qed_link_update(struct qed_hwfn *hwfn, struct qed_ptt *ptt);
 u32 qed_unzip_data(struct qed_hwfn *p_hwfn,
   u32 input_len, u8 *input_buf,
   u32 max_size, u8 *unzip_buf);
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 75d217a..8c7cbbd 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1463,6 +1463,7 @@ static int qed_get_link_data(struct qed_hwfn *hwfn,
 }
 
 static void qed_fill_link(struct qed_hwfn *hwfn,
+ struct qed_ptt *ptt,
  struct qed_link_output *if_link)
 {
struct qed_mcp_link_params params;
@@ -1549,7 +1550,7 @@ static void qed_fill_link(struct qed_hwfn *hwfn,
 
/* TODO - fill duplex properly */
if_link->duplex = DUPLEX_FULL;
-   qed_mcp_get_media_type(hwfn->cdev, &media_type);
+   qed_mcp_get_media_type(hwfn, ptt, &media_type);
if_link->port = qed_get_port_type(media_type);
 
if_link->autoneg = params.speed.autoneg;
@@ -1607,21 +1608,34 @@ static void qed_fill_link(struct qed_hwfn *hwfn,
 static void qed_get_current_link(struct qed_dev *cdev,
 struct qed_link_output *if_link)
 {
+   struct qed_hwfn *hwfn;
+   struct qed_ptt *ptt;
int i;
 
-   qed_fill_link(&cdev->hwfns[0], if_link);
+   hwfn = &cdev->hwfns[0];
+   if (IS_PF(cdev)) {
+   ptt = qed_ptt_acquire(hwfn);
+   if (ptt) {
+   qed_fill_link(hwfn, ptt, if_link);
+   qed_ptt_release(hwfn, ptt);
+   } else {
+   DP_NOTICE(hwfn, "Failed to fill link; No PTT\n");
+   }
+   } else {
+   qed_fill_link(hwfn, NULL, if_link);
+   }
 
for_each_hwfn(cdev, i)
qed_inform_vf_link_state(&cdev->hwfns[i]);
 }
 
-void qed_link_update(struct qed_hwfn *hwfn)
+void qed_link_update(struct qed_hwfn *hwfn, struct qed_ptt *ptt)
 {
void *cookie = hwfn->cdev->ops_cookie;
struct qed_common_cb_ops *op = hwfn->cdev->protocol_ops.common;
struct qed_link_output if_link;
 
-   qed_fill_link(hwfn, &if_link);
+   qed_fill_link(hwfn, ptt, &if_link);
qed_inform_vf_link_state(hwfn);
 
if (IS_LEAD_HWFN(hwfn) && cookie)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index b06e4cb..92c5950 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1447,7 +1447,7 @@ static void qed_mcp_handle_link_change(struct qed_hwfn 
*p_hwfn,
if (p_hwfn->mcp_info->capabilities & FW_MB_PARAM_FEATURE_SUPPORT_EEE)
qed_mcp_read_eee_config(p_hwfn, p_ptt, p_link);
 
-   qed_link_update(p_hwfn);
+   qed_link_update(p_hwfn, p_ptt);
 out:
spin_unlock_bh(&p_hwfn->mcp_info->link_lock);
 }
@@ -1867,12 +1867,10 @@ int qed_mcp_get_mbi_ver(struct qed_hwfn *p_hwfn,
return 0;
 }
 
-int qed_mcp_get_media_type(struct qed_dev *cdev, u32 *p_media_type)
+int qed_mcp_get_media_type(struct qed_hwfn *p_hwfn,
+  struct qed_ptt

[PATCH net-next 5/5] qed: Prevent link getting down in case of autoneg-off.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

Newly added link modes are required to be added
during setting link modes. If the new link mode
is not available during qed_set_link, it may cause
link getting down due to empty supported capability,
being passed to MFW, after setting autoneg off/on
with current/supported speed.

Signed-off-by: Rahul Verma 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 40 --
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index e762881..35fd0db 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1305,6 +1305,7 @@ static int qed_set_link(struct qed_dev *cdev, struct 
qed_link_params *params)
struct qed_hwfn *hwfn;
struct qed_mcp_link_params *link_params;
struct qed_ptt *ptt;
+   u32 sup_caps;
int rc;
 
if (!cdev)
@@ -1331,25 +1332,50 @@ static int qed_set_link(struct qed_dev *cdev, struct 
qed_link_params *params)
link_params->speed.autoneg = params->autoneg;
if (params->override_flags & QED_LINK_OVERRIDE_SPEED_ADV_SPEEDS) {
link_params->speed.advertised_speeds = 0;
-   if (params->adv_speeds & QED_LM_1000baseT_Full_BIT)
+   sup_caps = QED_LM_1000baseT_Full_BIT |
+  QED_LM_1000baseKX_Full_BIT |
+  QED_LM_1000baseX_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_1G;
-   if (params->adv_speeds & QED_LM_1baseKR_Full_BIT)
+   sup_caps = QED_LM_1baseT_Full_BIT |
+  QED_LM_1baseKR_Full_BIT |
+  QED_LM_1baseKX4_Full_BIT |
+  QED_LM_1baseR_FEC_BIT |
+  QED_LM_1baseCR_Full_BIT |
+  QED_LM_1baseSR_Full_BIT |
+  QED_LM_1baseLR_Full_BIT |
+  QED_LM_1baseLRM_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_10G;
if (params->adv_speeds & QED_LM_2baseKR2_Full_BIT)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_20G;
-   if (params->adv_speeds & QED_LM_25000baseKR_Full_BIT)
+   sup_caps = QED_LM_25000baseKR_Full_BIT |
+  QED_LM_25000baseCR_Full_BIT |
+  QED_LM_25000baseSR_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_25G;
-   if (params->adv_speeds & QED_LM_4baseLR4_Full_BIT)
+   sup_caps = QED_LM_4baseLR4_Full_BIT |
+  QED_LM_4baseKR4_Full_BIT |
+  QED_LM_4baseCR4_Full_BIT |
+  QED_LM_4baseSR4_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
-   NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_40G;
-   if (params->adv_speeds & QED_LM_5baseKR2_Full_BIT)
+   NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_40G;
+   sup_caps = QED_LM_5baseKR2_Full_BIT |
+  QED_LM_5baseCR2_Full_BIT |
+  QED_LM_5baseSR2_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_50G;
-   if (params->adv_speeds & QED_LM_10baseKR4_Full_BIT)
+   sup_caps = QED_LM_10baseKR4_Full_BIT |
+  QED_LM_10baseSR4_Full_BIT |
+  QED_LM_10baseCR4_Full_BIT |
+  QED_LM_10baseLR4_ER4_Full_BIT;
+   if (params->adv_speeds & sup_caps)
link_params->speed.advertised_speeds |=
NVM_CFG1_PORT_DRV_SPEED_CAPABILITY_MASK_BB_100G;
}
-- 
1.8.3.1



[PATCH net-next 4/5] qede: Check available link modes before link set from ethtool.

2018-10-16 Thread Rahul Verma
From: Rahul Verma 

Set link mode after checking available "supported" link caps
of the port.

Signed-off-by: Rahul Verma 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 64 +
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index df3ad59..8cbbd62 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -518,6 +518,7 @@ static int qede_set_link_ksettings(struct net_device *dev,
struct qede_dev *edev = netdev_priv(dev);
struct qed_link_output current_link;
struct qed_link_params params;
+   u32 sup_caps;
 
if (!edev->ops || !edev->ops->common->can_link_change(edev->cdev)) {
DP_INFO(edev, "Link settings are not allowed to be changed\n");
@@ -544,60 +545,85 @@ static int qede_set_link_ksettings(struct net_device *dev,
params.forced_speed = base->speed;
switch (base->speed) {
case SPEED_1000:
-   if (!(current_link.supported_caps &
- QED_LM_1000baseT_Full_BIT)) {
+   sup_caps = QED_LM_1000baseT_Full_BIT |
+   QED_LM_1000baseKX_Full_BIT |
+   QED_LM_1000baseX_Full_BIT;
+   if (!(current_link.supported_caps & sup_caps)) {
DP_INFO(edev, "1G speed not supported\n");
return -EINVAL;
}
-   params.adv_speeds = QED_LM_1000baseT_Full_BIT;
+   params.adv_speeds = current_link.supported_caps &
+   sup_caps;
break;
case SPEED_1:
-   if (!(current_link.supported_caps &
- QED_LM_1baseKR_Full_BIT)) {
+   sup_caps = QED_LM_1baseT_Full_BIT |
+   QED_LM_1baseKR_Full_BIT |
+   QED_LM_1baseKX4_Full_BIT |
+   QED_LM_1baseR_FEC_BIT |
+   QED_LM_1baseCR_Full_BIT |
+   QED_LM_1baseSR_Full_BIT |
+   QED_LM_1baseLR_Full_BIT |
+   QED_LM_1baseLRM_Full_BIT;
+   if (!(current_link.supported_caps & sup_caps)) {
DP_INFO(edev, "10G speed not supported\n");
return -EINVAL;
}
-   params.adv_speeds = QED_LM_1baseKR_Full_BIT;
+   params.adv_speeds = current_link.supported_caps &
+   sup_caps;
break;
case SPEED_2:
if (!(current_link.supported_caps &
- QED_LM_2baseKR2_Full_BIT)) {
+   QED_LM_2baseKR2_Full_BIT)) {
DP_INFO(edev, "20G speed not supported\n");
return -EINVAL;
}
params.adv_speeds = QED_LM_2baseKR2_Full_BIT;
break;
case SPEED_25000:
-   if (!(current_link.supported_caps &
- QED_LM_25000baseKR_Full_BIT)) {
+   sup_caps = QED_LM_25000baseKR_Full_BIT |
+   QED_LM_25000baseCR_Full_BIT |
+   QED_LM_25000baseSR_Full_BIT;
+   if (!(current_link.supported_caps & sup_caps)) {
DP_INFO(edev, "25G speed not supported\n");
return -EINVAL;
}
-   params.adv_speeds = QED_LM_25000baseKR_Full_BIT;
+   params.adv_speeds = current_link.supported_caps &
+   sup_caps;
break;
case SPEED_4:
-   if (!(current_link.supported_caps &
- QED_LM_4baseLR4_Full_BIT)) {
+   sup_caps = QED_LM_4baseLR4_Full_BIT |
+   QED_LM_4baseKR4_Full_BIT |
+   QED_LM_4baseCR4_Full_BIT |
+   QED_LM_4baseSR4_Full_BIT;
+   if (!(current_link.supported_caps & sup_caps)) {
DP_INFO(edev, "40G speed not supported\n");
 

Re: [PATCH net] sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead

2018-10-16 Thread Neil Horman
On Tue, Oct 16, 2018 at 03:52:02PM +0800, Xin Long wrote:
> According to rfc7496 section 4.3 or 4.4:
> 
>sprstat_policy:  This parameter indicates for which PR-SCTP policy
>   the user wants the information.  It is an error to use
>   SCTP_PR_SCTP_NONE in sprstat_policy.  If SCTP_PR_SCTP_ALL is used,
>   the counters provided are aggregated over all supported policies.
> 
> We change to dump pr_assoc and pr_stream all status by SCTP_PR_SCTP_ALL
> instead, and return error for SCTP_PR_SCTP_NONE, as it also said "It is
> an error to use SCTP_PR_SCTP_NONE in sprstat_policy. "
> 
> Fixes: 826d253d57b1 ("sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt")
> Fixes: d229d48d183f ("sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp")
> Reported-by: Ying Xu 
> Signed-off-by: Xin Long 
> ---
>  include/uapi/linux/sctp.h | 1 +
>  net/sctp/socket.c | 8 
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> index b479db5..34dd3d4 100644
> --- a/include/uapi/linux/sctp.h
> +++ b/include/uapi/linux/sctp.h
> @@ -301,6 +301,7 @@ enum sctp_sinfo_flags {
>   SCTP_SACK_IMMEDIATELY   = (1 << 3), /* SACK should be sent without 
> delay. */
>   /* 2 bits here have been used by SCTP_PR_SCTP_MASK */
>   SCTP_SENDALL= (1 << 6),
> + SCTP_PR_SCTP_ALL= (1 << 7),
>   SCTP_NOTIFICATION   = MSG_NOTIFICATION, /* Next message is not user 
> msg but notification. */
>   SCTP_EOF= MSG_FIN,  /* Initiate graceful shutdown 
> process. */
>  };
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index f73e9d3..e25a20f 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7100,14 +7100,14 @@ static int sctp_getsockopt_pr_assocstatus(struct sock 
> *sk, int len,
>   }
>  
>   policy = params.sprstat_policy;
> - if (policy & ~SCTP_PR_SCTP_MASK)
> + if (!policy || (policy & ~(SCTP_PR_SCTP_MASK | SCTP_PR_SCTP_ALL)))
>   goto out;
>  
>   asoc = sctp_id2assoc(sk, params.sprstat_assoc_id);
>   if (!asoc)
>   goto out;
>  
> - if (policy == SCTP_PR_SCTP_NONE) {
> + if (policy & SCTP_PR_SCTP_ALL) {
>   params.sprstat_abandoned_unsent = 0;
>   params.sprstat_abandoned_sent = 0;
>   for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
> @@ -7159,7 +7159,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock 
> *sk, int len,
>   }
>  
>   policy = params.sprstat_policy;
> - if (policy & ~SCTP_PR_SCTP_MASK)
> + if (!policy || (policy & ~(SCTP_PR_SCTP_MASK | SCTP_PR_SCTP_ALL)))
>   goto out;
>  
>   asoc = sctp_id2assoc(sk, params.sprstat_assoc_id);
> @@ -7175,7 +7175,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock 
> *sk, int len,
>   goto out;
>   }
>  
> - if (policy == SCTP_PR_SCTP_NONE) {
> + if (policy == SCTP_PR_SCTP_ALL) {
>   params.sprstat_abandoned_unsent = 0;
>   params.sprstat_abandoned_sent = 0;
>   for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



[PATCH 02/16] octeontx2-af: CGX Rx/Tx enable/disable mbox handlers

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Added new mailbox msgs for RVU PF/VFs to request AF
to enable/disable their mapped CGX::LMAC Rx & Tx.

Signed-off-by: Sunil Goutham 
Signed-off-by: Linu Cherian 
---
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 18 
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  5 
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  2 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h| 16 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 32 ++
 5 files changed, 73 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 5328ecc..03a91c6 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -119,6 +119,24 @@ void *cgx_get_pdata(int cgx_id)
 }
 EXPORT_SYMBOL(cgx_get_pdata);
 
+int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool enable)
+{
+   struct cgx *cgx = cgxd;
+   u64 cfg;
+
+   if (!cgx || lmac_id >= cgx->lmac_count)
+   return -ENODEV;
+
+   cfg = cgx_read(cgx, lmac_id, CGXX_CMRX_CFG);
+   if (enable)
+   cfg |= CMR_EN | DATA_PKT_RX_EN | DATA_PKT_TX_EN;
+   else
+   cfg &= ~(CMR_EN | DATA_PKT_RX_EN | DATA_PKT_TX_EN);
+   cgx_write(cgx, lmac_id, CGXX_CMRX_CFG, cfg);
+   return 0;
+}
+EXPORT_SYMBOL(cgx_lmac_rx_tx_enable);
+
 /* CGX Firmware interface low level support */
 static int cgx_fwi_cmd_send(u64 req, u64 *resp, struct lmac *lmac)
 {
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index a2a7a6d..9097935 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -24,6 +24,10 @@
 #define CGX_OFFSET(x)  ((x) * MAX_LMAC_PER_CGX)
 
 /* Registers */
+#define CGXX_CMRX_CFG  0x00
+#define  CMR_ENBIT_ULL(55)
+#define  DATA_PKT_TX_ENBIT_ULL(53)
+#define  DATA_PKT_RX_ENBIT_ULL(54)
 #define CGXX_CMRX_INT  0x040
 #define  FW_CGX_INTBIT_ULL(1)
 #define CGXX_CMRX_INT_ENA_W1S  0x058
@@ -62,4 +66,5 @@ int cgx_get_cgx_cnt(void);
 int cgx_get_lmac_cnt(void *cgxd);
 void *cgx_get_pdata(int cgx_id);
 int cgx_lmac_evh_register(struct cgx_event_cb *cb, void *cgxd, int lmac_id);
+int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool enable);
 #endif /* CGX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index bedf0ee..6b66cf0 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -124,6 +124,8 @@ M(ATTACH_RESOURCES, 0x002, rsrc_attach, msg_rsp)
\
 M(DETACH_RESOURCES,0x003, rsrc_detach, msg_rsp)\
 M(MSIX_OFFSET, 0x004, msg_req, msix_offset_rsp)\
 /* CGX mbox IDs (range 0x200 - 0x3FF) */   \
+M(CGX_START_RXTX,  0x200, msg_req, msg_rsp)\
+M(CGX_STOP_RXTX,   0x201, msg_req, msg_rsp)\
 /* NPA mbox IDs (range 0x400 - 0x5FF) */   \
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index d169fa9..4cf2bcb 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -153,6 +153,22 @@ int rvu_get_blkaddr(struct rvu *rvu, int blktype, u16 
pcifunc);
 int rvu_poll_reg(struct rvu *rvu, u64 block, u64 offset, u64 mask, bool zero);
 
 /* CGX APIs */
+static inline bool is_pf_cgxmapped(struct rvu *rvu, u8 pf)
+{
+   return (pf >= PF_CGXMAP_BASE && pf <= rvu->cgx_mapped_pfs);
+}
+
+static inline void rvu_get_cgx_lmac_id(u8 map, u8 *cgx_id, u8 *lmac_id)
+{
+   *cgx_id = (map >> 4) & 0xF;
+   *lmac_id = (map & 0xF);
+}
+
 int rvu_cgx_probe(struct rvu *rvu);
 void rvu_cgx_wq_destroy(struct rvu *rvu);
+int rvu_cgx_config_rxtx(struct rvu *rvu, u16 pcifunc, bool start);
+int rvu_mbox_handler_CGX_START_RXTX(struct rvu *rvu, struct msg_req *req,
+   struct msg_rsp *rsp);
+int rvu_mbox_handler_CGX_STOP_RXTX(struct rvu *rvu, struct msg_req *req,
+  struct msg_rsp *rsp);
 #endif /* RVU_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 5ecc223..75a03a8 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
@@ -192,3 +192,35 @@ int rvu_cgx_probe(struct rvu *rvu)

[PATCH 03/16] octeontx2-af: Support to retrieve CGX LMAC stats

2018-10-16 Thread sunil . kovvuri
From: Christina Jacob 

This patch adds support for a RVU PF/VF driver to retrieve
it's mapped CGX LMAC Rx and Tx stats from AF via mbox.
New mailbox msg is added is added.

Signed-off-by: Christina Jacob 
Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 22 +
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  4 +++
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   | 11 +++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  2 ++
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 37 ++
 5 files changed, 76 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 03a91c6..a7dc6f2 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -119,6 +119,28 @@ void *cgx_get_pdata(int cgx_id)
 }
 EXPORT_SYMBOL(cgx_get_pdata);
 
+int cgx_get_rx_stats(void *cgxd, int lmac_id, int idx, u64 *rx_stat)
+{
+   struct cgx *cgx = cgxd;
+
+   if (!cgx || lmac_id >= cgx->lmac_count)
+   return -ENODEV;
+   *rx_stat =  cgx_read(cgx, lmac_id, CGXX_CMRX_RX_STAT0 + (idx * 8));
+   return 0;
+}
+EXPORT_SYMBOL(cgx_get_rx_stats);
+
+int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat)
+{
+   struct cgx *cgx = cgxd;
+
+   if (!cgx || lmac_id >= cgx->lmac_count)
+   return -ENODEV;
+   *tx_stat = cgx_read(cgx, lmac_id, CGXX_CMRX_TX_STAT0 + (idx * 8));
+   return 0;
+}
+EXPORT_SYMBOL(cgx_get_tx_stats);
+
 int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool enable)
 {
struct cgx *cgx = cgxd;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index 9097935..8f596dfb 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -32,7 +32,9 @@
 #define  FW_CGX_INTBIT_ULL(1)
 #define CGXX_CMRX_INT_ENA_W1S  0x058
 #define CGXX_CMRX_RX_ID_MAP0x060
+#define CGXX_CMRX_RX_STAT0 0x070
 #define CGXX_CMRX_RX_LMACS 0x128
+#define CGXX_CMRX_TX_STAT0 0x700
 #define CGXX_SCRATCH0_REG  0x1050
 #define CGXX_SCRATCH1_REG  0x1058
 #define CGX_CONST  0x2000
@@ -66,5 +68,7 @@ int cgx_get_cgx_cnt(void);
 int cgx_get_lmac_cnt(void *cgxd);
 void *cgx_get_pdata(int cgx_id);
 int cgx_lmac_evh_register(struct cgx_event_cb *cb, void *cgxd, int lmac_id);
+int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat);
+int cgx_get_rx_stats(void *cgxd, int lmac_id, int idx, u64 *rx_stat);
 int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool enable);
 #endif /* CGX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 6b66cf0..03dd04d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -126,6 +126,7 @@ M(MSIX_OFFSET,  0x004, msg_req, 
msix_offset_rsp)\
 /* CGX mbox IDs (range 0x200 - 0x3FF) */   \
 M(CGX_START_RXTX,  0x200, msg_req, msg_rsp)\
 M(CGX_STOP_RXTX,   0x201, msg_req, msg_rsp)\
+M(CGX_STATS,   0x202, msg_req, cgx_stats_rsp)  \
 /* NPA mbox IDs (range 0x400 - 0x5FF) */   \
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
@@ -210,4 +211,14 @@ struct msix_offset_rsp {
u16  cptlf_msixoff[MAX_RVU_BLKLF_CNT];
 };
 
+/* CGX mbox message formats */
+
+struct cgx_stats_rsp {
+   struct mbox_msghdr hdr;
+#define CGX_RX_STATS_COUNT 13
+#define CGX_TX_STATS_COUNT 18
+   u64 rx_stats[CGX_RX_STATS_COUNT];
+   u64 tx_stats[CGX_TX_STATS_COUNT];
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 4cf2bcb..8ee6663 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -171,4 +171,6 @@ int rvu_mbox_handler_CGX_START_RXTX(struct rvu *rvu, struct 
msg_req *req,
struct msg_rsp *rsp);
 int rvu_mbox_handler_CGX_STOP_RXTX(struct rvu *rvu, struct msg_req *req,
   struct msg_rsp *rsp);
+int rvu_mbox_handler_CGX_STATS(struct rvu *rvu, struct msg_req *req,
+  struct cgx_stats_rsp *rsp);
 #endif /* RVU_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 75a03a8..a4aa1e0 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c

[PATCH 04/16] octeontx2-af: Support for MAC address filters in CGX

2018-10-16 Thread sunil . kovvuri
From: Vidhya Raman 

This patch adds support for setting MAC address filters in CGX
for PF interfaces. Also PF interfaces can be put in promiscuous
mode. Dataplane PFs access this functionality using mailbox
messages to the AF driver.

Signed-off-by: Vidhya Raman 
Signed-off-by: Stanislaw Kardach 
---
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 75 ++
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h| 12 
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   | 13 
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h| 10 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 73 +
 5 files changed, 183 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index a7dc6f2..e7ae9e0 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -119,6 +119,81 @@ void *cgx_get_pdata(int cgx_id)
 }
 EXPORT_SYMBOL(cgx_get_pdata);
 
+static u64 mac2u64 (u8 *mac_addr)
+{
+   u64 mac = 0;
+   int index;
+
+   for (index = ETH_ALEN - 1; index >= 0; index--)
+   mac |= ((u64)*mac_addr++) << (8 * index);
+   return mac;
+}
+
+int cgx_lmac_addr_set(u8 cgx_id, u8 lmac_id, u8 *mac_addr)
+{
+   struct cgx *cgx_dev = cgx_get_pdata(cgx_id);
+   u64 cfg;
+
+   /* copy 6bytes from macaddr */
+   /* memcpy(&cfg, mac_addr, 6); */
+
+   cfg = mac2u64 (mac_addr);
+
+   cgx_write(cgx_dev, 0, (CGXX_CMRX_RX_DMAC_CAM0 + (lmac_id * 0x8)),
+ cfg | CGX_DMAC_CAM_ADDR_ENABLE | ((u64)lmac_id << 49));
+
+   cfg = cgx_read(cgx_dev, lmac_id, CGXX_CMRX_RX_DMAC_CTL0);
+   cfg |= CGX_DMAC_CTL0_CAM_ENABLE;
+   cgx_write(cgx_dev, lmac_id, CGXX_CMRX_RX_DMAC_CTL0, cfg);
+
+   return 0;
+}
+EXPORT_SYMBOL(cgx_lmac_addr_set);
+
+u64 cgx_lmac_addr_get(u8 cgx_id, u8 lmac_id)
+{
+   struct cgx *cgx_dev = cgx_get_pdata(cgx_id);
+   u64 cfg;
+
+   cfg = cgx_read(cgx_dev, 0, CGXX_CMRX_RX_DMAC_CAM0 + lmac_id * 0x8);
+   return cfg & CGX_RX_DMAC_ADR_MASK;
+}
+EXPORT_SYMBOL(cgx_lmac_addr_get);
+
+void cgx_lmac_promisc_config(int cgx_id, int lmac_id, bool enable)
+{
+   struct cgx *cgx = cgx_get_pdata(cgx_id);
+   u64 cfg = 0;
+
+   if (!cgx)
+   return;
+
+   if (enable) {
+   /* Enable promiscuous mode on LMAC */
+   cfg = cgx_read(cgx, lmac_id, CGXX_CMRX_RX_DMAC_CTL0);
+   cfg &= ~(CGX_DMAC_CAM_ACCEPT | CGX_DMAC_MCAST_MODE);
+   cfg |= CGX_DMAC_BCAST_MODE;
+   cgx_write(cgx, lmac_id, CGXX_CMRX_RX_DMAC_CTL0, cfg);
+
+   cfg = cgx_read(cgx, 0,
+  (CGXX_CMRX_RX_DMAC_CAM0 + lmac_id * 0x8));
+   cfg &= ~CGX_DMAC_CAM_ADDR_ENABLE;
+   cgx_write(cgx, 0,
+ (CGXX_CMRX_RX_DMAC_CAM0 + lmac_id * 0x8), cfg);
+   } else {
+   /* Disable promiscuous mode */
+   cfg = cgx_read(cgx, lmac_id, CGXX_CMRX_RX_DMAC_CTL0);
+   cfg |= CGX_DMAC_CAM_ACCEPT | CGX_DMAC_MCAST_MODE;
+   cgx_write(cgx, lmac_id, CGXX_CMRX_RX_DMAC_CTL0, cfg);
+   cfg = cgx_read(cgx, 0,
+  (CGXX_CMRX_RX_DMAC_CAM0 + lmac_id * 0x8));
+   cfg |= CGX_DMAC_CAM_ADDR_ENABLE;
+   cgx_write(cgx, 0,
+ (CGXX_CMRX_RX_DMAC_CAM0 + lmac_id * 0x8), cfg);
+   }
+}
+EXPORT_SYMBOL(cgx_lmac_promisc_config);
+
 int cgx_get_rx_stats(void *cgxd, int lmac_id, int idx, u64 *rx_stat)
 {
struct cgx *cgx = cgxd;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index 8f596dfb..3ae426b 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -34,6 +34,15 @@
 #define CGXX_CMRX_RX_ID_MAP0x060
 #define CGXX_CMRX_RX_STAT0 0x070
 #define CGXX_CMRX_RX_LMACS 0x128
+#define CGXX_CMRX_RX_DMAC_CTL0 0x1F8
+#define  CGX_DMAC_CTL0_CAM_ENABLE  BIT_ULL(3)
+#define  CGX_DMAC_CAM_ACCEPT   BIT_ULL(3)
+#define  CGX_DMAC_MCAST_MODE   BIT_ULL(1)
+#define  CGX_DMAC_BCAST_MODE   BIT_ULL(0)
+#define CGXX_CMRX_RX_DMAC_CAM0 0x200
+#define  CGX_DMAC_CAM_ADDR_ENABLE  BIT_ULL(48)
+#define CGXX_CMRX_RX_DMAC_CAM1 0x400
+#define CGX_RX_DMAC_ADR_MASK   GENMASK_ULL(47, 0)
 #define CGXX_CMRX_TX_STAT0 0x700
 #define CGXX_SCRATCH0_REG  0x1050
 #define CGXX_SCRATCH1_REG  0x1058
@@ -71,4 +80,7 @@ int cgx_lmac_evh_register(struct cgx_event_cb *cb, void 
*cgxd, int lmac_id);
 int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat);
 int cgx_get_rx_stats(void *cgxd, int lmac_id, int idx, u64 *rx_stat);
 int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool enable)

[PATCH 09/16] octeontx2-af: NPA AQ instruction enqueue support

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Add support for a RVU PF/VF to submit instructions to NPA AQ
via mbox. Instructions can be to init/write/read Aura/Pool/Qint
contexts. In case of read, context will be returned as part of
response to the mbox msg received.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/common.h |  13 ++
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  35 
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   3 +
 .../net/ethernet/marvell/octeontx2/af/rvu_npa.c| 158 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 218 +
 5 files changed, 427 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index c64d241..24021cb 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -118,4 +118,17 @@ enum npa_aura_sz {
 
 #define NPA_AURA_COUNT(x)  (1ULL << ((x) + 6))
 
+/* NPA AQ result structure for init/read/write of aura HW contexts */
+struct npa_aq_aura_res {
+   struct  npa_aq_res_sres;
+   struct  npa_aura_s  aura_ctx;
+   struct  npa_aura_s  ctx_mask;
+};
+
+/* NPA AQ result structure for init/read/write of pool HW contexts */
+struct npa_aq_pool_res {
+   struct  npa_aq_res_sres;
+   struct  npa_pool_s  pool_ctx;
+   struct  npa_pool_s  ctx_mask;
+};
 #endif /* COMMON_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 8135339..bf11058 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -141,6 +141,7 @@ M(CGX_INTLBK_DISABLE,   0x20B, msg_req, msg_rsp)
\
 /* NPA mbox IDs (range 0x400 - 0x5FF) */   \
 M(NPA_LF_ALLOC,0x400, npa_lf_alloc_req, npa_lf_alloc_rsp)  
\
 M(NPA_LF_FREE, 0x401, msg_req, msg_rsp)\
+M(NPA_AQ_ENQ,  0x402, npa_aq_enq_req, npa_aq_enq_rsp)  \
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
 /* CPT mbox IDs (range 0xA00 - 0xBFF) */   \
@@ -290,4 +291,38 @@ struct npa_lf_alloc_rsp {
u16 qints; /* NPA_AF_CONST::QINTS */
 };
 
+/* NPA AQ enqueue msg */
+struct npa_aq_enq_req {
+   struct mbox_msghdr hdr;
+   u32 aura_id;
+   u8 ctype;
+   u8 op;
+   union {
+   /* Valid when op == WRITE/INIT and ctype == AURA.
+* LF fills the pool_id in aura.pool_addr. AF will translate
+* the pool_id to pool context pointer.
+*/
+   struct npa_aura_s aura;
+   /* Valid when op == WRITE/INIT and ctype == POOL */
+   struct npa_pool_s pool;
+   };
+   /* Mask data when op == WRITE (1=write, 0=don't write) */
+   union {
+   /* Valid when op == WRITE and ctype == AURA */
+   struct npa_aura_s aura_mask;
+   /* Valid when op == WRITE and ctype == POOL */
+   struct npa_pool_s pool_mask;
+   };
+};
+
+struct npa_aq_enq_rsp {
+   struct mbox_msghdr hdr;
+   union {
+   /* Valid when op == READ and ctype == AURA */
+   struct npa_aura_s aura;
+   /* Valid when op == READ and ctype == POOL */
+   struct npa_pool_s pool;
+   };
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index b32d1f1..a70c26b 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -213,6 +213,9 @@ int rvu_mbox_handler_CGX_INTLBK_DISABLE(struct rvu *rvu, 
struct msg_req *req,
 /* NPA APIs */
 int rvu_npa_init(struct rvu *rvu);
 void rvu_npa_freemem(struct rvu *rvu);
+int rvu_mbox_handler_NPA_AQ_ENQ(struct rvu *rvu,
+   struct npa_aq_enq_req *req,
+   struct npa_aq_enq_rsp *rsp);
 int rvu_mbox_handler_NPA_LF_ALLOC(struct rvu *rvu,
  struct npa_lf_alloc_req *req,
  struct npa_lf_alloc_rsp *rsp);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
index ea0c5e0..4ff0e76 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
@@ -15,6 +15,164 @@
 #include "rvu_reg.h"
 #include "rvu.h"
 
+static int npa_aq_enqueue_wait(struct rvu *rvu, struct rvu_block *block,
+  struct npa_aq_inst_s *inst)
+{
+   struct admin_queue *aq = block->aq;
+   struct npa_aq_res_s *result;
+   int timeout = 1000;
+   u64 

[PATCH 00/16] octeontx2-af: NPA and NIX blocks initialization

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

This patchset is a continuation to earlier submitted patch series
to add a new driver for Marvell's OcteonTX2 SOC's 
Resource virtualization unit (RVU) admin function driver.

octeontx2-af: Add RVU Admin Function driver
https://www.spinics.net/lists/netdev/msg528272.html

This patch series adds logic for the following.
- Modified register polling loop to use time_before(jiffies, timeout),
  as suggested by Arnd Bergmann.
- Support to forward interface link status notifications sent by
  firmware to registered PFs mapped to a CGX::LMAC.
- Support to set CGX LMAC in loopback mode, retrieve stats,
  configure DMAC filters at CGX level etc.
- Network pool allocator (NPA) functional block initialization,
  admin queue support, NPALF aura/pool contexts memory allocation, init
  and deinit.
- Network interface controller (NIX) functional block basic init,
  admin queue support, NIXLF RQ/CQ/SQ HW contexts memory allocation,
  init and deinit.

Christina Jacob (1):
  octeontx2-af: Support to retrieve CGX LMAC stats

Geetha sowjanya (3):
  octeontx2-af: Enable or disable CGX internal loopback
  octeontx2-af: Support for disabling NPA Aura/Pool contexts
  octeontx2-af: Support for disabling NIX RQ/SQ/CQ contexts

Linu Cherian (1):
  octeontx2-af: Forward CGX link notifications to PFs

Sunil Goutham (10):
  octeontx2-af: Improve register polling loop
  octeontx2-af: CGX Rx/Tx enable/disable mbox handlers
  octeontx2-af: NPA block admin queue init
  octeontx2-af: NPA block LF initialization
  octeontx2-af: NPA AQ instruction enqueue support
  octeontx2-af: NIX block admin queue init
  octeontx2-af: NIX block LF initialization
  octeontx2-af: NIX LSO config for TSOv4/v6 offload
  octeontx2-af: Alloc bitmaps for NIX Tx scheduler queues
  octeontx2-af: NIX AQ instruction enqueue support

Vidhya Raman (1):
  octeontx2-af: Support for MAC address filters in CGX

 drivers/net/ethernet/marvell/octeontx2/af/Makefile |   2 +-
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 244 +-
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  47 +-
 drivers/net/ethernet/marvell/octeontx2/af/common.h | 161 
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   | 206 +
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c| 152 +++-
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h| 117 ++-
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 320 +++-
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 892 +
 .../net/ethernet/marvell/octeontx2/af/rvu_npa.c| 475 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 808 +++
 11 files changed, 3407 insertions(+), 17 deletions(-)
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/common.h
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c

-- 
2.7.4



[PATCH 05/16] octeontx2-af: Forward CGX link notifications to PFs

2018-10-16 Thread sunil . kovvuri
From: Linu Cherian 

Upon receiving notification from firmware the CGX event handler
in the AF driver gets the current link info such as status, speed,
duplex etc from CGX driver and sends it across to PFs who have
registered to receive such notifications.

To support above
 - Mbox messaging support for sending msgs from AF to PF has been added.
 - Added mbox msgs so that PFs can register/unregister for link events.
 - Link notifications are sent to PF under two scenarioss.
  1. When a asynchronous link change notification is received from
 firmware with notification flag turned on for that PF.
  2. Upon notification turn on request, the current link status is
 send to the PF.

Also added a new mailbox msg using which RVU PF/VF can retrieve
their mapped CGX LMAC's current link info. Link info includes
status, speed, duplex and lmac type.

Signed-off-by: Linu Cherian 
Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c|  99 --
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  21 ++-
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  22 +++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c|  82 
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   9 ++
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 147 -
 6 files changed, 368 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index e7ae9e0..077f83f 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -29,6 +29,7 @@
  * @wq_cmd_cmplt:  waitq to keep the process blocked until cmd completion
  * @cmd_lock:  Lock to serialize the command interface
  * @resp:  command response
+ * @link_info: link related information
  * @event_cb:  callback for linkchange events
  * @cmd_pend:  flag set before new command is started
  * flag cleared after command response is received
@@ -40,6 +41,7 @@ struct lmac {
wait_queue_head_t wq_cmd_cmplt;
struct mutex cmd_lock;
u64 resp;
+   struct cgx_link_user_info link_info;
struct cgx_event_cb event_cb;
bool cmd_pend;
struct cgx *cgx;
@@ -58,6 +60,12 @@ struct cgx {
 
 static LIST_HEAD(cgx_list);
 
+/* Convert firmware speed encoding to user format(Mbps) */
+static u32 cgx_speed_mbps[CGX_LINK_SPEED_MAX];
+
+/* Convert firmware lmac type encoding to string */
+static char *cgx_lmactype_string[LMAC_MODE_MAX];
+
 /* Supported devices */
 static const struct pci_device_id cgx_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_CAVIUM, PCI_DEVID_OCTEONTX2_CGX) },
@@ -119,6 +127,24 @@ void *cgx_get_pdata(int cgx_id)
 }
 EXPORT_SYMBOL(cgx_get_pdata);
 
+/* Ensure the required lock for event queue(where asynchronous events are
+ * posted) is acquired before calling this API. Else an asynchronous event(with
+ * latest link status) can reach the destination before this function returns
+ * and could make the link status appear wrong.
+ */
+int cgx_get_link_info(void *cgxd, int lmac_id,
+ struct cgx_link_user_info *linfo)
+{
+   struct lmac *lmac = lmac_pdata(lmac_id, cgxd);
+
+   if (!lmac)
+   return -ENODEV;
+
+   *linfo = lmac->link_info;
+   return 0;
+}
+EXPORT_SYMBOL(cgx_get_link_info);
+
 static u64 mac2u64 (u8 *mac_addr)
 {
u64 mac = 0;
@@ -160,6 +186,14 @@ u64 cgx_lmac_addr_get(u8 cgx_id, u8 lmac_id)
 }
 EXPORT_SYMBOL(cgx_lmac_addr_get);
 
+static inline u8 cgx_get_lmac_type(struct cgx *cgx, int lmac_id)
+{
+   u64 cfg;
+
+   cfg = cgx_read(cgx, lmac_id, CGXX_CMRX_CFG);
+   return (cfg >> CGX_LMAC_TYPE_SHIFT) & CGX_LMAC_TYPE_MASK;
+}
+
 void cgx_lmac_promisc_config(int cgx_id, int lmac_id, bool enable)
 {
struct cgx *cgx = cgx_get_pdata(cgx_id);
@@ -306,36 +340,79 @@ static inline int cgx_fwi_cmd_generic(u64 req, u64 *resp,
return err;
 }
 
+static inline void cgx_link_usertable_init(void)
+{
+   cgx_speed_mbps[CGX_LINK_NONE] = 0;
+   cgx_speed_mbps[CGX_LINK_10M] = 10;
+   cgx_speed_mbps[CGX_LINK_100M] = 100;
+   cgx_speed_mbps[CGX_LINK_1G] = 1000;
+   cgx_speed_mbps[CGX_LINK_2HG] = 2500;
+   cgx_speed_mbps[CGX_LINK_5G] = 5000;
+   cgx_speed_mbps[CGX_LINK_10G] = 1;
+   cgx_speed_mbps[CGX_LINK_20G] = 2;
+   cgx_speed_mbps[CGX_LINK_25G] = 25000;
+   cgx_speed_mbps[CGX_LINK_40G] = 4;
+   cgx_speed_mbps[CGX_LINK_50G] = 5;
+   cgx_speed_mbps[CGX_LINK_100G] = 10;
+
+   cgx_lmactype_string[LMAC_MODE_SGMII] = "SGMII";
+   cgx_lmactype_string[LMAC_MODE_XAUI] = "XAUI";
+   cgx_lmactype_string[LMAC_MODE_RXAUI] = "RXAUI";
+   cgx_lmactype_string[LMAC_MODE_10G_R] = "10G_R";
+   cgx_lmactype_string[LMAC_MODE_40G_R] = "40G_R";
+   cgx_lmactype_string[LMAC_MODE_QSGMII] = "QSGMII";
+   cgx_lmactyp

[PATCH 01/16] octeontx2-af: Improve register polling loop

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Instead of looping on a integer timeout, use time_before(jiffies),
so that maximum poll time is capped.

Signed-off-by: Sunil Goutham 
Suggested-by: Arnd Bergmann 
---
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 2033f42..7cf5865 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -47,18 +47,18 @@ MODULE_DEVICE_TABLE(pci, rvu_id_table);
  */
 int rvu_poll_reg(struct rvu *rvu, u64 block, u64 offset, u64 mask, bool zero)
 {
+   unsigned long timeout = jiffies + usecs_to_jiffies(100);
void __iomem *reg;
-   int timeout = 100;
u64 reg_val;
 
reg = rvu->afreg_base + ((block << 28) | offset);
-   while (timeout) {
+   while (time_before(jiffies, timeout)) {
reg_val = readq(reg);
if (zero && !(reg_val & mask))
return 0;
if (!zero && (reg_val & mask))
return 0;
-   usleep_range(1, 2);
+   usleep_range(1, 5);
timeout--;
}
return -EBUSY;
-- 
2.7.4



[PATCH 10/16] octeontx2-af: Support for disabling NPA Aura/Pool contexts

2018-10-16 Thread sunil . kovvuri
From: Geetha sowjanya 

This patch adds support for a RVU PF/VF to disable all Aura/Pool
contexts of a NPA LF via mbox. This will be used by PF/VF drivers
upon teardown or while freeing up HW resources.

A HW context which is not INIT'ed cannot be modified and a
RVU PF/VF driver may or may not INIT all the Aura/Pool contexts.
So a bitmap is introduced to keep track of enabled NPA Aura/Pool
contexts, so that only enabled hw contexts are disabled upon LF
teardown.

Signed-off-by: Geetha sowjanya 
Signed-off-by: Stanislaw Kardach 
Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  7 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  5 ++
 .../net/ethernet/marvell/octeontx2/af/rvu_npa.c| 98 ++
 3 files changed, 110 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index bf11058..4e87314 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -142,6 +142,7 @@ M(CGX_INTLBK_DISABLE,   0x20B, msg_req, msg_rsp)
\
 M(NPA_LF_ALLOC,0x400, npa_lf_alloc_req, npa_lf_alloc_rsp)  
\
 M(NPA_LF_FREE, 0x401, msg_req, msg_rsp)\
 M(NPA_AQ_ENQ,  0x402, npa_aq_enq_req, npa_aq_enq_rsp)  \
+M(NPA_HWCTX_DISABLE,   0x403, hwctx_disable_req, msg_rsp)  \
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
 /* CPT mbox IDs (range 0xA00 - 0xBFF) */   \
@@ -325,4 +326,10 @@ struct npa_aq_enq_rsp {
};
 };
 
+/* Disable all contexts of type 'ctype' */
+struct hwctx_disable_req {
+   struct mbox_msghdr hdr;
+   u8 ctype;
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index a70c26b..bfc95c3 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -77,6 +77,8 @@ struct rvu_pfvf {
struct qmem *aura_ctx;
struct qmem *pool_ctx;
struct qmem *npa_qints_ctx;
+   unsigned long   *aura_bmap;
+   unsigned long   *pool_bmap;
 };
 
 struct rvu_hwinfo {
@@ -216,6 +218,9 @@ void rvu_npa_freemem(struct rvu *rvu);
 int rvu_mbox_handler_NPA_AQ_ENQ(struct rvu *rvu,
struct npa_aq_enq_req *req,
struct npa_aq_enq_rsp *rsp);
+int rvu_mbox_handler_NPA_HWCTX_DISABLE(struct rvu *rvu,
+  struct hwctx_disable_req *req,
+  struct msg_rsp *rsp);
 int rvu_mbox_handler_NPA_LF_ALLOC(struct rvu *rvu,
  struct npa_lf_alloc_req *req,
  struct npa_lf_alloc_rsp *rsp);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
index 4ff0e76..0e43a69 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c
@@ -63,6 +63,7 @@ static int rvu_npa_aq_enq_inst(struct rvu *rvu, struct 
npa_aq_enq_req *req,
struct admin_queue *aq;
struct rvu_pfvf *pfvf;
void *ctx, *mask;
+   bool ena;
 
pfvf = rvu_get_pfvf(rvu, pcifunc);
if (!pfvf->aura_ctx || req->aura_id >= pfvf->aura_ctx->qsize)
@@ -149,6 +150,35 @@ static int rvu_npa_aq_enq_inst(struct rvu *rvu, struct 
npa_aq_enq_req *req,
return rc;
}
 
+   /* Set aura bitmap if aura hw context is enabled */
+   if (req->ctype == NPA_AQ_CTYPE_AURA) {
+   if (req->op == NPA_AQ_INSTOP_INIT && req->aura.ena)
+   __set_bit(req->aura_id, pfvf->aura_bmap);
+   if (req->op == NPA_AQ_INSTOP_WRITE) {
+   ena = (req->aura.ena & req->aura_mask.ena) |
+   (test_bit(req->aura_id, pfvf->aura_bmap) &
+   ~req->aura_mask.ena);
+   if (ena)
+   __set_bit(req->aura_id, pfvf->aura_bmap);
+   else
+   __clear_bit(req->aura_id, pfvf->aura_bmap);
+   }
+   }
+
+   /* Set pool bitmap if pool hw context is enabled */
+   if (req->ctype == NPA_AQ_CTYPE_POOL) {
+   if (req->op == NPA_AQ_INSTOP_INIT && req->pool.ena)
+   __set_bit(req->aura_id, pfvf->pool_bmap);
+   if (req->op == NPA_AQ_INSTOP_WRITE) {
+   ena = (req->pool.ena & req->pool_mask.ena) |
+   (test_bit(req->aura_id, pfvf->pool_bmap) &
+   ~req->pool_mask.ena);
+   if (ena)
+   

[PATCH 06/16] octeontx2-af: Enable or disable CGX internal loopback

2018-10-16 Thread sunil . kovvuri
From: Geetha sowjanya 

Add support to enable or disable internal loopback mode in CGX.
New mbox IDs CGX_INTLBK_ENABLE/DISABLE added for this.

Signed-off-by: Geetha sowjanya 
Signed-off-by: Linu Cherian 
Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 30 +
 drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  5 
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  2 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  4 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 31 ++
 5 files changed, 72 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 077f83f..352501b 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -194,6 +194,36 @@ static inline u8 cgx_get_lmac_type(struct cgx *cgx, int 
lmac_id)
return (cfg >> CGX_LMAC_TYPE_SHIFT) & CGX_LMAC_TYPE_MASK;
 }
 
+/* Configure CGX LMAC in internal loopback mode */
+int cgx_lmac_internal_loopback(void *cgxd, int lmac_id, bool enable)
+{
+   struct cgx *cgx = cgxd;
+   u8 lmac_type;
+   u64 cfg;
+
+   if (!cgx || lmac_id >= cgx->lmac_count)
+   return -ENODEV;
+
+   lmac_type = cgx_get_lmac_type(cgx, lmac_id);
+   if (lmac_type == LMAC_MODE_SGMII || lmac_type == LMAC_MODE_QSGMII) {
+   cfg = cgx_read(cgx, lmac_id, CGXX_GMP_PCS_MRX_CTL);
+   if (enable)
+   cfg |= CGXX_GMP_PCS_MRX_CTL_LBK;
+   else
+   cfg &= ~CGXX_GMP_PCS_MRX_CTL_LBK;
+   cgx_write(cgx, lmac_id, CGXX_GMP_PCS_MRX_CTL, cfg);
+   } else {
+   cfg = cgx_read(cgx, lmac_id, CGXX_SPUX_CONTROL1);
+   if (enable)
+   cfg |= CGXX_SPUX_CONTROL1_LBK;
+   else
+   cfg &= ~CGXX_SPUX_CONTROL1_LBK;
+   cgx_write(cgx, lmac_id, CGXX_SPUX_CONTROL1, cfg);
+   }
+   return 0;
+}
+EXPORT_SYMBOL(cgx_lmac_internal_loopback);
+
 void cgx_lmac_promisc_config(int cgx_id, int lmac_id, bool enable)
 {
struct cgx *cgx = cgx_get_pdata(cgx_id);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h 
b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index c89edfa..ada25ed 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -50,6 +50,10 @@
 #define CGXX_SCRATCH0_REG  0x1050
 #define CGXX_SCRATCH1_REG  0x1058
 #define CGX_CONST  0x2000
+#define CGXX_SPUX_CONTROL1 0x1
+#define  CGXX_SPUX_CONTROL1_LBKBIT_ULL(14)
+#define CGXX_GMP_PCS_MRX_CTL   0x3
+#define  CGXX_GMP_PCS_MRX_CTL_LBK  BIT_ULL(14)
 
 #define CGX_COMMAND_REGCGXX_SCRATCH1_REG
 #define CGX_EVENT_REG  CGXX_SCRATCH0_REG
@@ -100,6 +104,7 @@ int cgx_lmac_rx_tx_enable(void *cgxd, int lmac_id, bool 
enable);
 int cgx_lmac_addr_set(u8 cgx_id, u8 lmac_id, u8 *mac_addr);
 u64 cgx_lmac_addr_get(u8 cgx_id, u8 lmac_id);
 void cgx_lmac_promisc_config(int cgx_id, int lmac_id, bool enable);
+int cgx_lmac_internal_loopback(void *cgxd, int lmac_id, bool enable);
 int cgx_get_link_info(void *cgxd, int lmac_id,
  struct cgx_link_user_info *linfo);
 #endif /* CGX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 9f3790b..be1cb16 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -136,6 +136,8 @@ M(CGX_PROMISC_DISABLE,  0x206, msg_req, msg_rsp)
\
 M(CGX_START_LINKEVENTS, 0x207, msg_req, msg_rsp)   \
 M(CGX_STOP_LINKEVENTS, 0x208, msg_req, msg_rsp)\
 M(CGX_GET_LINKINFO,0x209, msg_req, cgx_link_info_msg)  \
+M(CGX_INTLBK_ENABLE,   0x20A, msg_req, msg_rsp)\
+M(CGX_INTLBK_DISABLE,  0x20B, msg_req, msg_rsp)\
 /* NPA mbox IDs (range 0x400 - 0x5FF) */   \
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 8347808..88454cb 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -192,4 +192,8 @@ int rvu_mbox_handler_CGX_STOP_LINKEVENTS(struct rvu *rvu, 
struct msg_req *req,
 struct msg_rsp *rsp);
 int rvu_mbox_handler_CGX_GET_LINKINFO(struct rvu *rvu, struct msg_req *req,
  struct cgx_link_info_msg *rsp);
+int rvu_mbox_handler_CGX_INTLBK_E

[PATCH 07/16] octeontx2-af: NPA block admin queue init

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Initialize NPA admin queue (AQ) i.e alloc memory for
AQ instructions and for the results. All NPA LFs will submit
instructions to AQ to init/write/read Aura/Pool contexts
and in case of read, get context from result memory.

Added some common APIs for allocating memory for a queue
and get IOVA in return, these APIs will be used by
NIX AQ and for other purposes.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/Makefile |  2 +-
 drivers/net/ethernet/marvell/octeontx2/af/common.h | 99 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c| 46 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h| 13 ++-
 .../net/ethernet/marvell/octeontx2/af/rvu_npa.c| 86 +++
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 65 ++
 6 files changed, 309 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/common.h
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_npa.c

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/Makefile 
b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
index eaac264..bdb4f98 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/Makefile
+++ b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_OCTEONTX2_MBOX) += octeontx2_mbox.o
 obj-$(CONFIG_OCTEONTX2_AF) += octeontx2_af.o
 
 octeontx2_mbox-y := mbox.o
-octeontx2_af-y := cgx.o rvu.o rvu_cgx.o
+octeontx2_af-y := cgx.o rvu.o rvu_cgx.o rvu_npa.o
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
new file mode 100644
index 000..ec493ba
--- /dev/null
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Marvell OcteonTx2 RVU Admin Function driver
+ *
+ * Copyright (C) 2018 Marvell International Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef COMMON_H
+#define COMMON_H
+
+#include "rvu_struct.h"
+
+#define OTX2_ALIGN 128  /* Align to cacheline */
+
+#define Q_SIZE_16  0ULL /* 16 entries */
+#define Q_SIZE_64  1ULL /* 64 entries */
+#define Q_SIZE_256 2ULL
+#define Q_SIZE_1K  3ULL
+#define Q_SIZE_4K  4ULL
+#define Q_SIZE_16K 5ULL
+#define Q_SIZE_64K 6ULL
+#define Q_SIZE_256K7ULL
+#define Q_SIZE_1M  8ULL /* Million entries */
+#define Q_SIZE_MIN Q_SIZE_16
+#define Q_SIZE_MAX Q_SIZE_1M
+
+#define Q_COUNT(x) (16ULL << (2 * x))
+#define Q_SIZE(x, n)   ((ilog2(x) - (n)) / 2)
+
+/* Admin queue info */
+
+/* Since we intend to add only one instruction at a time,
+ * keep queue size to it's minimum.
+ */
+#define AQ_SIZEQ_SIZE_16
+/* HW head & tail pointer mask */
+#define AQ_PTR_MASK0xF
+
+struct qmem {
+   void*base;
+   dma_addr_t  iova;
+   int alloc_sz;
+   u8  entry_sz;
+   u8  align;
+   u32 qsize;
+};
+
+static inline int qmem_alloc(struct device *dev, struct qmem **q,
+int qsize, int entry_sz)
+{
+   struct qmem *qmem;
+   int aligned_addr;
+
+   if (!qsize)
+   return -EINVAL;
+
+   *q = devm_kzalloc(dev, sizeof(*qmem), GFP_KERNEL);
+   if (!*q)
+   return -ENOMEM;
+   qmem = *q;
+
+   qmem->entry_sz = entry_sz;
+   qmem->alloc_sz = (qsize * entry_sz) + OTX2_ALIGN;
+   qmem->base = dma_zalloc_coherent(dev, qmem->alloc_sz,
+&qmem->iova, GFP_KERNEL);
+   if (!qmem->base)
+   return -ENOMEM;
+
+   qmem->qsize = qsize;
+
+   aligned_addr = ALIGN((u64)qmem->iova, OTX2_ALIGN);
+   qmem->align = (aligned_addr - qmem->iova);
+   qmem->base += qmem->align;
+   qmem->iova += qmem->align;
+   return 0;
+}
+
+static inline void qmem_free(struct device *dev, struct qmem *qmem)
+{
+   if (!qmem)
+   return;
+
+   if (qmem->base)
+   dma_free_coherent(dev, qmem->alloc_sz,
+ qmem->base - qmem->align,
+ qmem->iova - qmem->align);
+   devm_kfree(dev, qmem);
+}
+
+struct admin_queue {
+   struct qmem *inst;
+   struct qmem *res;
+   spinlock_t  lock; /* Serialize inst enqueue from PFs */
+};
+
+#endif /* COMMON_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 85994ab..14255f2 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -552,6 +552,8 @@ static void r

[PATCH 11/16] octeontx2-af: NIX block admin queue init

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Initialize NIX admin queue (AQ) i.e alloc memory for
AQ instructions and for the results. All NIX LFs will submit
instructions to AQ to init/write/read RQ/SQ/CQ/RSS contexts
and in case of read, get context from result memory.

Also before configuring/using NIX block calibrate X2P bus
and check if NIX interfaces like CGX and LBK are in active
and working state.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/Makefile |   2 +-
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c|   5 +
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   4 +
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 138 +
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h |  72 +++
 5 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/Makefile 
b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
index bdb4f98..45b108f 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/Makefile
+++ b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_OCTEONTX2_MBOX) += octeontx2_mbox.o
 obj-$(CONFIG_OCTEONTX2_AF) += octeontx2_af.o
 
 octeontx2_mbox-y := mbox.o
-octeontx2_af-y := cgx.o rvu.o rvu_cgx.o rvu_npa.o
+octeontx2_af-y := cgx.o rvu.o rvu_cgx.o rvu_npa.o rvu_nix.o
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 5d4917c..c06cca9 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -566,6 +566,7 @@ static void rvu_free_hw_resources(struct rvu *rvu)
u64 cfg;
 
rvu_npa_freemem(rvu);
+   rvu_nix_freemem(rvu);
 
/* Free block LF bitmaps */
for (id = 0; id < BLK_COUNT; id++) {
@@ -774,6 +775,10 @@ static int rvu_setup_hw_resources(struct rvu *rvu)
if (err)
return err;
 
+   err = rvu_nix_init(rvu);
+   if (err)
+   return err;
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index bfc95c3..0d0fb1d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -226,4 +226,8 @@ int rvu_mbox_handler_NPA_LF_ALLOC(struct rvu *rvu,
  struct npa_lf_alloc_rsp *rsp);
 int rvu_mbox_handler_NPA_LF_FREE(struct rvu *rvu, struct msg_req *req,
 struct msg_rsp *rsp);
+
+/* NIX APIs */
+int rvu_nix_init(struct rvu *rvu);
+void rvu_nix_freemem(struct rvu *rvu);
 #endif /* RVU_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
new file mode 100644
index 000..5ff9e6b
--- /dev/null
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Marvell OcteonTx2 RVU Admin Function driver
+ *
+ * Copyright (C) 2018 Marvell International Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include "rvu_struct.h"
+#include "rvu_reg.h"
+#include "rvu.h"
+#include "cgx.h"
+
+static int nix_calibrate_x2p(struct rvu *rvu, int blkaddr)
+{
+   int idx, err;
+   u64 status;
+
+   /* Start X2P bus calibration */
+   rvu_write64(rvu, blkaddr, NIX_AF_CFG,
+   rvu_read64(rvu, blkaddr, NIX_AF_CFG) | BIT_ULL(9));
+   /* Wait for calibration to complete */
+   err = rvu_poll_reg(rvu, blkaddr,
+  NIX_AF_STATUS, BIT_ULL(10), false);
+   if (err) {
+   dev_err(rvu->dev, "NIX X2P bus calibration failed\n");
+   return err;
+   }
+
+   status = rvu_read64(rvu, blkaddr, NIX_AF_STATUS);
+   /* Check if CGX devices are ready */
+   for (idx = 0; idx < cgx_get_cgx_cnt(); idx++) {
+   if (status & (BIT_ULL(16 + idx)))
+   continue;
+   dev_err(rvu->dev,
+   "CGX%d didn't respond to NIX X2P calibration\n", idx);
+   err = -EBUSY;
+   }
+
+   /* Check if LBK is ready */
+   if (!(status & BIT_ULL(19))) {
+   dev_err(rvu->dev,
+   "LBK didn't respond to NIX X2P calibration\n");
+   err = -EBUSY;
+   }
+
+   /* Clear 'calibrate_x2p' bit */
+   rvu_write64(rvu, blkaddr, NIX_AF_CFG,
+   rvu_read64(rvu, blkaddr, NIX_AF_CFG) & ~BIT_ULL(9));
+   if (err || (status & 0x3FFULL))
+   dev_err(rvu->dev,
+   "NIX X2P calibration failed, status 0x%llx\n", status);
+   if (err)
+   return err;
+   return 0;
+}
+
+static int nix_aq_init(s

[PATCH 12/16] octeontx2-af: NIX block LF initialization

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Upon receiving NIX_LF_ALLOC mbox message allocate memory for
NIXLF's CQ, SQ, RQ, CINT, QINT and RSS HW contexts and configure
respective base iova HW. Enable caching of contexts into NIX NDC.

Return SQ buffer (SQB) size, this PF/VF MAC address etc info
e.t.c to the mbox msg sender.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/common.h |  10 +
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  45 
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  15 ++
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 238 +
 4 files changed, 308 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index 24021cb..d183ad8 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -131,4 +131,14 @@ struct npa_aq_pool_res {
struct  npa_pool_s  pool_ctx;
struct  npa_pool_s  ctx_mask;
 };
+
+/* RSS info */
+#define MAX_RSS_GROUPS 8
+/* Group 0 has to be used in default pkt forwarding MCAM entries
+ * reserved for NIXLFs. Groups 1-7 can be used for RSS for ntuple
+ * filters.
+ */
+#define DEFAULT_RSS_CONTEXT_GROUP  0
+#define MAX_RSS_INDIR_TBL_SIZE 256 /* 1 << Max adder bits */
+
 #endif /* COMMON_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 4e87314..5718b55 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -148,6 +148,8 @@ M(NPA_HWCTX_DISABLE,0x403, hwctx_disable_req, 
msg_rsp)  \
 /* CPT mbox IDs (range 0xA00 - 0xBFF) */   \
 /* NPC mbox IDs (range 0x6000 - 0x7FFF) */ \
 /* NIX mbox IDs (range 0x8000 - 0x) */ \
+M(NIX_LF_ALLOC,0x8000, nix_lf_alloc_req, nix_lf_alloc_rsp) 
\
+M(NIX_LF_FREE, 0x8001, msg_req, msg_rsp)
 
 /* Messages initiated by AF (range 0xC00 - 0xDFF) */
 #define MBOX_UP_CGX_MESSAGES   \
@@ -162,6 +164,8 @@ MBOX_UP_CGX_MESSAGES
 
 /* Mailbox message formats */
 
+#define RVU_DEFAULT_PF_FUNC 0x
+
 /* Generic request msg used for those mbox messages which
  * don't send any data in the request.
  */
@@ -332,4 +336,45 @@ struct hwctx_disable_req {
u8 ctype;
 };
 
+/* NIX mailbox error codes
+ * Range 401 - 500.
+ */
+enum nix_af_status {
+   NIX_AF_ERR_PARAM= -401,
+   NIX_AF_ERR_AQ_FULL  = -402,
+   NIX_AF_ERR_AQ_ENQUEUE   = -403,
+   NIX_AF_ERR_AF_LF_INVALID= -404,
+   NIX_AF_ERR_AF_LF_ALLOC  = -405,
+   NIX_AF_ERR_TLX_ALLOC_FAIL   = -406,
+   NIX_AF_ERR_TLX_INVALID  = -407,
+   NIX_AF_ERR_RSS_SIZE_INVALID = -408,
+   NIX_AF_ERR_RSS_GRPS_INVALID = -409,
+   NIX_AF_ERR_FRS_INVALID  = -410,
+   NIX_AF_ERR_RX_LINK_INVALID  = -411,
+   NIX_AF_INVAL_TXSCHQ_CFG = -412,
+   NIX_AF_SMQ_FLUSH_FAILED = -413,
+   NIX_AF_ERR_LF_RESET = -414,
+};
+
+/* For NIX LF context alloc and init */
+struct nix_lf_alloc_req {
+   struct mbox_msghdr hdr;
+   int node;
+   u32 rq_cnt;   /* No of receive queues */
+   u32 sq_cnt;   /* No of send queues */
+   u32 cq_cnt;   /* No of completion queues */
+   u8  xqe_sz;
+   u16 rss_sz;
+   u8  rss_grps;
+   u16 npa_func;
+   u16 sso_func;
+   u64 rx_cfg;   /* See NIX_AF_LF(0..127)_RX_CFG */
+};
+
+struct nix_lf_alloc_rsp {
+   struct mbox_msghdr hdr;
+   u16 sqb_size;
+   u8  mac_addr[ETH_ALEN];
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 0d0fb1d..d6aca2e 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -79,6 +79,16 @@ struct rvu_pfvf {
struct qmem *npa_qints_ctx;
unsigned long   *aura_bmap;
unsigned long   *pool_bmap;
+
+   /* NIX contexts */
+   struct qmem *rq_ctx;
+   struct qmem *sq_ctx;
+   struct qmem *cq_ctx;
+   struct qmem *rss_ctx;
+   struct qmem *cq_ints_ctx;
+   struct qmem *nix_qints_ctx;
+
+   u8  mac_addr[ETH_ALEN]; /* MAC address of this PF/VF */
 };
 
 struct rvu_hwinfo {
@@ -230,4 +240,9 @@ int rvu_mbox_handler_NPA_LF_FREE(struct rvu *rvu, struct 
msg_req *req,
 /* NIX APIs */
 int rvu_nix_init(struct rvu *rvu);
 void rvu_nix_freemem(struct rvu *rvu);
+int rvu_mbox_handler_NIX_LF_ALLOC(struct rvu *rvu,
+ struct nix_lf_alloc_req *req,
+ struct nix_lf_alloc_rsp *rsp);
+int rvu_mbox_handler_NIX_LF_FREE(struct rvu *rvu, struct msg_req *req,
+   

[PATCH 13/16] octeontx2-af: NIX LSO config for TSOv4/v6 offload

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Config LSO formats for TSOv4 and TSOv6 offloads.
These formats tell HW which fields in the TCP packet's
headers have to be updated while performing segmentation
offload.

Also report PF/VF drivers the LSO format indices as part
of response to NIX_LF_ALLOC mbox msg. These indices are
used in SQE extension headers while framing SQE for pkt
transmission with TSO offload.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/common.h |  6 ++
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  2 +
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 95 ++
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 35 
 4 files changed, 138 insertions(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index d183ad8..dc55e34 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -132,6 +132,12 @@ struct npa_aq_pool_res {
struct  npa_pool_s  ctx_mask;
 };
 
+/* NIX LSO format indices.
+ * As of now TSO is the only one using, so statically assigning indices.
+ */
+#define NIX_LSO_FORMAT_IDX_TSOV4   0
+#define NIX_LSO_FORMAT_IDX_TSOV6   1
+
 /* RSS info */
 #define MAX_RSS_GROUPS 8
 /* Group 0 has to be used in default pkt forwarding MCAM entries
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 5718b55..2f24e5d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -374,6 +374,8 @@ struct nix_lf_alloc_req {
 struct nix_lf_alloc_rsp {
struct mbox_msghdr hdr;
u16 sqb_size;
+   u8  lso_tsov4_idx;
+   u8  lso_tsov6_idx;
u8  mac_addr[ETH_ALEN];
 };
 
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index 1f41c7c..401f87f 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -16,6 +16,96 @@
 #include "rvu.h"
 #include "cgx.h"
 
+static void nix_setup_lso_tso_l3(struct rvu *rvu, int blkaddr,
+u64 format, bool v4, u64 *fidx)
+{
+   struct nix_lso_format field = {0};
+
+   /* IP's Length field */
+   field.layer = NIX_TXLAYER_OL3;
+   /* In ipv4, length field is at offset 2 bytes, for ipv6 it's 4 */
+   field.offset = v4 ? 2 : 4;
+   field.sizem1 = 1; /* i.e 2 bytes */
+   field.alg = NIX_LSOALG_ADD_PAYLEN;
+   rvu_write64(rvu, blkaddr,
+   NIX_AF_LSO_FORMATX_FIELDX(format, (*fidx)++),
+   *(u64 *)&field);
+
+   /* No ID field in IPv6 header */
+   if (!v4)
+   return;
+
+   /* IP's ID field */
+   field.layer = NIX_TXLAYER_OL3;
+   field.offset = 4;
+   field.sizem1 = 1; /* i.e 2 bytes */
+   field.alg = NIX_LSOALG_ADD_SEGNUM;
+   rvu_write64(rvu, blkaddr,
+   NIX_AF_LSO_FORMATX_FIELDX(format, (*fidx)++),
+   *(u64 *)&field);
+}
+
+static void nix_setup_lso_tso_l4(struct rvu *rvu, int blkaddr,
+u64 format, u64 *fidx)
+{
+   struct nix_lso_format field = {0};
+
+   /* TCP's sequence number field */
+   field.layer = NIX_TXLAYER_OL4;
+   field.offset = 4;
+   field.sizem1 = 3; /* i.e 4 bytes */
+   field.alg = NIX_LSOALG_ADD_OFFSET;
+   rvu_write64(rvu, blkaddr,
+   NIX_AF_LSO_FORMATX_FIELDX(format, (*fidx)++),
+   *(u64 *)&field);
+
+   /* TCP's flags field */
+   field.layer = NIX_TXLAYER_OL4;
+   field.offset = 12;
+   field.sizem1 = 0; /* not needed */
+   field.alg = NIX_LSOALG_TCP_FLAGS;
+   rvu_write64(rvu, blkaddr,
+   NIX_AF_LSO_FORMATX_FIELDX(format, (*fidx)++),
+   *(u64 *)&field);
+}
+
+static void nix_setup_lso(struct rvu *rvu, int blkaddr)
+{
+   u64 cfg, idx, fidx = 0;
+
+   /* Enable LSO */
+   cfg = rvu_read64(rvu, blkaddr, NIX_AF_LSO_CFG);
+   /* For TSO, set first and middle segment flags to
+* mask out PSH, RST & FIN flags in TCP packet
+*/
+   cfg &= ~((0xULL << 32) | (0xULL << 16));
+   cfg |= (0xFFF2ULL << 32) | (0xFFF2ULL << 16);
+   rvu_write64(rvu, blkaddr, NIX_AF_LSO_CFG, cfg | BIT_ULL(63));
+
+   /* Configure format fields for TCPv4 segmentation offload */
+   idx = NIX_LSO_FORMAT_IDX_TSOV4;
+   nix_setup_lso_tso_l3(rvu, blkaddr, idx, true, &fidx);
+   nix_setup_lso_tso_l4(rvu, blkaddr, idx, &fidx);
+
+   /* Set rest of the fields to NOP */
+   for (; fidx < 8; fidx++) {
+   rvu_write64(rvu, blkaddr,
+   NIX_AF_LSO_FORMATX_FIELDX(idx, fidx), 0x0ULL);
+   }
+
+   /* Configure format fields for TCPv6 segm

[PATCH 08/16] octeontx2-af: NPA block LF initialization

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Upon receiving NPA_LF_ALLOC mbox message allocate memory for
NPALF's aura, pool and qint contexts and configure the same
to HW. Enable caching of contexts into NPA NDC.

Return pool related info like stack size, num pointers per
stack page e.t.c to the mbox msg sender.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/common.h |  22 
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  32 +
 drivers/net/ethernet/marvell/octeontx2/af/rvu.c|  13 ++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  13 +-
 .../net/ethernet/marvell/octeontx2/af/rvu_npa.c| 137 -
 5 files changed, 214 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index ec493ba..c64d241 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -96,4 +96,26 @@ struct admin_queue {
spinlock_t  lock; /* Serialize inst enqueue from PFs */
 };
 
+/* NPA aura count */
+enum npa_aura_sz {
+   NPA_AURA_SZ_0,
+   NPA_AURA_SZ_128,
+   NPA_AURA_SZ_256,
+   NPA_AURA_SZ_512,
+   NPA_AURA_SZ_1K,
+   NPA_AURA_SZ_2K,
+   NPA_AURA_SZ_4K,
+   NPA_AURA_SZ_8K,
+   NPA_AURA_SZ_16K,
+   NPA_AURA_SZ_32K,
+   NPA_AURA_SZ_64K,
+   NPA_AURA_SZ_128K,
+   NPA_AURA_SZ_256K,
+   NPA_AURA_SZ_512K,
+   NPA_AURA_SZ_1M,
+   NPA_AURA_SZ_MAX,
+};
+
+#define NPA_AURA_COUNT(x)  (1ULL << ((x) + 6))
+
 #endif /* COMMON_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index be1cb16..8135339 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -139,6 +139,8 @@ M(CGX_GET_LINKINFO, 0x209, msg_req, cgx_link_info_msg)  
\
 M(CGX_INTLBK_ENABLE,   0x20A, msg_req, msg_rsp)\
 M(CGX_INTLBK_DISABLE,  0x20B, msg_req, msg_rsp)\
 /* NPA mbox IDs (range 0x400 - 0x5FF) */   \
+M(NPA_LF_ALLOC,0x400, npa_lf_alloc_req, npa_lf_alloc_rsp)  
\
+M(NPA_LF_FREE, 0x401, msg_req, msg_rsp)\
 /* SSO/SSOW mbox IDs (range 0x600 - 0x7FF) */  \
 /* TIM mbox IDs (range 0x800 - 0x9FF) */   \
 /* CPT mbox IDs (range 0xA00 - 0xBFF) */   \
@@ -258,4 +260,34 @@ struct cgx_link_info_msg {
struct mbox_msghdr hdr;
struct cgx_link_user_info link_info;
 };
+
+/* NPA mbox message formats */
+
+/* NPA mailbox error codes
+ * Range 301 - 400.
+ */
+enum npa_af_status {
+   NPA_AF_ERR_PARAM= -301,
+   NPA_AF_ERR_AQ_FULL  = -302,
+   NPA_AF_ERR_AQ_ENQUEUE   = -303,
+   NPA_AF_ERR_AF_LF_INVALID= -304,
+   NPA_AF_ERR_AF_LF_ALLOC  = -305,
+   NPA_AF_ERR_LF_RESET = -306,
+};
+
+/* For NPA LF context alloc and init */
+struct npa_lf_alloc_req {
+   struct mbox_msghdr hdr;
+   int node;
+   int aura_sz;  /* No of auras */
+   u32 nr_pools; /* No of pools */
+};
+
+struct npa_lf_alloc_rsp {
+   struct mbox_msghdr hdr;
+   u32 stack_pg_ptrs;  /* No of ptrs per stack page */
+   u32 stack_pg_bytes; /* Size of stack page */
+   u16 qints; /* NPA_AF_CONST::QINTS */
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 14255f2..5d4917c 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -361,6 +361,19 @@ static void rvu_check_block_implemented(struct rvu *rvu)
}
 }
 
+int rvu_lf_reset(struct rvu *rvu, struct rvu_block *block, int lf)
+{
+   int err;
+
+   if (!block->implemented)
+   return 0;
+
+   rvu_write64(rvu, block->addr, block->lfreset_reg, lf | BIT_ULL(12));
+   err = rvu_poll_reg(rvu, block->addr, block->lfreset_reg, BIT_ULL(12),
+  true);
+   return err;
+}
+
 static void rvu_block_reset(struct rvu *rvu, int blkaddr, u64 rst_reg)
 {
struct rvu_block *block = &rvu->hw->block[blkaddr];
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 999dc2c..b32d1f1 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -72,6 +72,11 @@ struct rvu_pfvf {
struct rsrc_bmap msix;  /* Bitmap for MSIX vector alloc */
 #define MSIX_BLKLF(blkaddr, lf) (((blkaddr) << 8) | ((lf) & 0xFF))
u16  *msix_lfmap; /* Vector to block LF mapping */
+
+   /* NPA contexts */
+   struct qmem *aura_ctx;
+   struct qmem *pool_ctx;
+   struct qmem *npa_qints_ctx;
 };
 
 st

[PATCH 14/16] octeontx2-af: Alloc bitmaps for NIX Tx scheduler queues

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Allocate bitmaps and memory for PFVF mapping info for
maintaining NIX transmit scheduler queues maintenance.
PF/VF drivers will request for alloc, free e.t.c of
Tx schedulers via mailbox.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/common.h | 11 +++
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h| 16 
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 88 +-
 3 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h 
b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index dc55e34..28eb691 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -132,6 +132,17 @@ struct npa_aq_pool_res {
struct  npa_pool_s  ctx_mask;
 };
 
+/* NIX Transmit schedulers */
+enum nix_scheduler {
+   NIX_TXSCH_LVL_SMQ = 0x0,
+   NIX_TXSCH_LVL_MDQ = 0x0,
+   NIX_TXSCH_LVL_TL4 = 0x1,
+   NIX_TXSCH_LVL_TL3 = 0x2,
+   NIX_TXSCH_LVL_TL2 = 0x3,
+   NIX_TXSCH_LVL_TL1 = 0x4,
+   NIX_TXSCH_LVL_CNT = 0x5,
+};
+
 /* NIX LSO format indices.
  * As of now TSO is the only one using, so statically assigning indices.
  */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index d6aca2e..135f263 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -91,12 +91,28 @@ struct rvu_pfvf {
u8  mac_addr[ETH_ALEN]; /* MAC address of this PF/VF */
 };
 
+struct nix_txsch {
+   struct rsrc_bmap schq;
+   u8   lvl;
+   u16  *pfvf_map;
+};
+
+struct nix_hw {
+   struct nix_txsch txsch[NIX_TXSCH_LVL_CNT]; /* Tx schedulers */
+};
+
 struct rvu_hwinfo {
u8  total_pfs;   /* MAX RVU PFs HW supports */
u16 total_vfs;   /* Max RVU VFs HW supports */
u16 max_vfs_per_pf; /* Max VFs that can be attached to a PF */
+   u8  cgx;
+   u8  lmac_per_cgx;
+   u8  cgx_links;
+   u8  lbk_links;
+   u8  sdp_links;
 
struct rvu_block block[BLK_COUNT]; /* Block info */
+   struct nix_hw*nix0;
 };
 
 struct rvu {
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index 401f87f..4d4cf5a 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -346,6 +346,60 @@ int rvu_mbox_handler_NIX_LF_FREE(struct rvu *rvu, struct 
msg_req *req,
return 0;
 }
 
+static inline struct nix_hw *get_nix_hw(struct rvu_hwinfo *hw, int blkaddr)
+{
+   if (blkaddr == BLKADDR_NIX0 && hw->nix0)
+   return hw->nix0;
+
+   return NULL;
+}
+
+static int nix_setup_txschq(struct rvu *rvu, struct nix_hw *nix_hw, int 
blkaddr)
+{
+   struct nix_txsch *txsch;
+   u64 cfg, reg;
+   int err, lvl;
+
+   /* Get scheduler queue count of each type and alloc
+* bitmap for each for alloc/free/attach operations.
+*/
+   for (lvl = 0; lvl < NIX_TXSCH_LVL_CNT; lvl++) {
+   txsch = &nix_hw->txsch[lvl];
+   txsch->lvl = lvl;
+   switch (lvl) {
+   case NIX_TXSCH_LVL_SMQ:
+   reg = NIX_AF_MDQ_CONST;
+   break;
+   case NIX_TXSCH_LVL_TL4:
+   reg = NIX_AF_TL4_CONST;
+   break;
+   case NIX_TXSCH_LVL_TL3:
+   reg = NIX_AF_TL3_CONST;
+   break;
+   case NIX_TXSCH_LVL_TL2:
+   reg = NIX_AF_TL2_CONST;
+   break;
+   case NIX_TXSCH_LVL_TL1:
+   reg = NIX_AF_TL1_CONST;
+   break;
+   }
+   cfg = rvu_read64(rvu, blkaddr, reg);
+   txsch->schq.max = cfg & 0x;
+   err = rvu_alloc_bitmap(&txsch->schq);
+   if (err)
+   return err;
+
+   /* Allocate memory for scheduler queues to
+* PF/VF pcifunc mapping info.
+*/
+   txsch->pfvf_map = devm_kcalloc(rvu->dev, txsch->schq.max,
+  sizeof(u16), GFP_KERNEL);
+   if (!txsch->pfvf_map)
+   return -ENOMEM;
+   }
+   return 0;
+}
+
 static int nix_calibrate_x2p(struct rvu *rvu, int blkaddr)
 {
int idx, err;
@@ -431,6 +485,7 @@ int rvu_nix_init(struct rvu *rvu)
struct rvu_hwinfo *hw = rvu->hw;
struct rvu_block *block;
int blkaddr, err;
+   u64 cfg;
 
blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
if (blkaddr < 0)
@@ -442,6 +497,14 @@ int rvu_nix_init(struct rvu *rvu)
if (err)
return err;
 
+   /* Set num of links of 

[PATCH 16/16] octeontx2-af: Support for disabling NIX RQ/SQ/CQ contexts

2018-10-16 Thread sunil . kovvuri
From: Geetha sowjanya 

This patch adds support for a RVU PF/VF to disable all RQ/SQ/CQ
contexts of a NIX LF via mbox. This will be used by PF/VF drivers
upon teardown or while freeing up HW resources.

A HW context which is not INIT'ed cannot be modified and a
RVU PF/VF driver may or may not INIT all the RQ/SQ/CQ contexts.
So a bitmap is introduced to keep track of enabled NIX RQ/SQ/CQ
contexts, so that only enabled hw contexts are disabled upon LF
teardown.

Signed-off-by: Geetha sowjanya 
Signed-off-by: Stanislaw Kardach 
Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |   3 +-
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   6 +
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 123 -
 3 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 3b44693..c339024 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -150,7 +150,8 @@ M(NPA_HWCTX_DISABLE,0x403, hwctx_disable_req, 
msg_rsp)  \
 /* NIX mbox IDs (range 0x8000 - 0x) */ \
 M(NIX_LF_ALLOC,0x8000, nix_lf_alloc_req, nix_lf_alloc_rsp) 
\
 M(NIX_LF_FREE, 0x8001, msg_req, msg_rsp)   \
-M(NIX_AQ_ENQ,  0x8002, nix_aq_enq_req, nix_aq_enq_rsp)
+M(NIX_AQ_ENQ,  0x8002, nix_aq_enq_req, nix_aq_enq_rsp) \
+M(NIX_HWCTX_DISABLE,   0x8003, hwctx_disable_req, msg_rsp)
 
 /* Messages initiated by AF (range 0xC00 - 0xDFF) */
 #define MBOX_UP_CGX_MESSAGES   \
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index caf652d..b48b5af 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -87,6 +87,9 @@ struct rvu_pfvf {
struct qmem *rss_ctx;
struct qmem *cq_ints_ctx;
struct qmem *nix_qints_ctx;
+   unsigned long   *sq_bmap;
+   unsigned long   *rq_bmap;
+   unsigned long   *cq_bmap;
 
u8  mac_addr[ETH_ALEN]; /* MAC address of this PF/VF */
 };
@@ -264,4 +267,7 @@ int rvu_mbox_handler_NIX_LF_FREE(struct rvu *rvu, struct 
msg_req *req,
 int rvu_mbox_handler_NIX_AQ_ENQ(struct rvu *rvu,
struct nix_aq_enq_req *req,
struct nix_aq_enq_rsp *rsp);
+int rvu_mbox_handler_NIX_HWCTX_DISABLE(struct rvu *rvu,
+  struct hwctx_disable_req *req,
+  struct msg_rsp *rsp);
 #endif /* RVU_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index fdc4d7b..214ca2c 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -140,6 +140,9 @@ static void nix_setup_lso(struct rvu *rvu, int blkaddr)
 
 static void nix_ctx_free(struct rvu *rvu, struct rvu_pfvf *pfvf)
 {
+   kfree(pfvf->rq_bmap);
+   kfree(pfvf->sq_bmap);
+   kfree(pfvf->cq_bmap);
if (pfvf->rq_ctx)
qmem_free(rvu->dev, pfvf->rq_ctx);
if (pfvf->sq_ctx)
@@ -153,6 +156,9 @@ static void nix_ctx_free(struct rvu *rvu, struct rvu_pfvf 
*pfvf)
if (pfvf->cq_ints_ctx)
qmem_free(rvu->dev, pfvf->cq_ints_ctx);
 
+   pfvf->rq_bmap = NULL;
+   pfvf->cq_bmap = NULL;
+   pfvf->sq_bmap = NULL;
pfvf->rq_ctx = NULL;
pfvf->sq_ctx = NULL;
pfvf->cq_ctx = NULL;
@@ -239,6 +245,7 @@ static int rvu_nix_aq_enq_inst(struct rvu *rvu, struct 
nix_aq_enq_req *req,
struct admin_queue *aq;
struct rvu_pfvf *pfvf;
void *ctx, *mask;
+   bool ena;
u64 cfg;
 
pfvf = rvu_get_pfvf(rvu, pcifunc);
@@ -354,9 +361,49 @@ static int rvu_nix_aq_enq_inst(struct rvu *rvu, struct 
nix_aq_enq_req *req,
return rc;
}
 
+   /* Set RQ/SQ/CQ bitmap if respective queue hw context is enabled */
+   if (req->op == NIX_AQ_INSTOP_INIT) {
+   if (req->ctype == NIX_AQ_CTYPE_RQ && req->rq.ena)
+   __set_bit(req->qidx, pfvf->rq_bmap);
+   if (req->ctype == NIX_AQ_CTYPE_SQ && req->sq.ena)
+   __set_bit(req->qidx, pfvf->sq_bmap);
+   if (req->ctype == NIX_AQ_CTYPE_CQ && req->cq.ena)
+   __set_bit(req->qidx, pfvf->cq_bmap);
+   }
+
+   if (req->op == NIX_AQ_INSTOP_WRITE) {
+   if (req->ctype == NIX_AQ_CTYPE_RQ) {
+   ena = (req->rq.ena & req->rq_mask.ena) |
+   (test_bit(req->qidx, pfvf->rq_bmap) &
+   ~req->rq_mask.ena);
+   if (ena)
+

[PATCH 15/16] octeontx2-af: NIX AQ instruction enqueue support

2018-10-16 Thread sunil . kovvuri
From: Sunil Goutham 

Add support for a RVU PF/VF to submit instructions to NIX AQ
via mbox. Instructions can be to init/write/read RQ/SQ/CQ/RSS
contexts. In case of read, context will be returned as part of
response to the mbox msg received.

Signed-off-by: Sunil Goutham 
---
 drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  36 +-
 drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   3 +
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 232 +++-
 .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 418 +
 4 files changed, 680 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h 
b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 2f24e5d..3b44693 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -149,7 +149,8 @@ M(NPA_HWCTX_DISABLE,0x403, hwctx_disable_req, 
msg_rsp)  \
 /* NPC mbox IDs (range 0x6000 - 0x7FFF) */ \
 /* NIX mbox IDs (range 0x8000 - 0x) */ \
 M(NIX_LF_ALLOC,0x8000, nix_lf_alloc_req, nix_lf_alloc_rsp) 
\
-M(NIX_LF_FREE, 0x8001, msg_req, msg_rsp)
+M(NIX_LF_FREE, 0x8001, msg_req, msg_rsp)   \
+M(NIX_AQ_ENQ,  0x8002, nix_aq_enq_req, nix_aq_enq_rsp)
 
 /* Messages initiated by AF (range 0xC00 - 0xDFF) */
 #define MBOX_UP_CGX_MESSAGES   \
@@ -379,4 +380,37 @@ struct nix_lf_alloc_rsp {
u8  mac_addr[ETH_ALEN];
 };
 
+/* NIX AQ enqueue msg */
+struct nix_aq_enq_req {
+   struct mbox_msghdr hdr;
+   u32  qidx;
+   u8 ctype;
+   u8 op;
+   union {
+   struct nix_rq_ctx_s rq;
+   struct nix_sq_ctx_s sq;
+   struct nix_cq_ctx_s cq;
+   struct nix_rsse_s   rss;
+   struct nix_rx_mce_s mce;
+   };
+   union {
+   struct nix_rq_ctx_s rq_mask;
+   struct nix_sq_ctx_s sq_mask;
+   struct nix_cq_ctx_s cq_mask;
+   struct nix_rsse_s   rss_mask;
+   struct nix_rx_mce_s mce_mask;
+   };
+};
+
+struct nix_aq_enq_rsp {
+   struct mbox_msghdr hdr;
+   union {
+   struct nix_rq_ctx_s rq;
+   struct nix_sq_ctx_s sq;
+   struct nix_cq_ctx_s cq;
+   struct nix_rsse_s   rss;
+   struct nix_rx_mce_s mce;
+   };
+};
+
 #endif /* MBOX_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 135f263..caf652d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -261,4 +261,7 @@ int rvu_mbox_handler_NIX_LF_ALLOC(struct rvu *rvu,
  struct nix_lf_alloc_rsp *rsp);
 int rvu_mbox_handler_NIX_LF_FREE(struct rvu *rvu, struct msg_req *req,
 struct msg_rsp *rsp);
+int rvu_mbox_handler_NIX_AQ_ENQ(struct rvu *rvu,
+   struct nix_aq_enq_req *req,
+   struct nix_aq_enq_rsp *rsp);
 #endif /* RVU_H */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c 
b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index 4d4cf5a..fdc4d7b 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -16,6 +16,38 @@
 #include "rvu.h"
 #include "cgx.h"
 
+static inline struct nix_hw *get_nix_hw(struct rvu_hwinfo *hw, int blkaddr)
+{
+   if (blkaddr == BLKADDR_NIX0 && hw->nix0)
+   return hw->nix0;
+
+   return NULL;
+}
+
+static bool is_valid_txschq(struct rvu *rvu, int blkaddr,
+   int lvl, u16 pcifunc, u16 schq)
+{
+   struct nix_txsch *txsch;
+   struct nix_hw *nix_hw;
+
+   nix_hw = get_nix_hw(rvu->hw, blkaddr);
+   if (!nix_hw)
+   return false;
+
+   txsch = &nix_hw->txsch[lvl];
+   /* Check out of bounds */
+   if (schq >= txsch->schq.max)
+   return false;
+
+   spin_lock(&rvu->rsrc_lock);
+   if (txsch->pfvf_map[schq] != pcifunc) {
+   spin_unlock(&rvu->rsrc_lock);
+   return false;
+   }
+   spin_unlock(&rvu->rsrc_lock);
+   return true;
+}
+
 static void nix_setup_lso_tso_l3(struct rvu *rvu, int blkaddr,
 u64 format, bool v4, u64 *fidx)
 {
@@ -159,6 +191,198 @@ static int nixlf_rss_ctx_init(struct rvu *rvu, int 
blkaddr,
return 0;
 }
 
+static int nix_aq_enqueue_wait(struct rvu *rvu, struct rvu_block *block,
+  struct nix_aq_inst_s *inst)
+{
+   struct admin_queue *aq = block->aq;
+   struct nix_aq_res_s *result;
+   int timeout = 1000;
+   u64 reg, head;
+
+   result = (struct nix_aq_res_s *)aq->res->base;
+
+   /* Get current h

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-16 Thread Eric Dumazet
On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt  wrote:
>
> On 15.10.2018 17:41, Eric Dumazet wrote:
> > On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
> >> Something is changed between 4.17.12 and 4.18, after bisecting the problem 
> >> I
> >> got the following first bad commit:
> >>
> >> commit 88078d98d1bb085d72af8437707279e203524fa5
> >> Author: Eric Dumazet 
> >> Date:   Wed Apr 18 11:43:15 2018 -0700
> >>
> >>  net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
> >>
> >>  After working on IP defragmentation lately, I found that some large
> >>  packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
> >>  zero paddings on the last (small) fragment.
> >>
> >>  While removing the padding with pskb_trim_rcsum(), we set 
> >> skb->ip_summed
> >>  to CHECKSUM_NONE, forcing a full csum validation, even if all prior
> >>  fragments had CHECKSUM_COMPLETE set.
> >>
> >>  We can instead compute the checksum of the part we are trimming,
> >>  usually smaller than the part we keep.
> >>
> >>  Signed-off-by: Eric Dumazet 
> >>  Signed-off-by: David S. Miller 
> >>
> >
> > Thanks for bisecting !
> >
> > This commit is known to expose some NIC/driver bugs.
> >
> > Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
> > ("net: sungem: fix rx checksum support")  for one driver needing a fix.
> >
> > I assume SKY2_HW_NEW_LE is not set on your NIC ?
> >
>
> I've seen similar on several systems with mlx4 cards when using 4.18.x -
> that is hw csum failure followed by some backtrace.
>
> Only seems to happen on systems dealing with quite a bit of UDP.
>

Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE,
but CHECKSUM_UNNECESSARY

I would be nice to track this a bit further, maybe by providing the
full packet content.

> Example from 4.18.10:
> > [635607.740574] p0xe0: hw csum failure
> > [635607.740598] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
> > [635607.740599] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 
> > 05/02/2017
> > [635607.740599] Call Trace:
> > [635607.740602]  
> > [635607.740611]  dump_stack+0x5c/0x7b
> > [635607.740617]  __skb_gro_checksum_complete+0x9a/0xa0
> > [635607.740621]  udp6_gro_receive+0x211/0x290
> > [635607.740624]  ipv6_gro_receive+0x1a8/0x390
> > [635607.740627]  dev_gro_receive+0x33e/0x550
> > [635607.740628]  napi_gro_frags+0xa2/0x210
> > [635607.740635]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
> > [635607.740648]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
> > [635607.740654]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
> > [635607.740657]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
> > [635607.740658]  net_rx_action+0xe0/0x2e0
> > [635607.740662]  __do_softirq+0xd8/0x2e5
> > [635607.740666]  irq_exit+0xb4/0xc0
> > [635607.740667]  do_IRQ+0x85/0xd0
> > [635607.740670]  common_interrupt+0xf/0xf
> > [635607.740671]  
> > [635607.740675] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
> > [635607.740675] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 00 
> > 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 00 
> > <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
> > [635607.740701] RSP: 0018:a5c206353ea8 EFLAGS: 0246 ORIG_RAX: 
> > ffd9
> > [635607.740703] RAX: 8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 
> > 001f
> > [635607.740703] RDX: 00024214f597c5b0 RSI: 00020780 RDI: 
> > 
> > [635607.740704] RBP: 0004 R08: 002542bfbefa99fa R09: 
> > 
> > [635607.740705] R10: a5c206353e88 R11: 00c5 R12: 
> > af0aaf78
> > [635607.740706] R13: 8d72ffd297d8 R14:  R15: 
> > 00024214f58c2ed5
> > [635607.740709]  ? cpuidle_enter_state+0x91/0x2a0
> > [635607.740712]  do_idle+0x1d0/0x240
> > [635607.740715]  cpu_startup_entry+0x5f/0x70
> > [635607.740719]  start_secondary+0x185/0x1a0
> > [635607.740722]  secondary_startup_64+0xa5/0xb0
> > [635607.740731] p0xe0: hw csum failure
> > [635607.740745] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
> > [635607.740746] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 
> > 05/02/2017
> > [635607.740746] Call Trace:
> > [635607.740747]  
> > [635607.740750]  dump_stack+0x5c/0x7b
> > [635607.740755]  __skb_checksum_complete+0xb8/0xd0
> > [635607.740760]  __udp6_lib_rcv+0xa6b/0xa70
> > [635607.740767]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> > [635607.740770]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> > [635607.740774]  ip6_input_finish+0xc0/0x460
> > [635607.740776]  ip6_input+0x2b/0x90
> > [635607.740778]  ? ip6_rcv_finish+0x110/0x110
> > [635607.740780]  ipv6_rcv+0x2cd/0x4b0
> > [635607.740783]  ? udp6_lib_lookup_skb+0x59/0x80
> > [635607.740785]  __netif_receive_skb_core+0x455/0xb30
> > [635607.740788]  ? ipv6_gro_receive+0x1a8/0x390
> > [635607.740790]  ? netif_receive_skb_internal+0x24/0xb0
> > [635607.740792]  netif_receive_skb_internal+0x24/0xb0
> > [635607.740793]  napi

[PATCH bpf-next] bpf, tls: add tls header to tools infrastructure

2018-10-16 Thread Daniel Borkmann
Andrey reported a build error for the BPF kselftest suite when compiled on
a machine which does not have tls related header bits installed natively:

  test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
   #include 
 ^
  compilation terminated.

Fix it by adding the header to the tools include infrastructure and add
definitions such as SOL_TLS that could potentially be missing.

Fixes: e9dd904708c4 ("bpf: add tls support for testing in test_sockmap")
Reported-by: Andrey Ignatov 
Signed-off-by: Daniel Borkmann 
---
 tools/include/uapi/linux/tls.h | 78 ++
 tools/testing/selftests/bpf/test_sockmap.c | 13 +++--
 2 files changed, 86 insertions(+), 5 deletions(-)
 create mode 100644 tools/include/uapi/linux/tls.h

diff --git a/tools/include/uapi/linux/tls.h b/tools/include/uapi/linux/tls.h
new file mode 100644
index 000..ff02287
--- /dev/null
+++ b/tools/include/uapi/linux/tls.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
Linux-OpenIB) */
+/*
+ * Copyright (c) 2016-2017, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _UAPI_LINUX_TLS_H
+#define _UAPI_LINUX_TLS_H
+
+#include 
+
+/* TLS socket options */
+#define TLS_TX 1   /* Set transmit parameters */
+#define TLS_RX 2   /* Set receive parameters */
+
+/* Supported versions */
+#define TLS_VERSION_MINOR(ver) ((ver) & 0xFF)
+#define TLS_VERSION_MAJOR(ver) (((ver) >> 8) & 0xFF)
+
+#define TLS_VERSION_NUMBER(id) id##_VERSION_MAJOR) & 0xFF) << 8) | \
+((id##_VERSION_MINOR) & 0xFF))
+
+#define TLS_1_2_VERSION_MAJOR  0x3
+#define TLS_1_2_VERSION_MINOR  0x3
+#define TLS_1_2_VERSIONTLS_VERSION_NUMBER(TLS_1_2)
+
+/* Supported ciphers */
+#define TLS_CIPHER_AES_GCM_128 51
+#define TLS_CIPHER_AES_GCM_128_IV_SIZE 8
+#define TLS_CIPHER_AES_GCM_128_KEY_SIZE16
+#define TLS_CIPHER_AES_GCM_128_SALT_SIZE   4
+#define TLS_CIPHER_AES_GCM_128_TAG_SIZE16
+#define TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE8
+
+#define TLS_SET_RECORD_TYPE1
+#define TLS_GET_RECORD_TYPE2
+
+struct tls_crypto_info {
+   __u16 version;
+   __u16 cipher_type;
+};
+
+struct tls12_crypto_info_aes_gcm_128 {
+   struct tls_crypto_info info;
+   unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE];
+   unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
+   unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE];
+   unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE];
+};
+
+#endif /* _UAPI_LINUX_TLS_H */
diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 10a5fa8..7cb69ce 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -43,6 +44,13 @@
 int running;
 static void running_handler(int a);
 
+#ifndef TCP_ULP
+# define TCP_ULP 31
+#endif
+#ifndef SOL_TLS
+# define SOL_TLS 282
+#endif
+
 /* randomly selected ports for testing on lo */
 #define S1_PORT 1
 #define S2_PORT 10001
@@ -114,11 +122,6 @@ static void usage(char *argv[])
printf("\n");
 }
 
-#define TCP_ULP 31
-#define TLS_TX 1
-#define TLS_RX 2
-#include 
-
 char *sock_to_string(int s)
 {
if (s == c1)
-- 
2.9.5



Re: [PATCH bpf-next v2 1/8] tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup

2018-10-16 Thread Eric Dumazet



On 10/12/2018 05:45 PM, Daniel Borkmann wrote:
> Whenever the ULP data on the socket is mangled, enforce that the
> caller has the socket lock held as otherwise things may race with
> initialization and cleanup callbacks from ulp ops as both would
> mangle internal socket state.
> 
> Joint work with John.
> 
> Signed-off-by: Daniel Borkmann 
> Signed-off-by: John Fastabend 
> ---
>  net/ipv4/tcp_ulp.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
> index a5995bb..34e9635 100644
> --- a/net/ipv4/tcp_ulp.c
> +++ b/net/ipv4/tcp_ulp.c
> @@ -123,6 +123,8 @@ void tcp_cleanup_ulp(struct sock *sk)
>  {
>   struct inet_connection_sock *icsk = inet_csk(sk);
>  
> + sock_owned_by_me(sk);
> +
>   if (!icsk->icsk_ulp_ops)
>   return;

Ahem... inet_csk_prepare_forced_close() releases the socket lock,
and tcp_done(newsk); is called after inet_csk_prepare_forced_close() 


syzkaller got the following trace



TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  
Check SNMP counters.
tmpfs: Bad mount option 
s\xA8\xFE\x9E\x92\xE9K\xD7:\x85\x87$z\x94\xFB3\xBF\xE4\x8E\x88\xE2\xF0%\x92\xF8\xE5\xC3lh
WARNING: CPU: 0 PID: 12625 at include/net/sock.h:1539 sock_owned_by_me 
include/net/sock.h:1539 [inline]
WARNING: CPU: 0 PID: 12625 at include/net/sock.h:1539 
tcp_cleanup_ulp+0x1ad/0x200 net/ipv4/tcp_ulp.c:102
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 12625 Comm: syz-executor3 Not tainted 4.19.0-rc8-next-20181016+ #95
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x244/0x39d lib/dump_stack.c:113
 panic+0x2ad/0x55c kernel/panic.c:188
 __warn.cold.8+0x20/0x45 kernel/panic.c:540
 report_bug+0x254/0x2d0 lib/bug.c:186
 fixup_bug arch/x86/kernel/traps.c:178 [inline]
 do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
 do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969
RIP: 0010:sock_owned_by_me include/net/sock.h:1539 [inline]
RIP: 0010:tcp_cleanup_ulp+0x1ad/0x200 net/ipv4/tcp_ulp.c:102
Code: 83 c0 03 38 d0 7c 04 84 d2 75 61 44 8b 25 cb 4e df 02 31 ff 44 89 e6 e8 
51 3d ed fa 45 85 e4 0f 84 91 fe ff ff e8 33 3c ed fa <0f> 0b e9 85 fe ff ff 4c 
89 ef e8 34 84 30 fb e9 9f fe ff ff 4c 89
RSP: 0018:8801dae06860 EFLAGS: 00010206
RAX: 8801bf202040 RBX: 8801918501c0 RCX: 8690e6ff
RDX: 0100 RSI: 8690e70d RDI: 0005
RBP: 8801dae06880 R08: 8801bf202040 R09: 0002
R10:  R11: 8801bf202040 R12: 0001
R13:  R14: 0003 R15: 8801dae069a0
 tcp_v4_destroy_sock+0x15c/0x980 net/ipv4/tcp_ipv4.c:1980
 tcp_v6_destroy_sock+0x15/0x20 net/ipv6/tcp_ipv6.c:1762
 inet_csk_destroy_sock+0x19f/0x440 net/ipv4/inet_connection_sock.c:838
 tcp_done+0x272/0x310 net/ipv4/tcp.c:3760
 tcp_v6_syn_recv_sock+0x1f21/0x25f0 net/ipv6/tcp_ipv6.c:1236
 tcp_get_cookie_sock+0x10e/0x580 net/ipv4/syncookies.c:213
 cookie_v6_check+0x1830/0x27d0 net/ipv6/syncookies.c:257
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1028 [inline]
 tcp_v6_do_rcv+0x10ea/0x13c0 net/ipv6/tcp_ipv6.c:1336
 tcp_v6_rcv+0x34e0/0x3ab0 net/ipv6/tcp_ipv6.c:1545
 ip6_input_finish+0x3fc/0x1aa0 net/ipv6/ip6_input.c:384
 NF_HOOK include/linux/netfilter.h:289 [inline]
 ip6_input+0xe4/0x600 net/ipv6/ip6_input.c:427
 dst_input include/net/dst.h:450 [inline]
 ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
 NF_HOOK include/linux/netfilter.h:289 [inline]
 ipv6_rcv+0x110/0x630 net/ipv6/ip6_input.c:272
 __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4930
 __netif_receive_skb+0x27/0x1e0 net/core/dev.c:5040
 process_backlog+0x24e/0x7a0 net/core/dev.c:5844
 napi_poll net/core/dev.c:6264 [inline]
 net_rx_action+0x7fa/0x19b0 net/core/dev.c:6330
 __do_softirq+0x308/0xb7e kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1023
 
 do_softirq.part.14+0x126/0x160 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x21d/0x260 kernel/softirq.c:189
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:696 [inline]
 ip6_finish_output2+0xce4/0x27a0 net/ipv6/ip6_output.c:121
 ip6_finish_output+0x468/0xc60 net/ipv6/ip6_output.c:154
 NF_HOOK_COND include/linux/netfilter.h:278 [inline]
 ip6_output+0x232/0x9d0 net/ipv6/ip6_output.c:171
 dst_output include/net/dst.h:444 [inline]
 NF_HOOK include/linux/netfilter.h:289 [inline]
 ip6_xmit+0xf64/0x2410 net/ipv6/ip6_output.c:275
 inet6_csk_xmit+0x375/0x630 net/ipv6/inet6_connection_sock.c:139
 __tcp_transmit_skb+0x1bc5/0x3b00 net/ipv4/tcp_output.c:1162
 tcp_transmit_skb net/ipv4/tcp_output.c:1178 [inline]
 tcp_write_xmit+0x1676/0x5710 net/ipv4/tcp_output.c:2364
 tcp_push_one+0

Re: [PATCH bpf-next v2 1/8] tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup

2018-10-16 Thread Daniel Borkmann
On 10/16/2018 04:17 PM, Eric Dumazet wrote:
> On 10/12/2018 05:45 PM, Daniel Borkmann wrote:
[...]
>> diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
>> index a5995bb..34e9635 100644
>> --- a/net/ipv4/tcp_ulp.c
>> +++ b/net/ipv4/tcp_ulp.c
>> @@ -123,6 +123,8 @@ void tcp_cleanup_ulp(struct sock *sk)
>>  {
>>  struct inet_connection_sock *icsk = inet_csk(sk);
>>  
>> +sock_owned_by_me(sk);
>> +
>>  if (!icsk->icsk_ulp_ops)
>>  return;
> 
> Ahem... inet_csk_prepare_forced_close() releases the socket lock,
> and tcp_done(newsk); is called after inet_csk_prepare_forced_close() 

Right you are, will fix it up. Thanks!


[PATCH] net/ipv4: fix tcp_poll for SMC fallback

2018-10-16 Thread Karsten Graul
Commit dd979b4df817 ("net: simplify sock_poll_wait") breaks tcp_poll for 
SMC fallback: An AF_SMC socket establishes an internal TCP socket for the 
CLC handshake with the remote peer. Whenever the SMC connection can not be 
established this CLC socket is used as a fallback. All socket operations on the 
SMC socket are then forwarded to the CLC socket. In case of poll, the 
file->private_data pointer references the SMC socket because the CLC socket has 
no file assigned. This causes tcp_poll to wait on the wrong socket.

This patch fixes the issue by (re)introducing a sock_poll_wait variant with 
a socket parameter, and let tcp_poll use this variant.

Fixes: dd979b4df817 ("net: simplify sock_poll_wait")
Signed-off-by: Karsten Graul 
---
 include/net/sock.h | 20 +---
 net/ipv4/tcp.c |  2 +-
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 433f45fc2d68..eb2980d48aeb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2057,14 +2057,14 @@ static inline bool skwq_has_sleeper(struct socket_wq 
*wq)
 /**
  * sock_poll_wait - place memory barrier behind the poll_wait call.
  * @filp:   file
+ * @sock:   socket to wait
  * @p:  poll_table
  *
  * See the comments in the wq_has_sleeper function.
  */
-static inline void sock_poll_wait(struct file *filp, poll_table *p)
+static inline void _sock_poll_wait(struct file *filp, struct socket *sock,
+  poll_table *p)
 {
-   struct socket *sock = filp->private_data;
-
if (!poll_does_not_wait(p)) {
poll_wait(filp, &sock->wq->wait, p);
/* We need to be sure we are in sync with the
@@ -2076,6 +2076,20 @@ static inline void sock_poll_wait(struct file *filp, 
poll_table *p)
}
 }
 
+/**
+ * sock_poll_wait - place memory barrier behind the poll_wait call.
+ * @filp:   file
+ * @p:  poll_table
+ *
+ * See the comments in the wq_has_sleeper function.
+ */
+static inline void sock_poll_wait(struct file *filp, poll_table *p)
+{
+   struct socket *sock = filp->private_data;
+
+   _sock_poll_wait(filp, sock, p);
+}
+
 static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
 {
if (sk->sk_txhash) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 10c6246396cc..a8041729839d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -507,7 +507,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, 
poll_table *wait)
const struct tcp_sock *tp = tcp_sk(sk);
int state;
 
-   sock_poll_wait(file, wait);
+   _sock_poll_wait(file, sock, wait);
 
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
-- 
2.18.0



Re: [PATCH net-next v2 0/2] FDDI: DEC FDDIcontroller 700 TURBOchannel adapter support

2018-10-16 Thread Maciej W. Rozycki
On Mon, 15 Oct 2018, David Miller wrote:

> Series applied, thank you.

 Great, thanks!

  Maciej


Re: [PATCH bpf-next] bpf, tls: add tls header to tools infrastructure

2018-10-16 Thread Alexei Starovoitov
On Tue, Oct 16, 2018 at 03:59:36PM +0200, Daniel Borkmann wrote:
> Andrey reported a build error for the BPF kselftest suite when compiled on
> a machine which does not have tls related header bits installed natively:
> 
>   test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
>#include 
>  ^
>   compilation terminated.
> 
> Fix it by adding the header to the tools include infrastructure and add
> definitions such as SOL_TLS that could potentially be missing.
> 
> Fixes: e9dd904708c4 ("bpf: add tls support for testing in test_sockmap")
> Reported-by: Andrey Ignatov 
> Signed-off-by: Daniel Borkmann 

Applied, Thanks



[PATCH net] net: bpfilter: use get_pid_task instead of pid_task

2018-10-16 Thread Taehee Yoo
pid_task() dereferences rcu protected tasks array.
But there is no rcu_read_lock() in shutdown_umh() routine so that
rcu_read_lock() is needed.
get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
then calls pid_task(). if task isn't NULL, it increases reference count
of task.

test commands:
   %modprobe bpfilter
   %modprobe -rv bpfilter

splat looks like:
[15102.030932] =
[15102.030957] WARNING: suspicious RCU usage
[15102.030985] 4.19.0-rc7+ #21 Not tainted
[15102.031010] -
[15102.031038] kernel/pid.c:330 suspicious rcu_dereference_check() usage!
[15102.031063]
   other info that might help us debug this:

[15102.031332]
   rcu_scheduler_active = 2, debug_locks = 1
[15102.031363] 1 lock held by modprobe/1570:
[15102.031389]  #0: 580ef2b0 (bpfilter_lock){+.+.}, at: 
stop_umh+0x13/0x52 [bpfilter]
[15102.031552]
   stack backtrace:
[15102.031583] CPU: 1 PID: 1570 Comm: modprobe Not tainted 4.19.0-rc7+ #21
[15102.031607] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[15102.031628] Call Trace:
[15102.031676]  dump_stack+0xc9/0x16b
[15102.031723]  ? show_regs_print_info+0x5/0x5
[15102.031801]  ? lockdep_rcu_suspicious+0x117/0x160
[15102.031855]  pid_task+0x134/0x160
[15102.031900]  ? find_vpid+0xf0/0xf0
[15102.032017]  shutdown_umh.constprop.1+0x1e/0x53 [bpfilter]
[15102.032055]  stop_umh+0x46/0x52 [bpfilter]
[15102.032092]  __x64_sys_delete_module+0x47e/0x570
[ ... ]

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Taehee Yoo 
---
 net/bpfilter/bpfilter_kern.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index b64e1649993b..94e88f510c5b 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -23,9 +23,11 @@ static void shutdown_umh(struct umh_info *info)
 
if (!info->pid)
return;
-   tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
-   if (tsk)
+   tsk = get_pid_task(find_vpid(info->pid), PIDTYPE_PID);
+   if (tsk) {
force_sig(SIGKILL, tsk);
+   put_task_struct(tsk);
+   }
fput(info->pipe_to_umh);
fput(info->pipe_from_umh);
info->pid = 0;
-- 
2.17.1



Re: [PATCH net] net: bpfilter: use get_pid_task instead of pid_task

2018-10-16 Thread Alexei Starovoitov
On Wed, Oct 17, 2018 at 12:35:10AM +0900, Taehee Yoo wrote:
> pid_task() dereferences rcu protected tasks array.
> But there is no rcu_read_lock() in shutdown_umh() routine so that
> rcu_read_lock() is needed.
> get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
> then calls pid_task(). if task isn't NULL, it increases reference count
> of task.
> 
> test commands:
>%modprobe bpfilter
>%modprobe -rv bpfilter
> 
> splat looks like:
> [15102.030932] =
> [15102.030957] WARNING: suspicious RCU usage
> [15102.030985] 4.19.0-rc7+ #21 Not tainted
> [15102.031010] -
> [15102.031038] kernel/pid.c:330 suspicious rcu_dereference_check() usage!
> [15102.031063]
>  other info that might help us debug this:
> 
> [15102.031332]
>  rcu_scheduler_active = 2, debug_locks = 1
> [15102.031363] 1 lock held by modprobe/1570:
> [15102.031389]  #0: 580ef2b0 (bpfilter_lock){+.+.}, at: 
> stop_umh+0x13/0x52 [bpfilter]
> [15102.031552]
>stack backtrace:
> [15102.031583] CPU: 1 PID: 1570 Comm: modprobe Not tainted 4.19.0-rc7+ #21
> [15102.031607] Hardware name: To be filled by O.E.M. To be filled by 
> O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
> [15102.031628] Call Trace:
> [15102.031676]  dump_stack+0xc9/0x16b
> [15102.031723]  ? show_regs_print_info+0x5/0x5
> [15102.031801]  ? lockdep_rcu_suspicious+0x117/0x160
> [15102.031855]  pid_task+0x134/0x160
> [15102.031900]  ? find_vpid+0xf0/0xf0
> [15102.032017]  shutdown_umh.constprop.1+0x1e/0x53 [bpfilter]
> [15102.032055]  stop_umh+0x46/0x52 [bpfilter]
> [15102.032092]  __x64_sys_delete_module+0x47e/0x570
> [ ... ]
> 
> Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Taehee Yoo 

thanks a lot for the fix
Acked-by: Alexei Starovoitov 



Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes

2018-10-16 Thread Greg Kroah-Hartman
On Mon, Oct 15, 2018 at 10:53:02AM -0700, Eric Dumazet wrote:
> On Mon, Oct 15, 2018 at 10:47 AM Florian Fainelli  
> wrote:
> >
> >
> >
> > On 10/10/2018 12:29 PM, Florian Fainelli wrote:
> > > This is based on Stephen's v4.14 patches, with the necessary merge
> > > conflicts, and the lack of timer_setup() on the 4.9 baseline.
> > >
> > > Perf results on a gigabit capable system, before and after are below.
> > >
> > > Series can also be found here:
> > >
> > > https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2
> > >
> > > Changes in v2:
> > >
> > > - drop "net: sk_buff rbnode reorg"
> > > - added original "ip: use rb trees for IP frag queue." commit
> >
> > Eric, does this look reasonable to you?
> 
> Yes, thanks a lot Florian.

Wonderful, all now queued up, thanks!

greg k-h


Reclaiming memory for network interface

2018-10-16 Thread Sujeev Dias

Hi

Setup: sdm845 connected to external modem over pcie interface

During a data call, we found out we spend more than 25% of cpu for 
memory ops with io coherency.  That include allocation, freeing, dma 
mapping, and unmapping.  As we pushing to higher data rate (beyond 7 
Gbps), the time we spend in memory operation is significant. So, we're 
looking into ways we can reclaim this memory.


One of idea we're thinking is:

1. allocate pages

2. Increment reference count of page

3. allocate skb, and assign page into paged data portion

4. Assign cb function to skb->destructor

5. once destructor get called, move the page to a new skb


Sound simple enough, but we couldn't find anyone actually doing this 
way.  Anything to be concern with above proposal? We see some example of 
using destructor to do deferred unmap but didn't see any example of 
re-using the buffer. Also, couldn't find any meaningful discussion about 
reclaiming memory for network data. Any thoughts on how we should solve 
this issue?  Any comment is welcome, thanks.



Sincerely

Sujeev


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: bpfilter causes a leftover kernel process

2018-10-16 Thread Alexei Starovoitov
On Wed, Sep 5, 2018 at 5:05 PM Olivier Brunel  wrote:
>
> You'll see in the end that systemd complains that it can't
> unmount /oldroot (EBUSY), aka the root fs; and that's because of the
> bpfilter helper, which wasn't killed because it's seen as a kernel
> thread due to its empty command line and therefore not signaled.

thanks for tracking it down.
can somebody send a patch to give bpfilter non-empty cmdline?
I think that would be a better fix than tweaking all pid1s.


Re: [PATCH net] sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead

2018-10-16 Thread David Miller
From: Xin Long 
Date: Tue, 16 Oct 2018 15:52:02 +0800

> According to rfc7496 section 4.3 or 4.4:
> 
>sprstat_policy:  This parameter indicates for which PR-SCTP policy
>   the user wants the information.  It is an error to use
>   SCTP_PR_SCTP_NONE in sprstat_policy.  If SCTP_PR_SCTP_ALL is used,
>   the counters provided are aggregated over all supported policies.
> 
> We change to dump pr_assoc and pr_stream all status by SCTP_PR_SCTP_ALL
> instead, and return error for SCTP_PR_SCTP_NONE, as it also said "It is
> an error to use SCTP_PR_SCTP_NONE in sprstat_policy. "
> 
> Fixes: 826d253d57b1 ("sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt")
> Fixes: d229d48d183f ("sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp")
> Reported-by: Ying Xu 
> Signed-off-by: Xin Long 

Applied and queued up for -stable.


Re: [PATCH net-next 0/5] Align PTT and add various link modes.

2018-10-16 Thread David Miller
From: Rahul Verma 
Date: Tue, 16 Oct 2018 03:59:17 -0700

> From: Rahul Verma 
> 
> This series aligns the ptt propagation as local ptt or global ptt.
> Adds new transceiver modes, speed capabilities and board config,
> which is utilized to display the enhanced link modes, media types
> and speed. Enhances the link with detailed information.

Series applied.


[PATCH net] r8169: re-enable MSI-X on RTL8168g

2018-10-16 Thread Heiner Kallweit
Similar to d49c88d7677b ("r8169: Enable MSI-X on RTL8106e") after
e9d0ba506ea8 ("PCI: Reprogram bridge prefetch registers on resume")
we can safely assume that this also fixes the root cause of
the issue worked around by 7c53a722459c ("r8169: don't use MSI-X on
RTL8168g"). So let's revert it.

Fixes: 7c53a722459c ("r8169: don't use MSI-X on RTL8168g")
Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index f4df367fb..28184b984 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7098,11 +7098,6 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
RTL_W8(tp, Cfg9346, Cfg9346_Lock);
flags = PCI_IRQ_LEGACY;
-   } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
-   /* This version was reported to have issues with resume
-* from suspend when using MSI-X
-*/
-   flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
} else {
flags = PCI_IRQ_ALL_TYPES;
}
-- 
2.19.1



[PATCH net-next v3 1/2] net: phy: mscc: fix signedness bug in vsc85xx_downshift_get

2018-10-16 Thread Gustavo A. R. Silva
Currently, the error handling for the call to function
phy_read_paged() doesn't work because *reg_val* is of
type u16 (16 bits, unsigned), which makes it impossible
for it to hold a value less than 0.

Fix this by changing the type of variable *reg_val* to int.

Addresses-Coverity-ID: 1473970 ("Unsigned compared against 0")
Fixes: 6a0bfbbe20b0 ("net: phy: mscc: migrate to phy_select/restore_page 
functions")
Reviewed-by: Quentin Schulz 
Signed-off-by: Gustavo A. R. Silva 
---
Changes in v3:
 - Post patch to netdev.

Changes in v2:
 - Add Quentin's Reviewed-by to the commit log.

 drivers/net/phy/mscc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index bffe077..bff56c3 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -522,7 +522,7 @@ static int vsc85xx_mdix_set(struct phy_device *phydev, u8 
mdix)
 
 static int vsc85xx_downshift_get(struct phy_device *phydev, u8 *count)
 {
-   u16 reg_val;
+   int reg_val;
 
reg_val = phy_read_paged(phydev, MSCC_PHY_PAGE_EXTENDED,
 MSCC_PHY_ACTIPHY_CNTL);
-- 
2.7.4



[bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly

2018-10-16 Thread John Fastabend
When converting sockmap to new skmsg generic data structures we missed
that the recvmsg handler did not correctly use sg.size and instead was
using individual elements length. The result is if a sock is closed
with outstanding data we omit the call to sk_mem_uncharge() and can
get the warning below.

[   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 
sk_stream_kill_queues+0x1fa/0x210

To fix this correct the redirect handler to xfer the size along with
the scatterlist and also decrement the size from the recvmsg handler.
Now when a sock is closed the remaining 'size' will be decremented
with sk_mem_uncharge().

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h |1 +
 net/ipv4/tcp_bpf.c|1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 0b919f0..31df0d9 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -176,6 +176,7 @@ static inline void sk_msg_xfer(struct sk_msg *dst, struct 
sk_msg *src,
 {
dst->sg.data[which] = src->sg.data[which];
dst->sg.data[which].length  = size;
+   dst->sg.size   += size;
src->sg.data[which].length -= size;
src->sg.data[which].offset += size;
 }
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 80debb0..f9d3cf1 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -73,6 +73,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
sge->offset += copy;
sge->length -= copy;
sk_mem_uncharge(sk, copy);
+   msg_rx->sg.size -= copy;
if (!sge->length) {
i++;
if (i == MAX_SKB_FRAGS)



Re: [PATCH net] net/sched: properly init chain in case of multiple control actions

2018-10-16 Thread Davide Caratti
On Mon, 2018-10-15 at 11:31 -0700, Cong Wang wrote:
> On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti  wrote:
> > 
> > On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > > Why not just validate the fallback action in each action init()?
> > > For example, checking tcfg_paction in tcf_gact_init().
> > > 
> > > I don't see the need of making it generic.
...
> > A (legal?) trick  is to let tcf_action store the fallback action when it
> > contains a 'goto chain' command, I just posted a proposal for gact. If you
> > think it's ok, I will test and post the same for act_police.
> 
> Do we really need to support TC_ACT_GOTO_CHAIN for
> gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
> completeness?
> 
> IF we don't need to support it, we can just make it invalid without needing
> to initialize it in ->init() at all.
> 
> If we do, however, we really need to move it into each ->init(), because
> we have to lock each action if we are modifying an existing one. With
> your patch, tcf_action_goto_chain_init() is still called without the 
> per-action
> lock.
> 
> What's more, if we support two different actions in gact, that is, 
> tcfg_paction
> and tcf_action, how could you still only have one a->goto_chain pointer?
> There should be two pointers for each of them. :)

whatever fixes the NULL dereference is OK for me.
I thought that the proposal made with

https://www.mail-archive.com/netdev@vger.kernel.org/msg251933.html

(i.e., letting init() copy tcfg_paction to tcf_action in case it contained
'goto chain x') was smart enough to preserve the current behavior, and
also let 'goto chain' work in case it was configured  *only* for the
fallback action.
When the action is modified, the change to tcfg_paction is done with the
same spinlock as tcf_action, so I didn't notice anything worse than the
current locking layout. 

(well, after some more thinking I looked again at that patch and yes, it
lacked the most important thing:)

--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -88,6 +88,9 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,
p_parm = nla_data(tb[TCA_GACT_PROB]);
if (p_parm->ptype >= MAX_RAND)
return -EINVAL;
+   if (TC_ACT_EXT_CMP(p_parm->paction, TC_ACT_GOTO_CHAIN) &&
+   TC_ACT_EXT_CMP(parm->action, TC_ACT_GOTO_CHAIN))
+   return -EINVAL;
}
 #endif

That said, 'goto chain' never worked for police and gact since the first
introduction of 'goto chain', so we are not breaking any userspace program.
And I don't necessarily need 'goto chain' in police and gact fallback
actions; nobody complained in 1 year, so we can just add these two lines
in tcf_gact_init() and something similar in tcf_police_init():


if (p_parm->ptype >= MAX_RAND)
return -EINVAL;
+   if (TC_ACT_EXT_CMP(p_parm->paction, TC_ACT_GOTO_CHAIN))
+   return -EINVAL;


(and maybe also help users with a proper extack). Just let me know which
approach you prefer, I will test and send patches.
thanks!

-- 
davide



Hello My Dear Friend,

2018-10-16 Thread Mr Marc Joseph Hebert
I am Mr Marc Joseph Hebert a I work in the Finance Risk
control/Accounts Broker Unit of a prestigious bank in London. Under
varying state laws in United Kingdom, financial institutions and other
companies are required to turn over any funds considered "abandoned,"
including uncashed paychecks, forgotten bank account balances,
unclaimed refunds, insurance payouts and contents of safe deposit
boxes. I have the official duty to process and release unclaimed funds
in the bank to government treasury.

Recently, there are multiple abandoned accounts in the bank which I
have transferred some to the government treasury. Some of these funds
are what I want to transfer (10.6m GBP) out of the bank to a sincere
and

trustworthy person for either investment purpose or sharing between
us.  Can you handle this with confidentiality, sincerity and
seriousness?

Please indicate your interest by simply replying to this email with
your full personal details below.

(1) Your Full Name:
(2) Full Residential Address:
(3) Phone And Fax Number:
(4) Occupation:
(5) Whatsapp Number:

I anticipate your urgent response to this financial deal.

Your responds should be forwarded to my private email below.

marc.joseph.hebe...@gmail.com

Sincerely,

Mr Marc Joseph Hebert
Finance Risk control/Accounts Broker Unit.


Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id

2018-10-16 Thread Alexei Starovoitov
On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
> This patch added interface to load a program with the following
> additional information:
>. prog_btf_fd
>. func_info and func_info_len
> where func_info will provides function range and type_id
> corresponding to each function.
> 
> If verifier agrees with function range provided by the user,
> the bpf_prog ksym for each function will use the func name
> provided in the type_id, which is supposed to provide better
> encoding as it is not limited by 16 bytes program name
> limitation and this is better for bpf program which contains
> multiple subprograms.
> 
> The bpf_prog_info interface is also extended to
> return btf_id and jited_func_types, so user spaces can
> print out the function prototype for each jited function.
> 
> Signed-off-by: Yonghong Song 
...
>   BUILD_BUG_ON(sizeof("bpf_prog_") +
>sizeof(prog->tag) * 2 +
> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog 
> *prog, char *sym)
>  
>   sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>   sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
> +
> + if (prog->aux->btf) {
> + func_name = btf_get_name_by_id(prog->aux->btf, 
> prog->aux->type_id);
> + snprintf(sym, (size_t)(end - sym), "_%s", func_name);
> + return;

Would it make sense to add a comment here that prog->aux->name is ignored
when full btf name is available? (otherwise the same name will appear twice in 
ksym)

> + }
> +
>   if (prog->aux->name[0])
>   snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
...
> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env 
> *env,
> +   union bpf_attr *attr)
> +{
> + struct bpf_func_info *data;
> + int i, nfuncs, ret = 0;
> +
> + if (!attr->func_info_len)
> + return 0;
> +
> + nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
> + if (env->subprog_cnt != nfuncs) {
> + verbose(env, "number of funcs in func_info does not match 
> verifier\n");

'does not match verifier' is hard to make sense of.
How about 'number of funcs in func_info doesn't match number of subprogs' ?

> + return -EINVAL;
> + }
> +
> + data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
> + if (!data) {
> + verbose(env, "no memory to allocate attr func_info\n");

I don't think we ever print such warnings for memory allocations.
imo this can be removed, since enomem is enough.

> + return -ENOMEM;
> + }
> +
> + if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
> +attr->func_info_len)) {
> + verbose(env, "memory copy error for attr func_info\n");

similar thing. kernel never warns about copy_from_user errors.

> + ret = -EFAULT;
> + goto cleanup;
> + }
> +
> + for (i = 0; i < nfuncs; i++) {
> + if (env->subprog_info[i].start != data[i].insn_offset) {
> + verbose(env, "func_info subprog start (%d) does not 
> match verifier (%d)\n",
> + env->subprog_info[i].start, 
> data[i].insn_offset);

I think printing exact insn offset isn't going to be much help
for regular user to debug it. If this happens, it's likely llvm issue.
How about 'func_info BTF section doesn't match subprog layout in BPF program' ?



[bpf-next PATCH 0/3] sockmap support for msg_peek flag

2018-10-16 Thread John Fastabend
This adds support for the MSG_PEEK flag when redirecting into an
ingress psock sk_msg queue.

The first patch adds some base support to the helpers, then the
feature, and finally we add an option for the test suite to do
a duplicate MSG_PEEK call on every recv to test the feature.

With duplicate MSG_PEEK call all tests continue to PASS.

---

John Fastabend (3):
  bpf: skmsg, improve sk_msg_used_element to work in cork context
  bpf: sockmap, support for msg_peek in sk_msg with redirect ingress
  bpf: sockmap, add msg_peek tests to test_sockmap


 include/linux/skmsg.h  |   13 +-
 include/net/tcp.h  |2 
 net/ipv4/tcp_bpf.c |   42 +--
 net/tls/tls_sw.c   |3 -
 tools/testing/selftests/bpf/test_sockmap.c |  167 +++-
 5 files changed, 153 insertions(+), 74 deletions(-)

--
Signature


[bpf-next PATCH 2/3] bpf: sockmap, support for msg_peek in sk_msg with redirect ingress

2018-10-16 Thread John Fastabend
This adds support for the MSG_PEEK flag when doing redirect to ingress
and receiving on the sk_msg psock queue. Previously the flag was
being ignored which could confuse applications if they expected the
flag to work as normal.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |2 +-
 net/ipv4/tcp_bpf.c |   42 +++---
 net/tls/tls_sw.c   |3 ++-
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3600ae0..14fdd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2089,7 +2089,7 @@ int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg 
*msg, u32 bytes,
 int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int nonblock, int flags, int *addr_len);
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
- struct msghdr *msg, int len);
+ struct msghdr *msg, int len, int flags);
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index f9d3cf1..b7918d4 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -39,17 +39,19 @@ static int tcp_bpf_wait_data(struct sock *sk, struct 
sk_psock *psock,
 }
 
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
- struct msghdr *msg, int len)
+ struct msghdr *msg, int len, int flags)
 {
struct iov_iter *iter = &msg->msg_iter;
+   int peek = flags & MSG_PEEK;
int i, ret, copied = 0;
+   struct sk_msg *msg_rx;
+
+   msg_rx = list_first_entry_or_null(&psock->ingress_msg,
+ struct sk_msg, list);
 
while (copied != len) {
struct scatterlist *sge;
-   struct sk_msg *msg_rx;
 
-   msg_rx = list_first_entry_or_null(&psock->ingress_msg,
- struct sk_msg, list);
if (unlikely(!msg_rx))
break;
 
@@ -70,22 +72,30 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
*psock,
}
 
copied += copy;
-   sge->offset += copy;
-   sge->length -= copy;
-   sk_mem_uncharge(sk, copy);
-   msg_rx->sg.size -= copy;
-   if (!sge->length) {
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
-   if (!msg_rx->skb)
-   put_page(page);
+   if (likely(!peek)) {
+   sge->offset += copy;
+   sge->length -= copy;
+   sk_mem_uncharge(sk, copy);
+   msg_rx->sg.size -= copy;
+
+   if (!sge->length) {
+   sk_msg_iter_var_next(i);
+   if (!msg_rx->skb)
+   put_page(page);
+   }
+   } else {
+   sk_msg_iter_var_next(i);
}
 
if (copied == len)
break;
} while (i != msg_rx->sg.end);
 
+   if (unlikely(peek)) {
+   msg_rx = list_next_entry(msg_rx, list);
+   continue;
+   }
+
msg_rx->sg.start = i;
if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
list_del(&msg_rx->list);
@@ -93,6 +103,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
*psock,
consume_skb(msg_rx->skb);
kfree(msg_rx);
}
+   msg_rx = list_first_entry_or_null(&psock->ingress_msg,
+ struct sk_msg, list);
}
 
return copied;
@@ -115,7 +127,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
lock_sock(sk);
 msg_bytes_ready:
-   copied = __tcp_bpf_recvmsg(sk, psock, msg, len);
+   copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags);
if (!copied) {
int data, err = 0;
long timeo;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a525fc4..5cd88ba 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1478,7 +1478,8 @@ int tls_sw_recvmsg(struct sock *sk,
skb = tls_wait_data(sk, psock, flags, timeo, &err);
if (!skb) {
if (psock) {
-   

[bpf-next PATCH 1/3] bpf: skmsg, improve sk_msg_used_element to work in cork context

2018-10-16 Thread John Fastabend
Currently sk_msg_used_element is only called in zerocopy context where
cork is not possible and if this case happens we fallback to copy
mode. However the helper is more useful if it works in all contexts.

This patch resolved the case where if end == head indicating a full
or empty ring the helper always reports an empty ring. To fix this
add a test for the full ring case to avoid reporting a full ring
has 0 elements. This additional functionality will be used in the
next patches from recvmsg context where end = head with a full ring
is a valid case.

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 31df0d9..22347b0 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -187,18 +187,21 @@ static inline void sk_msg_xfer_full(struct sk_msg *dst, 
struct sk_msg *src)
sk_msg_init(src);
 }
 
+static inline bool sk_msg_full(const struct sk_msg *msg)
+{
+   return (msg->sg.end == msg->sg.start) && msg->sg.size;
+}
+
 static inline u32 sk_msg_elem_used(const struct sk_msg *msg)
 {
+   if (sk_msg_full(msg))
+   return MAX_MSG_FRAGS;
+
return msg->sg.end >= msg->sg.start ?
msg->sg.end - msg->sg.start :
msg->sg.end + (MAX_MSG_FRAGS - msg->sg.start);
 }
 
-static inline bool sk_msg_full(const struct sk_msg *msg)
-{
-   return (msg->sg.end == msg->sg.start) && msg->sg.size;
-}
-
 static inline struct scatterlist *sk_msg_elem(struct sk_msg *msg, int which)
 {
return &msg->sg.data[which];



[bpf-next PATCH 3/3] bpf: sockmap, add msg_peek tests to test_sockmap

2018-10-16 Thread John Fastabend
Add tests that do a MSG_PEEK recv followed by a regular receive to
test flag support.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c |  167 +++-
 1 file changed, 115 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 7cb69ce..cbd1c0b 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -80,6 +80,7 @@
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
+int peek_flag;
 
 static const struct option long_options[] = {
{"help",no_argument,NULL, 'h' },
@@ -102,6 +103,7 @@
{"txmsg_ingress", no_argument,  &txmsg_ingress, 1 },
{"txmsg_skb", no_argument,  &txmsg_skb, 1 },
{"ktls", no_argument,   &ktls, 1 },
+   {"peek", no_argument,   &peek_flag, 1 },
{0, 0, NULL, 0 }
 };
 
@@ -352,33 +354,40 @@ static int msg_loop_sendpage(int fd, int iov_length, int 
cnt,
return 0;
 }
 
-static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
-   struct msg_stats *s, bool tx,
-   struct sockmap_options *opt)
+static void msg_free_iov(struct msghdr *msg)
 {
-   struct msghdr msg = {0};
-   int err, i, flags = MSG_NOSIGNAL;
+   int i;
+
+   for (i = 0; i < msg->msg_iovlen; i++)
+   free(msg->msg_iov[i].iov_base);
+   free(msg->msg_iov);
+   msg->msg_iov = NULL;
+   msg->msg_iovlen = 0;
+}
+
+static int msg_alloc_iov(struct msghdr *msg,
+int iov_count, int iov_length,
+bool data, bool xmit)
+{
+   unsigned char k = 0;
struct iovec *iov;
-   unsigned char k;
-   bool data_test = opt->data_test;
-   bool drop = opt->drop_expected;
+   int i;
 
iov = calloc(iov_count, sizeof(struct iovec));
if (!iov)
return errno;
 
-   k = 0;
for (i = 0; i < iov_count; i++) {
unsigned char *d = calloc(iov_length, sizeof(char));
 
if (!d) {
fprintf(stderr, "iov_count %i/%i OOM\n", i, iov_count);
-   goto out_errno;
+   goto unwind_iov;
}
iov[i].iov_base = d;
iov[i].iov_len = iov_length;
 
-   if (data_test && tx) {
+   if (data && xmit) {
int j;
 
for (j = 0; j < iov_length; j++)
@@ -386,9 +395,60 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
}
}
 
-   msg.msg_iov = iov;
-   msg.msg_iovlen = iov_count;
-   k = 0;
+   msg->msg_iov = iov;
+   msg->msg_iovlen = iov_count;
+
+   return 0;
+unwind_iov:
+   for (i--; i >= 0 ; i--)
+   free(msg->msg_iov[i].iov_base);
+   return -ENOMEM;
+}
+
+static int msg_verify_data(struct msghdr *msg, int size, int chunk_sz)
+{
+   int i, j, bytes_cnt = 0;
+   unsigned char k = 0;
+
+   for (i = 0; i < msg->msg_iovlen; i++) {
+   unsigned char *d = msg->msg_iov[i].iov_base;
+
+   for (j = 0;
+j < msg->msg_iov[i].iov_len && size; j++) {
+   if (d[j] != k++) {
+   fprintf(stderr,
+   "detected data corruption @iov[%i]:%i 
%02x != %02x, %02x ?= %02x\n",
+   i, j, d[j], k - 1, d[j+1], k);
+   return -EIO;
+   }
+   bytes_cnt++;
+   if (bytes_cnt == chunk_sz) {
+   k = 0;
+   bytes_cnt = 0;
+   }
+   size--;
+   }
+   }
+   return 0;
+}
+
+static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
+   struct msg_stats *s, bool tx,
+   struct sockmap_options *opt)
+{
+   struct msghdr msg = {0}, msg_peek = {0};
+   int err, i, flags = MSG_NOSIGNAL;
+   bool drop = opt->drop_expected;
+   bool data = opt->data_test;
+
+   err = msg_alloc_iov(&msg, iov_count, iov_length, data, tx);
+   if (err)
+   goto out_errno;
+   if (peek_flag) {
+   err = msg_alloc_iov(&msg_peek, iov_count, iov_length, data, tx);
+   if (err)
+   goto out_errno;
+   }
 
if (tx) {
clock_gettime(CLOCK_MONOTONIC, &s->start);
@@ -408,19 +468,12 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
}
clock_gettime(CLOCK_MONOTONIC, &s->end);
} else {
-   int slct, recv, max_fd = fd;
+   int slct, recvp = 0, recv, max_fd = fd;

[PATCH net] sctp: fix race on sctp_id2asoc

2018-10-16 Thread Marcelo Ricardo Leitner
syzbot reported an use-after-free involving sctp_id2asoc.  Dmitry Vyukov
helped to root cause it and it is because of reading the asoc after it
was freed:

CPU 1   CPU 2
(working on socket 1)(working on socket 2)
 sctp_association_destroy
sctp_id2asoc
   spin lock
 grab the asoc from idr
   spin unlock
   spin lock
 remove asoc from idr
   spin unlock
   free(asoc)
   if asoc->base.sk != sk ... [*]

This can only be hit if trying to fetch asocs from different sockets. As
we have a single IDR for all asocs, in all SCTP sockets, their id is
unique on the system. An application can try to send stuff on an id
that matches on another socket, and the if in [*] will protect from such
usage. But it didn't consider that as that asoc may belong to another
socket, it may be freed in parallel (read: under another socket lock).

We fix it by moving the checks in [*] into the protected region. This
fixes it because the asoc cannot be freed while the lock is held.

Reported-by: syzbot+c7dd55d7aec49d48e...@syzkaller.appspotmail.com
Acked-by: Dmitry Vyukov 
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/socket.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
f73e9d38d5ba734d7ee3347e4015fd30d355bbfa..a7722f43aa69801c31409d4914c99946ee5533f5
 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -271,11 +271,10 @@ struct sctp_association *sctp_id2assoc(struct sock *sk, 
sctp_assoc_t id)
 
spin_lock_bh(&sctp_assocs_id_lock);
asoc = (struct sctp_association *)idr_find(&sctp_assocs_id, (int)id);
+   if (asoc && (asoc->base.sk != sk || asoc->base.dead))
+   asoc = NULL;
spin_unlock_bh(&sctp_assocs_id_lock);
 
-   if (!asoc || (asoc->base.sk != sk) || asoc->base.dead)
-   return NULL;
-
return asoc;
 }
 
-- 
2.17.1



Re: [PATCH bpf-next 00/13] bpf: add btf func info support

2018-10-16 Thread Alexei Starovoitov
On Fri, Oct 12, 2018 at 11:54:20AM -0700, Yonghong Song wrote:
> The BTF support was added to kernel by Commit 69b693f0aefa
> ("bpf: btf: Introduce BPF Type Format (BTF)"), which introduced
> .BTF section into ELF file and is primarily
> used for map pretty print.
> pahole is used to convert dwarf to BTF for ELF files.
> 
> The next step would be add func type info and debug line info
> into BTF. For debug line info, it is desirable to encode
> source code directly in the BTF to ease deployment and
> introspection.

it's kinda confusing that cover letter talks about line info next step,
but these kernel side patches are only for full prog name from btf.
It certainly makes sense for llvm to do both at the same time.
Please make the cover letter more clear.



Re: [PATCH rdma-next v1 0/4] Scatter to CQE

2018-10-16 Thread Doug Ledford
On Tue, 2018-10-09 at 12:05 +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Changelog v0->v1:
>  * Changed patch #3 to use check_mask function from rdma-core instead define.
> 
> --
> From Yonatan,
> 
> Scatter to CQE is a HW offload feature that saves PCI writes by
> scattering the payload to the CQE.
> 
> The feature depends on the CQE size and if the CQE size is 64B, it will
> work for payload smaller than 32. If the CQE size is 128B, it will work for
> payload smaller than 64.
> 
> The feature works for responder and requestor:
> 1. For responder, if the payload is small as required above, the data
> will be part of the CQE, and thus we save another PCI transaction the recv 
> buffers.
> 2. For requestor, this can be used to get the RDMA_READ response and
> RDMA_ATOMIC response in the CQE. This feature is already supported in 
> upstream.
> 
> As part of this series, we are adding support for DC transport type and
> ability to enable the feature (force enable) in the requestor when SQ
> is not configured to signal all WRs.
> 
> Thanks
> 
> Yonatan Cohen (4):
>   net/mlx5: Expose DC scatter to CQE capability bit
>   IB/mlx5: Support scatter to CQE for DC transport type
>   IB/mlx5: Verify that driver supports user flags
>   IB/mlx5: Allow scatter to CQE without global signaled WRs
> 
>  drivers/infiniband/hw/mlx5/cq.c  |  2 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +-
>  drivers/infiniband/hw/mlx5/qp.c  | 93 
> 
>  include/linux/mlx5/mlx5_ifc.h|  3 +-
>  include/uapi/rdma/mlx5-abi.h |  1 +
>  5 files changed, 79 insertions(+), 22 deletions(-)
> 
> --
> 2.14.4
> 


Hi Leon,

This series looks fine.  Let me know when the net/mlx5 portion has been
committed.

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part


Re: [bpf-next PATCH 0/3] sockmap support for msg_peek flag

2018-10-16 Thread Alexei Starovoitov
On Tue, Oct 16, 2018 at 11:07:54AM -0700, John Fastabend wrote:
> This adds support for the MSG_PEEK flag when redirecting into an
> ingress psock sk_msg queue.
> 
> The first patch adds some base support to the helpers, then the
> feature, and finally we add an option for the test suite to do
> a duplicate MSG_PEEK call on every recv to test the feature.
> 
> With duplicate MSG_PEEK call all tests continue to PASS.

for the set
Acked-by: Alexei Starovoitov 



Re: [PATCH rdma-next v1 0/4] Scatter to CQE

2018-10-16 Thread Leon Romanovsky
On Tue, Oct 16, 2018 at 02:39:01PM -0400, Doug Ledford wrote:
> On Tue, 2018-10-09 at 12:05 +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > Changelog v0->v1:
> >  * Changed patch #3 to use check_mask function from rdma-core instead 
> > define.
> >
> > --
> > From Yonatan,
> >
> > Scatter to CQE is a HW offload feature that saves PCI writes by
> > scattering the payload to the CQE.
> >
> > The feature depends on the CQE size and if the CQE size is 64B, it will
> > work for payload smaller than 32. If the CQE size is 128B, it will work for
> > payload smaller than 64.
> >
> > The feature works for responder and requestor:
> > 1. For responder, if the payload is small as required above, the data
> > will be part of the CQE, and thus we save another PCI transaction the recv 
> > buffers.
> > 2. For requestor, this can be used to get the RDMA_READ response and
> > RDMA_ATOMIC response in the CQE. This feature is already supported in 
> > upstream.
> >
> > As part of this series, we are adding support for DC transport type and
> > ability to enable the feature (force enable) in the requestor when SQ
> > is not configured to signal all WRs.
> >
> > Thanks
> >
> > Yonatan Cohen (4):
> >   net/mlx5: Expose DC scatter to CQE capability bit
> >   IB/mlx5: Support scatter to CQE for DC transport type
> >   IB/mlx5: Verify that driver supports user flags
> >   IB/mlx5: Allow scatter to CQE without global signaled WRs
> >
> >  drivers/infiniband/hw/mlx5/cq.c  |  2 +-
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +-
> >  drivers/infiniband/hw/mlx5/qp.c  | 93 
> > 
> >  include/linux/mlx5/mlx5_ifc.h|  3 +-
> >  include/uapi/rdma/mlx5-abi.h |  1 +
> >  5 files changed, 79 insertions(+), 22 deletions(-)
> >
> > --
> > 2.14.4
> >
>
>
> Hi Leon,
>
> This series looks fine.  Let me know when the net/mlx5 portion has been
> committed.

Thanks Doug,
I pushed first patch to mlx5-next
94a04d1d3d36 ("net/mlx5: Expose DC scatter to CQE capability bit")

>
> --
> Doug Ledford 
> GPG KeyID: B826A3330E572FDD
> Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD




signature.asc
Description: PGP signature


[PATCH net] sctp: not free the new asoc when sctp_wait_for_connect returns err

2018-10-16 Thread Xin Long
When sctp_wait_for_connect is called to wait for connect ready
for sp->strm_interleave in sctp_sendmsg_to_asoc, a panic could
be triggered if cpu is scheduled out and the new asoc is freed
elsewhere, as it will return err and later the asoc gets freed
again in sctp_sendmsg.

[  285.840764] list_del corruption, 9f0f7b284078->next is LIST_POISON1 
(dead0100)
[  285.843590] WARNING: CPU: 1 PID: 8861 at lib/list_debug.c:47 
__list_del_entry_valid+0x50/0xa0
[  285.846193] Kernel panic - not syncing: panic_on_warn set ...
[  285.846193]
[  285.848206] CPU: 1 PID: 8861 Comm: sctp_ndata Kdump: loaded Not tainted 
4.19.0-rc7.label #584
[  285.850559] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  285.852164] Call Trace:
...
[  285.872210]  ? __list_del_entry_valid+0x50/0xa0
[  285.872894]  sctp_association_free+0x42/0x2d0 [sctp]
[  285.873612]  sctp_sendmsg+0x5a4/0x6b0 [sctp]
[  285.874236]  sock_sendmsg+0x30/0x40
[  285.874741]  ___sys_sendmsg+0x27a/0x290
[  285.875304]  ? __switch_to_asm+0x34/0x70
[  285.875872]  ? __switch_to_asm+0x40/0x70
[  285.876438]  ? ptep_set_access_flags+0x2a/0x30
[  285.877083]  ? do_wp_page+0x151/0x540
[  285.877614]  __sys_sendmsg+0x58/0xa0
[  285.878138]  do_syscall_64+0x55/0x180
[  285.878669]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is a similar issue with the one fixed in Commit ca3af4dd28cf
("sctp: do not free asoc when it is already dead in sctp_sendmsg").
But this one can't be fixed by returning -ESRCH for the dead asoc
in sctp_wait_for_connect, as it will break sctp_connect's return
value to users.

This patch is to simply set err to -ESRCH before it returns to
sctp_sendmsg when any err is returned by sctp_wait_for_connect
for sp->strm_interleave, so that no asoc would be freed due to
this.

When users see this error, they will know the packet hasn't been
sent. And it also makes sense to not free asoc because waiting
connect fails, like the second call for sctp_wait_for_connect in
sctp_sendmsg_to_asoc.

Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
Signed-off-by: Xin Long 
---
 net/sctp/socket.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e25a20f..1baa9d9 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1946,8 +1946,10 @@ static int sctp_sendmsg_to_asoc(struct sctp_association 
*asoc,
if (sp->strm_interleave) {
timeo = sock_sndtimeo(sk, 0);
err = sctp_wait_for_connect(asoc, &timeo);
-   if (err)
+   if (err) {
+   err = -ESRCH;
goto err;
+   }
} else {
wait_connect = true;
}
-- 
2.1.0



[PATCH net-next 0/2] sctp: fix sk_wmem_queued and use it to check for writable space

2018-10-16 Thread Xin Long
sctp doesn't count and use asoc sndbuf_used, sk sk_wmem_alloc and
sk_wmem_queued properly, which also causes some problem.

This patchset is to improve it.

Xin Long (2):
  sctp: count both sk and asoc sndbuf with skb truesize and sctp_chunk
size
  sctp: use sk_wmem_queued to check for writable space

 include/net/sctp/constants.h |  5 
 net/sctp/outqueue.c  |  8 ++
 net/sctp/socket.c| 59 +++-
 3 files changed, 17 insertions(+), 55 deletions(-)

-- 
2.1.0



[PATCH net-next 1/2] sctp: count both sk and asoc sndbuf with skb truesize and sctp_chunk size

2018-10-16 Thread Xin Long
Now it's confusing that asoc sndbuf_used is doing memory accounting with
SCTP_DATA_SNDSIZE(chunk) + sizeof(sk_buff) + sizeof(sctp_chunk) while sk
sk_wmem_alloc is doing that with skb->truesize + sizeof(sctp_chunk).

It also causes sctp_prsctp_prune to count with a wrong freed memory when
sndbuf_policy is not set.

To make this right and also keep consistent between asoc sndbuf_used, sk
sk_wmem_alloc and sk_wmem_queued, use skb->truesize + sizeof(sctp_chunk)
for them.

Signed-off-by: Xin Long 
---
 include/net/sctp/constants.h |  5 -
 net/sctp/outqueue.c  |  8 ++--
 net/sctp/socket.c| 21 ++---
 3 files changed, 8 insertions(+), 26 deletions(-)

diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index 86f034b..8dadc74 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -148,11 +148,6 @@ SCTP_SUBTYPE_CONSTRUCTOR(PRIMITIVE,enum 
sctp_event_primitive, primitive)
 #define sctp_chunk_is_data(a) (a->chunk_hdr->type == SCTP_CID_DATA || \
   a->chunk_hdr->type == SCTP_CID_I_DATA)
 
-/* Calculate the actual data size in a data chunk */
-#define SCTP_DATA_SNDSIZE(c) ((int)((unsigned long)(c->chunk_end) - \
-   (unsigned long)(c->chunk_hdr) - \
-   sctp_datachk_len(&c->asoc->stream)))
-
 /* Internal error codes */
 enum sctp_ierror {
SCTP_IERROR_NO_ERROR= 0,
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 42191ed..9cb854b 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -385,9 +385,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association 
*asoc,
asoc->outqueue.outstanding_bytes -= sctp_data_size(chk);
}
 
-   msg_len -= SCTP_DATA_SNDSIZE(chk) +
-  sizeof(struct sk_buff) +
-  sizeof(struct sctp_chunk);
+   msg_len -= chk->skb->truesize + sizeof(struct sctp_chunk);
if (msg_len <= 0)
break;
}
@@ -421,9 +419,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association 
*asoc,
streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
}
 
-   msg_len -= SCTP_DATA_SNDSIZE(chk) +
-  sizeof(struct sk_buff) +
-  sizeof(struct sctp_chunk);
+   msg_len -= chk->skb->truesize + sizeof(struct sctp_chunk);
sctp_chunk_free(chk);
if (msg_len <= 0)
break;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index f73e9d3..c6f2950 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -166,12 +166,9 @@ static inline void sctp_set_owner_w(struct sctp_chunk 
*chunk)
/* Save the chunk pointer in skb for sctp_wfree to use later.  */
skb_shinfo(chunk->skb)->destructor_arg = chunk;
 
-   asoc->sndbuf_used += SCTP_DATA_SNDSIZE(chunk) +
-   sizeof(struct sk_buff) +
-   sizeof(struct sctp_chunk);
-
refcount_add(sizeof(struct sctp_chunk), &sk->sk_wmem_alloc);
-   sk->sk_wmem_queued += chunk->skb->truesize;
+   asoc->sndbuf_used += chunk->skb->truesize + sizeof(struct sctp_chunk);
+   sk->sk_wmem_queued += chunk->skb->truesize + sizeof(struct sctp_chunk);
sk_mem_charge(sk, chunk->skb->truesize);
 }
 
@@ -8460,17 +8457,11 @@ static void sctp_wfree(struct sk_buff *skb)
struct sctp_association *asoc = chunk->asoc;
struct sock *sk = asoc->base.sk;
 
-   asoc->sndbuf_used -= SCTP_DATA_SNDSIZE(chunk) +
-   sizeof(struct sk_buff) +
-   sizeof(struct sctp_chunk);
-
-   WARN_ON(refcount_sub_and_test(sizeof(struct sctp_chunk), 
&sk->sk_wmem_alloc));
-
-   /*
-* This undoes what is done via sctp_set_owner_w and sk_mem_charge
-*/
-   sk->sk_wmem_queued   -= skb->truesize;
sk_mem_uncharge(sk, skb->truesize);
+   sk->sk_wmem_queued -= skb->truesize + sizeof(struct sctp_chunk);
+   asoc->sndbuf_used -= skb->truesize + sizeof(struct sctp_chunk);
+   WARN_ON(refcount_sub_and_test(sizeof(struct sctp_chunk),
+ &sk->sk_wmem_alloc));
 
if (chunk->shkey) {
struct sctp_shared_key *shkey = chunk->shkey;
-- 
2.1.0



[PATCH net-next 2/2] sctp: use sk_wmem_queued to check for writable space

2018-10-16 Thread Xin Long
sk->sk_wmem_queued is used to count the size of chunks in out queue
while sk->sk_wmem_alloc is for counting the size of chunks has been
sent. sctp is increasing both of them before enqueuing the chunks,
and using sk->sk_wmem_alloc to check for writable space.

However, sk_wmem_alloc is also increased by 1 for the skb allocked
for sending in sctp_packet_transmit() but it will not wake up the
waiters when sk_wmem_alloc is decreased in this skb's destructor.

If msg size is equal to sk_sndbuf and sendmsg is waiting for sndbuf,
the check 'msg_len <= sctp_wspace(asoc)' in sctp_wait_for_sndbuf()
will keep waiting if there's a skb allocked in sctp_packet_transmit,
and later even if this skb got freed, the waiting thread will never
get waked up.

This issue has been there since very beginning, so we change to use
sk->sk_wmem_queued to check for writable space as sk_wmem_queued is
not increased for the skb allocked for sending, also as TCP does.

SOCK_SNDBUF_LOCK check is also removed here as it's for tx buf auto
tuning which I will add in another patch.

Signed-off-by: Xin Long 
---
 net/sctp/socket.c | 38 +-
 1 file changed, 9 insertions(+), 29 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c6f2950..111ebd8 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -83,7 +83,7 @@
 #include 
 
 /* Forward declarations for internal helper functions. */
-static int sctp_writeable(struct sock *sk);
+static bool sctp_writeable(struct sock *sk);
 static void sctp_wfree(struct sk_buff *skb);
 static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
size_t msg_len);
@@ -119,25 +119,10 @@ static void sctp_enter_memory_pressure(struct sock *sk)
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
 {
-   int amt;
+   struct sock *sk = asoc->base.sk;
 
-   if (asoc->ep->sndbuf_policy)
-   amt = asoc->sndbuf_used;
-   else
-   amt = sk_wmem_alloc_get(asoc->base.sk);
-
-   if (amt >= asoc->base.sk->sk_sndbuf) {
-   if (asoc->base.sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-   amt = 0;
-   else {
-   amt = sk_stream_wspace(asoc->base.sk);
-   if (amt < 0)
-   amt = 0;
-   }
-   } else {
-   amt = asoc->base.sk->sk_sndbuf - amt;
-   }
-   return amt;
+   return asoc->ep->sndbuf_policy ? sk->sk_sndbuf - asoc->sndbuf_used
+  : sk_stream_wspace(sk);
 }
 
 /* Increment the used sndbuf space count of the corresponding association by
@@ -1925,10 +1910,10 @@ static int sctp_sendmsg_to_asoc(struct sctp_association 
*asoc,
asoc->pmtu_pending = 0;
}
 
-   if (sctp_wspace(asoc) < msg_len)
+   if (sctp_wspace(asoc) < (int)msg_len)
sctp_prsctp_prune(asoc, sinfo, msg_len - sctp_wspace(asoc));
 
-   if (!sctp_wspace(asoc)) {
+   if (sctp_wspace(asoc) <= 0) {
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
err = sctp_wait_for_sndbuf(asoc, &timeo, msg_len);
if (err)
@@ -8535,7 +8520,7 @@ static int sctp_wait_for_sndbuf(struct sctp_association 
*asoc, long *timeo_p,
goto do_error;
if (signal_pending(current))
goto do_interrupted;
-   if (msg_len <= sctp_wspace(asoc))
+   if ((int)msg_len <= sctp_wspace(asoc))
break;
 
/* Let another process have a go.  Since we are going
@@ -8610,14 +8595,9 @@ void sctp_write_space(struct sock *sk)
  * UDP-style sockets or TCP-style sockets, this code should work.
  *  - Daisy
  */
-static int sctp_writeable(struct sock *sk)
+static bool sctp_writeable(struct sock *sk)
 {
-   int amt = 0;
-
-   amt = sk->sk_sndbuf - sk_wmem_alloc_get(sk);
-   if (amt < 0)
-   amt = 0;
-   return amt;
+   return sk->sk_sndbuf > sk->sk_wmem_queued;
 }
 
 /* Wait for an association to go into ESTABLISHED state. If timeout is 0,
-- 
2.1.0



[PATCH net-next] tcp, ulp: remove socket lock assertion on ULP cleanup

2018-10-16 Thread Daniel Borkmann
Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp()
where assertion sock_owned_by_me() failed. This happened through
inet_csk_prepare_forced_close() first releasing the socket lock,
then calling into tcp_done(newsk) which is called after the
inet_csk_prepare_forced_close() and therefore without the socket
lock held. The sock_owned_by_me() assertion can generally be
removed as the only place where tcp_cleanup_ulp() is called from
now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy()
where socket is in dead state and unreachable. Therefore, add a
comment why the check is not needed instead.

Fixes: 8b9088f806e1 ("tcp, ulp: enforce sock_owned_by_me upon ulp init and 
cleanup")
Reported-by: Eric Dumazet 
Signed-off-by: Daniel Borkmann 
---
 net/ipv4/tcp_ulp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index a9162aa..95df7f7 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -99,8 +99,10 @@ void tcp_cleanup_ulp(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
 
-   sock_owned_by_me(sk);
-
+   /* No sock_owned_by_me() check here as at the time the
+* stack calls this function, the socket is dead and
+* about to be destroyed.
+*/
if (!icsk->icsk_ulp_ops)
return;
 
-- 
2.9.5



Re: [PATCH net-next] tcp, ulp: remove socket lock assertion on ULP cleanup

2018-10-16 Thread David Miller
From: Daniel Borkmann 
Date: Tue, 16 Oct 2018 21:31:35 +0200

> Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp()
> where assertion sock_owned_by_me() failed. This happened through
> inet_csk_prepare_forced_close() first releasing the socket lock,
> then calling into tcp_done(newsk) which is called after the
> inet_csk_prepare_forced_close() and therefore without the socket
> lock held. The sock_owned_by_me() assertion can generally be
> removed as the only place where tcp_cleanup_ulp() is called from
> now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy()
> where socket is in dead state and unreachable. Therefore, add a
> comment why the check is not needed instead.
> 
> Fixes: 8b9088f806e1 ("tcp, ulp: enforce sock_owned_by_me upon ulp init and 
> cleanup")
> Reported-by: Eric Dumazet 
> Signed-off-by: Daniel Borkmann 

Applied.


[PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Heiner Kallweit
rtl_rx() and rtl_tx() are called only if the respective bits are set
in the interrupt status register. Under high load NAPI may not be
able to process all data (work_done == budget) and it will schedule
subsequent calls to the poll callback.
rtl_ack_events() however resets the bits in the interrupt status
register, therefore subsequent calls to rtl8169_poll() won't call
rtl_rx() and rtl_tx() - chip interrupts are still disabled.

Fix this by calling rtl_rx() and rtl_tx() independent of the bits
set in the interrupt status register. Both functions will detect
if there's nothing to do for them.

This issue has been there more or less forever (at least it exists in
3.16 already), so I can't provide a "Fixes" tag. 

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 8c4f49adc..7caf3b7e9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6528,17 +6528,15 @@ static int rtl8169_poll(struct napi_struct *napi, int 
budget)
struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, 
napi);
struct net_device *dev = tp->dev;
u16 enable_mask = RTL_EVENT_NAPI | tp->event_slow;
-   int work_done= 0;
+   int work_done;
u16 status;
 
status = rtl_get_events(tp);
rtl_ack_events(tp, status & ~tp->event_slow);
 
-   if (status & RTL_EVENT_NAPI_RX)
-   work_done = rtl_rx(dev, tp, (u32) budget);
+   work_done = rtl_rx(dev, tp, (u32) budget);
 
-   if (status & RTL_EVENT_NAPI_TX)
-   rtl_tx(dev, tp);
+   rtl_tx(dev, tp);
 
if (status & tp->event_slow) {
enable_mask &= ~tp->event_slow;
-- 
2.19.1



Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall

2018-10-16 Thread Mauricio Vasquez




On 10/11/2018 06:51 PM, Alexei Starovoitov wrote:

On Wed, Oct 10, 2018 at 05:50:01PM -0500, Mauricio Vasquez wrote:

Does it make sense to you?

I reread the other patch, and found it does NOT use the following logic for
queue and stack:

 rcu_read_lock();
 ptr = map->ops->map_lookup_and_delete_elem(map, key);
 if (ptr)
 memcpy(value, ptr, value_size);

I guess this part is not used at all? Can we just remove it?

Thanks,
Song

This is the base code for map_lookup_and_delete support, it is not used in
queue/stack maps.

I think we can leave it there, so when somebody implements lookup_and_delete
for other maps doesn't have to care about implementing also this.

The code looks useful to me, but I also agree with Song. And in the kernel
we don't typically add 'almost dead code'.
May be provide an implementation of the lookup_and_delete for hash map
so it's actually used ?


I haven't written any code but I think there is a potential problem here.
Current lookup_and_delete returns a pointer to the element, hence 
deletion of the element should be done using call_rcu to guarantee this 
is valid after returning.
In the hashtab, the deletion only uses call_rcu when there is not 
prealloc, otherwise the element is pushed on the list of free elements 
immediately.
If we move the logic to push elements into the free list under a 
call_rcu invocation, it could happen that the free list is empty because 
the call_rcu is still pending (a similar issue we had with the 
queue/stack maps when they used a pass by reference API).


There is another way to implement it without this issue, in syscall.c:
l = ops->lookup(key);
memcpy(l, some_buffer)
ops->delete(key)

The point here is that the lookup_and_delete operation is not being used 
at all, so, is lookup + delete = lookup_and_delete?, can it be generalized?
If this is true, then what is the sense of having lookup_and_delete 
syscall?,




imo it would be a useful feature to have for hash map and will clarify
the intent of it for all other maps and for stack/queue in particular.






Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Holger Hoffstätte

On 10/16/18 22:37, Heiner Kallweit wrote:

rtl_rx() and rtl_tx() are called only if the respective bits are set
in the interrupt status register. Under high load NAPI may not be
able to process all data (work_done == budget) and it will schedule
subsequent calls to the poll callback.
rtl_ack_events() however resets the bits in the interrupt status
register, therefore subsequent calls to rtl8169_poll() won't call
rtl_rx() and rtl_tx() - chip interrupts are still disabled.


Very interesting! Could this be the reason for the mysterious
hangs & resets we experienced when enabling BQL for r8169?
They happened more often with TSO/GSO enabled and several people
attempted to fix those hangs unsuccessfully; it was later reverted
and has been since then (#87cda7cb43).
If this bug has been there "forever" it might be tempting to
re-apply BQL and see what happens. Any chance you could give that
a try? I'll gladly test patches, just like I'll run this one.

cheers
Holger


Re: gianfar: Implement MAC reset and reconfig procedure

2018-10-16 Thread Daniel Walker
Hi,

I would like to report an issue in the gianfar driver. The issue is as follows. 

We have a P2020 board that uses the gianfar driver, and we have a m88e1101
PHY connect. When the interface is initially brought up traffic flows as
normal. If you take the interface down then bring it back up traffic stops
flowing. If you do this sequence over and over up/down/up we find that the
interface will allow traffic to flow at a low percentage.

In v4.9 interface allows traffic about %10 of the time.

In v4.19-rc8 the allows traffic %30 of the time.

After bisecting I found that in v3.14 the interface was rock solid and never did
we see this issue. However, v3.15 we started to see this issue. After bisecting 
I
found the following change is the first one which causes the issue,

a328ac9 gianfar: Implement MAC reset and reconfig procedure

I was able to revert this in v3.15 , however with later development a revert
doesn't appear to be possible. We have no fix for this currently.

I can do testing if you have an idea what might cause the issue.

Daniel


Re: gianfar: Implement MAC reset and reconfig procedure

2018-10-16 Thread Florian Fainelli
On 10/16/2018 02:36 PM, Daniel Walker wrote:
> Hi,
> 
> I would like to report an issue in the gianfar driver. The issue is as 
> follows. 
> 
> We have a P2020 board that uses the gianfar driver, and we have a m88e1101
> PHY connect. When the interface is initially brought up traffic flows as
> normal. If you take the interface down then bring it back up traffic stops
> flowing. If you do this sequence over and over up/down/up we find that the
> interface will allow traffic to flow at a low percentage.
> 
> In v4.9 interface allows traffic about %10 of the time.
> 
> In v4.19-rc8 the allows traffic %30 of the time.
> 
> After bisecting I found that in v3.14 the interface was rock solid and never 
> did
> we see this issue. However, v3.15 we started to see this issue. After 
> bisecting I
> found the following change is the first one which causes the issue,
> 
> a328ac9 gianfar: Implement MAC reset and reconfig procedure
> 
> I was able to revert this in v3.15 , however with later development a revert
> doesn't appear to be possible. We have no fix for this currently.
> 
> I can do testing if you have an idea what might cause the issue.

What we have seen being typically the problem is that when you have a
PHY connection whereby the PHY provides the RX clock to the MAC (e.g:
RGMII), it is very easy to get in a situation where the PHY clock is
stopped, and the MAC is asked to be reset, but the HW design does not
like that at all since it e.g: stops on packet boundaries and need some
clock cycles to do that, and that results in all sorts of issues (in our
case it was some FIFO corruption). We solved that in bcmgenet.c with
looping internally the TX clock to the RX clock to make sure the
Ethernet MAC (UniMAC in our designs) was successfully reset:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=28c2d1a7a0bfdf3617800d2beae1c67983c03d15

Could that somehow be the problem here?
-- 
Florian


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Stephen Hemminger
On Tue, 16 Oct 2018 22:37:31 +0200
Heiner Kallweit  wrote:

> rtl_rx() and rtl_tx() are called only if the respective bits are set
> in the interrupt status register. Under high load NAPI may not be
> able to process all data (work_done == budget) and it will schedule
> subsequent calls to the poll callback.
> rtl_ack_events() however resets the bits in the interrupt status
> register, therefore subsequent calls to rtl8169_poll() won't call
> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
> 
> Fix this by calling rtl_rx() and rtl_tx() independent of the bits
> set in the interrupt status register. Both functions will detect
> if there's nothing to do for them.
> 
> This issue has been there more or less forever (at least it exists in
> 3.16 already), so I can't provide a "Fixes" tag. 
> 
> Signed-off-by: Heiner Kallweit 

Another issue is this:

if (work_done < budget) {
napi_complete_done(napi, work_done);

rtl_irq_enable(tp, enable_mask);
mmiowb();
}

return work_done;
}


The code needs to check return value of napi_complete_done.

if (work_done < budget &&
napi_complete_done(napi, work_done) {
rtl_irq_enable(tp, enable_mask);
mmiowb();
}

return work_done;
}

Try that, it might fix the problem and your logic would
be unnecessary 


[PATCH bpf-next 0/2] nfp: bpf: improve offload checks

2018-10-16 Thread Jakub Kicinski
Hi,

this set adds check to make sure offload behaviour is correct.
First when atomic counters are used, we must make sure the map
does not already contain data we did not prepare for holding
atomics.

Second patch double checks vNIC capabilities for program offload
in case program is shared by multiple vNICs with different
constraints.

Jakub Kicinski (2):
  nfp: bpf: protect against mis-initializing atomic counters
  nfp: bpf: double check vNIC capabilities after object sharing

 drivers/net/ethernet/netronome/nfp/bpf/main.h | 10 ++-
 .../net/ethernet/netronome/nfp/bpf/offload.c  | 32 -
 .../net/ethernet/netronome/nfp/bpf/verifier.c | 69 ---
 3 files changed, 98 insertions(+), 13 deletions(-)

-- 
2.17.1



[PATCH bpf-next 2/2] nfp: bpf: double check vNIC capabilities after object sharing

2018-10-16 Thread Jakub Kicinski
Program translation stage checks that program can be offloaded to
the netdev which was passed during the load (bpf_attr->prog_ifindex).
After program sharing was introduced, however, the netdev on which
program is loaded can theoretically be different, and therefore
we should recheck the program size and max stack size at load time.

This was found by code inspection, AFAIK today all vNICs have
identical caps.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  3 +++
 drivers/net/ethernet/netronome/nfp/bpf/offload.c  | 14 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 11 ++-
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 12e98a0a58e5..7f591d71ab28 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -441,6 +441,7 @@ struct nfp_bpf_subprog_info {
  * @prog: machine code
  * @prog_len: number of valid instructions in @prog array
  * @__prog_alloc_len: alloc size of @prog array
+ * @stack_size: total amount of stack used
  * @verifier_meta: temporary storage for verifier's insn meta
  * @type: BPF program type
  * @last_bpf_off: address of the last instruction translated from BPF
@@ -465,6 +466,8 @@ struct nfp_prog {
unsigned int prog_len;
unsigned int __prog_alloc_len;
 
+   unsigned int stack_size;
+
struct nfp_insn_meta *verifier_meta;
 
enum bpf_prog_type type;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 658c7143d59c..ba8ceedcf6a2 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -489,7 +489,7 @@ nfp_net_bpf_load(struct nfp_net *nn, struct bpf_prog *prog,
 struct netlink_ext_ack *extack)
 {
struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv;
-   unsigned int max_mtu;
+   unsigned int max_mtu, max_stack, max_prog_len;
dma_addr_t dma_addr;
void *img;
int err;
@@ -500,6 +500,18 @@ nfp_net_bpf_load(struct nfp_net *nn, struct bpf_prog *prog,
return -EOPNOTSUPP;
}
 
+   max_stack = nn_readb(nn, NFP_NET_CFG_BPF_STACK_SZ) * 64;
+   if (nfp_prog->stack_size > max_stack) {
+   NL_SET_ERR_MSG_MOD(extack, "stack too large");
+   return -EOPNOTSUPP;
+   }
+
+   max_prog_len = nn_readw(nn, NFP_NET_CFG_BPF_MAX_LEN);
+   if (nfp_prog->prog_len > max_prog_len) {
+   NL_SET_ERR_MSG_MOD(extack, "program too long");
+   return -EOPNOTSUPP;
+   }
+
img = nfp_bpf_relo_for_vnic(nfp_prog, nn->app_priv);
if (IS_ERR(img))
return PTR_ERR(img);
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index e04035c116a4..99f977bfd8cc 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -747,9 +747,9 @@ nfp_bpf_get_stack_usage(struct nfp_prog *nfp_prog, unsigned 
int cnt)
 
 static int nfp_bpf_finalize(struct bpf_verifier_env *env)
 {
-   unsigned int stack_size, stack_needed;
struct bpf_subprog_info *info;
struct nfp_prog *nfp_prog;
+   unsigned int max_stack;
struct nfp_net *nn;
int i;
 
@@ -777,11 +777,12 @@ static int nfp_bpf_finalize(struct bpf_verifier_env *env)
}
 
nn = netdev_priv(env->prog->aux->offload->netdev);
-   stack_size = nn_readb(nn, NFP_NET_CFG_BPF_STACK_SZ) * 64;
-   stack_needed = nfp_bpf_get_stack_usage(nfp_prog, env->prog->len);
-   if (stack_needed > stack_size) {
+   max_stack = nn_readb(nn, NFP_NET_CFG_BPF_STACK_SZ) * 64;
+   nfp_prog->stack_size = nfp_bpf_get_stack_usage(nfp_prog,
+  env->prog->len);
+   if (nfp_prog->stack_size > max_stack) {
pr_vlog(env, "stack too large: program %dB > FW stack %dB\n",
-   stack_needed, stack_size);
+   nfp_prog->stack_size, max_stack);
return -EOPNOTSUPP;
}
 
-- 
2.17.1



[PATCH bpf-next 1/2] nfp: bpf: protect against mis-initializing atomic counters

2018-10-16 Thread Jakub Kicinski
Atomic operations on the NFP are currently always in big endian.
The driver keeps track of regions of memory storing atomic values
and byte swaps them accordingly.  There are corner cases where
the map values may be initialized before the driver knows they
are used as atomic counters.  This can happen either when the
datapath is performing the update and the stack contents are
unknown or when map is updated before the program which will
use it for atomic values is loaded.

To avoid situation where user initializes the value to 0 1 2 3
and then after loading a program which uses the word as an atomic
counter starts reading 3 2 1 0 - only allow atomic counters to be
initialized to endian-neutral values.

For updates from the datapath the stack information may not be
as precise, so just allow initializing such values to 0.

Example code which would break:
struct bpf_map_def SEC("maps") rxcnt = {
   .type = BPF_MAP_TYPE_HASH,
   .key_size = sizeof(__u32),
   .value_size = sizeof(__u64),
   .max_entries = 1,
};

int xdp_prog1()
{
__u64 nonzeroval = 3;
__u32 key = 0;
__u64 *value;

value = bpf_map_lookup_elem(&rxcnt, &key);
if (!value)
bpf_map_update_elem(&rxcnt, &key, &nonzeroval, BPF_ANY);
else
__sync_fetch_and_add(value, 1);

return XDP_PASS;
}

$ offload bpftool map dump
key: 00 00 00 00 value: 00 00 00 03 00 00 00 00

should be:

$ offload bpftool map dump
key: 00 00 00 00 value: 03 00 00 00 00 00 00 00

Reported-by: David Beckett 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  7 ++-
 .../net/ethernet/netronome/nfp/bpf/offload.c  | 18 +-
 .../net/ethernet/netronome/nfp/bpf/verifier.c | 58 +--
 3 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 52457ae3b259..12e98a0a58e5 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -189,6 +189,11 @@ enum nfp_bpf_map_use {
NFP_MAP_USE_ATOMIC_CNT,
 };
 
+struct nfp_bpf_map_word {
+   unsigned char type  :4;
+   unsigned char non_zero_update   :1;
+};
+
 /**
  * struct nfp_bpf_map - private per-map data attached to BPF maps for offload
  * @offmap:pointer to the offloaded BPF map
@@ -202,7 +207,7 @@ struct nfp_bpf_map {
struct nfp_app_bpf *bpf;
u32 tid;
struct list_head l;
-   enum nfp_bpf_map_use use_map[];
+   struct nfp_bpf_map_word use_map[];
 };
 
 struct nfp_bpf_neutral_map {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 927e038d9f77..658c7143d59c 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -262,10 +262,25 @@ static void nfp_map_bpf_byte_swap(struct nfp_bpf_map 
*nfp_map, void *value)
unsigned int i;
 
for (i = 0; i < DIV_ROUND_UP(nfp_map->offmap->map.value_size, 4); i++)
-   if (nfp_map->use_map[i] == NFP_MAP_USE_ATOMIC_CNT)
+   if (nfp_map->use_map[i].type == NFP_MAP_USE_ATOMIC_CNT)
word[i] = (__force u32)cpu_to_be32(word[i]);
 }
 
+/* Mark value as unsafely initialized in case it becomes atomic later
+ * and we didn't byte swap something non-byte swap neutral.
+ */
+static void
+nfp_map_bpf_byte_swap_record(struct nfp_bpf_map *nfp_map, void *value)
+{
+   u32 *word = value;
+   unsigned int i;
+
+   for (i = 0; i < DIV_ROUND_UP(nfp_map->offmap->map.value_size, 4); i++)
+   if (nfp_map->use_map[i].type == NFP_MAP_UNUSED &&
+   word[i] != (__force u32)cpu_to_be32(word[i]))
+   nfp_map->use_map[i].non_zero_update = 1;
+}
+
 static int
 nfp_bpf_map_lookup_entry(struct bpf_offloaded_map *offmap,
 void *key, void *value)
@@ -285,6 +300,7 @@ nfp_bpf_map_update_entry(struct bpf_offloaded_map *offmap,
 void *key, void *value, u64 flags)
 {
nfp_map_bpf_byte_swap(offmap->dev_priv, value);
+   nfp_map_bpf_byte_swap_record(offmap->dev_priv, value);
return nfp_bpf_ctrl_update_entry(offmap, key, value, flags);
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index 193dd685b365..e04035c116a4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -80,6 +80,46 @@ nfp_record_adjust_head(struct nfp_app_bpf *bpf, struct 
nfp_prog *nfp_prog,
nfp_prog->adjust_head_location = location;
 }
 
+static bool nfp_bpf_map_update_value_ok(struct bpf_verifier_env *env)
+{
+   const struct bpf_reg_state *reg1 = cur_regs(env) + BPF_REG_1;
+   const struct bpf_reg_state *reg3 = cur_regs(

Re: [PATCH bpf-next] libbpf: Per-symbol visibility for DSO

2018-10-16 Thread Alexei Starovoitov
On Mon, Oct 15, 2018 at 10:50:34PM -0700, Andrey Ignatov wrote:
> Make global symbols in libbpf DSO hidden by default with
> -fvisibility=hidden and export symbols that are part of ABI explicitly
> with __attribute__((visibility("default"))).
> 
> This is common practice that should prevent from accidentally exporting
> a symbol, that is not supposed to be a part of ABI what, in turn,
> improves both libbpf developer- and user-experiences. See [1] for more
> details.
> 
> Export control becomes more important since more and more projects use
> libbpf.
> 
> The patch doesn't export a bunch of netlink related functions since as
> agreed in [2] they'll be reworked. That doesn't break bpftool since
> bpftool links libbpf statically.
> 
> [1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control)
> [2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.html
> 
> Signed-off-by: Andrey Ignatov 

Applied, Thanks



Re: [PATCH bpf-next 0/2] nfp: bpf: improve offload checks

2018-10-16 Thread Alexei Starovoitov
On Tue, Oct 16, 2018 at 03:19:08PM -0700, Jakub Kicinski wrote:
> Hi,
> 
> this set adds check to make sure offload behaviour is correct.
> First when atomic counters are used, we must make sure the map
> does not already contain data we did not prepare for holding
> atomics.
> 
> Second patch double checks vNIC capabilities for program offload
> in case program is shared by multiple vNICs with different
> constraints.

1st patch is quite a hack, but until we have proper BTF annotations
I don't see any other way to solve it.

Applied, Thanks



Re: [bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly

2018-10-16 Thread Alexei Starovoitov
On Tue, Oct 16, 2018 at 10:36:01AM -0700, John Fastabend wrote:
> When converting sockmap to new skmsg generic data structures we missed
> that the recvmsg handler did not correctly use sg.size and instead was
> using individual elements length. The result is if a sock is closed
> with outstanding data we omit the call to sk_mem_uncharge() and can
> get the warning below.
> 
> [   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 
> sk_stream_kill_queues+0x1fa/0x210
> 
> To fix this correct the redirect handler to xfer the size along with
> the scatterlist and also decrement the size from the recvmsg handler.
> Now when a sock is closed the remaining 'size' will be decremented
> with sk_mem_uncharge().
> 
> Signed-off-by: John Fastabend 

Acked-by: Alexei Starovoitov 



Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Stephen Hemminger
On Tue, 16 Oct 2018 23:17:31 +0200
Holger Hoffstätte  wrote:

> On 10/16/18 22:37, Heiner Kallweit wrote:
> > rtl_rx() and rtl_tx() are called only if the respective bits are set
> > in the interrupt status register. Under high load NAPI may not be
> > able to process all data (work_done == budget) and it will schedule
> > subsequent calls to the poll callback.
> > rtl_ack_events() however resets the bits in the interrupt status
> > register, therefore subsequent calls to rtl8169_poll() won't call
> > rtl_rx() and rtl_tx() - chip interrupts are still disabled.  
> 
> Very interesting! Could this be the reason for the mysterious
> hangs & resets we experienced when enabling BQL for r8169?
> They happened more often with TSO/GSO enabled and several people
> attempted to fix those hangs unsuccessfully; it was later reverted
> and has been since then (#87cda7cb43).
> If this bug has been there "forever" it might be tempting to
> re-apply BQL and see what happens. Any chance you could give that
> a try? I'll gladly test patches, just like I'll run this one.
> 
> cheers
> Holger

Many drivers have buggy usage of napi_complete_done.

Might even be worth forcing all network drivers to check the return
value. But fixing 150 broken drivers will be a nuisance.

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dc1d9ed33b31..c38bc66ffe74 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct *napi)
return false;
 }
 
-bool napi_complete_done(struct napi_struct *n, int work_done);
+bool __must_check napi_complete_done(struct napi_struct *n, int work_done);
+
 /**
  * napi_complete - NAPI processing complete
  * @n: NAPI context


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Florian Fainelli
On 10/16/2018 04:03 PM, Stephen Hemminger wrote:
> On Tue, 16 Oct 2018 23:17:31 +0200
> Holger Hoffstätte  wrote:
> 
>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>> in the interrupt status register. Under high load NAPI may not be
>>> able to process all data (work_done == budget) and it will schedule
>>> subsequent calls to the poll callback.
>>> rtl_ack_events() however resets the bits in the interrupt status
>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.  
>>
>> Very interesting! Could this be the reason for the mysterious
>> hangs & resets we experienced when enabling BQL for r8169?
>> They happened more often with TSO/GSO enabled and several people
>> attempted to fix those hangs unsuccessfully; it was later reverted
>> and has been since then (#87cda7cb43).
>> If this bug has been there "forever" it might be tempting to
>> re-apply BQL and see what happens. Any chance you could give that
>> a try? I'll gladly test patches, just like I'll run this one.
>>
>> cheers
>> Holger
> 
> Many drivers have buggy usage of napi_complete_done.
> 
> Might even be worth forcing all network drivers to check the return
> value. But fixing 150 broken drivers will be a nuisance.

I had started doing that about a month ago in light of the ixbge
ndo_poll_controller vs. napi problem, but have not had time to submit
that series yet:

https://github.com/ffainelli/linux/commits/napi-check

feel free to piggy back on top of that series if you would like to
address this.

> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dc1d9ed33b31..c38bc66ffe74 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct 
> *napi)
> return false;
>  }
>  
> -bool napi_complete_done(struct napi_struct *n, int work_done);
> +bool __must_check napi_complete_done(struct napi_struct *n, int work_done);
> +
>  /**
>   * napi_complete - NAPI processing complete
>   * @n: NAPI context
> 


-- 
Florian


Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall

2018-10-16 Thread Alexei Starovoitov
On Tue, Oct 16, 2018 at 04:16:39PM -0500, Mauricio Vasquez wrote:
> 
> 
> On 10/11/2018 06:51 PM, Alexei Starovoitov wrote:
> > On Wed, Oct 10, 2018 at 05:50:01PM -0500, Mauricio Vasquez wrote:
> > > > > Does it make sense to you?
> > > > I reread the other patch, and found it does NOT use the following logic 
> > > > for
> > > > queue and stack:
> > > > 
> > > >  rcu_read_lock();
> > > >  ptr = map->ops->map_lookup_and_delete_elem(map, key);
> > > >  if (ptr)
> > > >  memcpy(value, ptr, value_size);
> > > > 
> > > > I guess this part is not used at all? Can we just remove it?
> > > > 
> > > > Thanks,
> > > > Song
> > > This is the base code for map_lookup_and_delete support, it is not used in
> > > queue/stack maps.
> > > 
> > > I think we can leave it there, so when somebody implements 
> > > lookup_and_delete
> > > for other maps doesn't have to care about implementing also this.
> > The code looks useful to me, but I also agree with Song. And in the kernel
> > we don't typically add 'almost dead code'.
> > May be provide an implementation of the lookup_and_delete for hash map
> > so it's actually used ?
> 
> I haven't written any code but I think there is a potential problem here.
> Current lookup_and_delete returns a pointer to the element, hence deletion
> of the element should be done using call_rcu to guarantee this is valid
> after returning.
> In the hashtab, the deletion only uses call_rcu when there is not prealloc,
> otherwise the element is pushed on the list of free elements immediately.
> If we move the logic to push elements into the free list under a call_rcu
> invocation, it could happen that the free list is empty because the call_rcu
> is still pending (a similar issue we had with the queue/stack maps when they
> used a pass by reference API).
> 
> There is another way to implement it without this issue, in syscall.c:
> l = ops->lookup(key);
> memcpy(l, some_buffer)
> ops->delete(key)
> 
> The point here is that the lookup_and_delete operation is not being used at
> all, so, is lookup + delete = lookup_and_delete?, can it be generalized?
> If this is true, then what is the sense of having lookup_and_delete
> syscall?,

I though of lookup_and_delete command as atomic operation.
Only in such case it would make sense.
Otherwise there is no point in having additional cmd.
In case of hash map the implementation would be similar to delete:
  raw_spin_lock_irqsave(&b->lock, flags);
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
  hlist_nulls_del_rcu(&l->hash_node);
  bpf_long_memcpy(); // into temp kernel area
  free_htab_elem(htab, l);
  ret = 0;
  }
  raw_spin_unlock_irqrestore(&b->lock, flags);
  copy_to_user();

there is a chance that some other cpu is doing lookup in parallel
and may be modifying value, so bpf_long_mempcy() isn't fully atomic.
But bpf side is written together with user space side,
so above almost-atomic lookup_and_delete is usable instead
of lookup and then delete which races too much.

Having said that I think it's fine to defer this new ndo for now
and leave lookup_and_delete syscall cmd for stack/queue map only.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-16 Thread David Ahern
On 10/15/18 4:33 PM, Song Liu wrote:
> I am working with Alexei on the idea of fetching BPF program information via
> BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> to perf_event_type, and dumped these events to perf event ring buffer.
> 
> I found that perf will not process event until the end of perf-record:
> 
> root@virt-test:~# ~/perf record -ag -- sleep 10
> .. 10 seconds later
> [ perf record: Woken up 34 times to write data ]
> machine__process_bpf_event: prog_id 6 loaded
> machine__process_bpf_event: prog_id 6 unloaded
> [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> In this example, the bpf program was loaded and then unloaded in
> another terminal. When machine__process_bpf_event() processes
> the load event, the bpf program is already unloaded. Therefore,
> machine__process_bpf_event() will not be able to get information
> about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> as soon as perf get the event from kernel. I looked around the perf
> code for a while. But I haven't found a good example where some
> events are processed before the end of perf-record. Could you
> please help me with this?

perf record does not process events as they are generated. Its sole job
is pushing data from the maps to a file as fast as possible meaning in
bulk based on current read and write locations.

Adding code to process events will add significant overhead to the
record command and will not really solve your race problem.


[PATCH net-next 2/2] tcp_bbr: centralize code to set gains

2018-10-16 Thread Neal Cardwell
Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Priyaranjan Jha 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 40 ++--
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 4cc2223d2cd54..9277abdd822a0 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk)
 
bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
bbr->cycle_mstamp = tp->delivered_mstamp;
-   bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
-   bbr_pacing_gain[bbr->cycle_idx];
 }
 
 /* Gain cycling: cycle pacing gain to converge to fair share of available bw. 
*/
@@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_STARTUP;
-   bbr->pacing_gain = bbr_high_gain;
-   bbr->cwnd_gain   = bbr_high_gain;
 }
 
 static void bbr_reset_probe_bw_mode(struct sock *sk)
@@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_PROBE_BW;
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = bbr_cwnd_gain;
bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
bbr_advance_cycle_phase(sk);/* flip to next phase of gain cycle */
 }
@@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct 
rate_sample *rs)
 
if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
bbr->mode = BBR_DRAIN;  /* drain queue we created */
-   bbr->pacing_gain = bbr_drain_gain;  /* pace slow to drain */
-   bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */
tcp_sk(sk)->snd_ssthresh =
bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
}   /* fall through to check if in-flight is already small: */
@@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
!bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = BBR_UNIT;
bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
bbr->probe_rtt_done_stamp = 0;
}
@@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
bbr->idle_restart = 0;
 }
 
+static void bbr_update_gains(struct sock *sk)
+{
+   struct bbr *bbr = inet_csk_ca(sk);
+
+   switch (bbr->mode) {
+   case BBR_STARTUP:
+   bbr->pacing_gain = bbr_high_gain;
+   bbr->cwnd_gain   = bbr_high_gain;
+   break;
+   case BBR_DRAIN:
+   bbr->pacing_gain = bbr_drain_gain;  /* slow, to drain */
+   bbr->cwnd_gain   = bbr_high_gain;   /* keep cwnd */
+   break;
+   case BBR_PROBE_BW:
+   bbr->pacing_gain = (bbr->lt_use_bw ?
+   BBR_UNIT :
+   bbr_pacing_gain[bbr->cycle_idx]);
+   bbr->cwnd_gain   = bbr_cwnd_gain;
+   break;
+   case BBR_PROBE_RTT:
+   bbr->pacing_gain = BBR_UNIT;
+   bbr->cwnd_gain   = BBR_UNIT;
+   break;
+   default:
+   WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode);
+   break;
+   }
+}
+
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
 {
bbr_update_bw(sk, rs);
@@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct 
rate_sample *rs)
bbr_check_full_bw_reached(sk, rs);
bbr_check_drain(sk, rs);
bbr_update_min_rtt(sk, rs);
+   bbr_update_gains(sk);
 }
 
 static void bbr_main(struct sock *sk, const struct rate_sample *rs)
-- 
2.19.1.331.ge82ca0e54c-goog



[PATCH net-next 0/2] tcp_bbr: TCP BBR changes for EDT pacing model

2018-10-16 Thread Neal Cardwell
Two small patches for TCP BBR to follow up with Eric's recent work to change
the TCP and fq pacing machinery to an "earliest departure time" (EDT) model:

- The first patch adjusts the TCP BBR logic to work with the new
  "earliest departure time" (EDT) pacing model.

- The second patch adjusts the TCP BBR logic to centralize the setting
  of gain values, to simplify the code and prepare for future changes.

Neal Cardwell (2):
  tcp_bbr: adjust TCP BBR for departure time pacing
  tcp_bbr: centralize code to set gains

 net/ipv4/tcp_bbr.c | 77 ++
 1 file changed, 65 insertions(+), 12 deletions(-)

-- 
2.19.1.331.ge82ca0e54c-goog



[PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time pacing

2018-10-16 Thread Neal Cardwell
Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure time model").

With TSQ and pacing at lower layers, there are often several skbs
queued in the pacing layer, and thus there is less data "in the
network" than "in flight".

With departure time pacing at lower layers (e.g. fq or potential
future NICs), the data in the pacing layer now has a pre-scheduled
("baked-in") departure time that cannot be changed, even if the
congestion control algorithm decides to use a new pacing rate.

This means that there can be a non-trivial lag between when BBR makes
a pacing rate change and when the inter-skb pacing delays
change. After a pacing rate change, the number of packets in the
network can gradually evolve to be higher or lower, depending on
whether the sending rate is higher or lower than the delivery
rate. Thus ignoring this lag can cause significant overshoot, with the
flow ending up with too many or too few packets in the network.

This commit changes BBR to adapt its pacing rate based on the amount
of data in the network that it estimates has already been "baked in"
by previous departure time decisions. We estimate the number of our
packets that will be in the network at the earliest departure time
(EDT) for the next skb scheduled as:

   in_network_at_edt = inflight_at_edt - (EDT - now) * bw

If we're increasing the amount of data in the network ("in_network"),
then we want to know if the transmit of the EDT skb will push
in_network above the target, so our answer includes
bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
in_network, then we want to know if in_network will sink too low just
before the EDT transmit, so our answer does not include the segments
from the skb departing at EDT.

Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
differently? The in_network curve is a step function: in_network goes
up on transmits, and down on ACKs. To accurately predict when
in_network will go beyond our target value, this will happen on
different events, depending on whether we're concerned about
in_network potentially going too high or too low:

 o if pushing in_network up (pacing_gain > 1.0),
   then in_network goes above target upon a transmit event

 o if pushing in_network down (pacing_gain < 1.0),
   then in_network goes below target upon an ACK event

This commit changes the BBR state machine to use this estimated
"packets in network" value to make its decisions.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b88081285fd17..4cc2223d2cd54 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int 
gain)
return cwnd;
 }
 
+/* With pacing at lower layers, there's often less data "in the network" than
+ * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq),
+ * we often have several skbs queued in the pacing layer with a pre-scheduled
+ * earliest departure time (EDT). BBR adapts its pacing rate based on the
+ * inflight level that it estimates has already been "baked in" by previous
+ * departure time decisions. We calculate a rough estimate of the number of our
+ * packets that might be in the network at the earliest departure time for the
+ * next skb scheduled:
+ *   in_network_at_edt = inflight_at_edt - (EDT - now) * bw
+ * If we're increasing inflight, then we want to know if the transmit of the
+ * EDT skb will push inflight above the target, so inflight_at_edt includes
+ * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight,
+ * then estimate if inflight will sink too low just before the EDT transmit.
+ */
+static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct bbr *bbr = inet_csk_ca(sk);
+   u64 now_ns, edt_ns, interval_us;
+   u32 interval_delivered, inflight_at_edt;
+
+   now_ns = tp->tcp_clock_cache;
+   edt_ns = max(tp->tcp_wstamp_ns, now_ns);
+   interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC);
+   interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE;
+   inflight_at_edt = inflight_now;
+   if (bbr->pacing_gain > BBR_UNIT)  /* increasing inflight */
+   inflight_at_edt += bbr_tso_segs_goal(sk);  /* include EDT skb */
+   if (interval_delivered >= inflight_at_edt)
+   return 0;
+   return inflight_at_edt - interval_delivered;
+}
+
 /* An optimization in BBR to reduce losses: On the first round of recovery, we
  * follow the packet conservation principle: send P packets per P packets 
acked.
  * After that, we slow-start and send at most 2*P packets per P packets ack

Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Eric Dumazet



On 10/16/2018 03:17 PM, Stephen Hemminger wrote:
> On Tue, 16 Oct 2018 22:37:31 +0200
> Heiner Kallweit  wrote:
> 
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>
>> Fix this by calling rtl_rx() and rtl_tx() independent of the bits
>> set in the interrupt status register. Both functions will detect
>> if there's nothing to do for them.
>>
>> This issue has been there more or less forever (at least it exists in
>> 3.16 already), so I can't provide a "Fixes" tag. 
>>
>> Signed-off-by: Heiner Kallweit 
> 
> Another issue is this:
> 
>   if (work_done < budget) {
>   napi_complete_done(napi, work_done);
> 
>   rtl_irq_enable(tp, enable_mask);
>   mmiowb();
>   }
> 
>   return work_done;
> }
> 
> 
> The code needs to check return value of napi_complete_done.
> 
>   if (work_done < budget &&
>   napi_complete_done(napi, work_done) {
>   rtl_irq_enable(tp, enable_mask);
>   mmiowb();
>   }
> 
>   return work_done;
> }
> 
> Try that, it might fix the problem and your logic would
> be unnecessary 
> 

Well, I do not believe this is related.

Testing napi_complete_done() is not mandatory, it is only an optimization [1]
and only for busy polling users.

In short, by default, this should not matter, since busy-polling is not enabled 
by default.


[1] busy polling users are spinning anyway, so it is not even clear if this
   is really an optimization, unless maybe the cost of irq enabling is really 
really high...




Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Eric Dumazet



On 10/16/2018 04:03 PM, Stephen Hemminger wrote:

> Many drivers have buggy usage of napi_complete_done.
> 
> Might even be worth forcing all network drivers to check the return
> value. But fixing 150 broken drivers will be a nuisance.
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dc1d9ed33b31..c38bc66ffe74 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct 
> *napi)
> return false;
>  }
>  
> -bool napi_complete_done(struct napi_struct *n, int work_done);
> +bool __must_check napi_complete_done(struct napi_struct *n, int work_done);
> +
>  /**
>   * napi_complete - NAPI processing complete
>   * @n: NAPI context
> 

I disagree completely.
This never has been a requirement.


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Eric Dumazet



On 10/16/2018 04:08 PM, Florian Fainelli wrote:

> I had started doing that about a month ago in light of the ixbge
> ndo_poll_controller vs. napi problem, but have not had time to submit
> that series yet:
> 
> https://github.com/ffainelli/linux/commits/napi-check
> 
> feel free to piggy back on top of that series if you would like to
> address this.


But the root cause was different, you remember ?

We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
victim CPU.




Re: [bpf-next PATCH 0/3] sockmap support for msg_peek flag

2018-10-16 Thread Daniel Borkmann
On 10/16/2018 08:41 PM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 11:07:54AM -0700, John Fastabend wrote:
>> This adds support for the MSG_PEEK flag when redirecting into an
>> ingress psock sk_msg queue.
>>
>> The first patch adds some base support to the helpers, then the
>> feature, and finally we add an option for the test suite to do
>> a duplicate MSG_PEEK call on every recv to test the feature.
>>
>> With duplicate MSG_PEEK call all tests continue to PASS.
> 
> for the set
> Acked-by: Alexei Starovoitov 

Applied to bpf-next, thanks!


Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id

2018-10-16 Thread Yonghong Song


On 10/15/18 4:12 PM, Martin Lau wrote:
> On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
>> This patch added interface to load a program with the following
>> additional information:
>> . prog_btf_fd
>> . func_info and func_info_len
>> where func_info will provides function range and type_id
>> corresponding to each function.
>>
>> If verifier agrees with function range provided by the user,
>> the bpf_prog ksym for each function will use the func name
>> provided in the type_id, which is supposed to provide better
>> encoding as it is not limited by 16 bytes program name
>> limitation and this is better for bpf program which contains
>> multiple subprograms.
>>
>> The bpf_prog_info interface is also extended to
>> return btf_id and jited_func_types, so user spaces can
>> print out the function prototype for each jited function.
> Some nits.
> 
>>
>> Signed-off-by: Yonghong Song 
>> ---
>>   include/linux/bpf.h  |  2 +
>>   include/linux/bpf_verifier.h |  1 +
>>   include/linux/btf.h  |  2 +
>>   include/uapi/linux/bpf.h | 11 +
>>   kernel/bpf/btf.c | 16 +++
>>   kernel/bpf/core.c|  9 
>>   kernel/bpf/syscall.c | 86 +++-
>>   kernel/bpf/verifier.c| 50 +
>>   8 files changed, 176 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 9b558713447f..e9c63ffa01af 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -308,6 +308,8 @@ struct bpf_prog_aux {
>>  void *security;
>>   #endif
>>  struct bpf_prog_offload *offload;
>> +struct btf *btf;
>> +u32 type_id; /* type id for this prog/func */
>>  union {
>>  struct work_struct work;
>>  struct rcu_head rcu;
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 9e8056ec20fa..e84782ec50ac 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct 
>> bpf_verifier_log *log)
>>   struct bpf_subprog_info {
>>  u32 start; /* insn idx of function entry point */
>>  u16 stack_depth; /* max. stack depth used by this function */
>> +u32 type_id; /* btf type_id for this subprog */
>>   };
>>   
>>   /* single container for all structs
>> diff --git a/include/linux/btf.h b/include/linux/btf.h
>> index e076c4697049..90e91b52aa90 100644
>> --- a/include/linux/btf.h
>> +++ b/include/linux/btf.h
>> @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, 
>> void *obj,
>> struct seq_file *m);
>>   int btf_get_fd_by_id(u32 id);
>>   u32 btf_id(const struct btf *btf);
>> +bool is_btf_func_type(const struct btf *btf, u32 type_id);
>> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id);
>>   
>>   #endif
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index f9187b41dff6..7ebbf4f06a65 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -332,6 +332,9 @@ union bpf_attr {
>>   * (context accesses, allowed helpers, etc).
>>   */
>>  __u32   expected_attach_type;
>> +__u32   prog_btf_fd;/* fd pointing to BTF type data 
>> */
>> +__u32   func_info_len;  /* func_info length */
>> +__aligned_u64   func_info;  /* func type info */
>>  };
>>   
>>  struct { /* anonymous struct used by BPF_OBJ_* commands */
>> @@ -2585,6 +2588,9 @@ struct bpf_prog_info {
>>  __u32 nr_jited_func_lens;
>>  __aligned_u64 jited_ksyms;
>>  __aligned_u64 jited_func_lens;
>> +__u32 btf_id;
>> +__u32 nr_jited_func_types;
>> +__aligned_u64 jited_func_types;
>>   } __attribute__((aligned(8)));
>>   
>>   struct bpf_map_info {
>> @@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
>>  };
>>   };
>>   
>> +struct bpf_func_info {
>> +__u32   insn_offset;
>> +__u32   type_id;
>> +};
>> +
>>   #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index 794a185f11bf..85b8eeccddbd 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
>> @@ -486,6 +486,15 @@ static const struct btf_type *btf_type_by_id(const 
>> struct btf *btf, u32 type_id)
>>  return btf->types[type_id];
>>   }
>>   
>> +bool is_btf_func_type(const struct btf *btf, u32 type_id)
>> +{
>> +const struct btf_type *type = btf_type_by_id(btf, type_id);
>> +
>> +if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
>> +return false;
>> +return true;
>> +}
> Can btf_type_is_func() (from patch 2) be reused?
> The btf_type_by_id() can be done by the caller.
> I don't think it worths to add a similar helper
> for just one user for now.

Currently, btf_type_by_id() is not exposed.

bpf/btf.c:
static const struct btf_type *btf_type_by_id(const struct btf *btf, u32

Re: [bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly

2018-10-16 Thread Daniel Borkmann
On 10/16/2018 07:36 PM, John Fastabend wrote:
> When converting sockmap to new skmsg generic data structures we missed
> that the recvmsg handler did not correctly use sg.size and instead was
> using individual elements length. The result is if a sock is closed
> with outstanding data we omit the call to sk_mem_uncharge() and can
> get the warning below.
> 
> [   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 
> sk_stream_kill_queues+0x1fa/0x210
> 
> To fix this correct the redirect handler to xfer the size along with
> the scatterlist and also decrement the size from the recvmsg handler.
> Now when a sock is closed the remaining 'size' will be decremented
> with sk_mem_uncharge().
> 
> Signed-off-by: John Fastabend 

Applied to bpf-next, thanks!


Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall

2018-10-16 Thread Mauricio Vasquez




On 10/16/2018 06:20 PM, Alexei Starovoitov wrote:

On Tue, Oct 16, 2018 at 04:16:39PM -0500, Mauricio Vasquez wrote:


On 10/11/2018 06:51 PM, Alexei Starovoitov wrote:

On Wed, Oct 10, 2018 at 05:50:01PM -0500, Mauricio Vasquez wrote:

Does it make sense to you?

I reread the other patch, and found it does NOT use the following logic for
queue and stack:

  rcu_read_lock();
  ptr = map->ops->map_lookup_and_delete_elem(map, key);
  if (ptr)
  memcpy(value, ptr, value_size);

I guess this part is not used at all? Can we just remove it?

Thanks,
Song

This is the base code for map_lookup_and_delete support, it is not used in
queue/stack maps.

I think we can leave it there, so when somebody implements lookup_and_delete
for other maps doesn't have to care about implementing also this.

The code looks useful to me, but I also agree with Song. And in the kernel
we don't typically add 'almost dead code'.
May be provide an implementation of the lookup_and_delete for hash map
so it's actually used ?

I haven't written any code but I think there is a potential problem here.
Current lookup_and_delete returns a pointer to the element, hence deletion
of the element should be done using call_rcu to guarantee this is valid
after returning.
In the hashtab, the deletion only uses call_rcu when there is not prealloc,
otherwise the element is pushed on the list of free elements immediately.
If we move the logic to push elements into the free list under a call_rcu
invocation, it could happen that the free list is empty because the call_rcu
is still pending (a similar issue we had with the queue/stack maps when they
used a pass by reference API).

There is another way to implement it without this issue, in syscall.c:
l = ops->lookup(key);
memcpy(l, some_buffer)
ops->delete(key)

The point here is that the lookup_and_delete operation is not being used at
all, so, is lookup + delete = lookup_and_delete?, can it be generalized?
If this is true, then what is the sense of having lookup_and_delete
syscall?,

I though of lookup_and_delete command as atomic operation.
Only in such case it would make sense.
Otherwise there is no point in having additional cmd.
In case of hash map the implementation would be similar to delete:
   raw_spin_lock_irqsave(&b->lock, flags);
   l = lookup_elem_raw(head, hash, key, key_size);
   if (l) {
   hlist_nulls_del_rcu(&l->hash_node);
   bpf_long_memcpy(); // into temp kernel area
   free_htab_elem(htab, l);
   ret = 0;
   }
   raw_spin_unlock_irqrestore(&b->lock, flags);
   copy_to_user();


Well, this is new approach, currently operations have no enough info to 
perform the copy_to_user directly, btw, is there any technical reason 
why a double mem copy is done? (from the map value into a temp kernel 
buffer and then to userspace?)




there is a chance that some other cpu is doing lookup in parallel
and may be modifying value, so bpf_long_mempcy() isn't fully atomic.


I think we already have that case, if an eBPF program is updating the 
map, a lookup from userspace could see a partially updated value.

But bpf side is written together with user space side,
so above almost-atomic lookup_and_delete is usable instead
of lookup and then delete which races too much.

Having said that I think it's fine to defer this new ndo for now
and leave lookup_and_delete syscall cmd for stack/queue map only.

I agree, just a question, should we remove the "almost dead code" or 
leave it there?


Thanks,
Mauricio.



Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Florian Fainelli



On 10/16/2018 5:23 PM, Eric Dumazet wrote:
> 
> 
> On 10/16/2018 04:08 PM, Florian Fainelli wrote:
> 
>> I had started doing that about a month ago in light of the ixbge
>> ndo_poll_controller vs. napi problem, but have not had time to submit
>> that series yet:
>>
>> https://github.com/ffainelli/linux/commits/napi-check
>>
>> feel free to piggy back on top of that series if you would like to
>> address this.
> 
> 
> But the root cause was different, you remember ?
> 
> We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
> victim CPU.

I do remember, after seeing the resolution I kind of decided to leave
this in a branch but not submit it because it was not particularly
relevant anymore.
-- 
Florian


Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-10-16 Thread Yonghong Song


On 10/15/18 3:30 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
>> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
>> support to the type section. BTF_KIND_FUNC_PROTO is used
>> to specify the type of a function pointer. With this,
>> BTF has a complete set of C types (except float).
>>
>> BTF_KIND_FUNC is used to specify the signature of a
>> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
>> by another type, e.g., a pointer type, and BTF_KIND_FUNC
>> type cannot be referenced by another type.
>>
>> For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types,
>> the func return type is in t->type (where t is a
>> "struct btf_type" object). The func args are an array of
>> u32s immediately following object "t".
>>
>> As a concrete example, for the C program below,
>>$ cat test.c
>>int foo(int (*bar)(int)) { return bar(5); }
>> with latest llvm trunk built with Debug mode, we have
>>$ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
>>Type Table:
>>[1] FUNC name_off=1 info=0x0c01 size/type=2
>>param_type=3
>>[2] INT name_off=11 info=0x0100 size/type=4
>>desc=0x0120
>>[3] PTR name_off=0 info=0x0200 size/type=4
>>[4] FUNC_PROTO name_off=0 info=0x0d01 size/type=2
>>param_type=2
>>
>>String Table:
>>0 :
>>1 : foo
>>5 : .text
>>11 : int
>>15 : test.c
>>22 : int foo(int (*bar)(int)) { return bar(5); }
>>
>>FuncInfo Table:
>>sec_name_off=5
>>insn_offset= type_id=1
>>
>>...
>>
>> (Eventually we shall have bpftool to dump btf information
>>   like the above.)
>>
>> Function "foo" has a FUNC type (type_id = 1).
>> The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO,
>> where FUNC_PROTO refers to function pointer "bar".
> 
> Should also "bar" be part of the string table (at least at some point in 
> future)?

Yes, we can do it. The dwarf for the abovee example looks like

0x0043: DW_TAG_formal_parameter
   DW_AT_location(0x
  [0x,  0x0008): 
DW_OP_reg1 W1
  [0x0008,  0x0018): 
DW_OP_reg2 W2)
   DW_AT_name("bar")
   DW_AT_decl_file   ("/home/yhs/tmp/t.c")
   DW_AT_decl_line   (1)
   DW_AT_type(0x005a "subroutine int*")

0x005a:   DW_TAG_pointer_type
 DW_AT_type  (0x005f "subroutine int")

0x005f:   DW_TAG_subroutine_type
 DW_AT_type  (0x0053 "int")
 DW_AT_prototyped(true)

0x0064: DW_TAG_formal_parameter
   DW_AT_type(0x0053 "int")

0x0069: NULL

0x006a:   NULL

The current llvm implementation does not record func
parameter name, so "bar" got lost. We could associate
"bar" with pointer type in the future implementation.

> Iow, if verifier hints to an issue in the program when it would for example 
> walk
> pointers and rewrite ctx access, then it could dump the var name along with 
> it.
> It might be useful as well in combination with 22 from str table, when 
> annotating
> the source. We might need support for variadic functions, though. How is LLVM
> handling the latter with the recent BTF support?

The LLVM implementation does support variadic functions.
The last type id 0 indicates a variadic function.

> 
>> In FuncInfo Table, for section .text, the function,
>> with to-be-determined offset (marked as ),
>> has type_id=1 which refers to a FUNC type.
>> This way, the function signature is
>> available to both kernel and user space.
>> Here, the insn offset is not available during the dump time
>> as relocation is resolved pretty late in the compilation process.
>>
>> Signed-off-by: Martin KaFai Lau 
>> Signed-off-by: Yonghong Song 


Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-10-16 Thread Yonghong Song


On 10/15/18 3:36 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
> [...]
>> +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset)
>> +{
>> +/* offset must be valid */
>> +const char *src = &btf->strings[offset];
>> +
>> +if (!isalpha(*src) && *src != '_')
>> +return false;
>> +
>> +src++;
>> +while (*src) {
>> +if (!isalnum(*src) && *src != '_')
>> +return false;
>> +src++;
>> +}
>> +
>> +return true;
>> +}
> 
> Should there be an upper name length limit like KSYM_NAME_LEN? (Is it implied
> by the kvmalloc() limit?)

KSYM_NAME_LEN is good choice. Here, we check function names and 
struct/union member names. In C, based on
https://stackoverflow.com/questions/2352209/max-identifier-length,
the identifier max length is 63. Some compiler implementation may vary.
KSYM_NAME_LEN is 128.

> 
>>   static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>>   {
>>  if (!offset)
>> @@ -747,7 +782,9 @@ static bool env_type_is_resolve_sink(const struct 
>> btf_verifier_env *env,
>>  /* int, enum or void is a sink */
>>  return !btf_type_needs_resolve(next_type);
>>  case RESOLVE_PTR:
>> -/* int, enum, void, struct or array is a sink for ptr */
>> +/* int, enum, void, struct, array or func_ptoto is a sink
>> + * for ptr
>> + */
>>  return !btf_type_is_modifier(next_type) &&
>>  !btf_type_is_ptr(next_type);
>>  case RESOLVE_STRUCT_OR_ARRAY:


Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id

2018-10-16 Thread Yonghong Song


On 10/16/18 10:59 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
>> This patch added interface to load a program with the following
>> additional information:
>> . prog_btf_fd
>> . func_info and func_info_len
>> where func_info will provides function range and type_id
>> corresponding to each function.
>>
>> If verifier agrees with function range provided by the user,
>> the bpf_prog ksym for each function will use the func name
>> provided in the type_id, which is supposed to provide better
>> encoding as it is not limited by 16 bytes program name
>> limitation and this is better for bpf program which contains
>> multiple subprograms.
>>
>> The bpf_prog_info interface is also extended to
>> return btf_id and jited_func_types, so user spaces can
>> print out the function prototype for each jited function.
>>
>> Signed-off-by: Yonghong Song 
> ...
>>  BUILD_BUG_ON(sizeof("bpf_prog_") +
>>   sizeof(prog->tag) * 2 +
>> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog 
>> *prog, char *sym)
>>   
>>  sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>>  sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>> +
>> +if (prog->aux->btf) {
>> +func_name = btf_get_name_by_id(prog->aux->btf, 
>> prog->aux->type_id);
>> +snprintf(sym, (size_t)(end - sym), "_%s", func_name);
>> +return;
> 
> Would it make sense to add a comment here that prog->aux->name is ignored
> when full btf name is available? (otherwise the same name will appear twice 
> in ksym)

Will add a comment.

> 
>> +}
>> +
>>  if (prog->aux->name[0])
>>  snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> ...
>> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env 
>> *env,
>> +  union bpf_attr *attr)
>> +{
>> +struct bpf_func_info *data;
>> +int i, nfuncs, ret = 0;
>> +
>> +if (!attr->func_info_len)
>> +return 0;
>> +
>> +nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
>> +if (env->subprog_cnt != nfuncs) {
>> +verbose(env, "number of funcs in func_info does not match 
>> verifier\n");
> 
> 'does not match verifier' is hard to make sense of.
> How about 'number of funcs in func_info doesn't match number of subprogs' ?

Sounds good to me.

> 
>> +return -EINVAL;
>> +}
>> +
>> +data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
>> +if (!data) {
>> +verbose(env, "no memory to allocate attr func_info\n");
> 
> I don't think we ever print such warnings for memory allocations.
> imo this can be removed, since enomem is enough.

Okay.

> 
>> +return -ENOMEM;
>> +}
>> +
>> +if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
>> +   attr->func_info_len)) {
>> +verbose(env, "memory copy error for attr func_info\n");
> 
> similar thing. kernel never warns about copy_from_user errors.

Okay.

> 
>> +ret = -EFAULT;
>> +goto cleanup;
>> +}
>> +
>> +for (i = 0; i < nfuncs; i++) {
>> +if (env->subprog_info[i].start != data[i].insn_offset) {
>> +verbose(env, "func_info subprog start (%d) does not 
>> match verifier (%d)\n",
>> +env->subprog_info[i].start, 
>> data[i].insn_offset);
> 
> I think printing exact insn offset isn't going to be much help
> for regular user to debug it. If this happens, it's likely llvm issue.
> How about 'func_info BTF section doesn't match subprog layout in BPF program' 
> ?

Okay.

> 


Re: [PATCH bpf-next 00/13] bpf: add btf func info support

2018-10-16 Thread Yonghong Song


On 10/16/18 11:27 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:20AM -0700, Yonghong Song wrote:
>> The BTF support was added to kernel by Commit 69b693f0aefa
>> ("bpf: btf: Introduce BPF Type Format (BTF)"), which introduced
>> .BTF section into ELF file and is primarily
>> used for map pretty print.
>> pahole is used to convert dwarf to BTF for ELF files.
>>
>> The next step would be add func type info and debug line info
>> into BTF. For debug line info, it is desirable to encode
>> source code directly in the BTF to ease deployment and
>> introspection.
> 
> it's kinda confusing that cover letter talks about line info next step,
> but these kernel side patches are only for full prog name from btf.
> It certainly makes sense for llvm to do both at the same time.
> Please make the cover letter more clear.

Make sense. Will remove line_info stuff from the cover letter.


  1   2   >