Re: [PATCH net-next] net: stmmac: set total length of the packet to be transmitted in TDES3

2017-04-10 Thread Giuseppe CAVALLARO

Hi Niklas

patch looks ok for me, Alex any feedback?

peppe

On 4/10/2017 8:33 PM, Niklas Cassel wrote:

From: Niklas Cassel 

Field FL/TPL in register TDES3 is not correctly set on GMAC4.
TX appears to be functional on GMAC 4.10a even if this field is not set,
however, to avoid relying on undefined behavior, set the length in TDES3.

The field has a different meaning depending on if the TSE bit in TDES3
is set or not (TSO). However, regardless of the TSE bit, the field is
not optional. The field is already set correctly when the TSE bit is set.

Since there is no limit for the number of descriptors that can be
used for a single packet, the field should be set to the sum of
the buffers contained in:
[ ...  ...
], which should be equal to skb->len.

Signed-off-by: Niklas Cassel 
---
  drivers/net/ethernet/stmicro/stmmac/chain_mode.c   | 6 +++---
  drivers/net/ethernet/stmicro/stmmac/common.h   | 2 +-
  drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c | 3 ++-
  drivers/net/ethernet/stmicro/stmmac/enh_desc.c | 2 +-
  drivers/net/ethernet/stmicro/stmmac/norm_desc.c| 2 +-
  drivers/net/ethernet/stmicro/stmmac/ring_mode.c| 9 ++---
  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 5 +++--
  7 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c 
b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
index 37881f81319e..e93c40b4631e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
@@ -52,7 +52,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].len = bmax;
/* do not close the descriptor and do not set own bit */
priv->hw->desc->prepare_tx_desc(desc, 1, bmax, csum, STMMAC_CHAIN_MODE,
-   0, false);
+   0, false, skb->len);
  
  	while (len != 0) {

tx_q->tx_skbuff[entry] = NULL;
@@ -70,7 +70,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].len = bmax;
priv->hw->desc->prepare_tx_desc(desc, 0, bmax, csum,
STMMAC_CHAIN_MODE, 1,
-   false);
+   false, skb->len);
len -= bmax;
i++;
} else {
@@ -85,7 +85,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
/* last descriptor can be set now */
priv->hw->desc->prepare_tx_desc(desc, 0, len, csum,
STMMAC_CHAIN_MODE, 1,
-   true);
+   true, skb->len);
len = 0;
}
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 90d28bcad880..b7ce3fbb5375 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -373,7 +373,7 @@ struct stmmac_desc_ops {
/* Invoked by the xmit function to prepare the tx descriptor */
void (*prepare_tx_desc) (struct dma_desc *p, int is_fs, int len,
 bool csum_flag, int mode, bool tx_own,
-bool ls);
+bool ls, unsigned int tot_pkt_len);
void (*prepare_tso_tx_desc)(struct dma_desc *p, int is_fs, int len1,
int len2, bool tx_own, bool ls,
unsigned int tcphdrlen,
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
index 843ec69222ea..aa6476439aee 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
@@ -304,12 +304,13 @@ static void dwmac4_rd_init_tx_desc(struct dma_desc *p, 
int mode, int end)
  
  static void dwmac4_rd_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,

  bool csum_flag, int mode, bool tx_own,
- bool ls)
+ bool ls, unsigned int tot_pkt_len)
  {
unsigned int tdes3 = le32_to_cpu(p->des3);
  
  	p->des2 |= cpu_to_le32(len & TDES2_BUFFER1_SIZE_MASK);
  
+	tdes3 |= tot_pkt_len & TDES3_PACKET_SIZE_MASK;

if (is_fs)
tdes3 |= TDES3_FIRST_DESCRIPTOR;
else
diff --git a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c 
b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
index 323b59ec74a3..7546b3664113 100644
--- 

6f58284e66: BUG: kernel hang in boot stage

2017-04-10 Thread kernel test robot
Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master

commit 6f58284e666261162b2c95fdd8608f5e247e9a38
Merge: 7fd97bca bf74b20
Author: Stephen Rothwell <s...@canb.auug.org.au>
AuthorDate: Mon Apr 10 10:06:42 2017 +1000
Commit: Stephen Rothwell <s...@canb.auug.org.au>
CommitDate: Mon Apr 10 10:06:42 2017 +1000

Merge remote-tracking branch 'net-next/master'

7fd97bca68  Merge remote-tracking branch 'dlm/next'
bf74b20d00  Revert "rtnl: Add support for netdev event to link messages"
6f58284e66  Merge remote-tracking branch 'net-next/master'
f8c97bdb49  Add linux-next specific files for 20170410
+-++++---+
| | 7fd97bca68 | bf74b20d00 | 
6f58284e66 | next-20170410 |
+-++++---+
| boot_successes  | 42 | 38 | 0 
 | 0 |
| boot_failures   | 0  | 0  | 
17 | 17|
| BUG:kernel_hang_in_boot_stage   | 0  | 0  | 
17 | 11|
| BUG:kernel_reboot-without-warning_in_boot_stage | 0  | 0  | 0 
 | 6 |
+-++++---+

[0.00] ACPI: RSDP 0x000F6930 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1FFE1936 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1FFE180A 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)


  # HH:MM RESULT GOOD 
BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
git bisect start f8c97bdb49832d2b0edaa0c05db873aa2f6101ff 
39da7c509acff13fc8cb12ec1bb20337c988ed36 --
git bisect  bad af6d4e29c13fd47ef3d1b4d96b7f781aa7534413  # 05:19  B  0 
3   14   0  Merge remote-tracking branch 'drm-panel/drm/panel/for-next'
git bisect good 8a41837405da3919179983b830eb648b65954797  # 05:40  G 13 
00   0  Merge remote-tracking branch 'xtensa/xtensa-for-next'
git bisect good 1635d3d77b290e99090a4e7f613009cc68531bb8  # 06:04  G 13 
00   0  Merge remote-tracking branch 'v4l-dvb/master'
git bisect  bad 89c15a058190c83af2d029fad4de33f542bcfb42  # 06:28  B  0 
2   13   0  Merge remote-tracking branch 'ipsec-next/master'
git bisect good 3d5657773c2ccdbeff13e2db374a1fd3b3e36722  # 07:06  G 12 
00   0  Merge remote-tracking branch 'thermal/next'
git bisect good 3700d2a55af503f43ae0e4595c92557bcb89046e  # 07:24  G 13 
00   0  Merge remote-tracking branch 'ieee1394/for-next'
git bisect good 7fd97bca680b4ceedebf194f8316ae6c2b60ce01  # 08:32  G 13 
00   0  Merge remote-tracking branch 'dlm/next'
git bisect  bad 6f58284e666261162b2c95fdd8608f5e247e9a38  # 08:55  B  0 
1   12   0  Merge remote-tracking branch 'net-next/master'
git bisect good 00ecfb3b34b69dd702dee1bd6de6fc100be384db  # 09:12  G 12 
00   0  netvsc: remove unnecessary lock on shutdown
git bisect good df1c631648c55bfb247339279f9bc573c7f283f4  # 09:54  G 12 
00   0  net: mpls: Limit memory allocation for mpls_route
git bisect good b404127879471c38ad13a246ce5dec156f60f828  # 10:14  G 12 
00   0  Merge branch '100GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
git bisect good 15ed8a47ff0571dd268e37002511993b47e996bd  # 10:34  G 12 
00   0  qede: Add support for ingress headroom
git bisect good e8c5f7231cc03153fee1b5fcb173585354c08ee8  # 10:53  G 12 
00   0  i40e: Swap use of pf->flags and pf->hw_disabled_flags for ATR 
Eviction
git bisect good 0c264588b5de50353e4a1ce0c2521576426dd89d  # 11:15  G 13 
00   0  liquidio: fix VF incorrectly indicating that it successfully set 
its VLAN
git bisect good ca9ec0888d631c446040a7fab9985afdeb4f73f3  # 11:39  G 13 
00   0  i40e/i40evf: Add support for padding start of frames
git bisect good 417d978fa532b61b89f0c3ccbd9cdb51090ea032  # 11:59  G 13 
00   0  Merge branch 'dsa-receive-path-simplifications'
git bisect good 0492b71c42f76b6019ef5fe686a7cb253dded09c  # 12:29  G 13 
00   0  Merge branch '40GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
git bisect good bf74b20d00b13919db7ae5d1015636e76f56f6ae  # 12:45  G 13 
00   0  Revert "rtnl: Add support for netdev event to link messages"
# first bad commit: [6f58284e666261162b2c95fdd8608f5e247e9a38] Merge 
remote-tracking branch 'net-next/master'
git bisect good 7fd97bca680b4ceedebf194f8316ae6c2b60ce01  # 12:50  G 38 
00   0  Merge

Re: [PATCH v2] cfg80211: Fix array-bounds warning in fragment copy

2017-04-10 Thread Johannes Berg
On Mon, 2017-04-10 at 14:36 -0700, Matthias Kaehlcke wrote:
> Ping, any feedback on this patch?

You didn't cc linux-wireless, so it fell through the cracks ... I'll
try to remember it when I merge later.

johannes


Re: [PATCH] ibmveth: Support to enable LSO/CSO for Trunk VEA.

2017-04-10 Thread Sivakumar Krishnasamy

Re-sending as my earlier response had some HTML subparts.

Let me give some background before I answer your queries.

In IBM PowerVM environment, ibmveth driver supports largesend and 
checksum offload today, but only for virtual ethernet adapters (VEA) 
which are not configured in "Trunk mode".  In trunk mode, one cannot 
enable checksum and largesend offload capabilities. Without these 
offloads enabled, the performance numbers are not good. This patch is to 
enable these offloads for "Trunk" VEAs.


The following shows a typical configuration for network packet flow, 
when VMs in the PowerVM server have their network virtualized and 
communicate to external world.


VM (ibmveth) <=> PowerVM Hypervisor <=>  PowerVM I/O Server VM 
( ibmveth in "Trunk mode" <=> OVS <=> Physical NIC ) <=>  External Network


As you can see the packets originating in VM will travel through local 
ibmveth driver and then to PowerVM Hypervisor, then it gets delivered to 
ibmveth driver configured in "Trunk" mode in I/O Server, which is then 
bridged by OVS to external network via Physical NIC.  To have largesend 
and checksum offload enabled end to end, from VM up to Physical NIC, 
ibmveth needs to support these offload capabilities when configured in 
"Trunk" mode too.


Before this patch, when a VM communicates with external network (in a 
configuration similar to above), throughput numbers were not so good 
(~1.5 Gbps) and with the patch, I see ~9.4 Gbps throughput for a 10G NIC 
(iperf used for measurements).


On 4/9/2017 12:15 AM, David Miller wrote:

From: Sivakumar Krishnasamy 
Date: Fri,  7 Apr 2017 05:57:59 -0400


Enable largesend and checksum offload for ibmveth configured in trunk mode.
Added support to SKB frag_list in TX path by skb_linearize'ing such SKBs.

Signed-off-by: Sivakumar Krishnasamy 


Why is linearization necessary?

It would seem that the gains you get from GRO are nullified by
linearizing the SKB and thus copying all the data around and
allocating buffers.

When Physical NIC has GRO enabled and when OVS bridges these packets, 
OVS vport send code will end up calling dev_queue_xmit, which in turn 
calls validate_xmit_skb.


validate_xmit_skb has the below code snippet,

if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;

segs = skb_gso_segment(skb, features); <=== Segments the 
GSO packet into MTU sized segments.


When the OVS outbound vport is ibmveth, netif_needs_gso returns 
positively if the SKB has a frag_list and if the driver doesn't support 
the same (NETIF_F_FRAGLIST feature).  So all the packets received by 
ibmveth are of MSS size (or lesser) due to the above code.


On a 10G physical NIC, the maximum throughput achieved was 2.2 Gbps due 
to the above segmentation in validate_xmit_skb. With the patch to 
linearize the SKB, the throughput increased to 9 Gbps (and ibmveth 
received packets without being segmented). This is ~4X improvement even 
though we end up allocating buffers and copying data.



Finally, all of that new checksumming stuff looks extremely
suspicious.  You have to explain why that is happening and why it
isn't because this driver is doing something incorrectly.

Thanks.

We are now enabling support for OVS and improving bridging performance 
in IBM's PowerVM environment, which brings in these new offload 
requirements for ibmveth driver configured in Trunk mode.


Please let me know if you need more details.

Regards,
Siva K



Re: [PATCH iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute

2017-04-10 Thread David Ahern
On 4/10/17 8:36 AM, Robert Shearman wrote:
> @@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
> int argc, char **argv)
>  
>   if (rta->rta_len > RTA_LENGTH(0))
>   addraw_l(, 1024, RTA_DATA(rta), 
> RTA_PAYLOAD(rta));
> + } else if (strcmp(*argv, "ttl-propagate") == 0) {
> + __u8 ttl_prop;
> +
> + NEXT_ARG();
> + if (strcmp(*argv, "enabled") == 0)
> + ttl_prop = 1;
> + else if (strcmp(*argv, "disabled") == 0)
> + ttl_prop = 0;
> + else
> + invarg("\"ttl-propagate\" value is invalid\n",
> +*argv);
> +

matches() instead of strcmp() is more user friendly. 'enabled' is a lot
to type. ;-)


Re: Page allocator order-0 optimizations merged

2017-04-10 Thread zhong jiang
On 2017/4/10 23:10, Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 10:31:48PM +0800, zhong jiang wrote:
>> Hi, Mel
>>
>>  The patch I had test on arm64. I find the great degradation. I test it 
>> by micro-bench.
>> The patrly data is as following.  and it is stable.  That stands for the 
>> allocate and free time. 
>> 
> What type of allocations is the benchmark doing? In particular, what context
> is the microbenchmark allocating from? Lastly, how did you isolate the
> patch, did you test two specific commits in mainline or are you comparing
> 4.10 with 4.11-rcX?
>
 Hi, Mel

   benchmark adopt  0 order allocation.  just insmod module  allocate  memory 
by alloc_pages.
   it is not interrupt context.  I test the patch in linux 4.1 stable. In x86 , 
it have 10% improve.
   but in arm64,  it have great degradation.  

  Thanks
  zhongjiang



[PATCH 02/10] ftgmac100: Use device "compatible" property, not machine.

2017-04-10 Thread Benjamin Herrenschmidt
We test for aspeed chips to handle a couple of special cases,
but we do that by checking the machine type which isn't right.

Instead check the actual device compatible property. This also
updates the dtsi files for the aspeed SoC to match.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  4 ++--
 arch/arm/boot/dts/aspeed-g5.dtsi |  4 ++--
 drivers/net/ethernet/faraday/ftgmac100.c | 16 +---
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index 0b4932c..6068e79 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -42,7 +42,7 @@
};
 
mac0: ethernet@1e66 {
-   compatible = "faraday,ftgmac100";
+   compatible = "aspeed,ast2400-mac", "faraday,ftgmac100";
reg = <0x1e66 0x180>;
interrupts = <2>;
no-hw-checksum;
@@ -50,7 +50,7 @@
};
 
mac1: ethernet@1e68 {
-   compatible = "faraday,ftgmac100";
+   compatible = "aspeed,ast2400-mac", "faraday,ftgmac100";
reg = <0x1e68 0x180>;
interrupts = <3>;
no-hw-checksum;
diff --git a/arch/arm/boot/dts/aspeed-g5.dtsi b/arch/arm/boot/dts/aspeed-g5.dtsi
index b664fe3..4dbe91a 100644
--- a/arch/arm/boot/dts/aspeed-g5.dtsi
+++ b/arch/arm/boot/dts/aspeed-g5.dtsi
@@ -33,7 +33,7 @@
};
 
mac0: ethernet@1e66 {
-   compatible = "faraday,ftgmac100";
+   compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
reg = <0x1e66 0x180>;
interrupts = <2>;
no-hw-checksum;
@@ -41,7 +41,7 @@
};
 
mac1: ethernet@1e68 {
-   compatible = "faraday,ftgmac100";
+   compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
reg = <0x1e68 0x180>;
interrupts = <3>;
no-hw-checksum;
diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 85b650a..5e04717 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -90,6 +90,7 @@ struct ftgmac100 {
 
/* Misc */
bool need_mac_restart;
+   bool is_aspeed;
 };
 
 static void ftgmac100_set_rx_ring_base(struct ftgmac100 *priv, dma_addr_t addr)
@@ -1311,8 +1312,7 @@ static int ftgmac100_setup_mdio(struct net_device *netdev)
if (!priv->mii_bus)
return -EIO;
 
-   if (of_machine_is_compatible("aspeed,ast2400") ||
-   of_machine_is_compatible("aspeed,ast2500")) {
+   if (priv->is_aspeed) {
/* This driver supports the old MDIO interface */
reg = ioread32(priv->base + FTGMAC100_OFFSET_REVR);
reg &= ~FTGMAC100_REVR_NEW_MDIO_INTERFACE;
@@ -1377,6 +1377,7 @@ static int ftgmac100_probe(struct platform_device *pdev)
int irq;
struct net_device *netdev;
struct ftgmac100 *priv;
+   struct device_node *np;
int err = 0;
 
if (!pdev)
@@ -1432,17 +1433,18 @@ static int ftgmac100_probe(struct platform_device *pdev)
/* MAC address from chip or random one */
ftgmac100_setup_mac(priv);
 
-   if (of_machine_is_compatible("aspeed,ast2400") ||
-   of_machine_is_compatible("aspeed,ast2500")) {
+   np = pdev->dev.of_node;
+   if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
+  of_device_is_compatible(np, "aspeed,ast2500-mac"))) {
priv->rxdes0_edorr_mask = BIT(30);
priv->txdes0_edotr_mask = BIT(30);
+   priv->is_aspeed = true;
} else {
priv->rxdes0_edorr_mask = BIT(15);
priv->txdes0_edotr_mask = BIT(15);
}
 
-   if (pdev->dev.of_node &&
-   of_get_property(pdev->dev.of_node, "use-ncsi", NULL)) {
+   if (np && of_get_property(np, "use-ncsi", NULL)) {
if (!IS_ENABLED(CONFIG_NET_NCSI)) {
dev_err(>dev, "NCSI stack not enabled\n");
goto err_ncsi_dev;
@@ -1467,7 +1469,7 @@ static int ftgmac100_probe(struct platform_device *pdev)
netdev->features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
NETIF_F_GRO | NETIF_F_SG;
if (priv->use_ncsi &&
-   of_get_property(pdev->dev.of_node, "no-hw-checksum", NULL))
+   of_get_property(np, "no-hw-checksum", NULL))
netdev->features &= ~NETIF_F_HW_CSUM;
 
/* register network device */
-- 
2.9.3



Re: [PATCH 00/10] ftgmac100: Rework batch 4 - Misc

2017-04-10 Thread Benjamin Herrenschmidt
On Tue, 2017-04-11 at 11:04 +1000, Benjamin Herrenschmidt wrote:
> This is the fourth batch of updates to the ftgmac100 driver.
> 
> This is a bunch of misc cleanups and fixes, such as properly
> disabling HW checksum generation on AST2400 where it's known
> to be broken and some chip init updates.
> 
> This also adds the ability to turn HW checksum on/off and
> configure the ring sizes via ethtool.
> 
> The next (and last) batch will add a few more "features" such
> as netpoll, multicast/promist, vlan offload...

FYI. This series plays a bit with the device-tree properties for
this driver. There is more in the last batch, which will also
include an patch adding 

Documentation/devicetree/bindings/net/ftgmac100.txt

documenting all this.

Cheers,
Ben.



[PATCH 10/10] ftgmac100: Set default ring sizes to 128 entries

2017-04-10 Thread Benjamin Herrenschmidt
I haven't seen any improvement above that size on the machines
I've tested with.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 2f0b73b..af11e2b 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -45,8 +45,8 @@
 #define MIN_TX_QUEUE_ENTRIES   32
 
 /* Defaults */
-#define DEF_RX_QUEUE_ENTRIES   256
-#define DEF_TX_QUEUE_ENTRIES   512
+#define DEF_RX_QUEUE_ENTRIES   128
+#define DEF_TX_QUEUE_ENTRIES   128
 
 #define MAX_PKT_SIZE   1536
 #define RX_BUF_SIZEMAX_PKT_SIZE/* must be smaller than 0x3fff 
*/
-- 
2.9.3



[PATCH 09/10] ftgmac100: Make ring sizes configurable via ethtool

2017-04-10 Thread Benjamin Herrenschmidt
We set an arbitrary max at 1024 since we pre-allocate the actual
descriptor arrays and skb arrays to the full size to keep the
code a bit simpler and avoid allocation failures in the reset
task.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 204 ++-
 1 file changed, 148 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index d4fe599..2f0b73b 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -38,8 +38,15 @@
 #define DRV_NAME   "ftgmac100"
 #define DRV_VERSION"0.7"
 
-#define RX_QUEUE_ENTRIES   256 /* must be power of 2 */
-#define TX_QUEUE_ENTRIES   512 /* must be power of 2 */
+/* Arbitrary values, I am not sure the HW has limits */
+#define MAX_RX_QUEUE_ENTRIES   1024
+#define MAX_TX_QUEUE_ENTRIES   1024
+#define MIN_RX_QUEUE_ENTRIES   32
+#define MIN_TX_QUEUE_ENTRIES   32
+
+/* Defaults */
+#define DEF_RX_QUEUE_ENTRIES   256
+#define DEF_TX_QUEUE_ENTRIES   512
 
 #define MAX_PKT_SIZE   1536
 #define RX_BUF_SIZEMAX_PKT_SIZE/* must be smaller than 0x3fff 
*/
@@ -47,30 +54,32 @@
 /* Min number of tx ring entries before stopping queue */
 #define TX_THRESHOLD   (MAX_SKB_FRAGS + 1)
 
-struct ftgmac100_descs {
-   struct ftgmac100_rxdes rxdes[RX_QUEUE_ENTRIES];
-   struct ftgmac100_txdes txdes[TX_QUEUE_ENTRIES];
-};
-
 struct ftgmac100 {
/* Registers */
struct resource *res;
void __iomem *base;
 
-   struct ftgmac100_descs *descs;
-   dma_addr_t descs_dma_addr;
-
/* Rx ring */
-   struct sk_buff *rx_skbs[RX_QUEUE_ENTRIES];
+   unsigned int rx_q_entries;
+   struct ftgmac100_rxdes *rxdes;
+   dma_addr_t rxdes_dma;
+   struct sk_buff **rx_skbs;
unsigned int rx_pointer;
u32 rxdes0_edorr_mask;
 
/* Tx ring */
-   struct sk_buff *tx_skbs[TX_QUEUE_ENTRIES];
+   unsigned int tx_q_entries;
+   struct ftgmac100_txdes *txdes;
+   dma_addr_t txdes_dma;
+   struct sk_buff **tx_skbs;
unsigned int tx_clean_pointer;
unsigned int tx_pointer;
u32 txdes0_edotr_mask;
 
+   /* Used to signal the reset task of ring change request */
+   unsigned int new_rx_q_entries;
+   unsigned int new_tx_q_entries;
+
/* Scratch page to use when rx skb alloc fails */
void *rx_scratch;
dma_addr_t rx_scratch_dma;
@@ -217,14 +226,10 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
iowrite32(reg, priv->base + FTGMAC100_OFFSET_ISR);
 
/* Setup RX ring buffer base */
-   iowrite32(priv->descs_dma_addr +
- offsetof(struct ftgmac100_descs, rxdes),
- priv->base + FTGMAC100_OFFSET_RXR_BADR);
+   iowrite32(priv->rxdes_dma, priv->base + FTGMAC100_OFFSET_RXR_BADR);
 
/* Setup TX ring buffer base */
-   iowrite32(priv->descs_dma_addr +
- offsetof(struct ftgmac100_descs, txdes),
- priv->base + FTGMAC100_OFFSET_NPTXR_BADR);
+   iowrite32(priv->txdes_dma, priv->base + FTGMAC100_OFFSET_NPTXR_BADR);
 
/* Configure RX buffer size */
iowrite32(FTGMAC100_RBSR_SIZE(RX_BUF_SIZE),
@@ -337,7 +342,7 @@ static int ftgmac100_alloc_rx_buf(struct ftgmac100 *priv, 
unsigned int entry,
dma_wmb();
 
/* Clean status (which resets own bit) */
-   if (entry == (RX_QUEUE_ENTRIES - 1))
+   if (entry == (priv->rx_q_entries - 1))
rxdes->rxdes0 = cpu_to_le32(priv->rxdes0_edorr_mask);
else
rxdes->rxdes0 = 0;
@@ -345,9 +350,10 @@ static int ftgmac100_alloc_rx_buf(struct ftgmac100 *priv, 
unsigned int entry,
return 0;
 }
 
-static int ftgmac100_next_rx_pointer(int pointer)
+static unsigned int ftgmac100_next_rx_pointer(struct ftgmac100 *priv,
+ unsigned int pointer)
 {
-   return (pointer + 1) & (RX_QUEUE_ENTRIES - 1);
+   return (pointer + 1) & (priv->rx_q_entries - 1);
 }
 
 static void ftgmac100_rx_packet_error(struct ftgmac100 *priv, u32 status)
@@ -377,7 +383,7 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int 
*processed)
 
/* Grab next RX descriptor */
pointer = priv->rx_pointer;
-   rxdes = >descs->rxdes[pointer];
+   rxdes = >rxdes[pointer];
 
/* Grab descriptor status */
status = le32_to_cpu(rxdes->rxdes0);
@@ -465,7 +471,7 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int 
*processed)
 
/* Resplenish rx ring */
ftgmac100_alloc_rx_buf(priv, pointer, rxdes, GFP_ATOMIC);
-   priv->rx_pointer = ftgmac100_next_rx_pointer(pointer);
+   priv->rx_pointer = ftgmac100_next_rx_pointer(priv, pointer);
 
skb->protocol = eth_type_trans(skb, netdev);
 
@@ -484,7 +490,7 @@ static bool 

[PATCH 08/10] ftgmac100: Add more register inits in ftgmac100_init_hw()

2017-04-10 Thread Benjamin Herrenschmidt
Clear stale interrupts on entry, configure FIFO sizes, set FIFO
thresholds, configure interrupt mitigation.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 3c7a1e4..d4fe599 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -210,7 +210,11 @@ static int ftgmac100_set_mac_addr(struct net_device *dev, 
void *p)
 
 static void ftgmac100_init_hw(struct ftgmac100 *priv)
 {
+   u32 reg, rfifo_sz, tfifo_sz;
 
+   /* Clear stale interrupts */
+   reg = ioread32(priv->base + FTGMAC100_OFFSET_ISR);
+   iowrite32(reg, priv->base + FTGMAC100_OFFSET_ISR);
 
/* Setup RX ring buffer base */
iowrite32(priv->descs_dma_addr +
@@ -232,6 +236,38 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
 
/* Write MAC address */
ftgmac100_write_mac_addr(priv, priv->netdev->dev_addr);
+
+   /* Configure descriptor sizes and increase burst sizes according
+* to values in Aspeed SDK. The FIFO arbitration is enabled and
+* the thresholds set based on the recommended values in the
+* AST2400 specification.
+*/
+   iowrite32(FTGMAC100_DBLAC_RXDES_SIZE(2) |   /* 2*8 bytes RX descs */
+ FTGMAC100_DBLAC_TXDES_SIZE(2) |   /* 2*8 bytes TX descs */
+ FTGMAC100_DBLAC_RXBURST_SIZE(3) | /* 512 bytes max RX bursts 
*/
+ FTGMAC100_DBLAC_TXBURST_SIZE(3) | /* 512 bytes max TX bursts 
*/
+ FTGMAC100_DBLAC_RX_THR_EN |   /* Enable fifo threshold 
arb */
+ FTGMAC100_DBLAC_RXFIFO_HTHR(6) |  /* 6/8 of FIFO high 
threshold */
+ FTGMAC100_DBLAC_RXFIFO_LTHR(2),   /* 2/8 of FIFO low 
threshold */
+ priv->base + FTGMAC100_OFFSET_DBLAC);
+
+   /* Interrupt mitigation configured for 1 interrupt/packet. HW interrupt
+* mitigation doesn't seem to provide any benefit with NAPI so leave
+* it at that.
+*/
+   iowrite32(FTGMAC100_ITC_RXINT_THR(1) |
+ FTGMAC100_ITC_TXINT_THR(1),
+ priv->base + FTGMAC100_OFFSET_ITC);
+
+   /* Configure FIFO sizes in the TPAFCR register */
+   reg = ioread32(priv->base + FTGMAC100_OFFSET_FEAR);
+   rfifo_sz = reg & 0x0007;
+   tfifo_sz = (reg >> 3) & 0x0007;
+   reg = ioread32(priv->base + FTGMAC100_OFFSET_TPAFCR);
+   reg &= ~0x3f00;
+   reg |= (tfifo_sz << 27);
+   reg |= (rfifo_sz << 24);
+   iowrite32(reg, priv->base + FTGMAC100_OFFSET_TPAFCR);
 }
 
 static void ftgmac100_start_hw(struct ftgmac100 *priv)
-- 
2.9.3



[PATCH 05/10] ftgmac100: Rename ftgmac100_set_mac to ftgmac100_write_mac_addr

2017-04-10 Thread Benjamin Herrenschmidt
To avoid confusion with the ndo callback and generally be
clearer about the purpose of that function

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 257243a..c2053e5 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -173,7 +173,7 @@ static int ftgmac100_reset_and_config_mac(struct ftgmac100 
*priv)
return ftgmac100_reset_mac(priv, maccr);
 }
 
-static void ftgmac100_set_mac(struct ftgmac100 *priv, const unsigned char *mac)
+static void ftgmac100_write_mac_addr(struct ftgmac100 *priv, const u8 *mac)
 {
unsigned int maddr = mac[0] << 8 | mac[1];
unsigned int laddr = mac[2] << 24 | mac[3] << 16 | mac[4] << 8 | mac[5];
@@ -226,7 +226,7 @@ static int ftgmac100_set_mac_addr(struct net_device *dev, 
void *p)
return ret;
 
eth_commit_mac_addr_change(dev, p);
-   ftgmac100_set_mac(netdev_priv(dev), dev->dev_addr);
+   ftgmac100_write_mac_addr(netdev_priv(dev), dev->dev_addr);
 
return 0;
 }
@@ -245,7 +245,7 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
 
iowrite32(FTGMAC100_APTC_RXPOLL_CNT(1), priv->base + 
FTGMAC100_OFFSET_APTC);
 
-   ftgmac100_set_mac(priv, priv->netdev->dev_addr);
+   ftgmac100_write_mac_addr(priv, priv->netdev->dev_addr);
 }
 
 static void ftgmac100_start_hw(struct ftgmac100 *priv)
-- 
2.9.3



[PATCH 04/10] ftgmac100: Set netdev->hw_features

2017-04-10 Thread Benjamin Herrenschmidt
So features can be turned on/off via ethtool

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 3c68823..257243a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1463,14 +1463,15 @@ static int ftgmac100_probe(struct platform_device *pdev)
}
 
/* Base feature set */
-   netdev->features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
+   netdev->hw_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
NETIF_F_GRO | NETIF_F_SG;
 
/* AST2400  doesn't have working HW checksum generation */
if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
-   netdev->features &= ~NETIF_F_HW_CSUM;
+   netdev->hw_features &= ~NETIF_F_HW_CSUM;
if (np && of_get_property(np, "no-hw-checksum", NULL))
-   netdev->features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
+   netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
+   netdev->features |= netdev->hw_features;
 
/* register network device */
err = register_netdev(netdev);
-- 
2.9.3



[PATCH 07/10] ftgmac100: Open code remaining register writes

2017-04-10 Thread Benjamin Herrenschmidt
The helpers just take space but don't provide much value. Simple
one line comments are more explanatory.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 53 
 1 file changed, 20 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index e92f382..3c7a1e4 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -93,29 +93,6 @@ struct ftgmac100 {
bool is_aspeed;
 };
 
-static void ftgmac100_set_rx_ring_base(struct ftgmac100 *priv, dma_addr_t addr)
-{
-   iowrite32(addr, priv->base + FTGMAC100_OFFSET_RXR_BADR);
-}
-
-static void ftgmac100_set_rx_buffer_size(struct ftgmac100 *priv,
-   unsigned int size)
-{
-   size = FTGMAC100_RBSR_SIZE(size);
-   iowrite32(size, priv->base + FTGMAC100_OFFSET_RBSR);
-}
-
-static void ftgmac100_set_normal_prio_tx_ring_base(struct ftgmac100 *priv,
-  dma_addr_t addr)
-{
-   iowrite32(addr, priv->base + FTGMAC100_OFFSET_NPTXR_BADR);
-}
-
-static void ftgmac100_txdma_normal_prio_start_polling(struct ftgmac100 *priv)
-{
-   iowrite32(1, priv->base + FTGMAC100_OFFSET_NPTXPD);
-}
-
 static int ftgmac100_reset_mac(struct ftgmac100 *priv, u32 maccr)
 {
struct net_device *netdev = priv->netdev;
@@ -233,18 +210,27 @@ static int ftgmac100_set_mac_addr(struct net_device *dev, 
void *p)
 
 static void ftgmac100_init_hw(struct ftgmac100 *priv)
 {
-   /* setup ring buffer base registers */
-   ftgmac100_set_rx_ring_base(priv,
-  priv->descs_dma_addr +
-  offsetof(struct ftgmac100_descs, rxdes));
-   ftgmac100_set_normal_prio_tx_ring_base(priv,
-  priv->descs_dma_addr +
-  offsetof(struct ftgmac100_descs, 
txdes));
 
-   ftgmac100_set_rx_buffer_size(priv, RX_BUF_SIZE);
 
-   iowrite32(FTGMAC100_APTC_RXPOLL_CNT(1), priv->base + 
FTGMAC100_OFFSET_APTC);
+   /* Setup RX ring buffer base */
+   iowrite32(priv->descs_dma_addr +
+ offsetof(struct ftgmac100_descs, rxdes),
+ priv->base + FTGMAC100_OFFSET_RXR_BADR);
 
+   /* Setup TX ring buffer base */
+   iowrite32(priv->descs_dma_addr +
+ offsetof(struct ftgmac100_descs, txdes),
+ priv->base + FTGMAC100_OFFSET_NPTXR_BADR);
+
+   /* Configure RX buffer size */
+   iowrite32(FTGMAC100_RBSR_SIZE(RX_BUF_SIZE),
+ priv->base + FTGMAC100_OFFSET_RBSR);
+
+   /* Set RX descriptor autopoll */
+   iowrite32(FTGMAC100_APTC_RXPOLL_CNT(1),
+ priv->base + FTGMAC100_OFFSET_APTC);
+
+   /* Write MAC address */
ftgmac100_write_mac_addr(priv, priv->netdev->dev_addr);
 }
 
@@ -695,7 +681,8 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
netif_wake_queue(netdev);
}
 
-   ftgmac100_txdma_normal_prio_start_polling(priv);
+   /* Poke transmitter to read the updated TX descriptors */
+   iowrite32(1, priv->base + FTGMAC100_OFFSET_NPTXPD);
 
return NETDEV_TX_OK;
 
-- 
2.9.3



[PATCH 06/10] ftgmac100: Rename ftgmac100_setup_mac to ftgmac100_initial_mac

2017-04-10 Thread Benjamin Herrenschmidt
To remove more confusion. This function is about obtaining the
initial MAC address at driver probe time.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index c2053e5..e92f382 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -182,7 +182,7 @@ static void ftgmac100_write_mac_addr(struct ftgmac100 
*priv, const u8 *mac)
iowrite32(laddr, priv->base + FTGMAC100_OFFSET_MAC_LADR);
 }
 
-static void ftgmac100_setup_mac(struct ftgmac100 *priv)
+static void ftgmac100_initial_mac(struct ftgmac100 *priv)
 {
u8 mac[ETH_ALEN];
unsigned int m;
@@ -1431,7 +1431,7 @@ static int ftgmac100_probe(struct platform_device *pdev)
netdev->irq = irq;
 
/* MAC address from chip or random one */
-   ftgmac100_setup_mac(priv);
+   ftgmac100_initial_mac(priv);
 
np = pdev->dev.of_node;
if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
-- 
2.9.3



[PATCH 03/10] ftgmac100: Disable HW checksum generation on AST2400, enable on others

2017-04-10 Thread Benjamin Herrenschmidt
We found out that HW checksum generation only works from AST2500
onward. This disables it on AST2400 and removes the "no-hw-checksum"
properties in the device-trees. The problem we had wasn't related
to NC-SI.

Also rework the logic testing for that property so it can be used
to disable HW checksum generation and checking regardless of whether
NC-SI is used or not in case other variants out there need this.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  2 --
 arch/arm/boot/dts/aspeed-g5.dtsi |  2 --
 drivers/net/ethernet/faraday/ftgmac100.c | 12 ++--
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index 6068e79..c79c937 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -45,7 +45,6 @@
compatible = "aspeed,ast2400-mac", "faraday,ftgmac100";
reg = <0x1e66 0x180>;
interrupts = <2>;
-   no-hw-checksum;
status = "disabled";
};
 
@@ -53,7 +52,6 @@
compatible = "aspeed,ast2400-mac", "faraday,ftgmac100";
reg = <0x1e68 0x180>;
interrupts = <3>;
-   no-hw-checksum;
status = "disabled";
};
 
diff --git a/arch/arm/boot/dts/aspeed-g5.dtsi b/arch/arm/boot/dts/aspeed-g5.dtsi
index 4dbe91a..b659663 100644
--- a/arch/arm/boot/dts/aspeed-g5.dtsi
+++ b/arch/arm/boot/dts/aspeed-g5.dtsi
@@ -36,7 +36,6 @@
compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
reg = <0x1e66 0x180>;
interrupts = <2>;
-   no-hw-checksum;
status = "disabled";
};
 
@@ -44,7 +43,6 @@
compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
reg = <0x1e68 0x180>;
interrupts = <3>;
-   no-hw-checksum;
status = "disabled";
};
 
diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 5e04717..3c68823 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1462,15 +1462,15 @@ static int ftgmac100_probe(struct platform_device *pdev)
goto err_setup_mdio;
}
 
-   /* We have to disable on-chip IP checksum functionality
-* when NCSI is enabled on the interface. It doesn't work
-* in that case.
-*/
+   /* Base feature set */
netdev->features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
NETIF_F_GRO | NETIF_F_SG;
-   if (priv->use_ncsi &&
-   of_get_property(np, "no-hw-checksum", NULL))
+
+   /* AST2400  doesn't have working HW checksum generation */
+   if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
netdev->features &= ~NETIF_F_HW_CSUM;
+   if (np && of_get_property(np, "no-hw-checksum", NULL))
+   netdev->features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
 
/* register network device */
err = register_netdev(netdev);
-- 
2.9.3



[PATCH 01/10] ftgmac100: Upgrade to NETIF_F_HW_CSUM

2017-04-10 Thread Benjamin Herrenschmidt
The documentation describes NETIF_F_IP_CSUM as deprecated
so let's switch to NETIF_F_HW_CSUM and use the helper to
handle unhandled protocols.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 98b8956..85b650a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -637,7 +637,8 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
csum_vlan |= FTGMAC100_TXDES1_TCP_CHKSUM;
else if (ip_proto == IPPROTO_UDP)
csum_vlan |= FTGMAC100_TXDES1_UDP_CHKSUM;
-   }
+   } else if (skb_checksum_help(skb))
+   goto drop;
}
txdes->txdes1 = cpu_to_le32(csum_vlan);
 
@@ -1463,11 +1464,11 @@ static int ftgmac100_probe(struct platform_device *pdev)
 * when NCSI is enabled on the interface. It doesn't work
 * in that case.
 */
-   netdev->features = NETIF_F_RXCSUM | NETIF_F_IP_CSUM |
+   netdev->features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
NETIF_F_GRO | NETIF_F_SG;
if (priv->use_ncsi &&
of_get_property(pdev->dev.of_node, "no-hw-checksum", NULL))
-   netdev->features &= ~NETIF_F_IP_CSUM;
+   netdev->features &= ~NETIF_F_HW_CSUM;
 
/* register network device */
err = register_netdev(netdev);
-- 
2.9.3



[PATCH 00/10] ftgmac100: Rework batch 4 - Misc

2017-04-10 Thread Benjamin Herrenschmidt
This is the fourth batch of updates to the ftgmac100 driver.

This is a bunch of misc cleanups and fixes, such as properly
disabling HW checksum generation on AST2400 where it's known
to be broken and some chip init updates.

This also adds the ability to turn HW checksum on/off and
configure the ring sizes via ethtool.

The next (and last) batch will add a few more "features" such
as netpoll, multicast/promist, vlan offload...



Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread David Miller
From: Michael Chan 
Date: Mon, 10 Apr 2017 14:47:04 -0700

> On Mon, Apr 10, 2017 at 2:30 PM, Andy Gospodarek  wrote:
>> On Mon, Apr 10, 2017 at 03:28:54PM -0400, David Miller wrote:
>>> From: Andy Gospodarek 
>>> Date: Mon, 10 Apr 2017 14:39:35 -0400
>>>
>>> > As promised, I did some testing today with bnxt_en's implementation
>>> > of XDP and this one.
>>>
>>> Thanks a lot Andy, obviously the patch needs some more work.
>>>
>>> I noticed GRO stuff in your profile, and Alexei mentioned this earlier
>>> today.  We probably should elide GRO if generic XDP is attached, since
>>> in particular this makes the skb_linearize() really expensive.
>>
>> Good catch -- I actually thought we were disabling GRO automatically and it
>> looks like we are not.  :-/  I'll send Michael a patch.
> 
> Andy,  I think we only need to disable GRO if we are doing generic
> XDP.  Optimized XDP can still use GRO for the XDP_PASS case.

Right.


Re: [PATCH] net: phy: micrel: KSZ9031: fix crash when statistic requested

2017-04-10 Thread Florian Fainelli
On 04/10/2017 04:33 PM, Grygorii Strashko wrote:
> Now the command:
>   ethtool --phy-statistics eth0
> will cause system crash with meassage "Unable to handle kernel NULL pointer
> dereference at virtual address 0010" from:
> 
>  (kszphy_get_stats) from [] (ethtool_get_phy_stats+0xd8/0x210)
>  (ethtool_get_phy_stats) from [] (dev_ethtool+0x5b8/0x228c)
>  (dev_ethtool) from [] (dev_ioctl+0x3fc/0x964)
>  (dev_ioctl) from [] (sock_ioctl+0x170/0x2c0)
>  (sock_ioctl) from [] (do_vfs_ioctl+0xa8/0x95c)
>  (do_vfs_ioctl) from [] (SyS_ioctl+0x3c/0x64)
>  (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x44)
> 
> The reason: phy_driver structure for KSZ9031 phy has no .probe() callback
> defined. As result, struct phy_device *phydev->priv pointer will not be
> initializes (null).

This is a strange way to fix the problem, presumably this PHY supports
fetching statistics, if that is the case it sounds like we would want to
sort of provide two probe function:

- one which is just allocating the PHY device's private structure so we
have enough room for statistics
- another one which is doing all the reference clock fetching and so on

By adding a NULL pointer check here, you'd be better off just removing
all the function pointers pertaining to ethtool statistics.

> 
> Fix it by adding additional checks for !phydev->priv in
> kszphy_get_stats(), kszphy_get_strings() and kszphy_get_sset_count()
> 
> Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
> Signed-off-by: Grygorii Strashko 
> ---
>  drivers/net/phy/micrel.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
> index 6742070..8dbc1be 100644
> --- a/drivers/net/phy/micrel.c
> +++ b/drivers/net/phy/micrel.c
> @@ -574,7 +574,6 @@ static int ksz9031_config_init(struct phy_device *phydev)
>   MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
>   tx_data_skews, 4);
>   }
> -
>   return ksz9031_center_flp_timing(phydev);
>  }
>  
> @@ -654,6 +653,9 @@ ksz9021_wr_mmd_phyreg(struct phy_device *phydev, int 
> ptrad, int devnum,
>  
>  static int kszphy_get_sset_count(struct phy_device *phydev)
>  {
> + if (!phydev->priv)
> + return -EOPNOTSUPP;
> +
>   return ARRAY_SIZE(kszphy_hw_stats);
>  }
>  
> @@ -661,6 +663,9 @@ static void kszphy_get_strings(struct phy_device *phydev, 
> u8 *data)
>  {
>   int i;
>  
> + if (!phydev->priv)
> + return;
> +
>   for (i = 0; i < ARRAY_SIZE(kszphy_hw_stats); i++) {
>   memcpy(data + i * ETH_GSTRING_LEN,
>  kszphy_hw_stats[i].string, ETH_GSTRING_LEN);
> @@ -694,6 +699,9 @@ static void kszphy_get_stats(struct phy_device *phydev,
>  {
>   int i;
>  
> + if (!phydev->priv)
> + return;
> +
>   for (i = 0; i < ARRAY_SIZE(kszphy_hw_stats); i++)
>   data[i] = kszphy_get_stat(phydev, i);
>  }
> 


-- 
Florian


[PATCH] net: phy: micrel: KSZ9031: fix crash when statistic requested

2017-04-10 Thread Grygorii Strashko
Now the command:
ethtool --phy-statistics eth0
will cause system crash with meassage "Unable to handle kernel NULL pointer
dereference at virtual address 0010" from:

 (kszphy_get_stats) from [] (ethtool_get_phy_stats+0xd8/0x210)
 (ethtool_get_phy_stats) from [] (dev_ethtool+0x5b8/0x228c)
 (dev_ethtool) from [] (dev_ioctl+0x3fc/0x964)
 (dev_ioctl) from [] (sock_ioctl+0x170/0x2c0)
 (sock_ioctl) from [] (do_vfs_ioctl+0xa8/0x95c)
 (do_vfs_ioctl) from [] (SyS_ioctl+0x3c/0x64)
 (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x44)

The reason: phy_driver structure for KSZ9031 phy has no .probe() callback
defined. As result, struct phy_device *phydev->priv pointer will not be
initializes (null).

Fix it by adding additional checks for !phydev->priv in
kszphy_get_stats(), kszphy_get_strings() and kszphy_get_sset_count()

Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
Signed-off-by: Grygorii Strashko 
---
 drivers/net/phy/micrel.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 6742070..8dbc1be 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -574,7 +574,6 @@ static int ksz9031_config_init(struct phy_device *phydev)
MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
tx_data_skews, 4);
}
-
return ksz9031_center_flp_timing(phydev);
 }
 
@@ -654,6 +653,9 @@ ksz9021_wr_mmd_phyreg(struct phy_device *phydev, int ptrad, 
int devnum,
 
 static int kszphy_get_sset_count(struct phy_device *phydev)
 {
+   if (!phydev->priv)
+   return -EOPNOTSUPP;
+
return ARRAY_SIZE(kszphy_hw_stats);
 }
 
@@ -661,6 +663,9 @@ static void kszphy_get_strings(struct phy_device *phydev, 
u8 *data)
 {
int i;
 
+   if (!phydev->priv)
+   return;
+
for (i = 0; i < ARRAY_SIZE(kszphy_hw_stats); i++) {
memcpy(data + i * ETH_GSTRING_LEN,
   kszphy_hw_stats[i].string, ETH_GSTRING_LEN);
@@ -694,6 +699,9 @@ static void kszphy_get_stats(struct phy_device *phydev,
 {
int i;
 
+   if (!phydev->priv)
+   return;
+
for (i = 0; i < ARRAY_SIZE(kszphy_hw_stats); i++)
data[i] = kszphy_get_stat(phydev, i);
 }
-- 
2.10.1



Re: net/ipv6: use-after-free in ipv6_sock_ac_close

2017-04-10 Thread Cong Wang
On Mon, Apr 10, 2017 at 7:28 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>
> Unfortunately it's not reproducible.
>
> ==
> BUG: KASAN: use-after-free in ipv6_sock_ac_close+0x3d6/0x3f0
> net/ipv6/anycast.c:190 at addr 88005d6ce020
> Read of size 8 by task syz-executor6/18308
> CPU: 0 PID: 18308 Comm: syz-executor6 Not tainted 4.11.0-rc6+ #206
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:368
>  ipv6_sock_ac_close+0x3d6/0x3f0 net/ipv6/anycast.c:190
>  inet6_release+0x48/0x70 net/ipv6/af_inet6.c:430
>  sock_release+0x8d/0x1e0 net/socket.c:597
>  sock_close+0x16/0x20 net/socket.c:1072
>  __fput+0x332/0x7f0 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:245
>  task_work_run+0x19b/0x270 kernel/task_work.c:116
>  exit_task_work include/linux/task_work.h:21 [inline]
>  do_exit+0x18b6/0x2830 kernel/exit.c:878
>  do_group_exit+0x149/0x420 kernel/exit.c:982
>  get_signal+0x76d/0x17e0 kernel/signal.c:2318
>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>  exit_to_usermode_loop+0x170/0x200 arch/x86/entry/common.c:157
>  prepare_exit_to_usermode arch/x86/entry/common.c:191 [inline]
>  syscall_return_slowpath+0x3d3/0x420 arch/x86/entry/common.c:260
>  entry_SYSCALL_64_fastpath+0xc0/0xc2


I don't see how this is possible except when we have a loop in the
singly linked list np->ipv6_ac_list, it is protected by RTNL lock anyway.

The insertion and removal code looks correct too. But I may miss
something too obvious...

Thanks!

> RIP: 0033:0x4458d9
> RSP: 002b:7f59304e1cf8 EFLAGS: 0246 ORIG_RAX: 00ca
> RAX:  RBX: 00708218 RCX: 004458d9
> RDX:  RSI:  RDI: 00708218
> RBP: 007081f8 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13:  R14: 7f59304e29c0 R15: 7f59304e2700
> Object at 88005d6ce008, in cache kmalloc-32 size: 32
> Allocated:
> PID = 18270
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  __kmalloc+0xa0/0x2d0 mm/slub.c:3745
>  kmalloc include/linux/slab.h:495 [inline]
>  sock_kmalloc+0x11e/0x1b0 net/core/sock.c:1804
>  ipv6_sock_ac_join+0x21f/0x7b0 net/ipv6/anycast.c:72
>  do_ipv6_setsockopt.isra.7+0x2a77/0x3bc0 net/ipv6/ipv6_sockglue.c:655
>  ipv6_setsockopt+0x9b/0x140 net/ipv6/ipv6_sockglue.c:919
>  sctp_setsockopt+0x2b4/0x5f10 net/sctp/socket.c:3909
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
>  SYSC_setsockopt net/socket.c:1798 [inline]
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Freed:
> PID = 18308
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kfree+0xe8/0x2b0 mm/slub.c:3882
>  __sock_kfree_s net/core/sock.c:1825 [inline]
>  sock_kfree_s+0x29/0x70 net/core/sock.c:1831
>  ipv6_sock_ac_close+0x2d0/0x3f0 net/ipv6/anycast.c:198
>  inet6_release+0x48/0x70 net/ipv6/af_inet6.c:430
>  sock_release+0x8d/0x1e0 net/socket.c:597
>  sock_close+0x16/0x20 net/socket.c:1072
>  __fput+0x332/0x7f0 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:245
>  task_work_run+0x19b/0x270 kernel/task_work.c:116
>  exit_task_work include/linux/task_work.h:21 [inline]
>  do_exit+0x18b6/0x2830 kernel/exit.c:878
>  do_group_exit+0x149/0x420 kernel/exit.c:982
>  get_signal+0x76d/0x17e0 kernel/signal.c:2318
>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>  exit_to_usermode_loop+0x170/0x200 arch/x86/entry/common.c:157
>  prepare_exit_to_usermode arch/x86/entry/common.c:191 [inline]
>  syscall_return_slowpath+0x3d3/0x420 arch/x86/entry/common.c:260
>  entry_SYSCALL_64_fastpath+0xc0/0xc2
> Memory state around the buggy address:
>  88005d6cdf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  88005d6cdf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>88005d6ce000: fc fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
>^
>  88005d6ce080: fc fc fc fc fc fc fc 

Re: net/ipv4: use-after-free in ip_check_mc_rcu

2017-04-10 Thread Cong Wang
On Mon, Apr 10, 2017 at 7:33 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>
> Unfortunately it's not reproducible.
>
> BUG: KASAN: use-after-free in ip_check_mc_rcu+0x805/0x8b0
> net/ipv4/igmp.c:2645 at addr 880065b21cb8
> Read of size 8 by task syz-executor1/2843
> CPU: 3 PID: 2843 Comm: syz-executor1 Not tainted 4.11.0-rc6+ #206
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:368
>  ip_check_mc_rcu+0x805/0x8b0 net/ipv4/igmp.c:2645
>  __mkroute_output net/ipv4/route.c:2103 [inline]
>  __ip_route_output_key_hash+0x241d/0x2ea0 net/ipv4/route.c:2375
>  __ip_route_output_key include/net/route.h:122 [inline]
>  ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2461
>  udp_sendmsg+0x1565/0x2cd0 net/ipv4/udp.c:1015
>  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1f/0xc2

The singly linked list pmc->sources is probably supposed to be
protected by RCU, but we don't use kfree_rcu() to free it. I think
this is probably why we have this bug. Perhaps it is time now
to convert it to a doubly linked list which is much easier for RCU.

I will work on a patch.

Thanks!

> RIP: 0033:0x4458d9
> RSP: 002b:7f030b49bb58 EFLAGS: 0282 ORIG_RAX: 002c
> RAX: ffda RBX: 0005 RCX: 004458d9
> RDX: 1000 RSI: 20004000 RDI: 0005
> RBP: 006e2fc0 R08: 2000a000 R09: 0010
> R10:  R11: 0282 R12: 00708000
> R13: 0005 R14: 2ff3 R15: 20006ffc
> Object at 880065b21cb8, in cache kmalloc-64 size: 64
> Allocated:
> PID = 2841
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
>  kmalloc include/linux/slab.h:490 [inline]
>  kzalloc include/linux/slab.h:663 [inline]
>  ip_mc_add1_src net/ipv4/igmp.c:1909 [inline]
>  ip_mc_add_src+0xbea/0x1020 net/ipv4/igmp.c:2033
>  ip_mc_msfilter+0x5e5/0xcf0 net/ipv4/igmp.c:2403
>  do_ip_setsockopt.isra.12+0x2a9f/0x36c0 net/ipv4/ip_sockglue.c:967
>  ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
>  raw_setsockopt+0xb7/0xd0 net/ipv4/raw.c:839
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
>  SYSC_setsockopt net/socket.c:1798 [inline]
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Freed:
> PID = 2841
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kfree+0xe8/0x2b0 mm/slub.c:3882
>  ip_mc_del1_src+0x842/0x9d0 net/ipv4/igmp.c:1822
>  ip_mc_del_src+0x59a/0xbe0 net/ipv4/igmp.c:1863
>  ip_mc_leave_src+0x15d/0x320 net/ipv4/igmp.c:2155
>  ip_mc_drop_socket+0x113/0x230 net/ipv4/igmp.c:2607
>  inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411
>  sock_release+0x8d/0x1e0 net/socket.c:597
>  sock_close+0x16/0x20 net/socket.c:1072
>  __fput+0x332/0x7f0 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:245
>  task_work_run+0x19b/0x270 kernel/task_work.c:116
>  exit_task_work include/linux/task_work.h:21 [inline]
>  do_exit+0x18b6/0x2830 kernel/exit.c:878
>  do_group_exit+0x149/0x420 kernel/exit.c:982
>  get_signal+0x76d/0x17e0 kernel/signal.c:2318
>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>  exit_to_usermode_loop+0x170/0x200 arch/x86/entry/common.c:157
>  prepare_exit_to_usermode arch/x86/entry/common.c:191 [inline]
>  syscall_return_slowpath+0x3d3/0x420 arch/x86/entry/common.c:260
>  entry_SYSCALL_64_fastpath+0xc0/0xc2
> Memory state around the buggy address:
>  880065b21b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  880065b21c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>880065b21c80: fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb fc
> ^
>  880065b21d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  880065b21d80: fc fc fc fc fc fc fc fc fc fc fc fc fc 

[PATCH] p54: Prevent from dereferencing null pointer when releasing SKB

2017-04-10 Thread Myungho Jung
Added NULL check to make __dev_kfree_skb_irq consistent with kfree
family of functions.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289

Signed-off-by: Myungho Jung 
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3..22be2a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2450,6 +2450,9 @@ void __dev_kfree_skb_irq(struct sk_buff *skb, enum 
skb_free_reason reason)
 {
unsigned long flags;
 
+   if (unlikely(!skb))
+   return;
+
if (likely(atomic_read(>users) == 1)) {
smp_rmb();
atomic_set(>users, 0);
-- 
2.7.4



Re: WARN_ON running XDP on virtio net device

2017-04-10 Thread David Ahern
On 4/10/17 3:37 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 10, 2017 at 03:21:53PM -0600, David Ahern wrote:
>> I'm hitting a WARN_ON running XDP with virtio net:
> I just sent a pull request. Would appreciate reports on whether
> it helps.
> 

it does.


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Michael Chan
On Mon, Apr 10, 2017 at 2:30 PM, Andy Gospodarek  wrote:
> On Mon, Apr 10, 2017 at 03:28:54PM -0400, David Miller wrote:
>> From: Andy Gospodarek 
>> Date: Mon, 10 Apr 2017 14:39:35 -0400
>>
>> > As promised, I did some testing today with bnxt_en's implementation
>> > of XDP and this one.
>>
>> Thanks a lot Andy, obviously the patch needs some more work.
>>
>> I noticed GRO stuff in your profile, and Alexei mentioned this earlier
>> today.  We probably should elide GRO if generic XDP is attached, since
>> in particular this makes the skb_linearize() really expensive.
>
> Good catch -- I actually thought we were disabling GRO automatically and it
> looks like we are not.  :-/  I'll send Michael a patch.

Andy,  I think we only need to disable GRO if we are doing generic
XDP.  Optimized XDP can still use GRO for the XDP_PASS case.

>
> Disabling GRO allows me to process an additional 1Mpps, so I'm up to 7.5Mpps
> with this patch.


[RFC net-next] of: mdio: Honor hints from MDIO bus drivers

2017-04-10 Thread Florian Fainelli
A MDIO bus driver can set phy_mask to indicate which PHYs should be
probed and which should not. Right now, of_mdiobus_register() always
sets mdio->phy_mask to ~0 which means: don't probe anything yourself,
and let the Device Tree scanning do it based on the availability of
child nodes.

When MDIO buses are stacked together (on purpose, as is done by DSA), we
run into possible double probing which is, at best unnecessary, and at
worse, can cause problems if that's not expected (e.g: during probe
deferral).

Fix this by remember the original mdio->phy_mask, and make sure that if
it was set to all 0xF, we set it to zero internally in order not to
influence how the child PHY/MDIO device registration is going to behave.
When the original mdio->phy_mask is set to something non-zero, we honor
this value and utilize it as a hint to register only the child nodes
that we have both found, and indicated to be necessary.

Signed-off-by: Florian Fainelli 
---
Sending this as RFC because a quick look at the current tree makes
me think we are fine, but I would appreciate some review/feedback
before this gets merged.

Thank you!

 drivers/of/of_mdio.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 0b2979816dbf..6bfbf00623cb 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -209,6 +209,7 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
 {
struct device_node *child;
bool scanphys = false;
+   u32 orig_phy_mask;
int addr, rc;
 
/* Do not continue if the node is disabled */
@@ -217,8 +218,15 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
 
/* Mask out all PHYs from auto probing.  Instead the PHYs listed in
 * the device tree are populated after the bus has been registered */
+   orig_phy_mask = mdio->phy_mask;
mdio->phy_mask = ~0;
 
+   /* If the original phy_mask was all 0xf, we make it zero here in order
+* to get child Device Tree nodes to be probed successfully
+*/
+   if (orig_phy_mask == mdio->phy_mask)
+   orig_phy_mask = 0;
+
mdio->dev.of_node = np;
 
/* Register the MDIO bus */
@@ -234,6 +242,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
continue;
}
 
+   /* Honor hints from the mdio bus */
+   if (orig_phy_mask & BIT(addr))
+   continue;
+
if (of_mdiobus_child_is_phy(child))
of_mdiobus_register_phy(mdio, child, addr);
else
-- 
2.9.3



Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Andy Gospodarek
On Mon, Apr 10, 2017 at 10:12:42PM +0200, Daniel Borkmann wrote:
> On 04/10/2017 08:39 PM, Andy Gospodarek wrote:
> [...]
> > I ran this on a desktop-class system I have (i7-6700 CPU @ 3.40GHz)
> > and used pktgen_sample03_burst_single_flow.sh from another system to
> > throw ~6.5Mpps as a single UDP stream towards the system running XDP.
> > 
> > I just commented out the ndo_xdp op in bnxt.c to test this patch.  The
> > differences were pretty dramatic.
> > 
> > Using the ndo_xdp ndo in bnxt_en, my perf report output looked like
> > this (not sure why X is still running on this system!):
> > 
> >  38.08%  swapper  [sysimgblt]   [k] 
> > 0x5bd0
> >  11.80%  swapper  [kernel.vmlinux]  [k] intel_idle
> >  10.49%  swapper  [bnxt_en] [k] bnxt_rx_pkt
> >   6.31%  swapper  [bnxt_en] [k] bnxt_rx_xdp
> >   5.64%  swapper  [bnxt_en] [k] bnxt_poll
> >   4.22%  swapper  [kernel.vmlinux]  [k] poll_idle
> >   3.46%  swapper  [kernel.vmlinux]  [k] 
> > irq_entries_start
> >   2.95%  swapper  [kernel.vmlinux]  [k] 
> > napi_complete_done
> >   1.79%  swapper  [kernel.vmlinux]  [k] 
> > cpuidle_enter_state
> >   1.53%  swapper  [kernel.vmlinux]  [k] menu_select
> >   1.19%  swapper  [bnxt_en] [k] 
> > bnxt_reuse_rx_data
> >   1.00%  swapper  [sysimgblt]   [k] 
> > 0x5c6f
> >   0.92%  swapper  [kernel.vmlinux]  [k] 
> > __next_timer_interrupt
> >   0.71%  swapper  [kernel.vmlinux]  [k] 
> > _raw_spin_lock_irqsave
> >   0.71%  swapper  [kernel.vmlinux]  [k] 
> > bpf_map_lookup_elem
> > 
> > mpstat reports that the CPU receiving and dropping the traffic is
> > basically running idle.  Dropping this amount of traffic in the driver
> > has very little impact on the system.
> > 
> > With v2 of this patch I see the following from perf report:
> > 
> >  19.69%  ksoftirqd/3  [kernel.vmlinux]  [k] memcpy_erms
> >  16.30%  ksoftirqd/3  [kernel.vmlinux]  [k] 
> > __bpf_prog_run
> 
> Forgot echo 1 > /proc/sys/net/core/bpf_jit_enable? Was it disabled in both 
> cases?
> 

Good catch.  This must have been a run I did with it off.

Results with bpf_git_enable=1 and gro off, still only 7.5Mpps.

26.34%  ksoftirqd/5  [kernel.vmlinux]  [k] memcpy_erms
14.79%  ksoftirqd/5  [bnxt_en] [k] bnxt_rx_pkt
10.11%  ksoftirqd/5  [kernel.vmlinux]  [k] __build_skb
 5.01%  ksoftirqd/5  [kernel.vmlinux]  [k] page_frag_free
 4.66%  ksoftirqd/5  [kernel.vmlinux]  [k] kmem_cache_alloc
 4.19%  ksoftirqd/5  [kernel.vmlinux]  [k] kmem_cache_free
 3.67%  ksoftirqd/5  [bnxt_en] [k] bnxt_poll
 2.97%  ksoftirqd/5  [kernel.vmlinux]  [k] 
netif_receive_skb_internal
 2.24%  ksoftirqd/5  [kernel.vmlinux]  [k] __napi_alloc_skb
 1.92%  ksoftirqd/5  [kernel.vmlinux]  [k] eth_type_trans
 1.78%  ksoftirqd/5  [bnxt_en] [k] bnxt_rx_xdp
 1.62%  ksoftirqd/5  [kernel.vmlinux]  [k] net_rx_action
 1.61%  ksoftirqd/5  [kernel.vmlinux]  [k] kfree_skb
 1.58%  ksoftirqd/5  [kernel.vmlinux]  [k] dev_gro_receive
 1.51%  ksoftirqd/5  [bnxt_en] [k] 
bnxt_reuse_rx_data
 1.44%  ksoftirqd/5  [kernel.vmlinux]  [k] skb_release_data
 1.42%  ksoftirqd/5  [kernel.vmlinux]  [k] page_frag_alloc
 1.29%  ksoftirqd/5  [kernel.vmlinux]  [k] napi_gro_receive
 1.21%  ksoftirqd/5  [kernel.vmlinux]  [k] 
skb_release_head_state
 0.82%  ksoftirqd/5  [kernel.vmlinux]  [k] skb_free_head
 0.81%  ksoftirqd/5  [kernel.vmlinux]  [k] 
swiotlb_sync_single
 0.77%  ksoftirqd/5  [kernel.vmlinux]  [k] 
skb_gro_reset_offset
 0.76%  ksoftirqd/5  [kernel.vmlinux]  [k] skb_release_all
 0.73%  ksoftirqd/5  [kernel.vmlinux]  [k] kfree_skbmem




> >  10.11%  ksoftirqd/3  [bnxt_en] [k] bnxt_rx_pkt
> >   7.69%  ksoftirqd/3  [kernel.vmlinux]  [k] __build_skb
> >   4.25%  ksoftirqd/3  [kernel.vmlinux]  [k] 
> > inet_gro_receive
> >   3.74%  ksoftirqd/3  [kernel.vmlinux]  [k] 
> > kmem_cache_alloc
> >   3.53%  ksoftirqd/3  [kernel.vmlinux]  [k] 
> > dev_gro_receive
> >   3.43%  ksoftirqd/3  [kernel.vmlinux]  [k] 
> > page_frag_free
> >   3.12% 

Re: [PATCH] p54: add null pointer check before releasing socket buffer

2017-04-10 Thread Eric Dumazet
On Mon, Apr 10, 2017 at 2:22 PM, Christian Lamparter
 wrote:

> Well, the patch could be as simple as this:
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7869ae3837ca..44f7d5a1c67c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2450,6 +2450,9 @@ void __dev_kfree_skb_irq(struct sk_buff *skb, enum 
> skb_free_reason reason)
>  {
> unsigned long flags;
>
> +   if (!skb)
> +   return;
> +
> if (likely(atomic_read(>users) == 1)) {
> smp_rmb();
> atomic_set(>users, 0);
> ---
>
> The question is: would David or Eric support the change. Any comments,
> what's the prefered solution? Just patch __dev_kfree_skb_irq to make
> it consistent with *kfree*, or patch the driver? I'm fine either way,
> but I would prefere patching __dev_kfree_skb_irq.

This is fine, same check happens in consume_skb()


Re: WARN_ON running XDP on virtio net device

2017-04-10 Thread Michael S. Tsirkin
On Mon, Apr 10, 2017 at 03:21:53PM -0600, David Ahern wrote:
> I'm hitting a WARN_ON running XDP with virtio net:

I just sent a pull request. Would appreciate reports on whether
it helps.

> [  177.185570] [ cut here ]
> [  177.187250] WARNING: CPU: 0 PID: 880 at
> /home/dsa/kernel.git/drivers/pci/msi.c:1251 pci_irq_vector+0x92/0x123
> [  177.190932] Modules linked in: 8021q garp mrp stp llc vrf
> [  177.193473] CPU: 0 PID: 880 Comm: xdp1 Not tainted 4.11.0-rc5+ #305
> [  177.195640] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [  177.199190] Call Trace:
> [  177.200072]  dump_stack+0x81/0xb6
> [  177.201018]  __warn+0x107/0x122
> [  177.201899]  warn_slowpath_null+0x1d/0x1f
> [  177.203014]  pci_irq_vector+0x92/0x123
> [  177.204082]  vp_synchronize_vectors+0x6f/0x81
> [  177.205293]  vp_reset+0x50/0x55
> [  177.206174]  virtnet_xdp+0x346/0x6dd
> [  177.207188]  dev_change_xdp_fd+0x12a/0x151
> [  177.208256]  do_setlink+0x1288/0x130a
> 
> Seems to be something introduced in the past month or so. 2 attempts to
> bisect it pointed to this commit:
> 
> commit 54d7989f476ca57fc3c5cc71524c480ccb74c481
> Merge: 0f221a3102bb c4baad50297d
> Author: Linus Torvalds 
> Date:   Thu Mar 2 13:53:13 2017 -0800
> 
> Merge tag 'for_linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> 
> Pull vhost updates from Michael Tsirkin:
>  "virtio, vhost: optimizations, fixes


[PULL] vhost: cleanups and fixes

2017-04-10 Thread Michael S. Tsirkin
These changes were as in linux-next as e1c287efcd4cf688564ed1112d032b5dac29b159 
-
I tweaked the commit log slightly.

Too many people were complaining of warnings and errors in virtio
so I decided revert is the safest path forward right now.

The following changes since commit a71c9a1c779f2499fb2afc0553e543f18aff6edf:

  Linux 4.11-rc5 (2017-04-02 17:23:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 2f8dc3a01f1cf8ed17b1e295812ad12b688be5d3:

  virtio-pci: Remove affinity hint before freeing the interrupt (2017-04-11 
00:30:20 +0300)


virtio: oops fixes

virtio pci rework using shared interrupts caused a lot of issues. We
tried to fix them but run out of time. Revert for now, and revisit the
issue for the next kernel.

Luckily we are able to do this without loosing automatic
interrupt NUMA affinity which was the main motivator for the
rework.

Signed-off-by: Michael S. Tsirkin 


Cornelia Huck (1):
  MAINTAINERS: fix virtio file pattern

Marc Zyngier (1):
  virtio-pci: Remove affinity hint before freeing the interrupt

Michael S. Tsirkin (9):
  virtio_net: enable big packets for large MTU values
  virtio: allow drivers to validate features
  virtio_net: clear MTU when out of range
  virtio_console: fix uninitialized variable use
  Revert "virtio_pci: fix out of bound access for msix_names"
  Revert "virtio_pci: simplify MSI-X setup"
  Revert "virtio_pci: don't duplicate the msix_enable flag in struct 
pci_dev"
  Revert "virtio_pci: use shared interrupts for virtqueues"
  Revert "virtio_pci: remove struct virtio_pci_vq_info"

 MAINTAINERS|   2 +-
 drivers/char/virtio_console.c  |   6 +-
 drivers/net/virtio_net.c   |  45 +++--
 drivers/virtio/virtio.c|   6 +
 drivers/virtio/virtio_pci_common.c | 375 ++---
 drivers/virtio/virtio_pci_common.h |  43 -
 drivers/virtio/virtio_pci_legacy.c |   8 +-
 drivers/virtio/virtio_pci_modern.c |   8 +-
 include/linux/virtio.h |   1 +
 include/uapi/linux/virtio_pci.h|   2 +-
 10 files changed, 323 insertions(+), 173 deletions(-)


Re: [PATCH v2] cfg80211: Fix array-bounds warning in fragment copy

2017-04-10 Thread Matthias Kaehlcke
El Mon, Mar 27, 2017 at 12:58:22PM -0700 Matthias Kaehlcke ha dit:

> __ieee80211_amsdu_copy_frag intentionally initializes a pointer to
> array[-1] to increment it later to valid values. clang rightfully
> generates an array-bounds warning on the initialization statement.
> 
> Initialize the pointer to array[0] and change the algorithm from
> increment before to increment after consume.
> 
> Signed-off-by: Matthias Kaehlcke 
> ---
>  net/wireless/util.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/wireless/util.c b/net/wireless/util.c
> index 68e5f2ecee1a..52795ae5337f 100644
> --- a/net/wireless/util.c
> +++ b/net/wireless/util.c
> @@ -659,7 +659,7 @@ __ieee80211_amsdu_copy_frag(struct sk_buff *skb, struct 
> sk_buff *frame,
>   int offset, int len)
>  {
>   struct skb_shared_info *sh = skb_shinfo(skb);
> - const skb_frag_t *frag = >frags[-1];
> + const skb_frag_t *frag = >frags[0];
>   struct page *frag_page;
>   void *frag_ptr;
>   int frag_len, frag_size;
> @@ -672,10 +672,10 @@ __ieee80211_amsdu_copy_frag(struct sk_buff *skb, struct 
> sk_buff *frame,
>  
>   while (offset >= frag_size) {
>   offset -= frag_size;
> - frag++;
>   frag_page = skb_frag_page(frag);
>   frag_ptr = skb_frag_address(frag);
>   frag_size = skb_frag_size(frag);
> + frag++;
>   }
>  
>   frag_ptr += offset;
> @@ -687,12 +687,12 @@ __ieee80211_amsdu_copy_frag(struct sk_buff *skb, struct 
> sk_buff *frame,
>   len -= cur_len;
>  
>   while (len > 0) {
> - frag++;
>   frag_len = skb_frag_size(frag);
>   cur_len = min(len, frag_len);
>   __frame_add_frag(frame, skb_frag_page(frag),
>skb_frag_address(frag), cur_len, frag_len);
>   len -= cur_len;
> + frag++;
>   }
>  }
>  

Ping, any feedback on this patch?

Thanks

Matthias


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Andy Gospodarek
On Mon, Apr 10, 2017 at 03:34:56PM -0400, David Miller wrote:
> From: Andy Gospodarek 
> Date: Mon, 10 Apr 2017 14:39:35 -0400
> 
> > I also noted that...
> > 
> > xdp1 not parsing the protocol correctly.   All traffic was showing up as
> > protocol '0' instead of '17' in the output.  
> > 
> > xdp2 was not actually sending the frames back out when using this patch.
> > 
> > I'll take a look in a bit and see if I can track these two issues down.
> 
> Read Alexei's earlier posting, he spotted the bugs you are seeing
> here.
> 
> Basically, the MAC header is pulled already so we have to push it back
> before we run XDP over it.  We also have to be similarly careful with
> the MAC header for XDP_TX cases.

Yep, I saw those emails.  I just didn't want to be accused of not running all
the tests.  :-D 


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Andy Gospodarek
On Mon, Apr 10, 2017 at 03:28:54PM -0400, David Miller wrote:
> From: Andy Gospodarek 
> Date: Mon, 10 Apr 2017 14:39:35 -0400
> 
> > As promised, I did some testing today with bnxt_en's implementation
> > of XDP and this one.
> 
> Thanks a lot Andy, obviously the patch needs some more work.
> 
> I noticed GRO stuff in your profile, and Alexei mentioned this earlier
> today.  We probably should elide GRO if generic XDP is attached, since
> in particular this makes the skb_linearize() really expensive.

Good catch -- I actually thought we were disabling GRO automatically and it
looks like we are not.  :-/  I'll send Michael a patch.  

Disabling GRO allows me to process an additional 1Mpps, so I'm up to 7.5Mpps
with this patch.


Darlehen angebot 3 %

2017-04-10 Thread Frau SCHMIDT


Sehr geehrte Damen  und Herren,

Haben Sie Interesse über einer finanziellen Darlehen zu 3%?
kontaktieren Sie mich für mehr Details und Bedingungen. ich kann all 
jenen helfen, wer ein Darlehen benötigen.

Ich kann Ihnen biete ein darlehen in hohe von 10.000.000 EUR
Meine mail: info@rschmidt.online

Mit freundlichen Grüßen

Frau SCHMIDT


Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Andrew Morton
On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman  
wrote:

> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.

Out of curiosity: by how much did it "hurt"?



Tariq found:

: I disabled the page-cache (recycle) mechanism to stress the page
: allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to
: 31.4 G in v4.11-rc1 (34% drop).

then with this patch he found

: It looks very good!  I get line-rate (94Gbits/sec) with 8 streams, in
: comparison to less than 55Gbits/sec before.

Can I take this to mean that the page allocator's per-cpu-pages feature
ended up doubling the performance of this driver?  Better than the
driver's private page recycling?  I'd like to believe that, but am
having trouble doing so ;)

> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.


Re: WARN_ON running XDP on virtio net device

2017-04-10 Thread Michael S. Tsirkin
On Mon, Apr 10, 2017 at 03:21:53PM -0600, David Ahern wrote:
> I'm hitting a WARN_ON running XDP with virtio net:
> 
> [  177.185570] [ cut here ]
> [  177.187250] WARNING: CPU: 0 PID: 880 at
> /home/dsa/kernel.git/drivers/pci/msi.c:1251 pci_irq_vector+0x92/0x123
> [  177.190932] Modules linked in: 8021q garp mrp stp llc vrf
> [  177.193473] CPU: 0 PID: 880 Comm: xdp1 Not tainted 4.11.0-rc5+ #305
> [  177.195640] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [  177.199190] Call Trace:
> [  177.200072]  dump_stack+0x81/0xb6
> [  177.201018]  __warn+0x107/0x122
> [  177.201899]  warn_slowpath_null+0x1d/0x1f
> [  177.203014]  pci_irq_vector+0x92/0x123
> [  177.204082]  vp_synchronize_vectors+0x6f/0x81
> [  177.205293]  vp_reset+0x50/0x55
> [  177.206174]  virtnet_xdp+0x346/0x6dd
> [  177.207188]  dev_change_xdp_fd+0x12a/0x151
> [  177.208256]  do_setlink+0x1288/0x130a
> 
> Seems to be something introduced in the past month or so. 2 attempts to
> bisect it pointed to this commit:
> 
> commit 54d7989f476ca57fc3c5cc71524c480ccb74c481
> Merge: 0f221a3102bb c4baad50297d
> Author: Linus Torvalds 
> Date:   Thu Mar 2 13:53:13 2017 -0800
> 
> Merge tag 'for_linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> 
> Pull vhost updates from Michael Tsirkin:
>  "virtio, vhost: optimizations, fixes

I'm preparing a pull request reverting some changes we made
there. I'd be thankful if you can check whether

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost linux-next

help with this.


Thanks

-- 
MST



WARN_ON running XDP on virtio net device

2017-04-10 Thread David Ahern
I'm hitting a WARN_ON running XDP with virtio net:

[  177.185570] [ cut here ]
[  177.187250] WARNING: CPU: 0 PID: 880 at
/home/dsa/kernel.git/drivers/pci/msi.c:1251 pci_irq_vector+0x92/0x123
[  177.190932] Modules linked in: 8021q garp mrp stp llc vrf
[  177.193473] CPU: 0 PID: 880 Comm: xdp1 Not tainted 4.11.0-rc5+ #305
[  177.195640] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[  177.199190] Call Trace:
[  177.200072]  dump_stack+0x81/0xb6
[  177.201018]  __warn+0x107/0x122
[  177.201899]  warn_slowpath_null+0x1d/0x1f
[  177.203014]  pci_irq_vector+0x92/0x123
[  177.204082]  vp_synchronize_vectors+0x6f/0x81
[  177.205293]  vp_reset+0x50/0x55
[  177.206174]  virtnet_xdp+0x346/0x6dd
[  177.207188]  dev_change_xdp_fd+0x12a/0x151
[  177.208256]  do_setlink+0x1288/0x130a

Seems to be something introduced in the past month or so. 2 attempts to
bisect it pointed to this commit:

commit 54d7989f476ca57fc3c5cc71524c480ccb74c481
Merge: 0f221a3102bb c4baad50297d
Author: Linus Torvalds 
Date:   Thu Mar 2 13:53:13 2017 -0800

Merge tag 'for_linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull vhost updates from Michael Tsirkin:
 "virtio, vhost: optimizations, fixes


Re: [PATCH] p54: add null pointer check before releasing socket buffer

2017-04-10 Thread Christian Lamparter
On Monday, April 10, 2017 1:54:14 PM CEST Myungho Jung wrote:
> On Mon, Apr 10, 2017 at 02:12:54PM +0200, Christian Lamparter wrote:
> > On Sunday, April 9, 2017 10:23:20 PM CEST Myungho Jung wrote:
> > > Kernel panic is caused by trying to dereference null pointer. Check if
> > > the pointer is null before freeing space.
> >  [...]
> > As for adding if (!skb) checks. I think kfree, kfree_skb, dev_kfree_skb 
> > (aka consume_skb) all check for null pointers already. So the logical
> > thing to do would be to make dev_kfree_skb_irq (which would also fix
> > dev_kfree_skb_any) consistent with kfree, kfree_skb, dev_kfree_skb and
> > add the check there.
> > > ---
> > >  drivers/net/wireless/intersil/p54/txrx.c | 8 ++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/wireless/intersil/p54/txrx.c 
> > > b/drivers/net/wireless/intersil/p54/txrx.c
> > > index 1af7da0..8956061 100644
> > > --- a/drivers/net/wireless/intersil/p54/txrx.c
> > > +++ b/drivers/net/wireless/intersil/p54/txrx.c
> > > @@ -503,7 +503,9 @@ static void p54_rx_eeprom_readback(struct p54_common 
> > > *priv,
> > >  
> > >   priv->eeprom = NULL;
> > >   tmp = p54_find_and_unlink_skb(priv, hdr->req_id);
> > > - dev_kfree_skb_any(tmp);
> > > + if (unlikely(!tmp))
> > > + dev_kfree_skb_any(tmp);
> > > +
> > >   complete(>eeprom_comp);
> > >  }
> > >  
> > > @@ -597,7 +599,9 @@ static void p54_rx_stats(struct p54_common *priv, 
> > > struct sk_buff *skb)
> > >   }
> > >  
> > >   tmp = p54_find_and_unlink_skb(priv, hdr->req_id);
> > > - dev_kfree_skb_any(tmp);
> > > + if (unlikely(!tmp))
> > > + dev_kfree_skb_any(tmp);
> > > +
> > >   complete(>stat_comp);
> > >  }
> > >  
> > > 
> > 
> > 
> [...] I'm not sure it actually caused kernel panic but guessed from
> a bug report [https://bugzilla.kernel.org/show_bug.cgi?id=195289].
Thank you for bringing this to my attention. Reading the bugreport, it
does sound like there's a bigger issue with the USB Ports. I'll see if
this can be fixed. But it does sound like a hardware issue at this 
point.

> And correct fix will be like this:
>   if (likely(tmp))
>   dev_kfree_skb_any(tmp);
> 
> But, like you said, I think null pointer should be checked in
> dev_kfree_skb_irq although already checking before calling it in many
> other places. I'll try another patch. Thank you for your advice.

Well, the patch could be as simple as this:
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3837ca..44f7d5a1c67c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2450,6 +2450,9 @@ void __dev_kfree_skb_irq(struct sk_buff *skb, enum 
skb_free_reason reason)
 {
unsigned long flags;
 
+   if (!skb)
+   return;
+
if (likely(atomic_read(>users) == 1)) {
smp_rmb();
atomic_set(>users, 0);
---

The question is: would David or Eric support the change. Any comments,
what's the prefered solution? Just patch __dev_kfree_skb_irq to make
it consistent with *kfree*, or patch the driver? I'm fine either way,
but I would prefere patching __dev_kfree_skb_irq.

Regards,
Christian


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Daniel Borkmann

On 04/09/2017 10:35 PM, David Miller wrote:
[...]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc07c3b..f8ff49c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1892,6 +1892,7 @@ struct net_device {
struct lock_class_key   *qdisc_tx_busylock;
struct lock_class_key   *qdisc_running_key;
boolproto_down;
+   struct bpf_prog __rcu   *xdp_prog;


Should this be moved rather near other rx members for locality?
My config has last member of rx group in a new cacheline, which
is really waste ...

[...]
/* --- cacheline 14 boundary (896 bytes) --- */
struct hlist_node  index_hlist;  /*   89616 */

/* XXX 48 bytes hole, try to pack */
[...]

... but given this layout can change significantly for a given
config, perhaps could be a union with something exotic that is
currently always built in like ax25_ptr; would then require to
depend on ax25 not being built, though. :/ Otoh, we potentially
have to linearize the skb in data path, which is slow anyway,
so probably okay as is, too.

[...]

@@ -4247,6 +4248,83 @@ static int __netif_receive_skb(struct sk_buff *skb)
return ret;
  }

+static struct static_key generic_xdp_needed __read_mostly;
+
+static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   struct bpf_prog *new = xdp->prog;
+   int ret = 0;
+
+   switch (xdp->command) {
+   case XDP_SETUP_PROG: {
+   struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
+
+   rcu_assign_pointer(dev->xdp_prog, new);
+   if (old)
+   bpf_prog_put(old);
+
+   if (old && !new)
+   static_key_slow_dec(_xdp_needed);
+   else if (new && !old)
+   static_key_slow_inc(_xdp_needed);
+   break;
+   }
+
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog);
+   break;


Would be nice to extend this here, so that in the current 'ip link'
output we could indicate next to the 'xdp' flag that this does /not/
run natively in the driver.

Also, when the user loads a prog f.e. via 'ip link set dev em1 xdp obj prog.o',
we should add a flag where it could be specified that running non-natively
/is/ okay. Thus that the rtnl bits bail out if there's no ndo_xdp callback.
I could imagine that this could be overseen quickly and people wonder why
performance is significantly less than usually expected.


+
+   default:
+   ret = -EINVAL;
+   break;
+   }
+
+   return ret;
+}
+
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+struct bpf_prog *xdp_prog)
+{
+   struct xdp_buff xdp;
+   u32 act = XDP_DROP;
+   void *orig_data;
+   int hlen, off;
+
+   if (skb_linearize(skb))
+   goto do_drop;
+
+   hlen = skb_headlen(skb);
+   xdp.data = skb->data;
+   xdp.data_end = xdp.data + hlen;
+   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   orig_data = xdp.data;


Wondering regarding XDP_PACKET_HEADROOM requirement,
presumably we would need to guarantee enough headroom
here as well, so that there's no difference to a native
driver implementation.


+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   off = xdp.data - orig_data;
+   if (off)
+   __skb_push(skb, off);
+
+   switch (act) {
+   case XDP_PASS:
+   case XDP_TX:
+   break;
+
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   /* fall through */
+   case XDP_ABORTED:
+   trace_xdp_exception(skb->dev, xdp_prog, act);
+   /* fall through */
+   case XDP_DROP:
+   do_drop:
+   kfree_skb(skb);
+   break;
+   }
+
+   return act;
+}


Re: net: use-after-free in __ns_get_path

2017-04-10 Thread Cong Wang
On Mon, Apr 10, 2017 at 7:37 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>
> Unfortunately it's not reproducible.
>
> ==
> BUG: KASAN: use-after-free in __read_once_size
> include/linux/compiler.h:254 [inline] at addr 880059ce6590
> BUG: KASAN: use-after-free in atomic_read
> arch/x86/include/asm/atomic.h:26 [inline] at addr 880059ce6590
> BUG: KASAN: use-after-free in virt_spin_lock
> arch/x86/include/asm/qspinlock.h:62 [inline] at addr 880059ce6590
> BUG: KASAN: use-after-free in queued_spin_lock_slowpath+0xb0a/0xfd0
> kernel/locking/qspinlock.c:421 at addr 880059ce6590
> Read of size 4 by task syz-executor6/567
> CPU: 1 PID: 567 Comm: syz-executor6 Not tainted 4.11.0-rc6+ #206
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
>  __read_once_size include/linux/compiler.h:254 [inline]
>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>  virt_spin_lock arch/x86/include/asm/qspinlock.h:62 [inline]
>  queued_spin_lock_slowpath+0xb0a/0xfd0 kernel/locking/qspinlock.c:421
>  queued_spin_lock include/asm-generic/qspinlock.h:103 [inline]
>  do_raw_spin_lock+0x151/0x1e0 kernel/locking/spinlock_debug.c:113
>  __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
>  _raw_spin_lock+0x32/0x40 kernel/locking/spinlock.c:151
>  spin_lock include/linux/spinlock.h:299 [inline]
>  lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
>  __ns_get_path+0x197/0x860 fs/nsfs.c:66

This is almost not related to net, it is almost a pure fs bug.
So let's Cc Al.

I think the following one-line fix should work, since
ns->stashed is protected by RCU on the fast path, but
I am not brave enough to add a new dcache API for it. ;)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 1656843..ecfd3d1 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -90,6 +90,7 @@ static void *__ns_get_path(struct path *path, struct
ns_common *ns)
iput(inode);
return ERR_PTR(-ENOMEM);
}
+   dentry->d_flags |= DCACHE_RCUACCESS;
d_instantiate(dentry, inode);
dentry->d_fsdata = (void *)ns->ops;
d = atomic_long_cmpxchg(>stashed, 0, (unsigned long)dentry);



>  open_related_ns+0xda/0x200 fs/nsfs.c:143
>  sock_ioctl+0x39d/0x440 net/socket.c:1001
>  vfs_ioctl fs/ioctl.c:45 [inline]
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
>  SYSC_ioctl fs/ioctl.c:700 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458d9
> RSP: 002b:7f1c9259eb58 EFLAGS: 0286 ORIG_RAX: 0010
> RAX: ffda RBX: 0016 RCX: 004458d9
> RDX: 2090affc RSI: 894c RDI: 0016
> RBP: 006e1c50 R08:  R09: 
> R10:  R11: 0286 R12: 00708150
> R13:  R14: 7f1c9259f9c0 R15: 7f1c9259f700
> Object at 880059ce6510, in cache dentry size: 288
> Allocated:
> PID = 565
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
>  slab_post_alloc_hook mm/slab.h:456 [inline]
>  slab_alloc_node mm/slub.c:2718 [inline]
>  slab_alloc mm/slub.c:2726 [inline]
>  kmem_cache_alloc+0x1af/0x250 mm/slub.c:2731
>  __d_alloc+0xb3/0xbd0 fs/dcache.c:1571
>  d_alloc_pseudo+0x1d/0x30 fs/dcache.c:1692
>  __ns_get_path+0x3e8/0x860 fs/nsfs.c:88
>  open_related_ns+0xda/0x200 fs/nsfs.c:143
>  sock_ioctl+0x39d/0x440 net/socket.c:1001
>  vfs_ioctl fs/ioctl.c:45 [inline]
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
>  SYSC_ioctl fs/ioctl.c:700 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Freed:
> PID = 566
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kmem_cache_free+0xb2/0x2c0 mm/slub.c:2983
>  __d_free fs/dcache.c:265 [inline]
>  dentry_free+0xd5/0x150 fs/dcache.c:314
>  __dentry_kill+0x485/0x6e0 fs/dcache.c:552
>  dentry_kill fs/dcache.c:579 [inline]
>  dput.part.25+0x5cd/0x7c0 fs/dcache.c:791
>  

Re: [PATCH] p54: add null pointer check before releasing socket buffer

2017-04-10 Thread Myungho Jung
On Mon, Apr 10, 2017 at 02:12:54PM +0200, Christian Lamparter wrote:
> (Added linux-wireless, since this is a wireless driver)
> 
> On Sunday, April 9, 2017 10:23:20 PM CEST Myungho Jung wrote:
> > Kernel panic is caused by trying to dereference null pointer. Check if
> > the pointer is null before freeing space.
> Do you have the kernel panic somewhere?
> I think you have an even bigger problem: You see, in order to get EEPROM
> readback and rx_stats feedback you need to sent a request to the firmware
> and if the response's req_id cookies don't match, you end up filling up 
> the very limited device address space.
> 
> As for adding if (!skb) checks. I think kfree, kfree_skb, dev_kfree_skb 
> (aka consume_skb) all check for null pointers already. So the logical
> thing to do would be to make dev_kfree_skb_irq (which would also fix
> dev_kfree_skb_any) consistent with kfree, kfree_skb, dev_kfree_skb and
> add the check there.
> 
> > Signed-off-by: Myungho Jung 
> > ---
> >  drivers/net/wireless/intersil/p54/txrx.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/wireless/intersil/p54/txrx.c 
> > b/drivers/net/wireless/intersil/p54/txrx.c
> > index 1af7da0..8956061 100644
> > --- a/drivers/net/wireless/intersil/p54/txrx.c
> > +++ b/drivers/net/wireless/intersil/p54/txrx.c
> > @@ -503,7 +503,9 @@ static void p54_rx_eeprom_readback(struct p54_common 
> > *priv,
> >  
> > priv->eeprom = NULL;
> > tmp = p54_find_and_unlink_skb(priv, hdr->req_id);
> > -   dev_kfree_skb_any(tmp);
> > +   if (unlikely(!tmp))
> > +   dev_kfree_skb_any(tmp);
> > +
> > complete(>eeprom_comp);
> >  }
> >  
> > @@ -597,7 +599,9 @@ static void p54_rx_stats(struct p54_common *priv, 
> > struct sk_buff *skb)
> > }
> >  
> > tmp = p54_find_and_unlink_skb(priv, hdr->req_id);
> > -   dev_kfree_skb_any(tmp);
> > +   if (unlikely(!tmp))
> > +   dev_kfree_skb_any(tmp);
> > +
> > complete(>stat_comp);
> >  }
> >  
> > 
> 
> 

I found that the fix was totally opposite to my thought. Sorry about
confusion. I'm not sure it actually caused kernel panic but guessed from
a bug report [https://bugzilla.kernel.org/show_bug.cgi?id=195289]. And
correct fix will be like this:
if (likely(tmp))
dev_kfree_skb_any(tmp);

But, like you said, I think null pointer should be checked in
dev_kfree_skb_irq although already checking before calling it in many
other places. I'll try another patch. Thank you for your advice.

Thanks,
Myungho


Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Jesper Dangaard Brouer

I will appreciate review of this patch. My micro-benchmarking show we
basically return to same page alloc+free cost as before 374ad05ab64d
("mm, page_alloc: only use per-cpu allocator for irq-safe requests").
Which sort of invalidates this attempt of optimizing the page allocator.
But Mel's micro-benchmarks still show an improvement.

Notice the slowdown comes from me checking irqs_disabled() ... if
someone can spot a way to get rid of that this, then this patch will be
a win.

I'm traveling out of Montreal today, and will rerun my benchmarks when
I get home.  Tariq will also do some more testing with 100G NIC (he
also participated in the Montreal conf, so he is likely in transit too).


On Mon, 10 Apr 2017 16:08:21 +0100
Mel Gorman  wrote:

> From: Jesper Dangaard Brouer 
> 
> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.
> 
> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.
> 
> Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe 
> requests")
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Mel Gorman 

Missing:

Reported-by: Tariq [...]

> ---
>  mm/page_alloc.c | 26 +-
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..d7e986967910 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2351,9 +2351,9 @@ static void drain_local_pages_wq(struct work_struct 
> *work)
>* cpu which is allright but we also have to make sure to not move to
>* a different one.
>*/
> - preempt_disable();
> + local_bh_disable();
>   drain_local_pages(NULL);
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>   unsigned long pfn = page_to_pfn(page);
>   int migratetype;
>  
> - if (in_interrupt()) {
> + /*
> +  * Exclude (hard) IRQ and NMI context from using the pcplists.
> +  * But allow softirq context, via disabling BH.
> +  */
> + if (in_irq() || irqs_disabled()) {
>   __free_pages_ok(page, 0);
>   return;
>   }
> @@ -2491,7 +2495,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   set_pcppage_migratetype(page, migratetype);
> - preempt_disable();
> + local_bh_disable();
>  
>   /*
>* We only track unmovable, reclaimable and movable on pcp lists.
> @@ -2522,7 +2526,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>   }
>  
>  out:
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2647,7 +2651,7 @@ static struct page *__rmqueue_pcplist(struct zone 
> *zone, int migratetype,
>  {
>   struct page *page;
>  
> - VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_irq() || irqs_disabled());
>  
>   do {
>   if (list_empty(list)) {
> @@ -2680,7 +2684,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   bool cold = ((gfp_flags & __GFP_COLD) != 0);
>   struct page *page;
>  
> - preempt_disable();
> + local_bh_disable();
>   pcp = _cpu_ptr(zone->pageset)->pcp;
>   list = >lists[migratetype];
>   page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
> @@ -2688,7 +2692,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>   zone_statistics(preferred_zone, zone);
>   }
> - preempt_enable();
> + local_bh_enable();
>   return page;
>  }
>  
> @@ -2704,7 +2708,11 @@ struct page *rmqueue(struct zone *preferred_zone,
>   unsigned long flags;
>   struct page *page;
>  
> - if (likely(order == 0) && !in_interrupt()) {
> + /*
> +  * Exclude (hard) IRQ and NMI context from using the pcplists.
> +  * But allow softirq context, via disabling BH.
> +  */
> + if (likely(order == 0) && !(in_irq() || irqs_disabled()) ) {
>   

Distinguishing between forwarded/bridged packets and locally generated packets

2017-04-10 Thread Anthony Lineham
Hi,


I'm trying to find a way to distinguish between a packet that is being 
forwarded/bridged and one that was locally generated by the host itself. I need 
to know this for a particular application at tx in an ethernet device driver. 
Forwarded/bridged packets are processed when received, but I need to make sure 
locally generated packets are processed too, and avoid reprocessing the 
forwarded packets.

Up until now we have been using the skb_iif field in the socket buffer to do 
this. The logic is that forwarded/bridged packet has a non-zero value in this 
field, indicating the incoming interface index and a locally generated packet 
has 0 as there is no incoming interface. However, we've recently realised that 
in some situations this information is scrubbed from the skb (by 
skb_scrub_packet), for example when the packet is being sent over a PPP link. 

I'm looking for an alternative method to use. Currently I'm thinking about 
using the sk field in the skb. In a locally generated packet this would have 
the socket the packet was generated by. In the forwarding case it would not be 
set. It looks like there are some cases that don't quite conform to this, such 
as ARP's, but I can probably live with that.

Can anyone comment on the suitability of this method or if there is a more 
appropriate way to achieve this.
Any information is greatly appreciated.

Thanks,
Anthony








Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Daniel Borkmann

On 04/10/2017 08:39 PM, Andy Gospodarek wrote:
[...]

I ran this on a desktop-class system I have (i7-6700 CPU @ 3.40GHz)
and used pktgen_sample03_burst_single_flow.sh from another system to
throw ~6.5Mpps as a single UDP stream towards the system running XDP.

I just commented out the ndo_xdp op in bnxt.c to test this patch.  The
differences were pretty dramatic.

Using the ndo_xdp ndo in bnxt_en, my perf report output looked like
this (not sure why X is still running on this system!):

 38.08%  swapper  [sysimgblt]   [k] 
0x5bd0
 11.80%  swapper  [kernel.vmlinux]  [k] intel_idle
 10.49%  swapper  [bnxt_en] [k] bnxt_rx_pkt
  6.31%  swapper  [bnxt_en] [k] bnxt_rx_xdp
  5.64%  swapper  [bnxt_en] [k] bnxt_poll
  4.22%  swapper  [kernel.vmlinux]  [k] poll_idle
  3.46%  swapper  [kernel.vmlinux]  [k] 
irq_entries_start
  2.95%  swapper  [kernel.vmlinux]  [k] 
napi_complete_done
  1.79%  swapper  [kernel.vmlinux]  [k] 
cpuidle_enter_state
  1.53%  swapper  [kernel.vmlinux]  [k] menu_select
  1.19%  swapper  [bnxt_en] [k] 
bnxt_reuse_rx_data
  1.00%  swapper  [sysimgblt]   [k] 
0x5c6f
  0.92%  swapper  [kernel.vmlinux]  [k] 
__next_timer_interrupt
  0.71%  swapper  [kernel.vmlinux]  [k] 
_raw_spin_lock_irqsave
  0.71%  swapper  [kernel.vmlinux]  [k] 
bpf_map_lookup_elem

mpstat reports that the CPU receiving and dropping the traffic is
basically running idle.  Dropping this amount of traffic in the driver
has very little impact on the system.

With v2 of this patch I see the following from perf report:

 19.69%  ksoftirqd/3  [kernel.vmlinux]  [k] memcpy_erms
 16.30%  ksoftirqd/3  [kernel.vmlinux]  [k] __bpf_prog_run


Forgot echo 1 > /proc/sys/net/core/bpf_jit_enable? Was it disabled in both 
cases?


 10.11%  ksoftirqd/3  [bnxt_en] [k] bnxt_rx_pkt
  7.69%  ksoftirqd/3  [kernel.vmlinux]  [k] __build_skb
  4.25%  ksoftirqd/3  [kernel.vmlinux]  [k] inet_gro_receive
  3.74%  ksoftirqd/3  [kernel.vmlinux]  [k] kmem_cache_alloc
  3.53%  ksoftirqd/3  [kernel.vmlinux]  [k] dev_gro_receive
  3.43%  ksoftirqd/3  [kernel.vmlinux]  [k] page_frag_free
  3.12%  ksoftirqd/3  [kernel.vmlinux]  [k] kmem_cache_free
  2.56%  ksoftirqd/3  [bnxt_en] [k] bnxt_poll
  2.46%  ksoftirqd/3  [kernel.vmlinux]  [k] 
netif_receive_skb_internal
  2.13%  ksoftirqd/3  [kernel.vmlinux]  [k] 
__udp4_lib_lookup
  1.63%  ksoftirqd/3  [kernel.vmlinux]  [k] __napi_alloc_skb
  1.51%  ksoftirqd/3  [kernel.vmlinux]  [k] eth_type_trans
  1.42%  ksoftirqd/3  [kernel.vmlinux]  [k] udp_gro_receive
  1.29%  ksoftirqd/3  [kernel.vmlinux]  [k] napi_gro_receive
  1.25%  ksoftirqd/3  [kernel.vmlinux]  [k] udp4_gro_receive
  1.18%  ksoftirqd/3  [bnxt_en] [k] bnxt_rx_xdp
  1.17%  ksoftirqd/3  [kernel.vmlinux]  [k] skb_release_data
  1.11%  ksoftirqd/3  [bnxt_en] [k] 
bnxt_reuse_rx_data
  1.07%  ksoftirqd/3  [kernel.vmlinux]  [k] net_rx_action




Re: [PATCH net v3 0/2] bridge: Fix kernel oops during bridge creation

2017-04-10 Thread Stephen Hemminger
On Mon, 10 Apr 2017 14:59:26 +0300
ido...@idosch.org wrote:

> From: Ido Schimmel 
> 
> First patch adds a missing ndo_uninit() in the bridge driver, which is a
> prerequisite for the second patch that actually fixes the oops.
> 
> Please consider both patches for 4.4.y, 4.9.y and 4.10.y
> 
> Ido Schimmel (2):
>   bridge: implement missing ndo_uninit()
>   bridge: netlink: register netdevice before executing changelink
> 
>  net/bridge/br_device.c| 20 +++-
>  net/bridge/br_if.c|  1 -
>  net/bridge/br_multicast.c |  7 +--
>  net/bridge/br_netlink.c   |  7 +--
>  net/bridge/br_private.h   |  5 +
>  5 files changed, 26 insertions(+), 14 deletions(-)
> 

Looks good, thanks for following through on this.

Reviewed-by: Stephen Hemminger 


Re: [PATCH v2 00/12] ftgmac100: Rework batch 3 - TX path

2017-04-10 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Mon, 10 Apr 2017 11:15:14 +1000

> This is version 2 of the third batch of updates to
> the ftgmac100 driver.
> 
> This one tackles the TX path of the driver. This provides the
> bulk of the performance improvements by adding support for
> fragmented sends along with a bunch of cleanups.
> 
> Version 2 fixes a patch splitting mistake and uses
> eth_skb_pad() (which uses skb_put_padto) to pad ethernet
> frames rather than skb_padto(), thus removing the need to
> also pad the packet headlen in a couple of places. 
> 
> Subsequent batches will add various features (ethtool functions,
> vlan offlan, ...) and cleanups.

Series applied, thanks Ben.


Re: [PATCH 0/6] mvmdio updates

2017-04-10 Thread Marcin Wojtas
Hi Russel,

2017-04-10 17:27 GMT+02:00 Russell King - ARM Linux :
> This series of patches update mvmdio for Armada 8k CP110.  A number of
> issues were found:
>
> 1. The driver fails to disable an interrupt when something goes wrong
>in the probe function.
>
> 2. The interrupt is specified in DT to be optional, but the driver
>unconditionally writes to the interrupt mask register, which may
>not exist.
>
> 3. The DT binding specifies
> "reg: address and length of the SMI register"
>however, when supporting the interrupt, the size must cover the
>interrupt register as well.  Update the binding documentation
>with this information that was previously omitted.
>
> 4. If the register size is too small, have the driver print an error
>and disable use of the interrupt.
>
> 5. Armada 8k needs three clocks for the MDIO interface, otherwise the
>SoC hangs (since it is part of one of the ethernet interfaces.)
>GOP clock, MG core clock and MG clock are needed on 8k. Augment the
>binding and driver to allow three clocks to be specified.
>

Actually most of the interfaces on a7k/a8k require multiple clocks to
be enabled, however all those twisted dependencies are handled within:
drivers/clk/mvebu/cp110-system-controller.c
With the latest patch of Thomas Petazzoni, MG clock is already
specified as a child of MG_CORE, so I believe a just minor change will
resolve remaining GOP clock dependency. This way we will leave
orion-mdio driver untouched around clocks.

Thomas, what is your opinion?

Regards,
Marcin


Re: [PATCH 5/6] dt-bindings: allow up to three clocks for orion-mdio

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:25PM +0100, Russell King wrote:
> Armada 8040 needs three clocks to be enabled for MDIO accesses to work.
> Update the binding to allow the extra clocks to be specified.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


[PATCH nf-next] ipset: remove unused function __ip_set_get_netlink

2017-04-10 Thread Aaron Conole
There are no in-tree callers.

Signed-off-by: Aaron Conole 
---
 net/netfilter/ipset/ip_set_core.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c 
b/net/netfilter/ipset/ip_set_core.c
index c296f9b..68ba531 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -501,14 +501,6 @@ __ip_set_put(struct ip_set *set)
  * a separate reference counter
  */
 static inline void
-__ip_set_get_netlink(struct ip_set *set)
-{
-   write_lock_bh(_set_ref_lock);
-   set->ref_netlink++;
-   write_unlock_bh(_set_ref_lock);
-}
-
-static inline void
 __ip_set_put_netlink(struct ip_set *set)
 {
write_lock_bh(_set_ref_lock);
-- 
2.9.3



[PATCH nf-next] ipvs: remove unused function ip_vs_set_state_timeout

2017-04-10 Thread Aaron Conole
There are no in-tree callers of this function and it isn't exported.

Signed-off-by: Aaron Conole 
---
 include/net/ip_vs.h  |  2 --
 net/netfilter/ipvs/ip_vs_proto.c | 22 --
 2 files changed, 24 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 8a4a57b8..c76fedb 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1349,8 +1349,6 @@ int ip_vs_protocol_init(void);
 void ip_vs_protocol_cleanup(void);
 void ip_vs_protocol_timeout_change(struct netns_ipvs *ipvs, int flags);
 int *ip_vs_create_timeout_table(int *table, int size);
-int ip_vs_set_state_timeout(int *table, int num, const char *const *names,
-   const char *name, int to);
 void ip_vs_tcpudp_debug_packet(int af, struct ip_vs_protocol *pp,
   const struct sk_buff *skb, int offset,
   const char *msg);
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 8ae4807..ca880a3 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -193,28 +193,6 @@ ip_vs_create_timeout_table(int *table, int size)
 }
 
 
-/*
- * Set timeout value for state specified by name
- */
-int
-ip_vs_set_state_timeout(int *table, int num, const char *const *names,
-   const char *name, int to)
-{
-   int i;
-
-   if (!table || !name || !to)
-   return -EINVAL;
-
-   for (i = 0; i < num; i++) {
-   if (strcmp(names[i], name))
-   continue;
-   table[i] = to * HZ;
-   return 0;
-   }
-   return -ENOENT;
-}
-
-
 const char * ip_vs_state_name(__u16 proto, int state)
 {
struct ip_vs_protocol *pp = ip_vs_proto_get(proto);
-- 
2.9.3



Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Daniel Borkmann

On 04/10/2017 04:18 AM, Alexei Starovoitov wrote:
[...]

+   xdp.data_end = xdp.data + hlen;
+   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   orig_data = xdp.data;
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   off = xdp.data - orig_data;
+   if (off)
+   __skb_push(skb, off);


and restore l2 back somehow and get new skb->protocol ?
if we simply do __skb_pull(skb, skb->mac_len); like
we do with cls_bpf, it will not work correctly,
since if the program did ip->ipip encap (like our balancer
does and the test tools/testing/selftests/bpf/test_xdp.c)
the skb metadata fields will be wrong.
So we need to repeat eth_type_trans() here if (xdp.data != orig_data)


Yeah, agree. Also, when we have gso skb and rewrite/resize parts
of the packet, we would need to update gso related shinfo meta
data accordingly (f.e. a rewrite from v4/v6, rewrite of whole pkt
as icmp reply, etc)?

Also, what about encap/decap, should inner skb headers get
updated as well along with skb->encapsulation, etc? How do we
handle checksumming on this layer?


In case of cls_bpf when we mess with skb sizes we always
adjust skb metafields in helpers, so there it's fine
and __skb_pull(skb, skb->mac_len); is enough.
Here we need to be a bit more careful.


In cls_bpf I was looking into something generic and fast for
encap/decap like bpf_xdp_adjust_head() but for skbs. Problem is
that they can be received from ingress/egress and transmitted
further from cls_bpf to ingress/egress, so keeping skb meta data
correct and up to date without exposing skb (implementation)
details like header pointers to users is crucial, as otherwise
these can get messed up potentially affecting the rest of the
system. We restricted helpers in cls_bpf to avoid that. Perhaps
we could make easier assumptions when this generic callback is
known to be called out of a physical driver's rx path, but when
being skb already (as mentioned below by Alexei's thoughts) ...


  static int netif_receive_skb_internal(struct sk_buff *skb)
  {
int ret;
@@ -4258,6 +4336,21 @@ static int netif_receive_skb_internal(struct sk_buff 
*skb)

rcu_read_lock();

+   if (static_key_false(_xdp_needed)) {
+   struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+
+   if (xdp_prog) {
+   u32 act = netif_receive_generic_xdp(skb, xdp_prog);


That's indeed the best attachment point in the stack.
I was trying to see whether it can be lowered into something like
dev_gro_receive(), but not everyone calls it.
Another option to put it into eth_type_trans() itself, then
there are no problems with gro, l2 headers, and adjust_head,
but changing all drivers is too much.


+
+   if (act != XDP_PASS) {
+   rcu_read_unlock();
+   if (act == XDP_TX)
+   dev_queue_xmit(skb);


It should be fine. For cls_bpf we do recursion check __bpf_tx_skb()
but I forgot specific details. May be here it's fine as-is.
Daniel, do we need recursion check here?


Yeah, Willem is correct. That was for sch_handle_egress() to
sch_handle_egress() as that is otherwise not accounted by the
main xmit_recursion check we have in __dev_queue_xmit().

Thanks,
Daniel


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread David Miller
From: Andy Gospodarek 
Date: Mon, 10 Apr 2017 14:39:35 -0400

> I also noted that...
> 
> xdp1 not parsing the protocol correctly.   All traffic was showing up as
> protocol '0' instead of '17' in the output.  
> 
> xdp2 was not actually sending the frames back out when using this patch.
> 
> I'll take a look in a bit and see if I can track these two issues down.

Read Alexei's earlier posting, he spotted the bugs you are seeing
here.

Basically, the MAC header is pulled already so we have to push it back
before we run XDP over it.  We also have to be similarly careful with
the MAC header for XDP_TX cases.


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread David Miller
From: Alexei Starovoitov 
Date: Sun, 9 Apr 2017 19:18:09 -0700

> On Sun, Apr 09, 2017 at 01:35:28PM -0700, David Miller wrote:
>> +if (skb_linearize(skb))
>> +goto do_drop;
> 
> do we need to force disable gro ?

I think we do.

> Otherwise if we linearize skb_is_gso packet it will be huge
> and not similar to normal xdp packets?

I completely agree.

> gso probably needs to disabled too to avoid veth surprises?

I'm not so sure about that.

> 
>> +
>> +hlen = skb_headlen(skb);
>> +xdp.data = skb->data;
> 
> it probably should be
> hlen = skb_headlen(skb) + skb->mac_len;
> xdp.data = skb->data - skb->mac_len;
> to make sure xdp program is looking at l2 header.

This is why Andy's tests aren't working properly, good find.

>> +xdp.data_end = xdp.data + hlen;
>> +xdp.data_hard_start = xdp.data - skb_headroom(skb);
>> +orig_data = xdp.data;
>> +act = bpf_prog_run_xdp(xdp_prog, );
>> +
>> +off = xdp.data - orig_data;
>> +if (off)
>> +__skb_push(skb, off);
> 
> and restore l2 back somehow and get new skb->protocol ?
> if we simply do __skb_pull(skb, skb->mac_len); like
> we do with cls_bpf, it will not work correctly,

Yep.

I completely overlooked the fact that the MAC header was pulled
by the time we get here already.

> I suspect there always be drivers that don't support xdp (like e100),
> so this generic_xdp_install() will stay with us forever.
> Since it will stay, can we enable it for xdp enabled drivers too?

Yes it will be useful for debugging.

Andy already ran into this, he wanted to test it with a driver
that supports XDP so he had to comment out the driver's ndo op.

> Another advantage that it will help to flush out the differences
> between skb- and raw- modes in the drivers that support xdp already.

Indeed, this will be very valuable.


Re: [PATCH 6/6] net: mvmdio: allow up to three clocks to be specified for orion-mdio

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:31PM +0100, Russell King wrote:
> Allow up to three clocks to be specified and enabled for the orion-mdio
> interface, which are required for this interface to be accessible on
> Armada 8k platforms.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


Re: Horrid balance-rr bonding udp throughput

2017-04-10 Thread Eric Dumazet
On Mon, 2017-04-10 at 14:50 -0400, Jarod Wilson wrote:
> On 2017-04-08 7:33 PM, Jarod Wilson wrote:
> > I'm digging into some bug reports covering performance issues with 
> > balance-rr, and discovered something even worse than the reporter. My 
> > test setup has a pair of NICs, one e1000e, one e1000 (but dual e1000e 
> > seems the same). When I do a test run in LNST with bonding mode 
> > balance-rr and either miimon or arpmon, the throughput of the UDP_STREAM 
> > netperf test is absolutely horrible:
> > 
> > TCP: 941.19 +-0.88 mbits/sec
> > UDP: 45.42 +-4.59 mbits/sec
> > 
> > I figured I'd try LNST's packet capture mode, so exact same test, add 
> > the -p flag and I get:
> > 
> > TCP: 941.21 +-0.82 mbits/sec
> > UDP: 961.54 +-0.01 mbits/sec
> > 
> > Uh. What? So yeah. I can't capture the traffic in the bad case, but I 
> > guess that gives some potential insight into what's not happening 
> > correctly in either the bonding driver or the NIC drivers... More 
> > digging forthcoming, but first I have a flooded basement to deal with, 
> > so if in the interim, anyone has some insight, I'd be happy to hear it. :)
> 
> Okay, ignore the bit about bonding, I should have eliminated the bond 
> from the picture entirely. I think the traffic simply ended up on the 
> e1000 on the non-capture test and on the e1000e for the capture test, as 
> those numbers match perfectly with straight NIC to NIC testing, no bond 
> involved. That said, really odd that the e1000 is so severely crippled 
> for UDP, while TCP is still respectable. Not sure if I have a flaky NIC 
> or what...
> 
> For reference, e1000 to e1000e netperf:
> 
> TCP_STREAM: Measured rate was 849.95 +-1.32 mbits/sec
> UDP_STREAM: Measured rate was 44.73 +-5.73 mbits/sec

In our experiments, we found e1000e had latency issue with UDP packets,
not with TCP.

Try e1000e -> e1000e , problem should persist, right ?







Re: [PATCH 4/6] net: mvmdio: disable interrupt if resource size is too small

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:20PM +0100, Russell King wrote:
> Disable the MDIO interrupt, falling back to polled mode, if the resource
> size does not allow us to access the interrupt registers.  All current
> DT bindings use a size of 0x84, which allows access, but verifying it is
> good practice.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread David Miller
From: Andy Gospodarek 
Date: Mon, 10 Apr 2017 14:39:35 -0400

> As promised, I did some testing today with bnxt_en's implementation
> of XDP and this one.

Thanks a lot Andy, obviously the patch needs some more work.

I noticed GRO stuff in your profile, and Alexei mentioned this earlier
today.  We probably should elide GRO if generic XDP is attached, since
in particular this makes the skb_linearize() really expensive.


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Stephen Hemminger
On Sun, 09 Apr 2017 13:35:28 -0700 (PDT)
David Miller  wrote:

> This provides a generic non-optimized XDP implementation when the
> device driver does not provide an optimized one.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>dependencies.  Yes I know we have XDP support in virtio_net, but
>that just creates another depedency for learning how to use this
>facility.
> 
>I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>generic core implementation, it serves as a semantic example for
>driver folks adding XDP support.
> 
> This is just a rough draft and is untested.
> 
> Signed-off-by: David S. Miller 

Agreed, XDP really needs this

> v2:
>  - Add some "fall through" comments in switch statements based
>upon feedback from Andrew Lunn
>  - Use RCU for generic xdp_prog, thanks to Johannes Berg.
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cc07c3b..f8ff49c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1892,6 +1892,7 @@ struct net_device {
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> + struct bpf_prog __rcu   *xdp_prog;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7efb417..ad6de2d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c

...
 
> + if (static_key_false(_xdp_needed)) {
> + struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
> +
> + if (xdp_prog) {
> + u32 act = netif_receive_generic_xdp(skb, xdp_prog);
> +
> + if (act != XDP_PASS) {
> + rcu_read_unlock();
> + if (act == XDP_TX)
> + dev_queue_xmit(skb);
> + return NET_RX_DROP;
> + }

Wouldn't it be cleaner if XDP returned an enumerated type and a
switch was used? Safer than allowing arbitrary int.


Re: [PATCH 3/6] dt-bindings: correct marvell orion MDIO binding document

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:15PM +0100, Russell King wrote:
> Correct the Marvell Orion MDIO binding document to properly reflect the
> cases where an interrupt is present.  Augment the examples to show this.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH 2/6] net: mvmdio: fix interrupt disable in remove path

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:09PM +0100, Russell King wrote:
> The pre-existing write to disable interrupts on the remove path happens
> whether we have an interrupt or not.  While this may seem to be a good
> idea, this driver is re-used in many different implementations, some
> where the binding only specifies four bytes of register space.  This
> access causes us to access registers outside of the binding.
> 
> Make it conditional on the interrupt being present, which is the same
> condition used when enabling the interrupt in the first place.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH 1/6] net: mvmdio: disable interrupts in driver failure path

2017-04-10 Thread Andrew Lunn
On Mon, Apr 10, 2017 at 04:28:04PM +0100, Russell King wrote:
> When the mvmdio driver has an interrupt, it enables the "done" interrupt
> after requesting its interrupt handler.  However, probe failure results
> in the interrupt being left enabled.  Disable it on the failure path.
> 
> Signed-off-by: Russell King 

Reviewed-by: Andrew Lunn 

Andrew


Re: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-04-10 Thread Dave Jones
On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com wrote:
 > Hi all,
 > 
 > I seem to be hitting this use-after-free on a -next kernel using trinity:
 >
 > [  531.036054] BUG: KASAN: use-after-free in prb_retire_rx_blk_timer_expired 
 > (net/packet/af_packet.c:688) 
 > [  531.036961] Read of size 8 at addr 88038c1fb0e8 by task 
 > swapper/1/0  
 >   [  531.037727] 
 >  
 >   [  531.037928] CPU: 1 PID: 0 Comm: swapper/1 Not 
 > tainted 4.11.0-rc5-next-20170407-dirty #24

Funny, I was just going over my old pending bugs, and found this one
from January that looks like what happens with the same bug, but without kasan..

context: PID: 0  TASK: 881ff2fa5100  CPU: 5   COMMAND: "swapper/5"
panic: general protection fault:  [#1]
netversion: 2.2-1 (Feb 2014)
Backtrace:
 #0 [881fffaa3c00] machine_kexec at 81044af8
 #1 [881fffaa3c60] __crash_kexec at 810ec755
 #2 [881fffaa3d28] crash_kexec at 810ec81f
 #3 [881fffaa3d40] oops_end at 8101e348
 #4 [881fffaa3d68] die at 8101e76b
 #5 [881fffaa3d98] do_general_protection at 8101be76
 #6 [881fffaa3dc0] general_protection at 817fe5a2
[exception RIP: prb_retire_rx_blk_timer_expired+65]
RIP: 817e6e41  RSP: 881fffaa3e78  RFLAGS: 00010246
RAX:   RBX: 881fd7075800  RCX: 
RDX: 883ff0a16bb0  RSI: 0074636361757063  RDI: 881fd70758bc
RBP: 881fffaa3e88   R8: 0001   R9: 0005
R10:   R11:   R12: 881fd7075b78
R13: 0100  R14: 817e6e00  R15: 881fd7075800
ORIG_RAX:   CS: 0010  SS: 0018
 #7 [881fffaa3e90] call_timer_fn at 810cec35
 #8 [881fffaa3ec8] run_timer_softirq at 810cf01c
 #9 [881fffaa3f28] __softirqentry_text_start at 817ff05c
#10 [881fffaa3f88] irq_exit at 8107d5fc
#11 [881fffaa3f98] smp_apic_timer_interrupt at 817feea2
#12 [881fffaa3fb0] apic_timer_interrupt at 817fd56f
---  ---
#13 [881ff2fbfdd0] apic_timer_interrupt at 817fd56f
RIP: 0018  RSP:   RFLAGS: 81ebbb60
RAX: e8e0002a0400  RBX: 0067b502e95f  RCX: 0006
RDX: 002e  RSI: 0034  RDI: 0001
RBP: 81150540   R8: 881ff2fbfee0   R9: 0001
R10: 0005  R11: 81ebbb60  R12: 881ff2fbfe48
R13: 881ff2fa5110  R14:   R15: 881ff2fa5100
ORIG_RAX: 881fffab5340  CS: 20c49ba5e353f7cf  SS: ff10
WARNING: possibly bogus exception frame
Dmesg:
Code: 00 00 48 8b 93 10 03 00 00 80 bb 21 03 00 00 00 44 0f b6 83 20 03 00 00 
0f b7 c8 48 8b 34 ca 75 57 <44> 8b 5e 0c 45 85 db 74 1d 8b 93 68 03 00 00 85 d2 
74 13 f3 90 

RIP 
 [] prb_retire_rx_blk_timer_expired+0x41/0x120
 RSP 
[ cut here ]



af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-04-10 Thread alexander . levin
Hi all,

I seem to be hitting this use-after-free on a -next kernel using trinity:

[  531.036054] BUG: KASAN: use-after-free in prb_retire_rx_blk_timer_expired 
(net/packet/af_packet.c:688)
 [  531.036961] Read of size 8 at addr 88038c1fb0e8 by task swapper/1/0 
   
[  531.037727]  
  [ 
 531.037928] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
4.11.0-rc5-next-20170407-dirty #24
[  531.038862] Call Trace:
[  531.039163]  
[  531.039447] dump_stack (lib/dump_stack.c:54) 
[  531.041612] print_address_description (mm/kasan/report.c:253) 
[  531.042809] kasan_report (mm/kasan/report.c:352 mm/kasan/report.c:408) 
[  531.043263] __asan_report_load8_noabort (mm/kasan/report.c:429) 
[  531.043829] prb_retire_rx_blk_timer_expired (net/packet/af_packet.c:688) 
[  531.048298] call_timer_fn.isra.15 (./arch/x86/include/asm/preempt.h:22 
kernel/time/timer.c:1246) 
[  531.048805] __run_timers (./include/linux/spinlock.h:324 
kernel/time/timer.c:1308 kernel/time/timer.c:1601) 
[  531.055404] run_timer_softirq (kernel/time/timer.c:1614) 
[  531.055883] __do_softirq (./arch/x86/include/asm/preempt.h:22 
kernel/softirq.c:286) 
[  531.057507] irq_exit (kernel/softirq.c:364 kernel/softirq.c:405) 
[  531.057893] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:965) 
[  531.058446] apic_timer_interrupt (arch/x86/entry/entry_64.S:704) 
[  531.058951] RIP: 0010:native_safe_halt (??:?) 
[  531.059718] RSP: 0018:88039aa8fe88 EFLAGS: 0246 ORIG_RAX: 
ff10
[  531.060593] RAX: 0008 RBX: 88039aa68fc0 RCX: 
[  531.061411] RDX: 11007354d1f8 RSI:  RDI: 
[  531.062203] RBP: 88039aa8fe88 R08: 880376251bc0 R09: 0001
[  531.063001] R10: 88038e0f7838 R11: 0001 R12: 88039aa68fc0
[  531.064007] R13: 83df0028 R14:  R15: 88039aa68fc0
[  531.064811]  
[  531.065886] default_idle (./arch/x86/include/asm/paravirt.h:98 
arch/x86/kernel/process.c:341) 
[  531.066284] arch_cpu_idle (arch/x86/kernel/process.c:333) 
[  531.066692] default_idle_call (kernel/sched/idle.c:101) 
[  531.067151] do_idle (kernel/sched/idle.c:156 kernel/sched/idle.c:245) 
[  531.067537] cpu_startup_entry (kernel/sched/idle.c:350 (discriminator 1)) 
[  531.067992] start_secondary (arch/x86/kernel/smpboot.c:276) 
[  531.068444] secondary_startup_64 (arch/x86/kernel/verify_cpu.S:37) 
[  531.068924]  
  [ 
 531.069109] Allocated by task 18982:   
[  
531.069522] save_stack_trace (arch/x86/kernel/stacktrace.c:60)  
   [  
531.069965] save_stack (mm/kasan/kasan.c:493 mm/kasan/kasan.c:514) 
[  531.070347] kasan_kmalloc (mm/kasan/kasan.c:525 mm/kasan/kasan.c:617) 
[  531.070757] __kmalloc (mm/slub.c:3747) 
[  531.071153] packet_set_ring (net/packet/af_packet.c:4130 
net/packet/af_packet.c:4218) 
[  531.072024] packet_setsockopt (net/packet/af_packet.c:3617) 
[  531.072525] SyS_setsockopt (net/socket.c:1797 net/socket.c:1777) 
[  531.072968] do_syscall_64 (arch/x86/entry/common.c:284) 
[  531.073405] return_from_SYSCALL_64 (arch/x86/entry/entry_64.S:249) 
[  531.073893] 
[  531.074076] Freed by task 7019:
[  531.074443] save_stack_trace (arch/x86/kernel/stacktrace.c:60) 
[  531.074882] save_stack (mm/kasan/kasan.c:493 mm/kasan/kasan.c:514)
[  531.075275] kasan_slab_free (mm/kasan/kasan.c:525 mm/kasan/kasan.c:590) 
[  531.075705] kfree (mm/slub.c:2966 mm/slub.c:3882) 
[  531.076052] free_pg_vec (net/packet/af_packet.c:4096) 
[  531.076448] packet_set_ring (net/packet/af_packet.c:4298) 
[  531.076922] packet_setsockopt (net/packet/af_packet.c:3617) 
[  531.077406] SyS_setsockopt (net/socket.c:1797 net/socket.c:1777) 
[  531.077848] do_syscall_64 (arch/x86/entry/common.c:284) 
[  531.078285] return_from_SYSCALL_64 (arch/x86/entry/entry_64.S:249) 
[  531.078773] 
[  531.078956] The buggy address belongs to the object at 88038c1fb0e8
[  531.078956]  which belongs to the cache kmalloc-8 of size 8
[  531.080341] The buggy address is located 0 bytes inside of
[  531.080341]  8-byte region [88038c1fb0e8, 88038c1fb0f0)
[  531.081600] The buggy address belongs to the page:
[  531.082150] page:ea000e307e80 count:1 mapcount:0 mapping:  
(null) index:0x88038c1fbd90 compound_mapcount: 0
[  531.083613] flags: 0x2fffc008100(slab|head)
[  531.084139] raw: 02fffc008100  88038c1fbd90 

Re: Horrid balance-rr bonding udp throughput

2017-04-10 Thread Ben Greear

On 04/10/2017 11:50 AM, Jarod Wilson wrote:

On 2017-04-08 7:33 PM, Jarod Wilson wrote:

I'm digging into some bug reports covering performance issues with balance-rr, 
and discovered something even worse than the reporter. My test setup has a pair
of NICs, one e1000e, one e1000 (but dual e1000e seems the same). When I do a 
test run in LNST with bonding mode balance-rr and either miimon or arpmon, the
throughput of the UDP_STREAM netperf test is absolutely horrible:

TCP: 941.19 +-0.88 mbits/sec
UDP: 45.42 +-4.59 mbits/sec

I figured I'd try LNST's packet capture mode, so exact same test, add the -p 
flag and I get:

TCP: 941.21 +-0.82 mbits/sec
UDP: 961.54 +-0.01 mbits/sec

Uh. What? So yeah. I can't capture the traffic in the bad case, but I guess 
that gives some potential insight into what's not happening correctly in either
the bonding driver or the NIC drivers... More digging forthcoming, but first I 
have a flooded basement to deal with, so if in the interim, anyone has some
insight, I'd be happy to hear it. :)


Okay, ignore the bit about bonding, I should have eliminated the bond from the 
picture entirely. I think the traffic simply ended up on the e1000 on the
non-capture test and on the e1000e for the capture test, as those numbers match 
perfectly with straight NIC to NIC testing, no bond involved. That said, really
odd that the e1000 is so severely crippled for UDP, while TCP is still 
respectable. Not sure if I have a flaky NIC or what...

For reference, e1000 to e1000e netperf:

TCP_STREAM: Measured rate was 849.95 +-1.32 mbits/sec
UDP_STREAM: Measured rate was 44.73 +-5.73 mbits/sec


Maybe check that you have re-ordering issues?  I ran into that with igb
recently and it took a while to realize my problem!

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Horrid balance-rr bonding udp throughput

2017-04-10 Thread Jarod Wilson

On 2017-04-08 7:33 PM, Jarod Wilson wrote:
I'm digging into some bug reports covering performance issues with 
balance-rr, and discovered something even worse than the reporter. My 
test setup has a pair of NICs, one e1000e, one e1000 (but dual e1000e 
seems the same). When I do a test run in LNST with bonding mode 
balance-rr and either miimon or arpmon, the throughput of the UDP_STREAM 
netperf test is absolutely horrible:


TCP: 941.19 +-0.88 mbits/sec
UDP: 45.42 +-4.59 mbits/sec

I figured I'd try LNST's packet capture mode, so exact same test, add 
the -p flag and I get:


TCP: 941.21 +-0.82 mbits/sec
UDP: 961.54 +-0.01 mbits/sec

Uh. What? So yeah. I can't capture the traffic in the bad case, but I 
guess that gives some potential insight into what's not happening 
correctly in either the bonding driver or the NIC drivers... More 
digging forthcoming, but first I have a flooded basement to deal with, 
so if in the interim, anyone has some insight, I'd be happy to hear it. :)


Okay, ignore the bit about bonding, I should have eliminated the bond 
from the picture entirely. I think the traffic simply ended up on the 
e1000 on the non-capture test and on the e1000e for the capture test, as 
those numbers match perfectly with straight NIC to NIC testing, no bond 
involved. That said, really odd that the e1000 is so severely crippled 
for UDP, while TCP is still respectable. Not sure if I have a flaky NIC 
or what...


For reference, e1000 to e1000e netperf:

TCP_STREAM: Measured rate was 849.95 +-1.32 mbits/sec
UDP_STREAM: Measured rate was 44.73 +-5.73 mbits/sec


--
Jarod Wilson
ja...@redhat.com


[PATCH net] xfrm: calculate L4 checksums also for GSO case before encrypting packets

2017-04-10 Thread Ansis Atteka
Otherwise, if L4 checksum calculation is done after encryption,
then all ESP packets end up being corrupted at the location
where pre-encryption L4 checksum field resides.

One of the ways to reproduce this bug is to have a VM with virtio_net
driver (UFO set to ON in the guest VM); and then encapsulate all guest's
Ethernet frames in GENEVE; and then further encrypt GENEVE with IPsec.
In this case following symptoms are observed:
1. If using ixgbe NIC, then the driver will also emit following
   warning message:
   ixgbe :01:00.1: partial checksum but l4 proto=32!
2. Receiving VM will drop all the corrupted ESP packets, hence UDP iperf test
   with large packets will fail completely or TCP iperf will get ridiculously
   low performance because TCP window will never grow above MTU.

Signed-off-by: Ansis Atteka 
---
 net/xfrm/xfrm_output.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 8ba29fe..7ad7e5f 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -168,7 +168,8 @@ static int xfrm_output2(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 
 static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
-   struct sk_buff *segs;
+   struct sk_buff *segs, *nskb;
+   int err;
 
BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
BUILD_BUG_ON(sizeof(*IP6CB(skb)) > SKB_SGO_CB_OFFSET);
@@ -180,21 +181,27 @@ static int xfrm_output_gso(struct net *net, struct sock 
*sk, struct sk_buff *skb
return -EINVAL;
 
do {
-   struct sk_buff *nskb = segs->next;
-   int err;
+   nskb = segs->next;
 
segs->next = NULL;
-   err = xfrm_output2(net, sk, segs);
+   err = skb_checksum_help(segs);
+   if (unlikely(err)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+   goto error;
+   }
 
+   err = xfrm_output2(net, sk, segs);
if (unlikely(err)) {
-   kfree_skb_list(nskb);
-   return err;
+   goto error;
}
 
segs = nskb;
} while (segs);
 
return 0;
+error:
+   kfree_skb_list(nskb);
+   return err;
 }
 
 int xfrm_output(struct sock *sk, struct sk_buff *skb)
-- 
1.9.1



Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Andy Gospodarek
On Sun, Apr 09, 2017 at 01:35:28PM -0700, David Miller wrote:
> 
> This provides a generic non-optimized XDP implementation when the
> device driver does not provide an optimized one.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>dependencies.  Yes I know we have XDP support in virtio_net, but
>that just creates another depedency for learning how to use this
>facility.
> 
>I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>generic core implementation, it serves as a semantic example for
>driver folks adding XDP support.
> 
> This is just a rough draft and is untested.
> 
> Signed-off-by: David S. Miller 

As promised, I did some testing today with bnxt_en's implementation of
XDP and this one.

I ran this on a desktop-class system I have (i7-6700 CPU @ 3.40GHz)
and used pktgen_sample03_burst_single_flow.sh from another system to
throw ~6.5Mpps as a single UDP stream towards the system running XDP.

I just commented out the ndo_xdp op in bnxt.c to test this patch.  The
differences were pretty dramatic.

Using the ndo_xdp ndo in bnxt_en, my perf report output looked like
this (not sure why X is still running on this system!):

38.08%  swapper  [sysimgblt]   [k] 
0x5bd0
11.80%  swapper  [kernel.vmlinux]  [k] intel_idle
10.49%  swapper  [bnxt_en] [k] bnxt_rx_pkt
 6.31%  swapper  [bnxt_en] [k] bnxt_rx_xdp
 5.64%  swapper  [bnxt_en] [k] bnxt_poll
 4.22%  swapper  [kernel.vmlinux]  [k] poll_idle
 3.46%  swapper  [kernel.vmlinux]  [k] irq_entries_start
 2.95%  swapper  [kernel.vmlinux]  [k] 
napi_complete_done
 1.79%  swapper  [kernel.vmlinux]  [k] 
cpuidle_enter_state
 1.53%  swapper  [kernel.vmlinux]  [k] menu_select
 1.19%  swapper  [bnxt_en] [k] 
bnxt_reuse_rx_data
 1.00%  swapper  [sysimgblt]   [k] 
0x5c6f
 0.92%  swapper  [kernel.vmlinux]  [k] 
__next_timer_interrupt
 0.71%  swapper  [kernel.vmlinux]  [k] 
_raw_spin_lock_irqsave
 0.71%  swapper  [kernel.vmlinux]  [k] 
bpf_map_lookup_elem

mpstat reports that the CPU receiving and dropping the traffic is
basically running idle.  Dropping this amount of traffic in the driver
has very little impact on the system.

With v2 of this patch I see the following from perf report:

19.69%  ksoftirqd/3  [kernel.vmlinux]  [k] memcpy_erms
16.30%  ksoftirqd/3  [kernel.vmlinux]  [k] __bpf_prog_run
10.11%  ksoftirqd/3  [bnxt_en] [k] bnxt_rx_pkt
 7.69%  ksoftirqd/3  [kernel.vmlinux]  [k] __build_skb
 4.25%  ksoftirqd/3  [kernel.vmlinux]  [k] inet_gro_receive
 3.74%  ksoftirqd/3  [kernel.vmlinux]  [k] kmem_cache_alloc
 3.53%  ksoftirqd/3  [kernel.vmlinux]  [k] dev_gro_receive
 3.43%  ksoftirqd/3  [kernel.vmlinux]  [k] page_frag_free
 3.12%  ksoftirqd/3  [kernel.vmlinux]  [k] kmem_cache_free
 2.56%  ksoftirqd/3  [bnxt_en] [k] bnxt_poll
 2.46%  ksoftirqd/3  [kernel.vmlinux]  [k] 
netif_receive_skb_internal
 2.13%  ksoftirqd/3  [kernel.vmlinux]  [k] __udp4_lib_lookup
 1.63%  ksoftirqd/3  [kernel.vmlinux]  [k] __napi_alloc_skb
 1.51%  ksoftirqd/3  [kernel.vmlinux]  [k] eth_type_trans
 1.42%  ksoftirqd/3  [kernel.vmlinux]  [k] udp_gro_receive
 1.29%  ksoftirqd/3  [kernel.vmlinux]  [k] napi_gro_receive
 1.25%  ksoftirqd/3  [kernel.vmlinux]  [k] udp4_gro_receive
 1.18%  ksoftirqd/3  [bnxt_en] [k] bnxt_rx_xdp
 1.17%  ksoftirqd/3  [kernel.vmlinux]  [k] skb_release_data
 1.11%  ksoftirqd/3  [bnxt_en] [k] 
bnxt_reuse_rx_data
 1.07%  ksoftirqd/3  [kernel.vmlinux]  [k] net_rx_action

mpstat reports that the CPU receiving and dropping the traffic is about
98% utilized running in softirq context.

Unsurprisingly, in-driver XDP is significantly more efficient than in-stack
XDP.

I also noted that...

xdp1 not parsing the protocol correctly.   All traffic was showing up as
protocol '0' instead of '17' in the output.  

xdp2 was not actually sending the frames back out when using this patch.

I'll take a look in a bit and see if 

Re: [PATCH v3 1/5] netlink: extended ACK reporting

2017-04-10 Thread Johannes Berg
On Mon, 2017-04-10 at 17:40 +0200, Johannes Berg wrote:
> 
> Another thought: if we add a new flag that indicates "message has
> been capped", which we introduce in this same patch, then we can
> disentangle this more easily, right?
> 
> Adding a new flag for "TLVs present" won't really help, but if you
> know the message was capped then you know the TLVs start after the
> inner nlmsghdr and you ignore that header's nlmsg_len.

Actually, the flag should be set if (and only if) the message was
capped *and* TLVs were requested (or present, doesn't matter.)

That way it becomes completely backward compatible and stateless:
 * on kernels that don't have extack you can ignore the setsockopt
   failure
 * checking if TLVs are present becomes
   flag set ||
   nlh->nlmsg_len > sizeof(*nlh) + sizeof(int) +
sizeof(*inner_nlh) + inner_nlh->nlmsg_len
 * TLV start offset is
   tlv_start_offs = sizeof(*nlh) + sizeof(int) + sizeof(inner_nlh)
   if (flag set)
   tlv_start_offs += inner_nlh->nlmsg_len

I need to resend anyway so I'll add that tomorrow.

johannes


[PATCH net-next] net: stmmac: set total length of the packet to be transmitted in TDES3

2017-04-10 Thread Niklas Cassel
From: Niklas Cassel 

Field FL/TPL in register TDES3 is not correctly set on GMAC4.
TX appears to be functional on GMAC 4.10a even if this field is not set,
however, to avoid relying on undefined behavior, set the length in TDES3.

The field has a different meaning depending on if the TSE bit in TDES3
is set or not (TSO). However, regardless of the TSE bit, the field is
not optional. The field is already set correctly when the TSE bit is set.

Since there is no limit for the number of descriptors that can be
used for a single packet, the field should be set to the sum of
the buffers contained in:
[ ...  ...
], which should be equal to skb->len.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/chain_mode.c   | 6 +++---
 drivers/net/ethernet/stmicro/stmmac/common.h   | 2 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c | 3 ++-
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c | 2 +-
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c| 2 +-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c| 9 ++---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 5 +++--
 7 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c 
b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
index 37881f81319e..e93c40b4631e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
@@ -52,7 +52,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].len = bmax;
/* do not close the descriptor and do not set own bit */
priv->hw->desc->prepare_tx_desc(desc, 1, bmax, csum, STMMAC_CHAIN_MODE,
-   0, false);
+   0, false, skb->len);
 
while (len != 0) {
tx_q->tx_skbuff[entry] = NULL;
@@ -70,7 +70,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].len = bmax;
priv->hw->desc->prepare_tx_desc(desc, 0, bmax, csum,
STMMAC_CHAIN_MODE, 1,
-   false);
+   false, skb->len);
len -= bmax;
i++;
} else {
@@ -85,7 +85,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
/* last descriptor can be set now */
priv->hw->desc->prepare_tx_desc(desc, 0, len, csum,
STMMAC_CHAIN_MODE, 1,
-   true);
+   true, skb->len);
len = 0;
}
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 90d28bcad880..b7ce3fbb5375 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -373,7 +373,7 @@ struct stmmac_desc_ops {
/* Invoked by the xmit function to prepare the tx descriptor */
void (*prepare_tx_desc) (struct dma_desc *p, int is_fs, int len,
 bool csum_flag, int mode, bool tx_own,
-bool ls);
+bool ls, unsigned int tot_pkt_len);
void (*prepare_tso_tx_desc)(struct dma_desc *p, int is_fs, int len1,
int len2, bool tx_own, bool ls,
unsigned int tcphdrlen,
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
index 843ec69222ea..aa6476439aee 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
@@ -304,12 +304,13 @@ static void dwmac4_rd_init_tx_desc(struct dma_desc *p, 
int mode, int end)
 
 static void dwmac4_rd_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
  bool csum_flag, int mode, bool tx_own,
- bool ls)
+ bool ls, unsigned int tot_pkt_len)
 {
unsigned int tdes3 = le32_to_cpu(p->des3);
 
p->des2 |= cpu_to_le32(len & TDES2_BUFFER1_SIZE_MASK);
 
+   tdes3 |= tot_pkt_len & TDES3_PACKET_SIZE_MASK;
if (is_fs)
tdes3 |= TDES3_FIRST_DESCRIPTOR;
else
diff --git a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c 
b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
index 323b59ec74a3..7546b3664113 100644
--- a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
@@ -315,7 +315,7 @@ static 

RFC: Checksum offload and XDP

2017-04-10 Thread Tom Herbert
Not having checksum offload in XDP is going to get more painful once
we start seeing a lot programs doing packet modifications. One nice
thing we do for ILA router is pre-compute the checksum delta necessary
to maintain checksum neutral property in the packet. So that after
doing ILA routing in XDP the checksum complete value is still valid as
is the transport layer checksum.

It's conceivable we could generalize this by having a u16 checksum
delta returned from XDP program. If the checksum diff can be
pre-computed in a structure for doing the translation, then there
should be little cost other than making API a little more complex. On
return the checksum_complete value is updated jusy by adding in the
diff value.

Tom


Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-10 Thread Steve Wise

On 4/2/2017 8:41 AM, Sagi Grimberg wrote:

This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.


I'll test cxgb4 if you convert it. :)



Re: net/ipv4: use-after-free in ip_queue_xmit

2017-04-10 Thread Andrey Konovalov
On Mon, Apr 10, 2017 at 7:42 PM, Cong Wang  wrote:
> On Mon, Apr 10, 2017 at 7:40 AM, Andrey Konovalov  
> wrote:
>> Hi,
>>
>> I've got the following error report while fuzzing the kernel with syzkaller.
>>
>> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>>
>> Unfortunately it's not reproducible.
>>
>> BUG: KASAN: use-after-free in ip_select_ttl include/net/dst.h:176
>> [inline] at addr 88006ab3602c
>> BUG: KASAN: use-after-free in ip_queue_xmit+0x1817/0x1a30
>> net/ipv4/ip_output.c:485 at addr 88006ab3602c
>> Read of size 4 by task syz-executor1/12627
>> CPU: 3 PID: 12627 Comm: syz-executor1 Not tainted 4.11.0-rc6+ #206
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0x292/0x398 lib/dump_stack.c:52
>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>>  print_address_description mm/kasan/report.c:202 [inline]
>>  kasan_report_error mm/kasan/report.c:291 [inline]
>>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
>>  ip_select_ttl include/net/dst.h:176 [inline]
>
> Probably same as the one you reported on ipv4_mtu(), it would
> be nice if you could test the patch I proposed:
>
> https://patchwork.ozlabs.org/patch/747556/

Applied your patch.

The bug gets triggered very rarely (only twice so far), but I'll let
you know if I see it again.

Thanks!

>
>
> Thanks!
>
>>  ip_queue_xmit+0x1817/0x1a30 net/ipv4/ip_output.c:485
>>  sctp_v4_xmit+0x10d/0x140 net/sctp/protocol.c:994
>>  sctp_packet_transmit+0x215c/0x3560 net/sctp/output.c:637
>>  sctp_outq_flush+0xade/0x3f90 net/sctp/outqueue.c:885
>>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:750
>>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1773 [inline]
>>  sctp_side_effects net/sctp/sm_sideeffect.c:1175 [inline]
>>  sctp_do_sm+0x5a0/0x6a50 net/sctp/sm_sideeffect.c:1147
>>  sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88
>>  sctp_sendmsg+0x270d/0x3b50 net/sctp/socket.c:1954
>>  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
>>  sock_sendmsg_nosec net/socket.c:633 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:643
>>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>>  SyS_sendto+0x40/0x50 net/socket.c:1664
>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>> RIP: 0033:0x4458d9
>> RSP: 002b:7fdceca85b58 EFLAGS: 0282 ORIG_RAX: 002c
>> RAX: ffda RBX: 0016 RCX: 004458d9
>> RDX: 0087 RSI: 20003000 RDI: 0016
>> RBP: 006e2fe0 R08: 20003000 R09: 0010
>> R10: 00040841 R11: 0282 R12: 007080a8
>> R13: 000a R14: 0005 R15: 0084
>> Object at 88006ab36008, in cache kmalloc-64 size: 64
>> Allocated:
>> PID = 7243
>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>>  set_track mm/kasan/kasan.c:525 [inline]
>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>>  kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
>>  kmalloc include/linux/slab.h:490 [inline]
>>  kzalloc include/linux/slab.h:663 [inline]
>>  fib_create_info+0x8e0/0x3a30 net/ipv4/fib_semantics.c:1040
>>  fib_table_insert+0x1a5/0x1550 net/ipv4/fib_trie.c:1221
>>  ip_rt_ioctl+0xddc/0x1590 net/ipv4/fib_frontend.c:597
>>  inet_ioctl+0xf2/0x1c0 net/ipv4/af_inet.c:882
>>  sock_do_ioctl+0x65/0xb0 net/socket.c:906
>>  sock_ioctl+0x28f/0x440 net/socket.c:1004
>>  vfs_ioctl fs/ioctl.c:45 [inline]
>>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
>>  SYSC_ioctl fs/ioctl.c:700 [inline]
>>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>> Freed:
>> PID = 12622
>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>>  set_track mm/kasan/kasan.c:525 [inline]
>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>>  slab_free_hook mm/slub.c:1357 [inline]
>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>  slab_free mm/slub.c:2961 [inline]
>>  kfree+0xe8/0x2b0 mm/slub.c:3882
>>  free_fib_info_rcu+0x4ba/0x5e0 net/ipv4/fib_semantics.c:218
>>  __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
>>  rcu_do_batch.isra.64+0x947/0xcc0 kernel/rcu/tree.c:2879
>>  invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
>>  __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
>>  rcu_process_callbacks+0x2cc/0xb90 kernel/rcu/tree.c:3126
>>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
>> Memory state around the buggy address:
>>  88006ab35f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>  88006ab35f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>>88006ab36000: fc fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
>>   ^
>>  88006ab36080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>  88006ab36100: fc fc fc fc fc fc fc fc fc fc fc fc 

Re: net/ipv4: use-after-free in ip_queue_xmit

2017-04-10 Thread Cong Wang
On Mon, Apr 10, 2017 at 7:40 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>
> Unfortunately it's not reproducible.
>
> BUG: KASAN: use-after-free in ip_select_ttl include/net/dst.h:176
> [inline] at addr 88006ab3602c
> BUG: KASAN: use-after-free in ip_queue_xmit+0x1817/0x1a30
> net/ipv4/ip_output.c:485 at addr 88006ab3602c
> Read of size 4 by task syz-executor1/12627
> CPU: 3 PID: 12627 Comm: syz-executor1 Not tainted 4.11.0-rc6+ #206
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
>  ip_select_ttl include/net/dst.h:176 [inline]

Probably same as the one you reported on ipv4_mtu(), it would
be nice if you could test the patch I proposed:

https://patchwork.ozlabs.org/patch/747556/


Thanks!

>  ip_queue_xmit+0x1817/0x1a30 net/ipv4/ip_output.c:485
>  sctp_v4_xmit+0x10d/0x140 net/sctp/protocol.c:994
>  sctp_packet_transmit+0x215c/0x3560 net/sctp/output.c:637
>  sctp_outq_flush+0xade/0x3f90 net/sctp/outqueue.c:885
>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:750
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1773 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1175 [inline]
>  sctp_do_sm+0x5a0/0x6a50 net/sctp/sm_sideeffect.c:1147
>  sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88
>  sctp_sendmsg+0x270d/0x3b50 net/sctp/socket.c:1954
>  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458d9
> RSP: 002b:7fdceca85b58 EFLAGS: 0282 ORIG_RAX: 002c
> RAX: ffda RBX: 0016 RCX: 004458d9
> RDX: 0087 RSI: 20003000 RDI: 0016
> RBP: 006e2fe0 R08: 20003000 R09: 0010
> R10: 00040841 R11: 0282 R12: 007080a8
> R13: 000a R14: 0005 R15: 0084
> Object at 88006ab36008, in cache kmalloc-64 size: 64
> Allocated:
> PID = 7243
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
>  kmalloc include/linux/slab.h:490 [inline]
>  kzalloc include/linux/slab.h:663 [inline]
>  fib_create_info+0x8e0/0x3a30 net/ipv4/fib_semantics.c:1040
>  fib_table_insert+0x1a5/0x1550 net/ipv4/fib_trie.c:1221
>  ip_rt_ioctl+0xddc/0x1590 net/ipv4/fib_frontend.c:597
>  inet_ioctl+0xf2/0x1c0 net/ipv4/af_inet.c:882
>  sock_do_ioctl+0x65/0xb0 net/socket.c:906
>  sock_ioctl+0x28f/0x440 net/socket.c:1004
>  vfs_ioctl fs/ioctl.c:45 [inline]
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
>  SYSC_ioctl fs/ioctl.c:700 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Freed:
> PID = 12622
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kfree+0xe8/0x2b0 mm/slub.c:3882
>  free_fib_info_rcu+0x4ba/0x5e0 net/ipv4/fib_semantics.c:218
>  __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
>  rcu_do_batch.isra.64+0x947/0xcc0 kernel/rcu/tree.c:2879
>  invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
>  rcu_process_callbacks+0x2cc/0xb90 kernel/rcu/tree.c:3126
>  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
> Memory state around the buggy address:
>  88006ab35f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  88006ab35f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>88006ab36000: fc fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
>   ^
>  88006ab36080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  88006ab36100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==


[PATCH 2/3] netfilter: ipvs: Replace kzalloc with kcalloc.

2017-04-10 Thread Simon Horman
From: Varsha Rao 

Replace kzalloc with kcalloc. As kcalloc is preferred for allocating an
array instead of kzalloc. This patch fixes the checkpatch issue.

Signed-off-by: Varsha Rao 
---
 net/netfilter/ipvs/ip_vs_sync.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index b03c28084f81..30d6b2cc00a0 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1849,7 +1849,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct 
ipvs_sync_daemon_cfg *c,
if (state == IP_VS_STATE_MASTER) {
struct ipvs_master_sync_state *ms;
 
-   ipvs->ms = kzalloc(count * sizeof(ipvs->ms[0]), GFP_KERNEL);
+   ipvs->ms = kcalloc(count, sizeof(ipvs->ms[0]), GFP_KERNEL);
if (!ipvs->ms)
goto out;
ms = ipvs->ms;
@@ -1862,7 +1862,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct 
ipvs_sync_daemon_cfg *c,
ms->ipvs = ipvs;
}
} else {
-   array = kzalloc(count * sizeof(struct task_struct *),
+   array = kcalloc(count, sizeof(struct task_struct *),
GFP_KERNEL);
if (!array)
goto out;
-- 
2.7.0.rc3.207.g0ac5344



[PATCH 1/3] netfilter: ipvs: don't check for presence of nat extension

2017-04-10 Thread Simon Horman
From: Florian Westphal 

Check for the NAT status bits, they are set once conntrack needs NAT in source 
or
reply direction, this is slightly faster than nfct_nat() as that has to check 
the
extension area.

Signed-off-by: Florian Westphal 
---
 net/netfilter/ipvs/ip_vs_ftp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index d30c327bb578..2e2bf7428cd1 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -260,7 +260,7 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct 
ip_vs_conn *cp,
buf_len = strlen(buf);
 
ct = nf_ct_get(skb, );
-   if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) {
+   if (ct && !nf_ct_is_untracked(ct) && (ct->status & 
IPS_NAT_MASK)) {
/* If mangling fails this function will return 0
 * which will cause the packet to be dropped.
 * Mangling can only fail under memory pressure,
-- 
2.7.0.rc3.207.g0ac5344



[PATCH 3/3] ipvs: remove unused variable

2017-04-10 Thread Simon Horman
From: Arushi Singhal 

This patch uses the following coccinelle script to remove
a variable that was simply used to store the return
value of a function call before returning it:

@@
identifier len,f;
@@

-int len;
 ... when != len
 when strict
-len =
+return
f(...);
-return len;

Signed-off-by: Arushi Singhal 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/ip_vs_ftp.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index 2e2bf7428cd1..6caf4459e981 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -482,11 +482,8 @@ static struct pernet_operations ip_vs_ftp_ops = {
 
 static int __init ip_vs_ftp_init(void)
 {
-   int rv;
-
-   rv = register_pernet_subsys(_vs_ftp_ops);
/* rcu_barrier() is called by netns on error */
-   return rv;
+   return register_pernet_subsys(_vs_ftp_ops);
 }
 
 /*
-- 
2.7.0.rc3.207.g0ac5344



[GIT 0/3] Second Round of IPVS Updates for v4.12

2017-04-10 Thread Simon Horman
Hi Pablo,

please consider these clean-ups and enhancements to IPVS for v4.12.

* Removal unused variable
* Use kzalloc where appropriate
* More efficient detection of presence of NAT extension


The following changes since commit 592d42ac7fd36408979e09bf2f170f2595dab7b8:

  Merge branch 'qed-IOV-cleanups' (2017-03-21 19:02:38 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git 
ipvs2-for-v4.12

for you to fetch changes up to e24113769960980579610ecfd657bb17f19373a3:

  ipvs: remove unused variable (2017-03-30 14:54:47 +0200)


Arushi Singhal (1):
  ipvs: remove unused variable

Florian Westphal (1):
  netfilter: ipvs: don't check for presence of nat extension

Varsha Rao (1):
  netfilter: ipvs: Replace kzalloc with kcalloc.

 net/netfilter/ipvs/ip_vs_ftp.c  | 7 ++-
 net/netfilter/ipvs/ip_vs_sync.c | 4 ++--
 2 files changed, 4 insertions(+), 7 deletions(-)


Re: [PATCH v2 net-next RFC] Generic XDP

2017-04-10 Thread Willem de Bruijn
>>  static int netif_receive_skb_internal(struct sk_buff *skb)
>>  {
>>   int ret;
>> @@ -4258,6 +4336,21 @@ static int netif_receive_skb_internal(struct sk_buff 
>> *skb)
>>
>>   rcu_read_lock();
>>
>> + if (static_key_false(_xdp_needed)) {
>> + struct bpf_prog *xdp_prog = 
>> rcu_dereference(skb->dev->xdp_prog);
>> +
>> + if (xdp_prog) {
>> + u32 act = netif_receive_generic_xdp(skb, xdp_prog);
>
> That's indeed the best attachment point in the stack.
> I was trying to see whether it can be lowered into something like
> dev_gro_receive(), but not everyone calls it.

It would be a helpful (follow-on) optimization for packets that do
pass through it. It allows skb recycling with napi_reuse_skb and can
be used to protect if a vulnerability in the gro stack pops up.

> Another option to put it into eth_type_trans() itself, then
> there are no problems with gro, l2 headers, and adjust_head,
> but changing all drivers is too much.
>
>> +
>> + if (act != XDP_PASS) {
>> + rcu_read_unlock();
>> + if (act == XDP_TX)
>> + dev_queue_xmit(skb);
>
> It should be fine. For cls_bpf we do recursion check __bpf_tx_skb()
> but I forgot specific details. May be here it's fine as-is.
> Daniel, do we need recursion check here?

That limiter is for egress redirecting to egress, I believe. This
ingress to egress will go through netif_rx and a softirq if looping.

Another point on redirect is clearing skb state. queue_mapping and
sender_cpu will be dirty, but should be able to handle it. It seems
possible to attach to a virtual device, such as a tunnel. In that case
the packet may have gone through a complex receive path before
reaching the tunnel, including tc ingress, so even more skb fields may
be set (e.g., priority). The same holds for act_mirred or
__bpf_redirect, so I assume that this is safe.


Re: [PATCH] ipv6: Fix idev->addr_list corruption

2017-04-10 Thread David Ahern
On 4/10/17 12:36 AM, Rabin Vincent wrote:
> From: Rabin Vincent 
> 
> addrconf_ifdown() removes elements from the idev->addr_list without
> holding the idev->lock.
> 
> If this happens while the loop in __ipv6_dev_get_saddr() is handling the
> same element, that function ends up in an infinite loop:
> 
>   NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [test:1719]
>   Call Trace:
>ipv6_get_saddr_eval+0x13c/0x3a0
>__ipv6_dev_get_saddr+0xe4/0x1f0
>ipv6_dev_get_saddr+0x1b4/0x204
>ip6_dst_lookup_tail+0xcc/0x27c
>ip6_dst_lookup_flow+0x38/0x80
>udpv6_sendmsg+0x708/0xba8
>sock_sendmsg+0x18/0x30
>SyS_sendto+0xb8/0xf8
>syscall_common+0x34/0x58
> 
> Fixes: 6a923934c33 (Revert "ipv6: Revert optional address flusing on ifdown.")
> Signed-off-by: Rabin Vincent 
> ---
>  net/ipv6/addrconf.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 3631725..80ce478 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -3626,14 +3626,19 @@ static int addrconf_ifdown(struct net_device *dev, 
> int how)
>   INIT_LIST_HEAD(_list);
>   list_for_each_entry_safe(ifa, tmp, >addr_list, if_list) {
>   struct rt6_info *rt = NULL;
> + bool keep;
>  
>   addrconf_del_dad_work(ifa);
>  
> + keep = keep_addr && (ifa->flags & IFA_F_PERMANENT) &&
> + !addr_is_local(>addr);
> + if (!keep)
> + list_move(>if_list, _list);
> +
>   write_unlock_bh(>lock);

yes, the list manipulation should be done under the idev->lock. thanks
for fixing.

Acked-by: David Ahern 



Re: [PATCH net-next v3 2/2] bpf: remove struct bpf_map_type_list

2017-04-10 Thread Daniel Borkmann

On 04/10/2017 02:44 PM, Johannes Berg wrote:

From: Johannes Berg 

There's no need to have struct bpf_map_type_list since
it just contains a list_head, the type, and the ops
pointer. Since the types are densely packed and not
actually dynamically registered, it's much easier and
smaller to have an array of type->ops pointer. Also
initialize this array statically to remove code needed
to initialize it.

In order to save duplicating the list, move it to the
types header file added by the previous patch and
include it in the same fashion.

Signed-off-by: Johannes Berg 


Thanks!

Acked-by: Daniel Borkmann 


Re: [PATCH net-next v3 1/2] bpf: remove struct bpf_prog_type_list

2017-04-10 Thread Daniel Borkmann

On 04/10/2017 02:44 PM, Johannes Berg wrote:

From: Johannes Berg 

There's no need to have struct bpf_prog_type_list since
it just contains a list_head, the type, and the ops
pointer. Since the types are densely packed and not
actually dynamically registered, it's much easier and
smaller to have an array of type->ops pointer. Also
initialize this array statically to remove code needed
to initialize it.

In order to save duplicating the list, move it to a new
header file and include it in the places needing it.

Signed-off-by: Johannes Berg 


Looks awesome!

Acked-by: Daniel Borkmann 


[PATCH net-next] cxgb4: save tid while creating server filter

2017-04-10 Thread Ganesh Goudar
Save the filter tid while creating the server filter, which is used
later to retrieve the corresponding filter instance while handling
the filter reply.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..aa71019 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2338,6 +2338,10 @@ int cxgb4_create_server_filter(const struct net_device 
*dev, unsigned int stid,
f->locked = 1;
f->fs.rpttid = 1;
 
+   /* Save the actual tid. We need this to get the corresponding
+* filter entry structure in filter_rpl.
+*/
+   f->tid = stid + adap->tids.ftid_base;
ret = set_filter_wr(adap, stid);
if (ret) {
clear_filter(adap, f);
-- 
2.1.0



Re: [PATCH net-next 1/8] rtnetlink: Do not generate notifications for MTU events

2017-04-10 Thread David Ahern
On 4/10/17 9:39 AM, Vlad Yasevich wrote:
> OK, so this will work for the events that are generated as a result of device 
> state change
> (like mtu, address, and others).
> 
> However, the original event data may be needed for other events that may be
> of use to userspace like NETDEV_NOTIFY_PEERS and NETDEV_RESEND_IGMP (possibly 
> others...)

sure. My objection is to multiple messages with identical content.

I think the rtnetlink_event message is unique for those 2 netdev events,
so no objections if it has value.


Re: [PATCH v3 1/5] netlink: extended ACK reporting

2017-04-10 Thread Johannes Berg
On Mon, 2017-04-10 at 09:35 -0600, David Ahern wrote:

> > Do you have any better ideas?
> 
> NETLINK_F_CAP_ACK and NETLINK_F_EXT_ACK should be incompatible -- if
> one is set the other can not be set. CAP_ACK means abbreviate the
> response and EXT_ACK means give me more data.

So you mean if I want EXT_ACK I cannot also cap the message?

I guess that's doable, but again wouldn't it hurt applications that
want to use CAP? I assume every application wants to use EXT_ACK
eventually, and those that use CAP right now wouldn't be able to.

Another thought: if we add a new flag that indicates "message has been
capped", which we introduce in this same patch, then we can disentangle
this more easily, right?

Adding a new flag for "TLVs present" won't really help, but if you know
the message was capped then you know the TLVs start after the inner
nlmsghdr and you ignore that header's nlmsg_len.

johannes


Re: [PATCH net-next 1/8] rtnetlink: Do not generate notifications for MTU events

2017-04-10 Thread Vlad Yasevich
On 04/08/2017 02:18 PM, Roopa Prabhu wrote:
> On 4/8/17, 11:13 AM, David Ahern wrote:
>> On 4/8/17 2:06 PM, Roopa Prabhu wrote:
>>> On 4/7/17, 2:25 PM, David Ahern wrote:
 Changing MTU on a link currently causes 3 messages to be sent to userspace:

 [LINK]11: dummy1:  mtu 1490 qdisc noqueue 
 state UNKNOWN group default event PRE_CHANGE_MTU
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 [LINK]11: dummy1:  mtu 1500 qdisc noqueue 
 state UNKNOWN group default event CHANGE_MTU
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 [LINK]11: dummy1:  mtu 1500 qdisc noqueue 
 state UNKNOWN group default
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 Remove the PRE_CHANGE_MTU and CHANGE_MTU messages.


>>> This change is good... multiple notifications for the same event does not 
>>> help in large scale links setups. However, this
>>> reverts what vlad was trying to do with his patchset. Vlad's patch-set 
>>> relies on the rtnl notifications generated from
>>> notifiers (rtnetlink_event) to add  specific event (IFLA_EVENT) in 
>>> notifications.
>>>
>>> The third notification in your example above is the correct one and is an 
>>> aggregate notification for a set of changes, but
>>> it cannot really fill in all types of events in the single IFLA_EVENT 
>>> attribute as it stands today.  IFLA_EVENT should be
>>> a bitmask to include all events in this case (i had indicated this in vlads 
>>> first version).
>>>
>> Agreed. I think it would be best to revert def12888c161 before the UAPI
>> goes out.
>>
>> The change can instead add the IFLA_EVENT as a bitmask mentioned here to
>> note the changes in a setlink. On top of that, remove the notifications
>> for the events I mentioned in this set to reduce the overhead on userspace.
> 
> ack
> 

OK, so this will work for the events that are generated as a result of device 
state change
(like mtu, address, and others).

However, the original event data may be needed for other events that may be
of use to userspace like NETDEV_NOTIFY_PEERS and NETDEV_RESEND_IGMP (possibly 
others...)

-vlad


Re: Non-standard TCP stack processing of packets with unacceptable ACK numbers

2017-04-10 Thread Eric Dumazet
On Sat, 2017-04-08 at 22:29 +0200, Paul Fiterau Brostean wrote:
> Hello,
> 
> My name is Paul Fiterau, I am a PhD student at Radboud University whose 
> focus for the past few years has been among others to develop and apply 
> inference techniques on TCP stacks in order to obtain nice models, and 
> to verify them if possible using formal methods.  We contacted you on 
> something similar 2 years back.
> 
> The older (3.19 kernel release) Linux TCP stack we analyze exhibits 
> behavior that seems odd to me. The scenario is as follows (all packets 
> have empty payloads, no window scaling, rcv/snd window size should not 
> be a factor):
> 
>TEST HARNESS (CLIENT)LINUX SERVER
> 
>1.  -  LISTEN (server listen, then 
> accepts)
> 
>2.  - --> 

[PATCH 1/1] drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201

2017-04-10 Thread Daniele Palmas
Telit LE920A4 uses the same pid 0x1201 of LE920, but modem
implementation is different, since it requires DTR to be set for
answering to qmi messages.

This patch replaces QMI_FIXED_INTF with QMI_QUIRK_SET_DTR: tests on
LE920 have been performed in order to verify backward compatibility.

Signed-off-by: Daniele Palmas 
---

Hi Bjørn,

as a side note in latest kernels I had troubles with qmi devices
(e.g. I/O error when using qmicli).

I found your suggestion in libqmi mailing list to revert commit

833415a3e781a26fe480a34d45086bdb4fe1e4c0
cdc-wdm: fix "out-of-sync" due to missing notifications

and this seems to work.

Regards,
Daniele

---
 drivers/net/usb/qmi_wwan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 156f7f8..2474618 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -908,7 +908,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x2357, 0x9000, 4)},/* TP-LINK MA260 */
{QMI_QUIRK_SET_DTR(0x1bc7, 0x1040, 2)}, /* Telit LE922A */
{QMI_FIXED_INTF(0x1bc7, 0x1200, 5)},/* Telit LE920 */
-   {QMI_FIXED_INTF(0x1bc7, 0x1201, 2)},/* Telit LE920 */
+   {QMI_QUIRK_SET_DTR(0x1bc7, 0x1201, 2)}, /* Telit LE920, LE920A4 */
{QMI_FIXED_INTF(0x1c9e, 0x9b01, 3)},/* XS Stick W100-2 from 4G 
Systems */
{QMI_FIXED_INTF(0x0b3c, 0xc000, 4)},/* Olivetti Olicard 100 */
{QMI_FIXED_INTF(0x0b3c, 0xc001, 4)},/* Olivetti Olicard 120 */
-- 
2.7.4



Re: [PATCH v3 1/5] netlink: extended ACK reporting

2017-04-10 Thread David Ahern
On 4/10/17 9:30 AM, Johannes Berg wrote:
> On Mon, 2017-04-10 at 09:26 -0600, David Ahern wrote:
>> On 4/8/17 2:24 PM, Johannes Berg wrote:
>>> @@ -2300,14 +2332,35 @@ void netlink_ack(struct sk_buff *in_skb,
>>> struct nlmsghdr *nlh, int err)
>>>   NLMSG_ERROR, payload, 0);
>>> errmsg = nlmsg_data(rep);
>>> errmsg->error = err;
>>> -   memcpy(>msg, nlh, payload > sizeof(*errmsg) ? nlh-
 nlmsg_len : sizeof(*nlh));
>>> +   memcpy(>msg, nlh,
>>> +  !(nlk->flags & NETLINK_F_CAP_ACK) ? nlh->nlmsg_len
>>> +: sizeof(*nlh));
>>> +
>>
>> generically this makes userspace parsing more problematic: the
>> parsing layer may not know if the socket option has been set to
>> precisely know the size of errmsg->msg and how much data needs to be
>> skipped to get to the new attributes.
> 
> Yes, I know. I'd hope that userspace can remember that per socket - I
> don't see a good other way to do this.
> 
> If we insert the TLVs in front of, or instead of (with a TLV containing
> it), the request message then at least libnl's debugging will need to
> be changed.
> 
> As it is, I can assume that libnl will not set the CAP setting, and
> everything works fine even if I don't change libnl, which makes things
> easier.
> 
> Do you have any better ideas?

NETLINK_F_CAP_ACK and NETLINK_F_EXT_ACK should be incompatible -- if one
is set the other can not be set. CAP_ACK means abbreviate the response
and EXT_ACK means give me more data.


Re: [PATCH v3 1/5] netlink: extended ACK reporting

2017-04-10 Thread Johannes Berg
On Mon, 2017-04-10 at 09:26 -0600, David Ahern wrote:
> On 4/8/17 2:24 PM, Johannes Berg wrote:
> > @@ -2300,14 +2332,35 @@ void netlink_ack(struct sk_buff *in_skb,
> > struct nlmsghdr *nlh, int err)
> >       NLMSG_ERROR, payload, 0);
> >     errmsg = nlmsg_data(rep);
> >     errmsg->error = err;
> > -   memcpy(>msg, nlh, payload > sizeof(*errmsg) ? nlh-
> > >nlmsg_len : sizeof(*nlh));
> > +   memcpy(>msg, nlh,
> > +      !(nlk->flags & NETLINK_F_CAP_ACK) ? nlh->nlmsg_len
> > +    : sizeof(*nlh));
> > +
> 
> generically this makes userspace parsing more problematic: the
> parsing layer may not know if the socket option has been set to
> precisely know the size of errmsg->msg and how much data needs to be
> skipped to get to the new attributes.

Yes, I know. I'd hope that userspace can remember that per socket - I
don't see a good other way to do this.

If we insert the TLVs in front of, or instead of (with a TLV containing
it), the request message then at least libnl's debugging will need to
be changed.

As it is, I can assume that libnl will not set the CAP setting, and
everything works fine even if I don't change libnl, which makes things
easier.

Do you have any better ideas?

johannes


[PATCH 6/6] net: mvmdio: allow up to three clocks to be specified for orion-mdio

2017-04-10 Thread Russell King
Allow up to three clocks to be specified and enabled for the orion-mdio
interface, which are required for this interface to be accessible on
Armada 8k platforms.

Signed-off-by: Russell King 
---
 drivers/net/ethernet/marvell/mvmdio.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c 
b/drivers/net/ethernet/marvell/mvmdio.c
index 614dfde657fe..90a60b98c28e 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -53,7 +53,7 @@
 struct orion_mdio_dev {
struct mutex lock;
void __iomem *regs;
-   struct clk *clk;
+   struct clk *clk[3];
/*
 * If we have access to the error interrupt pin (which is
 * somewhat misnamed as it not only reflects internal errors
@@ -187,7 +187,7 @@ static int orion_mdio_probe(struct platform_device *pdev)
struct resource *r;
struct mii_bus *bus;
struct orion_mdio_dev *dev;
-   int ret;
+   int i, ret;
 
r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!r) {
@@ -216,9 +216,12 @@ static int orion_mdio_probe(struct platform_device *pdev)
 
init_waitqueue_head(>smi_busy_wait);
 
-   dev->clk = devm_clk_get(>dev, NULL);
-   if (!IS_ERR(dev->clk))
-   clk_prepare_enable(dev->clk);
+   for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
+   dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
+   if (IS_ERR(dev->clk[i]))
+   break;
+   clk_prepare_enable(dev->clk[i]);
+   }
 
dev->err_interrupt = platform_get_irq(pdev, 0);
if (dev->err_interrupt > 0 &&
@@ -259,8 +262,14 @@ static int orion_mdio_probe(struct platform_device *pdev)
 out_mdio:
if (dev->err_interrupt > 0)
writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
-   if (!IS_ERR(dev->clk))
-   clk_disable_unprepare(dev->clk);
+
+   for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
+   if (IS_ERR(dev->clk[i]))
+   break;
+   clk_disable_unprepare(dev->clk[i]);
+   clk_put(dev->clk[i]);
+   }
+
return ret;
 }
 
@@ -268,12 +277,18 @@ static int orion_mdio_remove(struct platform_device *pdev)
 {
struct mii_bus *bus = platform_get_drvdata(pdev);
struct orion_mdio_dev *dev = bus->priv;
+   int i;
 
if (dev->err_interrupt > 0)
writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
mdiobus_unregister(bus);
-   if (!IS_ERR(dev->clk))
-   clk_disable_unprepare(dev->clk);
+
+   for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
+   if (IS_ERR(dev->clk[i]))
+   break;
+   clk_disable_unprepare(dev->clk[i]);
+   clk_put(dev->clk[i]);
+   }
 
return 0;
 }
-- 
2.7.4



[PATCH 4/6] net: mvmdio: disable interrupt if resource size is too small

2017-04-10 Thread Russell King
Disable the MDIO interrupt, falling back to polled mode, if the resource
size does not allow us to access the interrupt registers.  All current
DT bindings use a size of 0x84, which allows access, but verifying it is
good practice.

Signed-off-by: Russell King 
---
 drivers/net/ethernet/marvell/mvmdio.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c 
b/drivers/net/ethernet/marvell/mvmdio.c
index 6ea5caddca62..614dfde657fe 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -221,6 +221,12 @@ static int orion_mdio_probe(struct platform_device *pdev)
clk_prepare_enable(dev->clk);
 
dev->err_interrupt = platform_get_irq(pdev, 0);
+   if (dev->err_interrupt > 0 &&
+   resource_size(r) < MVMDIO_ERR_INT_MASK + 4) {
+   dev_err(>dev,
+   "disabling interrupt, resource size is too small\n");
+   dev->err_interrupt = 0;
+   }
if (dev->err_interrupt > 0) {
ret = devm_request_irq(>dev, dev->err_interrupt,
orion_mdio_err_irq,
-- 
2.7.4



[PATCH 5/6] dt-bindings: allow up to three clocks for orion-mdio

2017-04-10 Thread Russell King
Armada 8040 needs three clocks to be enabled for MDIO accesses to work.
Update the binding to allow the extra clocks to be specified.

Signed-off-by: Russell King 
---
 Documentation/devicetree/bindings/net/marvell-orion-mdio.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt 
b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
index ca733ff68ab9..ccdabdcc8618 100644
--- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
+++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
@@ -14,7 +14,7 @@ interface.
 
 Optional properties:
 - interrupts: interrupt line number for the SMI error/done interrupt
-- clocks: Phandle to the clock control device and gate bit
+- clocks: phandle for up to three required clocks for the MDIO instance
 
 The child nodes of the MDIO driver are the individual PHY devices
 connected to this MDIO bus. They must have a "reg" property given the
-- 
2.7.4



[PATCH 3/6] dt-bindings: correct marvell orion MDIO binding document

2017-04-10 Thread Russell King
Correct the Marvell Orion MDIO binding document to properly reflect the
cases where an interrupt is present.  Augment the examples to show this.

Signed-off-by: Russell King 
---
 .../devicetree/bindings/net/marvell-orion-mdio.txt  | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt 
b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
index 9417e54c26c0..ca733ff68ab9 100644
--- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
+++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
@@ -7,7 +7,10 @@ interface.
 
 Required properties:
 - compatible: "marvell,orion-mdio"
-- reg: address and length of the SMI register
+- reg: address and length of the MDIO registers.  When an interrupt is
+  not present, the length is the size of the SMI register (4 bytes)
+  otherwise it must be 0x84 bytes to cover the interrupt control
+  registers.
 
 Optional properties:
 - interrupts: interrupt line number for the SMI error/done interrupt
@@ -17,7 +20,7 @@ The child nodes of the MDIO driver are the individual PHY 
devices
 connected to this MDIO bus. They must have a "reg" property given the
 PHY address on the MDIO bus.
 
-Example at the SoC level:
+Example at the SoC level without an interrupt property:
 
 mdio {
#address-cells = <1>;
@@ -26,6 +29,16 @@ mdio {
reg = <0xd0072004 0x4>;
 };
 
+Example with an interrupt property:
+
+mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "marvell,orion-mdio";
+   reg = <0xd0072004 0x84>;
+   interrupts = <30>;
+};
+
 And at the board level:
 
 mdio {
-- 
2.7.4



[PATCH 2/6] net: mvmdio: fix interrupt disable in remove path

2017-04-10 Thread Russell King
The pre-existing write to disable interrupts on the remove path happens
whether we have an interrupt or not.  While this may seem to be a good
idea, this driver is re-used in many different implementations, some
where the binding only specifies four bytes of register space.  This
access causes us to access registers outside of the binding.

Make it conditional on the interrupt being present, which is the same
condition used when enabling the interrupt in the first place.

Signed-off-by: Russell King 
---
 drivers/net/ethernet/marvell/mvmdio.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c 
b/drivers/net/ethernet/marvell/mvmdio.c
index 7aea0beca56e..6ea5caddca62 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -263,7 +263,8 @@ static int orion_mdio_remove(struct platform_device *pdev)
struct mii_bus *bus = platform_get_drvdata(pdev);
struct orion_mdio_dev *dev = bus->priv;
 
-   writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
+   if (dev->err_interrupt > 0)
+   writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
mdiobus_unregister(bus);
if (!IS_ERR(dev->clk))
clk_disable_unprepare(dev->clk);
-- 
2.7.4



[PATCH 1/6] net: mvmdio: disable interrupts in driver failure path

2017-04-10 Thread Russell King
When the mvmdio driver has an interrupt, it enables the "done" interrupt
after requesting its interrupt handler.  However, probe failure results
in the interrupt being left enabled.  Disable it on the failure path.

Signed-off-by: Russell King 
---
 drivers/net/ethernet/marvell/mvmdio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c 
b/drivers/net/ethernet/marvell/mvmdio.c
index a0d1b084ecec..7aea0beca56e 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -251,6 +251,8 @@ static int orion_mdio_probe(struct platform_device *pdev)
return 0;
 
 out_mdio:
+   if (dev->err_interrupt > 0)
+   writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
if (!IS_ERR(dev->clk))
clk_disable_unprepare(dev->clk);
return ret;
-- 
2.7.4



[PATCH 0/6] mvmdio updates

2017-04-10 Thread Russell King - ARM Linux
This series of patches update mvmdio for Armada 8k CP110.  A number of
issues were found:

1. The driver fails to disable an interrupt when something goes wrong
   in the probe function.

2. The interrupt is specified in DT to be optional, but the driver
   unconditionally writes to the interrupt mask register, which may
   not exist.

3. The DT binding specifies
"reg: address and length of the SMI register"
   however, when supporting the interrupt, the size must cover the
   interrupt register as well.  Update the binding documentation
   with this information that was previously omitted.

4. If the register size is too small, have the driver print an error
   and disable use of the interrupt.

5. Armada 8k needs three clocks for the MDIO interface, otherwise the
   SoC hangs (since it is part of one of the ethernet interfaces.)
   GOP clock, MG core clock and MG clock are needed on 8k. Augment the
   binding and driver to allow three clocks to be specified.

 .../devicetree/bindings/net/marvell-orion-mdio.txt | 19 --
 drivers/net/ethernet/marvell/mvmdio.c  | 44 +-
 2 files changed, 50 insertions(+), 13 deletions(-)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [PATCH v3 1/5] netlink: extended ACK reporting

2017-04-10 Thread David Ahern
On 4/8/17 2:24 PM, Johannes Berg wrote:
> @@ -2300,14 +2332,35 @@ void netlink_ack(struct sk_buff *in_skb, struct 
> nlmsghdr *nlh, int err)
> NLMSG_ERROR, payload, 0);
>   errmsg = nlmsg_data(rep);
>   errmsg->error = err;
> - memcpy(>msg, nlh, payload > sizeof(*errmsg) ? nlh->nlmsg_len : 
> sizeof(*nlh));
> + memcpy(>msg, nlh,
> +!(nlk->flags & NETLINK_F_CAP_ACK) ? nlh->nlmsg_len
> +  : sizeof(*nlh));
> +

generically this makes userspace parsing more problematic: the parsing
layer may not know if the socket option has been set to precisely know
the size of errmsg->msg and how much data needs to be skipped to get to
the new attributes.


RE: [PATCH net-next 5/7] s390/qeth: improve endianness handling

2017-04-10 Thread David Laight
From: Ursula Braun
> Sent: 07 April 2017 13:33
> On 04/07/2017 01:25 PM, David Laight wrote:
> > From: Ursula Braun
> >> Sent: 05 April 2017 09:40
> >> From: Hans Wippel 
> >>
> >> Avoid endianness warnings reported by sparse by (1) using endianness
> >> conversions for assigning and using network packet fields, and (2)
> >> removing unnecessary endianness conversions from qeth_l3_rebuild_skb. No
> >> functional changes.
> >>
> >> Signed-off-by: Hans Wippel 
> >> Signed-off-by: Ursula Braun 
> >> ---
> >>  drivers/s390/net/qeth_core.h  |  4 ++--
> >>  drivers/s390/net/qeth_core_main.c | 11 +-
> >>  drivers/s390/net/qeth_l3_main.c   | 46 
> >> +++
> >>  drivers/s390/net/qeth_l3_sys.c|  4 ++--
> >>  4 files changed, 33 insertions(+), 32 deletions(-)
> >>
> >> diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
> >> index 22aa2cd..6764ab9 100644
> >> --- a/drivers/s390/net/qeth_core.h
> >> +++ b/drivers/s390/net/qeth_core.h
> >> @@ -844,9 +844,9 @@ static inline int qeth_get_ip_version(struct sk_buff 
> >> *skb)
> >>  {
> >>__be16 *p = &((struct ethhdr *)skb->data)->h_proto;
> >>
> >> -  if (*p == ETH_P_8021Q)
> >> +  if (be16_to_cpu(*p) == ETH_P_8021Q)
> >>p += 2;
> >> -  switch (*p) {
> >> +  switch (be16_to_cpu(*p)) {
> >>case ETH_P_IPV6:
> >
> > These are definitely 'functional changes' on LE systems.
> s390 is Big Endian; thus the change is basicly a NOP.

> The only purpose of this kind of patches is getting rid
> of sparse warnings with our code.

Then say so in the commit message

What you are really doing is fixing the code in case anyone tries
to use it on a LE system.

David



RE: [PATCH net-next V2 5/7] s390/qeth: improve endianness handling

2017-04-10 Thread David Laight
From: Ursula Braun
> Sent: 07 April 2017 08:16
> Avoid endianness warnings reported by sparse by (1) using endianness
> conversions for assigning and using network packet fields, and (2)
> removing unnecessary endianness conversions from qeth_l3_rebuild_skb. No
> functional changes.
...
> - if (*p == ETH_P_8021Q)
> + if (be16_to_cpu(*p) == ETH_P_8021Q)
...

This is clearly different on little-endian systems.
Also (probably) better written as:

if (*p == cpu_to_be16(ETH_P_8021Q))

so that the constant can by byteswapped.

David





  1   2   >