date:20160219

[net:master 66/72] drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:700:1: warning: 'vpdstrtou16.constprop' uses dynamic stack allocation

2016-02-19 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
head:   d07c0278da1f4cfc91c3d46d0d07a0d13a949892
commit: 1003e19c466dc37812b5f88b2d5308ee63bb3fa0 [66/72] cxgb3: fix up vpd 
strings for kstrto*()
config: s390-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 1003e19c466dc37812b5f88b2d5308ee63bb3fa0
# save the attached .config to linux build tree
make.cross ARCH=s390 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/chelsio/cxgb3/t3_hw.c: In function 
'vpdstrtou16.constprop':
>> drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:700:1: warning: 
>> 'vpdstrtou16.constprop' uses dynamic stack allocation
}
^
   drivers/net/ethernet/chelsio/cxgb3/t3_hw.c: In function 
'vpdstrtouint.constprop':
>> drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:691:1: warning: 
>> 'vpdstrtouint.constprop' uses dynamic stack allocation
}
^

vim +700 drivers/net/ethernet/chelsio/cxgb3/t3_hw.c

   685  {
   686  char tok[len + 1];
   687  
   688  memcpy(tok, s, len);
   689  tok[len] = 0;
   690  return kstrtouint(strim(tok), base, val);
 > 691  }
   692  
   693  static int vpdstrtou16(char *s, int len, unsigned int base, u16 *val)
   694  {
   695  char tok[len + 1];
   696  
   697  memcpy(tok, s, len);
   698  tok[len] = 0;
   699  return kstrtou16(strim(tok), base, val);
 > 700  }
   701  
   702  /**
   703   *  get_vpd_params - read VPD parameters from VPD EEPROM

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH] Bluetooth: hci_core: Avoid mixing up req_complete and req_complete_skb

2016-02-19 Thread Johan Hedberg

Hi Douglas,

On Fri, Feb 19, 2016, Douglas Anderson wrote:
> In commit 44d271377479 ("Bluetooth: Compress the size of struct
> hci_ctrl") we squashed down the size of the structure by using a union
> with the assumption that all users would use the flag to determine
> whether we had a req_complete or a req_complete_skb.
> 
> Unfortunately we had a case in hci_req_cmd_complete() where we weren't
> looking at the flag.  This can result in a situation where we might be
> storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
> vice versa.
> 
> During some testing I found at least one case where the function
> hci_req_sync_complete() was called improperly because the kernel thought
> that it didn't require an SKB.  Looking through the stack in kgdb I
> found that it was called by hci_event_packet() and that
> hci_event_packet() had both of its locals "req_complete" and
> "req_complete_skb" pointing to the same place: both to
> hci_req_sync_complete().
> 
> Let's make sure we always check the flag.
> 
> For more details on debugging done, see .
> 
> Fixes: 44d271377479 ("Bluetooth: Compress the size of struct hci_ctrl")
> Signed-off-by: Douglas Anderson 
> ---
> Testing was done on a Chrome OS device on kernel 3.14 w/
> bluetooth-next backports.  Since I couldn't reliably reproduce the
> problem, I simply confirmed that existing functionality worked.
> 
>  net/bluetooth/hci_core.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)

Good catch.

Acked-by: Johan Hedberg 

Marcel, I think we want to send this through our stable tree to 4.5-rc
(where the patch introducing the bug is).

Johan

Re: [PATCH net 0/3] bnxt_en: Phy related fixes.

2016-02-19 Thread Michael Chan

On Fri, Feb 19, 2016 at 8:40 PM, David Miller  wrote:
> From: Michael Chan 
> Date: Fri, 19 Feb 2016 19:43:18 -0500
>
>> 3 small patches to fix PHY related code.
>
> Series applied, thanks Michael.
>
> Although I'm not so sure how wise it is to not fail an ->open()
> if the PHY settings fail.

The firmware will still have default settings for the PHY.  More importantly,
for the VF, the call doesn't do anything real and we should allow the VF to
proceed if the call fails.  Thanks.

RE: [PATCH] xen-netfront: set real_num_tx_queues to zreo avoid to trigger BUG_ON

2016-02-19 Thread Gonglei (Arei)

Hi,

Thanks for rapid feedback :)

> From: David Miller [mailto:da...@davemloft.net]
> Sent: Saturday, February 20, 2016 12:37 PM
> 
> From: Gonglei 
> Date: Sat, 20 Feb 2016 09:27:26 +0800
> 
> > It's possible for a race condition to exist between xennet_open() and
> > talk_to_netback(). After invoking netfront_probe() then other
> > threads or processes invoke xennet_open (such as NetworkManager)
> > immediately may trigger BUG_ON(). Besides, we also should reset
> > real_num_tx_queues in xennet_destroy_queues().
> 
> One should really never invoke register_netdev() until the device is
> %100 fully initialized.
> 
> This means you cannot call register_netdev() until it is completely
> legal to invoke your ->open() method.
> 
> And I think that is what the real problem is here.
> 
> If you follow the correct rules for ordering wrt. register_netdev()
> there are no "races".  Because ->open() must be legally invokable
> from the exact moment you call register_netdev().
> 

Yes, I agree. Though that's the historic legacy problem. ;)

> I'm not applying this, as it really sounds like the fundamental issue
> is the order in which the xen-netfront private data is initialized
> or setup before being registered.

That means register_netdev() should be invoked after xennet_connect(), right?

Regards,
-Gonglei

[PATCH iproute2] bridge: add support for dynamic fdb entries

2016-02-19 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch is a follow up to the recently added
'static' fdb option.

It introduces a new option 'dynamic' which adds
dynamic fdb entries with NUD_REACHABLE.

$bridge fdb add 00:01:02:03:04:06 dev eth0 master dynamic

$bridge fdb show
00:01:02:03:04:06 dev eth0

This patch also documents all fdb types. Removes 'temp'
from usage message since it is now replaced by 'static'.
'temp' still works and is synonymous with static.

Signed-off-by: Wilson Kok 
Signed-off-by: Roopa Prabhu 
---
 bridge/fdb.c  |  5 -
 man/man8/bridge.8 | 14 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index 9bc6b94..1400b65 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -33,7 +33,7 @@ static void usage(void)
 {
fprintf(stderr, "Usage: bridge fdb { add | append | del | replace } 
ADDR dev DEV\n"
"  [ self ] [ master ] [ use ] [ router ]\n"
-   "  [ local | temp | static ] [ dst IPADDR ] 
[ vlan VID ]\n"
+   "  [ local | static | dynamic ] [ dst 
IPADDR ] [ vlan VID ]\n"
"  [ port PORT] [ vni VNI ] [ via DEV ]\n");
fprintf(stderr, "   bridge fdb [ show [ br BRDEV ] [ brport DEV ] 
]\n");
exit(-1);
@@ -304,6 +304,9 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
} else if (matches(*argv, "temp") == 0 ||
   matches(*argv, "static") == 0) {
req.ndm.ndm_state |= NUD_REACHABLE;
+   } else if (matches(*argv, "dynamic") == 0) {
+   req.ndm.ndm_state |= NUD_REACHABLE;
+   req.ndm.ndm_state &= ~NUD_NOARP;
} else if (matches(*argv, "vlan") == 0) {
if (vid >= 0)
duparg2("vlan", *argv);
diff --git a/man/man8/bridge.8 b/man/man8/bridge.8
index 0ec6f17..efd416e 100644
--- a/man/man8/bridge.8
+++ b/man/man8/bridge.8
@@ -54,7 +54,7 @@ bridge \- show / manipulate bridge addresses and devices
 .I LLADDR
 .B dev
 .IR DEV " { "
-.BR local " | " temp " } [ "
+.BR local " | " static " | " dynamic " } [ "
 .BR self " ] [ " master " ] [ " router " ] [ " use " ] [ "
 .B dst
 .IR IPADDR " ] [ "
@@ -338,6 +338,18 @@ the Ethernet MAC address.
 .BI dev " DEV"
 the interface to which this address is associated.
 
+.B local
+- is a local permanent fdb entry
+.sp
+
+.B static
+- is a static (no arp) fdb entry
+.sp
+
+.B dynamic
+- is a dynamic reachable age-able fdb entry
+.sp
+
 .B self
 - the address is associated with the port drivers fdb. Usually hardware.
 .sp
-- 
1.9.1

Re: pull request [net]: batman-adv 20160216

2016-02-19 Thread Antonio Quartulli

On Fri, Feb 19, 2016 at 03:37:18PM -0500, David Miller wrote:
> And thanks for the heads up about the potential merge issues, I'll watch
> for that.
> 

Hi David,

actually I just realized that the patches that will create the conflict
are not yet in net-next, but they are still pending in my queue.

At this point I will wait for you to merge net into net-next first
(there should be no conflict at that point) and then I will rebase
my pending patches on top of that.

This should prevent you from dealing with any conflict.

Regards,

-- 
Antonio Quartulli

signature.asc
Description: Digital signature

Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map

2016-02-19 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 17 Feb 2016 19:58:56 -0800

> This patch set introduces new map type to store stack traces and
> corresponding bpf_get_stackid() helper.
 ...

Series applied, thanks Alexei.

Re: [PATCHv3 net 5/5] nfp: don't trust netif_running() in debug code

2016-02-19 Thread David Miller

From: Jakub Kicinski 
Date: Thu, 18 Feb 2016 20:38:13 +

> Since change_mtu() can fail and leave us with netif_running()
> returning true even though all rings were freed - we should
> look at NFP_NET_CFG_CTRL_ENABLE flag to determine if device
> is really opened.
> 
> Signed-off-by: Jakub Kicinski 

This is exactly why I don't like how you are doing your MTU change at
all.

You must not make the device inoperative if you simply cannot perform
the MTU change.  I'm pretty sure I've told you this already, this
whole ->close(), MTU change, ->open() OOPS THAT FAILED sequence is a
non-starter.  You can't do this.

You are leaving the netdev object in an illegal state when this
happens.

You must return from ->change_mtu() with the device in the UP
and running state if you cannot make the MTU change.

I don't care how invasive it is, you must fix this properly if
you want to fix this because as-is this patch series is a cure
worse than the disease.

I'm not applying any of your currently submitted changes, both for net
and net-next, sorry.

Re: [PATCH net] r8169:fix "rtl_counters_cond == 1 (loop: 1000, delay: 10)" log spam.

2016-02-19 Thread David Miller

From: Chunhao Lin 
Date: Thu, 18 Feb 2016 22:57:07 +0800

> There will be a log spam when there is no cable plugged.
> Please refer to following links.
> https://bugzilla.kernel.org/show_bug.cgi?id=104351
> https://bugzilla.kernel.org/show_bug.cgi?id=107421
> 
> This issue is caused by runtime power management. When there is no cable
> plugged, the driver will be suspend (runtime suspend) by OS and NIC will
> be put into the D3 state. During this time, if OS call rtl8169_get_stats64()
> to dump tally counter, because NIC is in D3 state, the register value read by
> driver will return all 0xff. This will let driver think tally counter flag is 
> not
> toggled and then sends the warning  message "rtl_counters_cond == 1 (loop: 
> 1000,
> delay: 10)" to kernel log.
> 
> I add checking driver's pm runtime status in rtl8169_get_stats64() to fix
> this issue.
> 
> Signed-off-by: Chunhao Lin 

If you are going to do this, then you need to quiesce the device RX/TX
and capture all of the statistics from the chip before suspending it.

Otherwise what we're returning from rtl8169_get_stats64() is inaccurate.

Re: [PATCH net] af_unix: Don't use continue to re-execute unix_stream_read_generic loop

2016-02-19 Thread David Miller

From: Rainer Weikusat 
Date: Thu, 18 Feb 2016 12:39:46 +

> The unix_stream_read_generic function tries to use a continue statement
> to restart the receive loop after waiting for a message. This may not
> work as intended as the caller might use a recvmsg call to peek at
> control messages without specifying a message buffer. If this was the
> case, the continue will cause the function to return without an error
> and without the credential information if the function had to wait for a
> message while it had returned with the credentials otherwise. Change to
> using goto to restart the loop without checking the condition first in
> this case so that credentials are returned either way.
> 
> Signed-off-by: Rainer Weikusat 
> Acked-by: Hannes Frederic Sowa 

Applied, thanks.

Re: [PATCH] net: bcmgenet: Fix internal PHY link state

2016-02-19 Thread David Miller

From: Jaedon Shin 
Date: Fri, 19 Feb 2016 13:48:50 +0900

> The PHY link state is not chaged in GENETv2 caused by the previous
> commit 49f7a471e4d1 ("net: bcmgenet: Properly configure PHY to ignore
> interrupt") was set to PHY_IGNORE_INTERRUPT in bcmgenet_mii_probe().
> 
> The internal PHY should use phy_mac_interrupt() when not in use
> PHY_POLL. The statement for phy_mac_interrupt() has two conditions. The
> first condition to check GENET_HAS_MDIO_INTR is not related PHY link
> state, so this patch removes it.
> 
> Fixes: 49f7a471e4d1 ("net: bcmgenet: Properly configure PHY to ignore 
> interrupt")
> Signed-off-by: Jaedon Shin 

Applied, thanks.

Re: [PATCH] unix_diag: fix incorrect sign extension in unix_lookup_by_ino

2016-02-19 Thread David Miller

From: Cong Wang 
Date: Fri, 19 Feb 2016 16:21:14 -0800

> On Thu, Feb 18, 2016 at 5:27 PM, Dmitry V. Levin  wrote:
>> The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
>> __u32, but unix_lookup_by_ino's argument ino has type int, which is not
>> a problem yet.
>> However, when ino is compared with sock_i_ino return value of type
>> unsigned long, ino is sign extended to signed long, and this results
>> to incorrect comparison on 64-bit architectures for inode numbers
>> greater than INT_MAX.
>>
>> This bug was found by strace test suite.
>>
>> Signed-off-by: Dmitry V. Levin 
>> Cc: 
> 
> Fixes: 5d3cae8bc39d ("unix_diag: Dumping exact socket core")
> Acked-by: Cong Wang 

Applied and queued up for -stable.

Re: [PATCH net-next] net: use skb_postpush_rcsum instead of own implementations

2016-02-19 Thread David Miller

From: Daniel Borkmann 
Date: Sat, 20 Feb 2016 00:29:30 +0100

> Replace individual implementations with the recently introduced
> skb_postpush_rcsum() helper.
> 
> Signed-off-by: Daniel Borkmann 

Applied, thanks Daniel.

Re: [PATCH net 0/3] bnxt_en: Phy related fixes.

2016-02-19 Thread David Miller

From: Michael Chan 
Date: Fri, 19 Feb 2016 19:43:18 -0500

> 3 small patches to fix PHY related code.

Series applied, thanks Michael.

Although I'm not so sure how wise it is to not fail an ->open()
if the PHY settings fail.

Re: [PATCH] xen-netfront: set real_num_tx_queues to zreo avoid to trigger BUG_ON

2016-02-19 Thread David Miller

From: Gonglei 
Date: Sat, 20 Feb 2016 09:27:26 +0800

> It's possible for a race condition to exist between xennet_open() and
> talk_to_netback(). After invoking netfront_probe() then other
> threads or processes invoke xennet_open (such as NetworkManager)
> immediately may trigger BUG_ON(). Besides, we also should reset
> real_num_tx_queues in xennet_destroy_queues().

One should really never invoke register_netdev() until the device is
%100 fully initialized.

This means you cannot call register_netdev() until it is completely
legal to invoke your ->open() method.

And I think that is what the real problem is here.

If you follow the correct rules for ordering wrt. register_netdev()
there are no "races".  Because ->open() must be legally invokable
from the exact moment you call register_netdev().

I'm not applying this, as it really sounds like the fundamental issue
is the order in which the xen-netfront private data is initialized
or setup before being registered.

Re: [Patch net-next] phy: marvell/micrel: Fix Unpossible condition

2016-02-19 Thread David Miller

From: Andrew Lunn 
Date: Sat, 20 Feb 2016 00:35:29 +0100

> commit 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
> from Dec 30, 2015, leads to the following static checker
> warning:
> 
> drivers/net/phy/micrel.c:609 kszphy_get_stat()
> warn: unsigned 'val' is never less than zero.
 ...
> The same problem exists in the Marvell driver. Fix both.
> 
> Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
> Reported-by: Dan Carpenter 
> Reported-by: Julia.Lawall 
> Signed-off-by: Andrew Lunn 

Applied, thanks.

[PATCH 1/1] net-next: do not store needed_headroom in ip_tunnel_xmit

2016-02-19 Thread Francesco Ruggeri

Misconfigurations can result in local tunnel loops being created.
__dev_queue_xmit catches packets caught in a loop and drops them,
but the affected tunnels' needed_headroom can be corrupted in the
process as it is recursively updated.
The script below can be used to create a loop between two tunnels
and to send packets.

ip link add dummy1 type dummy
ip addr add 1.1.1.1/32 dev dummy1
ip link set dummy1 up

ip link add dummy3 type dummy
ip addr add 3.3.3.3/32 dev dummy3
ip link set dummy3 up

ip tunnel add t1 mode gre local 1.1.1.1 remote 2.2.2.2
ip link set t1 up

ip tunnel add t3 mode gre local 3.3.3.3 remote 4.4.4.4
ip link set t3 up

ip route add 2.2.2.2 dev t3
ip route add 4.4.4.4 dev t1

ping -c 5 2.2.2.2

Signed-off-by: Francesco Ruggeri 
---
 net/ipv4/ip_tunnel.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 4569da7..2eddbe3 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -601,6 +601,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device 
*dev,
__be16 df;
struct rtable *rt;  /* Route to the other host */
unsigned int max_headroom;  /* The extra header space needed */
+   unsigned int needed_headroom;
__be32 dst;
bool connected;
 
@@ -731,10 +732,11 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct 
net_device *dev,
 
max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr)
+ rt->dst.header_len + ip_encap_hlen(&tunnel->encap);
-   if (max_headroom > dev->needed_headroom)
-   dev->needed_headroom = max_headroom;
+   needed_headroom = dev->needed_headroom;
+   if (max_headroom > needed_headroom)
+   needed_headroom = max_headroom;
 
-   if (skb_cow_head(skb, dev->needed_headroom)) {
+   if (skb_cow_head(skb, needed_headroom)) {
ip_rt_put(rt);
dev->stats.tx_dropped++;
kfree_skb(skb);
-- 
1.8.1.4

Re: [PATCH V7 0/8] ethtool per queue parameters support

2016-02-19 Thread David Miller

From: Kan Liang 
Date: Fri, 19 Feb 2016 09:23:58 -0500

> Modern network interface controllers usually support multiple receive
> and transmit queues. Each queue may have its own parameters. For
> example, Intel XL710/X710 hardware supports per queue interrupt
> moderation. However, current ethtool does not support per queue
> parameters option. User has to set parameters for the whole NIC.
> This series extends ethtool to support per queue parameters option.
> 
> Since the support of per queue parameters vary with different cards,
> it is impossible to address all cards in one patch. This series only
> supports per queue coalesce options on i40e driver. The framework used
> in the patch can be easily extended to other cards and parameters.
> 
> The lib bitmap needs to be extended to facilitate exchanging queue bitmaps
> between user space and kernel space. Two patches from David's latest V8
> patch series are also cited in this series. You may refer to
> https://lkml.org/lkml/2016/2/9/919 for more details.
 ...

Series applied, thanks.

Re: [PATCH] fmvj18x_cs: The MAC address of FMV-J182 is stored at buf[5]

2016-02-19 Thread David Miller

From: Ken Kawasaki 
Date: Sat, 20 Feb 2016 10:57:06 +0900

> 
> The MAC address of FMV-J182 is stored at buf[5].
> 
> Signed-off-by: Ken Kawasaki 

This is an extremely poor and misleading description of your change.

The original code was accessing the MAC address properly at buf[5] by
starting the iterator 'i' at 5.

The bug is that the indexing of dev->dev_addr[] is incorrect.

Please describe your change more accurately and resubmit this patch.

Thanks.

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross

On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert  wrote:
> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
 On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
 wrote:
> This patch series makes it so that we enable the outer Tx checksum for 
> IPv4
> tunnels by default.  This makes the behavior consistent with how we were
> handling this for IPv6.  In addition I have updated the internal flags for
> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
> match up will with the ZERO_CSUM6_TX flag which was already in use for
> IPv6.
>
> For most network devices this should be a net gain in terms of performance
> as having the outer header checksum present allows for devices to report
> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
> order
> to determine if the inner header checksum is valid.
>
> Below is some data I collected with ixgbe with an X540 that demonstrates
> this.  I located two PFs connected back to back in two different name
> spaces and then setup a pair of tunnels on each, one with checksum enabled
> and one without.
>
> Recv   SendSend  Utilization
> Socket Socket  Message  Elapsed  Send
> Size   SizeSize Time Throughput  local
> bytes  bytes   bytessecs.10^6bits/s  % S
>
> noudpcsum:
>  87380  16384  1638430.00  8898.67   12.80
> udpcsum:
>  87380  16384  1638430.00  9088.47   5.69
>
> The one spot where this may cause a performance regression is if the
> environment contains devices that can parse the inner headers and a device
> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
> the case of such a device we have to fall back to using GSO to segment the
> tunnel instead of TSO and as a result we may take a performance hit as 
> seen
> below with i40e.

 Do you have any numbers from 40G links? Obviously, at 10G the links
 are basically saturated and while I can see a difference in the
 utilization rate, I suspect that the change will be much more apparent
 at higher speeds.
>>>
>>> Unfortunately I don't have any true 40G links to test with.  The
>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>> difference being related to the fact that we are having to
>>> allocate/free more skbs and make more trips through the
>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>
>> OK, I guess that is more or less in line with what I would expect off
>> the top my head. There is a reasonably significant drop in the worst
>> case.
>>
 I'm concerned about the drop in performance for devices that currently
 support offloads (almost none of which expose
 NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
 care most about tunnel performance are the ones that already have
 these NICs and will be the most impacted by the drop.
>>>
>>> The problem is being able to transmit fast is kind of pointless if the
>>> receiving end cannot handle it.  We hadn't gotten around to really
>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>> don't suspect many people are running so at this point messing with
>>> the TSO bits isn't really making much of a difference.  Then on top of
>>> that most devices have certain limitations on how many ports they can
>>> handle and such.  I know the i40e is supposed to support something
>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>> as I recall.  So this whole thing is already really brittle as it is.
>>> My goal with this change is to make the behavior more consistent
>>> across the board.
>>
>> That's true to some degree but there are certainly plenty of cases
>> where TSO makes a difference - lower CPU usage, transmitting to
>> multiple receivers, people will upgrade their kernels, etc. It's
>> clearly good to make things more consistent but hopefully not by
>> reducing existing performance. :)
>>
 My hope is that we can continue to use TSO on devices that only
 support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
 length field may vary across segments. However, in practice this is
 the only on the final segment and only in cases where the total length
 is not a multiple of the MSS. If we could detect cases where those
 conditions are met, we could continue to use TSO with the UDP checksum
 field pre-populated. A possible step even further would be to break
 off the final segment into a separate packet to make things conform if
 necessary. This would avoid a performance regression and I think make
 this more palatab

Re: [PATCH net-next] net: use skb_postpush_rcsum instead of own implementations

2016-02-19 Thread Alexei Starovoitov

On Sat, Feb 20, 2016 at 12:29:30AM +0100, Daniel Borkmann wrote:
> Replace individual implementations with the recently introduced
> skb_postpush_rcsum() helper.
> 
> Signed-off-by: Daniel Borkmann 

nice follow up to f8ffad69c9f8 indeed.
Acked-by: Alexei Starovoitov

[PATCH] fmvj18x_cs: The MAC address of FMV-J182 is stored at buf[5]

2016-02-19 Thread Ken Kawasaki


The MAC address of FMV-J182 is stored at buf[5].

Signed-off-by: Ken Kawasaki 

---

--- linux-4.4.1/drivers/net/ethernet/fujitsu/fmvj18x_cs.c.orig  2016-02-19 
20:48:40.143852346 +0900
+++ linux-4.4.1/drivers/net/ethernet/fujitsu/fmvj18x_cs.c   2016-02-20 
10:33:42.137713831 +0900
@@ -469,8 +469,8 @@ static int fmvj18x_config(struct pcmcia_
goto failed;
}
/* Read MACID from CIS */
-   for (i = 5; i < 11; i++)
-   dev->dev_addr[i] = buf[i];
+   for (i = 0; i < 6; i++)
+   dev->dev_addr[i] = buf[i + 5];
kfree(buf);
} else {
if (pcmcia_get_mac_from_cis(link, dev))

[PATCH] xen-netfront: set real_num_tx_queues to zreo avoid to trigger BUG_ON

2016-02-19 Thread Gonglei

It's possible for a race condition to exist between xennet_open() and
talk_to_netback(). After invoking netfront_probe() then other
threads or processes invoke xennet_open (such as NetworkManager)
immediately may trigger BUG_ON(). Besides, we also should reset
real_num_tx_queues in xennet_destroy_queues().

[ 3324.658057] kernel BUG at include/linux/netdevice.h:508!
[ 3324.658057] invalid opcode:  [#1] SMP
[ 3324.658057] CPU: 0 PID: 662 Comm: NetworkManager Tainted: G
[] ? raw_notifier_call_chain+0x16/0x20
[] __dev_open+0xce/0x150
[] __dev_change_flags+0xa1/0x170
[] dev_change_flags+0x29/0x70
[] do_setlink+0x39f/0xb40
[] ? nla_parse+0x32/0x120
[] rtnl_newlink+0x604/0x900
[] ? netlink_unicast+0x193/0x1c0
[] ? security_capable+0x18/0x20
[] ? ns_capable+0x2d/0x60
[] rtnetlink_rcv_msg+0xf5/0x270
[] ? rhashtable_lookup_compare+0x5d/0xa0
[] ? rtnetlink_rcv+0x40/0x40
[] netlink_rcv_skb+0xb9/0xe0
[] rtnetlink_rcv+0x2c/0x40
[] netlink_unicast+0x12d/0x1c0
[] netlink_sendmsg+0x4d3/0x630
[] ? sock_has_perm+0x72/0x90
[] do_sock_sendmsg+0x9f/0xc0
[ 3324.703482] RIP  [] xennet_open+0x180/0x182 
[xen_netfront]

CC: David S. Miller 
Signed-off-by: Gonglei 
---
 drivers/net/xen-netfront.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index d6abf19..da2 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -340,7 +340,7 @@ static int xennet_open(struct net_device *dev)
unsigned int i = 0;
struct netfront_queue *queue = NULL;
 
-   for (i = 0; i < num_queues; ++i) {
+   for (i = 0; i < num_queues && np->queues; ++i) {
queue = &np->queues[i];
napi_enable(&queue->napi);
 
@@ -1296,6 +1296,10 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
np   = netdev_priv(netdev);
np->xbdev= dev;
 
+   /* No need to use rtnl_lock() before the call below as it
+* happens before register_netdev().
+*/
+   netdev->real_num_tx_queues = 0;
np->queues = NULL;
 
err = -ENOMEM;
@@ -1748,7 +1752,7 @@ static void xennet_destroy_queues(struct netfront_info 
*info)
del_timer_sync(&queue->rx_refill_timer);
netif_napi_del(&queue->napi);
}
-
+   info->netdev->real_num_tx_queues = 0;
rtnl_unlock();
 
kfree(info->queues);
@@ -1951,6 +1955,9 @@ abort_transaction_no_dev_fatal:
xennet_disconnect_backend(info);
kfree(info->queues);
info->queues = NULL;
+   rtnl_lock();
+   info->netdev->real_num_tx_queues = 0;
+   rtnl_unlock();
  out:
return err;
 }
-- 
1.8.5.2

Re: [PATCH V7 0/8] ethtool per queue parameters support

2016-02-19 Thread David Miller

From: Jeff Kirsher 
Date: Fri, 19 Feb 2016 13:50:18 -0800

> Dave, I have pretty much cleared out my i40e queue of patches, so I am
> fine if you want to apply the entire series (of course after proper
> review) :-)

Ok, thanks Jeff.

Re: [PATCH] bgmac: support Ethernet device on BCM47094 SoC

2016-02-19 Thread David Miller

From: Rafał Miłecki 
Date: Fri, 19 Feb 2016 22:26:48 +0100

> Are you applying it to net-next? It way my intention, sorry if I was
> supposed to make it more clear.

You are always supposed to specify this explicitly, in the
Subject line, after PATCH, like "[PATCH net-next] bgmac: ..."

[Patch net-next] net_sched: add network namespace support for tc actions

2016-02-19 Thread Cong Wang

Currently tc actions are stored in a per-module hashtable,
therefore are visible to all network namespaces. This is
probably the last part of the tc subsystem which is not
aware of netns now. This patch makes them per-netns,
several tc action API's need to be adjusted for this.

The tc action API code is ugly due to historical reasons,
we need to refactor that code in the future.

Also this patch is on top of my other patch
"net_sched: fix memory leaks when rmmod tc action modules",
therefore should be applied after -net is merged into
net-next.

Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h|  28 +---
 net/sched/act_api.c  |  88 ++-
 net/sched/act_bpf.c  |  71 +--
 net/sched/act_connmark.c |  73 +--
 net/sched/act_csum.c |  78 +++--
 net/sched/act_gact.c |  74 ++--
 net/sched/act_ipt.c  | 178 +++
 net/sched/act_mirred.c   |  75 ++--
 net/sched/act_nat.c  |  73 +--
 net/sched/act_pedit.c|  73 +--
 net/sched/act_police.c   |  69 --
 net/sched/act_simple.c   |  76 ++--
 net/sched/act_skbedit.c  |  73 +--
 net/sched/act_vlan.c |  73 +--
 14 files changed, 969 insertions(+), 133 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 8c4e3ff..af74406 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -7,6 +7,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 struct tcf_common {
struct hlist_node   tcfc_head;
@@ -87,31 +89,36 @@ struct tc_action {
__u32   type; /* for backward compat(TCA_OLD_COMPAT) */
__u32   order;
struct list_headlist;
+   struct tcf_hashinfo *hinfo;
 };
 
 struct tc_action_ops {
struct list_head head;
-   struct tcf_hashinfo *hinfo;
charkind[IFNAMSIZ];
__u32   type; /* TBD to match kind */
struct module   *owner;
int (*act)(struct sk_buff *, const struct tc_action *, struct 
tcf_result *);
int (*dump)(struct sk_buff *, struct tc_action *, int, int);
void(*cleanup)(struct tc_action *, int bind);
-   int (*lookup)(struct tc_action *, u32);
+   int (*lookup)(struct net *, struct tc_action *, u32);
int (*init)(struct net *net, struct nlattr *nla,
struct nlattr *est, struct tc_action *act, int ovr,
int bind);
-   int (*walk)(struct sk_buff *, struct netlink_callback *, int, 
struct tc_action *);
+   int (*walk)(struct net *, struct sk_buff *,
+   struct netlink_callback *, int, struct tc_action *);
 };
 
-int tcf_hash_search(struct tc_action *a, u32 index);
+int tcf_generic_walker(struct tcf_hashinfo *hinfo, struct sk_buff *skb,
+  struct netlink_callback *cb, int type,
+  struct tc_action *a);
+int tcf_hash_search(struct tcf_hashinfo *hinfo, struct tc_action *a, u32 
index);
 u32 tcf_hash_new_index(struct tcf_hashinfo *hinfo);
-int tcf_hash_check(u32 index, struct tc_action *a, int bind);
-int tcf_hash_create(u32 index, struct nlattr *est, struct tc_action *a,
-   int size, int bind, bool cpustats);
+int tcf_hash_check(struct tcf_hashinfo *hinfo, u32 index, struct tc_action *a,
+  int bind);
+int tcf_hash_create(struct tcf_hashinfo *hinfo, u32 index, struct nlattr *est,
+   struct tc_action *a, int size, int bind, bool cpustats);
 void tcf_hash_cleanup(struct tc_action *a, struct nlattr *est);
-void tcf_hash_insert(struct tc_action *a);
+void tcf_hash_insert(struct tcf_hashinfo *hinfo, struct tc_action *a);
 
 int __tcf_hash_release(struct tc_action *a, bool bind, bool strict);
 
@@ -120,7 +127,10 @@ static inline int tcf_hash_release(struct tc_action *a, 
bool bind)
return __tcf_hash_release(a, bind, false);
 }
 
-int tcf_register_action(struct tc_action_ops *a, unsigned int mask);
+void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
+ struct tcf_hashinfo *hinfo);
+
+int tcf_register_action(struct tc_action_ops *a);
 int tcf_unregister_action(struct tc_action_ops *a);
 int tcf_action_destroy(struct list_head *actions, int bind);
 int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index acafaf7..ad3339c 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -36,10 +36,9 @@ static void free_tcf(struct rcu_head *head)
kfree(p);
 }
 
-static void tcf_hash_destroy(struct tc_action *a)
+static void tcf_hash_destroy(struct tcf_hashinfo *hinfo, struct tc_action *a)
 {
struct tcf_common *p = a->priv;
-   struct tcf_hashinfo *hinfo = a->ops->hinfo;

[PATCH net 1/3] bnxt_en: Poll link at the end of __bnxt_open_nic().

2016-02-19 Thread Michael Chan

From: Michael Chan 

When shutting down the NIC, we shutdown async event processing before
freeing all the rings.  If there is a link change event during reset, the
driver may miss it and the link state may be incorrect after the NIC is
re-opened.  Poll the link at the end of __bnxt_open_nic() to get the
correct link status.

Signed-off-by Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 82f4e6d..9b56058 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4662,6 +4662,7 @@ static int __bnxt_open_nic(struct bnxt *bp, bool 
irq_re_init, bool link_re_init)
/* Enable TX queues */
bnxt_tx_enable(bp);
mod_timer(&bp->timer, jiffies + bp->current_interval);
+   bnxt_update_link(bp, true);
 
return 0;
 
-- 
1.8.3.1

[PATCH net 3/3] bnxt_en: Failure to update PHY is not fatal condition.

2016-02-19 Thread Michael Chan

From: Michael Chan 

If we fail to update the PHY, we should print a warning and continue.
The current code to exit is buggy as it has not freed up the NIC
resources yet.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 78f6b5a..8ab000d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4642,7 +4642,7 @@ static int __bnxt_open_nic(struct bnxt *bp, bool 
irq_re_init, bool link_re_init)
if (link_re_init) {
rc = bnxt_update_phy_setting(bp);
if (rc)
-   goto open_err;
+   netdev_warn(bp->dev, "failed to update phy settings\n");
}
 
if (irq_re_init) {
-- 
1.8.3.1

[PATCH net 2/3] bnxt_en: Remove unnecessary call to update PHY settings.

2016-02-19 Thread Michael Chan

From: Michael Chan 

Fix bnxt_update_phy_setting() to check the correct parameters when
determining whether to update the PHY.  Requested line speed/duplex should
only be checked for forced speed mode.  This avoids unnecessary link
interruptions when loading the driver.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9b56058..78f6b5a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4554,20 +4554,18 @@ static int bnxt_update_phy_setting(struct bnxt *bp)
if (!(link_info->autoneg & BNXT_AUTONEG_FLOW_CTRL) &&
link_info->force_pause_setting != link_info->req_flow_ctrl)
update_pause = true;
-   if (link_info->req_duplex != link_info->duplex_setting)
-   update_link = true;
if (!(link_info->autoneg & BNXT_AUTONEG_SPEED)) {
if (BNXT_AUTO_MODE(link_info->auto_mode))
update_link = true;
if (link_info->req_link_speed != link_info->force_link_speed)
update_link = true;
+   if (link_info->req_duplex != link_info->duplex_setting)
+   update_link = true;
} else {
if (link_info->auto_mode == BNXT_LINK_AUTO_NONE)
update_link = true;
if (link_info->advertising != link_info->auto_link_speeds)
update_link = true;
-   if (link_info->req_link_speed != link_info->auto_link_speed)
-   update_link = true;
}
 
if (update_link)
-- 
1.8.3.1

[PATCH net 0/3] bnxt_en: Phy related fixes.

2016-02-19 Thread Michael Chan

3 small patches to fix PHY related code.

Michael Chan (3):
  bnxt_en: Poll link at the end of __bnxt_open_nic().
  bnxt_en: Remove unnecessary call to update PHY settings.
  bnxt_en: Failure to update PHY is not fatal condition.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

-- 
1.8.3.1

Re: [PATCH next v3 1/3] ipvlan: scrub skb before routing in L3 mode.

2016-02-19 Thread Cong Wang

On Thu, Feb 18, 2016 at 5:06 PM, Mahesh Bandewar  wrote:
> On Thu, Feb 18, 2016 at 4:44 PM, Cong Wang  wrote:
>> On Thu, Feb 18, 2016 at 4:39 PM, Mahesh Bandewar  wrote:
>>> [snip]
> -   skb_dst_drop(skb);
> +   skb_scrub_packet(skb, true);

 At least this patch is still same with the previous version. Or am I
 missing anything?
>>>
>>>  xnet param is now set to 'true'.
>>
>> Oh, I was suggesting to set xnet based on the netns of both ipvlan
>> device and physical device, not setting it to be true or false
>> unconditionally.
>>
> Well, thought about that but don't know any use case / user who is
> using the ipvlan slave devices in the same ns as master hence decided
> to do it this way.

In practice, probably, but in theory we still need to consider it. ;)

Re: [PATCH] unix_diag: fix incorrect sign extension in unix_lookup_by_ino

2016-02-19 Thread Cong Wang

On Thu, Feb 18, 2016 at 5:27 PM, Dmitry V. Levin  wrote:
> The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
> __u32, but unix_lookup_by_ino's argument ino has type int, which is not
> a problem yet.
> However, when ino is compared with sock_i_ino return value of type
> unsigned long, ino is sign extended to signed long, and this results
> to incorrect comparison on 64-bit architectures for inode numbers
> greater than INT_MAX.
>
> This bug was found by strace test suite.
>
> Signed-off-by: Dmitry V. Levin 
> Cc: 

Fixes: 5d3cae8bc39d ("unix_diag: Dumping exact socket core")
Acked-by: Cong Wang 

Thanks.

Loan @ 3% Rate offer. Reply if interested.

2016-02-19 Thread loan

--

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Tom Herbert

On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
>>> wrote:
 This patch series makes it so that we enable the outer Tx checksum for IPv4
 tunnels by default.  This makes the behavior consistent with how we were
 handling this for IPv6.  In addition I have updated the internal flags for
 these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
 match up will with the ZERO_CSUM6_TX flag which was already in use for
 IPv6.

 For most network devices this should be a net gain in terms of performance
 as having the outer header checksum present allows for devices to report
 CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
 order
 to determine if the inner header checksum is valid.

 Below is some data I collected with ixgbe with an X540 that demonstrates
 this.  I located two PFs connected back to back in two different name
 spaces and then setup a pair of tunnels on each, one with checksum enabled
 and one without.

 Recv   SendSend  Utilization
 Socket Socket  Message  Elapsed  Send
 Size   SizeSize Time Throughput  local
 bytes  bytes   bytessecs.10^6bits/s  % S

 noudpcsum:
  87380  16384  1638430.00  8898.67   12.80
 udpcsum:
  87380  16384  1638430.00  9088.47   5.69

 The one spot where this may cause a performance regression is if the
 environment contains devices that can parse the inner headers and a device
 supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
 the case of such a device we have to fall back to using GSO to segment the
 tunnel instead of TSO and as a result we may take a performance hit as seen
 below with i40e.
>>>
>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>> are basically saturated and while I can see a difference in the
>>> utilization rate, I suspect that the change will be much more apparent
>>> at higher speeds.
>>
>> Unfortunately I don't have any true 40G links to test with.  The
>> closest I can get is to run PF to VF on an i40e.  Running that I have
>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>> difference being related to the fact that we are having to
>> allocate/free more skbs and make more trips through the
>> i40e_lan_xmit_frame function resulting in more descriptors.
>
> OK, I guess that is more or less in line with what I would expect off
> the top my head. There is a reasonably significant drop in the worst
> case.
>
>>> I'm concerned about the drop in performance for devices that currently
>>> support offloads (almost none of which expose
>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>> care most about tunnel performance are the ones that already have
>>> these NICs and will be the most impacted by the drop.
>>
>> The problem is being able to transmit fast is kind of pointless if the
>> receiving end cannot handle it.  We hadn't gotten around to really
>> getting the Rx checksum bits working until the 3.18 kernel which I
>> don't suspect many people are running so at this point messing with
>> the TSO bits isn't really making much of a difference.  Then on top of
>> that most devices have certain limitations on how many ports they can
>> handle and such.  I know the i40e is supposed to support something
>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>> as I recall.  So this whole thing is already really brittle as it is.
>> My goal with this change is to make the behavior more consistent
>> across the board.
>
> That's true to some degree but there are certainly plenty of cases
> where TSO makes a difference - lower CPU usage, transmitting to
> multiple receivers, people will upgrade their kernels, etc. It's
> clearly good to make things more consistent but hopefully not by
> reducing existing performance. :)
>
>>> My hope is that we can continue to use TSO on devices that only
>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>> length field may vary across segments. However, in practice this is
>>> the only on the final segment and only in cases where the total length
>>> is not a multiple of the MSS. If we could detect cases where those
>>> conditions are met, we could continue to use TSO with the UDP checksum
>>> field pre-populated. A possible step even further would be to break
>>> off the final segment into a separate packet to make things conform if
>>> necessary. This would avoid a performance regression and I think make
>>> this more palatable to a lot of people.
>>
>> I think Tom and I had discussed this possibility a bit at netconf.
>> The GSO logic is something I planned on looking at over the

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross

On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
>> wrote:
>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>> tunnels by default.  This makes the behavior consistent with how we were
>>> handling this for IPv6.  In addition I have updated the internal flags for
>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>> IPv6.
>>>
>>> For most network devices this should be a net gain in terms of performance
>>> as having the outer header checksum present allows for devices to report
>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>> to determine if the inner header checksum is valid.
>>>
>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>> this.  I located two PFs connected back to back in two different name
>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>> and one without.
>>>
>>> Recv   SendSend  Utilization
>>> Socket Socket  Message  Elapsed  Send
>>> Size   SizeSize Time Throughput  local
>>> bytes  bytes   bytessecs.10^6bits/s  % S
>>>
>>> noudpcsum:
>>>  87380  16384  1638430.00  8898.67   12.80
>>> udpcsum:
>>>  87380  16384  1638430.00  9088.47   5.69
>>>
>>> The one spot where this may cause a performance regression is if the
>>> environment contains devices that can parse the inner headers and a device
>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>> the case of such a device we have to fall back to using GSO to segment the
>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>> below with i40e.
>>
>> Do you have any numbers from 40G links? Obviously, at 10G the links
>> are basically saturated and while I can see a difference in the
>> utilization rate, I suspect that the change will be much more apparent
>> at higher speeds.
>
> Unfortunately I don't have any true 40G links to test with.  The
> closest I can get is to run PF to VF on an i40e.  Running that I have
> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
> difference being related to the fact that we are having to
> allocate/free more skbs and make more trips through the
> i40e_lan_xmit_frame function resulting in more descriptors.

OK, I guess that is more or less in line with what I would expect off
the top my head. There is a reasonably significant drop in the worst
case.

>> I'm concerned about the drop in performance for devices that currently
>> support offloads (almost none of which expose
>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>> care most about tunnel performance are the ones that already have
>> these NICs and will be the most impacted by the drop.
>
> The problem is being able to transmit fast is kind of pointless if the
> receiving end cannot handle it.  We hadn't gotten around to really
> getting the Rx checksum bits working until the 3.18 kernel which I
> don't suspect many people are running so at this point messing with
> the TSO bits isn't really making much of a difference.  Then on top of
> that most devices have certain limitations on how many ports they can
> handle and such.  I know the i40e is supposed to support something
> like 10 port numbers, but the fm10k and ixgbe are limited to one port
> as I recall.  So this whole thing is already really brittle as it is.
> My goal with this change is to make the behavior more consistent
> across the board.

That's true to some degree but there are certainly plenty of cases
where TSO makes a difference - lower CPU usage, transmitting to
multiple receivers, people will upgrade their kernels, etc. It's
clearly good to make things more consistent but hopefully not by
reducing existing performance. :)

>> My hope is that we can continue to use TSO on devices that only
>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>> length field may vary across segments. However, in practice this is
>> the only on the final segment and only in cases where the total length
>> is not a multiple of the MSS. If we could detect cases where those
>> conditions are met, we could continue to use TSO with the UDP checksum
>> field pre-populated. A possible step even further would be to break
>> off the final segment into a separate packet to make things conform if
>> necessary. This would avoid a performance regression and I think make
>> this more palatable to a lot of people.
>
> I think Tom and I had discussed this possibility a bit at netconf.
> The GSO logic is something I planned on looking at over the next
> several weeks as I suspect there is probably room for improvement
> there.

That sounds great.

>>> I also haven't investigated the effect this will ha

[Patch net] net_sched: fix memory leaks when rmmod tc action modules

2016-02-19 Thread Cong Wang

We only release the memory of the hashtable itself, not its
entries inside. We need to do both.

Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h |  5 -
 net/sched/act_api.c   | 32 +---
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9d446f13..8c4e3ff 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -65,11 +65,6 @@ static inline int tcf_hashinfo_init(struct tcf_hashinfo *hf, 
unsigned int mask)
return 0;
 }
 
-static inline void tcf_hashinfo_destroy(struct tcf_hashinfo *hf)
-{
-   kfree(hf->htab);
-}
-
 /* Update lastuse only if needed, to avoid dirtying a cache line.
  * We use a temp variable to avoid fetching jiffies twice.
  */
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 06e7c4a..acafaf7 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -69,7 +69,7 @@ int __tcf_hash_release(struct tc_action *a, bool bind, bool 
strict)
if (a->ops->cleanup)
a->ops->cleanup(a, bind);
tcf_hash_destroy(a);
-   ret = 1;
+   ret = ACT_P_DELETED;
}
}
 
@@ -302,6 +302,32 @@ void tcf_hash_insert(struct tc_action *a)
 }
 EXPORT_SYMBOL(tcf_hash_insert);
 
+static void tcf_hashinfo_destroy(const struct tc_action_ops *ops)
+{
+   struct tcf_hashinfo *hinfo = ops->hinfo;
+   struct tc_action a = {
+   .ops = ops,
+   };
+   int i;
+
+   for (i = 0; i < hinfo->hmask + 1; i++) {
+   struct tcf_common *p;
+   struct hlist_node *n;
+
+   hlist_for_each_entry_safe(p, n, &hinfo->htab[i], tcfc_head) {
+   int ret;
+
+   a.priv = p;
+   ret = __tcf_hash_release(&a, false, true);
+   if (ret == ACT_P_DELETED)
+   module_put(ops->owner);
+   else if (ret < 0)
+   return;
+   }
+   }
+   kfree(hinfo->htab);
+}
+
 static LIST_HEAD(act_base);
 static DEFINE_RWLOCK(act_mod_lock);
 
@@ -333,7 +359,7 @@ int tcf_register_action(struct tc_action_ops *act, unsigned 
int mask)
list_for_each_entry(a, &act_base, head) {
if (act->type == a->type || (strcmp(act->kind, a->kind) == 0)) {
write_unlock(&act_mod_lock);
-   tcf_hashinfo_destroy(act->hinfo);
+   tcf_hashinfo_destroy(act);
kfree(act->hinfo);
return -EEXIST;
}
@@ -353,7 +379,7 @@ int tcf_unregister_action(struct tc_action_ops *act)
list_for_each_entry(a, &act_base, head) {
if (a == act) {
list_del(&act->head);
-   tcf_hashinfo_destroy(act->hinfo);
+   tcf_hashinfo_destroy(act);
kfree(act->hinfo);
err = 0;
break;
-- 
2.1.0

Re: [PATCH net-next] net: use skb_postpush_rcsum instead of own implementations

2016-02-19 Thread Tom Herbert

On Fri, Feb 19, 2016 at 3:29 PM, Daniel Borkmann  wrote:
> Replace individual implementations with the recently introduced
> skb_postpush_rcsum() helper.
>
> Signed-off-by: Daniel Borkmann 

Acked-by: Tom Herbert 

Looks like some nice cleanup!

> ---
>  net/core/skbuff.c  | 4 +---
>  net/ipv6/reassembly.c  | 6 ++
>  net/openvswitch/actions.c  | 8 +++-
>  net/openvswitch/vport-netdev.c | 2 +-
>  net/openvswitch/vport.h| 7 ---
>  5 files changed, 7 insertions(+), 20 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index a5bd067..8bd4b79 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4496,9 +4496,7 @@ int skb_vlan_push(struct sk_buff *skb, __be16 
> vlan_proto, u16 vlan_tci)
> skb->mac_len += VLAN_HLEN;
> __skb_pull(skb, offset);
>
> -   if (skb->ip_summed == CHECKSUM_COMPLETE)
> -   skb->csum = csum_add(skb->csum, csum_partial(skb->data
> -   + (2 * ETH_ALEN), VLAN_HLEN, 0));
> +   skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), 
> VLAN_HLEN);
> }
> __vlan_hwaccel_put_tag(skb, vlan_proto, vlan_tci);
> return 0;
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index 18f3498..e2ea311 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -496,10 +496,8 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
> sk_buff *prev,
> IP6CB(head)->flags |= IP6SKB_FRAGMENTED;
>
> /* Yes, and fold redundant checksum back. 8) */
> -   if (head->ip_summed == CHECKSUM_COMPLETE)
> -   head->csum = csum_partial(skb_network_header(head),
> - skb_network_header_len(head),
> - head->csum);
> +   skb_postpush_rcsum(head, skb_network_header(head),
> +  skb_network_header_len(head));
>
> rcu_read_lock();
> IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 2d59df5..e9dd47b 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -158,9 +158,7 @@ static int push_mpls(struct sk_buff *skb, struct 
> sw_flow_key *key,
> new_mpls_lse = (__be32 *)skb_mpls_header(skb);
> *new_mpls_lse = mpls->mpls_lse;
>
> -   if (skb->ip_summed == CHECKSUM_COMPLETE)
> -   skb->csum = csum_add(skb->csum, csum_partial(new_mpls_lse,
> -MPLS_HLEN, 0));
> +   skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
>
> hdr = eth_hdr(skb);
> hdr->h_proto = mpls->mpls_ethertype;
> @@ -280,7 +278,7 @@ static int set_eth_addr(struct sk_buff *skb, struct 
> sw_flow_key *flow_key,
> ether_addr_copy_masked(eth_hdr(skb)->h_dest, key->eth_dst,
>mask->eth_dst);
>
> -   ovs_skb_postpush_rcsum(skb, eth_hdr(skb), ETH_ALEN * 2);
> +   skb_postpush_rcsum(skb, eth_hdr(skb), ETH_ALEN * 2);
>
> ether_addr_copy(flow_key->eth.src, eth_hdr(skb)->h_source);
> ether_addr_copy(flow_key->eth.dst, eth_hdr(skb)->h_dest);
> @@ -639,7 +637,7 @@ static int ovs_vport_output(struct net *net, struct sock 
> *sk, struct sk_buff *sk
> /* Reconstruct the MAC header.  */
> skb_push(skb, data->l2_len);
> memcpy(skb->data, &data->l2_data, data->l2_len);
> -   ovs_skb_postpush_rcsum(skb, skb->data, data->l2_len);
> +   skb_postpush_rcsum(skb, skb->data, data->l2_len);
> skb_reset_mac_header(skb);
>
> ovs_vport_send(vport, skb);
> diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
> index 6a6adf3..4e39723 100644
> --- a/net/openvswitch/vport-netdev.c
> +++ b/net/openvswitch/vport-netdev.c
> @@ -58,7 +58,7 @@ static void netdev_port_receive(struct sk_buff *skb)
> return;
>
> skb_push(skb, ETH_HLEN);
> -   ovs_skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
> +   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
> ovs_vport_receive(vport, skb, skb_tunnel_info(skb));
> return;
>  error:
> diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
> index c10899cb..f01f28a 100644
> --- a/net/openvswitch/vport.h
> +++ b/net/openvswitch/vport.h
> @@ -185,13 +185,6 @@ static inline struct vport *vport_from_priv(void *priv)
>  int ovs_vport_receive(struct vport *, struct sk_buff *,
>   const struct ip_tunnel_info *);
>
> -static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
> - const void *start, unsigned int len)
> -{
> -   if (skb->ip_summed == CHECKSUM_COMPLETE)
> -   skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
> -}
> -
>  static inline const char *ovs_vport_name(struct vport *vport)
>  {
> retur

[Patch net-next] phy: marvell/micrel: Fix Unpossible condition

2016-02-19 Thread Andrew Lunn

commit 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
from Dec 30, 2015, leads to the following static checker
warning:

drivers/net/phy/micrel.c:609 kszphy_get_stat()
warn: unsigned 'val' is never less than zero.

drivers/net/phy/micrel.c
   602  static u64 kszphy_get_stat(struct phy_device *phydev, int i)
   603  {
   604  struct kszphy_hw_stat stat = kszphy_hw_stats[i];
   605  struct kszphy_priv *priv = phydev->priv;
   606  u64 val;
   607
   608  val = phy_read(phydev, stat.reg);
   609  if (val < 0) {
^^^
Unpossible!

   610  val = UINT64_MAX;
   611  } else {
   612  val = val & ((1 << stat.bits) - 1);
   613  priv->stats[i] += val;
   614  val = priv->stats[i];
   615  }
   616
   617  return val;
   618  }

The same problem exists in the Marvell driver. Fix both.

Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
Reported-by: Dan Carpenter 
Reported-by: Julia.Lawall 
Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 10 +-
 drivers/net/phy/micrel.c  |  9 +
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 1dcbd3ff9e38..d0168f1a1bf0 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -1065,8 +1065,8 @@ static u64 marvell_get_stat(struct phy_device *phydev, 
int i)
 {
struct marvell_hw_stat stat = marvell_hw_stats[i];
struct marvell_priv *priv = phydev->priv;
-   int err, oldpage;
-   u64 val;
+   int err, oldpage, val;
+   u64 ret;
 
oldpage = phy_read(phydev, MII_MARVELL_PHY_PAGE);
err = phy_write(phydev, MII_MARVELL_PHY_PAGE,
@@ -1076,16 +1076,16 @@ static u64 marvell_get_stat(struct phy_device *phydev, 
int i)
 
val = phy_read(phydev, stat.reg);
if (val < 0) {
-   val = UINT64_MAX;
+   ret = UINT64_MAX;
} else {
val = val & ((1 << stat.bits) - 1);
priv->stats[i] += val;
-   val = priv->stats[i];
+   ret = priv->stats[i];
}
 
phy_write(phydev, MII_MARVELL_PHY_PAGE, oldpage);
 
-   return val;
+   return ret;
 }
 
 static void marvell_get_stats(struct phy_device *phydev,
diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 03833dbfca67..48219c83fb00 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -612,18 +612,19 @@ static u64 kszphy_get_stat(struct phy_device *phydev, int 
i)
 {
struct kszphy_hw_stat stat = kszphy_hw_stats[i];
struct kszphy_priv *priv = phydev->priv;
-   u64 val;
+   int val;
+   u64 ret;
 
val = phy_read(phydev, stat.reg);
if (val < 0) {
-   val = UINT64_MAX;
+   ret = UINT64_MAX;
} else {
val = val & ((1 << stat.bits) - 1);
priv->stats[i] += val;
-   val = priv->stats[i];
+   ret = priv->stats[i];
}
 
-   return val;
+   return ret;
 }
 
 static void kszphy_get_stats(struct phy_device *phydev,
-- 
2.7.0

[PATCH net-next] net: use skb_postpush_rcsum instead of own implementations

2016-02-19 Thread Daniel Borkmann

Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.

Signed-off-by: Daniel Borkmann 
---
 net/core/skbuff.c  | 4 +---
 net/ipv6/reassembly.c  | 6 ++
 net/openvswitch/actions.c  | 8 +++-
 net/openvswitch/vport-netdev.c | 2 +-
 net/openvswitch/vport.h| 7 ---
 5 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a5bd067..8bd4b79 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4496,9 +4496,7 @@ int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, 
u16 vlan_tci)
skb->mac_len += VLAN_HLEN;
__skb_pull(skb, offset);
 
-   if (skb->ip_summed == CHECKSUM_COMPLETE)
-   skb->csum = csum_add(skb->csum, csum_partial(skb->data
-   + (2 * ETH_ALEN), VLAN_HLEN, 0));
+   skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN);
}
__vlan_hwaccel_put_tag(skb, vlan_proto, vlan_tci);
return 0;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 18f3498..e2ea311 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -496,10 +496,8 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
IP6CB(head)->flags |= IP6SKB_FRAGMENTED;
 
/* Yes, and fold redundant checksum back. 8) */
-   if (head->ip_summed == CHECKSUM_COMPLETE)
-   head->csum = csum_partial(skb_network_header(head),
- skb_network_header_len(head),
- head->csum);
+   skb_postpush_rcsum(head, skb_network_header(head),
+  skb_network_header_len(head));
 
rcu_read_lock();
IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 2d59df5..e9dd47b 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -158,9 +158,7 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
new_mpls_lse = (__be32 *)skb_mpls_header(skb);
*new_mpls_lse = mpls->mpls_lse;
 
-   if (skb->ip_summed == CHECKSUM_COMPLETE)
-   skb->csum = csum_add(skb->csum, csum_partial(new_mpls_lse,
-MPLS_HLEN, 0));
+   skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
 
hdr = eth_hdr(skb);
hdr->h_proto = mpls->mpls_ethertype;
@@ -280,7 +278,7 @@ static int set_eth_addr(struct sk_buff *skb, struct 
sw_flow_key *flow_key,
ether_addr_copy_masked(eth_hdr(skb)->h_dest, key->eth_dst,
   mask->eth_dst);
 
-   ovs_skb_postpush_rcsum(skb, eth_hdr(skb), ETH_ALEN * 2);
+   skb_postpush_rcsum(skb, eth_hdr(skb), ETH_ALEN * 2);
 
ether_addr_copy(flow_key->eth.src, eth_hdr(skb)->h_source);
ether_addr_copy(flow_key->eth.dst, eth_hdr(skb)->h_dest);
@@ -639,7 +637,7 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
/* Reconstruct the MAC header.  */
skb_push(skb, data->l2_len);
memcpy(skb->data, &data->l2_data, data->l2_len);
-   ovs_skb_postpush_rcsum(skb, skb->data, data->l2_len);
+   skb_postpush_rcsum(skb, skb->data, data->l2_len);
skb_reset_mac_header(skb);
 
ovs_vport_send(vport, skb);
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 6a6adf3..4e39723 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -58,7 +58,7 @@ static void netdev_port_receive(struct sk_buff *skb)
return;
 
skb_push(skb, ETH_HLEN);
-   ovs_skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
+   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
ovs_vport_receive(vport, skb, skb_tunnel_info(skb));
return;
 error:
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index c10899cb..f01f28a 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -185,13 +185,6 @@ static inline struct vport *vport_from_priv(void *priv)
 int ovs_vport_receive(struct vport *, struct sk_buff *,
  const struct ip_tunnel_info *);
 
-static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
- const void *start, unsigned int len)
-{
-   if (skb->ip_summed == CHECKSUM_COMPLETE)
-   skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
-}
-
 static inline const char *ovs_vport_name(struct vport *vport)
 {
return vport->dev->name;
-- 
1.9.3

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Alex Duyck

On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>> tunnels by default.  This makes the behavior consistent with how we were
>> handling this for IPv6.  In addition I have updated the internal flags for
>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>> IPv6.
>>
>> For most network devices this should be a net gain in terms of performance
>> as having the outer header checksum present allows for devices to report
>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>> to determine if the inner header checksum is valid.
>>
>> Below is some data I collected with ixgbe with an X540 that demonstrates
>> this.  I located two PFs connected back to back in two different name
>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>> and one without.
>>
>> Recv   SendSend  Utilization
>> Socket Socket  Message  Elapsed  Send
>> Size   SizeSize Time Throughput  local
>> bytes  bytes   bytessecs.10^6bits/s  % S
>>
>> noudpcsum:
>>  87380  16384  1638430.00  8898.67   12.80
>> udpcsum:
>>  87380  16384  1638430.00  9088.47   5.69
>>
>> The one spot where this may cause a performance regression is if the
>> environment contains devices that can parse the inner headers and a device
>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>> the case of such a device we have to fall back to using GSO to segment the
>> tunnel instead of TSO and as a result we may take a performance hit as seen
>> below with i40e.
>
> Do you have any numbers from 40G links? Obviously, at 10G the links
> are basically saturated and while I can see a difference in the
> utilization rate, I suspect that the change will be much more apparent
> at higher speeds.

Unfortunately I don't have any true 40G links to test with.  The
closest I can get is to run PF to VF on an i40e.  Running that I have
seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
difference being related to the fact that we are having to
allocate/free more skbs and make more trips through the
i40e_lan_xmit_frame function resulting in more descriptors.

> I'm concerned about the drop in performance for devices that currently
> support offloads (almost none of which expose
> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
> care most about tunnel performance are the ones that already have
> these NICs and will be the most impacted by the drop.

The problem is being able to transmit fast is kind of pointless if the
receiving end cannot handle it.  We hadn't gotten around to really
getting the Rx checksum bits working until the 3.18 kernel which I
don't suspect many people are running so at this point messing with
the TSO bits isn't really making much of a difference.  Then on top of
that most devices have certain limitations on how many ports they can
handle and such.  I know the i40e is supposed to support something
like 10 port numbers, but the fm10k and ixgbe are limited to one port
as I recall.  So this whole thing is already really brittle as it is.
My goal with this change is to make the behavior more consistent
across the board.

> My hope is that we can continue to use TSO on devices that only
> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
> length field may vary across segments. However, in practice this is
> the only on the final segment and only in cases where the total length
> is not a multiple of the MSS. If we could detect cases where those
> conditions are met, we could continue to use TSO with the UDP checksum
> field pre-populated. A possible step even further would be to break
> off the final segment into a separate packet to make things conform if
> necessary. This would avoid a performance regression and I think make
> this more palatable to a lot of people.

I think Tom and I had discussed this possibility a bit at netconf.
The GSO logic is something I planned on looking at over the next
several weeks as I suspect there is probably room for improvement
there.

>> I also haven't investigated the effect this will have on OVS.  However I
>> suspect the impact should be minimal as the worst case scenario should be
>> that Tx checksumming will become enabled by default which should be
>> consistent with the existing behavior for IPv6.
>
> I don't think that it should cause any problems.

Good to hear.

Do you know if OVS has some way to control the VXLAN configuration so
that it could disable Tx checksums?  If so that would probably be a
good way to address the 40G issues assuming someone is running an
environment hat had nothing but NICs that can support the TSO and Rx
checksum on inner headers.

[PATCH] Bluetooth: hci_core: Avoid mixing up req_complete and req_complete_skb

2016-02-19 Thread Douglas Anderson

In commit 44d271377479 ("Bluetooth: Compress the size of struct
hci_ctrl") we squashed down the size of the structure by using a union
with the assumption that all users would use the flag to determine
whether we had a req_complete or a req_complete_skb.

Unfortunately we had a case in hci_req_cmd_complete() where we weren't
looking at the flag.  This can result in a situation where we might be
storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
vice versa.

During some testing I found at least one case where the function
hci_req_sync_complete() was called improperly because the kernel thought
that it didn't require an SKB.  Looking through the stack in kgdb I
found that it was called by hci_event_packet() and that
hci_event_packet() had both of its locals "req_complete" and
"req_complete_skb" pointing to the same place: both to
hci_req_sync_complete().

Let's make sure we always check the flag.

For more details on debugging done, see .

Fixes: 44d271377479 ("Bluetooth: Compress the size of struct hci_ctrl")
Signed-off-by: Douglas Anderson 
---
Testing was done on a Chrome OS device on kernel 3.14 w/
bluetooth-next backports.  Since I couldn't reliably reproduce the
problem, I simply confirmed that existing functionality worked.

 net/bluetooth/hci_core.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 541760fe53d4..9c0a6830ff92 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -4118,8 +4118,10 @@ void hci_req_cmd_complete(struct hci_dev *hdev, u16 
opcode, u8 status,
break;
}
 
-   *req_complete = bt_cb(skb)->hci.req_complete;
-   *req_complete_skb = bt_cb(skb)->hci.req_complete_skb;
+   if (bt_cb(skb)->hci.req_flags & HCI_REQ_SKB)
+   *req_complete_skb = bt_cb(skb)->hci.req_complete_skb;
+   else
+   *req_complete = bt_cb(skb)->hci.req_complete;
kfree_skb(skb);
}
spin_unlock_irqrestore(&hdev->cmd_q.lock, flags);
-- 
2.7.0.rc3.207.g0ac5344

Re: [PATCH v2 3/3] net: netcp: rework the code for get/set sw_data in dma desc

2016-02-19 Thread Arnd Bergmann

On Friday 19 February 2016 17:21:57 Murali Karicheri wrote:
> >>  get_pkt_info(&dma_buf, &tmp, &dma_desc, ndesc);
> >> -get_sw_data((u32 *)&buf_ptr, &buf_len, ndesc);
> >> +/* warning We are retrieving the virtual ptr in the 
> >> sw_data
> >> + * field as a 32bit value. Will not work on 64bit machines
> >> + */
> >> +buf_ptr = (void *)GET_SW_DATA0(ndesc);
> >> +buf_len = (int)GET_SW_DATA1(desc);
> > 
> > I would have abstracted the retrieval of a pointer again,
> > and added the comment in the helper function once, it doesn't
> > really need to be duplicated everywhere.
> > 
> Arnd,
> 
> I thought about it to add it to the API. API currently set buffer
> and ptr. It would be an issue only if store/retrieve ptr in/from the sw_data.
> So for the comment to be really useful to someone who is changing the code,
> doesn't it make sense to add it at the point of invocation as done in this
> patch? No?
> 

Up to you, it was just an idea and you have my Ack either way.

Arnd

Re: [PATCH v2 3/3] net: netcp: rework the code for get/set sw_data in dma desc

2016-02-19 Thread Murali Karicheri

On 02/19/2016 03:55 PM, Arnd Bergmann wrote:
> On Friday 19 February 2016 12:58:44 Murali Karicheri wrote:
>> SW data field in descriptor can be used by software to hold private
>> data for the driver. As there are 4 words available for this purpose,
>> use separate macros to place it or retrieve the same to/from
>> descriptors. Also do type cast of data types accordingly.
>>
>> Cc: Wingman Kwok 
>> Cc: Mugunthan V N 
>> CC: Arnd Bergmann 
>> CC: Grygorii Strashko 
>> CC: David Laight 
>> Signed-off-by: Murali Karicheri 
> 
> Looks ok in principle.
> 
> Acked-by: Arnd Bergmann 
> 
>>  get_pkt_info(&dma_buf, &tmp, &dma_desc, ndesc);
>> -get_sw_data((u32 *)&buf_ptr, &buf_len, ndesc);
>> +/* warning We are retrieving the virtual ptr in the sw_data
>> + * field as a 32bit value. Will not work on 64bit machines
>> + */
>> +buf_ptr = (void *)GET_SW_DATA0(ndesc);
>> +buf_len = (int)GET_SW_DATA1(desc);
> 
> I would have abstracted the retrieval of a pointer again,
> and added the comment in the helper function once, it doesn't
> really need to be duplicated everywhere.
> 
Arnd,

I thought about it to add it to the API. API currently set buffer
and ptr. It would be an issue only if store/retrieve ptr in/from the sw_data.
So for the comment to be really useful to someone who is changing the code,
doesn't it make sense to add it at the point of invocation as done in this
patch? No?

Murali

>   Arnd
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

[PATCH net-next 4/6] bpf: try harder on clones when writing into skb

2016-02-19 Thread Daniel Borkmann

When we're dealing with clones and the area is not writeable, try
harder and get a copy via pskb_expand_head(). Replace also other
occurences in tc actions with the new skb_try_make_writable().

Reported-by: Ashhad Sheikh 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/skbuff.h |  7 +++
 net/core/filter.c  | 19 ++-
 net/sched/act_csum.c   |  8 ++--
 net/sched/act_nat.c| 18 +-
 4 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 89b5367..6a57757 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2630,6 +2630,13 @@ static inline int skb_clone_writable(const struct 
sk_buff *skb, unsigned int len
   skb_headroom(skb) + len <= skb->hdr_len;
 }
 
+static inline int skb_try_make_writable(struct sk_buff *skb,
+   unsigned int write_len)
+{
+   return skb_cloned(skb) && !skb_clone_writable(skb, write_len) &&
+  pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+}
+
 static inline int __skb_cow(struct sk_buff *skb, unsigned int headroom,
int cloned)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index ea391e6..f031b82 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1364,9 +1364,7 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, 
u64 r4, u64 flags)
 */
if (unlikely((u32) offset > 0x || len > sizeof(sp->buff)))
return -EFAULT;
-
-   if (unlikely(skb_cloned(skb) &&
-!skb_clone_writable(skb, offset + len)))
+   if (unlikely(skb_try_make_writable(skb, offset + len)))
return -EFAULT;
 
ptr = skb_header_pointer(skb, offset, len, sp->buff);
@@ -1439,9 +1437,7 @@ static u64 bpf_l3_csum_replace(u64 r1, u64 r2, u64 from, 
u64 to, u64 flags)
return -EINVAL;
if (unlikely((u32) offset > 0x))
return -EFAULT;
-
-   if (unlikely(skb_cloned(skb) &&
-!skb_clone_writable(skb, offset + sizeof(sum
+   if (unlikely(skb_try_make_writable(skb, offset + sizeof(sum
return -EFAULT;
 
ptr = skb_header_pointer(skb, offset, sizeof(sum), &sum);
@@ -1488,9 +1484,7 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 from, 
u64 to, u64 flags)
return -EINVAL;
if (unlikely((u32) offset > 0x))
return -EFAULT;
-
-   if (unlikely(skb_cloned(skb) &&
-!skb_clone_writable(skb, offset + sizeof(sum
+   if (unlikely(skb_try_make_writable(skb, offset + sizeof(sum
return -EFAULT;
 
ptr = skb_header_pointer(skb, offset, sizeof(sum), &sum);
@@ -1734,6 +1728,13 @@ bool bpf_helper_changes_skb_data(void *func)
return true;
if (func == bpf_skb_vlan_pop)
return true;
+   if (func == bpf_skb_store_bytes)
+   return true;
+   if (func == bpf_l3_csum_replace)
+   return true;
+   if (func == bpf_l4_csum_replace)
+   return true;
+
return false;
 }
 
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index b07c535..eeb3eb3 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -105,9 +105,7 @@ static void *tcf_csum_skb_nextlayer(struct sk_buff *skb,
int hl = ihl + jhl;
 
if (!pskb_may_pull(skb, ipl + ntkoff) || (ipl < hl) ||
-   (skb_cloned(skb) &&
-!skb_clone_writable(skb, hl + ntkoff) &&
-pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
+   skb_try_make_writable(skb, hl + ntkoff))
return NULL;
else
return (void *)(skb_network_header(skb) + ihl);
@@ -365,9 +363,7 @@ static int tcf_csum_ipv4(struct sk_buff *skb, u32 
update_flags)
}
 
if (update_flags & TCA_CSUM_UPDATE_FLAG_IPV4HDR) {
-   if (skb_cloned(skb) &&
-   !skb_clone_writable(skb, sizeof(*iph) + ntkoff) &&
-   pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
+   if (skb_try_make_writable(skb, sizeof(*iph) + ntkoff))
goto fail;
 
ip_send_check(ip_hdr(skb));
diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index b7c4ead..27607b8 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -126,9 +126,7 @@ static int tcf_nat(struct sk_buff *skb, const struct 
tc_action *a,
addr = iph->daddr;
 
if (!((old_addr ^ addr) & mask)) {
-   if (skb_cloned(skb) &&
-   !skb_clone_writable(skb, sizeof(*iph) + noff) &&
-   pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
+   if (skb_try_make_writable(skb, sizeof(*iph) + noff))
goto drop;
 
new_addr &= mask;
@@ -156,9 +154,7 @@ static int tcf_nat(struct sk_buff *skb, const struct 
tc_action *a,

[PATCH net-next 2/6] bpf: add generic bpf_csum_diff helper

2016-02-19 Thread Daniel Borkmann

For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
currently limited to handle 2 and 4 byte changes in a header and feeds the
from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
working with IPv6, for example, this makes it rather cumbersome to deal
with, similarly when editing larger parts of a header.

Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
add a case for header field mask of 0 to change the checksum at a given
offset through inet_proto_csum_replace_by_diff(), and provide a helper
bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
amounts of data.

This can be used in multiple ways: for the bpf_l4_csum_replace() only
part, this even provides us with the option to insert precalculated diffs
from user space f.e. from a map, or from bpf_csum_diff() during runtime.

bpf_csum_diff() has a optional from/to stack buffer input, so we can
calculate a diff by using a scratchbuffer for scenarios where we're
inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
don't need to be of equal size) data. Also, bpf_csum_diff() allows to
feed a previous csum into csum_partial(), so the function can also be
cascaded.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 11 ++
 net/core/filter.c| 53 
 2 files changed, 64 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d3e77da..48d0a6c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -287,6 +287,17 @@ enum bpf_func_id {
 * Return: >= 0 stackid on success or negative error
 */
BPF_FUNC_get_stackid,
+
+   /**
+* bpf_csum_diff(from, from_size, to, to_size, seed) - calculate csum 
diff
+* @from: raw from buffer
+* @from_size: length of from buffer
+* @to: raw to buffer
+* @to_size: length of to buffer
+* @seed: optional seed
+* Return: csum result
+*/
+   BPF_FUNC_csum_diff,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 2a6e956..bf504f8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1491,6 +1491,12 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 from, 
u64 to, u64 flags)
return -EFAULT;
 
switch (flags & BPF_F_HDR_FIELD_MASK) {
+   case 0:
+   if (unlikely(from != 0))
+   return -EINVAL;
+
+   inet_proto_csum_replace_by_diff(ptr, skb, to, is_pseudo);
+   break;
case 2:
inet_proto_csum_replace2(ptr, skb, from, to, is_pseudo);
break;
@@ -1519,6 +1525,51 @@ const struct bpf_func_proto bpf_l4_csum_replace_proto = {
.arg5_type  = ARG_ANYTHING,
 };
 
+struct bpf_csum_scratchpad {
+   __be32 diff[128];
+};
+
+static DEFINE_PER_CPU(struct bpf_csum_scratchpad, bpf_csum_sp);
+
+static u64 bpf_csum_diff(u64 r1, u64 from_size, u64 r3, u64 to_size, u64 seed)
+{
+   struct bpf_csum_scratchpad *sp = this_cpu_ptr(&bpf_csum_sp);
+   u64 diff_size = from_size + to_size;
+   __be32 *from = (__be32 *) (long) r1;
+   __be32 *to   = (__be32 *) (long) r3;
+   int i, j = 0;
+
+   /* This is quite flexible, some examples:
+*
+* from_size == 0, to_size > 0,  seed := csum --> pushing data
+* from_size > 0,  to_size == 0, seed := csum --> pulling data
+* from_size > 0,  to_size > 0,  seed := 0--> diffing data
+*
+* Even for diffing, from_size and to_size don't need to be equal.
+*/
+   if (unlikely(((from_size | to_size) & (sizeof(__be32) - 1)) ||
+diff_size > sizeof(sp->diff)))
+   return -EINVAL;
+
+   for (i = 0; i < from_size / sizeof(__be32); i++, j++)
+   sp->diff[j] = ~from[i];
+   for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
+   sp->diff[j] = to[i];
+
+   return csum_partial(sp->diff, diff_size, seed);
+}
+
+const struct bpf_func_proto bpf_csum_diff_proto = {
+   .func   = bpf_csum_diff,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_STACK,
+   .arg2_type  = ARG_CONST_STACK_SIZE_OR_ZERO,
+   .arg3_type  = ARG_PTR_TO_STACK,
+   .arg4_type  = ARG_CONST_STACK_SIZE_OR_ZERO,
+   .arg5_type  = ARG_ANYTHING,
+};
+
 static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5)
 {
struct sk_buff *skb = (struct sk_buff *) (long) r1, *skb2;
@@ -1849,6 +1900,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return &bpf_skb_store_bytes_proto;
case BPF_FUNC_skb_load_bytes:
return &bpf_skb_load_bytes_proto;
+   case BPF_FUNC_csum_diff:
+   return &bpf_csum_diff_proto;
case BPF_FUNC_l3_

[PATCH net-next 3/6] bpf: remove artificial bpf_skb_{load,store}_bytes buffer limitation

2016-02-19 Thread Daniel Borkmann

We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes()
helpers to only store or load a maximum buffer of 16 bytes. Thus,
loading, rewriting and storing headers require several bpf_skb_load_bytes()
and bpf_skb_store_bytes() calls.

Also here we can use a per-cpu scratch buffer instead in order to not
pressure stack space any further. I do suspect that this limit was mainly
set in place for this particular reason. So, ease program development
by removing this limitation and make the scratchpad generic, so it can
be reused.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bf504f8..ea391e6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1333,15 +1333,22 @@ int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk)
return 0;
 }
 
-#define BPF_LDST_LEN 16U
+struct bpf_scratchpad {
+   union {
+   __be32 diff[MAX_BPF_STACK / sizeof(__be32)];
+   u8 buff[MAX_BPF_STACK];
+   };
+};
+
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
 
 static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
 {
+   struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp);
struct sk_buff *skb = (struct sk_buff *) (long) r1;
int offset = (int) r2;
void *from = (void *) (long) r3;
unsigned int len = (unsigned int) r4;
-   char buf[BPF_LDST_LEN];
void *ptr;
 
if (unlikely(flags & ~(BPF_F_RECOMPUTE_CSUM)))
@@ -1355,14 +1362,14 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, 
u64 r4, u64 flags)
 *
 * so check for invalid 'offset' and too large 'len'
 */
-   if (unlikely((u32) offset > 0x || len > sizeof(buf)))
+   if (unlikely((u32) offset > 0x || len > sizeof(sp->buff)))
return -EFAULT;
 
if (unlikely(skb_cloned(skb) &&
 !skb_clone_writable(skb, offset + len)))
return -EFAULT;
 
-   ptr = skb_header_pointer(skb, offset, len, buf);
+   ptr = skb_header_pointer(skb, offset, len, sp->buff);
if (unlikely(!ptr))
return -EFAULT;
 
@@ -1371,7 +1378,7 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, 
u64 r4, u64 flags)
 
memcpy(ptr, from, len);
 
-   if (ptr == buf)
+   if (ptr == sp->buff)
/* skb_store_bits cannot return -EFAULT here */
skb_store_bits(skb, offset, ptr, len);
 
@@ -1400,7 +1407,7 @@ static u64 bpf_skb_load_bytes(u64 r1, u64 r2, u64 r3, u64 
r4, u64 r5)
unsigned int len = (unsigned int) r4;
void *ptr;
 
-   if (unlikely((u32) offset > 0x || len > BPF_LDST_LEN))
+   if (unlikely((u32) offset > 0x || len > MAX_BPF_STACK))
return -EFAULT;
 
ptr = skb_header_pointer(skb, offset, len, to);
@@ -1525,15 +1532,9 @@ const struct bpf_func_proto bpf_l4_csum_replace_proto = {
.arg5_type  = ARG_ANYTHING,
 };
 
-struct bpf_csum_scratchpad {
-   __be32 diff[128];
-};
-
-static DEFINE_PER_CPU(struct bpf_csum_scratchpad, bpf_csum_sp);
-
 static u64 bpf_csum_diff(u64 r1, u64 from_size, u64 r3, u64 to_size, u64 seed)
 {
-   struct bpf_csum_scratchpad *sp = this_cpu_ptr(&bpf_csum_sp);
+   struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp);
u64 diff_size = from_size + to_size;
__be32 *from = (__be32 *) (long) r1;
__be32 *to   = (__be32 *) (long) r3;
-- 
1.9.3

[PATCH net-next 5/6] bpf: fix csum update in bpf_l4_csum_replace helper for udp

2016-02-19 Thread Daniel Borkmann

When using this helper for updating UDP checksums, we need to extend
this in order to write CSUM_MANGLED_0 for csum computations that result
into 0 as sum. Reason we need this is because packets with a checksum
could otherwise become incorrectly marked as a packet without a checksum.
Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
not turn packets without a checksum into ones with a checksum.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 1 +
 net/core/filter.c| 8 +++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 48d0a6c..6496f98 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -313,6 +313,7 @@ enum bpf_func_id {
 
 /* BPF_FUNC_l4_csum_replace flags. */
 #define BPF_F_PSEUDO_HDR   (1ULL << 4)
+#define BPF_F_MARK_MANGLED_0   (1ULL << 5)
 
 /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
 #define BPF_F_INGRESS  (1ULL << 0)
diff --git a/net/core/filter.c b/net/core/filter.c
index f031b82..8a0b8c3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1477,10 +1477,12 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 
from, u64 to, u64 flags)
 {
struct sk_buff *skb = (struct sk_buff *) (long) r1;
bool is_pseudo = flags & BPF_F_PSEUDO_HDR;
+   bool is_mmzero = flags & BPF_F_MARK_MANGLED_0;
int offset = (int) r2;
__sum16 sum, *ptr;
 
-   if (unlikely(flags & ~(BPF_F_PSEUDO_HDR | BPF_F_HDR_FIELD_MASK)))
+   if (unlikely(flags & ~(BPF_F_MARK_MANGLED_0 | BPF_F_PSEUDO_HDR |
+  BPF_F_HDR_FIELD_MASK)))
return -EINVAL;
if (unlikely((u32) offset > 0x))
return -EFAULT;
@@ -1490,6 +1492,8 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 from, 
u64 to, u64 flags)
ptr = skb_header_pointer(skb, offset, sizeof(sum), &sum);
if (unlikely(!ptr))
return -EFAULT;
+   if (is_mmzero && !*ptr)
+   return 0;
 
switch (flags & BPF_F_HDR_FIELD_MASK) {
case 0:
@@ -1508,6 +1512,8 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 from, 
u64 to, u64 flags)
return -EINVAL;
}
 
+   if (is_mmzero && !*ptr)
+   *ptr = CSUM_MANGLED_0;
if (ptr == &sum)
/* skb_store_bits guaranteed to not return -EFAULT here */
skb_store_bits(skb, offset, ptr, sizeof(sum));
-- 
1.9.3

[PATCH net-next 6/6] bpf: don't emit mov A,A on return

2016-02-19 Thread Daniel Borkmann

While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax',
and found that this comes from BPF_RET | BPF_A translations from classic
BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0
already, therefore only emit a mov when immediates are used as return value.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 8a0b8c3..a3aba15 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -530,12 +530,14 @@ do_pass:
*insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_TMP);
break;
 
-   /* RET_K, RET_A are remaped into 2 insns. */
+   /* RET_K is remaped into 2 insns. RET_A case doesn't need an
+* extra mov as BPF_REG_0 is already mapped into BPF_REG_A.
+*/
case BPF_RET | BPF_A:
case BPF_RET | BPF_K:
-   *insn++ = BPF_MOV32_RAW(BPF_RVAL(fp->code) == BPF_K ?
-   BPF_K : BPF_X, BPF_REG_0,
-   BPF_REG_A, fp->k);
+   if (BPF_RVAL(fp->code) == BPF_K)
+   *insn++ = BPF_MOV32_RAW(BPF_K, BPF_REG_0,
+   0, fp->k);
*insn = BPF_EXIT_INSN();
break;
 
-- 
1.9.3

[PATCH net-next 0/6] BPF updates

2016-02-19 Thread Daniel Borkmann

This set contains various updates for eBPF, i.e. the addition of a
generic csum helper function and other misc bits that mostly improve
existing helpers and ease programming with eBPF on cls_bpf. For more
details, please see individual patches.

Set is rebased on top of http://patchwork.ozlabs.org/patch/584465/.

Thanks!

Daniel Borkmann (6):
  bpf: add new arg_type that allows for 0 sized stack buffer
  bpf: add generic bpf_csum_diff helper
  bpf: remove artificial bpf_skb_{load,store}_bytes buffer limitation
  bpf: try harder on clones when writing into skb
  bpf: fix csum update in bpf_l4_csum_replace helper for udp
  bpf: don't emit mov A,A on return

 include/linux/bpf.h  |   1 +
 include/linux/skbuff.h   |   7 
 include/uapi/linux/bpf.h |  12 ++
 kernel/bpf/verifier.c|  42 ++-
 net/core/filter.c| 103 ++-
 net/sched/act_csum.c |   8 +---
 net/sched/act_nat.c  |  18 +++--
 7 files changed, 142 insertions(+), 49 deletions(-)

-- 
1.9.3

[PATCH net-next 1/6] bpf: add new arg_type that allows for 0 sized stack buffer

2016-02-19 Thread Daniel Borkmann

Currently, when we pass a buffer from the eBPF stack into a helper
function, the function proto indicates argument types as ARG_PTR_TO_STACK
and ARG_CONST_STACK_SIZE pair. If R contains the former, then R
must be of the latter type. Then, verifier checks whether the buffer
points into eBPF stack, is initialized, etc. The verifier also guarantees
that the constant value passed in R is greater than 0, so helper
functions don't need to test for it and can always assume a non-NULL
initialized buffer as well as non-0 buffer size.

This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
allows to also pass NULL as R and 0 as R into the helper function.
Such helper functions, of course, need to be able to handle these cases
internally then. Verifier guarantees that either R == NULL && R == 0
or R != NULL && R != 0 (like the case of ARG_CONST_STACK_SIZE), any
other combinations are not possible to load.

I went through various options of extending the verifier, and introducing
the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
needed to the verifier.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/verifier.c | 42 --
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0cadbb7..51e498e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -65,6 +65,7 @@ enum bpf_arg_type {
 */
ARG_PTR_TO_STACK,   /* any pointer to eBPF program stack */
ARG_CONST_STACK_SIZE,   /* number of bytes accessed from stack */
+   ARG_CONST_STACK_SIZE_OR_ZERO, /* number of bytes accessed from stack or 
0 */
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 42ba4cc..36dc497 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -779,15 +779,24 @@ static int check_xadd(struct verifier_env *env, struct 
bpf_insn *insn)
  * bytes from that pointer, make sure that it's within stack boundary
  * and all elements of stack are initialized
  */
-static int check_stack_boundary(struct verifier_env *env,
-   int regno, int access_size)
+static int check_stack_boundary(struct verifier_env *env, int regno,
+   int access_size, bool zero_size_allowed)
 {
struct verifier_state *state = &env->cur_state;
struct reg_state *regs = state->regs;
int off, i;
 
-   if (regs[regno].type != PTR_TO_STACK)
+   if (regs[regno].type != PTR_TO_STACK) {
+   if (zero_size_allowed && access_size == 0 &&
+   regs[regno].type == CONST_IMM &&
+   regs[regno].imm  == 0)
+   return 0;
+
+   verbose("R%d type=%s expected=%s\n", regno,
+   reg_type_str[regs[regno].type],
+   reg_type_str[PTR_TO_STACK]);
return -EACCES;
+   }
 
off = regs[regno].imm;
if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
@@ -830,15 +839,24 @@ static int check_func_arg(struct verifier_env *env, u32 
regno,
return 0;
}
 
-   if (arg_type == ARG_PTR_TO_STACK || arg_type == ARG_PTR_TO_MAP_KEY ||
+   if (arg_type == ARG_PTR_TO_MAP_KEY ||
arg_type == ARG_PTR_TO_MAP_VALUE) {
expected_type = PTR_TO_STACK;
-   } else if (arg_type == ARG_CONST_STACK_SIZE) {
+   } else if (arg_type == ARG_CONST_STACK_SIZE ||
+  arg_type == ARG_CONST_STACK_SIZE_OR_ZERO) {
expected_type = CONST_IMM;
} else if (arg_type == ARG_CONST_MAP_PTR) {
expected_type = CONST_PTR_TO_MAP;
} else if (arg_type == ARG_PTR_TO_CTX) {
expected_type = PTR_TO_CTX;
+   } else if (arg_type == ARG_PTR_TO_STACK) {
+   expected_type = PTR_TO_STACK;
+   /* One exception here. In case function allows for NULL to be
+* passed in as argument, it's a CONST_IMM type. Final test
+* happens during stack boundary checking.
+*/
+   if (reg->type == CONST_IMM && reg->imm == 0)
+   expected_type = CONST_IMM;
} else {
verbose("unsupported arg_type %d\n", arg_type);
return -EFAULT;
@@ -868,8 +886,8 @@ static int check_func_arg(struct verifier_env *env, u32 
regno,
verbose("invalid map_ptr to access map->key\n");
return -EACCES;
}
-   err = check_stack_boundary(env, regno, (*mapp)->key_size);
-
+   err = check_stack_boundary(env, regno, (*mapp)->key_size,
+  false);
} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {

Re: [PATCH V7 8/8] i40e/ethtool: support coalesce setting by queue

2016-02-19 Thread Jeff Kirsher

On Fri, 2016-02-19 at 09:24 -0500, Kan Liang wrote:
> From: Kan Liang 
> 
> This patch implements set_per_queue_coalesce for i40e driver.
> 
> Signed-off-by: Kan Liang 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 7 +++
>  1 file changed, 7 insertions(+)

Acked-by: Jeff Kirsher 

signature.asc
Description: This is a digitally signed message part

Re: [PATCH V7 7/8] i40e/ethtool: support coalesce getting by queue

2016-02-19 Thread Jeff Kirsher

On Fri, 2016-02-19 at 09:24 -0500, Kan Liang wrote:
> From: Kan Liang 
> 
> This patch implements get_per_queue_coalesce for i40e driver.
> 
> Signed-off-by: Kan Liang 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 7 +++
>  1 file changed, 7 insertions(+)

Acked-by: Jeff Kirsher 

signature.asc
Description: This is a digitally signed message part

Re: [PATCH V7 6/8] i40e: queue-specific settings for interrupt moderation

2016-02-19 Thread Jeff Kirsher

On Fri, 2016-02-19 at 09:24 -0500, Kan Liang wrote:
> From: Kan Liang 
> 
> For i40e driver, each vector has its own ITR register. However, there
> are no concept of queue-specific settings in the driver proper. Only
> global variable is used to store ITR values. That will cause problems
> especially when resetting the vector. The specific ITR values could
> be
> lost.
> This patch move rx_itr_setting and tx_itr_setting to i40e_ring to
> store
> specific ITR register for each queue.
> i40e_get_coalesce and i40e_set_coalesce are also modified accordingly
> to
> support queue-specific settings. To make it compatible with old
> ethtool,
> if user doesn't specify the queue number, i40e_get_coalesce will
> return
> queue 0's value. While i40e_set_coalesce will apply value to all
> queues.
> 
> Signed-off-by: Kan Liang 
> Acked-by: Shannon Nelson 

Acked-by: Jeff Kirsher 

There is one minor nitpick noted below, but that should not hold up
this patch.

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h |   7 --
>  drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  15 ++-
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 139
> -
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  12 +--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |   9 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   8 ++
>  6 files changed, 120 insertions(+), 70 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
> b/drivers/net/ethernet/intel/i40e/i40e.h
> index e99be9f..2f6210a 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -521,13 +521,6 @@ struct i40e_vsi {
> struct i40e_ring **tx_rings;
>  
> u16 work_limit;
> -   /* high bit set means dynamic, use accessor routines to
> read/write.
> -    * hardware only supports 2us resolution for the ITR
> registers.
> -    * these values always store the USER setting, and must be
> converted
> -    * before programming to a register.
> -    */
> -   u16 rx_itr_setting;
> -   u16 tx_itr_setting;
> u16 int_rate_limit;  /* value in usecs */
>  
> u16 rss_table_size; /* HW RSS table size */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
> b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
> index 2a44f2e..0c97733 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
> @@ -302,6 +302,10 @@ static void i40e_dbg_dump_vsi_seid(struct
> i40e_pf *pf, int seid)
>  "    rx_rings[%i]: vsi = %p, q_vector =
> %p\n",
>  i, rx_ring->vsi,
>  rx_ring->q_vector);
> +   dev_info(&pf->pdev->dev,
> +    "    rx_rings[%i]: rx_itr_setting = %d
> (%s)\n",
> +    i, rx_ring->rx_itr_setting,
> +    ITR_IS_DYNAMIC(rx_ring->rx_itr_setting) ?
> "dynamic" : "fixed");
> }
> for (i = 0; i < vsi->num_queue_pairs; i++) {
> struct i40e_ring *tx_ring = ACCESS_ONCE(vsi-
> >tx_rings[i]);
> @@ -352,14 +356,15 @@ static void i40e_dbg_dump_vsi_seid(struct
> i40e_pf *pf, int seid)
> dev_info(&pf->pdev->dev,
>  "    tx_rings[%i]: DCB tc = %d\n",
>  i, tx_ring->dcb_tc);
> +   dev_info(&pf->pdev->dev,
> +    "    tx_rings[%i]: tx_itr_setting = %d
> (%s)\n",
> +    i, tx_ring->tx_itr_setting,
> +    ITR_IS_DYNAMIC(tx_ring->tx_itr_setting) ?
> "dynamic" : "fixed");
> }
> rcu_read_unlock();
> dev_info(&pf->pdev->dev,
> -    "    work_limit = %d, rx_itr_setting = %d (%s),
> tx_itr_setting = %d (%s)\n",
> -    vsi->work_limit, vsi->rx_itr_setting,
> -    ITR_IS_DYNAMIC(vsi->rx_itr_setting) ? "dynamic" :
> "fixed",
> -    vsi->tx_itr_setting,
> -    ITR_IS_DYNAMIC(vsi->tx_itr_setting) ? "dynamic" :
> "fixed");
> +    "    work_limit = %d\n",
> +    vsi->work_limit);
> dev_info(&pf->pdev->dev,
>  "    max_frame = %d, rx_hdr_len = %d, rx_buf_len =
> %d dtype = %d\n",
>  vsi->max_frame, vsi->rx_hdr_len, vsi->rx_buf_len,
> vsi->dtype);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> index a85bc94..a470599 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> @@ -1879,8 +1879,9 @@ static int i40e_set_phys_id(struct net_device
> *netdev,
>   * 125us (8000 interrupts per second) == ITR(62)
>   */
>  
> -static int i40e_get_coalesce(struct net_device *netdev,
> -    struct ethtool_coalesce *ec)
> +static int __i40e_get_coalesce(struct net_device *netdev,
> +  struct ethtool_coal

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
> This patch series makes it so that we enable the outer Tx checksum for IPv4
> tunnels by default.  This makes the behavior consistent with how we were
> handling this for IPv6.  In addition I have updated the internal flags for
> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
> match up will with the ZERO_CSUM6_TX flag which was already in use for
> IPv6.
>
> For most network devices this should be a net gain in terms of performance
> as having the outer header checksum present allows for devices to report
> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
> to determine if the inner header checksum is valid.
>
> Below is some data I collected with ixgbe with an X540 that demonstrates
> this.  I located two PFs connected back to back in two different name
> spaces and then setup a pair of tunnels on each, one with checksum enabled
> and one without.
>
> Recv   SendSend  Utilization
> Socket Socket  Message  Elapsed  Send
> Size   SizeSize Time Throughput  local
> bytes  bytes   bytessecs.10^6bits/s  % S
>
> noudpcsum:
>  87380  16384  1638430.00  8898.67   12.80
> udpcsum:
>  87380  16384  1638430.00  9088.47   5.69
>
> The one spot where this may cause a performance regression is if the
> environment contains devices that can parse the inner headers and a device
> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
> the case of such a device we have to fall back to using GSO to segment the
> tunnel instead of TSO and as a result we may take a performance hit as seen
> below with i40e.

Do you have any numbers from 40G links? Obviously, at 10G the links
are basically saturated and while I can see a difference in the
utilization rate, I suspect that the change will be much more apparent
at higher speeds.

I'm concerned about the drop in performance for devices that currently
support offloads (almost none of which expose
NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
care most about tunnel performance are the ones that already have
these NICs and will be the most impacted by the drop.

My hope is that we can continue to use TSO on devices that only
support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
length field may vary across segments. However, in practice this is
the only on the final segment and only in cases where the total length
is not a multiple of the MSS. If we could detect cases where those
conditions are met, we could continue to use TSO with the UDP checksum
field pre-populated. A possible step even further would be to break
off the final segment into a separate packet to make things conform if
necessary. This would avoid a performance regression and I think make
this more palatable to a lot of people.

> I also haven't investigated the effect this will have on OVS.  However I
> suspect the impact should be minimal as the worst case scenario should be
> that Tx checksumming will become enabled by default which should be
> consistent with the existing behavior for IPv6.

I don't think that it should cause any problems.

Re: [PATCH V7 0/8] ethtool per queue parameters support

2016-02-19 Thread Jeff Kirsher

On Fri, 2016-02-19 at 09:23 -0500, Kan Liang wrote:
> Modern network interface controllers usually support multiple receive
> and transmit queues. Each queue may have its own parameters. For
> example, Intel XL710/X710 hardware supports per queue interrupt
> moderation. However, current ethtool does not support per queue
> parameters option. User has to set parameters for the whole NIC.
> This series extends ethtool to support per queue parameters option.
> 
> Since the support of per queue parameters vary with different cards,
> it is impossible to address all cards in one patch. This series only
> supports per queue coalesce options on i40e driver. The framework
> used
> in the patch can be easily extended to other cards and parameters.
> 
> The lib bitmap needs to be extended to facilitate exchanging queue
> bitmaps
> between user space and kernel space. Two patches from David's latest
> V8
> patch series are also cited in this series. You may refer to
> https://lkml.org/lkml/2016/2/9/919 for more details.
> 
> Changes since V6:
>  - Rebase on commit 76d13b568776. Did minor change in patch 6.
> 
> Changes since V5:
>  - Add test_bitmap.c and bitmap.sh in the series. They are forgot
>    to be added previously.
>  - Update the first two patches to David's latest V8 version. The
> changes
>    include
>   - bitmap u32 API returns number of bits copied, unit tests
> updated
>   - module_exit in test_bitmap
>  - Also change the mode of bitmap.sh to 755 according to Ben's
> suggestion
> 
> Changes since V4:
>  - Modify set/get_per_queue_coalesce function description
>  - Change the queue number to be u32
>  - Correct an error of calculating coalesce backup buffer address
>  - Rename queue_num to n_queues
>  - Don't log error message in __i40e_get_coalesce
> 
> Changes since V3:
>  - Based on David's lib bitmap.
>  - ETHTOOL_PERQUEUE should be handled before the containing switch
>  - Make the rollback code unconditional
>  - some minor changes according to Ben's feedback
> 
> Changes since V2:
>  - Add queue-specific settings for interrupt moderation in i40e
> 
> Changes since V1:
>  - Checking the sub-command number to determine whether the command
>    requires CAP_NET_ADMIN
>  - Refine the struct ethtool_per_queue_op and improve the comments
>  - Use bitmap functions to parse queue mask
>  - Improve comments
>  - Use bitmap functions to parse queue mask
>  - Improve comments
>  - Add rollback support
>  - Correct the way to find the vector for specific queue.
> 
> David Decotigny (2):
>   lib/bitmap.c: conversion routines to/from u32 array
>   test_bitmap: unit tests for lib/bitmap.c
> 
> Kan Liang (6)
>   net/ethtool: introduce a new ioctl for per queue setting
>   net/ethtool: support get coalesce per queue
>   net/ethtool: support set coalesce per queue
>   i40e: queue-specific settings for interrupt moderation
>   i40e/ethtool: support coalesce getting by queue
>   i40e/ethtool: support coalesce setting by queue
> 
>  drivers/net/ethernet/intel/i40e/i40e.h |   7 -
>  drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  15 +-
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 151 +++
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  12 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |   9 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   8 +

Dave, I have pretty much cleared out my i40e queue of patches, so I am
fine if you want to apply the entire series (of course after proper
review) :-)

signature.asc
Description: This is a digitally signed message part

[PATCH V7 0/8] ethtool per queue parameters support

2016-02-19 Thread Kan Liang

Modern network interface controllers usually support multiple receive
and transmit queues. Each queue may have its own parameters. For
example, Intel XL710/X710 hardware supports per queue interrupt
moderation. However, current ethtool does not support per queue
parameters option. User has to set parameters for the whole NIC.
This series extends ethtool to support per queue parameters option.

Since the support of per queue parameters vary with different cards,
it is impossible to address all cards in one patch. This series only
supports per queue coalesce options on i40e driver. The framework used
in the patch can be easily extended to other cards and parameters.

The lib bitmap needs to be extended to facilitate exchanging queue bitmaps
between user space and kernel space. Two patches from David's latest V8
patch series are also cited in this series. You may refer to
https://lkml.org/lkml/2016/2/9/919 for more details.

Changes since V6:
 - Rebase on commit 76d13b568776. Did minor change in patch 6.

Changes since V5:
 - Add test_bitmap.c and bitmap.sh in the series. They are forgot
   to be added previously.
 - Update the first two patches to David's latest V8 version. The changes
   include
  - bitmap u32 API returns number of bits copied, unit tests updated
  - module_exit in test_bitmap
 - Also change the mode of bitmap.sh to 755 according to Ben's suggestion

Changes since V4:
 - Modify set/get_per_queue_coalesce function description
 - Change the queue number to be u32
 - Correct an error of calculating coalesce backup buffer address
 - Rename queue_num to n_queues
 - Don't log error message in __i40e_get_coalesce

Changes since V3:
 - Based on David's lib bitmap.
 - ETHTOOL_PERQUEUE should be handled before the containing switch
 - Make the rollback code unconditional
 - some minor changes according to Ben's feedback

Changes since V2:
 - Add queue-specific settings for interrupt moderation in i40e

Changes since V1:
 - Checking the sub-command number to determine whether the command
   requires CAP_NET_ADMIN
 - Refine the struct ethtool_per_queue_op and improve the comments
 - Use bitmap functions to parse queue mask
 - Improve comments
 - Use bitmap functions to parse queue mask
 - Improve comments
 - Add rollback support
 - Correct the way to find the vector for specific queue.

David Decotigny (2):
  lib/bitmap.c: conversion routines to/from u32 array
  test_bitmap: unit tests for lib/bitmap.c

Kan Liang (6)
  net/ethtool: introduce a new ioctl for per queue setting
  net/ethtool: support get coalesce per queue
  net/ethtool: support set coalesce per queue
  i40e: queue-specific settings for interrupt moderation
  i40e/ethtool: support coalesce getting by queue
  i40e/ethtool: support coalesce setting by queue

 drivers/net/ethernet/intel/i40e/i40e.h |   7 -
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  15 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 151 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c|  12 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   8 +
 include/linux/bitmap.h |  10 +
 include/linux/ethtool.h|  15 +-
 include/uapi/linux/ethtool.h   |  17 ++
 lib/Kconfig.debug  |   8 +
 lib/Makefile   |   1 +
 lib/bitmap.c   |  89 ++
 lib/test_bitmap.c  | 358 +
 net/core/ethtool.c | 121 -
 tools/testing/selftests/lib/Makefile   |   2 +-
 tools/testing/selftests/lib/bitmap.sh  |  10 +
 16 files changed, 760 insertions(+), 73 deletions(-)
 create mode 100644 lib/test_bitmap.c
 create mode 100755 tools/testing/selftests/lib/bitmap.sh

-- 
1.8.3.1

[PATCH V7 2/8] test_bitmap: unit tests for lib/bitmap.c

2016-02-19 Thread Kan Liang

From: David Decotigny 

This is mainly testing bitmap construction and conversion to/from u32[]
for now.

Tested:
  qemu i386, x86_64, ppc, ppc64 BE and LE, ARM.

Signed-off-by: David Decotigny 
---
 lib/Kconfig.debug |   8 +
 lib/Makefile  |   1 +
 lib/test_bitmap.c | 358 ++
 tools/testing/selftests/lib/Makefile  |   2 +-
 tools/testing/selftests/lib/bitmap.sh |  10 +
 5 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 lib/test_bitmap.c
 create mode 100755 tools/testing/selftests/lib/bitmap.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ecb9e75..f890ee5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1738,6 +1738,14 @@ config TEST_KSTRTOX
 config TEST_PRINTF
tristate "Test printf() family of functions at runtime"
 
+config TEST_BITMAP
+   tristate "Test bitmap_*() family of functions at runtime"
+   default n
+   help
+ Enable this option to test the bitmap functions at boot.
+
+ If unsure, say N.
+
 config TEST_RHASHTABLE
tristate "Perform selftest on resizable hash table"
default n
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..dda4039 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_TEST_USER_COPY) += test_user_copy.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
 obj-$(CONFIG_TEST_PRINTF) += test_printf.o
+obj-$(CONFIG_TEST_BITMAP) += test_bitmap.o
 
 ifeq ($(CONFIG_DEBUG_KOBJECT),y)
 CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
new file mode 100644
index 000..e2cbd43
--- /dev/null
+++ b/lib/test_bitmap.c
@@ -0,0 +1,358 @@
+/*
+ * Test cases for printf facility.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static unsigned total_tests __initdata;
+static unsigned failed_tests __initdata;
+
+static char pbl_buffer[PAGE_SIZE] __initdata;
+
+
+static bool __init
+__check_eq_uint(const char *srcfile, unsigned int line,
+   const unsigned int exp_uint, unsigned int x)
+{
+   if (exp_uint != x) {
+   pr_warn("[%s:%u] expected %u, got %u\n",
+   srcfile, line, exp_uint, x);
+   return false;
+   }
+   return true;
+}
+
+
+static bool __init
+__check_eq_bitmap(const char *srcfile, unsigned int line,
+ const unsigned long *exp_bmap, unsigned int exp_nbits,
+ const unsigned long *bmap, unsigned int nbits)
+{
+   if (exp_nbits != nbits) {
+   pr_warn("[%s:%u] bitmap length mismatch: expected %u, got %u\n",
+   srcfile, line, exp_nbits, nbits);
+   return false;
+   }
+
+   if (!bitmap_equal(exp_bmap, bmap, nbits)) {
+   pr_warn("[%s:%u] bitmaps contents differ: expected \"%*pbl\", 
got \"%*pbl\"\n",
+   srcfile, line,
+   exp_nbits, exp_bmap, nbits, bmap);
+   return false;
+   }
+   return true;
+}
+
+static bool __init
+__check_eq_pbl(const char *srcfile, unsigned int line,
+  const char *expected_pbl,
+  const unsigned long *bitmap, unsigned int nbits)
+{
+   snprintf(pbl_buffer, sizeof(pbl_buffer), "%*pbl", nbits, bitmap);
+   if (strcmp(expected_pbl, pbl_buffer)) {
+   pr_warn("[%s:%u] expected \"%s\", got \"%s\"\n",
+   srcfile, line,
+   expected_pbl, pbl_buffer);
+   return false;
+   }
+   return true;
+}
+
+static bool __init
+__check_eq_u32_array(const char *srcfile, unsigned int line,
+const u32 *exp_arr, unsigned int exp_len,
+const u32 *arr, unsigned int len)
+{
+   if (exp_len != len) {
+   pr_warn("[%s:%u] array length differ: expected %u, got %u\n",
+   srcfile, line,
+   exp_len, len);
+   return false;
+   }
+
+   if (memcmp(exp_arr, arr, len*sizeof(*arr))) {
+   pr_warn("[%s:%u] array contents differ\n", srcfile, line);
+   print_hex_dump(KERN_WARNING, "  exp:  ", DUMP_PREFIX_OFFSET,
+  32, 4, exp_arr, exp_len*sizeof(*exp_arr), false);
+   print_hex_dump(KERN_WARNING, "  got:  ", DUMP_PREFIX_OFFSET,
+  32, 4, arr, len*sizeof(*arr), false);
+   return false;
+   }
+
+   return true;
+}
+
+#define __expect_eq(suffix, ...)   \
+   ({  \
+   int result = 0; \
+   total_tests++;  \
+   if (!__check_eq_ ## suffix(__FILE__, __

[PATCH V7 4/8] net/ethtool: support get coalesce per queue

2016-02-19 Thread Kan Liang

From: Kan Liang 

This patch implements sub command ETHTOOL_GCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
get coalesce of each masked queue from device driver. Then the interrupt
coalescing parameters will be copied back to user space one by one.

Signed-off-by: Kan Liang 
Reviewed-by: Ben Hutchings 
---
 include/linux/ethtool.h |  8 +++-
 net/core/ethtool.c  | 35 ++-
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 653dc9c..de56600 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -201,6 +201,11 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, 
u32 n_rx_rings)
  * @get_module_eeprom: Get the eeprom information from the plug-in module
  * @get_eee: Get Energy-Efficient (EEE) supported and status.
  * @set_eee: Set EEE status (enable/disable) as well as LPI timers.
+ * @get_per_queue_coalesce: Get interrupt coalescing parameters per queue.
+ * It must check that the given queue number is valid. If neither a RX nor
+ * a TX queue has this number, return -EINVAL. If only a RX queue or a TX
+ * queue has this number, set the inapplicable fields to ~0 and return 0.
+ * Returns a negative error code or zero.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -279,7 +284,8 @@ struct ethtool_ops {
   const struct ethtool_tunable *, void *);
int (*set_tunable)(struct net_device *,
   const struct ethtool_tunable *, const void *);
-
+   int (*get_per_queue_coalesce)(struct net_device *, u32,
+ struct ethtool_coalesce *);
 
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index d640ecf..2a6c3a2 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1888,6 +1888,38 @@ out:
return ret;
 }
 
+static int ethtool_get_per_queue_coalesce(struct net_device *dev,
+ void __user *useraddr,
+ struct ethtool_per_queue_op 
*per_queue_opt)
+{
+   u32 bit;
+   int ret;
+   DECLARE_BITMAP(queue_mask, MAX_NUM_QUEUE);
+
+   if (!dev->ethtool_ops->get_per_queue_coalesce)
+   return -EOPNOTSUPP;
+
+   useraddr += sizeof(*per_queue_opt);
+
+   bitmap_from_u32array(queue_mask,
+MAX_NUM_QUEUE,
+per_queue_opt->queue_mask,
+DIV_ROUND_UP(MAX_NUM_QUEUE, 32));
+
+   for_each_set_bit(bit, queue_mask, MAX_NUM_QUEUE) {
+   struct ethtool_coalesce coalesce = { .cmd = ETHTOOL_GCOALESCE };
+
+   ret = dev->ethtool_ops->get_per_queue_coalesce(dev, bit, 
&coalesce);
+   if (ret != 0)
+   return ret;
+   if (copy_to_user(useraddr, &coalesce, sizeof(coalesce)))
+   return -EFAULT;
+   useraddr += sizeof(coalesce);
+   }
+
+   return 0;
+}
+
 static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_per_queue_op per_queue_opt;
@@ -1896,7 +1928,8 @@ static int ethtool_set_per_queue(struct net_device *dev, 
void __user *useraddr)
return -EFAULT;
 
switch (per_queue_opt.sub_command) {
-
+   case ETHTOOL_GCOALESCE:
+   return ethtool_get_per_queue_coalesce(dev, useraddr, 
&per_queue_opt);
default:
return -EOPNOTSUPP;
};
-- 
1.8.3.1

[PATCH V7 7/8] i40e/ethtool: support coalesce getting by queue

2016-02-19 Thread Kan Liang

From: Kan Liang 

This patch implements get_per_queue_coalesce for i40e driver.

Signed-off-by: Kan Liang 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a470599..dd572ab 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1925,6 +1925,12 @@ static int i40e_get_coalesce(struct net_device *netdev,
return __i40e_get_coalesce(netdev, ec, -1);
 }
 
+static int i40e_get_per_queue_coalesce(struct net_device *netdev, u32 queue,
+  struct ethtool_coalesce *ec)
+{
+   return __i40e_get_coalesce(netdev, ec, queue);
+}
+
 static void i40e_set_itr_per_queue(struct i40e_vsi *vsi,
   struct ethtool_coalesce *ec,
   int queue)
@@ -2914,6 +2920,7 @@ static const struct ethtool_ops i40e_ethtool_ops = {
.get_ts_info= i40e_get_ts_info,
.get_priv_flags = i40e_get_priv_flags,
.set_priv_flags = i40e_set_priv_flags,
+   .get_per_queue_coalesce = i40e_get_per_queue_coalesce,
 };
 
 void i40e_set_ethtool_ops(struct net_device *netdev)
-- 
1.8.3.1

[PATCH V7 8/8] i40e/ethtool: support coalesce setting by queue

2016-02-19 Thread Kan Liang

From: Kan Liang 

This patch implements set_per_queue_coalesce for i40e driver.

Signed-off-by: Kan Liang 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index dd572ab..784b165 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2035,6 +2035,12 @@ static int i40e_set_coalesce(struct net_device *netdev,
return __i40e_set_coalesce(netdev, ec, -1);
 }
 
+static int i40e_set_per_queue_coalesce(struct net_device *netdev, u32 queue,
+  struct ethtool_coalesce *ec)
+{
+   return __i40e_set_coalesce(netdev, ec, queue);
+}
+
 /**
  * i40e_get_rss_hash_opts - Get RSS hash Input Set for each flow type
  * @pf: pointer to the physical function struct
@@ -2921,6 +2927,7 @@ static const struct ethtool_ops i40e_ethtool_ops = {
.get_priv_flags = i40e_get_priv_flags,
.set_priv_flags = i40e_set_priv_flags,
.get_per_queue_coalesce = i40e_get_per_queue_coalesce,
+   .set_per_queue_coalesce = i40e_set_per_queue_coalesce,
 };
 
 void i40e_set_ethtool_ops(struct net_device *netdev)
-- 
1.8.3.1

[PATCH V7 6/8] i40e: queue-specific settings for interrupt moderation

2016-02-19 Thread Kan Liang

From: Kan Liang 

For i40e driver, each vector has its own ITR register. However, there
are no concept of queue-specific settings in the driver proper. Only
global variable is used to store ITR values. That will cause problems
especially when resetting the vector. The specific ITR values could be
lost.
This patch move rx_itr_setting and tx_itr_setting to i40e_ring to store
specific ITR register for each queue.
i40e_get_coalesce and i40e_set_coalesce are also modified accordingly to
support queue-specific settings. To make it compatible with old ethtool,
if user doesn't specify the queue number, i40e_get_coalesce will return
queue 0's value. While i40e_set_coalesce will apply value to all queues.

Signed-off-by: Kan Liang 
Acked-by: Shannon Nelson 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   7 --
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  15 ++-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 139 -
 drivers/net/ethernet/intel/i40e/i40e_main.c|  12 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   8 ++
 6 files changed, 120 insertions(+), 70 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index e99be9f..2f6210a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -521,13 +521,6 @@ struct i40e_vsi {
struct i40e_ring **tx_rings;
 
u16 work_limit;
-   /* high bit set means dynamic, use accessor routines to read/write.
-* hardware only supports 2us resolution for the ITR registers.
-* these values always store the USER setting, and must be converted
-* before programming to a register.
-*/
-   u16 rx_itr_setting;
-   u16 tx_itr_setting;
u16 int_rate_limit;  /* value in usecs */
 
u16 rss_table_size; /* HW RSS table size */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 2a44f2e..0c97733 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -302,6 +302,10 @@ static void i40e_dbg_dump_vsi_seid(struct i40e_pf *pf, int 
seid)
 "rx_rings[%i]: vsi = %p, q_vector = %p\n",
 i, rx_ring->vsi,
 rx_ring->q_vector);
+   dev_info(&pf->pdev->dev,
+"rx_rings[%i]: rx_itr_setting = %d (%s)\n",
+i, rx_ring->rx_itr_setting,
+ITR_IS_DYNAMIC(rx_ring->rx_itr_setting) ? "dynamic" : 
"fixed");
}
for (i = 0; i < vsi->num_queue_pairs; i++) {
struct i40e_ring *tx_ring = ACCESS_ONCE(vsi->tx_rings[i]);
@@ -352,14 +356,15 @@ static void i40e_dbg_dump_vsi_seid(struct i40e_pf *pf, 
int seid)
dev_info(&pf->pdev->dev,
 "tx_rings[%i]: DCB tc = %d\n",
 i, tx_ring->dcb_tc);
+   dev_info(&pf->pdev->dev,
+"tx_rings[%i]: tx_itr_setting = %d (%s)\n",
+i, tx_ring->tx_itr_setting,
+ITR_IS_DYNAMIC(tx_ring->tx_itr_setting) ? "dynamic" : 
"fixed");
}
rcu_read_unlock();
dev_info(&pf->pdev->dev,
-"work_limit = %d, rx_itr_setting = %d (%s), tx_itr_setting 
= %d (%s)\n",
-vsi->work_limit, vsi->rx_itr_setting,
-ITR_IS_DYNAMIC(vsi->rx_itr_setting) ? "dynamic" : "fixed",
-vsi->tx_itr_setting,
-ITR_IS_DYNAMIC(vsi->tx_itr_setting) ? "dynamic" : "fixed");
+"work_limit = %d\n",
+vsi->work_limit);
dev_info(&pf->pdev->dev,
 "max_frame = %d, rx_hdr_len = %d, rx_buf_len = %d dtype = 
%d\n",
 vsi->max_frame, vsi->rx_hdr_len, vsi->rx_buf_len, vsi->dtype);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a85bc94..a470599 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1879,8 +1879,9 @@ static int i40e_set_phys_id(struct net_device *netdev,
  * 125us (8000 interrupts per second) == ITR(62)
  */
 
-static int i40e_get_coalesce(struct net_device *netdev,
-struct ethtool_coalesce *ec)
+static int __i40e_get_coalesce(struct net_device *netdev,
+  struct ethtool_coalesce *ec,
+  int queue)
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
@@ -1888,14 +1889,24 @@ static int i40e_get_coalesce(struct net_device *netdev,
ec->tx_max_coalesced_frames_irq = vsi->work_limit;
ec->rx_max_coalesced_frames_irq = vsi->work_limit;
 
-   if (ITR_IS_DYNAMIC(vsi->r

[PATCH V7 3/8] net/ethtool: introduce a new ioctl for per queue setting

2016-02-19 Thread Kan Liang

From: Kan Liang 

Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
The following patches will enable some SUB_COMMANDs for per queue
setting.

Signed-off-by: Kan Liang 
Reviewed-by: Ben Hutchings 
---
 include/uapi/linux/ethtool.h | 17 +
 net/core/ethtool.c   | 27 +--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 190aea0..f15ae02 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1202,6 +1202,21 @@ enum ethtool_sfeatures_retval_bits {
 #define ETHTOOL_F_WISH  (1 << ETHTOOL_F_WISH__BIT)
 #define ETHTOOL_F_COMPAT(1 << ETHTOOL_F_COMPAT__BIT)
 
+#define MAX_NUM_QUEUE  4096
+
+/**
+ * struct ethtool_per_queue_op - apply sub command to the queues in mask.
+ * @cmd: ETHTOOL_PERQUEUE
+ * @sub_command: the sub command which apply to each queues
+ * @queue_mask: Bitmap of the queues which sub command apply to
+ * @data: A complete command structure following for each of the queues 
addressed
+ */
+struct ethtool_per_queue_op {
+   __u32   cmd;
+   __u32   sub_command;
+   __u32   queue_mask[DIV_ROUND_UP(MAX_NUM_QUEUE, 32)];
+   chardata[];
+};
 
 /* CMDs currently supported */
 #define ETHTOOL_GSET   0x0001 /* Get settings. */
@@ -1285,6 +1300,8 @@ enum ethtool_sfeatures_retval_bits {
 #define ETHTOOL_STUNABLE   0x0049 /* Set tunable configuration */
 #define ETHTOOL_GPHYSTATS  0x004a /* get PHY-specific statistics */
 
+#define ETHTOOL_PERQUEUE   0x004b /* Set per queue options */
+
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
 #define SPARC_ETH_SSET ETHTOOL_SSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index c2d3118..d640ecf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1888,13 +1888,27 @@ out:
return ret;
 }
 
+static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
+{
+   struct ethtool_per_queue_op per_queue_opt;
+
+   if (copy_from_user(&per_queue_opt, useraddr, sizeof(per_queue_opt)))
+   return -EFAULT;
+
+   switch (per_queue_opt.sub_command) {
+
+   default:
+   return -EOPNOTSUPP;
+   };
+}
+
 /* The main entry point in this file.  Called from net/core/dev_ioctl.c */
 
 int dev_ethtool(struct net *net, struct ifreq *ifr)
 {
struct net_device *dev = __dev_get_by_name(net, ifr->ifr_name);
void __user *useraddr = ifr->ifr_data;
-   u32 ethcmd;
+   u32 ethcmd, sub_cmd;
int rc;
netdev_features_t old_features;
 
@@ -1904,8 +1918,14 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
if (copy_from_user(ðcmd, useraddr, sizeof(ethcmd)))
return -EFAULT;
 
+   if (ethcmd == ETHTOOL_PERQUEUE) {
+   if (copy_from_user(&sub_cmd, useraddr + sizeof(ethcmd), 
sizeof(sub_cmd)))
+   return -EFAULT;
+   } else {
+   sub_cmd = ethcmd;
+   }
/* Allow some commands to be done by anyone */
-   switch (ethcmd) {
+   switch (sub_cmd) {
case ETHTOOL_GSET:
case ETHTOOL_GDRVINFO:
case ETHTOOL_GMSGLVL:
@@ -2135,6 +2155,9 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_GPHYSTATS:
rc = ethtool_get_phy_stats(dev, useraddr);
break;
+   case ETHTOOL_PERQUEUE:
+   rc = ethtool_set_per_queue(dev, useraddr);
+   break;
default:
rc = -EOPNOTSUPP;
}
-- 
1.8.3.1

[PATCH V7 1/8] lib/bitmap.c: conversion routines to/from u32 array

2016-02-19 Thread Kan Liang

From: David Decotigny 

Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
way.

Tested:
  unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
  ARM.

Signed-off-by: David Decotigny 
Reviewed-by: Ben Hutchings 
---
 include/linux/bitmap.h | 10 ++
 lib/bitmap.c   | 89 ++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 9653fdb..e9b0b9a 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -59,6 +59,8 @@
  * bitmap_find_free_region(bitmap, bits, order)Find and allocate bit 
region
  * bitmap_release_region(bitmap, pos, order)   Free specified bit region
  * bitmap_allocate_region(bitmap, pos, order)  Allocate specified bit region
+ * bitmap_from_u32array(dst, nbits, buf, nwords) *dst = *buf (nwords 32b words)
+ * bitmap_to_u32array(buf, nwords, src, nbits) *buf = *dst (nwords 32b words)
  */
 
 /*
@@ -163,6 +165,14 @@ extern void bitmap_fold(unsigned long *dst, const unsigned 
long *orig,
 extern int bitmap_find_free_region(unsigned long *bitmap, unsigned int bits, 
int order);
 extern void bitmap_release_region(unsigned long *bitmap, unsigned int pos, int 
order);
 extern int bitmap_allocate_region(unsigned long *bitmap, unsigned int pos, int 
order);
+extern unsigned int bitmap_from_u32array(unsigned long *bitmap,
+unsigned int nbits,
+const u32 *buf,
+unsigned int nwords);
+extern unsigned int bitmap_to_u32array(u32 *buf,
+  unsigned int nwords,
+  const unsigned long *bitmap,
+  unsigned int nbits);
 #ifdef __BIG_ENDIAN
 extern void bitmap_copy_le(unsigned long *dst, const unsigned long *src, 
unsigned int nbits);
 #else
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 8148143..c66da50 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1060,6 +1062,93 @@ int bitmap_allocate_region(unsigned long *bitmap, 
unsigned int pos, int order)
 EXPORT_SYMBOL(bitmap_allocate_region);
 
 /**
+ * bitmap_from_u32array - copy the contents of a u32 array of bits to bitmap
+ * @bitmap: array of unsigned longs, the destination bitmap, non NULL
+ * @nbits: number of bits in @bitmap
+ * @buf: array of u32 (in host byte order), the source bitmap, non NULL
+ * @nwords: number of u32 words in @buf
+ *
+ * copy min(nbits, 32*nwords) bits from @buf to @bitmap, remaining
+ * bits between nword and nbits in @bitmap (if any) are cleared. In
+ * last word of @bitmap, the bits beyond nbits (if any) are kept
+ * unchanged.
+ *
+ * Return the number of bits effectively copied.
+ */
+unsigned int
+bitmap_from_u32array(unsigned long *bitmap, unsigned int nbits,
+const u32 *buf, unsigned int nwords)
+{
+   unsigned int dst_idx, src_idx;
+
+   for (src_idx = dst_idx = 0; dst_idx < BITS_TO_LONGS(nbits); ++dst_idx) {
+   unsigned long part = 0;
+
+   if (src_idx < nwords)
+   part = buf[src_idx++];
+
+#if BITS_PER_LONG == 64
+   if (src_idx < nwords)
+   part |= ((unsigned long) buf[src_idx++]) << 32;
+#endif
+
+   if (dst_idx < nbits/BITS_PER_LONG)
+   bitmap[dst_idx] = part;
+   else {
+   unsigned long mask = BITMAP_LAST_WORD_MASK(nbits);
+
+   bitmap[dst_idx] = (bitmap[dst_idx] & ~mask)
+   | (part & mask);
+   }
+   }
+
+   return min_t(unsigned int, nbits, 32*nwords);
+}
+EXPORT_SYMBOL(bitmap_from_u32array);
+
+/**
+ * bitmap_to_u32array - copy the contents of bitmap to a u32 array of bits
+ * @buf: array of u32 (in host byte order), the dest bitmap, non NULL
+ * @nwords: number of u32 words in @buf
+ * @bitmap: array of unsigned longs, the source bitmap, non NULL
+ * @nbits: number of bits in @bitmap
+ *
+ * copy min(nbits, 32*nwords) bits from @bitmap to @buf. Remaining
+ * bits after nbits in @buf (if any) are cleared.
+ *
+ * Return the number of bits effectively copied.
+ */
+unsigned int
+bitmap_to_u32array(u32 *buf, unsigned int nwords,
+  const unsigned long *bitmap, unsigned int nbits)
+{
+   unsigned int dst_idx = 0, src_idx = 0;
+
+   while (dst_idx < nwords) {
+   unsigned long part = 0;
+
+   if (src_idx < BITS_TO_LONGS(nbits)) {
+   part = bitmap[src_idx];
+   if (src_idx >= nbits/BITS_PER_LONG)
+   part &= BITMAP_LAST_WORD_MASK(nbits);
+   src_idx++;
+   }
+
+   buf[dst_idx++] = part & 0xUL;
+
+#if BITS_PER_L

[PATCH V7 5/8] net/ethtool: support set coalesce per queue

2016-02-19 Thread Kan Liang

From: Kan Liang 

This patch implements sub command ETHTOOL_SCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
set coalesce of each masked queue to device driver. The wanted coalesce
information are stored in "data" for each masked queue, which can copy
from userspace.
If it fails to set coalesce to device driver, the value which already
set to specific queue will be tried to rollback.

Signed-off-by: Kan Liang 
Reviewed-by: Ben Hutchings 
---
 include/linux/ethtool.h |  7 ++
 net/core/ethtool.c  | 61 +
 2 files changed, 68 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index de56600..472d7d7 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -206,6 +206,11 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, 
u32 n_rx_rings)
  * a TX queue has this number, return -EINVAL. If only a RX queue or a TX
  * queue has this number, set the inapplicable fields to ~0 and return 0.
  * Returns a negative error code or zero.
+ * @set_per_queue_coalesce: Set interrupt coalescing parameters per queue.
+ * It must check that the given queue number is valid. If neither a RX nor
+ * a TX queue has this number, return -EINVAL. If only a RX queue or a TX
+ * queue has this number, ignore the inapplicable fields.
+ * Returns a negative error code or zero.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -286,6 +291,8 @@ struct ethtool_ops {
   const struct ethtool_tunable *, const void *);
int (*get_per_queue_coalesce)(struct net_device *, u32,
  struct ethtool_coalesce *);
+   int (*set_per_queue_coalesce)(struct net_device *, u32,
+ struct ethtool_coalesce *);
 
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 2a6c3a2..2406101 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1920,6 +1920,65 @@ static int ethtool_get_per_queue_coalesce(struct 
net_device *dev,
return 0;
 }
 
+static int ethtool_set_per_queue_coalesce(struct net_device *dev,
+ void __user *useraddr,
+ struct ethtool_per_queue_op 
*per_queue_opt)
+{
+   u32 bit;
+   int i, ret = 0;
+   int n_queue;
+   struct ethtool_coalesce *backup = NULL, *tmp = NULL;
+   DECLARE_BITMAP(queue_mask, MAX_NUM_QUEUE);
+
+   if ((!dev->ethtool_ops->set_per_queue_coalesce) ||
+   (!dev->ethtool_ops->get_per_queue_coalesce))
+   return -EOPNOTSUPP;
+
+   useraddr += sizeof(*per_queue_opt);
+
+   bitmap_from_u32array(queue_mask,
+MAX_NUM_QUEUE,
+per_queue_opt->queue_mask,
+DIV_ROUND_UP(MAX_NUM_QUEUE, 32));
+   n_queue = bitmap_weight(queue_mask, MAX_NUM_QUEUE);
+   tmp = backup = kmalloc_array(n_queue, sizeof(*backup), GFP_KERNEL);
+   if (!backup)
+   return -ENOMEM;
+
+   for_each_set_bit(bit, queue_mask, MAX_NUM_QUEUE) {
+   struct ethtool_coalesce coalesce;
+
+   ret = dev->ethtool_ops->get_per_queue_coalesce(dev, bit, tmp);
+   if (ret != 0)
+   goto roll_back;
+
+   tmp++;
+
+   if (copy_from_user(&coalesce, useraddr, sizeof(coalesce))) {
+   ret = -EFAULT;
+   goto roll_back;
+   }
+
+   ret = dev->ethtool_ops->set_per_queue_coalesce(dev, bit, 
&coalesce);
+   if (ret != 0)
+   goto roll_back;
+
+   useraddr += sizeof(coalesce);
+   }
+
+roll_back:
+   if (ret != 0) {
+   tmp = backup;
+   for_each_set_bit(i, queue_mask, bit) {
+   dev->ethtool_ops->set_per_queue_coalesce(dev, i, tmp);
+   tmp++;
+   }
+   }
+   kfree(backup);
+
+   return ret;
+}
+
 static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_per_queue_op per_queue_opt;
@@ -1930,6 +1989,8 @@ static int ethtool_set_per_queue(struct net_device *dev, 
void __user *useraddr)
switch (per_queue_opt.sub_command) {
case ETHTOOL_GCOALESCE:
return ethtool_get_per_queue_coalesce(dev, useraddr, 
&per_queue_opt);
+   case ETHTOOL_SCOALESCE:
+   return ethtool_set_per_queue_coalesce(dev, useraddr, 
&per_queue_opt);
default:
return -EOPNOTSUPP;
};
-- 
1.8.3.1

Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default

2016-02-19 Thread Jesse Gross

On Fri, Feb 19, 2016 at 12:27 PM, Tom Herbert  wrote:
> I would also note RFC7348 specifies:
>
> UDP Checksum: It SHOULD be transmitted as zero. ...
>
> The RFC doesn't provide any rationale as to why this is a SHOULD
> (neither is there any discussion as to whether this pertains to IPv6
> which has stronger requirements for non-zero UDP checksum). I think
> there are two possibilities in the intent: 1) The authors assume that
> computing UDP checksums is a significant performance hit which is
> dis-proven by this patch 2) They are worried about devices that are
> unable to compute receive checksums, however this would be addressed
> by an allowance that devices can ignore non-zero UDP checksums for
> VXLAN ("When a decapsulating end point receives a packet with a
> non-zero checksum, it MAY choose to verify the checksum value.")

It's #2.

All of the performance concerns around checksums and tunneling stem
from devices implemented using switching ASICs. In those devices,
computing/verifying checksums is so slow (software path) that they are
effectively unable to do it.

Re: [PATCH] bgmac: support Ethernet device on BCM47094 SoC

2016-02-19 Thread Rafał Miłecki

On 19 February 2016 at 21:37, David Miller  wrote:
> From: Rafał Miłecki 
> Date: Wed, 17 Feb 2016 07:48:28 +0100
>
>> It needs very similar workarounds to the one on BCM4707. It was tested
>> on D-Link DIR-885L home router.
>>
>> Signed-off-by: Rafał Miłecki 
>
> This patch doesn't apply, I get rejects.

I developed it on top of:
547b9ca Merge branch 'ip-sysctl-namespaceify'
I just rebased in on top of origin/master:
76d13b5 hv_netvsc: add software transmit timestamp support
and it did cleanly. I'm using net-next git tree.

Are you applying it to net-next? It way my intention, sorry if I was
supposed to make it more clear. It depends e.g. on some extra patches,
e.g.:
387b75f bgmac: add helper checking for BCM4707 / BCM53018 chip id
61dba73 bcma: add support for BCM47094
so it's for net-next only.

-- 
Rafał

RE: [PATCH net-next,V2] Add LAN9352 Ethernet Driver

2016-02-19 Thread Bryan.Whitehead

Thanks Andrew,

Your tips are much appreciated.

Bryan

Re: [PATCH v1 1/4] net: ti: netcp: restore get/set_pad_info() functionality

2016-02-19 Thread Arnd Bergmann

On Friday 19 February 2016 13:01:59 Murali Karicheri wrote:
> > 
> I have just send v2. I will investigate your original patch that added
> regression this afternoon and respond with my observation as soon as
> my investigation is complete.

Thanks.

> I assume, you are trying to make the change
> such that the virtual pointers stored in sw_data/pad works across
> 32bit and 64bit machine, right? 

Yes, that was the idea. The way I intended it, pointers were supposed
to always get stored in a pair of swdata fields and take up 64 bits,
independent of the architecture.

Arnd

Re: [PATCH 1/1] ser_gigaset: use container_of() instead of detour

2016-02-19 Thread David Miller

From: Paul Bolle 
Date: Thu, 18 Feb 2016 21:29:08 +0100

> The purpose of gigaset_device_release() is to kfree() the struct
> ser_cardstate that contains our struct device. This is done via a bit of
> a detour. First we make our struct device's driver_data point to the
> container of our struct ser_cardstate (which is a struct cardstate). In
> gigaset_device_release() we then retrieve that driver_data again. And
> after that we finally kfree() the struct ser_cardstate that was saved in
> the struct cardstate.
> 
> All of this can be achieved much easier by using container_of() to get
> from our struct device to its container, struct ser_cardstate. Do so.
> 
> Note that at the time the detour was implemented commit b8b2c7d845d5
> ("base/platform: assert that dev_pm_domain callbacks are called
> unconditionally") had just entered the tree. That commit disconnected
> our platform_device and our platform_driver. These were reconnected
> again in v4.5-rc2 through commit 25cad69f21f5 ("base/platform: Fix
> platform drivers with no probe callback"). And one of the consequences
> of that fix was that it broke the detour via driver_data. That's because
> it made __device_release_driver() stop being a NOP for our struct device
> and actually do stuff again. One of the things it now does, is setting
> our driver_data to NULL. That, in turn, makes it impossible for
> gigaset_device_release() to get to our struct cardstate. Which has the
> net effect of leaking a struct ser_cardstate at every call of this
> driver's tty close() operation. So using container_of() has the
> additional benefit of actually working.
> 
> Reported-by: Dmitry Vyukov 
> Tested-by: Dmitry Vyukov 
> Signed-off-by: Paul Bolle 

Applied, thanks.

Re: [PATCH v2 1/3] net: ti: netcp: restore get/set_pad_info() functionality

2016-02-19 Thread Arnd Bergmann

On Friday 19 February 2016 12:58:42 Murali Karicheri wrote:
> The commit 899077791403 ("netcp: try to reduce type confusion in
> descriptors") introduces a regression in Kernel 4.5-rc1 and it breaks
> get/set_pad_info() functionality.
> 
> The TI NETCP driver uses pad0 and pad1 fields of knav_dma_desc to
> store DMA/MEM buffer pointer and buffer size respectively. And in both
> cases for Keystone 2 the pointer type size is 32 bit regardless of
> LAPE enabled or not, because CONFIG_ARCH_DMA_ADDR_T_64BIT originally
> is not expected to be defined.
> 
> Unfortunately, above commit changed buffer's pointers save/restore
> code (get/set_pad_info()) and added intermediate conversation to u64
> which works incorrectly on 32bit Keystone 2 and causes TI NETCP driver
> crash in RX/TX path due to "Unable to handle kernel NULL pointer"
> exception. This issue was reported and discussed in [1].
> 
> Hence, fix it by partially reverting above commit and restoring
> get/set_pad_info() functionality as it was before.
> 
> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg95361.html
> Cc: Wingman Kwok 
> Cc: Mugunthan V N 
> CC: David Laight 
> CC: Arnd Bergmann 
> Reported-by: Franklin S Cooper Jr 
> Signed-off-by: Grygorii Strashko 
> Signed-off-by: Murali Karicheri 
> 

Acked-by: Arnd Bergmann

Re: [PATCH v2 2/3] soc: ti: knav_dma: rename pad in struct knav_dma_desc to sw_data

2016-02-19 Thread Arnd Bergmann

On Friday 19 February 2016 12:58:43 Murali Karicheri wrote:
> Rename the pad to sw_data as per description of this field in the hardware
> spec(refer sprugr9 from www.ti.com). Latest version of the document is
> at http://www.ti.com/lit/ug/sprugr9h/sprugr9h.pdf and section 3.1
> Host Packet Descriptor describes this field.
> 
> Define and use a constant for the size of sw_data field similar to
> other fields in the struct for desc and document the sw_data field
> in the header. As the sw_data is not touched by hw, it's type can be
> changed to u32.
> 
> Rename the helpers to match with the updated dma desc field sw_data.
> 
> Cc: Wingman Kwok 
> Cc: Mugunthan V N 
> CC: Arnd Bergmann 
> CC: Grygorii Strashko 
> CC: David Laight 
> Signed-off-by: Murali Karicheri 
> 

Acked-by: Arnd Bergmann

Re: [net-next 10/16] i40e: add check for null VSI

2016-02-19 Thread David Miller

From: "Underwood, JohnX" 
Date: Fri, 19 Feb 2016 19:51:39 +

> Sorry, folks.  I was confused.  The patch looked familiar, but I
> wasn't able to find the email where I originally sent it out.

Please stop top-posting

Quote the relevant material, then add your response, not the other way around!

Re: [PATCH v2 3/3] net: netcp: rework the code for get/set sw_data in dma desc

2016-02-19 Thread Arnd Bergmann

On Friday 19 February 2016 12:58:44 Murali Karicheri wrote:
> SW data field in descriptor can be used by software to hold private
> data for the driver. As there are 4 words available for this purpose,
> use separate macros to place it or retrieve the same to/from
> descriptors. Also do type cast of data types accordingly.
> 
> Cc: Wingman Kwok 
> Cc: Mugunthan V N 
> CC: Arnd Bergmann 
> CC: Grygorii Strashko 
> CC: David Laight 
> Signed-off-by: Murali Karicheri 

Looks ok in principle.

Acked-by: Arnd Bergmann 

>   get_pkt_info(&dma_buf, &tmp, &dma_desc, ndesc);
> - get_sw_data((u32 *)&buf_ptr, &buf_len, ndesc);
> + /* warning We are retrieving the virtual ptr in the sw_data
> +  * field as a 32bit value. Will not work on 64bit machines
> +  */
> + buf_ptr = (void *)GET_SW_DATA0(ndesc);
> + buf_len = (int)GET_SW_DATA1(desc);

I would have abstracted the retrieval of a pointer again,
and added the comment in the helper function once, it doesn't
really need to be duplicated everywhere.

Arnd

Re: pull-request: wireless-drivers 2016-02-18

2016-02-19 Thread David Miller

From: Kalle Valo 
Date: Thu, 18 Feb 2016 17:28:14 +0200

> I have some important fixes I would like to get 4.5 still, more info in
> the signed tag. Please let me know if you have problems.

Pulled, thanks.

Re: [PATCH v2 1/1] cxgb3: fix up vpd strings for kstrto*()

2016-02-19 Thread David Miller

From: Steve Wise 
Date: Thu, 18 Feb 2016 06:34:24 -0800

> The vpd strings are left justified, in a fixed length array, with possible
> trailing white space and no NUL.  So fix them up before calling kstrto*().
> 
> This is a recent regression which causes cxgb3 to fail to load.
> 
> Fixes:e72c932('cxgb3: Convert simple_strtoul to kstrtox')

Space after "Fixes: ", and after the commit ID too.  Double quotes to surround
the commit change log header line.

> 
> Signed-off-by: Steve Wise 

Applied, thanks.

Re: [PATCH net-next] hv_netvsc: add software transmit timestamp support

2016-02-19 Thread David Miller

From: Simon Xiao 
Date: Wed, 17 Feb 2016 16:43:59 -0800

> Enable skb_tx_timestamp in hyperv netvsc.
> 
> Signed-off-by: Simon Xiao 
> Reviewed-by: K. Y. Srinivasan 
> Reviewed-by: Haiyang Zhang 

Applied.

Re: [PATCH net-next] ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6

2016-02-19 Thread David Miller

From: Wei Wang 
Date: Wed, 17 Feb 2016 13:58:22 -0800

> From: Wei Wang 
> 
> In ipv4,  when  the machine receives a ICMP_FRAG_NEEDED message,  the
> connected UDP socket will get EMSGSIZE message on its next read from the
> socket.
> However, this is not the case for ipv6.
> This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
> make it similar to ipv4 behavior. That is when the machine gets an
> ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
> message on its next read from the socket.
> 
> Signed-off-by: Wei Wang 

Applied.

Re: [PATCH V6 0/8] ethtool per queue parameters support

2016-02-19 Thread David Miller

From: Kan Liang 
Date: Thu, 18 Feb 2016 07:39:48 -0500

> Modern network interface controllers usually support multiple receive
> and transmit queues. Each queue may have its own parameters. For
> example, Intel XL710/X710 hardware supports per queue interrupt
> moderation. However, current ethtool does not support per queue
> parameters option. User has to set parameters for the whole NIC.
> This series extends ethtool to support per queue parameters option.

This series gets rejects on the i40e changes so you'll have to respin.

Also you are doing the versioning all wrong.

Please do not version all of the patches differently, that's so confusing
and nobody does things like that.

It's the "series" that has a version.

Also you stopped properly updating the changelog in this cover letter
after V5 or so.

Thanks.

Re: [PATCH net-next] be2net: Fix pcie error recovery in case of NIC+RoCE adapters

2016-02-19 Thread David Miller

From: Padmanabh Ratnakar 
Date: Thu, 18 Feb 2016 03:09:34 +0530

> Interrupts registered by RoCE driver are not unregistered when
> msix interrupts are disabled during error recovery causing a
> crash. Detach the adapter instance from RoCE driver when error
> is detected to complete the cleanup. Attach the driver again after
> the adapter is recovered from error.
> 
> Signed-off-by: Padmanabh Ratnakar 

Applied.

Re: [PATCH] tipc: unlock in error path

2016-02-19 Thread David Miller

From: Insu Yun 
Date: Wed, 17 Feb 2016 11:47:35 -0500

> tipc_bcast_unlock need to be unlocked in error path.
> 
> Signed-off-by: Insu Yun 

Applied.

Re: [PATCH net v2] lwt: fix rx checksum setting for lwt devices tunneling over ipv6

2016-02-19 Thread David Miller

From: Paolo Abeni 
Date: Wed, 17 Feb 2016 19:30:01 +0100

> the commit 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be
> correctly controlled.") changed the default xmit checksum setting
> for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
> set into external UDP header.
> This commit changes the rx checksum setting for both lwt vxlan/geneve
> devices created by openvswitch accordingly, so that lwt over ipv6
> tunnel pairs are again able to communicate with default values.
> 
> Signed-off-by: Paolo Abeni 

Applied.

Re: [PATCH] bgmac: support Ethernet device on BCM47094 SoC

2016-02-19 Thread David Miller

From: Rafał Miłecki 
Date: Wed, 17 Feb 2016 07:48:28 +0100

> It needs very similar workarounds to the one on BCM4707. It was tested
> on D-Link DIR-885L home router.
> 
> Signed-off-by: Rafał Miłecki 

This patch doesn't apply, I get rejects.

Re: pull request [net]: batman-adv 20160216

2016-02-19 Thread David Miller

From: Antonio Quartulli 
Date: Tue, 16 Feb 2016 23:01:25 +0800

> this pull request is intended for net.
> 
> Two of the fixes included in this patchset prevent a wrong memory
> access - it was triggered when removing an object from a list
> after it was already free'd due to bad reference counting.
> This misbehaviour existed for both the gw_node and the
> orig_node_vlan object and has been fixed by Sven Eckelmann.
> 
> The last patch fixes our interface feasibility check and prevents
> it from looping indefinitely when two net_device objects
> reference each other via iflink index (i.e. veth pair), by
> Andrew Lunn

Pulled, thanks Antonio.

And thanks for the heads up about the potential merge issues, I'll watch
for that.

Re: [PATCH] rtnl: RTM_GETNETCONF: fix wrong return value

2016-02-19 Thread David Miller

From: Anton Protopopov 
Date: Tue, 16 Feb 2016 21:43:16 -0500

> An error response from a RTM_GETNETCONF request can return the positive
> error value EINVAL in the struct nlmsgerr that can mislead userspace.
> 
> Signed-off-by: Anton Protopopov 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next,V2] net: macb: make magic-packet property generic

2016-02-19 Thread David Miller

From: Sergio Prado 
Date: Tue, 16 Feb 2016 21:10:45 -0200

> As requested by Rob Herring on patch
> https://patchwork.ozlabs.org/patch/580862/.
> 
> This is a new property that it's still in net-next and has never been
> used in production, so we are not breaking anything with the
> incompatible binding change.
> 
> Signed-off-by: Sergio Prado 

Applied, thanks.

Re: [PATCH net] net: make netdev_for_each_lower_dev safe for device removal

2016-02-19 Thread David Miller

From: David Ahern 
Date: Wed, 17 Feb 2016 12:01:10 -0700

> On 2/17/16 10:00 AM, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov 
>>
>> When I used netdev_for_each_lower_dev in commit bad531623253 ("vrf:
>> remove slave queue and private slave struct") I thought that it acts
>> like netdev_for_each_lower_private and can be used to remove the
>> current
>> device from the list while walking, but unfortunately it acts more
>> like
>> netdev_for_each_lower_private_rcu and doesn't allow it. The difference
>> is where the "iter" points to, right now it points to the current
>> element
>> and that makes it impossible to remove it. Change the logic to be
>> similar to netdev_for_each_lower_private and make it point to the
>> "next"
>> element so we can safely delete the current one. VRF is the only such
>> user right now, there's no change for the read-only users.
 ...
>> Fixes: bad531623253 ("vrf: remove slave queue and private slave
>> struct")
>> Signed-off-by: Nikolay Aleksandrov 
>> ---
>>   include/linux/netdevice.h | 2 +-
>>   net/core/dev.c| 4 ++--
>>   2 files changed, 3 insertions(+), 3 deletions(-)
> 
> Solves the problem for me. Thanks for the quick turnaround, Nik.
> 
> Reviewed-by / Tested-by: David Ahern 

David, please explicitly list these tags one by one, patchwork is not
able to pick them up if you try to free-form the tags.  I had to
incorporate them by hand, but that makes more work for me.

Applied, thanks.

Re: [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default

2016-02-19 Thread Tom Herbert

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
> This change makes it so that if UDP CSUM is not specified we will default
> to enabling it.  The main motivation behind this is the fact that with the
> use of outer checksum we can greatly improve the performance for GENEVE
> tunnels on hardware that doesn't know how to parse them.
>
> Signed-off-by: Alexander Duyck 

Acked-by: Tom Herbert 

> ---
>  drivers/net/geneve.c |   16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 1b5deaf7e3c8..2acaf2b209cd 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -76,7 +76,7 @@ struct geneve_dev {
>  };
>
>  /* Geneve device flags */
> -#define GENEVE_F_UDP_CSUM  BIT(0)
> +#define GENEVE_F_UDP_ZERO_CSUM_TX  BIT(0)
>  #define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1)
>  #define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2)
>
> @@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct 
> sk_buff *skb,
> struct genevehdr *gnvh;
> int min_headroom;
> int err;
> -   bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM);
> +   bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
>
> skb_scrub_packet(skb, xnet);
>
> @@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
> struct net_device *dev,
> opts = ip_tunnel_info_opts(info);
>
> if (key->tun_flags & TUNNEL_CSUM)
> -   flags |= GENEVE_F_UDP_CSUM;
> +   flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX;
> else
> -   flags &= ~GENEVE_F_UDP_CSUM;
> +   flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
>
> err = geneve_build_skb(rt, skb, key->tun_flags, vni,
>info->options_len, opts, flags, xnet);
> @@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
> struct net_device *dev,
> udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
> tos, ttl, df, sport, geneve->dst_port,
> !net_eq(geneve->net, dev_net(geneve->dev)),
> -   !(flags & GENEVE_F_UDP_CSUM));
> +   !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX));
>
> return NETDEV_TX_OK;
>
> @@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct 
> net_device *dev,
> metadata = true;
>
> if (data[IFLA_GENEVE_UDP_CSUM] &&
> -   nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
> -   flags |= GENEVE_F_UDP_CSUM;
> +   !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
> +   flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
>
> if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
> nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
> @@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
> }
>
> if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> -  !!(geneve->flags & GENEVE_F_UDP_CSUM)) ||
> +  !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) ||
> nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
>!!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) ||
> nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
>

Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default

2016-02-19 Thread Tom Herbert

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
> This change makes it so that if UDP CSUM is not specified we will default
> to enabling it.  The main motivation behind this is the fact that with the
> use of outer checksum we can greatly improve the performance for VXLAN
> tunnels on devices that don't know how to parse tunnel headers.
>
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/vxlan.c |   19 +--
>  include/net/vxlan.h |2 +-
>  2 files changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 766e6114a37f..909f7931c297 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
> net_device *dev,
> goto drop;
> sk = vxlan->vn4_sock->sock->sk;
>
> -   if (info) {
> -   if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
> -   df = htons(IP_DF);
> -   } else {
> -   udp_sum = !!(flags & VXLAN_F_UDP_CSUM);
> -   }
> -
> rt = vxlan_get_route(vxlan, skb,
>  rdst ? rdst->remote_ifindex : 0, tos,
>  dst->sin.sin_addr.s_addr, &saddr,
> @@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
> net_device *dev,
> return;
> }
>
> +   if (!info)
> +   udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX);
> +   else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
> +   df = htons(IP_DF);
> +
> tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
> ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
> err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr),
> @@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct 
> net_device *dev,
> if (data[IFLA_VXLAN_PORT])
> conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]);
>
> -   if (data[IFLA_VXLAN_UDP_CSUM] && 
> nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
> -   conf.flags |= VXLAN_F_UDP_CSUM;
> +   if (data[IFLA_VXLAN_UDP_CSUM] &&
> +   !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
> +   conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
>
> if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
> nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
> @@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
> nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) ||
> nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) ||
> nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM,
> -   !!(vxlan->flags & VXLAN_F_UDP_CSUM)) ||
> +   !(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) ||
> nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
> !!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) ||
> nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> diff --git a/include/net/vxlan.h b/include/net/vxlan.h
> index 748083de367a..6eda4ed4d78b 100644
> --- a/include/net/vxlan.h
> +++ b/include/net/vxlan.h
> @@ -197,7 +197,7 @@ struct vxlan_dev {
>  #define VXLAN_F_L2MISS 0x08
>  #define VXLAN_F_L3MISS 0x10
>  #define VXLAN_F_IPV6   0x20
> -#define VXLAN_F_UDP_CSUM   0x40
> +#define VXLAN_F_UDP_ZERO_CSUM_TX   0x40
>  #define VXLAN_F_UDP_ZERO_CSUM6_TX  0x80
>  #define VXLAN_F_UDP_ZERO_CSUM6_RX  0x100
>  #define VXLAN_F_REMCSUM_TX 0x200
>

Acked-by: Tom Herbert 

I would also note RFC7348 specifies:

UDP Checksum: It SHOULD be transmitted as zero. ...

The RFC doesn't provide any rationale as to why this is a SHOULD
(neither is there any discussion as to whether this pertains to IPv6
which has stronger requirements for non-zero UDP checksum). I think
there are two possibilities in the intent: 1) The authors assume that
computing UDP checksums is a significant performance hit which is
dis-proven by this patch 2) They are worried about devices that are
unable to compute receive checksums, however this would be addressed
by an allowance that devices can ignore non-zero UDP checksums for
VXLAN ("When a decapsulating end point receives a packet with a
non-zero checksum, it MAY choose to verify the checksum value.")


.

Re: [PATCH net-next 0/2] bridge: mdb: add support for extended attributes

2016-02-19 Thread David Miller

From: Nikolay Aleksandrov 
Date: Thu, 18 Feb 2016 21:51:26 +0100

> I just thought this version is a middle ground between the two solutions and
> still doesn't break user-space while being extensible.

Ok, I'll apply this series, thanks for taking me through your thought
process.

Re: [PATCH net-next,V2] Add LAN9352 Ethernet Driver

2016-02-19 Thread Andrew Lunn

On Fri, Feb 19, 2016 at 07:29:46PM +, bryan.whiteh...@microchip.com wrote:
> > From: Andrew Lunn [mailto:and...@lunn.ch]
> > Sent: Tuesday, February 16, 2016 5:16 PM
> > 
> > You are already 1/2 way to a DSA driver, since you have a MAC driver. So i
> > agree with David, do it right and add a simple DSA driver.
> > 
> Andrew,
> 
> I've done a little research on DSA.
> I've read Documentation/networking/dsa/dsa.txt
> And I've looked over some examples in drivers/net/dsa/
> 
> It appears there are about 40 functions to implement.

For a minimal driver, you need a lot less. mv88e6060.c is a standalone
DSA driver and has 5 functions. 250 lines of code. You will also need
a new tag implementation, named something like net/dsa/tag_lan935x.c,
but that is only two functions.

> I see examples from only 2 manufacturers, and they average more than
> 2000 lines of code.

These are more fully featured switches, and the drivers do more than
the minimum. You should be aiming for your first submission to be a
minimal driver. Expose two ports, with a PHY each, send and receive
frames on each port.

Latter patches can add more features, like EEE, MIB counters, hardware
bridging of the ports, 802.1Q, reading and writing the EEPROM
etc. These are all optional features you can add later.

> I'm not yet sure how it attaches to the underlying Ethernet driver.

The DSA core does that for you. If you look at the device tree
binding:

dsa@0 {
compatible = "marvell,dsa";
#address-cells = <2>;
#size-cells = <0>;

interrupts = <10>;
dsa,ethernet = <ðernet0>;

So this says which Ethernet interface to use. net/dsa/dsa.c will
create a slave interface per external port of your switch. This slave
is a netdev. Frames transmitted by this slave are fed through the tag
code to add additional headers/trailers, and then passed to this
ethernet device. Frames received by the ethernet again through the tag
code, stripping off any headers/trails and demuxing to the correct
slave.

> And I have no idea how I would test it at the end. 

The whole concept behind this is that the ports of the switch
behaviour like normal network interfaces. So test it the same way you
test an linux ethernet driver. ping, iperf, ethtool, etc. And add some
tests using a linux bridge.

> Given these issues, it will be hard to sell it to my manager.

With time, you can expect to see more switch chips gaining mainline
support. There is interest in implementing a DSA driver for the
Micrel/Kendin KS8995M and KSZ8864RMN chips, for example, which i think
are in the same or similar market segment as the lan935x devices.

> If it can be partly implemented, which parts should be implemented first?

I would suggest you first understand what the tag_ files are about,
and look at implementing for the lan935x. The datasheet talks about a
special VLAN tag with bits 0 and 1 indicating the outgoing port, if
bit 3 is zero. Then do a minimal driver, equivalent to
mv88e6060.c. FYI the datasheet of this device is publicly available.

 Andrew

RE: [net-next 10/16] i40e: add check for null VSI

2016-02-19 Thread Underwood, JohnX

Sorry, folks.  I was confused.  The patch looked familiar, but I wasn't able to 
find the email where I originally sent it out.

-Original Message-
From: Kirsher, Jeffrey T 
Sent: Friday, February 19, 2016 11:49 AM
To: David Miller; Underwood, JohnX
Cc: netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com; 
jogre...@redhat.com
Subject: Re: [net-next 10/16] i40e: add check for null VSI

On Fri, 2016-02-19 at 14:41 -0500, David Miller wrote:
> From: "Underwood, JohnX" 
> Date: Fri, 19 Feb 2016 19:15:53 +
> 
> > ACK
> 
> You should never top-post on this mailing list.
> 
> But in this specific case it is even more important.
> 
> So just in case it is not clear:
> 
> Please, pretty please, DO NOT top-post ACKs to patches like 
> this.
> 
> It looks like a new, fresh, patch submission to patchwork therefore 
> you are creating a significant burdon for me.

Not to mention he was ACK'ing his own patch that he authored. :-(

Sorry Dave, I will work with him offline.

Re: [net-next 10/16] i40e: add check for null VSI

2016-02-19 Thread Jeff Kirsher

On Fri, 2016-02-19 at 14:41 -0500, David Miller wrote:
> From: "Underwood, JohnX" 
> Date: Fri, 19 Feb 2016 19:15:53 +
> 
> > ACK
> 
> You should never top-post on this mailing list.
> 
> But in this specific case it is even more important.
> 
> So just in case it is not clear:
> 
> Please, pretty please, DO NOT top-post ACKs to patches like
> this.
> 
> It looks like a new, fresh, patch submission to patchwork therefore
> you
> are creating a significant burdon for me.

Not to mention he was ACK'ing his own patch that he authored. :-(

Sorry Dave, I will work with him offline.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2] bpf: grab rcu read lock for bpf_percpu_hash_update

2016-02-19 Thread David Miller

From: Sasha Levin 
Date: Fri, 19 Feb 2016 13:53:10 -0500

> bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
> which pointed out a missing rcu read lock.
> 
> Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array 
> maps")
> Signed-off-by: Sasha Levin 

Applied, thanks.

Re: [net-next 10/16] i40e: add check for null VSI

2016-02-19 Thread David Miller

From: "Underwood, JohnX" 
Date: Fri, 19 Feb 2016 19:15:53 +

> ACK

You should never top-post on this mailing list.

But in this specific case it is even more important.

So just in case it is not clear:

Please, pretty please, DO NOT top-post ACKs to patches like this.

It looks like a new, fresh, patch submission to patchwork therefore you
are creating a significant burdon for me.

Thanks.

RE: [PATCH net-next,V2] Add LAN9352 Ethernet Driver

2016-02-19 Thread Bryan.Whitehead

> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Tuesday, February 16, 2016 5:16 PM
> 
> You are already 1/2 way to a DSA driver, since you have a MAC driver. So i
> agree with David, do it right and add a simple DSA driver.
> 
Andrew,

I've done a little research on DSA.
I've read Documentation/networking/dsa/dsa.txt
And I've looked over some examples in drivers/net/dsa/

It appears there are about 40 functions to implement.
I see examples from only 2 manufacturers, and they average more than 2000 lines 
of code.
I'm not yet sure how it attaches to the underlying Ethernet driver.
And I have no idea how I would test it at the end. 

Given these issues, it will be hard to sell it to my manager.
Since you claim it is simple. Can you explain your reasons?

I've never done a DSA driver before. Given that, can you generally outline the 
steps.
If it can be partly implemented, which parts should be implemented first?

I appreciate any pointers you can provide.

Bryan

[net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Alexander Duyck

This patch series makes it so that we enable the outer Tx checksum for IPv4
tunnels by default.  This makes the behavior consistent with how we were
handling this for IPv6.  In addition I have updated the internal flags for
these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
match up will with the ZERO_CSUM6_TX flag which was already in use for
IPv6.

For most network devices this should be a net gain in terms of performance
as having the outer header checksum present allows for devices to report
CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
to determine if the inner header checksum is valid.

Below is some data I collected with ixgbe with an X540 that demonstrates
this.  I located two PFs connected back to back in two different name
spaces and then setup a pair of tunnels on each, one with checksum enabled
and one without.

Recv   SendSend  Utilization
Socket Socket  Message  Elapsed  Send
Size   SizeSize Time Throughput  local
bytes  bytes   bytessecs.10^6bits/s  % S

noudpcsum:
 87380  16384  1638430.00  8898.67   12.80
udpcsum:
 87380  16384  1638430.00  9088.47   5.69

The one spot where this may cause a performance regression is if the
environment contains devices that can parse the inner headers and a device
supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
the case of such a device we have to fall back to using GSO to segment the
tunnel instead of TSO and as a result we may take a performance hit as seen
below with i40e.

Recv   SendSend  Utilization
Socket Socket  Message  Elapsed  Send
Size   SizeSize Time Throughput  local
bytes  bytes   bytessecs.10^6bits/s  % S

noudpcsum:
 87380  16384  1638430.00  9085.21   3.32
udpcsum:
 87380  16384  1638430.00  9089.23   5.54

In addition it will be necessary to update iproute2 so that we don't
provide the checksum attribute unless specified.  This way on older kernels
which don't have local checksum offload we will default to disabling the
outer checksum, and on newer kernels that have LCO we can default to
enabling it.

I also haven't investigated the effect this will have on OVS.  However I
suspect the impact should be minimal as the worst case scenario should be
that Tx checksumming will become enabled by default which should be
consistent with the existing behavior for IPv6.

---

Alexander Duyck (2):
  GENEVE: Support outer IPv4 Tx checksums by default
  VXLAN: Support outer IPv4 Tx checksums by default


 drivers/net/geneve.c |   16 
 drivers/net/vxlan.c  |   19 +--
 include/net/vxlan.h  |2 +-
 3 files changed, 18 insertions(+), 19 deletions(-)

--

[net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default

2016-02-19 Thread Alexander Duyck

This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for GENEVE
tunnels on hardware that doesn't know how to parse them.

Signed-off-by: Alexander Duyck 
---
 drivers/net/geneve.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 1b5deaf7e3c8..2acaf2b209cd 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -76,7 +76,7 @@ struct geneve_dev {
 };
 
 /* Geneve device flags */
-#define GENEVE_F_UDP_CSUM  BIT(0)
+#define GENEVE_F_UDP_ZERO_CSUM_TX  BIT(0)
 #define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1)
 #define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2)
 
@@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct 
sk_buff *skb,
struct genevehdr *gnvh;
int min_headroom;
int err;
-   bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM);
+   bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
 
skb_scrub_packet(skb, xnet);
 
@@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
opts = ip_tunnel_info_opts(info);
 
if (key->tun_flags & TUNNEL_CSUM)
-   flags |= GENEVE_F_UDP_CSUM;
+   flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX;
else
-   flags &= ~GENEVE_F_UDP_CSUM;
+   flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
 
err = geneve_build_skb(rt, skb, key->tun_flags, vni,
   info->options_len, opts, flags, xnet);
@@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
tos, ttl, df, sport, geneve->dst_port,
!net_eq(geneve->net, dev_net(geneve->dev)),
-   !(flags & GENEVE_F_UDP_CSUM));
+   !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX));
 
return NETDEV_TX_OK;
 
@@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct 
net_device *dev,
metadata = true;
 
if (data[IFLA_GENEVE_UDP_CSUM] &&
-   nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
-   flags |= GENEVE_F_UDP_CSUM;
+   !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
+   flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
 
if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
@@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
}
 
if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
-  !!(geneve->flags & GENEVE_F_UDP_CSUM)) ||
+  !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) ||
nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
   !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) ||
nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,

[net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default

2016-02-19 Thread Alexander Duyck

This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for VXLAN
tunnels on devices that don't know how to parse tunnel headers.

Signed-off-by: Alexander Duyck 
---
 drivers/net/vxlan.c |   19 +--
 include/net/vxlan.h |2 +-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 766e6114a37f..909f7931c297 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto drop;
sk = vxlan->vn4_sock->sock->sk;
 
-   if (info) {
-   if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
-   df = htons(IP_DF);
-   } else {
-   udp_sum = !!(flags & VXLAN_F_UDP_CSUM);
-   }
-
rt = vxlan_get_route(vxlan, skb,
 rdst ? rdst->remote_ifindex : 0, tos,
 dst->sin.sin_addr.s_addr, &saddr,
@@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
return;
}
 
+   if (!info)
+   udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX);
+   else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
+   df = htons(IP_DF);
+
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr),
@@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct 
net_device *dev,
if (data[IFLA_VXLAN_PORT])
conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]);
 
-   if (data[IFLA_VXLAN_UDP_CSUM] && nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
-   conf.flags |= VXLAN_F_UDP_CSUM;
+   if (data[IFLA_VXLAN_UDP_CSUM] &&
+   !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
+   conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
 
if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
@@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) ||
nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) ||
nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM,
-   !!(vxlan->flags & VXLAN_F_UDP_CSUM)) ||
+   !(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) ||
nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
!!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) ||
nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 748083de367a..6eda4ed4d78b 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -197,7 +197,7 @@ struct vxlan_dev {
 #define VXLAN_F_L2MISS 0x08
 #define VXLAN_F_L3MISS 0x10
 #define VXLAN_F_IPV6   0x20
-#define VXLAN_F_UDP_CSUM   0x40
+#define VXLAN_F_UDP_ZERO_CSUM_TX   0x40
 #define VXLAN_F_UDP_ZERO_CSUM6_TX  0x80
 #define VXLAN_F_UDP_ZERO_CSUM6_RX  0x100
 #define VXLAN_F_REMCSUM_TX 0x200

[PATCH net] Driver: Vmxnet3: Update Rx ring 2 max size

2016-02-19 Thread Shrikrishna Khare

Device emulation supports max size of 4096.

Signed-off-by: Shrikrishna Khare 
Signed-off-by: Bhavesh Davda 
---
 drivers/net/vmxnet3/vmxnet3_defs.h | 2 +-
 drivers/net/vmxnet3/vmxnet3_int.h  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_defs.h 
b/drivers/net/vmxnet3/vmxnet3_defs.h
index 221a530..72ba8ae 100644
--- a/drivers/net/vmxnet3/vmxnet3_defs.h
+++ b/drivers/net/vmxnet3/vmxnet3_defs.h
@@ -377,7 +377,7 @@ union Vmxnet3_GenericDesc {
 #define VMXNET3_TX_RING_MAX_SIZE   4096
 #define VMXNET3_TC_RING_MAX_SIZE   4096
 #define VMXNET3_RX_RING_MAX_SIZE   4096
-#define VMXNET3_RX_RING2_MAX_SIZE  2048
+#define VMXNET3_RX_RING2_MAX_SIZE  4096
 #define VMXNET3_RC_RING_MAX_SIZE   8192
 
 /* a list of reasons for queue stop */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h 
b/drivers/net/vmxnet3/vmxnet3_int.h
index bdb8a6c..729c344 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.5.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.6.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM  0x01040500
+#define VMXNET3_DRIVER_VERSION_NUM  0x01040600
 
 #if defined(CONFIG_PCI_MSI)
/* RSS only makes sense if MSI-X is supported. */
-- 
1.8.5.6

RE: [net-next 10/16] i40e: add check for null VSI

2016-02-19 Thread Underwood, JohnX

ACK

-Original Message-
From: Kirsher, Jeffrey T 
Sent: Friday, February 19, 2016 1:54 AM
To: da...@davemloft.net
Cc: Underwood, JohnX; netdev@vger.kernel.org; nhor...@redhat.com; 
sassm...@redhat.com; jogre...@redhat.com; Kirsher, Jeffrey T
Subject: [net-next 10/16] i40e: add check for null VSI

From: John Underwood 

Return from i40e_vsi_reinit_setup() if vsi param is NULL.
This makes this code consistent with all the other code that checks for NULL 
before using one of the VSI pointers accessed with an indexed variable. 
(Indexed VSI pointers are intentionally set to NULL in i40e_vsi_clear() and 
i40e_remove().

Change-ID: I3bc8b909c70fd2439334eeae994d151f61480985
Signed-off-by: John Underwood 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 05def9f..3ff3e83 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9583,10 +9583,15 @@ vector_setup_out:
  **/
 static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)  {
-   struct i40e_pf *pf = vsi->back;
+   struct i40e_pf *pf;
u8 enabled_tc;
int ret;
 
+   if (!vsi)
+   return NULL;
+
+   pf = vsi->back;
+
i40e_put_lump(pf->qp_pile, vsi->base_queue, vsi->idx);
i40e_vsi_clear_rings(vsi);
 
--
2.5.0

1 2 >

1 - 100 of 179 matches

Mail list logo